2017-12-15 21:52 UTC

View Issue Details Jump to Notes ]
IDProjectCategoryView StatusLast Update
0013843CentOS-7kernelpublic2017-12-15 21:49
Reporterthompsop 
PriorityhighSeverityminorReproducibilityalways
StatusnewResolutionopen 
Platformx86_64OSOS Version7.4.1708
Product Version7.3.1611 
Target VersionFixed in Version 
Summary0013843: xfsaild blocks after certain time
Descriptionvmlinuz-3.10.0-693.2.2.el7.x86_64, started to have problem where xfsaild becomes blocked after a period of time and processes wanting to access the file system begin to grind to a halt. Most often hits chatty IO applications such as auditd first. A reboot solves the problem until the next time. The file system is in a mirrored volume group and neither the volume group nor individual disk reports any problems. Reverting to vmlinuz-3.10.0-514.26.2.el7.x86_64 seems to make the problem go away.
Steps To ReproduceRun vmlinuz-3.10.0-693.2.2.el7.x86_64 and wait a variable amount of time, usually less than 12 hours.
Additional InformationSep 15 02:23:36 lol1093 kernel: INFO: task touch:9955 blocked for more than 120 seconds.
Sep 15 02:23:36 lol1093 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 15 02:23:36 lol1093 kernel: touch D ffff8801fbffcf10 0 9955 9951 0x00000080
Sep 15 02:23:36 lol1093 kernel: ffff88017306f6f0 0000000000000082 ffff8801fbffcf10 ffff88017306ffd8
Sep 15 02:23:36 lol1093 kernel: ffff88017306ffd8 ffff88017306ffd8 ffff8801fbffcf10 ffff880210e307b0
Sep 15 02:23:36 lol1093 kernel: 7fffffffffffffff 0000000000000002 0000000000000000 ffff8801fbffcf10
Sep 15 02:23:36 lol1093 kernel: Call Trace:
Sep 15 02:23:36 lol1093 kernel: [<ffffffff816a94e9>] schedule+0x29/0x70
Sep 15 02:23:36 lol1093 kernel: [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0
Sep 15 02:23:36 lol1093 kernel: [<ffffffff816a8887>] __down_common+0xaa/0x104
Sep 15 02:23:36 lol1093 kernel: [<ffffffff810d1223>] ? find_busiest_group+0x143/0x980
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc0444410>] ? _xfs_buf_find+0x170/0x330 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffff816a88fe>] __down+0x1d/0x1f
Sep 15 02:23:36 lol1093 kernel: [<ffffffff810b6691>] down+0x41/0x50
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc044420c>] xfs_buf_lock+0x3c/0xd0 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc0444410>] _xfs_buf_find+0x170/0x330 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc04445fa>] xfs_buf_get_map+0x2a/0x240 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc04451a0>] xfs_buf_read_map+0x30/0x160 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc0474ec1>] xfs_trans_read_buf_map+0x211/0x400 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc0434bad>] xfs_read_agi+0x9d/0x110 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc0434c54>] xfs_ialloc_read_agi+0x34/0xd0 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc04356a8>] xfs_dialloc+0xe8/0x280 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc0457581>] xfs_ialloc+0x71/0x530 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc0464b94>] ? xlog_grant_head_check+0x54/0x100 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc0457ab3>] xfs_dir_ialloc+0x73/0x1f0 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffff816a87b2>] ? down_write+0x12/0x3d
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc045837e>] xfs_create+0x43e/0x6c0 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc045522b>] xfs_vn_mknod+0xbb/0x240 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc04553e3>] xfs_vn_create+0x13/0x20 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffff8120d60d>] vfs_create+0xcd/0x130
Sep 15 02:23:36 lol1093 kernel: [<ffffffff8121079a>] do_last+0x10ea/0x12c0
Sep 15 02:23:36 lol1093 kernel: [<ffffffff81210a32>] path_openat+0xc2/0x490
Sep 15 02:23:36 lol1093 kernel: [<ffffffff8118295b>] ? unlock_page+0x2b/0x30
Sep 15 02:23:36 lol1093 kernel: [<ffffffff811ad6a6>] ? do_read_fault.isra.44+0xe6/0x130
Sep 15 02:23:36 lol1093 kernel: [<ffffffff81212fcb>] do_filp_open+0x4b/0xb0
Sep 15 02:23:36 lol1093 kernel: [<ffffffff8111f757>] ? __audit_getname+0x97/0xb0
Sep 15 02:23:36 lol1093 kernel: [<ffffffff8122022a>] ? __alloc_fd+0x8a/0x130
Sep 15 02:23:36 lol1093 kernel: [<ffffffff811ffc13>] do_sys_open+0xf3/0x1f0
Sep 15 02:23:36 lol1093 kernel: [<ffffffff816b0325>] ? do_page_fault+0x35/0x90
Sep 15 02:23:36 lol1093 kernel: [<ffffffff811ffd2e>] SyS_open+0x1e/0x20
Sep 15 02:23:36 lol1093 kernel: [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b
TagsNo tags attached.
abrt_hash
URL
Attached Files

-Relationships
+Relationships

-Notes

~0030400

thompsop (reporter)

Here's a similar, recent, but not identical problem
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1681914

~0030468

thompsop (reporter)

Problem persists on 3.10.0-693.5.2.el7.x86_64, must revert to 3.10.0-514.26.2.el7.x86_64 and the machine will run fine for weeks.

~0030693

Neil Mukerji (reporter)

I'd like to add that I'm experiencing this issue on a number (10+) of machines. My experiments also show that the instability started with Kernel 3.10.0-693.2.2.el7.x86_64 and persist with 3.10.0-693.5.2.el7.x86_64; regressing to Kernel 3.10.0-514.26.2.el7.x86_64 brings stability.

We run software RAID 1 and usually the 10G /tmp partition is the one to lock up. The issue only occurs with SSD drives; our servers with traditional drives never encounter this problem. We use systemd's weekly timer to trim the disks, and we have tried also trimming daily in case that was relevant. Anecdotally it feels like these trims have reduced the frequency of xfsaild lock-ups, but they are still occurring. The only fix we've found so far is to regress the kernel, which isn't okay in the long term.

~0030694

thompsop (reporter)

I have found running the current debug kernel masks the problem.
3.10.0-693.5.2.el7.x86_64.debug is stable for me, for instance.
However if the machine accidentally boots back to the normal kernel the problem returns.

~0030761

gregb (reporter)

I'm not sure if this is related, but we just had this happen on one of our machines today - a task became blocked by xfsaild, and logins to the machine became blocked at that point too; reboot was required to fix.

This may be an upstream XFS issue -- we were trying a plain vanilla 4.14.6 kernel at the time because of a different XFS issue (the plain vanilla kernel did not fix the other issue, but this lockup happened).
+Notes

-Issue History
Date Modified Username Field Change
2017-09-18 14:19 thompsop New Issue
2017-10-18 11:34 thompsop Note Added: 0030400
2017-10-25 21:44 thompsop Note Added: 0030468
2017-12-03 10:21 Neil Mukerji Note Added: 0030693
2017-12-03 18:29 thompsop Note Added: 0030694
2017-12-15 21:49 gregb Note Added: 0030761
+Issue History