2018-02-20 05:35 UTC

View Issue Details Jump to Notes ]
IDProjectCategoryView StatusLast Update
0013843CentOS-7kernelpublic2018-01-31 12:26
Reporterthompsop 
PriorityhighSeverityminorReproducibilityalways
StatusnewResolutionopen 
Platformx86_64OSOS Version7.4.1708
Product Version7.3.1611 
Target VersionFixed in Version 
Summary0013843: xfsaild blocks after certain time
Descriptionvmlinuz-3.10.0-693.2.2.el7.x86_64, started to have problem where xfsaild becomes blocked after a period of time and processes wanting to access the file system begin to grind to a halt. Most often hits chatty IO applications such as auditd first. A reboot solves the problem until the next time. The file system is in a mirrored volume group and neither the volume group nor individual disk reports any problems. Reverting to vmlinuz-3.10.0-514.26.2.el7.x86_64 seems to make the problem go away.
Steps To ReproduceRun vmlinuz-3.10.0-693.2.2.el7.x86_64 and wait a variable amount of time, usually less than 12 hours.
Additional InformationSep 15 02:23:36 lol1093 kernel: INFO: task touch:9955 blocked for more than 120 seconds.
Sep 15 02:23:36 lol1093 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 15 02:23:36 lol1093 kernel: touch D ffff8801fbffcf10 0 9955 9951 0x00000080
Sep 15 02:23:36 lol1093 kernel: ffff88017306f6f0 0000000000000082 ffff8801fbffcf10 ffff88017306ffd8
Sep 15 02:23:36 lol1093 kernel: ffff88017306ffd8 ffff88017306ffd8 ffff8801fbffcf10 ffff880210e307b0
Sep 15 02:23:36 lol1093 kernel: 7fffffffffffffff 0000000000000002 0000000000000000 ffff8801fbffcf10
Sep 15 02:23:36 lol1093 kernel: Call Trace:
Sep 15 02:23:36 lol1093 kernel: [<ffffffff816a94e9>] schedule+0x29/0x70
Sep 15 02:23:36 lol1093 kernel: [<ffffffff816a6ff9>] schedule_timeout+0x239/0x2c0
Sep 15 02:23:36 lol1093 kernel: [<ffffffff816a8887>] __down_common+0xaa/0x104
Sep 15 02:23:36 lol1093 kernel: [<ffffffff810d1223>] ? find_busiest_group+0x143/0x980
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc0444410>] ? _xfs_buf_find+0x170/0x330 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffff816a88fe>] __down+0x1d/0x1f
Sep 15 02:23:36 lol1093 kernel: [<ffffffff810b6691>] down+0x41/0x50
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc044420c>] xfs_buf_lock+0x3c/0xd0 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc0444410>] _xfs_buf_find+0x170/0x330 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc04445fa>] xfs_buf_get_map+0x2a/0x240 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc04451a0>] xfs_buf_read_map+0x30/0x160 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc0474ec1>] xfs_trans_read_buf_map+0x211/0x400 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc0434bad>] xfs_read_agi+0x9d/0x110 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc0434c54>] xfs_ialloc_read_agi+0x34/0xd0 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc04356a8>] xfs_dialloc+0xe8/0x280 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc0457581>] xfs_ialloc+0x71/0x530 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc0464b94>] ? xlog_grant_head_check+0x54/0x100 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc0457ab3>] xfs_dir_ialloc+0x73/0x1f0 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffff816a87b2>] ? down_write+0x12/0x3d
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc045837e>] xfs_create+0x43e/0x6c0 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc045522b>] xfs_vn_mknod+0xbb/0x240 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffffc04553e3>] xfs_vn_create+0x13/0x20 [xfs]
Sep 15 02:23:36 lol1093 kernel: [<ffffffff8120d60d>] vfs_create+0xcd/0x130
Sep 15 02:23:36 lol1093 kernel: [<ffffffff8121079a>] do_last+0x10ea/0x12c0
Sep 15 02:23:36 lol1093 kernel: [<ffffffff81210a32>] path_openat+0xc2/0x490
Sep 15 02:23:36 lol1093 kernel: [<ffffffff8118295b>] ? unlock_page+0x2b/0x30
Sep 15 02:23:36 lol1093 kernel: [<ffffffff811ad6a6>] ? do_read_fault.isra.44+0xe6/0x130
Sep 15 02:23:36 lol1093 kernel: [<ffffffff81212fcb>] do_filp_open+0x4b/0xb0
Sep 15 02:23:36 lol1093 kernel: [<ffffffff8111f757>] ? __audit_getname+0x97/0xb0
Sep 15 02:23:36 lol1093 kernel: [<ffffffff8122022a>] ? __alloc_fd+0x8a/0x130
Sep 15 02:23:36 lol1093 kernel: [<ffffffff811ffc13>] do_sys_open+0xf3/0x1f0
Sep 15 02:23:36 lol1093 kernel: [<ffffffff816b0325>] ? do_page_fault+0x35/0x90
Sep 15 02:23:36 lol1093 kernel: [<ffffffff811ffd2e>] SyS_open+0x1e/0x20
Sep 15 02:23:36 lol1093 kernel: [<ffffffff816b5009>] system_call_fastpath+0x16/0x1b
Tags7.4, file system, upstream, xfs
abrt_hash
URL
Attached Files

-Relationships
+Relationships

-Notes

~0030400

thompsop (reporter)

Here's a similar, recent, but not identical problem
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1681914

~0030468

thompsop (reporter)

Problem persists on 3.10.0-693.5.2.el7.x86_64, must revert to 3.10.0-514.26.2.el7.x86_64 and the machine will run fine for weeks.

~0030693

Neil Mukerji (reporter)

I'd like to add that I'm experiencing this issue on a number (10+) of machines. My experiments also show that the instability started with Kernel 3.10.0-693.2.2.el7.x86_64 and persist with 3.10.0-693.5.2.el7.x86_64; regressing to Kernel 3.10.0-514.26.2.el7.x86_64 brings stability.

We run software RAID 1 and usually the 10G /tmp partition is the one to lock up. The issue only occurs with SSD drives; our servers with traditional drives never encounter this problem. We use systemd's weekly timer to trim the disks, and we have tried also trimming daily in case that was relevant. Anecdotally it feels like these trims have reduced the frequency of xfsaild lock-ups, but they are still occurring. The only fix we've found so far is to regress the kernel, which isn't okay in the long term.

~0030694

thompsop (reporter)

I have found running the current debug kernel masks the problem.
3.10.0-693.5.2.el7.x86_64.debug is stable for me, for instance.
However if the machine accidentally boots back to the normal kernel the problem returns.

~0030761

gregb (reporter)

I'm not sure if this is related, but we just had this happen on one of our machines today - a task became blocked by xfsaild, and logins to the machine became blocked at that point too; reboot was required to fix.

This may be an upstream XFS issue -- we were trying a plain vanilla 4.14.6 kernel at the time because of a different XFS issue (the plain vanilla kernel did not fix the other issue, but this lockup happened).

~0030855

smuku@salesforce.com (reporter)

We are seeing the exact behaviour on multiple hosts running this kernel with processes getting blocked by xfsaild.

~0030856

gbeeley (reporter)

I was able to reproduce this (or a similar) issue on demand with 4.14.6 through the following procedure:

0. Be running vanilla 4.14.6.
1. Have a mounted and active XFS filesystem.
2. Go to /sys/devices/system/cpu and disable all but cpu0 by writing "0" to cpuXX/online for each cpu other than cpu0.
3. Re-enable one other cpu by writing "1" to cpuXX/online for that cpu. The selection of which cpu to re-enable *might* depend on hyperthreading and/or NUMA details, but I have not confirmed that yet.
4. Observe system lockup apparently due to xfsaild.

~0030981

joerg (reporter)

I am witnessing the same failure after updating from
vmlinuz-3.10.0-514.16.1.el7.x86_64
to
vmlinuz-3.10.0-693.11.6.el7.x86_64

We are also running on a mdadm-raid1:

[284165.322308] INFO: task xfsaild/md127:525 blocked for more than 120 seconds.
[284165.322667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[284165.323070] xfsaild/md127 D ffff881039386eb0 0 525 2 0x00000000
[284165.323081] Call Trace:
[284165.323100] [<ffffffff8109ae1e>] ? try_to_del_timer_sync+0x5e/0x90
[284165.323115] [<ffffffff816ab6d9>] schedule+0x29/0x70
[284165.323307] [<ffffffffc0385d36>] _xfs_log_force+0x1c6/0x2c0 [xfs]
[284165.323339] [<ffffffff810c6440>] ? wake_up_state+0x20/0x20
[284165.323503] [<ffffffffc0391bbc>] ? xfsaild+0x16c/0x6f0 [xfs]
[284165.323685] [<ffffffffc0385e5c>] xfs_log_force+0x2c/0x70 [xfs]
[284165.323868] [<ffffffffc0391a50>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
[284165.324048] [<ffffffffc0391bbc>] xfsaild+0x16c/0x6f0 [xfs]
[284165.324235] [<ffffffffc0391a50>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
[284165.324264] [<ffffffff810b252f>] kthread+0xcf/0xe0
[284165.324278] [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
[284165.324294] [<ffffffff816b8798>] ret_from_fork+0x58/0x90
[284165.324310] [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
[284285.326264] INFO: task xfsaild/md127:525 blocked for more than 120 seconds.
[284285.326594] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[284285.326912] xfsaild/md127 D ffff881039386eb0 0 525 2 0x00000000
[284285.326923] Call Trace:
[284285.326942] [<ffffffff8109ae1e>] ? try_to_del_timer_sync+0x5e/0x90
[284285.326956] [<ffffffff816ab6d9>] schedule+0x29/0x70
[284285.327139] [<ffffffffc0385d36>] _xfs_log_force+0x1c6/0x2c0 [xfs]
[284285.327169] [<ffffffff810c6440>] ? wake_up_state+0x20/0x20
[284285.327322] [<ffffffffc0391bbc>] ? xfsaild+0x16c/0x6f0 [xfs]
[284285.327479] [<ffffffffc0385e5c>] xfs_log_force+0x2c/0x70 [xfs]
[284285.327652] [<ffffffffc0391a50>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
[284285.327820] [<ffffffffc0391bbc>] xfsaild+0x16c/0x6f0 [xfs]
[284285.327981] [<ffffffffc0391a50>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
[284285.327999] [<ffffffff810b252f>] kthread+0xcf/0xe0
[284285.328013] [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
[284285.328027] [<ffffffff816b8798>] ret_from_fork+0x58/0x90
[284285.328044] [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
[284405.330013] INFO: task xfsaild/md127:525 blocked for more than 120 seconds.
[284405.330297] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[284405.330669] xfsaild/md127 D ffff881039386eb0 0 525 2 0x00000000
[284405.330679] Call Trace:
[284405.330697] [<ffffffff8109ae1e>] ? try_to_del_timer_sync+0x5e/0x90
[284405.330710] [<ffffffff816ab6d9>] schedule+0x29/0x70
[284405.330927] [<ffffffffc0385d36>] _xfs_log_force+0x1c6/0x2c0 [xfs]
[284405.330941] [<ffffffff810c6440>] ? wake_up_state+0x20/0x20
[284405.331106] [<ffffffffc0391bbc>] ? xfsaild+0x16c/0x6f0 [xfs]
[284405.331846] [<ffffffffc0385e5c>] xfs_log_force+0x2c/0x70 [xfs]
[284405.332104] [<ffffffffc0391a50>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
[284405.332412] [<ffffffffc0391bbc>] xfsaild+0x16c/0x6f0 [xfs]
[284405.332642] [<ffffffffc0391a50>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
[284405.332742] [<ffffffff810b252f>] kthread+0xcf/0xe0
[284405.332755] [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
[284405.332770] [<ffffffff816b8798>] ret_from_fork+0x58/0x90
[284405.332782] [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
[284405.332896] INFO: task kworker/12:2:18723 blocked for more than 120 seconds.
[284405.333359] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[284405.333880] kworker/12:2 D ffff881029beaf70 0 18723 2 0x00000000
[284405.334039] Workqueue: xfs-cil/md127 xlog_cil_push_work [xfs]
[284405.334127] Call Trace:
[284405.334143] [<ffffffff816ab6d9>] schedule+0x29/0x70
[284405.334302] [<ffffffffc0384ffd>] xlog_state_get_iclog_space+0x10d/0x320 [xfs]
[284405.334462] [<ffffffff810c6440>] ? wake_up_state+0x20/0x20
[284405.334623] [<ffffffffc0385509>] xlog_write+0x1a9/0x720 [xfs]
[284405.335075] [<ffffffffc0382477>] ? kmem_zone_alloc+0x97/0x130 [xfs]
[284405.335479] [<ffffffffc0387218>] xlog_cil_push+0x2a8/0x430 [xfs]
[284405.335655] [<ffffffffc03873b5>] xlog_cil_push_work+0x15/0x20 [xfs]
[284405.335668] [<ffffffff810aa3ba>] process_one_work+0x17a/0x440
[284405.335682] [<ffffffff810ab1d8>] worker_thread+0x278/0x3c0
[284405.335695] [<ffffffff810aaf60>] ? manage_workers.isra.24+0x2a0/0x2a0
[284405.335707] [<ffffffff810b252f>] kthread+0xcf/0xe0
[284405.335720] [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
[284405.335732] [<ffffffff816b8798>] ret_from_fork+0x58/0x90
[284405.335744] [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
[284405.335755] INFO: task kworker/12:1:20525 blocked for more than 120 seconds.
[284405.336031] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[284405.336324] kworker/12:1 D ffff881029beb774 0 20525 2 0x00000000
[284405.336487] Workqueue: xfs-sync/md127 xfs_log_worker [xfs]
[284405.336493] Call Trace:
[284405.336511] [<ffffffff8150c3a4>] ? md_make_request+0x104/0x290
[284405.336524] [<ffffffff816ab6d9>] schedule+0x29/0x70
[284405.336538] [<ffffffff816a90e9>] schedule_timeout+0x239/0x2c0
[284405.336568] [<ffffffffc000d6d0>] ? ahci_qc_prep+0x150/0x1c0 [libahci]
[284405.336644] [<ffffffffc00ad957>] ? ata_qc_issue+0x177/0x3c0 [libata]
[284405.336661] [<ffffffff8109a5cb>] ? __internal_add_timer+0xab/0x130
[284405.336673] [<ffffffff816aba8d>] wait_for_completion+0xfd/0x140
[284405.336686] [<ffffffff810c6440>] ? wake_up_state+0x20/0x20
[284405.336702] [<ffffffff810ab41d>] flush_work+0xfd/0x190
[284405.336714] [<ffffffff810a7990>] ? move_linked_works+0x90/0x90
[284405.336883] [<ffffffffc0387b3a>] xlog_cil_force_lsn+0x8a/0x210 [xfs]
[284405.336901] [<ffffffff811a3206>] ? next_online_pgdat+0x26/0x60
[284405.337069] [<ffffffffc0385bf5>] _xfs_log_force+0x85/0x2c0 [xfs]
[284405.337092] [<ffffffff810298b1>] ? __switch_to+0x431/0x500
[284405.337255] [<ffffffffc0385ec4>] ? xfs_log_worker+0x24/0x50 [xfs]
[284405.337426] [<ffffffffc0385e5c>] xfs_log_force+0x2c/0x70 [xfs]
[284405.337587] [<ffffffffc0385ec4>] xfs_log_worker+0x24/0x50 [xfs]
[284405.337604] [<ffffffff810aa3ba>] process_one_work+0x17a/0x440
[284405.337617] [<ffffffff810ab086>] worker_thread+0x126/0x3c0
[284405.337631] [<ffffffff810aaf60>] ? manage_workers.isra.24+0x2a0/0x2a0
[284405.337645] [<ffffffff810b252f>] kthread+0xcf/0xe0
[284405.337659] [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
[284405.337675] [<ffffffff816b8798>] ret_from_fork+0x58/0x90
[284405.337687] [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
[284525.338947] INFO: task xfsaild/md127:525 blocked for more than 120 seconds.
[284525.339306] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[284525.339709] xfsaild/md127 D ffff881039386eb0 0 525 2 0x00000000
[284525.339720] Call Trace:
[284525.339739] [<ffffffff8109ae1e>] ? try_to_del_timer_sync+0x5e/0x90
[284525.339753] [<ffffffff816ab6d9>] schedule+0x29/0x70
[284525.339961] [<ffffffffc0385d36>] _xfs_log_force+0x1c6/0x2c0 [xfs]
[284525.339993] [<ffffffff810c6440>] ? wake_up_state+0x20/0x20
[284525.340159] [<ffffffffc0391bbc>] ? xfsaild+0x16c/0x6f0 [xfs]
[284525.340332] [<ffffffffc0385e5c>] xfs_log_force+0x2c/0x70 [xfs]
[284525.340508] [<ffffffffc0391a50>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
[284525.340689] [<ffffffffc0391bbc>] xfsaild+0x16c/0x6f0 [xfs]
[284525.340858] [<ffffffffc0391a50>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
[284525.340886] [<ffffffff810b252f>] kthread+0xcf/0xe0
[284525.340904] [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
[284525.340920] [<ffffffff816b8798>] ret_from_fork+0x58/0x90
[284525.340937] [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
[284525.340984] INFO: task kworker/12:2:18723 blocked for more than 120 seconds.
[284525.341276] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[284525.341594] kworker/12:2 D ffff881029beaf70 0 18723 2 0x00000000
[284525.341770] Workqueue: xfs-cil/md127 xlog_cil_push_work [xfs]
[284525.341776] Call Trace:
[284525.341791] [<ffffffff816ab6d9>] schedule+0x29/0x70
[284525.341959] [<ffffffffc0384ffd>] xlog_state_get_iclog_space+0x10d/0x320 [xfs]
[284525.341988] [<ffffffff810c6440>] ? wake_up_state+0x20/0x20
[284525.342163] [<ffffffffc0385509>] xlog_write+0x1a9/0x720 [xfs]
[284525.342350] [<ffffffffc0382477>] ? kmem_zone_alloc+0x97/0x130 [xfs]
[284525.342537] [<ffffffffc0387218>] xlog_cil_push+0x2a8/0x430 [xfs]
[284525.342726] [<ffffffffc03873b5>] xlog_cil_push_work+0x15/0x20 [xfs]
[284525.342744] [<ffffffff810aa3ba>] process_one_work+0x17a/0x440
[284525.342759] [<ffffffff810ab1d8>] worker_thread+0x278/0x3c0
[284525.342776] [<ffffffff810aaf60>] ? manage_workers.isra.24+0x2a0/0x2a0
[284525.342789] [<ffffffff810b252f>] kthread+0xcf/0xe0
[284525.342805] [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
[284525.342818] [<ffffffff816b8798>] ret_from_fork+0x58/0x90
[284525.342832] [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
[284525.342861] INFO: task kworker/12:1:20525 blocked for more than 120 seconds.
[284525.343223] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[284525.343546] kworker/12:1 D ffff881029beb774 0 20525 2 0x00000000
[284525.343717] Workqueue: xfs-sync/md127 xfs_log_worker [xfs]
[284525.343722] Call Trace:
[284525.343738] [<ffffffff8150c3a4>] ? md_make_request+0x104/0x290
[284525.343751] [<ffffffff816ab6d9>] schedule+0x29/0x70
[284525.343763] [<ffffffff816a90e9>] schedule_timeout+0x239/0x2c0
[284525.343795] [<ffffffffc000d6d0>] ? ahci_qc_prep+0x150/0x1c0 [libahci]
[284525.343874] [<ffffffffc00ad957>] ? ata_qc_issue+0x177/0x3c0 [libata]
[284525.343901] [<ffffffff8109a5cb>] ? __internal_add_timer+0xab/0x130
[284525.343918] [<ffffffff816aba8d>] wait_for_completion+0xfd/0x140
[284525.343931] [<ffffffff810c6440>] ? wake_up_state+0x20/0x20
[284525.343946] [<ffffffff810ab41d>] flush_work+0xfd/0x190
[284525.343962] [<ffffffff810a7990>] ? move_linked_works+0x90/0x90
[284525.344131] [<ffffffffc0387b3a>] xlog_cil_force_lsn+0x8a/0x210 [xfs]
[284525.344150] [<ffffffff811a3206>] ? next_online_pgdat+0x26/0x60
[284525.344320] [<ffffffffc0385bf5>] _xfs_log_force+0x85/0x2c0 [xfs]
[284525.344335] [<ffffffff810298b1>] ? __switch_to+0x431/0x500
[284525.344505] [<ffffffffc0385ec4>] ? xfs_log_worker+0x24/0x50 [xfs]
[284525.344694] [<ffffffffc0385e5c>] xfs_log_force+0x2c/0x70 [xfs]
[284525.344879] [<ffffffffc0385ec4>] xfs_log_worker+0x24/0x50 [xfs]
[284525.344902] [<ffffffff810aa3ba>] process_one_work+0x17a/0x440
[284525.344920] [<ffffffff810ab086>] worker_thread+0x126/0x3c0
[284525.344935] [<ffffffff810aaf60>] ? manage_workers.isra.24+0x2a0/0x2a0
[284525.344951] [<ffffffff810b252f>] kthread+0xcf/0xe0
[284525.344966] [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
[284525.344979] [<ffffffff816b8798>] ret_from_fork+0x58/0x90
[284525.344996] [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
[284645.346805] INFO: task xfsaild/md127:525 blocked for more than 120 seconds.
[284645.347170] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[284645.347569] xfsaild/md127 D ffff881039386eb0 0 525 2 0x00000000
[284645.347581] Call Trace:
[284645.347599] [<ffffffff8109ae1e>] ? try_to_del_timer_sync+0x5e/0x90
[284645.347613] [<ffffffff816ab6d9>] schedule+0x29/0x70
[284645.347821] [<ffffffffc0385d36>] _xfs_log_force+0x1c6/0x2c0 [xfs]
[284645.347852] [<ffffffff810c6440>] ? wake_up_state+0x20/0x20
[284645.348018] [<ffffffffc0391bbc>] ? xfsaild+0x16c/0x6f0 [xfs]
[284645.348189] [<ffffffffc0385e5c>] xfs_log_force+0x2c/0x70 [xfs]
[284645.348370] [<ffffffffc0391a50>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
[284645.348550] [<ffffffffc0391bbc>] xfsaild+0x16c/0x6f0 [xfs]
[284645.348734] [<ffffffffc0391a50>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
[284645.348761] [<ffffffff810b252f>] kthread+0xcf/0xe0
[284645.348778] [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
[284645.348794] [<ffffffff816b8798>] ret_from_fork+0x58/0x90
[284645.348807] [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
[284645.348855] INFO: task kworker/12:2:18723 blocked for more than 120 seconds.
[284645.349145] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[284645.349541] kworker/12:2 D ffff881029beaf70 0 18723 2 0x00000000
[284645.349711] Workqueue: xfs-cil/md127 xlog_cil_push_work [xfs]
[284645.349731] Call Trace:
[284645.349751] [<ffffffff816ab6d9>] schedule+0x29/0x70
[284645.349923] [<ffffffffc0384ffd>] xlog_state_get_iclog_space+0x10d/0x320 [xfs]
[284645.349940] [<ffffffff810c6440>] ? wake_up_state+0x20/0x20
[284645.350118] [<ffffffffc0385509>] xlog_write+0x1a9/0x720 [xfs]
[284645.350294] [<ffffffffc0382477>] ? kmem_zone_alloc+0x97/0x130 [xfs]
[284645.350482] [<ffffffffc0387218>] xlog_cil_push+0x2a8/0x430 [xfs]
[284645.350654] [<ffffffffc03873b5>] xlog_cil_push_work+0x15/0x20 [xfs]
[284645.350673] [<ffffffff810aa3ba>] process_one_work+0x17a/0x440
[284645.350687] [<ffffffff810ab1d8>] worker_thread+0x278/0x3c0
[284645.350703] [<ffffffff810aaf60>] ? manage_workers.isra.24+0x2a0/0x2a0
[284645.350730] [<ffffffff810b252f>] kthread+0xcf/0xe0
[284645.350744] [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
[284645.350761] [<ffffffff816b8798>] ret_from_fork+0x58/0x90
[284645.350774] [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40

~0031076

thompsop (reporter)

created
https://bugzilla.redhat.com/show_bug.cgi?id=1539394

~0031077

TrevorH (developer)

Please note that there are possible fixes for this in the latest CentOS 7 kernel. Running rpm -q --changelog kernel-3.10.0-693.17.1.el7.x86_64 shows several possible xfs fixes that might be the solution to this. Please test and feedback.

~0031084

thompsop (reporter)

Copying in a comment from the redhat bug. I believe is he is specifically referencing gbeeley's steps to reproduce and also joerg's call trace.

--- Comment #2 from Brian Foster <bfoster@redhat.com> ---
The xfsaild traces in the bug referenced above show us waiting on a log force.
That tends to imply we're waiting on I/O. The stacks also show md appears to be
involved. Since you have a reproducer, could you rule out md from the test if
that hasn't been done already?

FWIW, I've been running an fsstress/cpu [on|off]line loop against a dummy
raid1+XFS device on 4.15.0-rc9 for a few minutes and haven't seen any issues so
far. Is 4.14.6 definitely required or has anything newer than that simply not
been checked for the problem?

~0031087

joerg (reporter)

I am still on 3.10.0-693.11.6.el7 , but we restored our system (kvm hypervisor) with ext4 on the same MD-raid1 and are now running without issues.

I have a twin system (same hardware, ...) in the same network still running MD-raid1 with xfs as /, currently without witnessing any problem.

~0031091

diito (reporter)

I seem to have this problem with a CentOS 7.4 VM running on a Fedora 27 KVM host. I've switched to the 3.10.0-514.26.2.el7.x86_64 kernel on the VM and that DID NOT resolve this, it only seems to have delayed it a few days at best. The VM itself sits on a xfs partition on a nvme (Samsung 960 PRO m.2) so it's plenty fast enough to keep up with anything. Previously the VM was running on a ext4 partition on an SSD in another Fedora 27 system without issues. Besides that the previous system was Intel and the new AMD. Another CentOS based VM I migrated that's basically idle all day does NOT have this issue.

Has anyone seen where the host impacts this in anyway? I'm trying to figure out why xfsaild would fail in this manor within the VM on my new host and not my old as it's not been altered in any way.

~0031100

thompsop (reporter)

[94443.457717] [<ffffffff81235765>] do_fsync+0x65/0xa0
[94443.457718] [<ffffffff81235a40>] SyS_fsync+0x10/0x20
[94443.457720] [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
[94563.461909] INFO: task xfsaild/dm-4:502 blocked for more than 120 seconds.
[94563.463173] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[94563.464414] xfsaild/dm-4 D ffff88020f610fd0 0 502 2 0x00000000
[94563.464416] Call Trace:
[94563.464422] [<ffffffff8109affe>] ? try_to_del_timer_sync+0x5e/0x90
[94563.464425] [<ffffffff816ab8a9>] schedule+0x29/0x70
[94563.464460] [<ffffffffc0464d46>] _xfs_log_force+0x1c6/0x2c0 [xfs]
[94563.464464] [<ffffffff810c6620>] ? wake_up_state+0x20/0x20
[94563.464480] [<ffffffffc0470bcc>] ? xfsaild+0x16c/0x6f0 [xfs]
[94563.464495] [<ffffffffc0464e6c>] xfs_log_force+0x2c/0x70 [xfs]
[94563.464509] [<ffffffffc0470a60>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
[94563.464522] [<ffffffffc0470bcc>] xfsaild+0x16c/0x6f0 [xfs]
[94563.464535] [<ffffffffc0470a60>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
[94563.464538] [<ffffffff810b270f>] kthread+0xcf/0xe0
[94563.464540] [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
[94563.464543] [<ffffffff816b8798>] ret_from_fork+0x58/0x90
[94563.464544] [<ffffffff810b2640>] ? insert_kthread_work+0x40/0x40
[94563.464550] INFO: task auditd:1002 blocked for more than 120 seconds.
[94563.466275] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[94563.468135] auditd D ffff8802123e6eb0 0 1002 1 0x00000000
[94563.468138] Call Trace:
[94563.468161] [<ffffffffc0461487>] ? kmem_zone_alloc+0x97/0x130 [xfs]
[94563.468165] [<ffffffff816ab8a9>] schedule+0x29/0x70
[94563.468167] [<ffffffff816ad195>] rwsem_down_write_failed+0x225/0x3a0
[94563.468189] [<ffffffffc0461bc4>] ? xlog_grant_head_check+0x54/0x100 [xfs]
[94563.468210] [<ffffffffc0450c9d>] ? xfs_vn_update_time+0xcd/0x150 [xfs]
[94563.468213] [<ffffffff81333fd7>] call_rwsem_down_write_failed+0x17/0x30
[94563.468216] [<ffffffff816aaa8d>] down_write+0x2d/0x3d
[94563.468236] [<ffffffffc04531e1>] xfs_ilock+0xc1/0x120 [xfs]
[94563.468254] [<ffffffffc0450c9d>] xfs_vn_update_time+0xcd/0x150 [xfs]
[94563.468258] [<ffffffff8121efa5>] update_time+0x25/0xd0
[94563.468261] [<ffffffff8121f0f0>] file_update_time+0xa0/0xf0
[94563.468280] [<ffffffffc04473b5>] xfs_file_aio_write_checks+0x185/0x1f0 [xfs]
[94563.468298] [<ffffffffc0447a5a>] xfs_file_buffered_aio_write+0xca/0x2c0 [xfs]
[94563.468316] [<ffffffffc0447ddd>] xfs_file_aio_write+0x18d/0x1b0 [xfs]
[94563.468319] [<ffffffff8120222d>] do_sync_write+0x8d/0xd0
[94563.468323] [<ffffffff81202ced>] vfs_write+0xbd/0x1e0
[94563.468325] [<ffffffff81203aff>] SyS_write+0x7f/0xe0
[94563.468328] [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b
[94563.468330] INFO: task auditd:1003 blocked for more than 120 seconds.
[94563.470184] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[94563.472067] auditd D ffff8802135f0000 0 1003 1 0x00000000
[94563.472086] Call Trace:
[94563.472089] [<ffffffff816ab8a9>] schedule+0x29/0x70

~0031101

thompsop (reporter)

Running one day of uptime on 3.10.0-693.17.1.el7.x86_64 problem recurred. No md. I was unable to reproduce using the CPU offline recipe, unless it took effect some hours after the experiment.

~0031109

thompsop (reporter)

From redhat bug: sosreport, block task output, kdump:

Could you please describe the storage (devices, I/O schedulers, etc.), filesystem (xfs_info) and workload involved? Note that much of the former would be included in a sosreport.

Also, could you provide the complete blocked task output while in this state ('echo w > /proc/sysrq-trigger')? FWIW, a kdump wouldn't hurt as well if kdump is configured.

~0031110

thompsop (reporter)

Bam, bug closed:
--- Comment #13 from Brian Foster <bfoster@redhat.com> ---
All signs point to this as a duplicate of bug 1493585. Feel free to reopen if a
hang variant is reproduced that does not involve md.

*** This bug has been marked as a duplicate of bug 1493585 ***

~0031111

TrevorH (developer)

Which is yet another invisible RHEL bug :-(

Does anyone have the ability to read that and add email addresses to the cc list?

~0031113

thompsop (reporter)

I asked for exactly that - no reply to date

~0031119

diito (reporter)

> Feel free to reopen if a hang variant is reproduced that does not involve md.

My issue does not involve md. It also happens on SSD's.

~0031121

thompsop (reporter)

"Bug 1493585 is in the verified state and has already been fixed in
MD in kernel-3.10.0-820.el7."

~0031122

joerg (reporter)

Had this one left from my January hang up:
# ps aux | grep D
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 124 0.0 0.0 0 0 ? DN Jan09 0:02 [khugepaged]
root 538 0.0 0.0 0 0 ? D Jan09 2:11 [xfsaild/md127]
root 723 0.0 0.0 128428 1492 ? Ds Jan13 0:00 /bin/systemctl status snmpd.service
root 842 0.0 0.0 55452 1024 ? D<sl Jan09 0:08 /sbin/auditd
root 936 0.0 0.0 201248 1112 ? Ssl Jan09 0:00 /usr/sbin/gssproxy -D
root 1358 0.0 0.0 150220 3412 ? Dsl Jan09 3:57 /usr/bin/monit -I
root 1375 0.0 0.0 105996 3728 ? Ss Jan09 0:05 /usr/sbin/sshd -D
root 1852 0.0 0.0 178116 2508 ? S Jan16 0:00 /usr/sbin/CROND -n
root 1864 0.0 0.0 89616 3912 ? S Jan16 0:00 /usr/sbin/sendmail -FCronDaemon -i -odi -oem -oi -t -f root
root 1879 0.0 0.0 89584 3908 ? D Jan16 0:00 /usr/sbin/postdrop -r

~0031123

thompsop (reporter)

Regarding diito's items, I would suggest appending sosreport, block task output, kdump to this ticket and once here I will reopen the redhat ticket. Otherwise you could jump through the hoops and get a redhat bugzilla login and append those items there. Or wait until -820.

I was not completely certain I was correct when I told redhat that MD was involved in my hangs. I am not explicity using the MD raid tools but in the blocked task report mdX_raid1 showed with a very surprising high wait-time for a module I did not believe I was using. If the -820 kernel did not fix me I was going to reopen the ticket.

~0031129

slayerduck (reporter)

I'm experiencing the same problems, my entire array got stuck and i had to reset the server. I was on 3.10.0-693.11.6.el7.x86_64 and now upgraded to 3.10.0-693.17.1.el7.x86_64 but thompsop said it recurred for him so i don't doubt its going to happen again for me.

Using an array of SSD's with mdadm

Jan 31 06:36:01 vita kernel: INFO: task xfsaild/md0:1190 blocked for more than 120 seconds.
Jan 31 06:36:01 vita kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 31 06:36:01 vita kernel: xfsaild/md0 D ffff88203d901fa0 0 1190 2 0x00000000
Jan 31 06:36:01 vita kernel: Call Trace:
Jan 31 06:36:01 vita kernel: [<ffffffff816ab6d9>] schedule+0x29/0x70
Jan 31 06:36:01 vita kernel: [<ffffffffc02c7d36>] _xfs_log_force+0x1c6/0x2c0 [xfs]
Jan 31 06:36:01 vita kernel: [<ffffffff810c6440>] ? wake_up_state+0x20/0x20
Jan 31 06:36:01 vita kernel: [<ffffffffc02d3bbc>] ? xfsaild+0x16c/0x6f0 [xfs]
Jan 31 06:36:01 vita kernel: [<ffffffffc02c7e5c>] xfs_log_force+0x2c/0x70 [xfs]
Jan 31 06:36:01 vita kernel: [<ffffffffc02d3a50>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
Jan 31 06:36:01 vita kernel: [<ffffffffc02d3bbc>] xfsaild+0x16c/0x6f0 [xfs]
Jan 31 06:36:01 vita kernel: [<ffffffffc02d3a50>] ? xfs_trans_ail_cursor_first+0x90/0x90 [xfs]
Jan 31 06:36:01 vita kernel: [<ffffffff810b252f>] kthread+0xcf/0xe0
Jan 31 06:36:01 vita kernel: [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40
Jan 31 06:36:01 vita kernel: [<ffffffff816b8798>] ret_from_fork+0x58/0x90
Jan 31 06:36:01 vita kernel: [<ffffffff810b2460>] ? insert_kthread_work+0x40/0x40

~0031130

slayerduck (reporter)

Sorry for double post but is there an ETA to the release of 3.10.0-820.el7 or somewhere where i can grab it?
+Notes

-Issue History
Date Modified Username Field Change
2017-09-18 14:19 thompsop New Issue
2017-10-18 11:34 thompsop Note Added: 0030400
2017-10-25 21:44 thompsop Note Added: 0030468
2017-12-03 10:21 Neil Mukerji Note Added: 0030693
2017-12-03 18:29 thompsop Note Added: 0030694
2017-12-15 21:49 gregb Note Added: 0030761
2018-01-05 14:14 smuku@salesforce.com Note Added: 0030855
2018-01-05 15:13 gbeeley Note Added: 0030856
2018-01-05 16:08 thompsop Tag Attached: xfs
2018-01-05 16:09 thompsop Tag Attached: upstream
2018-01-05 16:10 thompsop Tag Attached: 7.4
2018-01-05 16:10 thompsop Tag Attached: file system
2018-01-22 12:11 joerg Note Added: 0030981
2018-01-28 16:05 thompsop Note Added: 0031076
2018-01-28 17:00 TrevorH Note Added: 0031077
2018-01-29 14:54 thompsop Note Added: 0031084
2018-01-29 15:20 joerg Note Added: 0031087
2018-01-29 18:08 diito Note Added: 0031091
2018-01-30 02:43 thompsop Note Added: 0031100
2018-01-30 02:44 thompsop Note Added: 0031101
2018-01-30 13:54 thompsop Note Added: 0031109
2018-01-30 15:42 thompsop Note Added: 0031110
2018-01-30 16:15 TrevorH Note Added: 0031111
2018-01-30 16:36 thompsop Note Added: 0031113
2018-01-30 18:06 diito Note Added: 0031119
2018-01-30 19:55 thompsop Note Added: 0031121
2018-01-30 19:55 joerg Note Added: 0031122
2018-01-30 20:07 thompsop Note Added: 0031123
2018-01-31 12:05 slayerduck Note Added: 0031129
2018-01-31 12:26 slayerduck Note Added: 0031130
+Issue History