| View Issue Details [ Jump to Notes ] | [ Issue History ] [ Print ] | ||||||||||||
| ID | Project | Category | View Status | Date Submitted | Last Update | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0006217 | CentOS-5 | kernel | public | 2013-01-25 06:47 | 2013-06-11 18:28 | ||||||||
| Reporter | storm9c1 | ||||||||||||
| Priority | normal | Severity | crash | Reproducibility | always | ||||||||
| Status | new | Resolution | open | ||||||||||
| Platform | x86_64 | OS | CentOS-5 | OS Version | 5.9 | ||||||||
| Product Version | 5.9 | ||||||||||||
| Target Version | Fixed in Version | ||||||||||||
| Summary | 0006217: XFS hang on reboot | ||||||||||||
| Description | Special configuration -- md raid1 root filesystem is on XFS. I know this isn't formally supported, but I've been running RHEL and CentOS this way since the RedHat 7.2 days. And have many machines deployed and tested with this configuration. As of CentOS 5.9, kernel 2.6.18-348.el5, a reboot of the system results in a hang immediately after printing: "md: md1 switched to read-only mode" After 120 seconds, a traceback is produced. I can provide the traceback later if it would be helpful. It's important to note that: * Same hardware, kickstarted with same configuration -- CentOS 5.8 does not hang on reboot. * Same hardware, kickstarted with same configuration -- CentOS 5.9 and downgraded kernel to 2.6.18-308.el5 does NOT hang on reboot. * Same hardware, kickstarted with same configuration -- CentOS 5.[678] and upgraded kernel to 2.6.18-348.el5 hangs on reboot. * Same hardware, kickstarted with same configuration -- CentOS 5.9 using ext3 for root fs does not hang on reboot. * Same hardware, kickstarted with same configuration -- CentOS 5.9 using ext3 for root fs, and a XFS data partition (md and non-md) does not hang on reboot. * Same hardware, kickstarted with same configuration -- CentOS 5.9 and NO md raid1, does NOT hang on reboot I suspect something changed between kernel level 308 and 348 to interfere with XFS and md raid1 on the root filesystem. Looking at the changelog for the 348 kernel, I do see mention of a FREEZE/THAW change and this could have irritated XFS. And perhaps a few other suspicious changes that could also impact XFS. But I'm not sure. Any insight as to the cause is appreciated. I can stay at 5.8 for now, but this seems like a potentially serious regression (albeit an edge case) that shouldn't go unreported, either here or upstream. | ||||||||||||
| Steps To Reproduce | * Kickstart a system with CentOS 5.9 (either with a patched Anaconda or migrate root fs to XFS in %post or migrate manually). * System stages OK, and reboots OK from Anaconda. System runs normally after being staged. * Login and reboot (such as with shutdown -r, reboot, or exec init 6, etc). * System hangs after performing all shutdown operations (never completes reboot). Sometimes a traceback is produced. | ||||||||||||
| Additional Information | Also tried newest 348.1.1 kernel. And tried Oracle EL kernel and Scientific Linux kernel -- same level -- 348, all have the same issue. Therefore I am convinced it's an upstream issue, but I have no mechanism to report that except here. In these cases, md raid1 is being used, / is on XFS, /boot is on XFS, and swap is also raid1. It's a simple config with local SATA disks on generic x86_64 hardware. No hardware raid. | ||||||||||||
| Tags | No tags attached. | ||||||||||||
| Attached Files |
| ||||||||||||
Notes |
|
|
storm9c1 (reporter) 2013-01-25 16:57 |
Same with 32-bit (2.6.18-348) using XFS kmod. Same kickstart, same config, same hardware. Reboot hangs in the same place when XFS and md raid1 in use. |
|
storm9c1 (reporter) 2013-01-30 02:59 |
-- More detail about the hang including traceback -- Using 5.9 kernel (348) without md raid: Please stand by while rebooting the system... md: stopping all md devices. Synchronizing SCSI cache for disk sda: Restarting system. . machine restart (reboots normally) With md raid1: Unmounting pipe file systems: Unmounting file systems: Please stand by while rebooting the system... md: stopping all md devices. md: md2 switched to read-only mode. md: md1 switched to read-only mode. (hang) Traceback after 120 seconds: INFO: task reboot:2063 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. reboot D ffff810037df37e0 0 2063 1 19 (NOTLB) ffff81005890ba08 0000000000000082 ffff81005890ba58 ffff81005beb1ea0 0000000000000001 0000000000000007 ffff810058d67040 ffff810037df37e0 000000596dd8a1e6 0000000000003df4 ffff810058d67228 000000008008d76f Call Trace: [<ffffffff8002e4bc>] __wake_up+0x38/0x4f [<ffffffff80223bce>] md_write_start+0xf2/0x108 [<ffffffff800a3bc2>] autoremove_wake_function+0x0/0x2e [<ffffffff8000ab62>] get_page_from_freelist+0x380/0x442 [<ffffffff880b102c>] :raid1:make_request+0x38/0x5d8 [<ffffffff8001c839>] generic_make_request+0x211/0x228 [<ffffffff8002389f>] mempool_alloc+0x31/0xe7 [<ffffffff8001a98f>] vsnprintf+0x5d7/0xb54 [<ffffffff80033695>] submit_bio+0xe6/0xed [<ffffffff8807f801>] :xfs:_xfs_buf_ioapply+0x1f2/0x254 [<ffffffff8807f89c>] :xfs:xfs_buf_iorequest+0x39/0x64 [<ffffffff8808386c>] :xfs:xfs_bdstrat_cb+0x36/0x3a [<ffffffff8807c0a8>] :xfs:xfs_bwrite+0x5e/0xba [<ffffffff88077669>] :xfs:xfs_syncsub+0x119/0x226 [<ffffffff88084ce2>] :xfs:xfs_fs_sync_super+0x33/0xdd [<ffffffff8010aa44>] quota_sync_sb+0x2e/0xf0 [<ffffffff800e55bd>] __fsync_super+0x1b/0x9e [<ffffffff800e578a>] fsync_super+0x9/0x16 [<ffffffff800e57c1>] fsync_bdev+0x2a/0x3b [<ffffffff8014ea59>] invalidate_partition+0x28/0x40 [<ffffffff802225a8>] do_md_stop+0xa0/0x2ec [<ffffffff80224d41>] md_notify_reboot+0x5f/0x120 [<ffffffff80067565>] notifier_call_chain+0x20/0x32 [<ffffffff8009de98>] blocking_notifier_call_chain+0x22/0x36 [<ffffffff8009e220>] kernel_restart_prepare+0x18/0x29 [<ffffffff8009e280>] kernel_restart+0x9/0x46 [<ffffffff8009e40a>] sys_reboot+0x146/0x1c7 [<ffffffff8003b291>] hrtimer_try_to_cancel+0x4a/0x53 [<ffffffff8005a753>] hrtimer_cancel+0xc/0x16 [<ffffffff80063cf9>] do_nanosleep+0x47/0x70 [<ffffffff8005a640>] hrtimer_nanosleep+0x58/0x118 [<ffffffff800a5b84>] hrtimer_wakeup+0x0/0x22 [<ffffffff8001e2f2>] sigprocmask+0xb7/0xdb [<ffffffff80054fe6>] sys_nanosleep+0x4c/0x62 [<ffffffff8005d116>] system_call+0x7e/0x83 Basic fs info: Filesystem Size Used Avail Use% Mounted on /dev/md3 4.9G 784M 4.2G 16% / /dev/md2 108M 11M 97M 11% /boot tmpfs 689M 0 689M 0% /dev/shm [root@test9][/root]# swapon -s Filename Type Size Used Priority /dev/md1 partition 2947832 0 -1 [root@test9][/root]# cat /proc/mdstat Personalities : [raid1] md2 : active raid1 sdb1[1] sda1[0] 128384 blocks [2/2] [UU] md1 : active raid1 sdb2[1] sda2[0] 2947840 blocks [2/2] [UU] md3 : active raid1 sdb3[1] sda3[0] 5116608 blocks [2/2] [UU] unused devices: <none> |
|
storm9c1 (reporter) 2013-02-02 21:18 |
SysRq-t output file attached. |
|
storm9c1 (reporter) 2013-02-07 19:02 |
Got some answers from the XFS mailing list: "I can tell you the exact commit in the RHEL 5.9 tree that caused this regression: 11ff4073: [md] Fix reboot stall with raid on megaraid_sas controller The result is that the final shutdown of md devices now uses a "force readonly" method, which means it ignores the fact that a filesystem may still be active on top of it and rips the device out from under the filesystem. This really only affects root devices, and given that XFs is not supported as a root device on RHEL, it isn't in the QE test matrix and so the problem was never noticed." Next step, I'm going to look into filing a bug with the upstream vendor now that we know the cause. |
|
storm9c1 (reporter) 2013-02-14 00:39 |
Filed upstream: RH Bugzilla bug 910635 https://bugzilla.redhat.com/show_bug.cgi?id=910635 |
|
centos128 (reporter) 2013-04-09 12:44 |
I can confirm this issue. It also happens on one of my machines in a similar setup. No problem to shutdown/reboot using CentOS 5.9 with kernel vmlinuz-2.6.18-308.24.1.el5 (latest 5.8 kernel). But starting with the first 5.9 kernel until the current kernel vmlinuz-2.6.18-348.3.1.el5 this problem occurs. |
|
storm9c1 (reporter) 2013-04-09 14:52 |
Hi centos128, you may want to also report this upstream. The RH bug ID is 910635. The more we get on board, the quicker this may get resolved. It's been months now since I noticed this, and unfortunately for me and my customers, this has forced me to begin evaluating Ubuntu after using a RH and CentOS products for over 15 years. Ubuntu supports XFS on the root FS natively. The biggest challenge with Ubuntu is their "Kickstart" compatibility is weak. You need to do extra preseed work to get it to work flawlessly. And it's full of quirks. I also like RPM packaging better than Debian packaging. Please please please fix this XFS problem in RH/CentOS. |
|
centos128 (reporter) 2013-04-09 15:26 |
Though I have a RedHat Bugzilla account and I am logged in with it, I get the error message at https://bugzilla.redhat.com/show_bug.cgi?id=910635 "You are not authorized to access bug #910635." |
|
storm9c1 (reporter) 2013-04-09 17:25 |
I think they made the bug private. Nice. Well then perhaps file a new one and reference 910635. You aren't missing much in that bug, it's pretty much a cut/paste of this bug. And a few discussion items. They haven't made much progress, if any, yet. |
|
centos128 (reporter) 2013-04-10 09:28 |
Here you are: https://bugzilla.redhat.com/show_bug.cgi?id=950460 RH bug ID is 950460 |
|
storm9c1 (reporter) 2013-04-10 15:11 |
Hehe, well your bug is private too and I can't read it. I wonder if we add each other to the respective "CC List" for the bug we opened, if that will allow us to see it? It's supposed to, but I'm not sure it will let me add you. Looks like I can only add an email, not a registered user, which is annoying. |
|
toracat (manager) 2013-04-10 15:16 |
Yes, if you use email address registered to BZ, you can add other people to the CC list. |
|
storm9c1 (reporter) 2013-05-03 18:13 |
Patch provided upstream. Testing it now. |
|
centos128 (reporter) 2013-05-04 11:58 |
How do you test? |
|
storm9c1 (reporter) 2013-05-06 15:15 |
I rebuilt the kernel RPM from the SRPM. Added their patch in the spec file. Didn't solve my problem. Might need more patching. Or perhaps this patch is needed in conjunction with another kernel version. None of this is clear in the BZ report as of this time. |
|
storm9c1 (reporter) 2013-06-11 18:05 |
RH removed the private flag for this bug in Bugzilla (for now). |
|
toracat (manager) 2013-06-11 18:28 |
@storm9c1 Thanks for the note. |
Issue History |
|||
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2013-01-25 06:47 | storm9c1 | New Issue | |
| 2013-01-25 16:57 | storm9c1 | Note Added: 0016345 | |
| 2013-01-30 02:59 | storm9c1 | Note Added: 0016361 | |
| 2013-02-02 21:17 | storm9c1 | File Added: CentOS 5.9 XFS sysrq | |
| 2013-02-02 21:18 | storm9c1 | Note Added: 0016408 | |
| 2013-02-07 19:02 | storm9c1 | Note Added: 0016458 | |
| 2013-02-14 00:39 | storm9c1 | Note Added: 0016484 | |
| 2013-04-09 12:44 | centos128 | Note Added: 0017180 | |
| 2013-04-09 14:52 | storm9c1 | Note Added: 0017181 | |
| 2013-04-09 15:26 | centos128 | Note Added: 0017182 | |
| 2013-04-09 17:25 | storm9c1 | Note Added: 0017183 | |
| 2013-04-10 09:28 | centos128 | Note Added: 0017188 | |
| 2013-04-10 15:11 | storm9c1 | Note Added: 0017192 | |
| 2013-04-10 15:16 | toracat | Note Added: 0017193 | |
| 2013-05-03 18:13 | storm9c1 | Note Added: 0017370 | |
| 2013-05-04 11:58 | centos128 | Note Added: 0017371 | |
| 2013-05-06 15:15 | storm9c1 | Note Added: 0017373 | |
| 2013-06-11 18:05 | storm9c1 | Note Added: 0017555 | |
| 2013-06-11 18:28 | toracat | Note Added: 0017556 | |


