CentOS Bug Tracker
CentOS Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0006217CentOS-5kernelpublic2013-01-25 06:472013-06-11 18:28
Reporterstorm9c1 
PrioritynormalSeveritycrashReproducibilityalways
StatusnewResolutionopen 
Platformx86_64OSCentOS-5OS Version5.9
Product Version5.9 
Target VersionFixed in Version 
Summary0006217: XFS hang on reboot
DescriptionSpecial configuration -- md raid1 root filesystem is on XFS. I know this isn't formally supported, but I've been running RHEL and CentOS this way since the RedHat 7.2 days. And have many machines deployed and tested with this configuration.

As of CentOS 5.9, kernel 2.6.18-348.el5, a reboot of the system results in a hang immediately after printing:

"md: md1 switched to read-only mode"

After 120 seconds, a traceback is produced. I can provide the traceback later if it would be helpful.

It's important to note that:
* Same hardware, kickstarted with same configuration -- CentOS 5.8 does not hang on reboot.
* Same hardware, kickstarted with same configuration -- CentOS 5.9 and downgraded kernel to 2.6.18-308.el5 does NOT hang on reboot.
* Same hardware, kickstarted with same configuration -- CentOS 5.[678] and upgraded kernel to 2.6.18-348.el5 hangs on reboot.
* Same hardware, kickstarted with same configuration -- CentOS 5.9 using ext3 for root fs does not hang on reboot.
* Same hardware, kickstarted with same configuration -- CentOS 5.9 using ext3 for root fs, and a XFS data partition (md and non-md) does not hang on reboot.
* Same hardware, kickstarted with same configuration -- CentOS 5.9 and NO md raid1, does NOT hang on reboot

I suspect something changed between kernel level 308 and 348 to interfere with XFS and md raid1 on the root filesystem.

Looking at the changelog for the 348 kernel, I do see mention of a FREEZE/THAW change and this could have irritated XFS. And perhaps a few other suspicious changes that could also impact XFS. But I'm not sure.

Any insight as to the cause is appreciated. I can stay at 5.8 for now, but this seems like a potentially serious regression (albeit an edge case) that shouldn't go unreported, either here or upstream.
Steps To Reproduce* Kickstart a system with CentOS 5.9 (either with a patched Anaconda or migrate root fs to XFS in %post or migrate manually).
* System stages OK, and reboots OK from Anaconda. System runs normally after being staged.
* Login and reboot (such as with shutdown -r, reboot, or exec init 6, etc).
* System hangs after performing all shutdown operations (never completes reboot). Sometimes a traceback is produced.
Additional InformationAlso tried newest 348.1.1 kernel. And tried Oracle EL kernel and Scientific Linux kernel -- same level -- 348, all have the same issue. Therefore I am convinced it's an upstream issue, but I have no mechanism to report that except here.

In these cases, md raid1 is being used, / is on XFS, /boot is on XFS, and swap is also raid1. It's a simple config with local SATA disks on generic x86_64 hardware. No hardware raid.
TagsNo tags attached.
Attached Files? file icon CentOS 5.9 XFS sysrq [^] (36,126 bytes) 2013-02-02 21:17

- Relationships

-  Notes
(0016345)
storm9c1 (reporter)
2013-01-25 16:57

Same with 32-bit (2.6.18-348) using XFS kmod. Same kickstart, same config, same hardware. Reboot hangs in the same place when XFS and md raid1 in use.
(0016361)
storm9c1 (reporter)
2013-01-30 02:59

-- More detail about the hang including traceback --

Using 5.9 kernel (348) without md raid:

Please stand by while rebooting the system...
md: stopping all md devices.
Synchronizing SCSI cache for disk sda:
Restarting system.
.
machine restart
(reboots normally)


With md raid1:

Unmounting pipe file systems:
Unmounting file systems:
Please stand by while rebooting the system...
md: stopping all md devices.
md: md2 switched to read-only mode.
md: md1 switched to read-only mode.
(hang)


Traceback after 120 seconds:

INFO: task reboot:2063 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
reboot D ffff810037df37e0 0 2063 1 19 (NOTLB)
 ffff81005890ba08 0000000000000082 ffff81005890ba58 ffff81005beb1ea0
 0000000000000001 0000000000000007 ffff810058d67040 ffff810037df37e0
 000000596dd8a1e6 0000000000003df4 ffff810058d67228 000000008008d76f
Call Trace:
 [<ffffffff8002e4bc>] __wake_up+0x38/0x4f
 [<ffffffff80223bce>] md_write_start+0xf2/0x108
 [<ffffffff800a3bc2>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8000ab62>] get_page_from_freelist+0x380/0x442
 [<ffffffff880b102c>] :raid1:make_request+0x38/0x5d8
 [<ffffffff8001c839>] generic_make_request+0x211/0x228
 [<ffffffff8002389f>] mempool_alloc+0x31/0xe7
 [<ffffffff8001a98f>] vsnprintf+0x5d7/0xb54
 [<ffffffff80033695>] submit_bio+0xe6/0xed
 [<ffffffff8807f801>] :xfs:_xfs_buf_ioapply+0x1f2/0x254
 [<ffffffff8807f89c>] :xfs:xfs_buf_iorequest+0x39/0x64
 [<ffffffff8808386c>] :xfs:xfs_bdstrat_cb+0x36/0x3a
 [<ffffffff8807c0a8>] :xfs:xfs_bwrite+0x5e/0xba
 [<ffffffff88077669>] :xfs:xfs_syncsub+0x119/0x226
 [<ffffffff88084ce2>] :xfs:xfs_fs_sync_super+0x33/0xdd
 [<ffffffff8010aa44>] quota_sync_sb+0x2e/0xf0
 [<ffffffff800e55bd>] __fsync_super+0x1b/0x9e
 [<ffffffff800e578a>] fsync_super+0x9/0x16
 [<ffffffff800e57c1>] fsync_bdev+0x2a/0x3b
 [<ffffffff8014ea59>] invalidate_partition+0x28/0x40
 [<ffffffff802225a8>] do_md_stop+0xa0/0x2ec
 [<ffffffff80224d41>] md_notify_reboot+0x5f/0x120
 [<ffffffff80067565>] notifier_call_chain+0x20/0x32
 [<ffffffff8009de98>] blocking_notifier_call_chain+0x22/0x36
 [<ffffffff8009e220>] kernel_restart_prepare+0x18/0x29
 [<ffffffff8009e280>] kernel_restart+0x9/0x46
 [<ffffffff8009e40a>] sys_reboot+0x146/0x1c7
 [<ffffffff8003b291>] hrtimer_try_to_cancel+0x4a/0x53
 [<ffffffff8005a753>] hrtimer_cancel+0xc/0x16
 [<ffffffff80063cf9>] do_nanosleep+0x47/0x70
 [<ffffffff8005a640>] hrtimer_nanosleep+0x58/0x118
 [<ffffffff800a5b84>] hrtimer_wakeup+0x0/0x22
 [<ffffffff8001e2f2>] sigprocmask+0xb7/0xdb
 [<ffffffff80054fe6>] sys_nanosleep+0x4c/0x62
 [<ffffffff8005d116>] system_call+0x7e/0x83


Basic fs info:

Filesystem Size Used Avail Use% Mounted on
/dev/md3 4.9G 784M 4.2G 16% /
/dev/md2 108M 11M 97M 11% /boot
tmpfs 689M 0 689M 0% /dev/shm

[root@test9][/root]# swapon -s
Filename Type Size Used Priority
/dev/md1 partition 2947832 0 -1


[root@test9][/root]# cat /proc/mdstat
Personalities : [raid1]
md2 : active raid1 sdb1[1] sda1[0]
      128384 blocks [2/2] [UU]
      
md1 : active raid1 sdb2[1] sda2[0]
      2947840 blocks [2/2] [UU]
      
md3 : active raid1 sdb3[1] sda3[0]
      5116608 blocks [2/2] [UU]
      
unused devices: <none>
(0016408)
storm9c1 (reporter)
2013-02-02 21:18

SysRq-t output file attached.
(0016458)
storm9c1 (reporter)
2013-02-07 19:02

Got some answers from the XFS mailing list:

"I can tell you the exact commit in the RHEL 5.9 tree that
caused this regression:

11ff4073: [md] Fix reboot stall with raid on megaraid_sas controller

The result is that the final shutdown of md devices now uses a
"force readonly" method, which means it ignores the fact that a
filesystem may still be active on top of it and rips the device out
from under the filesystem. This really only affects root devices,
and given that XFs is not supported as a root device on RHEL, it
isn't in the QE test matrix and so the problem was never noticed."


Next step, I'm going to look into filing a bug with the upstream vendor now that we know the cause.
(0016484)
storm9c1 (reporter)
2013-02-14 00:39

Filed upstream: RH Bugzilla bug 910635

https://bugzilla.redhat.com/show_bug.cgi?id=910635 [^]
(0017180)
centos128 (reporter)
2013-04-09 12:44

I can confirm this issue. It also happens on one of my machines in a similar setup. No problem to shutdown/reboot using CentOS 5.9 with kernel vmlinuz-2.6.18-308.24.1.el5 (latest 5.8 kernel). But starting with the first 5.9 kernel until the current kernel vmlinuz-2.6.18-348.3.1.el5 this problem occurs.
(0017181)
storm9c1 (reporter)
2013-04-09 14:52

Hi centos128, you may want to also report this upstream. The RH bug ID is 910635. The more we get on board, the quicker this may get resolved. It's been months now since I noticed this, and unfortunately for me and my customers, this has forced me to begin evaluating Ubuntu after using a RH and CentOS products for over 15 years. Ubuntu supports XFS on the root FS natively. The biggest challenge with Ubuntu is their "Kickstart" compatibility is weak. You need to do extra preseed work to get it to work flawlessly. And it's full of quirks. I also like RPM packaging better than Debian packaging.

Please please please fix this XFS problem in RH/CentOS.
(0017182)
centos128 (reporter)
2013-04-09 15:26

Though I have a RedHat Bugzilla account and I am logged in with it, I get the error message at https://bugzilla.redhat.com/show_bug.cgi?id=910635 [^]
"You are not authorized to access bug #910635."
(0017183)
storm9c1 (reporter)
2013-04-09 17:25

I think they made the bug private. Nice. Well then perhaps file a new one and reference 910635. You aren't missing much in that bug, it's pretty much a cut/paste of this bug. And a few discussion items. They haven't made much progress, if any, yet.
(0017188)
centos128 (reporter)
2013-04-10 09:28

Here you are: https://bugzilla.redhat.com/show_bug.cgi?id=950460 [^]
RH bug ID is 950460
(0017192)
storm9c1 (reporter)
2013-04-10 15:11

Hehe, well your bug is private too and I can't read it. I wonder if we add each other to the respective "CC List" for the bug we opened, if that will allow us to see it? It's supposed to, but I'm not sure it will let me add you. Looks like I can only add an email, not a registered user, which is annoying.
(0017193)
toracat (developer)
2013-04-10 15:16

Yes, if you use email address registered to BZ, you can add other people to the CC list.
(0017370)
storm9c1 (reporter)
2013-05-03 18:13

Patch provided upstream. Testing it now.
(0017371)
centos128 (reporter)
2013-05-04 11:58

How do you test?
(0017373)
storm9c1 (reporter)
2013-05-06 15:15

I rebuilt the kernel RPM from the SRPM. Added their patch in the spec file. Didn't solve my problem. Might need more patching. Or perhaps this patch is needed in conjunction with another kernel version. None of this is clear in the BZ report as of this time.
(0017555)
storm9c1 (reporter)
2013-06-11 18:05

RH removed the private flag for this bug in Bugzilla (for now).
(0017556)
toracat (developer)
2013-06-11 18:28

@storm9c1

Thanks for the note.

- Issue History
Date Modified Username Field Change
2013-01-25 06:47 storm9c1 New Issue
2013-01-25 16:57 storm9c1 Note Added: 0016345
2013-01-30 02:59 storm9c1 Note Added: 0016361
2013-02-02 21:17 storm9c1 File Added: CentOS 5.9 XFS sysrq
2013-02-02 21:18 storm9c1 Note Added: 0016408
2013-02-07 19:02 storm9c1 Note Added: 0016458
2013-02-14 00:39 storm9c1 Note Added: 0016484
2013-04-09 12:44 centos128 Note Added: 0017180
2013-04-09 14:52 storm9c1 Note Added: 0017181
2013-04-09 15:26 centos128 Note Added: 0017182
2013-04-09 17:25 storm9c1 Note Added: 0017183
2013-04-10 09:28 centos128 Note Added: 0017188
2013-04-10 15:11 storm9c1 Note Added: 0017192
2013-04-10 15:16 toracat Note Added: 0017193
2013-05-03 18:13 storm9c1 Note Added: 0017370
2013-05-04 11:58 centos128 Note Added: 0017371
2013-05-06 15:15 storm9c1 Note Added: 0017373
2013-06-11 18:05 storm9c1 Note Added: 0017555
2013-06-11 18:28 toracat Note Added: 0017556


Copyright © 2000 - 2014 MantisBT Team
Powered by Mantis Bugtracker