View Issue Details

IDProjectCategoryView StatusLast Update
0015226CentOS-7kernelpublic2018-09-16 09:25
Reportertiredpixel 
PrioritynormalSeverityblockReproducibilityalways
Status newResolutionopen 
Platformx86_64OSCentOSOS Version7
Product Version7.5.1804 
Target VersionFixed in Version 
Summary0015226: 3.10.0-862.11.6 kernel panic
DescriptionOn three dedicated servers hosted at Hetzner, all with almost-identical configurations (KVM hosts, deployed from scripts), the kernels were upgraded from the package manager and the servers rebooted. 1 server rebooted fine, but 2 servers didn't come up. Remote shutdown signals were sent, but failed. Automatic power resets were sent, but also failed to start the system. After remote consoles were attached, it was seen that the 2 servers had kernal panics, reproducible on every reboot.

Selecting the previous kernel resulted in successful booting. Unfortunately, however, the systems were not configured to persist boot logs (default CentOS, I believe), and they are production systems, so these were not rebooted to collect them. A staging system did not experience the issue, however, despite having an almost identical system and software configuration (but note with different BIOS versions, as below).
Steps To ReproduceUpgrade to latest CentOS 7 kernel. Reboot. On some systems at Hetzner, it panics. On others, it works.

I discovered that kernel

  3.10.0-862.11.6

causes a panic on 2 servers, but works on a 3rd server. However, kernel

  3.10.0-862.9.1

works on all 3 servers.
Additional InformationNote that it happened on

  PX61-NVMe #xxxxx1
  EX41 #xxxxx2

but not on

  PX61-NVMe #xxxxx3

all of them running exactly the same versions of CentOS 7, same stacks, and same packages. The 3 systems above were configured into 2 VLANs through vSwitches, but the panic happened even with the vSwitches detached.

p.s. Interestingly, I see they're running different BIOS versions:

Note that it happened on

  PX61-NVMe #xxxxx1

        Vendor: FUJITSU // American Megatrends Inc.
        Version: V5.0.0.12 R1.18.0.SR.1 for D3417-B2x
        Release Date: 07/02/2018

  EX41 #xxxxx2

        Vendor: FUJITSU // American Megatrends Inc.
        Version: V5.0.0.12 R1.18.0.SR.1 for D3401-H2x
        Release Date: 07/02/2018

but not on

  PX61-NVMe #xxxxx3

        Vendor: FUJITSU // American Megatrends Inc.
        Version: V5.0.0.11 R1.26.0.SR.2 for D3417-B1x
        Release Date: 07/04/2018

Note that the kernel panic occurred for

  V5.0.0.12 R1.18.0.SR.1
TagsNo tags attached.
abrt_hash
URL

Activities

TrevorH

TrevorH

2018-08-29 13:50

manager   ~0032620

And the kernel panic looks like...
tiredpixel

tiredpixel

2018-08-29 14:29

reporter   ~0032621

Unfortunately, I was not able to capture it at the time. However, I've since enabled persistent boot messages, and attempted upgrading to the affected kernel once again. This time, a reboot worked fine without any panic. I rebooted a few times, but was unsuccessful in recreating the problem, whereas yesterday, I rebooted multiple times and got the panic. Nothing has changed in the hardware since, however.

Nevertheless, I realise that without an actual dump of the kernel panic messages, there's not really anything you can do. Thus, as I've now been unsuccessful in recreating it, I understand if you would prefer to close the ticket. I do think there's a valid issue here, though, as it happened on 2 separate servers with multiple reboots, but all I can hope is that this ticket, even if closed, serves as useful reference if somebody else reports the same thing.
TrevorH

TrevorH

2018-08-29 15:21

manager   ~0032622

So, if it worked the second time after either a yum reinstall kernel-3.10.0-862.11.6 or a yum remove/yum update cycle then I would suspect that this is going to be an old bug that no-one has yet managed to track down. There are some circumstances under which a yum update kernel will add a new stanza to grub.cfg containing a kernel entry but missing the matching initrd line for that version of the kernel. That means it boots and starts up but then tries to mount the initramfs ramdisk on / and fails as it is not there. This leads to a relatively short kernel panic backtrace that basically says "unable to find the root filesystem" but with lots of hex mixed in there too.

So far as I have been able to trace this through, a yum update kernel actually updates grub.cfg twice. The first time during the update itself iss when it adds the kernel line to grub.cfg and then later in the "cleanup" phase, it goes back round and adds the initrd line to it. From the sounds of it, I suspect that when this bug occurs, something has stopped yum from entering its cleanup phase and thus the grub.cfg stanza is incomplete and unbootable.
tiredpixel

tiredpixel

2018-09-05 06:54

reporter   ~0032647

Thank you for your suggestion. This does actually sound an interesting idea, certainly one I wasn't aware of. Unfortunately, I don't think I can find a way to verify that that is indeed what happened here.

Are you aware of any effort to ensure that `grub.cfg` does not enter such a state? Or is that hard to avoid, given how the updates work?
TrevorH

TrevorH

2018-09-05 07:28

manager   ~0032648

No one has yet discovered how or why it happens so there is no fix.
toracat

toracat

2018-09-05 07:44

manager   ~0032649

Many reports can be found in https://bugs.centos.org/view.php?id=6310 (mostly c6, some c7).
tiredpixel

tiredpixel

2018-09-16 09:25

reporter   ~0032730

I have just experienced it again, but this time, on some test VM guests I have. This time, I know exactly what had just happened, since I was working on the boxes at the time, and I was able to retrieve more information.

Package upgrade was running for 5 VM guests in parallel, via Ansible. I forgot that was running ;) , and issued a manual `reboot` across all boxes in parallel. 3 failed to boot, 2 came up fine.

Attaching a console, I saw the attached (apologies for the screenshot, I was unable to copy the text out at such an early stage of the boot process, even with persistent logging enabled). The screen first pauses at the `Press any key to continue`; if a key is pressed, or after a few seconds automatically, the system continues and attempts to boot, causing the depicted kernel panic.

Booting via the previous kernel worked fine, as before, and I was able to inspect the state of the `/boot` partition.

The relevant section of `/boot/grub2/grub.cfg` shows:

menuentry 'CentOS Linux (3.10.0-862.11.6.el7.x86_64) 7 (Core)' --class centos --class gnu-linux --class gnu --class os --unrestricted $menuentry_id_option 'gnulinux-3.10.0-862.el7.x86_64-adv
anced-91a2b306-c0f5-495a-89c9-1d68c0da1b4e' {
        load_video
        set gfxpayload=keep
        insmod gzio
        insmod part_msdos
        insmod ext2
        set root='hd0,msdos1' if [ x$feature_platform_search_hint = xy ]; then
          search --no-floppy --fs-uuid --set=root --hint='hd0,msdos1' 8a5950eb-b91b-4efa-b2e7-1d5fbf20ec81
        else
          search --no-floppy --fs-uuid --set=root 8a5950eb-b91b-4efa-b2e7-1d5fbf20ec81
        fi
        linux16 /vmlinuz-3.10.0-862.11.6.el7.x86_64 root=/dev/mapper/centos_0c4i1lnh-root ro crashkernel=auto rd.lvm.lv=centos_0c4i1lnh/root rhgb quiet LANG=en_GB.UTF-8
        initrd16 /initramfs-3.10.0-862.11.6.el7.x86_64.img

However, there is no such img:

# ls -l /boot/
total 131116
-rw-r--r--. 1 root root 147859 Aug 14 22:02 config-3.10.0-862.11.6.el7.x86_64
-rw-r--r--. 1 root root 147837 Jul 16 16:43 config-3.10.0-862.9.1.el7.x86_64
-rw-r--r--. 1 root root 147819 Apr 20 16:57 config-3.10.0-862.el7.x86_64
drwxr-xr-x. 3 root root 4096 Sep 8 14:13 efi
drwxr-xr-x. 2 root root 4096 Sep 8 14:14 grub
drwx------. 5 root root 4096 Sep 16 08:58 grub2
-rw-------. 1 root root 55390527 Sep 8 14:17 initramfs-0-rescue-c5f759d8a8b84ff383e3f6bfae772a8a.img
-rw-------. 1 root root 21144479 Sep 9 11:15 initramfs-3.10.0-862.9.1.el7.x86_64.img
-rw-------. 1 root root 21143526 Sep 9 11:15 initramfs-3.10.0-862.el7.x86_64.img
drwx------. 2 root root 16384 Sep 8 14:13 lost+found
-rw-r--r--. 1 root root 305158 Aug 14 22:05 symvers-3.10.0-862.11.6.el7.x86_64.gz
-rw-r--r--. 1 root root 305117 Jul 16 16:45 symvers-3.10.0-862.9.1.el7.x86_64.gz
-rw-r--r--. 1 root root 304926 Apr 20 17:00 symvers-3.10.0-862.el7.x86_64.gz
-rw-------. 1 root root 3414344 Aug 14 22:02 System.map-3.10.0-862.11.6.el7.x86_64
-rw-------. 1 root root 3412056 Jul 16 16:43 System.map-3.10.0-862.9.1.el7.x86_64
-rw-------. 1 root root 3409143 Apr 20 16:57 System.map-3.10.0-862.el7.x86_64
-rwxr-xr-x. 1 root root 6224704 Sep 8 14:17 vmlinuz-0-rescue-c5f759d8a8b84ff383e3f6bfae772a8a
-rwxr-xr-x. 1 root root 6242208 Aug 14 22:02 vmlinuz-3.10.0-862.11.6.el7.x86_64
-rwxr-xr-x. 1 root root 6234048 Jul 16 16:43 vmlinuz-3.10.0-862.9.1.el7.x86_64
-rwxr-xr-x. 1 root root 6224704 Apr 20 16:57 vmlinuz-3.10.0-862.el7.x86_64

@TrevorH, this supports the theory you describe; in fact, I'm convinced of it. It actually seems pretty easy to reproduce (I've experienced it 5 times, now, 2 on bare-metal servers, 3 on VMs): start an upgrade of the kernel via the package manager, and whilst that is running, issue a `reboot`. Although of course this requires the reboot happening within some window of time in order to break, it does seem that window is in fact quite wide—perhaps it's something that could be recreated more easily if upgrading lots of packages at the same time, so the cleanup phase you describe only happens at the end?

Issue History

Date Modified Username Field Change
2018-08-29 13:46 tiredpixel New Issue
2018-08-29 13:50 TrevorH Note Added: 0032620
2018-08-29 14:29 tiredpixel Note Added: 0032621
2018-08-29 15:21 TrevorH Note Added: 0032622
2018-09-05 06:54 tiredpixel Note Added: 0032647
2018-09-05 07:28 TrevorH Note Added: 0032648
2018-09-05 07:44 toracat Note Added: 0032649
2018-09-16 09:25 tiredpixel File Added: Screenshot from 2018-09-16 11-04-23.png
2018-09-16 09:25 tiredpixel Note Added: 0032730