View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0013107||CentOS-7||mdadm||public||2017-04-10 20:58||2018-12-17 14:18|
|Target Version||Fixed in Version|
|Summary||0013107: CentOS 7 with encrypted RAID 1 partitions won't boot up in degraded mode.|
|Description||As the summary says, the machine won't boot up if one of the disks is removed. To my understanding the system should still boot up in degraded mode, when one of the disks fails or is removed. It looks like md devices just won't start.|
Here's a screenshot from a failed boot after I removed the second disk: https://i.imgur.com/k7jjxTj.png
This might be related or even the same bug I saw in Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=859691
Installed package versions that might be interesting:
|Steps To Reproduce||1. Begin installing new Centos Vvrtual machine with 2 hard disks from CentOS-7-x86_64-Minimal-1611.iso|
2. Manually create a RAID 1 partition for /boot and encrypted RAID 1 partitions for / and swap. See the details from the attached anaconda-ks.cfg.
3. Finish the installation, update all packages (yum update) and verify that the machine can be rebooted normally again.
4. Shut down the machine, remove the second hard disk, try to boot again, and notice how the boot fails.
|Tags||No tags attached.|
anaconda-ks.cfg (1,773 bytes)
I could reproduce this.
Since activating the systemd-cryptsetup device is done in udev
via dev-disk-by-\x2duuid-<UUID>.device systemd unit,
systemd gives up by default
and drops to emergency shell.
Fortunately, RHEL dracut has built-in failproof mechanism in
which kicks INACTIVE md devices to run by "mdadm -R" after timeout.
But this timeout is by default set to rd.retry=180 seconds, so it is too late.
Workaround is to add "rd.retry=20" kernel option on boot.
(adjust if you're using devices which needs longer probe)
After 20*(2/3)=15 seconds, the failproof logic kicks in and
you'll get prompt for LUKS device passphrase, and
can continue boot in degraded RAID mode.
Is there a reason why the default is set to 180 seconds in /usr/lib/dracut/modules.d/98systemd/dracut-initqueue.sh?
30 seconds should be the default according to the manual (man dracut.cmdline) and that's what's used in /usr/lib/dracut/modules.d/99base/init.sh.
Dunno. The document is stale, so this may be reported as a bug as upstream dracut.
changes timeout from 30 to 180, but doesn't say anything particular
about the change.
Maybe wanted to set it longer than systemd's default timeout of 90secs,
or there was some disk subsystem which needed longer than 30 secs to come up.
I found that newer dracut in git, modules.d/99base/init.sh also has changed to
99base: Increase initqueue timeout in non systemd case
In case of systemd is used the timeout already is set to 180s, compare
with file: modules.d/98systemd/dracut-initqueue.sh
Do the same if systemd is not used, e.g. in kdump case.
So it seems that timeout was enlengthed for non-systemd case.
Still, document (dracut.cmdline.7.asc) is stale.
Same problem here with degraded mdadm partition (without encrypted partition)
On emergency mode, we need to:
mdadm --run /dev/md0
mdadm --run /dev/md1
mdadm --run /dev/md2
|2017-04-10 20:58||Vuojolahti||New Issue|
|2017-04-10 20:58||Vuojolahti||File Added: anaconda-ks.cfg|
|2017-04-11 06:56||kabe||Note Added: 0029045|
|2017-04-11 18:28||Vuojolahti||Note Added: 0029054|
|2017-04-12 00:32||kabe||Note Added: 0029056|
|2017-04-12 01:03||kabe||Note Added: 0029057|
|2018-12-17 14:18||bibianthony||Note Added: 0033368|