2017-09-22 07:51 UTC

View Issue Details Jump to Notes ]
IDProjectCategoryView StatusLast Update
0013763CentOS-7kernelpublic2017-09-22 03:21
Reporterkstange 
PriorityhighSeveritycrashReproducibilityalways
StatusacknowledgedResolutionopen 
Platformx86_64OScentosOS Version7.4
Product Version 
Target VersionFixed in Version 
Summary0013763: CentOS 7.4 kernel (3.10.0-693*) fails to boot as Xen PV guest
DescriptionAfter a mailing list post indicating Xen PV guest not booting with CR kernel for 7.4:

https://lists.centos.org/pipermail/centos-virt/2017-August/005618.html

I began to investigate and determined that it appears something is causing a page allocation failure during systemd-udev initialization, which prevents proper booting of the system.

https://lists.centos.org/pipermail/centos-virt/2017-September/005623.html

Failure is:

[ 1.970630] ------------[ cut here ]------------
[ 1.970651] WARNING: CPU: 2 PID: 225 at mm/vmalloc.c:131
vmap_page_range_noflush+0x2c1/0x350
[ 1.970660] Modules linked in:
[ 1.970668] CPU: 2 PID: 225 Comm: systemd-udevd Not tainted
3.10.0-693.1.1.el7.x86_64 #1
[ 1.970677] 0000000000000000 000000008cddc75d ffff8803e8587bd8
ffffffff816a3d91
[ 1.970688] ffff8803e8587c18 ffffffff810879c8 00000083811c14e8
ffff8800066eb000
[ 1.970698] 0000000000000001 ffff8803e86d6940 ffffffffc0000000
0000000000000000
[ 1.970708] Call Trace:
[ 1.970725] [<ffffffff816a3d91>] dump_stack+0x19/0x1b
[ 1.970736] [<ffffffff810879c8>] __warn+0xd8/0x100
[ 1.970742] [<ffffffff81087b0d>] warn_slowpath_null+0x1d/0x20
[ 1.970748] [<ffffffff811c0781>] vmap_page_range_noflush+0x2c1/0x350
[ 1.970758] [<ffffffff811c083e>] map_vm_area+0x2e/0x40
[ 1.970765] [<ffffffff811c1590>] __vmalloc_node_range+0x170/0x270
[ 1.970774] [<ffffffff810fe754>] ? module_alloc_update_bounds+0x14/0x70
[ 1.970781] [<ffffffff810fe754>] ? module_alloc_update_bounds+0x14/0x70
[ 1.970792] [<ffffffff8105f143>] module_alloc+0x73/0xd0
[ 1.970798] [<ffffffff810fe754>] ? module_alloc_update_bounds+0x14/0x70
[ 1.970804] [<ffffffff810fe754>] module_alloc_update_bounds+0x14/0x70
[ 1.970811] [<ffffffff810ff2d2>] load_module+0xb02/0x29e0
[ 1.970817] [<ffffffff811c0717>] ? vmap_page_range_noflush+0x257/0x350
[ 1.970823] [<ffffffff811c083e>] ? map_vm_area+0x2e/0x40
[ 1.970829] [<ffffffff811c1590>] ? __vmalloc_node_range+0x170/0x270
[ 1.970838] [<ffffffff81101249>] ? SyS_init_module+0x99/0x110
[ 1.970846] [<ffffffff81101275>] SyS_init_module+0xc5/0x110
[ 1.970856] [<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b
[ 1.970862] ---[ end trace 2117480876ed90d2 ]---
[ 1.970869] vmalloc: allocation failure, allocated 24576 of 28672 bytes
[ 1.970874] systemd-udevd: page allocation failure: order:0, mode:0xd2
[ 1.970883] CPU: 2 PID: 225 Comm: systemd-udevd Tainted: G W
   ------------ 3.10.0-693.1.1.el7.x86_64 #1
[ 1.970894] 00000000000000d2 000000008cddc75d ffff8803e8587c48
ffffffff816a3d91
[ 1.970910] ffff8803e8587cd8 ffffffff81188810 ffffffff8190ea38
ffff8803e8587c68
[ 1.970923] ffffffff00000018 ffff8803e8587ce8 ffff8803e8587c88
000000008cddc75d
[ 1.970939] Call Trace:
[ 1.970946] [<ffffffff816a3d91>] dump_stack+0x19/0x1b
[ 1.970961] [<ffffffff81188810>] warn_alloc_failed+0x110/0x180
[ 1.970971] [<ffffffff811c1654>] __vmalloc_node_range+0x234/0x270
[ 1.970981] [<ffffffff810fe754>] ? module_alloc_update_bounds+0x14/0x70
[ 1.970989] [<ffffffff810fe754>] ? module_alloc_update_bounds+0x14/0x70
[ 1.970999] [<ffffffff8105f143>] module_alloc+0x73/0xd0
[ 1.971031] [<ffffffff810fe754>] ? module_alloc_update_bounds+0x14/0x70
[ 1.971038] [<ffffffff810fe754>] module_alloc_update_bounds+0x14/0x70
[ 1.971046] [<ffffffff810ff2d2>] load_module+0xb02/0x29e0
[ 1.971052] [<ffffffff811c0717>] ? vmap_page_range_noflush+0x257/0x350
[ 1.971061] [<ffffffff811c083e>] ? map_vm_area+0x2e/0x40
[ 1.971067] [<ffffffff811c1590>] ? __vmalloc_node_range+0x170/0x270
[ 1.971075] [<ffffffff81101249>] ? SyS_init_module+0x99/0x110
[ 1.971081] [<ffffffff81101275>] SyS_init_module+0xc5/0x110
[ 1.971088] [<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b
[ 1.971094] Mem-Info:
[ 1.971103] active_anon:875 inactive_anon:2049 isolated_anon:0
[ 1.971103] active_file:791 inactive_file:8841 isolated_file:0
[ 1.971103] unevictable:0 dirty:0 writeback:0 unstable:0
[ 1.971103] slab_reclaimable:1732 slab_unreclaimable:1629
[ 1.971103] mapped:1464 shmem:2053 pagetables:480 bounce:0
[ 1.971103] free:4065966 free_pcp:763 free_cma:0
[ 1.971131] Node 0 DMA free:15912kB min:12kB low:12kB high:16kB
active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB
unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15996kB
managed:15912kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB
slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB
pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB
free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 1.971217] lowmem_reserve[]: 0 4063 16028 16028
[ 1.971226] Node 0 DMA32 free:4156584kB min:4104kB low:5128kB
high:6156kB active_anon:952kB inactive_anon:1924kB active_file:0kB
inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:4177920kB managed:4162956kB mlocked:0kB dirty:0kB writeback:0kB
mapped:4kB shmem:1928kB slab_reclaimable:240kB slab_unreclaimable:504kB
kernel_stack:32kB pagetables:592kB unstable:0kB bounce:0kB
free_pcp:1760kB local_pcp:288kB free_cma:0kB writeback_tmp:0kB
pages_scanned:0 all_unreclaimable? no
[ 1.971264] lowmem_reserve[]: 0 0 11964 11964
[ 1.971273] Node 0 Normal free:12091564kB min:12088kB low:15108kB
high:18132kB active_anon:2352kB inactive_anon:6272kB active_file:3164kB
inactive_file:35364kB unevictable:0kB isolated(anon):0kB
isolated(file):0kB present:12591104kB managed:12251788kB mlocked:0kB
dirty:0kB writeback:0kB mapped:5852kB shmem:6284kB
slab_reclaimable:6688kB slab_unreclaimable:6012kB kernel_stack:880kB
pagetables:1328kB unstable:0kB bounce:0kB free_pcp:1196kB
local_pcp:152kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? no
[ 1.971309] lowmem_reserve[]: 0 0 0 0
[ 1.971316] Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 1*32kB (U) 2*64kB (U)
1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) =
15912kB
[ 1.971343] Node 0 DMA32: 7*4kB (M) 18*8kB (UM) 7*16kB (EM) 3*32kB
(EM) 1*64kB (E) 2*128kB (UM) 1*256kB (E) 4*512kB (UM) 4*1024kB (UEM)
4*2048kB (EM) 1011*4096kB (M) = 4156348kB
[ 1.971377] Node 0 Normal: 64*4kB (UEM) 10*8kB (UEM) 6*16kB (EM)
3*32kB (EM) 3*64kB (UE) 3*128kB (UEM) 1*256kB (E) 2*512kB (UE) 0*1024kB
1*2048kB (M) 2951*4096kB (M) = 12091728kB
[ 1.971413] Node 0 hugepages_total=0 hugepages_free=0
hugepages_surp=0 hugepages_size=2048kB
[ 1.971425] 11685 total pagecache pages
[ 1.971430] 0 pages in swap cache
[ 1.971437] Swap cache stats: add 0, delete 0, find 0/0
[ 1.971444] Free swap = 0kB
[ 1.971451] Total swap = 0kB
[ 1.971456] 4196255 pages RAM
[ 1.971462] 0 pages HighMem/MovableOnly
[ 1.971467] 88591 pages reserved
Steps To ReproduceUpgrade to 3.10.0-693 or later and reboot.
Additional InformationKernel boots fine if you switch back to 514.26.2. I'm putting this here and then filing a bug upstream to RHEL as well.
TagsNo tags attached.
abrt_hash
URL
Attached Files
  • log file icon kernel-3.10.0-693.1.1-boot.log (107,196 bytes) 2017-09-01 23:09
  • patch file icon xen-dont-copy-bogus-duplicate-entries-into-kernel-page-tables.patch (5,502 bytes) 2017-09-06 19:06 -
    From 0b5a50635fc916cf46e3de0b819a61fc3f17e7ee Mon Sep 17 00:00:00 2001
    From: Stefan Bader <stefan.bader@canonical.com>
    Date: Tue, 2 Sep 2014 11:16:01 +0100
    Subject: x86/xen: don't copy bogus duplicate entries into kernel page tables
    
    When RANDOMIZE_BASE (KASLR) is enabled; or the sum of all loaded
    modules exceeds 512 MiB, then loading modules fails with a warning
    (and hence a vmalloc allocation failure) because the PTEs for the
    newly-allocated vmalloc address space are not zero.
    
      WARNING: CPU: 0 PID: 494 at linux/mm/vmalloc.c:128
               vmap_page_range_noflush+0x2a1/0x360()
    
    This is caused by xen_setup_kernel_pagetables() copying
    level2_kernel_pgt into level2_fixmap_pgt, overwriting many non-present
    entries.
    
    Without KASLR, the normal kernel image size only covers the first half
    of level2_kernel_pgt and module space starts after that.
    
    L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[  0..255]->kernel
                                                      [256..511]->module
                              [511]->level2_fixmap_pgt[  0..505]->module
    
    This allows 512 MiB of of module vmalloc space to be used before
    having to use the corrupted level2_fixmap_pgt entries.
    
    With KASLR enabled, the kernel image uses the full PUD range of 1G and
    module space starts in the level2_fixmap_pgt. So basically:
    
    L4[511]->level3_kernel_pgt[510]->level2_kernel_pgt[0..511]->kernel
                              [511]->level2_fixmap_pgt[0..505]->module
    
    And now no module vmalloc space can be used without using the corrupt
    level2_fixmap_pgt entries.
    
    Fix this by properly converting the level2_fixmap_pgt entries to MFNs,
    and setting level1_fixmap_pgt as read-only.
    
    A number of comments were also using the the wrong L3 offset for
    level2_kernel_pgt.  These have been corrected.
    
    Signed-off-by: Stefan Bader <stefan.bader@canonical.com>
    Signed-off-by: David Vrabel <david.vrabel@citrix.com>
    Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
    Cc: stable@vger.kernel.org
    ---
     arch/x86/include/asm/pgtable_64.h |  1 +
     arch/x86/xen/mmu.c                | 27 ++++++++++++---------------
     2 files changed, 13 insertions(+), 15 deletions(-)
    
    diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
    index 5be9063..3874693 100644
    --- a/arch/x86/include/asm/pgtable_64.h
    +++ b/arch/x86/include/asm/pgtable_64.h
    @@ -19,6 +19,7 @@ extern pud_t level3_ident_pgt[512];
     extern pmd_t level2_kernel_pgt[512];
     extern pmd_t level2_fixmap_pgt[512];
     extern pmd_t level2_ident_pgt[512];
    +extern pte_t level1_fixmap_pgt[512];
     extern pgd_t init_level4_pgt[];
     
     #define swapper_pg_dir init_level4_pgt
    diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
    index e8a1201..16fb009 100644
    --- a/arch/x86/xen/mmu.c
    +++ b/arch/x86/xen/mmu.c
    @@ -1858,11 +1858,10 @@
      *
      * We can construct this by grafting the Xen provided pagetable into
      * head_64.S's preconstructed pagetables.  We copy the Xen L2's into
    - * level2_ident_pgt, level2_kernel_pgt and level2_fixmap_pgt.  This
    - * means that only the kernel has a physical mapping to start with -
    - * but that's enough to get __va working.  We need to fill in the rest
    - * of the physical mapping once some sort of allocator has been set
    - * up.
    + * level2_ident_pgt, and level2_kernel_pgt.  This means that only the
    + * kernel has a physical mapping to start with - but that's enough to
    + * get __va working.  We need to fill in the rest of the physical
    + * mapping once some sort of allocator has been set up.
      */
     void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
     {
    @@ -1892,9 +1891,12 @@
     	/* L3_i[0] -> level2_ident_pgt */
     	convert_pfn_mfn(level3_ident_pgt);
     	/* L3_k[510] -> level2_kernel_pgt
    -	 * L3_i[511] -> level2_fixmap_pgt */
    +	 * L3_k[511] -> level2_fixmap_pgt */
     	convert_pfn_mfn(level3_kernel_pgt);
     
    +	/* L3_k[511][506] -> level1_fixmap_pgt */
    +	convert_pfn_mfn(level2_fixmap_pgt);
    +
     	/* We get [511][511] and have Xen's version of level2_kernel_pgt */
     	l3 = m2v(pgd[pgd_index(__START_KERNEL_map)].pgd);
     	l2 = m2v(l3[pud_index(__START_KERNEL_map)].pud);
    @@ -1903,22 +1905,15 @@
     	addr[1] = (unsigned long)l3;
     	addr[2] = (unsigned long)l2;
     	/* Graft it onto L4[272][0]. Note that we creating an aliasing problem:
    -	 * Both L4[272][0] and L4[511][511] have entries that point to the same
    +	 * Both L4[272][0] and L4[511][510] have entries that point to the same
     	 * L2 (PMD) tables. Meaning that if you modify it in __va space
     	 * it will be also modified in the __ka space! (But if you just
     	 * modify the PMD table to point to other PTE's or none, then you
     	 * are OK - which is what cleanup_highmap does) */
     	copy_page(level2_ident_pgt, l2);
    -	/* Graft it onto L4[511][511] */
    +	/* Graft it onto L4[511][510] */
     	copy_page(level2_kernel_pgt, l2);
     
    -	/* Get [511][510] and graft that in level2_fixmap_pgt */
    -	l3 = m2v(pgd[pgd_index(__START_KERNEL_map + PMD_SIZE)].pgd);
    -	l2 = m2v(l3[pud_index(__START_KERNEL_map + PMD_SIZE)].pud);
    -	copy_page(level2_fixmap_pgt, l2);
    -	/* Note that we don't do anything with level1_fixmap_pgt which
    -	 * we don't need. */
    -
     	/* Make pagetable pieces RO */
     	set_page_prot(init_level4_pgt, PAGE_KERNEL_RO);
     	set_page_prot(level3_ident_pgt, PAGE_KERNEL_RO);
    @@ -1927,6 +1922,7 @@
     	set_page_prot(level2_ident_pgt, PAGE_KERNEL_RO);
     	set_page_prot(level2_kernel_pgt, PAGE_KERNEL_RO);
     	set_page_prot(level2_fixmap_pgt, PAGE_KERNEL_RO);
    +	set_page_prot(level1_fixmap_pgt, PAGE_KERNEL_RO);
     
     	/* Pin down new L4 */
     	pin_pagetable_pfn(MMUEXT_PIN_L4_TABLE,
    
  • ? file icon run_lorax (280 bytes) 2017-09-22 03:21 -
    #sudo -i
    #setenforce 0
    DEFAULTKERNEL=kernel-plus lorax -p Centos-Minimal -v 7 -r 7.4 \
    -s http://mirrors.us.kernel.org/centos/7/os/x86_64/ \
    -s http://mirrors.us.kernel.org/centos/7/updates/x86_64/ \
    -s http://mirrors.us.kernel.org/centos/7/centosplus/x86_64/ \
    ./results/
    exit 0
    
    ? file icon run_lorax (280 bytes) 2017-09-22 03:21 +

-Relationships
+Relationships

-Notes

~0029977

kstange (reporter)

Running xen version on host:

xen-4.4.4-26
kernel-4.9.44-29

~0029978

kstange (reporter)

Created a bug at RH bugzilla:

https://bugzilla.redhat.com/show_bug.cgi?id=1487754

Anyone who wants a CC to the bug to comment or view, please let me know the email to add... you can ping me on IRC (kstange) if you would rather.

~0029979

arrfab (administrator)

Confirmed through CentOS QA jobs and kernel panic so unable to setup a new PV domU guest with the 7.4.1708 tree

~0030014

kstange (reporter)

Things I have done since filing this bug:

- I tested booting my CentOS virt kernel 4.9.44 and it boots correctly.

- I tested an official RHEL 7.4 PV guest and it does not boot.

- From the Red Hat bug, it was suggested that I rebuild with CONFIG_RANDOMIZE_BASE=n and this allowed the kernel to boot.

- I did some digging based on this change and found this thread:

https://lists.xen.org/archives/html/xen-devel/2014-08/msg01024.html

The patch proposed is here in its final form:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v4.13&id=0b5a50635fc916cf46e3de0b819a61fc3f17e7ee

It doesn't apply cleanly to 3.10.0, but I backported it to the 693.1.1 kernel and applied it, and I can now boot my PV guest properly with CONFIG_RANDOMIZE_BASE=y.

- I have asked, but I am doubtful that Red Hat will accept this patch. We could potentially apply it to CentOS Plus kernels, but that doesn't solve the installer problem unless we have an alternate installer for PV guests.

~0030021

kstange (reporter)

With many thanks to hughesjr and toracat, the patch indicated in my previous comment is now in CentOS Plus kernel 3.10.0-693.2.1. I've asked Red Hat to apply it to some future kernel update, but that is only a dream for now.

In the meantime, if anyone who has been experiencing the issue with PV domains can try out the CentOS Plus kernel here and provide feedback, I'd appreciate it!

https://buildlogs.centos.org/c7-plus/kernel-plus/20170907163005/3.10.0-693.2.1.el7.centos.plus.x86_64/

~0030077

avij (manager)

For the record, those 3.10.0-693.* CentOS Plus kernels are now available via the regular centosplus repository.

https://wiki.centos.org/AdditionalResources/Repositories/CentOSPlus
http://mirror.centos.org/centos/7.4.1708/centosplus/x86_64/Packages/

~0030095

bill_mcgonigle (reporter)

Kevin, thanks for managing this bug for us - saved my bacon. I ran into it when rebooting a database server and have proactively switched to kernel-plus on a dozen other VM's, both in-house and cloud-hosted (on XenPV). They have been fully updated and rebooted successfully. I've started a blog entry/howto on this issue, which is really major IMO <https://www.bfccomputing.com/2017/09/15/centos-plus-kernel-for-xen.html> - if you can add me to the RHBZ, I'll verify that my claims are accurate. RHBZ id is bill-bugzilla.redhat.com@bfccomputing.com .

~0030169

PryMar56 (reporter)

In order to launch any Xen pv install we need another pxeboot pair, vmlinuz/initrd.img, that have the desired patch.
The tool of choice to make these is `lorax`, or pylorax.

It turns out that kernel-plus can be used in lorax by changing the first kernel -> kernel-plus in:
/usr/share/lorax/runtime-install.tmpl

The resulting boot.iso, pxeboot/{vmlinuz,initrd.img} can be used launch a Xen pv netinstall. I've done it.
+Notes

-Issue History
Date Modified Username Field Change
2017-09-01 20:11 kstange New Issue
2017-09-01 20:12 kstange Note Added: 0029977
2017-09-01 22:15 kstange Note Added: 0029978
2017-09-01 23:09 kstange File Added: kernel-3.10.0-693.1.1-boot.log
2017-09-02 09:49 arrfab Status new => acknowledged
2017-09-02 09:49 arrfab Note Added: 0029979
2017-09-06 19:06 kstange File Added: xen-dont-copy-bogus-duplicate-entries-into-kernel-page-tables.patch
2017-09-06 19:06 kstange Note Added: 0030014
2017-09-07 20:21 kstange Note Added: 0030021
2017-09-14 14:39 avij Note Added: 0030077
2017-09-16 15:38 bill_mcgonigle Note Added: 0030095
2017-09-22 03:21 PryMar56 File Added: run_lorax
2017-09-22 03:21 PryMar56 Note Added: 0030169
+Issue History