View Issue Details

IDProjectCategoryView StatusLast Update
0010729CentOS-7kernelpublic2016-11-21 18:12
Reporterbcran 
PrioritynormalSeveritycrashReproducibilityalways
Status resolvedResolutionfixed 
Platformx86_64OSCentOSOS Version7.1.
Product Version7.2.1511 
Target VersionFixed in Version 
Summary0010729: Boot enters infinite loop if EFI app/driver allocates 2GB above 4GB
DescriptionThe kernel boot process enters an infinite loop (i.e. hang) if a UEFI application or driver has allocated 2GB above 4GB before CentOS boots. For example, allocating 2GB at 0x180000000.

Steps To ReproduceEdit AppPkg/Applications/Hello.c from the TianoCore edk2 and add add the lines:

EFI_PHYSICAL_ADDRESS addr = 0x180000000;
gBS->AllocatePages(AllocateAddress, EfiBootServicesData, 0x80000, &addr);

Then, run GRUB and boot CentOS with kernel 3.10.0-327. If you enable debug output ("debug earlyprintk=efi,keep efi=debug ignore_loglevel uefi_debug") then the boot process stops after displaying

"[ 0.005412] pid_max: default: 32768 minimum: 301"

Windows 10 (build 1511) and OpenSUSE with kernel 4.5.0 boots without a problem, so it looks like it's either specific to CentOS, or a bug in the upstream 3.10 kernel.
Additional InformationLaszlo Ersek on the edk2-devel mailing list debugged the problem further than I did, and reported the following:

I enabled DEBUG_GCD (0x00100000) in PcdDebugPrintErrorLevel. I also modified one of the DXE drivers in OvmfPkg (PlatformDxe to be exact, but it's irrelevant) to allocate 2GB memory at 6GB, in its entry point function:

+ {
+ EFI_PHYSICAL_ADDRESS Address;
+ EFI_STATUS Status2;
+
+ Address = 0x180000000;
+ Status2 = gBS->AllocatePages(AllocateAddress, EfiBootServicesData, 0x80000,
+ &Address);
+ ASSERT_EFI_ERROR (Status2);
+ }
+

Finally I booted this in a 16GB virtual machine. Here's what I see:

(1) Since "OvmfPkg/PlatformPei/MemDetect.c" adds the memory above 4GB as untested memory (*), with

    if (UpperMemorySize != 0) {
      AddUntestedMemoryBaseSizeHob (BASE_4GB, UpperMemorySize);
    }

the DXE core initializes this range in the GCD memory space map as Reserved. (See the CoreInitializeGcdServices() function -- search it for the comment "Walk the HOB list and add all resource descriptors to the GCD" and the macro TESTED_MEMORY_ATTRIBUTES.)

Because this range is Reserved when the DXE driver in question tries to allocate 2GB at 6GB, the AllocatePages() boot service fails. That's when you see an error message like

  ConvertPages: failed to find range ...

In other words, if the AllocatePages() call in your driver causes this message to appear, then the allocation fails, and the driver should not try to use the address. Are you checking the return value from AllocatePages()? For me, it is EFI_NOT_FOUND.

(*) This is a side point. I guess you might want to know why PlatformPei adds the RAM above 4GB as untested. It is because we want the DXE IPL to load the DXE Core into the permanent PEI RAM that our PEI phase installs. There was a thread in the past where we discussed this at lenght; I've forgotten most of the details by now, but the point is, such allocations won't succeed in the entry points of DXE_DRIVER modules.

However, this per se should never cause the problem you are seeing (as long as you are obeying the return status of AllocatePages(), of course).

(2) Now, memory testing in BDS promotes this memory range to usable system memory. Then it becomes available to UEFI_DRIVER modules, and code in DXE_DRIVER modules that allocates memory this late (for example, protocol installation callbacks). See PlatformBdsDiagnostics() in "OvmfPkg/Library/PlatformBdsLib/BdsPlatform.c".

Moving the AllocatePages() hunk above to just after PlatformBdsDiagnostics(), the call succeeds. The kernel is booted (I tested with grub, using a RHEL-7.2 guest), but it does fall into an infinite loop, exactly where you described. IOW, I can reproduce the issue.

(3) At this point I paused the guest, and dumped its memory contents to disk, with the following command:

# virsh dump ovmf.rhel7 ovmf.rhel7.dump --memory-only --format kdump-lzo

(Note that you can do the same using just QEMU: see the "dump-guest-memory" monitor command. You will want to *disable* paging, and pick either the kdump-lzo or kdump-snappy formats.)

After this, the guest can be forced off, we'll work with the memory dump only.

(4) I installed the debug symbols for the guest kernel onto the host (kernel-debuginfo + kernel-debuginfo-common packages), and also the "crash" utility. (The same should be doable on CentOS 7 too.) The "crash" utility had been extended to handle dumps of guest kernels running on top of OVMF (see <https://bugzilla.redhat.com/show_bug.cgi?id=1080698>.)

(5) The dump can be opened like this:

  crash \
    /usr/lib/debug/usr/lib/modules/3.10.0-327.el7.x86_64/vmlinux \
    ovmf.rhel7.dump

The stack dump I get with "bt -l" is:

PID: 0 TASK: ffffffff81951440 CPU: 0 COMMAND: "swapper/0"
    [exception RIP: native_set_pmd+1]
    RIP: ffffffff810592b1 RSP: ffffffff8193fc68 RFLAGS: 00000282
    RAX: 007f05b2156000e3 RBX: 007f05af10e00000 RCX: ffff880000000000
    RDX: ffff88042f065438 RSI: 007f05b2156000e3 RDI: ffff88042f065438
    RBP: ffffffff8193fcb8 R8: 0000000000000063 R9: 0000000000000063
    R10: ffff88043ffc7000 R11: 0000000000000001 R12: ff80fa4eeaa00000
    R13: ffffffff8193fe28 R14: 0000000000000063 R15: ffff88000294dfc8
    CS: 0010 SS: 0000
 #0 [ffffffff8193fc70] populate_pmd at ffffffff81060f16
    /usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/arch/x86/include/asm/paravirt.h: 545
 #1 [ffffffff8193fcc0] __cpa_process_fault at ffffffff810613ab
    /usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/arch/x86/mm/pageattr.c: 974
 #2 [ffffffff8193fd28] __change_page_attr_set_clr at ffffffff810619a4
    /usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/include/linux/spinlock.h: 333
 #3 [ffffffff8193fe18] kernel_map_pages_in_pgd at ffffffff8106320e
    /usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/arch/x86/mm/pageattr.c: 1869
 #4 [ffffffff8193fe80] __map_region at ffffffff81aab2be
    /usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/arch/x86/platform/efi/efi_64.c: 182
 #5 [ffffffff8193fea0] efi_map_region at ffffffff81aab50a
    /usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/arch/x86/platform/efi/efi_64.c: 225
 #6 [ffffffff8193fec8] efi_enter_virtual_mode at ffffffff81aaaf1a
    /usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/arch/x86/platform/efi/efi.c: 975
 #7 [ffffffff8193ff40] start_kernel at ffffffff81a8cfc8
    /usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/init/main.c: 612
 #8 [ffffffff8193ff88] x86_64_start_reservations at ffffffff81a8c5ee
    /usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/arch/x86/kernel/head64.c: 194
 #9 [ffffffff8193ff98] x86_64_start_kernel at ffffffff81a8c742
    /usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/arch/x86/kernel/head64.c: 183

Since the exception RIP is "native_set_pmd+1", I asked "crash" to disassemble that function:

crash> disassemble native_set_pmd
Dump of assembler code for function native_set_pmd:
   0xffffffff810592b0 <+0>: push %rbp
   0xffffffff810592b1 <+1>: mov %rsp,%rbp
   0xffffffff810592b4 <+4>: push %r12
   0xffffffff810592b6 <+6>: mov %rsi,%r12
   0xffffffff810592b9 <+9>: push %rbx
   0xffffffff810592ba <+10>: mov %rdi,%rbx
   0xffffffff810592bd <+13>: data32 data32 data32 xchg %ax,%ax
   0xffffffff810592c2 <+18>: mov %r12,(%rbx)
   0xffffffff810592c5 <+21>: pop %rbx
   0xffffffff810592c6 <+22>: pop %r12
   0xffffffff810592c8 <+24>: pop %rbp
   0xffffffff810592c9 <+25>: retq
   0xffffffff810592ca <+26>: nopw 0x0(%rax,%rax,1)
   0xffffffff810592d0 <+32>: callq 0xffffffff81066c90 <do_mm_track_pmd>
   0xffffffff810592d5 <+37>: mov %r12,(%rbx)
   0xffffffff810592d8 <+40>: pop %rbx
   0xffffffff810592d9 <+41>: pop %r12
   0xffffffff810592db <+43>: pop %rbp
   0xffffffff810592dc <+44>: retq
End of assembler dump.

"native_set_pmd" has several versions; "addr2line" resolves the address ffffffff810592b1 to:

native_set_pmd at /usr/src/debug/kernel-3.10.0-327.el7/linux-3.10.0-327.el7.x86_64/arch/x86/include/asm/pgtable_64.h:66

However, I think that those "deep" stack frames are just a symptom, not the cause. I'd look for the cause somewhere in efi_enter_virtual_mode() / efi_map_region().

I think at this point I'll copy Matt :) , and ask you to reproduce the issue with a fresh upstream kernel (most recent Linux release, or even fresh git). If it reproduces, then it's an upstream kernel bug I think; if it doesn't reproduce, then please report an RHBZ about it.

... I think commits 700870119f490 and 916f676f8 are interesting. Normally efi_map_regions() does not map a region if its EFI_MEMORY_RUNTIME attribute is clear, *unless* you are on x86_64 and the region is either boot services code or boot services data. In the latter cases, the region is mapped, temporarily. And, I think that's when efi_map_region() is passed the EFI_MEMORY_DESCRIPTOR for the 2GB@6GB allocation, and it blows up. Not sure why.

I'm going to hang on to the dump file I saved, in case Matt or someone else wants me to issue some more "crash" commands against it.
TagsNo tags attached.
abrt_hash
URL

Activities

bcran

bcran

2016-04-20 01:07

reporter   ~0026329

Laszlo Ersek, Matt Fleming and I have subsequently found and verified that Linux commit 742563777e8da62197d6cb4b99f4027f59454735 (https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=742563777e8da62197d6cb4b99f4027f59454735) from 2016-01-29 fixed the problem.

Laszlo has said he'll file a RHBZ to backport the changeset for the RHEL kernel; I don't know if something needs to be done on the CentOS side, or if it will automatically pick up the RHEL kernel when it gets released?
toracat

toracat

2016-04-20 01:43

manager   ~0026331

Once the RHEL kernel gets fixed, it will be picked up by the CentOS kernel automatically. In fact it is the only way to get any patches into the CentOS distro kernel.

That said, CentOS offers a custom kernel called 'centosplus' kernel. The patch can be included in the plus kernel.
toracat

toracat

2016-04-20 01:58

manager  

centos-linux-3.10-fix-truncation-bug-EFI-bug10720.patch (3,356 bytes)
centosplus patch [bug#10729]

commit	742563777e8da62197d6cb4b99f4027f59454735

x86/mm/pat: Avoid truncation when converting cpa->numpages to address
There are a couple of nasty truncation bugs lurking in the pageattr
code that can be triggered when mapping EFI regions, e.g. when we pass
a cpa->pgd pointer. Because cpa->numpages is a 32-bit value, shifting
left by PAGE_SHIFT will truncate the resultant address to 32-bits.

Viorel-Cătălin managed to trigger this bug on his Dell machine that
provides a ~5GB EFI region which requires 1236992 pages to be mapped.
When calling populate_pud() the end of the region gets calculated
incorrectly in the following buggy expression,

  end = start + (cpa->numpages << PAGE_SHIFT);

And only 188416 pages are mapped. Next, populate_pud() gets invoked
for a second time because of the loop in __change_page_attr_set_clr(),
only this time no pages get mapped because shifting the remaining
number of pages (1048576) by PAGE_SHIFT is zero. At which point the
loop in __change_page_attr_set_clr() spins forever because we fail to
map progress.

Hitting this bug depends very much on the virtual address we pick to
map the large region at and how many pages we map on the initial run
through the loop. This explains why this issue was only recently hit
with the introduction of commit

  a5caa209ba9c ("x86/efi: Fix boot crash by mapping EFI memmap
   entries bottom-up at runtime, instead of top-down")

It's interesting to note that safe uses of cpa->numpages do exist in
the pageattr code. If instead of shifting ->numpages we multiply by
PAGE_SIZE, no truncation occurs because PAGE_SIZE is a UL value, and
so the result is unsigned long.

To avoid surprises when users try to convert very large cpa->numpages
values to addresses, change the data type from 'int' to 'unsigned
long', thereby making it suitable for shifting by PAGE_SHIFT without
any type casting.

The alternative would be to make liberal use of casting, but that is
far more likely to cause problems in the future when someone adds more
code and fails to cast properly; this bug was difficult enough to
track down in the first place.

Reported-and-tested-by: Viorel-Cătălin Răpițeanu <rapiteanu.catalin@gmail.com> 
Acked-by: Borislav Petkov <bp@alien8.de> 
Cc: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com> 
Cc: <stable@vger.kernel.org> 
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk> 
Link: https://bugzilla.kernel.org/show_bug.cgi?id=110131 
Link: http://lkml.kernel.org/r/1454067370-10374-1-git-send-email-matt@codeblueprint.co.uk 
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Applied-by: Akemi Yagi <toracat@centos.org>

--- a/arch/x86/mm/pageattr.c	2016-02-29 09:35:49.000000000 -0800
+++ b/arch/x86/mm/pageattr.c	2016-04-19 18:50:00.712944601 -0700
@@ -33,7 +33,7 @@ struct cpa_data {
 	pgd_t		*pgd;
 	pgprot_t	mask_set;
 	pgprot_t	mask_clr;
-	int		numpages;
+	unsigned long	numpages;
 	int		flags;
 	unsigned long	pfn;
 	unsigned	force_split : 1;
@@ -1289,7 +1289,7 @@ static int __change_page_attr_set_clr(st
 		 * CPA operation. Either a large page has been
 		 * preserved or a single page update happened.
 		 */
-		BUG_ON(cpa->numpages > numpages);
+		BUG_ON(cpa->numpages > numpages || !cpa->numpages);
 		numpages -= cpa->numpages;
 		if (cpa->flags & (CPA_PAGES_ARRAY | CPA_ARRAY))
 			cpa->curpage++;
toracat

toracat

2016-04-20 02:01

manager   ~0026333

$ git describe 742563777e8da62197d6cb4b99f4027f59454735
v4.4-553-g7425637

The patch will be added to the centosplus kernel (kernel-plus) in the next update.
toracat

toracat

2016-11-21 18:12

manager   ~0027959

The patch is now in the 7.3 distro kernel. Therefore it has been removed from the plus kernel.

Closing as 'resolved'. If you find any issue, please submit a new ticket.

Issue History

Date Modified Username Field Change
2016-04-18 21:06 bcran New Issue
2016-04-20 01:07 bcran Note Added: 0026329
2016-04-20 01:43 toracat Note Added: 0026331
2016-04-20 01:45 toracat Status new => assigned
2016-04-20 01:58 toracat File Added: centos-linux-3.10-fix-truncation-bug-EFI-bug10720.patch
2016-04-20 02:01 toracat Note Added: 0026333
2016-11-21 18:12 toracat Note Added: 0027959
2016-11-21 18:12 toracat Status assigned => resolved
2016-11-21 18:12 toracat Resolution open => fixed