View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0017324||CentOS-7||centos-release||public||2020-05-05 00:04||2020-05-08 09:23|
|Platform||Centos 7.X||OS||Centos||OS Version||Centos 7.7|
|Summary||0017324: CPU hot add feature in ESXi causing Centos 7.X VM to crash due to race condition when free memory in guest VM is quite low.|
|Description||What problem/issue/behavior are you having trouble with? What do you expect to see?|
When free memory in Centos 7.7 guest VM (tested Kernel version : 3.10.0-1062.12.1.el7.x86_64) running on VMware (tested ESXi 6.7) environment is below 110MB or 120MB, then CPU hot add operation can cause the VM to panic unexpectedly. Exactly same issue has been found in Red-hat 7.7 and it has been confirmed that the issue is a bug and fix will be needed. This issue eventually happens not only in 7.7 but also in every version of Centos 7.X. Same issue has been found on Red-hat and Red-hat confirmed this is the bug that needs to be fixed. Please refer to https://bugzilla.redhat.com/show_bug.cgi?id=1819807 for the same issue on the Red-hat.
Following is the log "core.txt" from Centox 7.7 (Kernel version : 3.10.0-1062.12.1.el7.x86_64) when panic happens during the Cpu hot add.
System crashes because an invalid (NULL) pointer is dereferenced:
Vmcore.txt shows the following panic signatures. All of the panics are reporting the similar symptoms.
[ 92.164060] CPU8 has been hot-added
[ 92.166979] CPU9 has been hot-added
[ 92.169032] CPU10 has been hot-added
[ 92.170138] CPU11 has been hot-added
[ 93.841222] smpboot: Booting Node 0 Processor 11 APIC 0x16
[ 93.841809] Disabled fast string operations
[ 93.842964] smpboot: CPU 11 Converting physical 22 to logical package 8
[ 93.843003] Skipped synchronization checks as TSC is reliable.
[ 93.915347] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 93.915353] IP: [<ffffffffb21a11bb>] __list_add+0x1b/0xc0
[ 93.915361] PGD 0
[ 93.915364] Oops: 0000 [#1] SMP
[ 93.915367] Modules linked in: tcp_lp fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun devlink ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter vmw_vsock_vmci_transport vsock sunrpc ppdev sb_edac iosf_mbi crc32_pclmul vmw_balloon ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd joydev pcspkr sg vmw_vmci i2c_piix4 parport_pc parport ip_tables xfs libcrc32c sr_mod cdrom ata_generic pata_acpi
[ 93.915401] sd_mod crc_t10dif vmwgfx crct10dif_generic drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci crct10dif_pclmul crct10dif_common crc32c_intel libahci drm ata_piix nfit serio_raw libata libnvdimm vmxnet3 vmw_pvscsi drm_panel_orientation_quirks dm_mirror dm_region_hash dm_log dm_mod
[ 93.915417] CPU: 11 PID: 3568 Comm: systemd-udevd Kdump: loaded Not tainted 3.10.0-1062.12.1.el7.x86_64 #1
[ 93.915419] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
[ 93.915421] task: ffff8eaca7675230 ti: ffff8ead8c378000 task.ti: ffff8ead8c378000
[ 93.915423] RIP: 0010:[<ffffffffb21a11bb>] [<ffffffffb21a11bb>] __list_add+0x1b/0xc0
[ 93.915426] RSP: 0018:ffff8ead8c37b508 EFLAGS: 00010246
[ 93.915427] RAX: 00000000ffffffff RBX: ffff8ead8c37b530 RCX: 0000000000000000
[ 93.915429] RDX: ffff8eae2a6d80b0 RSI: 0000000000000000 RDI: ffff8ead8c37b530
[ 93.915431] RBP: ffff8ead8c37b520 R08: 0000000000000000 R09: 0000000000000002
[ 93.915433] R10: ffffffffb2b5b2c0 R11: ffffffffffffffff R12: ffff8eae2a6d80b0
[ 93.915434] R13: 0000000000000000 R14: 00000000ffffffff R15: ffff8eae2a6d80b0
[ 93.915437] FS: 00007f9d423788c0(0000) GS:ffff8eae2a6c0000(0000) knlGS:0000000000000000
[ 93.915439] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 93.915440] CR2: 0000000000000000 CR3: 000000018e32c000 CR4: 00000000001607e0
[ 93.915472] Call Trace:
[ 93.915479] [<ffffffffb257f8b6>] __mutex_lock_slowpath+0xa6/0x1d0
[ 93.915485] [<ffffffffb257ecaf>] mutex_lock+0x1f/0x2f
[ 93.915490] [<ffffffffb1fe6bab>] get_swap_page+0x9b/0x1b0
[ 93.915494] [<ffffffffb20075c9>] add_to_swap+0x19/0x80
[ 93.915499] [<ffffffffb1fd26cb>] shrink_page_list+0x69b/0xc30
[ 93.915503] [<ffffffffb1fd3286>] shrink_inactive_list+0x1c6/0x5d0
[ 93.915506] [<ffffffffb1fd3d85>] shrink_lruvec+0x385/0x740
[ 93.915509] [<ffffffffb1fd41b6>] shrink_zone+0x76/0x1a0
[ 93.915512] [<ffffffffb1fd46a0>] do_try_to_free_pages+0xf0/0x520
[ 93.915516] [<ffffffffb2024b5e>] ? ___slab_alloc+0x24e/0x4f0
[ 93.915519] [<ffffffffb1fd4bcc>] try_to_free_pages+0xfc/0x180
[ 93.915522] [<ffffffffb1fc87f1>] __alloc_pages_nodemask+0x831/0xbe0
[ 93.915527] [<ffffffffb2109700>] ? selinux_mmap_addr+0x50/0x60
[ 93.915531] [<ffffffffb2016ba8>] alloc_pages_current+0x98/0x110
[ 93.915533] [<ffffffffb20247c3>] new_slab+0x393/0x4e0
[ 93.915536] [<ffffffffb2024cbc>] ___slab_alloc+0x3ac/0x4f0
[ 93.915539] [<ffffffffb1ffa71c>] ? mmap_region+0x38c/0x670
[ 93.915542] [<ffffffffb210a3db>] ? cred_has_capability+0x6b/0x120
[ 93.915545] [<ffffffffb1ffa71c>] ? mmap_region+0x38c/0x670
[ 93.915548] [<ffffffffb257760f>] __slab_alloc+0x40/0x5c
[ 93.915550] [<ffffffffb20250db>] kmem_cache_alloc+0x19b/0x1f0
[ 93.915553] [<ffffffffb1ffa71c>] ? mmap_region+0x38c/0x670
[ 93.915555] [<ffffffffb1ffa71c>] mmap_region+0x38c/0x670
[ 93.915558] [<ffffffffb1ffad78>] do_mmap+0x378/0x530
[ 93.915560] [<ffffffffb210a9b0>] ? file_map_prot_check+0xd0/0xd0
[ 93.915563] [<ffffffffb1fddfe0>] vm_mmap_pgoff+0xd0/0x120
[ 93.915566] [<ffffffffb1ff8c26>] SyS_mmap_pgoff+0x116/0x270
[ 93.915572] [<ffffffffb1e31f12>] SyS_mmap+0x22/0x30
[ 93.915575] [<ffffffffb258dede>] system_call_fastpath+0x25/0x2a
[ 93.915577] Code: ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 55 48 89 e5 41 55 49 89 f5 41 54 49 89 d4 53 4c 8b 42 08 48 89 fb 49 39 f0 75 2a <4d> 8b 45 00 4d 39 c4 75 68 4c 39 e3 74 3e 4c 39 eb 74 39 49 89
[ 93.915597] RIP [<ffffffffb21a11bb>] __list_add+0x1b/0xc0
[ 93.915600] RSP <ffff8ead8c37b508>
[ 93.915601] CR2: 0000000000000000
Where are you experiencing the behavior? What environment?
Encountered in production and reproducible in a lab setup.
When does the behavior occur? Frequency? Repeatedly? At certain times?
Consistent failure. Appears to be a possible race condition - that can occur whenever CPU hot add is performed. When the memory pressure is created (refer to the document for steps to reproduce) this consistently occurs.
What information can you provide around timeframes and the business impact?
This is a significant business impact as it prevents the safe use of the hot CPU feature for guests running Lentos.
|Tags||No tags attached.|
https://bugzilla.redhat.com/show_bug.cgi?id=1819807 is invisible to us. All RH kernel bugs are marked as private and are readable only by the reporter and Red Hat employees.
CentOS is a rebuild of the public sources for RHEL. Did Red Hat say in what kernel version this fix was made? Or give an RHSA-yyyy:nnnn number for the fix? Has it been released publically?
The current kernel for CentOS 7 is 3.10.0-1127.el7.x86_64 - did you try to recreate the problem on the newer version?
Here is what Redhat recently commented. Please check it and apply the fix to all of the Centos 7.X version.
>>> -Do you mean there is an official patch that I would need to download, to fix this in RHEL 8.0 ? Is that published or would I need to request that patch?
Starting from RHEL8.0 this patch is already present in it. Because RHEL8 was built on top of upstream 4.18 kernel
the patch which fixes this issue already got backported in RHEL8.
>>> -Could an official fix be available within the next couple of months?
Currently its planned for RHEL7.9, and I can also see a request been raised to backport
this in RHEL7.8.z [Z-stream fix] which could come early than RHEL7.9
>>> --Please consider a fix in RHEL 7.x.
Yes fix will be for RHEL7.x itself
>>> -I suggest the issue itself is important to fix soon - because it results in a crash, during an operation that is expected to be safe and avoid downtime (the hot CPU add).
Yes, I do see this reproduces on every cpu hot-add, however the memory usage has to be very tight for this to reproduce.
Hence its very unlikely that this happens in production, since before hot-adding cpu one definitely observes the memory
usage and other resource utilization. However this is an important patch.
I will keep you posted with further updates.
Can I expect Centos to do the same thing as Redhat? Will the fix be included in Centos 7.9 version as Redhat? What about prior to 7.9? Again the bug is everywher in Centos 7.X versions and I already tested it in Centos 7.5/7.6/7.7 and it all shows the problems..
It seems that you do not know a few things about CentOS so let me clarify them for you.
First of all, CentOS Linux is built from the sources published by RedHat and used by RedHat to build RHEL. One of the aims of CentOS is to reproduce RHEL as well as possible, therefore in stock CentOS you will find all the bugs that exist in the equivalent RHEL version. OTOH, any fix that RH introduces in RHEL lands automatically in CentOS as well. So, to answer your first question, the fix will be included in CentOS if and when RedHat includes it. If they do it as an update in the cycle of the current RHEL 7.8, it will land immediately in CentOS 7.8.2003. If they postpone to 7.9, it will be included in CentOS 7.9.
Second, CentOS does not have the resources to support anything but the current major.minor releases ( and by matter of consequence, does not support any of the older minor releases ). And, to cut short further questons, CentOS cannot do anything because the sources for the Z-stream (that is updates for older RHEL minor releases) are not publicly available. If you need support for something older than what is (at any given moment) current in CentOS Linux, we are strongly encouraging you to purchase a subscription for RHEL EUS ( Extended Update Support ) which does exactly that, offers some updates for selected parts of older minor releases.
In your case you have the following ways forward:
- switch to CentOS 8 which already includes the fix
- wait for 7.9 where the fix will be included, according to what you relayed to us
- persuade RH to include the fix NOW in whatever version you have in use
The other alternative is to isolate yourself the fix from the kernel used in CentOS 8 and backport it as a patch for the kernel used in CentOS 7.8.2003 ( which is probably what RH will do for the kernel that will be included in RHEL 7.9 ). In this case CentOS might ( emphasize here on the difference between "might" versus "will" ) be able to publish a modified centosplus kernel even before 7.9 lands.