View Issue Details

IDProjectCategoryView StatusLast Update
0015216CentOS-7kernelpublic2019-09-22 15:28
Reporterwyukawa 
PrioritynormalSeveritycrashReproducibilityrandom
Status assignedResolutionopen 
PlatformLinuxOSCentOSOS Version7.2.1511
Product Version7.2.1511 
Target VersionFixed in Version 
Summary0015216: 3.10.0-862.9.1.el7.x86_64 kernel panic and crash under hadoop environment
DescriptionWe have about 30 hadoop nodes(datanode, nodemanager, presto worker) with kernel 3.10.0-862.9.1.el7.x86_64 and these nodes randomly cause kernel panic and crash.
Recently we setup kdump and can get vmcore.

Here is the crash command result
```
crash> sys
      KERNEL: /usr/lib/debug/lib/modules/3.10.0-862.9.1.el7.x86_64/vmlinux
    DUMPFILE: vmcore [PARTIAL DUMP]
        CPUS: 40
        DATE: Fri Aug 24 03:27:27 2018
      UPTIME: 13 days, 19:27:57
LOAD AVERAGE: 21.81, 18.43, 15.18
       TASKS: 2640
    NODENAME: SOMEHOST
     RELEASE: 3.10.0-862.9.1.el7.x86_64
     VERSION: #1 SMP Mon Jul 16 16:29:36 UTC 2018
     MACHINE: x86_64 (2199 Mhz)
      MEMORY: 255.9 GB
       PANIC: "BUG: unable to handle kernel paging request at 0000000025fc6f8b"
crash> bt
PID: 139574 TASK: ffff8e84016c8fd0 CPU: 23 COMMAND: "java"
 #0 [ffff8e842c88f618] machine_kexec at ffffffffb186178a
 #1 [ffff8e842c88f678] __crash_kexec at ffffffffb1913bf2
 #2 [ffff8e842c88f748] crash_kexec at ffffffffb1913ce0
 #3 [ffff8e842c88f760] oops_end at ffffffffb1f18738
 #4 [ffff8e842c88f788] no_context at ffffffffb1f0807e
 #5 [ffff8e842c88f7d8] __bad_area_nosemaphore at ffffffffb1f08115
 #6 [ffff8e842c88f828] bad_area_nosemaphore at ffffffffb1f08286
 #7 [ffff8e842c88f838] __do_page_fault at ffffffffb1f1b6f0
 #8 [ffff8e842c88f8a0] do_page_fault at ffffffffb1f1b8e5
 #9 [ffff8e842c88f8d0] page_fault at ffffffffb1f17758
    [exception RIP: find_busiest_group+869]
    RIP: ffffffffb18dc645 RSP: ffff8e842c88f980 RFLAGS: 00010807
    RAX: 0000000025fc7000 RBX: 0000000000000001 RCX: 0000000000000498
    RDX: 0000000000000002 RSI: 0000000000000002 RDI: 0000000000000001
    RBP: ffff8e842c88fae8 R8: 00000000000006f7 R9: 0000000000000002
    R10: 000000000000049a R11: 0000000000000000 R12: ffff8e842c88fb48
    R13: ffff8e842c88f9b8 R14: 0000000000000000 R15: 0000000000000800
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#10 [ffff8e842c88faf0] load_balance at ffffffffb18dcde8
#11 [ffff8e842c88fbd8] idle_balance at ffffffffb18ddc71
#12 [ffff8e842c88fc30] __schedule at ffffffffb1f13f5f
#13 [ffff8e842c88fcb0] schedule at ffffffffb1f14029
#14 [ffff8e842c88fcc0] futex_wait_queue_me at ffffffffb1903d46
#15 [ffff8e842c88fd00] futex_wait at ffffffffb1904a2b
#16 [ffff8e842c88fe48] do_futex at ffffffffb1906786
#17 [ffff8e842c88fed8] sys_futex at ffffffffb1906ca0
#18 [ffff8e842c88ff50] system_call_fastpath at ffffffffb1f20795
    RIP: 00007fe50b5a1a82 RSP: 00007fe4cfaef5a0 RFLAGS: 00010206
    RAX: 00000000000000ca RBX: 0000000000000005 RCX: ffffffffff9882ec
    RDX: 00000000000005ad RSI: 0000000000000089 RDI: 00007fe504134954
    RBP: 00007fe4cfaf2ba0 R8: 00007fe504134928 R9: 00000000ffffffff
    R10: 00007fe4cfaf2b60 R11: 0000000000000202 R12: 00000000000005ad
    R13: 00007fe4cfaf2b60 R14: ffffffffffffff92 R15: 00007fe504134900
    ORIG_RAX: 00000000000000ca CS: 0033 SS: 002b
```

PID: 139574 is the normal hive job.

Any suggestions?
Tagskernel panic
abrt_hash
URL

Activities

toracat

toracat

2018-08-27 03:28

manager   ~0032594

What was the last kernel version that did not have this problem?
wyukawa

wyukawa

2018-08-27 03:54

reporter   ~0032595

Thank you for the comment.

I'm not sure what version did not have the problem because recently we encountered this issue.

We used 3.10.0-514.2.2.el7.x86_64 for about 1.5 years since we created Hadoop clusters
But we often encountered kernel panic problems since about June this year.
I guess this problem is related to data/load increase.

Recently We upgraded Linux kernel to 3.10.0-862.9.1.el7.x86_64 but it didn't resolve, unfortunately.
wyukawa

wyukawa

2018-08-27 06:57

reporter   ~0032596

add crash log

```
crash> log
[1193296.151357] BUG: unable to handle kernel paging request at 0000000025fc6f8b
[1193296.151412] IP: [<ffffffffb18dc645>] find_busiest_group+0x365/0x990
[1193296.151450] PGD 80000020bc703067 PUD 2212f7a067 PMD 0
[1193296.151477] Oops: 0002 [#1] SMP
[1193296.151495] Modules linked in: iptable_mangle iptable_filter mpt3sas mpt2sas raid_class scsi_transport_sas mptctl mptbase vfat fat uas usb_storage dell_rbu binfmt_misc bonding dsa_filter(POE) dm_mirror dm_region_hash dm_log dm_mod iTCO_wdt iTCO_vendor_support mxm_wmi dcdbas intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sg mei_me lpc_ich mei ipmi_si shpchp ipmi_devintf ipmi_msghandler wmi acpi_power_meter ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ixgbe ahci libahci crct10dif_pclmul crct10dif_common igb crc32c_intel libata megaraid_sas mdio i2c_algo_bit ptp i2c_core pps_core dca
[1193296.151873] CPU: 23 PID: 139574 Comm: java Kdump: loaded Tainted: P OE ------------ 3.10.0-862.9.1.el7.x86_64 #1
[1193296.151917] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.1.7 06/16/2016
[1193296.151948] task: ffff8e84016c8fd0 ti: ffff8e842c88c000 task.ti: ffff8e842c88c000
[1193296.151978] RIP: 0010:[<ffffffffb18dc645>] [<ffffffffb18dc645>] find_busiest_group+0x365/0x990
[1193296.152016] RSP: 0018:ffff8e842c88f980 EFLAGS: 00010807
[1193296.152038] RAX: 0000000025fc7000 RBX: 0000000000000001 RCX: 0000000000000498
[1193296.152067] RDX: 0000000000000002 RSI: 0000000000000002 RDI: 0000000000000001
[1193296.152096] RBP: ffff8e842c88fae8 R08: 00000000000006f7 R09: 0000000000000002
[1193296.152125] R10: 000000000000049a R11: 0000000000000000 R12: ffff8e842c88fb48
[1193296.152153] R13: ffff8e842c88f9b8 R14: 0000000000000000 R15: 0000000000000800
[1193296.152183] FS: 00007fe4cfaf3700(0000) GS:ffff8ec1be0c0000(0000) knlGS:0000000000000000
[1193296.152232] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1193296.152256] CR2: 0000000025fc6f8b CR3: 000000275d54a000 CR4: 00000000003607e0
[1193296.152285] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[1193296.152314] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[1193296.152342] Call Trace:
[1193296.152360] [<ffffffffb18dcde8>] load_balance+0x178/0x9a0
[1193296.152391] [<ffffffffb18ddc71>] idle_balance+0x1d1/0x250
[1193296.152419] [<ffffffffb1f13f5f>] __schedule+0x97f/0xa20
[1193296.152444] [<ffffffffb18bf500>] ? hrtimer_start_range_ns+0x1d0/0x3c0
[1193296.152472] [<ffffffffb1f14029>] schedule+0x29/0x70
[1193296.152496] [<ffffffffb1903d46>] futex_wait_queue_me+0xc6/0x130
[1193296.152523] [<ffffffffb1904a2b>] futex_wait+0x17b/0x280
[1193296.152547] [<ffffffffb18bf060>] ? hrtimer_get_res+0x50/0x50
[1193296.152571] [<ffffffffb1903d24>] ? futex_wait_queue_me+0xa4/0x130
[1193296.152598] [<ffffffffb1906786>] do_futex+0x106/0x5a0
[1193296.152621] [<ffffffffb1906ca0>] SyS_futex+0x80/0x180
[1193296.152646] [<ffffffffb1f20795>] system_call_fastpath+0x1c/0x21
[1193296.152671] Code: 00 44 0f 44 cf 41 39 f1 45 89 4d 2c 72 17 48 8b bd a0 fe ff ff 48 8b 57 10 44 8b 4a 18 31 d2 45 85 c9 0f 95 c2 41 89 55 38 39 95 <78> ff ff ff 0f 82 21 03 00 00 77 1f 4c 3b 85 40 ff ff ff 76 16
[1193296.152824] RIP [<ffffffffb18dc645>] find_busiest_group+0x365/0x990
[1193296.152853] RSP <ffff8e842c88f980>
[1193296.152870] CR2: 0000000025fc6f8b
```
pgreco

pgreco

2018-08-27 09:33

developer   ~0032597

Can you check with kernel 3.10-862.11.6 instead of 9.1? there were a couple of changes wrt cpu handling that I'm interested in seing if they help.
If it still crashes, please test with our centos-plus kernel
wyukawa

wyukawa

2018-08-28 02:54

reporter   ~0032604

Thank you for the comment.

I guess this issue is related to Completely Fair Scheduler (CFS) due to [exception RIP: find_busiest_group+869]
Is that right?
But there seems to be no difference between kernel 3.10-862.9.1 and kernel 3.10-862.11.6 in sched/fair.c
Could you please give me a couple of changes wrt cpu handling in kernel 3.10-862.11.6?
pgreco

pgreco

2018-08-28 09:54

developer   ~0032610

Last edited: 2018-08-28 09:54

View 2 revisions

@wyukawa
All the reports I could find online associate your bug (yes, that find_busiest_group crash) with cpu online/hotplug, so what I had in mind was bug https://bugs.centos.org/view.php?id=15108, which was fixed between 862.9.1 and 862.11.6

wyukawa

wyukawa

2018-08-30 04:16

reporter   ~0032624

OK, I will upgrade the kernel to 3.10-862.11.6.
Thanks
wyukawa

wyukawa

2018-09-05 02:49

reporter   ~0032646

@pgreco

Unfortunately, the kernel panic occurred again even though we upgraded the kernel to 3.10.0-862.11.6.el7.x86_64.

Are there any suggestions although error message is different?

crash log
```
crash> sys
      KERNEL: /usr/lib/debug/lib/modules/3.10.0-862.11.6.el7.x86_64/vmlinux
    DUMPFILE: vmcore [PARTIAL DUMP]
        CPUS: 40
        DATE: Tue Sep 4 17:41:42 2018
      UPTIME: 5 days, 03:18:19
LOAD AVERAGE: 18.55, 18.31, 11.82
       TASKS: 3226
    NODENAME: SOMEHOST
     RELEASE: 3.10.0-862.11.6.el7.x86_64
     VERSION: #1 SMP Tue Aug 14 21:49:04 UTC 2018
     MACHINE: x86_64 (2199 Mhz)
      MEMORY: 255.9 GB
       PANIC: "BUG: unable to handle kernel NULL pointer dereference at 000000000000001b"
crash> bt
PID: 83135 TASK: ffff9136cbf9bf40 CPU: 14 COMMAND: "java"
 #0 [ffff9121fe7c3c08] machine_kexec at ffffffffa2c629da
 #1 [ffff9121fe7c3c68] __crash_kexec at ffffffffa2d16692
 #2 [ffff9121fe7c3d38] crash_kexec at ffffffffa2d16780
 #3 [ffff9121fe7c3d50] oops_end at ffffffffa331d738
 #4 [ffff9121fe7c3d78] no_context at ffffffffa330c6cd
 #5 [ffff9121fe7c3dc8] __bad_area_nosemaphore at ffffffffa330c764
 #6 [ffff9121fe7c3e18] bad_area_nosemaphore at ffffffffa330c8d5
 #7 [ffff9121fe7c3e28] __do_page_fault at ffffffffa33206f0
 #8 [ffff9121fe7c3e90] do_page_fault at ffffffffa33208e5
 #9 [ffff9121fe7c3ec0] page_fault at ffffffffa331c758
    [exception RIP: __hrtimer_get_next_event+52]
    RIP: ffffffffa2cc1b74 RSP: ffff9121fe7c3f70 RFLAGS: 00010002
    RAX: 0000000000000001 RBX: 0000000000000003 RCX: 000193b96f76cc40
    RDX: 000193b9833e644d RSI: ffff9121fe7d3a30 RDI: ffff9121fe7d3b20
    RBP: ffff9121fe7c3f70 R8: 000193b96f678a00 R9: ffff9121fe7c3de0
    R10: 0000000000000000 R11: ffff9121fe7c3de8 R12: ffff9121fe7d39e0
    R13: ffff9121fe7d3a98 R14: 0000000300000001 R15: ffff9121fe7d3b18
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
#10 [ffff9121fe7c3f78] hrtimer_interrupt at ffffffffa2cc26c7
#11 [ffff9121fe7c3fc0] local_apic_timer_interrupt at ffffffffa2c5967b
#12 [ffff9121fe7c3fd8] smp_apic_timer_interrupt at ffffffffa332a083
#13 [ffff9121fe7c3ff0] apic_timer_interrupt at ffffffffa33267b2
--- <IRQ stack> ---
#14 [ffff912748ed3f58] apic_timer_interrupt at ffffffffa33267b2
    RIP: 00007fa287b24139 RSP: 00007fa284ed3638 RFLAGS: 00000206
    RAX: 0000000000000001 RBX: 00007fd43de4b000 RCX: 000000000000eb0f
    RDX: 0000000000001000 RSI: 000000000000001c RDI: 00007fa28005fb68
    RBP: 00007fa284ed3680 R8: 000000000000001c R9: 000000000006c5fb
    R10: 0000000000000019 R11: 0000000000000246 R12: ffff9121fe7c61e8
    R13: 00000000000000c0 R14: 00007fa254bd32f0 R15: 0000000000000000
    ORIG_RAX: ffffffffffffff10 CS: 0033 SS: 002b
```
pgreco

pgreco

2018-09-05 10:20

developer   ~0032650

@toracat, this looks like a possible fix https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/patch/?id=34a938cd3ad45b5c6d67d83329ce1dcf239863a8 .
Can you build a plus kernel with it so @wyukawa can test.
The patch applies with a 7 line offset to 3.10.862.11.6
toracat

toracat

2018-09-05 14:35

manager   ~0032652

@wyukawa

A set of kernel-plus that has the patch suggested by @pgreco is now available from:

https://people.centos.org/toracat/kernel/7/plus/bug15216/

Can you test and let us know the result?
wyukawa

wyukawa

2018-09-06 08:46

reporter   ~0032659

OK, I will install 3.10.0-862.11.6.el7.centos.plus.2.x86_64 next week.
Thanks
wyukawa

wyukawa

2018-09-10 06:13

reporter   ~0032690

Today we have just upgraded the kernel to 3.10.0-862.11.6.el7.centos.plus.2.x86_64.
We are going to wait and see.
pgreco

pgreco

2018-09-10 10:35

developer   ~0032691

Thanks for the update!.
How long does it usually take to generate the problem?
wyukawa

wyukawa

2018-09-11 01:33

reporter   ~0032693

At least, we need to wait and see for about 1 week.
toracat

toracat

2018-09-25 22:55

manager   ~0032791

@wyukawa

What is the verdict?
wyukawa

wyukawa

2018-09-26 07:25

reporter   ~0032799

@toracat

Unfortunately, the error occurred again today but we can't analyze vmcore.
Is kernel plus debug info like kernel-plus-debuginfo-3.10.0-862.11.6.el7.centos.plus.2.x86_64.rpm necessary?

# crash /usr/lib/debug/lib/modules/3.10.0-862.11.6.el7.x86_64/vmlinux vmcore

crash 7.2.0-6.el7
Copyright (C) 2002-2017 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.

GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

WARNING: kernel relocated [888MB]: patching 82660 gdb minimal_symbol values

crash: page excluded: kernel virtual address: ffffffffffffffff type: "possible"
WARNING: cannot read cpu_possible_map
crash: page excluded: kernel virtual address: ffffffffffffffff type: "present"
WARNING: cannot read cpu_present_map
crash: page excluded: kernel virtual address: ffffffffffffffff type: "online"
WARNING: cannot read cpu_online_map
crash: page excluded: kernel virtual address: ffffffffffffffff type: "active"
WARNING: cannot read cpu_active_map
crash: /usr/lib/debug/lib/modules/3.10.0-862.11.6.el7.x86_64/vmlinux and vmcore do not match!

Usage:

  crash [OPTION]... NAMELIST MEMORY-IMAGE[@ADDRESS] (dumpfile form)
  crash [OPTION]... [NAMELIST] (live system form)

Enter "crash -h" for details.
toracat

toracat

2018-09-26 07:40

manager   ~0032800

@wyukawa

You would need a matching kernel-debuginfo package. But since the patched plus kernel did not help, it's best doing the analysis with the official distro kernel.
wyukawa

wyukawa

2018-09-26 07:56

reporter   ~0032801

@toracat

What should I install?
My test machine where I analyze vmcore is the following.

# uname -r
3.10.0-862.11.6.el7.centos.plus.2.x86_64
# yum list installed | grep kernel
abrt-addon-kerneloops.x86_64 2.1.11-50.el7.centos @anaconda
kernel.x86_64 3.10.0-862.el7 @anaconda
kernel.x86_64 3.10.0-862.9.1.el7 @update/7.5.1804
kernel.x86_64 3.10.0-862.11.6.el7 @update
kernel-debuginfo.x86_64 3.10.0-862.11.6.el7 @debuginfo
kernel-debuginfo-common-x86_64.x86_64
kernel-devel.x86_64 3.10.0-862.el7 @anaconda
kernel-devel.x86_64 3.10.0-862.9.1.el7 @update/7.5.1804
kernel-headers.x86_64 3.10.0-862.9.1.el7 @update/7.5.1804
kernel-plus.x86_64 3.10.0-862.11.6.el7.centos.plus.1
kernel-plus.x86_64 3.10.0-862.11.6.el7.centos.plus.2
kernel-tools.x86_64 3.10.0-862.9.1.el7 @update/7.5.1804
kernel-tools-libs.x86_64 3.10.0-862.9.1.el7 @update/7.5.1804
toracat

toracat

2018-09-27 15:41

manager   ~0032813

@wyukawa

kernel-plus-3.10.0-862.14.4.el7.centos.plus will be released soon. Once it is available (and its debuginfo package), you may want to give it a try.
toracat

toracat

2018-09-28 17:39

manager   ~0032821

@wyukawa

kernel-plus-3.10.0-862.14.4.el7.centos.plus has been released. Please give it a try.
wyukawa

wyukawa

2018-10-03 01:13

reporter   ~0032843

@toracat

I moved the vmcore file to the test machine where I installed kernel-plus-3.10.0-862.14.4.el7.centos.plus including debuginfo but crash command didn't work well.

Do I need to install kernel-plus-3.10.0-862.14.4.el7.centos.plus to the real machine and wait to reproduce a kernel panic?

# uname -r
3.10.0-862.14.4.el7.centos.plus.x86_64
# yum list installed | grep kernel
abrt-addon-kerneloops.x86_64 2.1.11-50.el7.centos @anaconda
kernel.x86_64 3.10.0-862.el7 @anaconda
kernel.x86_64 3.10.0-862.9.1.el7 @update/7.5.1804
kernel.x86_64 3.10.0-862.11.6.el7 @update
kernel-debuginfo.x86_64 3.10.0-862.14.4.el7 @debuginfo
kernel-debuginfo-common-x86_64.x86_64
kernel-devel.x86_64 3.10.0-862.el7 @anaconda
kernel-devel.x86_64 3.10.0-862.9.1.el7 @update/7.5.1804
kernel-headers.x86_64 3.10.0-862.9.1.el7 @update/7.5.1804
kernel-plus.x86_64 3.10.0-862.11.6.el7.centos.plus.1
kernel-plus.x86_64 3.10.0-862.11.6.el7.centos.plus.2
kernel-plus.x86_64 3.10.0-862.14.4.el7.centos.plus
kernel-tools.x86_64 3.10.0-862.9.1.el7 @update/7.5.1804
kernel-tools-libs.x86_64 3.10.0-862.9.1.el7 @update/7.5.1804
# crash /usr/lib/debug/lib/modules/3.10.0-862.14.4.el7.x86_64/vmlinux vmcore

crash 7.2.0-6.el7
Copyright (C) 2002-2017 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.

GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

WARNING: kernel relocated [888MB]: patching 82671 gdb minimal_symbol values

crash: page excluded: kernel virtual address: ffffffffffffffff type: "possible"
WARNING: cannot read cpu_possible_map
crash: page excluded: kernel virtual address: ffffffffffffffff type: "present"
WARNING: cannot read cpu_present_map
crash: page excluded: kernel virtual address: ffffffffffffffff type: "online"
WARNING: cannot read cpu_online_map
crash: page excluded: kernel virtual address: ffffffffffffffff type: "active"
WARNING: cannot read cpu_active_map
crash: /usr/lib/debug/lib/modules/3.10.0-862.14.4.el7.x86_64/vmlinux and vmcore do not match!

Usage:

  crash [OPTION]... NAMELIST MEMORY-IMAGE[@ADDRESS] (dumpfile form)
  crash [OPTION]... [NAMELIST] (live system form)

Enter "crash -h" for details.
toracat

toracat

2018-10-03 02:31

manager   ~0032844

@wyukawa

You need a debuginfo package for your running kernel, 3.10.0-862.14.4.el7.centos.plus. It is:

http://debuginfo.centos.org/7/x86_64/kernel-plus-debuginfo-3.10.0-862.14.4.el7.centos.plus.x86_64.rpm
wyukawa

wyukawa

2018-10-03 07:39

reporter   ~0032845

@toracat

I installed kernel-plus-debuginfo 3.10.0-862.14.4.el7.centos.plus in the test machine but crash command didn't work well.
Any suggestion?

Here is the test machine information.

# yum list installed | grep kernel
abrt-addon-kerneloops.x86_64 2.1.11-50.el7.centos @anaconda
kernel.x86_64 3.10.0-862.el7 @anaconda
kernel.x86_64 3.10.0-862.9.1.el7 @update/7.5.1804
kernel.x86_64 3.10.0-862.11.6.el7 @update
kernel-devel.x86_64 3.10.0-862.el7 @anaconda
kernel-devel.x86_64 3.10.0-862.9.1.el7 @update/7.5.1804
kernel-headers.x86_64 3.10.0-862.9.1.el7 @update/7.5.1804
kernel-plus.x86_64 3.10.0-862.11.6.el7.centos.plus.1
kernel-plus.x86_64 3.10.0-862.11.6.el7.centos.plus.2
kernel-plus.x86_64 3.10.0-862.14.4.el7.centos.plus
kernel-plus-debuginfo.x86_64 3.10.0-862.14.4.el7.centos.plus
kernel-plus-debuginfo-common-x86_64.x86_64
kernel-tools.x86_64 3.10.0-862.9.1.el7 @update/7.5.1804
kernel-tools-libs.x86_64 3.10.0-862.9.1.el7 @update/7.5.1804

# crash /usr/lib/debug/lib/modules/3.10.0-862.14.4.el7.centos.plus.x86_64/vmlinux vmcore

crash 7.2.0-6.el7
Copyright (C) 2002-2017 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006, 2010 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006, 2011, 2012 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005, 2011 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for details.

GNU gdb (GDB) 7.6
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

WARNING: kernel relocated [888MB]: patching 83024 gdb minimal_symbol values

crash: page excluded: kernel virtual address: ffffffffffffffff type: "possible"
WARNING: cannot read cpu_possible_map
crash: page excluded: kernel virtual address: ffffffffffffffff type: "present"
WARNING: cannot read cpu_present_map
crash: page excluded: kernel virtual address: ffffffffffffffff type: "online"
WARNING: cannot read cpu_online_map
crash: page excluded: kernel virtual address: ffffffffffffffff type: "active"
WARNING: cannot read cpu_active_map
crash: /usr/lib/debug/lib/modules/3.10.0-862.14.4.el7.centos.plus.x86_64/vmlinux and vmcore do not match!

Usage:

  crash [OPTION]... NAMELIST MEMORY-IMAGE[@ADDRESS] (dumpfile form)
  crash [OPTION]... [NAMELIST] (live system form)

Enter "crash -h" for details.


By the way, here is the real machine information and I moved vmcore file in the real machine to the test machine and try to execute crash command.

$ uname -r
3.10.0-862.11.6.el7.centos.plus.2.x86_64
$ yum list installed | grep kernel
Repodata is over 2 weeks old. Install yum-cron? Or run: yum makecache fast
abrt-addon-kerneloops.x86_64 2.1.11-36.el7.centos @anaconda
kernel.x86_64 3.10.0-327.el7 @anaconda
kernel.x86_64 3.10.0-327.36.3.el7 @update
kernel.x86_64 3.10.0-514.2.2.el7 @update
kernel.x86_64 3.10.0-862.9.1.el7 @update
kernel.x86_64 3.10.0-862.11.6.el7 @update
kernel-debuginfo.x86_64 3.10.0-862.9.1.el7 @debuginfo
kernel-debuginfo-common-x86_64.x86_64 3.10.0-862.9.1.el7 @debuginfo
kernel-devel.x86_64 3.10.0-327.el7 @anaconda
kernel-devel.x86_64 3.10.0-327.36.3.el7 @update
kernel-devel.x86_64 3.10.0-514.2.2.el7 @update
kernel-headers.x86_64 3.10.0-514.2.2.el7 @update
kernel-plus.x86_64 3.10.0-862.11.6.el7.centos.plus.2
                                                                      @/kernel-plus-3.10.0-862.11.6.el7.centos.plus.2.x86_64
kernel-tools.x86_64 3.10.0-327.36.3.el7 @update
kernel-tools-libs.x86_64 3.10.0-327.36.3.el7 @update
DaveLaneCA

DaveLaneCA

2018-10-04 12:05

reporter   ~0032858

If it helps, I just installed CentOS 7 from a few month old iso on a (CentOS 6 host) kvm virtual machine. After the updates were done, including kernel 3.10-0-862.14.4.el7.x86_64, the machine kernel panics on starting the kernel at boot. If I boot from the original kernel (3.10-0-693.el7.x86_64) its fine.

In the kvm config, I tried changing the cpu from the native SandyBridge-IBRS to generic kvm64 and the updated kernel boots fine.

Dave
dijuremo@gmail.com

dijuremo@gmail.com

2018-10-06 15:30

reporter   ~0032869

@DaveLaneCA, I have the same issues with the latest kernel 3.10-0-862.14.4.el7.x86_64 per:

https://bugs.centos.org/view.php?id=15358

I cannot update past: 3.10.0-693.21.1.el7.x86_64
pgreco

pgreco

2018-10-15 21:31

developer   ~0032923

@wyukawa have you managed to replicate the crash with the right debuginfo?
wyukawa

wyukawa

2018-10-16 03:17

reporter   ~0032925

@pgreco

No.
Kernel panic occurred three times since we upgraded kernel to 3.10.0-862.11.6.el7.centos.plus.2

We copied the vmcore to the test machine where we installed kernel-plus-3.10.0-862.14.4.el7.centos.plus but we can't execute crash command in the test machine.

Do you have any suggestions?

For example, Do we need to install kernel-plus-3.10.0-862.14.4.el7.centos.plus to the real machine and wait to reproduce a kernel panic?

If so, I'm worried about kernel-plus-3.10.0-862.14.4.el7.centos.plus because @DaveLaneCA mentioned the above issue.
pgreco

pgreco

2018-10-17 09:32

developer   ~0032932

@wyukawa, yes, please try to replicate the crash (anywhere you can) with kernel-plus-3.10.0-862.14.4.el7.centos.plus.
The problem @DaveLaneCA mentions (IIUC) applies to all 862*, which fail to boot, so not the same problem as you
wyukawa

wyukawa

2018-10-22 08:15

reporter   ~0032960

Today we have just upgraded the kernel to 3.10.0-862.14.4.el7.centos.plus.x86_64.
We are going to wait and see.
wyukawa

wyukawa

2018-10-25 03:52

reporter   ~0032985

@toracat

Unfortunately, the kernel panic occurred in three nodes since we upgraded to 3.10.0-862.14.4.el7.centos.plus.x86_64.

Here is the crash log.

crash> sys
      KERNEL: /usr/lib/debug/lib/modules/3.10.0-862.14.4.el7.centos.plus.x86_64/vmlinux
    DUMPFILE: vmcore [PARTIAL DUMP]
        CPUS: 40
        DATE: Thu Oct 25 10:51:57 2018
      UPTIME: 2 days, 22:12:39
LOAD AVERAGE: 6.98, 7.47, 12.47
       TASKS: 2397
    NODENAME: SOMEHOST
     RELEASE: 3.10.0-862.14.4.el7.centos.plus.x86_64
     VERSION: #1 SMP Fri Sep 28 05:34:05 UTC 2018
     MACHINE: x86_64 (2200 Mhz)
      MEMORY: 255.9 GB
       PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000008"
crash> bt
PID: 60464 TASK: ffff987fdd6faf70 CPU: 33 COMMAND: "java"
 #0 [ffff98a73e203bd8] machine_kexec at ffffffff9a262e6a
 #1 [ffff98a73e203c38] __crash_kexec at ffffffff9a3166c2
 #2 [ffff98a73e203d08] crash_kexec at ffffffff9a3167b0
 #3 [ffff98a73e203d20] oops_end at ffffffff9a92e728
 #4 [ffff98a73e203d48] no_context at ffffffff9a91d84d
 #5 [ffff98a73e203d98] __bad_area_nosemaphore at ffffffff9a91d8e4
 #6 [ffff98a73e203de8] bad_area_nosemaphore at ffffffff9a91da55
 #7 [ffff98a73e203df8] __do_page_fault at ffffffff9a9316e0
 #8 [ffff98a73e203e60] do_page_fault at ffffffff9a9318d5
 #9 [ffff98a73e203e90] page_fault at ffffffff9a92d758
    [exception RIP: clockevents_program_event+57]
    RIP: ffffffff9a302339 RSP: ffff98a73e203f48 RFLAGS: 00010006
    RAX: 0000e5e20ce99582 RBX: ffff98a73e2111c0 RCX: 0000000000000018
    RDX: 0000000225c17d03 RSI: 0000e5e20cf8c6c0 RDI: ffffffff9ae2b440
    RBP: ffff98a73e203f60 R8: 0000e5e20ce98480 R9: ffff98a73e203de0
    R10: 0000000000000000 R11: ffff98a73e203de8 R12: 0000e5e20cf8c6c0
    R13: 0000000000000000 R14: ffff98a73e213ad8 R15: ffff98a73e213b18
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
#10 [ffff98a73e203f68] tick_program_event at ffffffff9a304073
#11 [ffff98a73e203f78] hrtimer_interrupt at ffffffff9a2c2732
#12 [ffff98a73e203fc0] local_apic_timer_interrupt at ffffffff9a2596ab
#13 [ffff98a73e203fd8] smp_apic_timer_interrupt at ffffffff9a93b083
#14 [ffff98a73e203ff0] apic_timer_interrupt at ffffffff9a9377b2
--- <IRQ stack> ---
#15 [ffff98786af13f58] apic_timer_interrupt at ffffffff9a9377b2
    RIP: 00007f7f1b96b139 RSP: 00007f7f04bfa988 RFLAGS: 00000206
    RAX: 0000000000000001 RBX: 0000000000004b64 RCX: 0000000000000202
    RDX: 0000000000001000 RSI: 000000000000001c RDI: 00007f7f140c4698
    RBP: 00007f7f04bfa9d0 R8: 000000000000001c R9: 0000000004000001
    R10: 0000000000000001 R11: 0000000000000246 R12: ffff98a73e2061e8
    R13: 000000000000003c R14: 00007f7ef81ab850 R15: 0000000000000000
    ORIG_RAX: ffffffffffffff10 CS: 0033 SS: 002b
pgreco

pgreco

2018-10-25 16:21

developer   ~0032995

@wyukawa, I'd like to wait until 7.6 to continue testing, there will be many changes and we may be fighting something that is already solved.
As soon as I have something to test, I'll let you know
wyukawa

wyukawa

2018-10-30 05:51

reporter   ~0033014

OK, by the way, we backed to the kernel to 3.10.0-862.11.6.el7.centos.plus.2 due to many errors.
toracat

toracat

2018-10-30 07:58

manager   ~0033016

RHEL 7.6 is out.
pgreco

pgreco

2018-10-30 10:45

developer   ~0033017

@wyukawa, does 3.10.0-862.11.6.el7.centos.plus.2 work better than 3.10.0-862.14.4.el7.centos.plus ??????
wyukawa

wyukawa

2018-10-30 11:56

reporter   ~0033018

@pgreco
Yes

The kernel panic occurred 3 times per about 42 days in 3.10.0-862.11.6.el7.centos.plus.2 but
9 times per about 8 days in 3.10.0-862.14.4.el7.centos.plus.
pgreco

pgreco

2018-12-15 15:18

developer   ~0033353

@wyukawa, now that we have CentOS 7.6 out, can you check our latest centos-plus kernel?
wyukawa

wyukawa

2018-12-16 03:56

reporter   ~0033360

@pgreco
OK, I may be able to check the latest centos-plus kernel(now kernel-plus-3.10.0-957.1.3.el7.centos.plus.x86_64.rpm) next year.
Thanks.
Snorch

Snorch

2019-01-24 10:16

reporter   ~0033666

> [exception RIP: clockevents_program_event+57]

We had a similar crash on kernel based on 862.11.6.el7, @wyukawa can you please show me the output of "crash> dis -l ffffffff9a302339" for the last crash, to help with my investigation?

In our case we have RIP pointing to "sub %rax,%r12" which is a very strange place for a page-fault on address 0x8.
wyukawa

wyukawa

2019-01-24 10:30

reporter   ~0033667

crash> dis -l ffffffff9a302339
dis: page excluded: kernel virtual address: ffffffff9a302339 type: "gdb_readmem_callback"
0xffffffff9a302339 <__per_cpu_end+-1708226271>: Cannot access memory at address 0xffffffff9a302339
Snorch

Snorch

2019-01-24 12:34

reporter   ~0033670

Strange, if you loaded proper vmcore and vmlinux it should be no "dis: page excluded: kernel virtual address:".

Anyway I've managed to get the information I need from kernel-plus-debuginfo-3.10.0-862.14.4.el7.centos.plus.x86_64.rpm :

$ objdump -D -S --start-address=0xffffffff81102300 --stop-address=0xffffffff8110233c ./usr/lib/debug/lib/modules/3.10.0-862.14.4.el7.centos.plus.x86_64/vmlinux

./usr/lib/debug/lib/modules/3.10.0-862.14.4.el7.centos.plus.x86_64/vmlinux: file format elf64-x86-64


Disassembly of section .text:

ffffffff81102300 <clockevents_program_event>:
ffffffff81102300: e8 eb 7d 63 00 callq ffffffff8173a0f0 <__fentry__>
ffffffff81102305: 55 push %rbp
ffffffff81102306: 48 85 f6 test %rsi,%rsi
ffffffff81102309: 48 89 e5 mov %rsp,%rbp
ffffffff8110230c: 41 55 push %r13
ffffffff8110230e: 41 54 push %r12
ffffffff81102310: 49 89 f4 mov %rsi,%r12
ffffffff81102313: 53 push %rbx
ffffffff81102314: 0f 88 b7 00 00 00 js ffffffff811023d1 <clockevents_program_event+0xd1>
ffffffff8110231a: 48 89 fb mov %rdi,%rbx
ffffffff8110231d: 48 89 73 18 mov %rsi,0x18(%rbx)
ffffffff81102321: 83 7f 38 01 cmpl $0x1,0x38(%rdi)
ffffffff81102325: 0f 84 9d 00 00 00 je ffffffff811023c8 <clockevents_program_event+0xc8>
ffffffff8110232b: f6 47 3c 04 testb $0x4,0x3c(%rdi)
ffffffff8110232f: 75 5f jne ffffffff81102390 <clockevents_program_event+0x90>
ffffffff81102331: 41 89 d5 mov %edx,%r13d
ffffffff81102334: e8 97 98 ff ff callq ffffffff810fbbd0 <ktime_get>
ffffffff81102339: 49 29 c4 sub %rax,%r12 <- RIP

Actually you had the same problem as we had in our reproduction, looks like stacktrace is somehow broken, as the instruction under the instruction pointer shouldn't cause a pagefault.
pgreco

pgreco

2019-09-22 15:28

developer   ~0035193

Post 7.7 cleanup, is this still valid?

Issue History

Date Modified Username Field Change
2018-08-27 01:49 wyukawa New Issue
2018-08-27 01:49 wyukawa Tag Attached: 3.10.0-862.9.1.el7.x86_64
2018-08-27 01:56 wyukawa Tag Attached: kernel panic
2018-08-27 01:56 wyukawa Tag Detached: 3.10.0-862.9.1.el7.x86_64
2018-08-27 03:28 toracat Note Added: 0032594
2018-08-27 03:54 wyukawa Note Added: 0032595
2018-08-27 06:57 wyukawa Note Added: 0032596
2018-08-27 09:33 pgreco Note Added: 0032597
2018-08-28 02:54 wyukawa Note Added: 0032604
2018-08-28 09:54 pgreco Note Added: 0032610
2018-08-28 09:54 pgreco Note Edited: 0032610 View Revisions
2018-08-30 04:16 wyukawa Note Added: 0032624
2018-09-05 02:49 wyukawa Note Added: 0032646
2018-09-05 10:20 pgreco Note Added: 0032650
2018-09-05 14:35 toracat Note Added: 0032652
2018-09-05 14:36 toracat Status new => feedback
2018-09-06 08:46 wyukawa Note Added: 0032659
2018-09-06 08:46 wyukawa Status feedback => assigned
2018-09-10 06:13 wyukawa Note Added: 0032690
2018-09-10 10:35 pgreco Note Added: 0032691
2018-09-11 01:33 wyukawa Note Added: 0032693
2018-09-25 22:55 toracat Note Added: 0032791
2018-09-26 07:25 wyukawa Note Added: 0032799
2018-09-26 07:40 toracat Note Added: 0032800
2018-09-26 07:56 wyukawa Note Added: 0032801
2018-09-27 15:41 toracat Note Added: 0032813
2018-09-28 17:39 toracat Note Added: 0032821
2018-10-03 01:13 wyukawa Note Added: 0032843
2018-10-03 02:31 toracat Note Added: 0032844
2018-10-03 07:39 wyukawa Note Added: 0032845
2018-10-04 12:05 DaveLaneCA Note Added: 0032858
2018-10-06 15:30 dijuremo@gmail.com Note Added: 0032869
2018-10-15 21:31 pgreco Note Added: 0032923
2018-10-16 03:17 wyukawa Note Added: 0032925
2018-10-17 09:32 pgreco Note Added: 0032932
2018-10-22 08:15 wyukawa Note Added: 0032960
2018-10-25 03:52 wyukawa Note Added: 0032985
2018-10-25 16:21 pgreco Note Added: 0032995
2018-10-30 05:51 wyukawa Note Added: 0033014
2018-10-30 07:58 toracat Note Added: 0033016
2018-10-30 10:45 pgreco Note Added: 0033017
2018-10-30 11:56 wyukawa Note Added: 0033018
2018-12-15 15:18 pgreco Note Added: 0033353
2018-12-16 03:56 wyukawa Note Added: 0033360
2019-01-24 10:16 Snorch Note Added: 0033666
2019-01-24 10:30 wyukawa Note Added: 0033667
2019-01-24 12:34 Snorch Note Added: 0033670
2019-09-22 15:28 pgreco Note Added: 0035193