View Issue Details

IDProjectCategoryView StatusLast Update
0018504CentOS-8generalpublic2022-09-06 01:19
ReporterQuantaMarkKang Assigned To 
PriorityhighSeveritycrashReproducibilityalways
Status newResolutionopen 
Product Version8.4.2105 
Summary0018504: kernel panic when doing uncorrected error injection but new kernel never crash
DescriptionWhen doing uncorrected error injection, we found the kernel crashed in

Red Hat Enterprise Linux 8.5 (Ootpa)
Kernel 4.18.0-348.el8.x86_64 on an x86_64

If we update the kernel to upstream 5.15, it will not crash.

Test tool:
https://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git

Test command:
./einj_mem_uc -f 'single'

Result:
```
[root@localhost GreenTea]# ./einj_mem_uc -f 'single'

0: single vaddr = xxxxxxxx paddr = xxxxx[ 242.248140] core: Uncorrected hardware memory error in user-access at xxxxxxxx
[ 242.248410] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[ 242.257296] BUG: scheduling while atomic: einj_mem_uc/9237/0x00110000
a400
[ 242.258700] Memory failure: xxxx: Killing einj_mem_uc:9237 due to hardware memory corruption
[ 242.267021] {1}[Hardware Error]: event severity: recoverable
[ 242.267022] {1}[Hardware Error]: Error 0, type: recoverable
[ 242.267023] {1}[Hardware Error]: fru_text: Card01, ChnG, DIMM0
[ 242.267023] {1}[Hardware Error]: section_type: memory error
[ 242.267024] {1}[Hardware Error]: error_status: 0x0000000000000400
[ 242.267024] {1}[Hardware Error]: physical_address: 0x00000004805da400
[ 242.267026] {1}[Hardware Error]: node: 0 card: 6 module: 0 rank: 0 bank: 16 device: 0 row: 8835 column: 16
[ 242.267026] {1}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
[ 242.267027] {1}[Hardware Error]: DIMM location: _Node0_Channel6_Dimm0 CPU0_G0
[ 242.267053] Memory failure: xxxxx: already hardware poisoned
[ 242.274392] Memory failure: xxxxx: recovery action for dirty LRU page: Recovered
[ 242.285519] EDAC skx MC3: HANDLING MCE MEMORY ERROR
[ 242.318662] ------------[ cut here ]------------
[ 242.326171] EDAC skx MC3: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f
[ 242.337362] kernel BUG at arch/x86/kernel/cpu/mce/core.c:1364!
[ 242.337366] invalid opcode: 0000 [#1] SMP NOPTI
[ 242.337367] CPU: 139 PID: 9237 Comm: einj_mem_uc Kdump: loaded Tainted: G M W --------- - - 4.18.0-348.el8.x86_64 #1
[ 242.337368] Hardware name: Foo Inc. Foo BIOS 4C012 01/21/2022
[ 242.345383] EDAC skx MC3: TSC 0x0
[ 242.345383] EDAC skx MC3: ADDR 0x4805da400
[ 242.353765] RIP: 0010:do_machine_check+0xb10/0xc70
[ 242.353766] Code: 42 bf f4 01 00 00 e8 df 92 92 00 8b 05 b9 cb e2 01 41 39 c7 7e 2d 4c 89 ee 4c 89 e7 e8 09 ec ff ff 85 c0 74 dc e9 17 fe ff ff <0f> 0b 0f 0b 8b 35 1a 6c 7e 01 e9 05 fb ff ff c7 05 87 cb e2 01 01
[ 242.353766] RSP: 0018:ff2f652d53383e58 EFLAGS: 00010046
[ 242.353767] RAX: 0000000080000000 RBX: 00000000004805da RCX: 3ffffffffffffffe
[ 242.353768] RDX: ff121aa3ffbeaf40 RSI: 0000000000000001 RDI: ff121a69005db000
[ 242.360497] EDAC skx MC3: MISC 0x0
[ 242.360498] EDAC skx MC3: PROCESSOR 0:0x806f6 TIME 1529665988 SOCKET 0 APIC 0x0
[ 242.369175] RBP: ff121a65a70f3c80 R08: ff121a6480000010 R09: 0000000000000000
[ 242.369176] R10: 0000000000000002 R11: 0000000000000003 R12: 0000000000000000
[ 242.369176] R13: 0000000000000000 R14: ff121aa3ffb95ce0 R15: 0000000000000014
[ 242.369177] FS: 00007fe1bde23640(0000) GS:ff121aa3ffbc0000(0000) knlGS:0000000000000000
[ 242.369177] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 242.369177] CR2: 00007f50808270cc CR3: 00000001d0118001 CR4: 0000000000771ee0
[ 242.369178] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 242.369178] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 242.369179] PKRU: 55555554
[ 242.369179] Call Trace:
[ 242.369182] ? machine_check+0x25/0x40
[ 242.374751] EDAC MC3: 1 UE memory read error on CPU_SrcID#0_MC#3_Chan#0_DIMM#0 (channel:0 slot:0 page:0x4805da offset:0x400 grain:32 - err_code:0x0000:0x009f SystemAddress:0x4805da400 ProcessorSocketId:0x0 MemoryControllerId:0x3 ChannelAddress:0x800bb400 ChannelId:0x0 RankAddress:0x2002ed00 PhysicalRankId:0x0 DimmSlotId:0x0 DimmRankId:0x0 Row:0x2283 Column:0x10 Bank:0x0 BankGroup:0x4 ChipSelect:0x0)
[ 242.380013] machine_check+0x2f/0x40
[ 242.380015] RIP: 0033:0x403f5b
[ 242.380015] Code: 89 05 cd 37 20 00 8b 05 c7 37 20 00 c3 53 48 8b 1d 92 37 20 00 e8 2b d5 ff ff 48 8d 84 1b 76 14 40 00 48 f7 db 48 21 d8 5b c3 <0f> be 07 c3 0f be 07 0f be 57 01 01 d0 c3 48 8b 47 ff c3 c6 07 61
[ 242.380016] RSP: 002b:00007ffcfb9e6098 EFLAGS: 00010206
[ 242.380016] RAX: 0000000000607280 RBX: 0000000000607280 RCX: 0000000001b7b010
[ 242.380017] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00007fe1bde21400
[ 242.380017] RBP: 00007fe1bde21400 R08: 0000000001b7b04a R09: 0000000000000000
[ 242.380017] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000001
[ 242.380018] R13: 00007ffcfb9e6410 R14: 0000000000000000 R15: 0000000000000000
[ 242.380018] Modules linked in: einj xt_CHECKSUM ipt_MASQUERADE xt_conntrack ipt_REJECT nft_compat nf_nat_tftp nft_objref nf_conntrack_tftp nft_counter tun bridge stp llc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc vfat fat sd_mod sg intel_rapl_msr intel_rapl_common i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel iTCO_wdt intel_pmt_telemetry intel_pmt_crashlog iTCO_vendor_support intel_pmt_class kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr rapl intel_th_gth ipmi_ssif uas intel_th_pci isst_if_mbox_pci isst_if_mmio idxd usb_storage joydev intel_pmt intel_th i2c_i801 isst_if_common i2c_ismt wmi acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter xfs libcrc32c ast i2c_algo_bit drm_vram_helper drm_kms_helper syscopyarea sysfillrect
[ 242.380033] sysimgblt fb_sys_fops drm_ttm_helper ttm crc32c_intel nvme ahci drm nvme_core libahci libata t10_pi pinctrl_emmitsburg dm_mirror dm_region_hash dm_log dm_mod fuse
[ 0.000000] Linux version 4.18.0-348.el8.x86_64 (mockbuild@x86-vm-09.build.eng.bos.redhat.com) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-3) (GCC)) #1 SMP Mon Oct 4 12:17:22 EDT 2021
As logs show:
[ 242.337362] kernel BUG at arch/x86/kernel/cpu/mce/core.c:1364!
[ 242.337366] invalid opcode: 0000 [#1] SMP NOPTI
```

The core.c source code:

1363 out_ist:
1364 nmi_exit();
Steps To Reproduce./einj_mem_uc -f 'single'
Additional InformationTest tool:
https://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git

Red Hat Enterprise Linux 8.5 (Ootpa)
Kernel 4.18.0-348.el8.x86_64 on an x86_64
Tagsmce, panic

Activities

QuantaMarkKang

QuantaMarkKang

2022-09-06 01:19

reporter  

log_1.txt (5,865 bytes)   
[root@localhost GreenTea]# ./einj_mem_uc -f 'single'

0: single   vaddr = xxxxxxxx paddr = xxxxx[   242.248140] core: Uncorrected hardware memory error in user-access at xxxxxxxx
[   242.248410] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[   242.257296] BUG: scheduling while atomic: einj_mem_uc/9237/0x00110000
a400
[   242.258700] Memory failure: xxxx: Killing einj_mem_uc:9237 due to hardware memory corruption
[   242.267021] {1}[Hardware Error]: event severity: recoverable
[   242.267022] {1}[Hardware Error]:  Error 0, type: recoverable
[   242.267023] {1}[Hardware Error]:  fru_text: Card01, ChnG, DIMM0
[   242.267023] {1}[Hardware Error]:   section_type: memory error
[   242.267024] {1}[Hardware Error]:   error_status: 0x0000000000000400
[   242.267024] {1}[Hardware Error]:   physical_address: 0x00000004805da400
[   242.267026] {1}[Hardware Error]:   node: 0 card: 6 module: 0 rank: 0 bank: 16 device: 0 row: 8835 column: 16 
[   242.267026] {1}[Hardware Error]:   error_type: 4, single-symbol chipkill ECC
[   242.267027] {1}[Hardware Error]:   DIMM location: _Node0_Channel6_Dimm0 CPU0_G0 
[   242.267053] Memory failure: xxxxx: already hardware poisoned
[   242.274392] Memory failure: xxxxx: recovery action for dirty LRU page: Recovered
[   242.285519] EDAC skx MC3: HANDLING MCE MEMORY ERROR
[   242.318662] ------------[ cut here ]------------
[   242.326171] EDAC skx MC3: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f
[   242.337362] kernel BUG at arch/x86/kernel/cpu/mce/core.c:1364!
[   242.337366] invalid opcode: 0000 [#1] SMP NOPTI
[   242.337367] CPU: 139 PID: 9237 Comm: einj_mem_uc Kdump: loaded Tainted: G   M    W        --------- -  - 4.18.0-348.el8.x86_64 #1
[   242.337368] Hardware name: Foo Inc. Foo  BIOS 4C012 01/21/2022
[   242.345383] EDAC skx MC3: TSC 0x0 
[   242.345383] EDAC skx MC3: ADDR 0x4805da400 
[   242.353765] RIP: 0010:do_machine_check+0xb10/0xc70
[   242.353766] Code: 42 bf f4 01 00 00 e8 df 92 92 00 8b 05 b9 cb e2 01 41 39 c7 7e 2d 4c 89 ee 4c 89 e7 e8 09 ec ff ff 85 c0 74 dc e9 17 fe ff ff <0f> 0b 0f 0b 8b 35 1a 6c 7e 01 e9 05 fb ff ff c7 05 87 cb e2 01 01
[   242.353766] RSP: 0018:ff2f652d53383e58 EFLAGS: 00010046
[   242.353767] RAX: 0000000080000000 RBX: 00000000004805da RCX: 3ffffffffffffffe
[   242.353768] RDX: ff121aa3ffbeaf40 RSI: 0000000000000001 RDI: ff121a69005db000
[   242.360497] EDAC skx MC3: MISC 0x0 
[   242.360498] EDAC skx MC3: PROCESSOR 0:0x806f6 TIME 1529665988 SOCKET 0 APIC 0x0
[   242.369175] RBP: ff121a65a70f3c80 R08: ff121a6480000010 R09: 0000000000000000
[   242.369176] R10: 0000000000000002 R11: 0000000000000003 R12: 0000000000000000
[   242.369176] R13: 0000000000000000 R14: ff121aa3ffb95ce0 R15: 0000000000000014
[   242.369177] FS:  00007fe1bde23640(0000) GS:ff121aa3ffbc0000(0000) knlGS:0000000000000000
[   242.369177] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   242.369177] CR2: 00007f50808270cc CR3: 00000001d0118001 CR4: 0000000000771ee0
[   242.369178] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   242.369178] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[   242.369179] PKRU: 55555554
[   242.369179] Call Trace:
[   242.369182]  ? machine_check+0x25/0x40
[   242.374751] EDAC MC3: 1 UE memory read error on CPU_SrcID#0_MC#3_Chan#0_DIMM#0 (channel:0 slot:0 page:0x4805da offset:0x400 grain:32 -  err_code:0x0000:0x009f  SystemAddress:0x4805da400 ProcessorSocketId:0x0 MemoryControllerId:0x3 ChannelAddress:0x800bb400 ChannelId:0x0 RankAddress:0x2002ed00 PhysicalRankId:0x0 DimmSlotId:0x0 DimmRankId:0x0 Row:0x2283 Column:0x10 Bank:0x0 BankGroup:0x4 ChipSelect:0x0)
[   242.380013]  machine_check+0x2f/0x40
[   242.380015] RIP: 0033:0x403f5b
[   242.380015] Code: 89 05 cd 37 20 00 8b 05 c7 37 20 00 c3 53 48 8b 1d 92 37 20 00 e8 2b d5 ff ff 48 8d 84 1b 76 14 40 00 48 f7 db 48 21 d8 5b c3 <0f> be 07 c3 0f be 07 0f be 57 01 01 d0 c3 48 8b 47 ff c3 c6 07 61
[   242.380016] RSP: 002b:00007ffcfb9e6098 EFLAGS: 00010206
[   242.380016] RAX: 0000000000607280 RBX: 0000000000607280 RCX: 0000000001b7b010
[   242.380017] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00007fe1bde21400
[   242.380017] RBP: 00007fe1bde21400 R08: 0000000001b7b04a R09: 0000000000000000
[   242.380017] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000001
[   242.380018] R13: 00007ffcfb9e6410 R14: 0000000000000000 R15: 0000000000000000
[   242.380018] Modules linked in: einj xt_CHECKSUM ipt_MASQUERADE xt_conntrack ipt_REJECT nft_compat nf_nat_tftp nft_objref nf_conntrack_tftp nft_counter tun bridge stp llc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nf_tables_set nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink sunrpc vfat fat sd_mod sg intel_rapl_msr intel_rapl_common i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel iTCO_wdt intel_pmt_telemetry intel_pmt_crashlog iTCO_vendor_support intel_pmt_class kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr rapl intel_th_gth ipmi_ssif uas intel_th_pci isst_if_mbox_pci isst_if_mmio idxd usb_storage joydev intel_pmt intel_th i2c_i801 isst_if_common i2c_ismt wmi acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter xfs libcrc32c ast i2c_algo_bit drm_vram_helper drm_kms_helper syscopyarea sysfillrect
[   242.380033]  sysimgblt fb_sys_fops drm_ttm_helper ttm crc32c_intel nvme ahci drm nvme_core libahci libata t10_pi pinctrl_emmitsburg dm_mirror dm_region_hash dm_log dm_mod fuse
[    0.000000] Linux version 4.18.0-348.el8.x86_64 (mockbuild@x86-vm-09.build.eng.bos.redhat.com) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-3) (GCC)) #1 SMP Mon Oct 4 12:17:22 EDT 2021
log_1.txt (5,865 bytes)   

Issue History

Date Modified Username Field Change
2022-09-06 01:19 QuantaMarkKang New Issue
2022-09-06 01:19 QuantaMarkKang Tag Attached: mce
2022-09-06 01:19 QuantaMarkKang Tag Attached: panic
2022-09-06 01:19 QuantaMarkKang File Added: log_1.txt