View Issue Details

IDProjectCategoryView StatusLast Update
0016315CentOS-7kernelpublic2020-10-25 16:22
ReporterJcztery 
PrioritynormalSeveritycrashReproducibilityalways
Status resolvedResolutionfixed 
Product Version7.6.1810 
Target VersionFixed in Version 
Summary0016315: APEI causes kernel panic on PCIe Fatal error instead of sending them to AER (and to the driver) for recovery
DescriptionWhen for some reason (so far i do not know the exact cause - it will be more apparent once this problem is fixed) a network card reports a fatal PCIe error the kernel crashes with the following error:
[101567.970807] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[101567.980304] {1}[Hardware Error]: event severity: fatal
[101567.986209] {1}[Hardware Error]: Error 0, type: fatal
[101567.992102] {1}[Hardware Error]: section_type: PCIe error
[101567.998496] {1}[Hardware Error]: port_type: 4, root port
[101568.004789] {1}[Hardware Error]: version: 1.16
[101568.010105] {1}[Hardware Error]: command: 0x4010, status: 0x0547
[101568.017183] {1}[Hardware Error]: device_id: 0000:00:01.0
[101568.023476] {1}[Hardware Error]: slot: 0
[101568.028199] {1}[Hardware Error]: secondary_bus: 0x01
[101568.034097] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x780a
[101568.041666] {1}[Hardware Error]: class_code: 060400
[101568.047469] {1}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003
[101568.056312] Kernel panic - not syncing: Fatal hardware error!

There is already a fix for it upstream:
https://lore.kernel.org/patchwork/patch/842033/
Additional InformationI think this : https://github.com/torvalds/linux/commit/9852ce9ae213d39a98f161db84b90b047fbdc436 (and a previous commit modifying the same file) are needed to fix it.
TagsNo tags attached.
abrt_hash
URL

Activities

pgreco

pgreco

2019-08-04 13:16

developer   ~0034912

@Jcztery
Thanks for the report, the patch you link indeed sounds logic as a solution to this error, but code is different in CentOS' kernel, so it doesn't apply.
I made a completely untested version of the patch, using the same logic, but applying it to our code (attached).

The most important thing you can do now is report it to RedHat bugzilla (https://bugzilla.redhat.com) referencing this bug, so they are aware of it and can keep track.
Once you've done that, please report the bug number here, for reference.
Also, kernel bugs are automatically made private there, so please add toracat@elrepo.org and pablo@fliagreco.com.ar so we can also keep track of it.

Pablo.

bug16315.patch (700 bytes)
diff -Naurp a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
--- a/drivers/acpi/apei/ghes.c	2019-07-09 13:13:02.000000000 -0300
+++ b/drivers/acpi/apei/ghes.c	2019-08-04 10:03:00.974164494 -0300
@@ -461,9 +461,7 @@ static void ghes_do_proc(struct ghes *gh
 				      CPER_SEC_PCIE)) {
 			struct cper_sec_pcie *pcie_err;
 			pcie_err = (struct cper_sec_pcie *)(gdata+1);
-			if (sev == GHES_SEV_RECOVERABLE &&
-			    sec_sev == GHES_SEV_RECOVERABLE &&
-			    pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
+			if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
 			    pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) {
 				unsigned int devfn;
 				int aer_severity;
bug16315.patch (700 bytes)
Jcztery

Jcztery

2019-08-04 13:37

reporter   ~0034913

@pgreco,
Thank you for the quick response.
There was another commit extracting a function, which was originally part of the solution. It might be cleaner, to just cherry pick both of them:
https://github.com/torvalds/linux/commit/3c5b977f06b754b00a49ee7bf1595491afab7de6
https://github.com/torvalds/linux/commit/9852ce9ae213d39a98f161db84b90b047fbdc436

Thanks again! I will log a bug at RedHat now.
Regards.
Jacek Tomaka
pgreco

pgreco

2019-08-04 13:46

developer   ~0034914

yes, they are both path of the same series, but the code had diverged before that, so when I said it doesn't apply, I was actually talking about the first one
Jcztery

Jcztery

2019-08-04 13:59

reporter   ~0034915

Ok! https://bugzilla.redhat.com/show_bug.cgi?id=1737246
toracat

toracat

2019-08-04 20:56

manager   ~0034916

@Jcztery

There is a test version of the plus kernel that has the patch provided by @pgreco at:

https://people.centos.org/toracat/kernel/7/plus/bug16315/

Please give it a try and see if that fixes the issue.
Jcztery

Jcztery

2019-08-05 16:38

reporter   ~0034922

Thanks! BTW, how can i figure out the differences between 3.10.0-957.21.3 and 3.10.0-957.27.2 ?
toracat

toracat

2019-08-05 18:28

manager   ~0034923

Checking the changelog is one way:

$ rpm -q --changelog kernel-3.10.0-957.27.2.el7

Other than that there is no easy way. 'diff' the whole kernel source tree is not practical unless you are looking for a particular code.

RH provides the "Code Browser" feature but is only available to subscribers and usually does not include the latest version (some lag).
Jcztery

Jcztery

2020-01-05 15:49

reporter   ~0035931

Hello,
We have finally gotten around to deploy this patched kernel at scale. Unfortunately it does not fix our issue.
Looking into it a little bit more, it seems that it is because it happens on NMI path, which is not addressed by that patch...
crash> bt -l
PID: 0 TASK: ffffffff819f9480 CPU: 0 COMMAND: "swapper/0"
 #0 [ffff882f7b605c68] machine_kexec at ffffffff8105c4cb
    /usr/src/debug/kernel-3.10.0-693.5.2.el7/linux-3.10.0-693.5.2.el7.gitfd9e07f.x86_64/arch/x86/kernel/machine_kexec_64.c: 320
 #1 [ffff882f7b605cc8] __crash_kexec at ffffffff81104a42
    /usr/src/debug/kernel-3.10.0-693.5.2.el7/linux-3.10.0-693.5.2.el7.gitfd9e07f.x86_64/kernel/kexec_core.c: 875
 #2 [ffff882f7b605d98] panic at ffffffff8169dd1f
    /usr/src/debug/kernel-3.10.0-693.5.2.el7/linux-3.10.0-693.5.2.el7.gitfd9e07f.x86_64/arch/x86/include/asm/smp.h: 96
 #3 [ffff882f7b605e18] ghes_notify_nmi at ffffffff813db3a1
    /usr/src/debug/kernel-3.10.0-693.5.2.el7/linux-3.10.0-693.5.2.el7.gitfd9e07f.x86_64/drivers/acpi/apei/ghes.c: 838
 #4 [ffff882f7b605e58] nmi_handle at ffffffff816ad527
    /usr/src/debug/kernel-3.10.0-693.5.2.el7/linux-3.10.0-693.5.2.el7.gitfd9e07f.x86_64/arch/x86/kernel/nmi.c: 115
 #5 [ffff882f7b605eb0] do_nmi at ffffffff816ad75d
    /usr/src/debug/kernel-3.10.0-693.5.2.el7/linux-3.10.0-693.5.2.el7.gitfd9e07f.x86_64/arch/x86/kernel/nmi.c: 322
 #6 [ffff882f7b605ef0] end_repeat_nmi at ffffffff816ac9d3
    /usr/src/debug/kernel-3.10.0-693.5.2.el7/linux-3.10.0-693.5.2.el7.gitfd9e07f.x86_64/arch/x86/kernel/entry_64.S: 1800
    [exception RIP: intel_idle+213]
    RIP: ffffffff816ab7a5 RSP: ffffffff819e7e28 RFLAGS: 00000046
    RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000001
    RDX: 0000000000000000 RSI: ffffffff819e7fd8 RDI: 0000000000000000
    RBP: ffffffff819e7e58 R8: 00000000000003e8 R9: 0000000000000018
    R10: 0000000000000377 R11: 0000000000000000 R12: ffffffff819e7fd8
    R13: 0000000000000001 R14: 0000000000000000 R15: ffffffff81aaf990
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
    /usr/src/debug/kernel-3.10.0-693.5.2.el7/linux-3.10.0-693.5.2.el7.gitfd9e07f.x86_64/arch/x86/include/asm/thread_info.h: 211
--- <NMI exception stack> ---
 #7 [ffffffff819e7e28] intel_idle at ffffffff816ab7a5
    /usr/src/debug/kernel-3.10.0-693.5.2.el7/linux-3.10.0-693.5.2.el7.gitfd9e07f.x86_64/arch/x86/include/asm/thread_info.h: 211
 #8 [ffffffff819e7e60] cpuidle_enter_state at ffffffff81527a90
    /usr/src/debug/kernel-3.10.0-693.5.2.el7/linux-3.10.0-693.5.2.el7.gitfd9e07f.x86_64/drivers/cpuidle/cpuidle.c: 86
 #9 [ffffffff819e7e98] cpuidle_idle_call at ffffffff81527be8
    /usr/src/debug/kernel-3.10.0-693.5.2.el7/linux-3.10.0-693.5.2.el7.gitfd9e07f.x86_64/drivers/cpuidle/cpuidle.c: 163
#10 [ffffffff819e7ed8] arch_cpu_idle at ffffffff81034fee
    /usr/src/debug/kernel-3.10.0-693.5.2.el7/linux-3.10.0-693.5.2.el7.gitfd9e07f.x86_64/arch/x86/kernel/process.c: 303
#11 [ffffffff819e7ee8] cpu_startup_entry at ffffffff810e7bda
    /usr/src/debug/kernel-3.10.0-693.5.2.el7/linux-3.10.0-693.5.2.el7.gitfd9e07f.x86_64/arch/x86/include/asm/paravirt.h: 802
#12 [ffffffff819e7f30] rest_init at ffffffff81692d17
    /usr/src/debug/kernel-3.10.0-693.5.2.el7/linux-3.10.0-693.5.2.el7.gitfd9e07f.x86_64/init/main.c: 401
#13 [ffffffff819e7f40] start_kernel at ffffffff81b45060
    /usr/src/debug/kernel-3.10.0-693.5.2.el7/linux-3.10.0-693.5.2.el7.gitfd9e07f.x86_64/init/main.c: 656
#14 [ffffffff819e7f88] x86_64_start_reservations at ffffffff81b445ef
    /usr/src/debug/kernel-3.10.0-693.5.2.el7/linux-3.10.0-693.5.2.el7.gitfd9e07f.x86_64/arch/x86/kernel/head64.c: 194
#15 [ffffffff819e7f98] x86_64_start_kernel at ffffffff81b44740
    /usr/src/debug/kernel-3.10.0-693.5.2.el7/linux-3.10.0-693.5.2.el7.gitfd9e07f.x86_64/arch/x86/kernel/head64.c: 183
TrevorH

TrevorH

2020-01-05 16:01

manager   ~0035932

You are not even running the fixed kernel according to that. Yours seems to be kernel-3.10.0-693.5.2.el7 which is not 3.10.0-957.27.2 and in any case, only the latest kernel is supported and that is currently 3.10.0-1062.9.1.el7. The -693 kernels are from 7.4.1708 so you are more than 2 years out of date.
Jcztery

Jcztery

2020-01-05 16:23

reporter   ~0035933

I am running stock 7.4 kernel plus patch prepared by pgreco. I thought you were interested in the feedback that it did not solve my particular issue.
I am not expecting you to fix the problem (but it would be nice).
TrevorH

TrevorH

2020-01-05 17:33

manager   ~0035934

The test kernel linked to above on https://people.centos.org/toracat/kernel/7/plus/bug16315/ is kernel-plus-3.10.0-957.27.2.el7.centos.plus.bug16315.x86_64.rpm but what you appear to be running according to that is 3.10.0-693.5.2.el7
pgreco

pgreco

2020-10-25 14:04

developer   ~0037823

Fixed in 3.10.0.1160 (7.9)

Issue History

Date Modified Username Field Change
2019-08-04 06:31 Jcztery New Issue
2019-08-04 13:16 pgreco File Added: bug16315.patch
2019-08-04 13:16 pgreco Note Added: 0034912
2019-08-04 13:37 Jcztery Note Added: 0034913
2019-08-04 13:46 pgreco Note Added: 0034914
2019-08-04 13:59 Jcztery Note Added: 0034915
2019-08-04 16:16 toracat Status new => assigned
2019-08-04 20:56 toracat Note Added: 0034916
2019-08-05 16:38 Jcztery Note Added: 0034922
2019-08-05 18:28 toracat Note Added: 0034923
2020-01-05 15:49 Jcztery Note Added: 0035931
2020-01-05 16:01 TrevorH Note Added: 0035932
2020-01-05 16:23 Jcztery Note Added: 0035933
2020-01-05 17:33 TrevorH Note Added: 0035934
2020-10-25 14:04 pgreco Status assigned => closed
2020-10-25 14:04 pgreco Resolution open => fixed
2020-10-25 14:04 pgreco Note Added: 0037823
2020-10-25 16:22 toracat Status closed => resolved