View Issue Details

IDProjectCategoryView StatusLast Update
0016315CentOS-7kernelpublic2019-08-27 13:27
ReporterJcztery 
PrioritynormalSeveritycrashReproducibilityalways
Status assignedResolutionopen 
Product Version7.6.1810 
Target VersionFixed in Version 
Summary0016315: APEI causes kernel panic on PCIe Fatal error instead of sending them to AER (and to the driver) for recovery
DescriptionWhen for some reason (so far i do not know the exact cause - it will be more apparent once this problem is fixed) a network card reports a fatal PCIe error the kernel crashes with the following error:
[101567.970807] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[101567.980304] {1}[Hardware Error]: event severity: fatal
[101567.986209] {1}[Hardware Error]: Error 0, type: fatal
[101567.992102] {1}[Hardware Error]: section_type: PCIe error
[101567.998496] {1}[Hardware Error]: port_type: 4, root port
[101568.004789] {1}[Hardware Error]: version: 1.16
[101568.010105] {1}[Hardware Error]: command: 0x4010, status: 0x0547
[101568.017183] {1}[Hardware Error]: device_id: 0000:00:01.0
[101568.023476] {1}[Hardware Error]: slot: 0
[101568.028199] {1}[Hardware Error]: secondary_bus: 0x01
[101568.034097] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x780a
[101568.041666] {1}[Hardware Error]: class_code: 060400
[101568.047469] {1}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003
[101568.056312] Kernel panic - not syncing: Fatal hardware error!

There is already a fix for it upstream:
https://lore.kernel.org/patchwork/patch/842033/
Additional InformationI think this : https://github.com/torvalds/linux/commit/9852ce9ae213d39a98f161db84b90b047fbdc436 (and a previous commit modifying the same file) are needed to fix it.
TagsNo tags attached.
abrt_hash
URL

Activities

pgreco

pgreco

2019-08-04 13:16

developer   ~0034912

@Jcztery
Thanks for the report, the patch you link indeed sounds logic as a solution to this error, but code is different in CentOS' kernel, so it doesn't apply.
I made a completely untested version of the patch, using the same logic, but applying it to our code (attached).

The most important thing you can do now is report it to RedHat bugzilla (https://bugzilla.redhat.com) referencing this bug, so they are aware of it and can keep track.
Once you've done that, please report the bug number here, for reference.
Also, kernel bugs are automatically made private there, so please add toracat@elrepo.org and pablo@fliagreco.com.ar so we can also keep track of it.

Pablo.

bug16315.patch (700 bytes)
diff -Naurp a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
--- a/drivers/acpi/apei/ghes.c	2019-07-09 13:13:02.000000000 -0300
+++ b/drivers/acpi/apei/ghes.c	2019-08-04 10:03:00.974164494 -0300
@@ -461,9 +461,7 @@ static void ghes_do_proc(struct ghes *gh
 				      CPER_SEC_PCIE)) {
 			struct cper_sec_pcie *pcie_err;
 			pcie_err = (struct cper_sec_pcie *)(gdata+1);
-			if (sev == GHES_SEV_RECOVERABLE &&
-			    sec_sev == GHES_SEV_RECOVERABLE &&
-			    pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
+			if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
 			    pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) {
 				unsigned int devfn;
 				int aer_severity;
bug16315.patch (700 bytes)
Jcztery

Jcztery

2019-08-04 13:37

reporter   ~0034913

@pgreco,
Thank you for the quick response.
There was another commit extracting a function, which was originally part of the solution. It might be cleaner, to just cherry pick both of them:
https://github.com/torvalds/linux/commit/3c5b977f06b754b00a49ee7bf1595491afab7de6
https://github.com/torvalds/linux/commit/9852ce9ae213d39a98f161db84b90b047fbdc436

Thanks again! I will log a bug at RedHat now.
Regards.
Jacek Tomaka
pgreco

pgreco

2019-08-04 13:46

developer   ~0034914

yes, they are both path of the same series, but the code had diverged before that, so when I said it doesn't apply, I was actually talking about the first one
Jcztery

Jcztery

2019-08-04 13:59

reporter   ~0034915

Ok! https://bugzilla.redhat.com/show_bug.cgi?id=1737246
toracat

toracat

2019-08-04 20:56

manager   ~0034916

@Jcztery

There is a test version of the plus kernel that has the patch provided by @pgreco at:

https://people.centos.org/toracat/kernel/7/plus/bug16315/

Please give it a try and see if that fixes the issue.
Jcztery

Jcztery

2019-08-05 16:38

reporter   ~0034922

Thanks! BTW, how can i figure out the differences between 3.10.0-957.21.3 and 3.10.0-957.27.2 ?
toracat

toracat

2019-08-05 18:28

manager   ~0034923

Checking the changelog is one way:

$ rpm -q --changelog kernel-3.10.0-957.27.2.el7

Other than that there is no easy way. 'diff' the whole kernel source tree is not practical unless you are looking for a particular code.

RH provides the "Code Browser" feature but is only available to subscribers and usually does not include the latest version (some lag).

Issue History

Date Modified Username Field Change
2019-08-04 06:31 Jcztery New Issue
2019-08-04 13:16 pgreco File Added: bug16315.patch
2019-08-04 13:16 pgreco Note Added: 0034912
2019-08-04 13:37 Jcztery Note Added: 0034913
2019-08-04 13:46 pgreco Note Added: 0034914
2019-08-04 13:59 Jcztery Note Added: 0034915
2019-08-04 16:16 toracat Status new => assigned
2019-08-04 20:56 toracat Note Added: 0034916
2019-08-05 16:38 Jcztery Note Added: 0034922
2019-08-05 18:28 toracat Note Added: 0034923