View Issue Details

IDProjectCategoryView StatusLast Update
0012277CentOS-7kernelpublic2017-03-02 18:37
ReporterTO 
PrioritynormalSeveritycrashReproducibilityalways
Status resolvedResolutionfixed 
PlatformIBM NeXtScale nx360 M5OSCentOSOS Version7.3.1611
Product Version7.3.1611 
Target VersionFixed in Version 
Summary0012277: Kernel 3.10.0-514 does not boot with mlx4_core module on some IBM systems
DescriptionAs a test, I updated some of our nodes with the CR repository. On many systems, this works without any problems, but on "IBM NeXtScale nx360 M5" nodes, it does not boot.

I think this is related to what I found here:
https://patchwork.kernel.org/patch/9397341/

When I boot the previous kernel (3.10.0-327.36.3.el7.x86_64), the same system works fine.

As we have Mellanox Infinband cards, mlx4_core is in use. If I blacklist this module, the system also boots with the new kernel.

I do not have this problem on a "System x3550 M4". There everything works as intended.
Additional Information[ 7.312283] ERST: Can not request iomem region <0x 3b3dd000-0x 3b3dec00> for ERST.
[ 8.490015] i8042: No controller found
[ 10.587931] Uhhuh. NMI received for unknown reason 3d on CPU 0.
[ 10.594539] Do you have a strange power saving mode enabled?
[ 10.600854] Dazed and confused, but trying to continue
[ OK ] Started Show Plymouth Boot Screen.
[ OK ] Reached target Paths.
[ OK ] Reached target Basic System.
[ OK ] Found device ST9500620NS_81Y9715_81Y3856IBM 3.
         Starting File System Check on /dev/...e-595a-4388-9b4d-0ee553589aad...
[ OK ] Started File System Check on /dev/d...9be-595a-4388-9b4d-0ee553589aad.
[ 70.028613] mlx4_core 0000:81:00.0: device is going to be reset
TagsInfiniBand, kernel
abrt_hash
URL

Activities

toracat

toracat

2016-11-28 16:23

manager   ~0028020

If I build a kernel with the referenced patch, will you be able to test?
TO

TO

2016-11-28 16:31

reporter   ~0028021

Yes, this is our test system, I can play around with it.

The question is how difficult it is to install and boot your kernel. Please give me a very brief overview what I would have to do (installing an rpm, copying files to /boot, or whatever). The last time that I built a custom kernel on a Linux system must have been ages ago.
toracat

toracat

2016-11-28 16:37

manager   ~0028022

It will be the centosplus kernel (kernel-plus). I will let you know when it's ready for testing with the details on how to install.
TO

TO

2016-11-28 16:48

reporter   ~0028023

Ok, shouldn't be a problem using the centosplus repo. Please remember that the current plus kernel in the plus repo is a 3.10.0-327.36.3 kernel but I need a 3.10.0-514 kernel to reproduce the problem.
toracat

toracat

2016-11-28 17:31

manager   ~0028024

Sure, it is -540 that I'm preparing. The official version will be published with the GA release of CentOS 7.3.1611.
toracat

toracat

2016-11-28 18:28

manager   ~0028025

Unfortunately, it turns out that the patch cannot be applied to the current CentOS kernel (3.10.0-514). I get a 'implicit declaration of function' error. Fixing this is beyond what can be done within the scope of the plus kernel.
TO

TO

2016-11-29 06:12

reporter   ~0028029

Many thanks for your efforts. This made me curious how to compile the kernel myself. Here is what I did:

----------
export RPM_BUILD_NCPUS=12
git clone https://git.centos.org/git/centos-git-common.git
git clone https://git.centos.org/git/rpms/kernel.git
cd kernel
git checkout c7
../centos-git-common/get_sources.sh
git checkout -b my-kernel
cd SOURCES
tar xf linux-3.10.0-514.el7.tar.xz
cd linux-3.10.0-514.el7
wget https://patchwork.kernel.org/patch/9397341/raw/
patch -p1 < index.html
rm index.html
cd ..
mv linux-3.10.0-514.el7.tar.xz linux-3.10.0-514.el7.tar.xz.old
tar cfJ linux-3.10.0-514.el7.tar.xz linux-3.10.0-514.el7
rm -rf linux-3.10.0-514.el7
cd ..
rpmbuild --nodeps --define "%_topdir `pwd`" -bs SPECS/kernel.spec
rpmbuild --define "%_topdir `pwd`" -ba SPECS/kernel.spec
----------

I don't know whether all these steps are necessary, but I can boot this kernel. Can you check whether these steps make sense and maybe try to build the plus kernel again? I think it makes not sense to maintain the kernel ourselves after each new release. I guess, this should be committed upstream.
TO

TO

2016-11-29 06:49

reporter   ~0028030

By the way: Your error message somehow sounds as if one of the hooks in the patch file failed. If it helps, I can provide the patched file that I used.
toracat

toracat

2016-11-29 07:39

manager  

centos-linux-3.10-pci-fix-regression-mlx4-bug12277.patch (1,348 bytes)
centosplus patch (bug#12277)

Ref: https://patchwork.kernel.org/patch/9397341/

--- a/drivers/pci/probe.c	2016-10-19 07:16:25.000000000 -0700
+++ b/drivers/pci/probe.c	2016-11-28 09:30:30.621332097 -0800
@@ -1426,6 +1426,16 @@ static void program_hpp_type1(struct pci
 		dev_warn(&dev->dev, "PCI-X settings not supported\n");
 }
 
++static bool pcie_get_upstream_rcb(struct pci_dev *dev)
+{
+	struct pci_dev *bridge = pci_upstream_bridge(dev);
+	u16 lnkctl;
+
+	pcie_capability_read_word(bridge, PCI_EXP_LNKCTL, &lnkctl);
+
+	return lnkctl & PCI_EXP_LNKCTL_RCB;
+}
+
 static void program_hpp_type2(struct pci_dev *dev, struct hpp_type2 *hpp)
 {
 	int pos;
@@ -1455,9 +1465,21 @@ static void program_hpp_type2(struct pci
 			~hpp->pci_exp_devctl_and, hpp->pci_exp_devctl_or);
 
 	/* Initialize Link Control Register */
-	if (pcie_cap_has_lnkctl(dev))
+	if (pcie_cap_has_lnkctl(dev)) {
+		bool us_rcb;
+		u16 clear;
+		u16 set;
+
+		us_rcb = pcie_get_upstream_rcb(dev);
+
+		clear = ~hpp->pci_exp_lnkctl_and;
+		set = hpp->pci_exp_lnkctl_or;
+		if (!us_rcb)
+			set &= ~PCI_EXP_LNKCTL_RCB;
+
 		pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL,
-			~hpp->pci_exp_lnkctl_and, hpp->pci_exp_lnkctl_or);
+						  clear, set);
+	}
 
 	/* Find Advanced Error Reporting Enhanced Capability */
 	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
toracat

toracat

2016-11-29 07:47

manager   ~0028031

I am uploading the patch file I used. It is derived from the patch you referenced after adjustment of line numbers. According to my build log, it was successfully applied. The actual error during the build was:

drivers/pci/probe.c:1429:1: error: expected identifier or '(' before '+' token
 +static bool pcie_get_upstream_rcb(struct pci_dev *dev)
 ^
drivers/pci/probe.c: In function 'program_hpp_type2':
drivers/pci/probe.c:1473:3: error: implicit declaration of function 'pcie_get_upstream_rcb' [-Werror=implicit-function-declaration]
   us_rcb = pcie_get_upstream_rcb(dev);
   ^

By the way, when you ran 'patch -p1 < index.html', did it go without errors?
toracat

toracat

2016-11-29 07:56

manager   ~0028032

Just after I copied the actual error in my previous note, I saw the real error. I will do the build again after the correction.
TO

TO

2016-11-29 08:46

reporter   ~0028033

Ok, great. Just as an information: I did not have to adjust the line numbers. "patch" detected the offset automatically.
toracat

toracat

2016-11-29 08:51

manager   ~0028034

I have uploaded the test version of the plus kernel here:

https://people.centos.org/toracat/kernel/7/plus/bug12277/

After installing the kernel-plus package, please reboot the system and select this plus kernel from the grub menu. It does not automatically become the default kernel (unless it is so defined in /etc/sysconfig/kernel .
TO

TO

2016-11-29 09:05

reporter   ~0028035

Yes, this seems to work. I only installed kernel-plus-3.10.0-514.bug12277.el7.centos.plus.x86_64.rpm and kernel-plus-devel-3.10.0-514.bug12277.el7.centos.plus.x86_64.rpm . I guess this is sufficient. The new kernel was chosen by default in grub. Here is the result:

[root@txm0001 ~]# uname -r
3.10.0-514.bug12277.el7.centos.plus.x86_64
[root@txm0001 ~]#

Also infiniband seems to work as intended.

Thank you very much for your help. I hope this kernel will be released in the centosplus repo after 7.3 release. Or even better: Can you give this information upstream?
toracat

toracat

2016-11-29 09:18

manager   ~0028036

Glad to hear it worked. Yes, those 2 packages should be enough.

The patch will be applied to the first update to kernel-plus in CentOS 7.3 (a little too late to include in the GA kernel).

Regarding getting the fix into the upstream kernel, I'd like to ask you to file a bug report ( http://bugzilla.redhat.com) because only you can test it.
TO

TO

2016-11-29 09:42

reporter   ~0028038

Ok, thank you again. I filed a bug report at Red Hat. I fact, I did not know that this is recommended without a support contract.
TO

TO

2016-12-09 10:11

reporter   ~0028115

I just installed 3.10.0-514.2.2.el7.centos.plus.x86_64 and it seems to be working. Thank you again. I think we can close this and wait until there is an upstream fix for the regular kernel.
toracat

toracat

2016-12-09 15:21

manager   ~0028118

Thanks for the report. I will close this ticket as 'resolved' for now. When the upstream (therefore CentOS) kernel gets fixed, I will add a note here.
toracat

toracat

2017-03-02 18:37

manager   ~0028739

RHEL/CentOS kernel-3.10.0-514.10.2.el7 has this fix. Therefore the patch has been removed from the plus kernel.

Issue History

Date Modified Username Field Change
2016-11-28 13:29 TO New Issue
2016-11-28 13:29 TO Tag Attached: InfiniBand
2016-11-28 13:29 TO Tag Attached: kernel
2016-11-28 16:23 toracat Status new => feedback
2016-11-28 16:23 toracat Note Added: 0028020
2016-11-28 16:31 TO Note Added: 0028021
2016-11-28 16:31 TO Status feedback => assigned
2016-11-28 16:37 toracat Note Added: 0028022
2016-11-28 16:48 TO Note Added: 0028023
2016-11-28 17:31 toracat Note Added: 0028024
2016-11-28 18:28 toracat Note Added: 0028025
2016-11-29 06:12 TO Note Added: 0028029
2016-11-29 06:49 TO Note Added: 0028030
2016-11-29 07:39 toracat File Added: centos-linux-3.10-pci-fix-regression-mlx4-bug12277.patch
2016-11-29 07:47 toracat File Added: centos-linux-3.10-pci-fix-regression-mlx4-bug12277-2.patch
2016-11-29 07:47 toracat Note Added: 0028031
2016-11-29 07:48 toracat File Deleted: centos-linux-3.10-pci-fix-regression-mlx4-bug12277-2.patch
2016-11-29 07:56 toracat Note Added: 0028032
2016-11-29 08:46 TO Note Added: 0028033
2016-11-29 08:51 toracat Note Added: 0028034
2016-11-29 09:05 TO Note Added: 0028035
2016-11-29 09:18 toracat Note Added: 0028036
2016-11-29 09:42 TO Note Added: 0028038
2016-12-09 10:11 TO Note Added: 0028115
2016-12-09 15:21 toracat Note Added: 0028118
2016-12-09 15:21 toracat Status assigned => resolved
2016-12-09 15:21 toracat Resolution open => fixed
2017-03-02 18:37 toracat Note Added: 0028739