2018-01-21 01:09 UTC

View Issue Details Jump to Notes ]
IDProjectCategoryView StatusLast Update
0012277CentOS-7kernelpublic2017-03-02 18:37
PlatformIBM NeXtScale nx360 M5OSCentOSOS Version7.3.1611
Product Version7.3.1611 
Target VersionFixed in Version 
Summary0012277: Kernel 3.10.0-514 does not boot with mlx4_core module on some IBM systems
DescriptionAs a test, I updated some of our nodes with the CR repository. On many systems, this works without any problems, but on "IBM NeXtScale nx360 M5" nodes, it does not boot.

I think this is related to what I found here:

When I boot the previous kernel (3.10.0-327.36.3.el7.x86_64), the same system works fine.

As we have Mellanox Infinband cards, mlx4_core is in use. If I blacklist this module, the system also boots with the new kernel.

I do not have this problem on a "System x3550 M4". There everything works as intended.
Additional Information[ 7.312283] ERST: Can not request iomem region <0x 3b3dd000-0x 3b3dec00> for ERST.
[ 8.490015] i8042: No controller found
[ 10.587931] Uhhuh. NMI received for unknown reason 3d on CPU 0.
[ 10.594539] Do you have a strange power saving mode enabled?
[ 10.600854] Dazed and confused, but trying to continue
[ OK ] Started Show Plymouth Boot Screen.
[ OK ] Reached target Paths.
[ OK ] Reached target Basic System.
[ OK ] Found device ST9500620NS_81Y9715_81Y3856IBM 3.
         Starting File System Check on /dev/...e-595a-4388-9b4d-0ee553589aad...
[ OK ] Started File System Check on /dev/d...9be-595a-4388-9b4d-0ee553589aad.
[ 70.028613] mlx4_core 0000:81:00.0: device is going to be reset
TagsInfiniBand, kernel
Attached Files
  • patch file icon centos-linux-3.10-pci-fix-regression-mlx4-bug12277.patch (1,348 bytes) 2016-11-29 07:39 -
    centosplus patch (bug#12277)
    Ref: https://patchwork.kernel.org/patch/9397341/
    --- a/drivers/pci/probe.c	2016-10-19 07:16:25.000000000 -0700
    +++ b/drivers/pci/probe.c	2016-11-28 09:30:30.621332097 -0800
    @@ -1426,6 +1426,16 @@ static void program_hpp_type1(struct pci
     		dev_warn(&dev->dev, "PCI-X settings not supported\n");
    ++static bool pcie_get_upstream_rcb(struct pci_dev *dev)
    +	struct pci_dev *bridge = pci_upstream_bridge(dev);
    +	u16 lnkctl;
    +	pcie_capability_read_word(bridge, PCI_EXP_LNKCTL, &lnkctl);
    +	return lnkctl & PCI_EXP_LNKCTL_RCB;
     static void program_hpp_type2(struct pci_dev *dev, struct hpp_type2 *hpp)
     	int pos;
    @@ -1455,9 +1465,21 @@ static void program_hpp_type2(struct pci
     			~hpp->pci_exp_devctl_and, hpp->pci_exp_devctl_or);
     	/* Initialize Link Control Register */
    -	if (pcie_cap_has_lnkctl(dev))
    +	if (pcie_cap_has_lnkctl(dev)) {
    +		bool us_rcb;
    +		u16 clear;
    +		u16 set;
    +		us_rcb = pcie_get_upstream_rcb(dev);
    +		clear = ~hpp->pci_exp_lnkctl_and;
    +		set = hpp->pci_exp_lnkctl_or;
    +		if (!us_rcb)
    +			set &= ~PCI_EXP_LNKCTL_RCB;
     		pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL,
    -			~hpp->pci_exp_lnkctl_and, hpp->pci_exp_lnkctl_or);
    +						  clear, set);
    +	}
     	/* Find Advanced Error Reporting Enhanced Capability */
     	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);




toracat (manager)

If I build a kernel with the referenced patch, will you be able to test?


TO (reporter)

Yes, this is our test system, I can play around with it.

The question is how difficult it is to install and boot your kernel. Please give me a very brief overview what I would have to do (installing an rpm, copying files to /boot, or whatever). The last time that I built a custom kernel on a Linux system must have been ages ago.


toracat (manager)

It will be the centosplus kernel (kernel-plus). I will let you know when it's ready for testing with the details on how to install.


TO (reporter)

Ok, shouldn't be a problem using the centosplus repo. Please remember that the current plus kernel in the plus repo is a 3.10.0-327.36.3 kernel but I need a 3.10.0-514 kernel to reproduce the problem.


toracat (manager)

Sure, it is -540 that I'm preparing. The official version will be published with the GA release of CentOS 7.3.1611.


toracat (manager)

Unfortunately, it turns out that the patch cannot be applied to the current CentOS kernel (3.10.0-514). I get a 'implicit declaration of function' error. Fixing this is beyond what can be done within the scope of the plus kernel.


TO (reporter)

Many thanks for your efforts. This made me curious how to compile the kernel myself. Here is what I did:

git clone https://git.centos.org/git/centos-git-common.git
git clone https://git.centos.org/git/rpms/kernel.git
cd kernel
git checkout c7
git checkout -b my-kernel
tar xf linux-3.10.0-514.el7.tar.xz
cd linux-3.10.0-514.el7
wget https://patchwork.kernel.org/patch/9397341/raw/
patch -p1 < index.html
rm index.html
cd ..
mv linux-3.10.0-514.el7.tar.xz linux-3.10.0-514.el7.tar.xz.old
tar cfJ linux-3.10.0-514.el7.tar.xz linux-3.10.0-514.el7
rm -rf linux-3.10.0-514.el7
cd ..
rpmbuild --nodeps --define "%_topdir `pwd`" -bs SPECS/kernel.spec
rpmbuild --define "%_topdir `pwd`" -ba SPECS/kernel.spec

I don't know whether all these steps are necessary, but I can boot this kernel. Can you check whether these steps make sense and maybe try to build the plus kernel again? I think it makes not sense to maintain the kernel ourselves after each new release. I guess, this should be committed upstream.


TO (reporter)

By the way: Your error message somehow sounds as if one of the hooks in the patch file failed. If it helps, I can provide the patched file that I used.


toracat (manager)

I am uploading the patch file I used. It is derived from the patch you referenced after adjustment of line numbers. According to my build log, it was successfully applied. The actual error during the build was:

drivers/pci/probe.c:1429:1: error: expected identifier or '(' before '+' token
 +static bool pcie_get_upstream_rcb(struct pci_dev *dev)
drivers/pci/probe.c: In function 'program_hpp_type2':
drivers/pci/probe.c:1473:3: error: implicit declaration of function 'pcie_get_upstream_rcb' [-Werror=implicit-function-declaration]
   us_rcb = pcie_get_upstream_rcb(dev);

By the way, when you ran 'patch -p1 < index.html', did it go without errors?


toracat (manager)

Just after I copied the actual error in my previous note, I saw the real error. I will do the build again after the correction.


TO (reporter)

Ok, great. Just as an information: I did not have to adjust the line numbers. "patch" detected the offset automatically.


toracat (manager)

I have uploaded the test version of the plus kernel here:


After installing the kernel-plus package, please reboot the system and select this plus kernel from the grub menu. It does not automatically become the default kernel (unless it is so defined in /etc/sysconfig/kernel .


TO (reporter)

Yes, this seems to work. I only installed kernel-plus-3.10.0-514.bug12277.el7.centos.plus.x86_64.rpm and kernel-plus-devel-3.10.0-514.bug12277.el7.centos.plus.x86_64.rpm . I guess this is sufficient. The new kernel was chosen by default in grub. Here is the result:

[root@txm0001 ~]# uname -r
[root@txm0001 ~]#

Also infiniband seems to work as intended.

Thank you very much for your help. I hope this kernel will be released in the centosplus repo after 7.3 release. Or even better: Can you give this information upstream?


toracat (manager)

Glad to hear it worked. Yes, those 2 packages should be enough.

The patch will be applied to the first update to kernel-plus in CentOS 7.3 (a little too late to include in the GA kernel).

Regarding getting the fix into the upstream kernel, I'd like to ask you to file a bug report ( http://bugzilla.redhat.com) because only you can test it.


TO (reporter)

Ok, thank you again. I filed a bug report at Red Hat. I fact, I did not know that this is recommended without a support contract.


TO (reporter)

I just installed 3.10.0-514.2.2.el7.centos.plus.x86_64 and it seems to be working. Thank you again. I think we can close this and wait until there is an upstream fix for the regular kernel.


toracat (manager)

Thanks for the report. I will close this ticket as 'resolved' for now. When the upstream (therefore CentOS) kernel gets fixed, I will add a note here.


toracat (manager)

RHEL/CentOS kernel-3.10.0-514.10.2.el7 has this fix. Therefore the patch has been removed from the plus kernel.

-Issue History
Date Modified Username Field Change
2016-11-28 13:29 TO New Issue
2016-11-28 13:29 TO Tag Attached: InfiniBand
2016-11-28 13:29 TO Tag Attached: kernel
2016-11-28 16:23 toracat Status new => feedback
2016-11-28 16:23 toracat Note Added: 0028020
2016-11-28 16:31 TO Note Added: 0028021
2016-11-28 16:31 TO Status feedback => assigned
2016-11-28 16:37 toracat Note Added: 0028022
2016-11-28 16:48 TO Note Added: 0028023
2016-11-28 17:31 toracat Note Added: 0028024
2016-11-28 18:28 toracat Note Added: 0028025
2016-11-29 06:12 TO Note Added: 0028029
2016-11-29 06:49 TO Note Added: 0028030
2016-11-29 07:39 toracat File Added: centos-linux-3.10-pci-fix-regression-mlx4-bug12277.patch
2016-11-29 07:47 toracat File Added: centos-linux-3.10-pci-fix-regression-mlx4-bug12277-2.patch
2016-11-29 07:47 toracat Note Added: 0028031
2016-11-29 07:48 toracat File Deleted: centos-linux-3.10-pci-fix-regression-mlx4-bug12277-2.patch
2016-11-29 07:56 toracat Note Added: 0028032
2016-11-29 08:46 TO Note Added: 0028033
2016-11-29 08:51 toracat Note Added: 0028034
2016-11-29 09:05 TO Note Added: 0028035
2016-11-29 09:18 toracat Note Added: 0028036
2016-11-29 09:42 TO Note Added: 0028038
2016-12-09 10:11 TO Note Added: 0028115
2016-12-09 15:21 toracat Note Added: 0028118
2016-12-09 15:21 toracat Status assigned => resolved
2016-12-09 15:21 toracat Resolution open => fixed
2017-03-02 18:37 toracat Note Added: 0028739
+Issue History