2017-10-23 18:50 UTC

View Issue Details Jump to Notes ]
IDProjectCategoryView StatusLast Update
0012277CentOS-7kernelpublic2017-03-02 18:37
ReporterTO 
PrioritynormalSeveritycrashReproducibilityalways
StatusresolvedResolutionfixed 
PlatformIBM NeXtScale nx360 M5OSCentOSOS Version7.3.1611
Product Version7.3.1611 
Target VersionFixed in Version 
Summary0012277: Kernel 3.10.0-514 does not boot with mlx4_core module on some IBM systems
DescriptionAs a test, I updated some of our nodes with the CR repository. On many systems, this works without any problems, but on "IBM NeXtScale nx360 M5" nodes, it does not boot.

I think this is related to what I found here:
https://patchwork.kernel.org/patch/9397341/

When I boot the previous kernel (3.10.0-327.36.3.el7.x86_64), the same system works fine.

As we have Mellanox Infinband cards, mlx4_core is in use. If I blacklist this module, the system also boots with the new kernel.

I do not have this problem on a "System x3550 M4". There everything works as intended.
Additional Information[ 7.312283] ERST: Can not request iomem region <0x 3b3dd000-0x 3b3dec00> for ERST.
[ 8.490015] i8042: No controller found
[ 10.587931] Uhhuh. NMI received for unknown reason 3d on CPU 0.
[ 10.594539] Do you have a strange power saving mode enabled?
[ 10.600854] Dazed and confused, but trying to continue
[ OK ] Started Show Plymouth Boot Screen.
[ OK ] Reached target Paths.
[ OK ] Reached target Basic System.
[ OK ] Found device ST9500620NS_81Y9715_81Y3856IBM 3.
         Starting File System Check on /dev/...e-595a-4388-9b4d-0ee553589aad...
[ OK ] Started File System Check on /dev/d...9be-595a-4388-9b4d-0ee553589aad.
[ 70.028613] mlx4_core 0000:81:00.0: device is going to be reset
TagsInfiniBand, kernel
abrt_hash
URL
Attached Files
  • patch file icon centos-linux-3.10-pci-fix-regression-mlx4-bug12277.patch (1,348 bytes) 2016-11-29 07:39 -
    centosplus patch (bug#12277)
    
    Ref: https://patchwork.kernel.org/patch/9397341/
    
    --- a/drivers/pci/probe.c	2016-10-19 07:16:25.000000000 -0700
    +++ b/drivers/pci/probe.c	2016-11-28 09:30:30.621332097 -0800
    @@ -1426,6 +1426,16 @@ static void program_hpp_type1(struct pci
     		dev_warn(&dev->dev, "PCI-X settings not supported\n");
     }
     
    ++static bool pcie_get_upstream_rcb(struct pci_dev *dev)
    +{
    +	struct pci_dev *bridge = pci_upstream_bridge(dev);
    +	u16 lnkctl;
    +
    +	pcie_capability_read_word(bridge, PCI_EXP_LNKCTL, &lnkctl);
    +
    +	return lnkctl & PCI_EXP_LNKCTL_RCB;
    +}
    +
     static void program_hpp_type2(struct pci_dev *dev, struct hpp_type2 *hpp)
     {
     	int pos;
    @@ -1455,9 +1465,21 @@ static void program_hpp_type2(struct pci
     			~hpp->pci_exp_devctl_and, hpp->pci_exp_devctl_or);
     
     	/* Initialize Link Control Register */
    -	if (pcie_cap_has_lnkctl(dev))
    +	if (pcie_cap_has_lnkctl(dev)) {
    +		bool us_rcb;
    +		u16 clear;
    +		u16 set;
    +
    +		us_rcb = pcie_get_upstream_rcb(dev);
    +
    +		clear = ~hpp->pci_exp_lnkctl_and;
    +		set = hpp->pci_exp_lnkctl_or;
    +		if (!us_rcb)
    +			set &= ~PCI_EXP_LNKCTL_RCB;
    +
     		pcie_capability_clear_and_set_word(dev, PCI_EXP_LNKCTL,
    -			~hpp->pci_exp_lnkctl_and, hpp->pci_exp_lnkctl_or);
    +						  clear, set);
    +	}
     
     	/* Find Advanced Error Reporting Enhanced Capability */
     	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
    

-Relationships
+Relationships

-Notes

~0028020

toracat (manager)

If I build a kernel with the referenced patch, will you be able to test?

~0028021

TO (reporter)

Yes, this is our test system, I can play around with it.

The question is how difficult it is to install and boot your kernel. Please give me a very brief overview what I would have to do (installing an rpm, copying files to /boot, or whatever). The last time that I built a custom kernel on a Linux system must have been ages ago.

~0028022

toracat (manager)

It will be the centosplus kernel (kernel-plus). I will let you know when it's ready for testing with the details on how to install.

~0028023

TO (reporter)

Ok, shouldn't be a problem using the centosplus repo. Please remember that the current plus kernel in the plus repo is a 3.10.0-327.36.3 kernel but I need a 3.10.0-514 kernel to reproduce the problem.

~0028024

toracat (manager)

Sure, it is -540 that I'm preparing. The official version will be published with the GA release of CentOS 7.3.1611.

~0028025

toracat (manager)

Unfortunately, it turns out that the patch cannot be applied to the current CentOS kernel (3.10.0-514). I get a 'implicit declaration of function' error. Fixing this is beyond what can be done within the scope of the plus kernel.

~0028029

TO (reporter)

Many thanks for your efforts. This made me curious how to compile the kernel myself. Here is what I did:

----------
export RPM_BUILD_NCPUS=12
git clone https://git.centos.org/git/centos-git-common.git
git clone https://git.centos.org/git/rpms/kernel.git
cd kernel
git checkout c7
../centos-git-common/get_sources.sh
git checkout -b my-kernel
cd SOURCES
tar xf linux-3.10.0-514.el7.tar.xz
cd linux-3.10.0-514.el7
wget https://patchwork.kernel.org/patch/9397341/raw/
patch -p1 < index.html
rm index.html
cd ..
mv linux-3.10.0-514.el7.tar.xz linux-3.10.0-514.el7.tar.xz.old
tar cfJ linux-3.10.0-514.el7.tar.xz linux-3.10.0-514.el7
rm -rf linux-3.10.0-514.el7
cd ..
rpmbuild --nodeps --define "%_topdir `pwd`" -bs SPECS/kernel.spec
rpmbuild --define "%_topdir `pwd`" -ba SPECS/kernel.spec
----------

I don't know whether all these steps are necessary, but I can boot this kernel. Can you check whether these steps make sense and maybe try to build the plus kernel again? I think it makes not sense to maintain the kernel ourselves after each new release. I guess, this should be committed upstream.

~0028030

TO (reporter)

By the way: Your error message somehow sounds as if one of the hooks in the patch file failed. If it helps, I can provide the patched file that I used.

~0028031

toracat (manager)

I am uploading the patch file I used. It is derived from the patch you referenced after adjustment of line numbers. According to my build log, it was successfully applied. The actual error during the build was:

drivers/pci/probe.c:1429:1: error: expected identifier or '(' before '+' token
 +static bool pcie_get_upstream_rcb(struct pci_dev *dev)
 ^
drivers/pci/probe.c: In function 'program_hpp_type2':
drivers/pci/probe.c:1473:3: error: implicit declaration of function 'pcie_get_upstream_rcb' [-Werror=implicit-function-declaration]
   us_rcb = pcie_get_upstream_rcb(dev);
   ^

By the way, when you ran 'patch -p1 < index.html', did it go without errors?

~0028032

toracat (manager)

Just after I copied the actual error in my previous note, I saw the real error. I will do the build again after the correction.

~0028033

TO (reporter)

Ok, great. Just as an information: I did not have to adjust the line numbers. "patch" detected the offset automatically.

~0028034

toracat (manager)

I have uploaded the test version of the plus kernel here:

https://people.centos.org/toracat/kernel/7/plus/bug12277/

After installing the kernel-plus package, please reboot the system and select this plus kernel from the grub menu. It does not automatically become the default kernel (unless it is so defined in /etc/sysconfig/kernel .

~0028035

TO (reporter)

Yes, this seems to work. I only installed kernel-plus-3.10.0-514.bug12277.el7.centos.plus.x86_64.rpm and kernel-plus-devel-3.10.0-514.bug12277.el7.centos.plus.x86_64.rpm . I guess this is sufficient. The new kernel was chosen by default in grub. Here is the result:

[root@txm0001 ~]# uname -r
3.10.0-514.bug12277.el7.centos.plus.x86_64
[root@txm0001 ~]#

Also infiniband seems to work as intended.

Thank you very much for your help. I hope this kernel will be released in the centosplus repo after 7.3 release. Or even better: Can you give this information upstream?

~0028036

toracat (manager)

Glad to hear it worked. Yes, those 2 packages should be enough.

The patch will be applied to the first update to kernel-plus in CentOS 7.3 (a little too late to include in the GA kernel).

Regarding getting the fix into the upstream kernel, I'd like to ask you to file a bug report ( http://bugzilla.redhat.com) because only you can test it.

~0028038

TO (reporter)

Ok, thank you again. I filed a bug report at Red Hat. I fact, I did not know that this is recommended without a support contract.

~0028115

TO (reporter)

I just installed 3.10.0-514.2.2.el7.centos.plus.x86_64 and it seems to be working. Thank you again. I think we can close this and wait until there is an upstream fix for the regular kernel.

~0028118

toracat (manager)

Thanks for the report. I will close this ticket as 'resolved' for now. When the upstream (therefore CentOS) kernel gets fixed, I will add a note here.

~0028739

toracat (manager)

RHEL/CentOS kernel-3.10.0-514.10.2.el7 has this fix. Therefore the patch has been removed from the plus kernel.
+Notes

-Issue History
Date Modified Username Field Change
2016-11-28 13:29 TO New Issue
2016-11-28 13:29 TO Tag Attached: InfiniBand
2016-11-28 13:29 TO Tag Attached: kernel
2016-11-28 16:23 toracat Status new => feedback
2016-11-28 16:23 toracat Note Added: 0028020
2016-11-28 16:31 TO Note Added: 0028021
2016-11-28 16:31 TO Status feedback => assigned
2016-11-28 16:37 toracat Note Added: 0028022
2016-11-28 16:48 TO Note Added: 0028023
2016-11-28 17:31 toracat Note Added: 0028024
2016-11-28 18:28 toracat Note Added: 0028025
2016-11-29 06:12 TO Note Added: 0028029
2016-11-29 06:49 TO Note Added: 0028030
2016-11-29 07:39 toracat File Added: centos-linux-3.10-pci-fix-regression-mlx4-bug12277.patch
2016-11-29 07:47 toracat File Added: centos-linux-3.10-pci-fix-regression-mlx4-bug12277-2.patch
2016-11-29 07:47 toracat Note Added: 0028031
2016-11-29 07:48 toracat File Deleted: centos-linux-3.10-pci-fix-regression-mlx4-bug12277-2.patch
2016-11-29 07:56 toracat Note Added: 0028032
2016-11-29 08:46 TO Note Added: 0028033
2016-11-29 08:51 toracat Note Added: 0028034
2016-11-29 09:05 TO Note Added: 0028035
2016-11-29 09:18 toracat Note Added: 0028036
2016-11-29 09:42 TO Note Added: 0028038
2016-12-09 10:11 TO Note Added: 0028115
2016-12-09 15:21 toracat Note Added: 0028118
2016-12-09 15:21 toracat Status assigned => resolved
2016-12-09 15:21 toracat Resolution open => fixed
2017-03-02 18:37 toracat Note Added: 0028739
+Issue History