View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0014419 | CentOS-7 | kernel | public | 2018-01-27 00:54 | 2019-03-06 09:56 |
Reporter | justinclift | ||||
Priority | high | Severity | major | Reproducibility | always |
Status | closed | Resolution | not fixable | ||
Product Version | 7.4.1708 | ||||
Target Version | Fixed in Version | ||||
Summary | 0014419: Mellanox ConnectX cards refuse to work in Ethernet mode with kernel kernel-3.10.0-693.11.6 onwards | ||||
Description | Rebooted my CentOS 7 x64 (1708) desktop this evening after a yum update, and the Mellanox ConnectX-2 card in it (set to run in ethernet mode) refused to come up correctly. That was with the latest kernel (installed this evening) of kernel-3.10.0-693.17.1. Instead, it came up in Infiniband mode, with "ip addr" complaining about a potential bad address. Had a feeling it might be kernel related (something unforeseen from recent Meldown/Spectre patches maybe?), so tried the previous kernels to see if that's the cause. Short answer: Yep. ;) My desktop has these kernels installed at the moment: * kernel-3.10.0-693.17.1.el7.x86_64 | ConnectX-2 card not working * kernel-3.10.0-693.11.6.el7.x86_64 | ConnectX-2 card not working * kernel-3.10.0-693.11.1.el7.x86_64 | ConnectX-2 card works * kernel-3.10.0-693.el7.x86_64 | ConnectX-2 card works So, kernel-3.10.0-693.11.6 and onwards are "busted" from this point of view. Reverting to either of the older two kernels and the card comes up fine, working as 10GbE as expected. For reference, this is the address of the card in my desktop: $ lspci | grep Mellanox 06:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0) And this is the entry for it in /etc/rdma/mlx4.conf which tells it to operate in ethernet mode (it's a single port card): $ tail -2 /etc/rdma/mlx4.conf # 0000:06:00.0 eth | ||||
Steps To Reproduce | 1. With a Mellanox card (probably any of the ConnectX series), set it to operate in Ethernet mode via /etc/rdma/mlx4.conf. 2. Boot into kernel-3.10.0-693.11.1 (known good) and confirm the card is recognised by the system with an ethernet address given. 3. Reboot into kernel-3.10.0-693.11.6 or later. The card will now not be recognised as ethernet, instead coming up as Infiniband. | ||||
Tags | No tags attached. | ||||
abrt_hash | |||||
URL | |||||
Upstream bug : https://bugzilla.redhat.com/show_bug.cgi?id=1539875 | |
keep us posted (private bug entry uptream) :) | |
It's being looked into by RH kernel staff. Someone else opened a RH BZ about this too (also private bug entry now), however for them the issue was apparently resolved by updating the BIOS on their motherboard/system. No such luck for my system though. The firmware on the Mellanox card (2.9.1000) is the latest on the Mellanox website, and the updating the motherboard BIOS (it was out of date) didn't help. So... still "in progress". ;) |
|
Found a workaround for now. Setting the port type manually in /etc/modprobe.d/mlx4.conf, then regenerating the initramfs image so it's read on boot works. eg: # echo options mlx4_core port_type_array=2 >> /etc/modprobe.d/mlx4.conf # dracut --force # grub2-mkconfig -o /boot/grub2/grub.cfg The "port_type_array=2" option is the important bit. `modinfo mlx4_core` says this about port_type_array: port_type_array:Array of port types: HW_DEFAULT (0) is default 1 for IB, 2 for Ethernet (array of int) So for my single port card, the single 2 value means "set it to ethernet mode". With the grub2-mkconfig command, I'm not sure if it was needed. But it didn't seem to hurt things. :) After this, the card is coming up fine in ethernet mode and working as per normal. :) |
|
On one of my other systems, the above workaround didn't err... work. Instead, I had to directly modify /usr/lib/modprobe.d/libmlx4.conf, changing it from: install mlx4_core /sbin/modprobe --ignore-install mlx4_core $CMDLINE_OPTS && (if [ -f /usr/libexec/mlx4-setup.sh -a -f /etc/rdma/mlx4.conf ]; then /usr/libexec/mlx4-setup.sh < /etc/rdma/mlx4.conf; fi; /sbin/modprobe mlx4_en; if /sbin/modinfo mlx4_ib > /dev/null 2>&1; then /sbin/modprobe mlx4_ib; fi) to: install mlx4_core /sbin/modprobe --ignore-install mlx4_core port_type_array=2 $CMDLINE_OPTS && (if [ -f /usr/libexec/mlx4-setup.sh -a -f /etc/rdma/mlx4.conf ]; then /usr/libexec/mlx4-setup.sh < /etc/rdma/mlx4.conf; fi; /sbin/modprobe mlx4_en; if /sbin/modinfo mlx4_ib > /dev/null 2>&1; then /sbin/modprobe mlx4_ib; fi) eg forcibly add the "port_type_array=2" to the mlx4_core module loading line. With that in place, then running dracut + grub2-mkconfig to regenerate things, the card comes up correctly: # vi /usr/lib/modprobe.d/libmlx4.conf # dracut --force # grub2-mkconfig -o /boot/grub2/grub.cfg The libmlx4.conf file already contains a variable called CMDLINE_OPTS, in what looks like the right place to specify options. Haven't found where that variable is set nor read from though. The comment at the top of the file says it should be settable in /etc/rdma/mlx4.conf, but reading through the process used... it doesn't seem to actually be used. The same CMDLINE_OPTS variable is used by all of the /usr/lib/modprobe.d/*conf files in the rdma-core package too, which also makes this a bit less clear. Anyway, the above workaround gets the job done. It may need to be re-applied whenever the rdma-core package is updated though. |
|
Latest Centos 7.5 Linux hv01 3.10.0-862.3.2.el7.x86_64 #1 SMP Mon May 21 23:36:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux Mellanox ConnectX 2 0000:01:00.0 Ethernet controller [0200]: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] [15b3:6750] (rev b0) The same bug. Above workarounds didn't work... |
|
If you unload the mellanox kernel modules, then manually modprobe mlx4_core with an appropriate port type... does the adapter then show up correctly? eg: modprobe mlx4_core port_type_array=2 And just to check, what was the behaviour with previous kernels or is this the first one you've tried it with? Asking that because your adapter says it's a "ConnectX EN", which means it's ethernet only. I thought that could **only** show up as an ethernet adapter and nothing else. I don't think I've ever owned one though, so am not sure. :) |
|
My apologies. The card is working properly under CentOS 7.5. The problem was in switch configuration. | |
No worries Aleksey. :) | |
This bug has been closed upstream (no real resolution), so can probably be closed here too. | |
Closed per reporter's request. | |
Date Modified | Username | Field | Change |
---|---|---|---|
2018-01-27 00:54 | justinclift | New Issue | |
2018-01-29 21:38 | arrfab | Note Added: 0031095 | |
2018-01-30 09:04 | tru | Note Added: 0031106 | |
2018-01-30 23:40 | justinclift | Note Added: 0031124 | |
2018-02-06 10:51 | justinclift | Note Added: 0031159 | |
2018-02-18 12:49 | justinclift | Note Added: 0031247 | |
2018-05-29 09:57 | Aleksey | Note Added: 0031938 | |
2018-05-29 10:16 | justinclift | Note Added: 0031939 | |
2018-05-31 07:18 | Aleksey | Note Added: 0031971 | |
2018-05-31 12:12 | justinclift | Note Added: 0031974 | |
2019-03-06 09:46 | justinclift | Note Added: 0033947 | |
2019-03-06 09:56 | jrd | Status | new => closed |
2019-03-06 09:56 | jrd | Resolution | open => not fixable |
2019-03-06 09:56 | jrd | Note Added: 0033948 |