View Issue Details

IDProjectCategoryView StatusLast Update
0017888CentOS-7kernelpublic2021-05-12 17:16
Reporteranish Assigned To 
PrioritynormalSeverityblockReproducibilityalways
Status newResolutionopen 
PlatformPowerEdge R640OSCentOSOS Version7.9
Product Version7.9.2009 
Summary0017888: Mellanox MT27710 [ConnectX-4 Lx] NICs unuseable after upgrade to 3.10.0-1160.6.1.el7
DescriptionNo traffic is detected on the NIC. Works consistently after rolling back to 3.10.0-1160.2.1.el7. Link seems to go down briefly and then seen coming up again. All other dmesg output matches exactly between kernels. the Nic uses the mlx5_core driver
Additional InformationDid notice these changes in 1160.3.1 :

- [netdrv] net/mlx5e: Modify uplink state on interface up/down (Alaa Hleihel) [1733181]
- [netdrv] net/mlx5: E-Switch, Disable esw manager vport correctly (Alaa Hleihel) [1733181]
- [netdrv] net/mlx5: E-Switch, Properly refer to host PF vport as other vport (Alaa Hleihel) [1733181]
TagsNo tags attached.
abrt_hash
URL

Activities

ManuelWolfshant

ManuelWolfshant

2020-11-25 21:53

manager   ~0037996

CentOS is a rebuild of the sources used to create RHEL and aims to reproduce RHEL bug for bug and feature for feature. Please file a ticket against the kernel package at bugzilla.redhat.com and let them know about the regression. If/when RH fixes it and releases a patched version, CentOS will pick it up automatically.
For easier tracking, please crosslink this bug with the one opened at bugzilla.redhat.com.
anish

anish

2020-11-29 21:56

reporter   ~0038010

@manuelwolfshant I'm not sure how to crosslink ,but the redhat bugzilla id is 1902516
anish

anish

2020-11-30 20:35

reporter   ~0038014

Addendum : This seems to only affect mellanox cards with fw < 14.20. Anything that version and higher works fine
aletchet

aletchet

2021-05-12 17:16

reporter   ~0038439

We are seeing exactly the same issues for the Mellanox Technologies MT27800 Family [ConnectX-5] (Mellanox ConnectX-5 Dual Port 25 GbE SFP OCP3.0 Network Adapter) cards.
Combined with Dell firmware: 16.28.4512 (DEL0000000016) or prior. Dell do not have a later firmware for GA.

Links are up and Layer2 traffic is received (arp requests etc).
From tcpdumps I can see layer 2 responses getting sent back (arp response etc) from the kernel, but when performing captures on the other end, those layer2 responses do not appear to be leaving the source servers card.
Packet captures on the destination server do show LLDP packets from this host, but these leave the card directly via the firmware and not the kernel, this confirms the card is definitely functioning without issue.

Workarounds:
Rolling the kernel back to 3.10.0-1160.2.2.el7.x86_64 works without issue.
Setting the firmware to "always" keep the links up throughout power cycles also seems to also mitigate this issue - "KEEP_ETH_LINK_UP_P1=TRUE"/ "KEEP_ETH_LINK_UP_P2=TRUE" (applied via mlxconfig)

Debug:
[host.mellanox:/root]# tcpdump -i em3 -e
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on em3, link-type EN10MB (Ethernet), capture size 262144 bytes
08:22:08.293748 04:3f:72:ac:cc:ef (oui Unknown) > Broadcast, ethertype ARP (0x0806), length 60: Request who-has 192.168.100.1 tell 192.168.100.2, length 46
08:22:08.293756 04:3f:72:ac:d3:07 (oui Unknown) > 04:3f:72:ac:cc:ef (oui Unknown), ethertype ARP (0x0806), length 42: Reply 192.168.100.1 is-at 04:3f:72:ac:d3:07 (oui Unknown), length 28
08:22:08.819888 04:3f:72:ac:cc:ef (oui Unknown) > 01:80:c2:00:00:0e (oui Unknown), ethertype LLDP (0x88cc), length 136: LLDP, length 122
08:22:09.295727 04:3f:72:ac:cc:ef (oui Unknown) > Broadcast, ethertype ARP (0x0806), length 60: Request who-has 192.168.100.1 tell 192.168.100.2, length 46
08:22:09.295734 04:3f:72:ac:d3:07 (oui Unknown) > 04:3f:72:ac:cc:ef (oui Unknown), ethertype ARP (0x0806), length 42: Reply 192.168.100.1 is-at 04:3f:72:ac:d3:07 (oui Unknown), length 28
08:22:10.297727 04:3f:72:ac:cc:ef (oui Unknown) > Broadcast, ethertype ARP (0x0806), length 60: Request who-has 192.168.100.1 tell 192.168.100.2, length 46
08:22:10.297731 04:3f:72:ac:d3:07 (oui Unknown) > 04:3f:72:ac:cc:ef (oui Unknown), ethertype ARP (0x0806), length 42: Reply 192.168.100.1 is-at 04:3f:72:ac:d3:07 (oui Unknown), length 28
08:22:12.294749 04:3f:72:ac:cc:ef (oui Unknown) > Broadcast, ethertype ARP (0x0806), length 60: Request who-has 192.168.100.1 tell 192.168.100.2, length 46
08:22:12.294759 04:3f:72:ac:d3:07 (oui Unknown) > 04:3f:72:ac:cc:ef (oui Unknown), ethertype ARP (0x0806), length 42: Reply 192.168.100.1 is-at 04:3f:72:ac:d3:07 (oui Unknown), length 28
08:22:13.295736 04:3f:72:ac:cc:ef (oui Unknown) > Broadcast, ethertype ARP (0x0806), length 60: Request who-has 192.168.100.1 tell 192.168.100.2, length 46
08:22:13.295740 04:3f:72:ac:d3:07 (oui Unknown) > 04:3f:72:ac:cc:ef (oui Unknown), ethertype ARP (0x0806), length 42: Reply 192.168.100.1 is-at 04:3f:72:ac:d3:07 (oui Unknown), length 28

Issue History

Date Modified Username Field Change
2020-11-25 20:13 anish New Issue
2020-11-25 21:53 ManuelWolfshant Note Added: 0037996
2020-11-29 21:56 anish Note Added: 0038010
2020-11-30 20:35 anish Note Added: 0038014
2021-05-12 17:16 aletchet Note Added: 0038439