View Issue Details

IDProjectCategoryView StatusLast Update
0015193CentOS-7kernelpublic2018-09-21 06:25
Reporterarrfab 
PriorityhighSeveritycrashReproducibilityalways
Status feedbackResolutionopen 
Product Version7.5.1804 
Target VersionFixed in Version 
Summary0015193: IPoIB broken after update to kernel-3.10.0-862.11.6.el7
DescriptionSome nodes in cluster with storage exposed over IPoIB (nfs but not relevant to the issue itself) can't reach other nodes after having updated the kernel to kernel-3.10.0-862.11.6.el7.
Worth noting that the HBA are configured for as network devices, and in connected mode (so not datagram)
It was working fine until new kernel

Version-Release number of selected component (if applicable):
kernel-3.10.0-862.11.6.el7



Steps To ReproduceHow reproducible:
yum update && systemctl reboot

Steps to Reproduce:
1. install new kernel
2. reboot
3. network stack is broken
Additional InformationActual results:

No network on the IB device (so IPoIB not working)
Following messages (from dmesg) :
[ 17.292139] ib1: enabling connected mode will cause multicast packet drops
[ 17.292192] ib1: mtu > 4092 will cause multicast packet drops.
[ 17.316611] IPv6: ADDRCONF(NETDEV_UP): ib1: link is not ready
[ 17.318685] IPv6: ADDRCONF(NETDEV_CHANGE): ib1: link becomes ready
[ 118.239565] ib1: failed to modify QP to RTR: -22

Rebooting on kernel-3.10.0-862.9.1.el7.x86_64 without any config change brings back network start in functional state

Expected results:
network stack functioning on those Mellanox HBAs and newer kernel (rebooted on that one for the L1TF security issue)

Additional info:

Details about the HBAs:
81:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE] (rev a0)

TagsNo tags attached.
abrt_hash
URL

Relationships

has duplicate 0015194 closedpgreco 09:00.0 InfiniBand: (rev 02) fails on 3.10.0-862.11.6 
has duplicate 0015198 closedTrevorH Typo bug in linux-3.10.0-862.11.6.el7.x86_64 drivers/infiniband/core/verbs.c 
has duplicate 0015270 closedTrevorH NFS over RDM not working after system update. 

Activities

arrfab

arrfab

2018-08-18 21:01

administrator   ~0032522

Upstream bug report : https://bugzilla.redhat.com/show_bug.cgi?id=1618956
arrfab

arrfab

2018-08-18 21:02

administrator   ~0032523

Other people reporting same issue (with SL kernel, same version) : https://listserv.fnal.gov/scripts/wa.exe?A2=ind1808&L=scientific-linux-users&F=&S=&P=14187
toracat

toracat

2018-08-19 00:23

manager   ~0032526

A test set of kernel-plus has been built using two patch candidates submitted by @pgreco and uploaded to:

https://people.centos.org/toracat/kernel/7/plus/bug15193/

Please test if you are able.
toracat

toracat

2018-08-19 00:28

manager  

pg-test3.patch (1,157 bytes)
From c0a72d7ddcec079cc94493ce8de536246ce65388 Mon Sep 17 00:00:00 2001
From: Pablo Greco <psgreco@gmail.com>
Date: Sat, 18 Aug 2018 19:14:13 -0300
Subject: [PATCH] test 3

---
 include/rdma/ib_verbs.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index fbb654b..1120b8e 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2542,7 +2542,7 @@ static inline int rdma_is_port_valid_nospec(const struct ib_device *device,
 		return 0;
 
 	*port = array_index_nospec(*port - rdma_start_port(device),
-				   rdma_end_port(device) - rdma_start_port(device) + 1);
+				   (unsigned int)rdma_end_port(device) - rdma_start_port(device) + 1);
 	*port += rdma_start_port(device);
 
 	return 1;
@@ -2556,7 +2556,7 @@ static inline int rdma_is_port_valid_nospec_uint(const struct ib_device *device,
 		return 0;
 
 	*port = array_index_nospec(*port - rdma_start_port(device),
-				   rdma_end_port(device) - rdma_start_port(device) + 1);
+				   (unsigned int)rdma_end_port(device) - rdma_start_port(device) + 1);
 	*port += rdma_start_port(device);
 
 	return 1;
-- 
1.8.3.1

pg-test3.patch (1,157 bytes)
pg-infiniband-u8.patch (1,147 bytes)
From c01ef40001f59e7872c9920c35a8942a11d2ef5b Mon Sep 17 00:00:00 2001
From: Pablo Greco <psgreco@gmail.com>
Date: Sat, 18 Aug 2018 18:28:57 -0300
Subject: [PATCH] Infiniband u8

---
 drivers/infiniband/core/nldev.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/nldev.c b/drivers/infiniband/core/nldev.c
index aaa1f43..3dc40c5 100644
--- a/drivers/infiniband/core/nldev.c
+++ b/drivers/infiniband/core/nldev.c
@@ -207,7 +207,8 @@ static int nldev_port_get_doit(struct sk_buff *skb, struct nlmsghdr *nlh)
 	struct ib_device *device;
 	struct sk_buff *msg;
 	u32 index;
-	u8 port;
+	u32 port;
+	unsigned int portaux;
 	int err;
 
 	err = nlmsg_parse(nlh, 0, tb, RDMA_NLDEV_ATTR_MAX - 1,
@@ -223,8 +224,10 @@ static int nldev_port_get_doit(struct sk_buff *skb, struct nlmsghdr *nlh)
 		return -EINVAL;
 
 	port = nla_get_u32(tb[RDMA_NLDEV_ATTR_PORT_INDEX]);
-	if (!rdma_is_port_valid_nospec(device, &port))
+	portaux = port;
+	if (!rdma_is_port_valid_nospec_uint(device, &portaux))
 		return -EINVAL;
+	port = portaux;
 
 	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
 	if (!msg)
-- 
1.8.3.1

pg-infiniband-u8.patch (1,147 bytes)
arrfab

arrfab

2018-08-19 06:30

administrator   ~0032531

Just to mention that it's now acknowledged by RH : https://access.redhat.com/solutions/3568891
arrfab

arrfab

2018-08-19 07:07

administrator   ~0032532

Tested 3.10.0-862.11.6.el7.centos.plus.bug15193.x86_64 but same issue :
ib1: failed to modify QP to RTR: -22 (and so no network)
toracat

toracat

2018-08-19 23:03

manager   ~0032533

A new set of test plus kernel using @pgreco's "test 4" patch is now in:

https://people.centos.org/toracat/kernel/7/plus/bug15193.1/

pg-test4.patch (1,724 bytes)
From bc5015f6b946b6a86856820ad718908bb9d5f947 Mon Sep 17 00:00:00 2001
From: Pablo Greco <psgreco@gmail.com>
Date: Sun, 19 Aug 2018 11:06:19 -0300
Subject: [PATCH] Test 4

---
 include/rdma/ib_verbs.h | 27 +++++++++------------------
 1 file changed, 9 insertions(+), 18 deletions(-)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index fbb654b..5bfaf50 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -2534,32 +2534,23 @@ static inline u8 rdma_end_port(const struct ib_device *device)
 	return rdma_cap_ib_switch(device) ? 0 : device->phys_port_cnt;
 }
 
+static inline int rdma_is_port_valid(const struct ib_device *device,
+				     unsigned int port)
+{
+	return (port >= rdma_start_port(device) &&
+		port <= rdma_end_port(device));
+}
+
 static inline int rdma_is_port_valid_nospec(const struct ib_device *device,
 					    u8 *port)
 {
-	if (*port < rdma_start_port(device) ||
-	    *port > rdma_end_port(device))
-		return 0;
-
-	*port = array_index_nospec(*port - rdma_start_port(device),
-				   rdma_end_port(device) - rdma_start_port(device) + 1);
-	*port += rdma_start_port(device);
-
-	return 1;
+	return rdma_is_port_valid(device, *port);
 }
 
 static inline int rdma_is_port_valid_nospec_uint(const struct ib_device *device,
 					    unsigned int *port)
 {
-	if (*port < rdma_start_port(device) ||
-	    *port > rdma_end_port(device))
-		return 0;
-
-	*port = array_index_nospec(*port - rdma_start_port(device),
-				   rdma_end_port(device) - rdma_start_port(device) + 1);
-	*port += rdma_start_port(device);
-
-	return 1;
+	return rdma_is_port_valid(device, *port);
 }
 
 static inline bool rdma_protocol_ib(const struct ib_device *device, u8 port_num)
-- 
1.8.3.1

pg-test4.patch (1,724 bytes)
arrfab

arrfab

2018-08-20 07:37

administrator   ~0032544

Tested and same issue
arrfab

arrfab

2018-08-20 11:53

administrator   ~0032548

The following patch, courtesy of @pgreco , works fine

omg.patch (786 bytes)
From 6353587a7efa488a4064f3661cf64bd4d74eaa73 Mon Sep 17 00:00:00 2001
From: Pablo Greco <psgreco@gmail.com>
Date: Mon, 20 Aug 2018 06:39:55 -0300
Subject: [PATCH] OMG!!!!

---
 drivers/infiniband/core/verbs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index debe718..c080eb2 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -1232,7 +1232,7 @@ int ib_resolve_eth_dmac(struct ib_device *device,
 	int           ret = 0;
 	struct ib_global_route *grh;
 
-	if (!rdma_is_port_valid_nospec(device, &ah_attr->port_num));
+	if (!rdma_is_port_valid_nospec(device, &ah_attr->port_num))
 		return -EINVAL;
 
 	if (ah_attr->type != RDMA_AH_ATTR_TYPE_ROCE)
-- 
1.8.3.1

omg.patch (786 bytes)
arrfab

arrfab

2018-08-20 12:43

administrator   ~0032550

The following build has that patch rolled-in :
https://buildlogs.centos.org/c7.1804.u.x86_64/kernel/20180820114938/3.10.0-862.11.6.el7.bug15193.x86_64/

Reminder : that kernel (https://buildlogs.centos.org/c7.1804.u.x86_64/kernel/20180820114938/3.10.0-862.11.6.el7.bug15193.x86_64/kernel-3.10.0-862.11.6.el7.bug15193.x86_64.rpm) :
- isn't rpm-signed (so no gpg)
- isn't secureboot ready (not signed)
toracat

toracat

2018-08-21 04:23

manager   ~0032559

Also, the latest verision of centosplus kernel (kernel-plus-3.10.0-862.11.6.el7.centos.plus.1.x86_64.rpm) has the patch rolled-in.
arrfab

arrfab

2018-08-28 08:14

administrator   ~0032609

Asking for feedback (positive/negative) and closing after some time (working on my side that is)
gvas

gvas

2018-08-28 12:51

reporter   ~0032615

seems to be ok !
Irek

Irek

2018-09-19 05:30

reporter   ~0032757

it works for me too

Issue History

Date Modified Username Field Change
2018-08-18 21:01 arrfab New Issue
2018-08-18 21:01 arrfab Note Added: 0032522
2018-08-18 21:02 arrfab Note Added: 0032523
2018-08-19 00:23 toracat Note Added: 0032526
2018-08-19 00:28 toracat File Added: pg-test3.patch
2018-08-19 00:28 toracat File Added: pg-infiniband-u8.patch
2018-08-19 06:30 arrfab Note Added: 0032531
2018-08-19 07:07 arrfab Note Added: 0032532
2018-08-19 07:07 arrfab Status new => confirmed
2018-08-19 23:03 toracat File Added: pg-test4.patch
2018-08-19 23:03 toracat Note Added: 0032533
2018-08-20 07:37 arrfab Note Added: 0032544
2018-08-20 08:24 pgreco Relationship added has duplicate 0015194
2018-08-20 11:53 arrfab File Added: omg.patch
2018-08-20 11:53 arrfab Note Added: 0032548
2018-08-20 12:43 arrfab Note Added: 0032550
2018-08-21 04:23 toracat Note Added: 0032559
2018-08-21 12:05 TrevorH Relationship added has duplicate 0015198
2018-08-28 08:14 arrfab Status confirmed => feedback
2018-08-28 08:14 arrfab Note Added: 0032609
2018-08-28 12:51 gvas Note Added: 0032615
2018-09-13 05:43 TrevorH Relationship added has duplicate 0015270
2018-09-19 05:30 Irek Note Added: 0032757