View Issue Details

IDProjectCategoryView StatusLast Update
0010767CentOS-7kernelpublic2016-08-04 14:47
Reporterisuzdal 
PrioritynormalSeverityminorReproducibilityrandom
Status resolvedResolutionfixed 
Platformx86_64OSCentOSOS Version7.2
Product Version7.2.1511 
Target VersionFixed in Version 
Summary0010767: e1000: Tx Unit Hang
Descriptione1000 NIC sporadically hangs, which causes temporary network unavailable.

I saw this issue on virtual environment and on real hardware.
This issue is already fixed in upstream kernel [0] [1].
I've patches against the current CentOS kernel, if it requires.
Could you apply this fixes, please?

[0] https://github.com/torvalds/linux/commit/a4605fef7132f19afded76ee025c957558271a7d
[1] https://github.com/torvalds/linux/commit/847a1d6796c767f8b697ead60997b847a84b897b
Additional Information# ethtool -i enp0s3
driver: e1000
version: 7.3.21-k8-NAPI
firmware-version:
bus-info: 0000:00:03.0

# uname -a
Linux host.domain.tld 3.10.0-327.13.1.el7.x86_64 #1 SMP Thu Mar 31 16:04:38 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
TagsNo tags attached.
abrt_hash
URL

Activities

isuzdal

isuzdal

2016-04-26 09:36

reporter  

e1000.dmesg (3,476 bytes)
toracat

toracat

2016-04-26 11:29

manager   ~0026371

While we cannot modify the distro kernel, we can apply the patches to the centosplus kernel.

In the meantime, could you file a bug report upstream at http://bugzilla.redhat.com so that it gets fixed in the RHEL kernel? CentOS kernels will then inherit the patches.
isuzdal

isuzdal

2016-04-26 16:16

reporter   ~0026374

JFYI: https://bugzilla.redhat.com/show_bug.cgi?id=1330516
toracat

toracat

2016-04-26 16:30

manager  

centos-linux-3.10-e1000-Tx-fix-1-bug10767.patch (2,482 bytes)
centosplus patch [bug#10767-1]

commit 847a1d6796c767f8b697ead60997b847a84b897b                                           
Author: Alexander Duyck <aduyck@mirantis.com>                                             
Date:   Wed Mar 2 16:16:01 2016 -0500                                                     

    e1000: Do not overestimate descriptor counts in Tx pre-check

    The current code path is capable of grossly overestimating the number of
    descriptors needed to transmit a new frame.  This specifically occurs if
    the skb contains a number of 4K pages.  The issue is that the logic for
    determining the descriptors needed is ((S) >> (X)) + 1.  When X is 12 it
    means that we were indicating that we required 2 descriptors for each 4K
    page when we only needed one.

    This change corrects this by instead adding (1 << (X)) - 1 to the S value
    instead of adding 1 after the fact.  This way we get an accurate descriptor
    needed count as we are essentially doing a DIV_ROUNDUP().

    Reported-by: Ivan Suzdal <isuzdal@mirantis.com>
    Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
    Tested-by: Aaron Brown <aaron.f.brown@intel.com>
    Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

    Applied-by: Akemi Yagi <toracat@centos.org>

--- a/drivers/net/ethernet/intel/e1000/e1000_main.c	2016-02-29 09:35:49.000000000 -0800
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c	2016-04-26 08:54:24.879464164 -0700
@@ -3246,12 +3246,29 @@ static netdev_tx_t e1000_xmit_frame(stru
 			     nr_frags, mss);
 
 	if (count) {
+		/* The descriptors needed is higher than other Intel drivers
+		 * due to a number of workarounds.  The breakdown is below:
+		 * Data descriptors: MAX_SKB_FRAGS + 1
+		 * Context Descriptor: 1
+		 * Keep head from touching tail: 2
+		 * Workarounds: 3
+		 */
+		int desc_needed = MAX_SKB_FRAGS + 7;
+
 		netdev_sent_queue(netdev, skb->len);
 		skb_tx_timestamp(skb);
 
 		e1000_tx_queue(adapter, tx_ring, tx_flags, count);
+
+		/* 82544 potentially requires twice as many data descriptors
+		 * in order to guarantee buffers don't end on evenly-aligned
+		 * dwords
+		 */
+		if (adapter->pcix_82544)
+			desc_needed += MAX_SKB_FRAGS + 1;
+
 		/* Make sure there is space in the ring for the next send. */
-		e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2);
+		e1000_maybe_stop_tx(netdev, tx_ring, desc_needed);
 
 		if (!skb->xmit_more ||
 		    netif_xmit_stopped(netdev_get_tx_queue(netdev, 0))) {
toracat

toracat

2016-04-26 16:31

manager  

centos-linux-3.10-e1000-Tx-fix-2-bug10767.patch (1,785 bytes)
centosplus patch [bug#10767-2]

commit 847a1d6796c767f8b697ead60997b847a84b897b                                           
Author: Alexander Duyck <aduyck@mirantis.com>                                             
Date:   Wed Mar 2 16:16:01 2016 -0500                                                     

    e1000: Do not overestimate descriptor counts in Tx pre-check

    The current code path is capable of grossly overestimating the number of
    descriptors needed to transmit a new frame.  This specifically occurs if
    the skb contains a number of 4K pages.  The issue is that the logic for
    determining the descriptors needed is ((S) >> (X)) + 1.  When X is 12 it
    means that we were indicating that we required 2 descriptors for each 4K
    page when we only needed one.

    This change corrects this by instead adding (1 << (X)) - 1 to the S value
    instead of adding 1 after the fact.  This way we get an accurate descriptor
    needed count as we are essentially doing a DIV_ROUNDUP().

    Reported-by: Ivan Suzdal <isuzdal@mirantis.com>
    Signed-off-by: Alexander Duyck <aduyck@mirantis.com>
    Tested-by: Aaron Brown <aaron.f.brown@intel.com>
    Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

    Applied-by: Akemi Yagi <toracat@centos.org>

--- a/drivers/net/ethernet/intel/e1000/e1000_main.c	2016-04-26 08:54:24.879464164 -0700
+++ b/drivers/net/ethernet/intel/e1000/e1000_main.c	2016-04-26 08:57:51.432765137 -0700
@@ -3088,7 +3088,7 @@ static int e1000_maybe_stop_tx(struct ne
 	return __e1000_maybe_stop_tx(netdev, size);
 }
 
-#define TXD_USE_COUNT(S, X) (((S) >> (X)) + 1 )
+#define TXD_USE_COUNT(S, X) (((S) + ((1 << (X)) - 1)) >> (X))
 static netdev_tx_t e1000_xmit_frame(struct sk_buff *skb,
 				    struct net_device *netdev)
 {
toracat

toracat

2016-04-26 16:34

manager   ~0026375

Thanks. Because the upstream bug report is not open to the public, please update the info here when there is any progress. Also please add this CentOS bug number to the external link.

Two patch files uploaded. Will be added to the next kernel-plus update.
toracat

toracat

2016-08-03 15:41

manager   ~0027193

The patch is now in kernel-3.10.0-327.28.2.el7. Therefore it has been removed from the plus kernel.

Closing as resolved.

Issue History

Date Modified Username Field Change
2016-04-26 09:36 isuzdal New Issue
2016-04-26 09:36 isuzdal File Added: e1000.dmesg
2016-04-26 11:29 toracat Note Added: 0026371
2016-04-26 11:38 toracat Status new => assigned
2016-04-26 16:16 isuzdal Note Added: 0026374
2016-04-26 16:30 toracat File Added: centos-linux-3.10-e1000-Tx-fix-1-bug10767.patch
2016-04-26 16:31 toracat File Added: centos-linux-3.10-e1000-Tx-fix-2-bug10767.patch
2016-04-26 16:34 toracat Note Added: 0026375
2016-08-03 15:41 toracat Note Added: 0027193
2016-08-04 14:47 toracat Status assigned => resolved
2016-08-04 14:47 toracat Resolution open => fixed