View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0003832||CentOS-5||kernel||public||2009-09-11 10:17||2010-05-07 09:45|
|Summary||0003832: bnx2 stops transmitting data|
|Description||We have a large cluster of IBM HS22 blades with Broadcom NetXtreme II BCM5709S Gigabit Ethernet cards running CentOS 5.3. After some period of time some systems will stop transmitting data.|
We see no particular errors reported in /var/log/messages or on the console other than we get messages similar to:
Aug 24 21:12:19 kernel: NETDEV WATCHDOG: eth0: transmit timed out
Aug 24 21:12:19 kernel: bnx2: eth0 NIC SerDes Link is Down
and we get messages such as as "NFS server not responding" relating to the lack of network connectivity.
We can get network back by rebooting the system or by unloading/reloading the bnx2 driver (rmmod bnx2/modprobe bnx2) - the system will then apparently resume normally.
|Additional Information||The NICs have the latest firmware available from IBM installed (4.6.7 NCSI 1.0.6)|
We do not believe that there is anything wrong with the network infrastructure (cabling, switches) as this have been exhaustively checked out.
We are not using jumbo frames or similar, so we are not seeing any of the bugs with bnx2 related to this.
We have now tried the stock CentOS 5.3 kernel (bnx2 driver version 1.7.9-1), kernel 2.6.18_128.4.1.el5 (bnx2 driver version 1.7.9-2) plus the 1.8.2b bnx2 driver downloaded from IBMs website (with both the above kernels).
We note that the 5.4 kernel updates the bnx2 driver to version 1.9.3; as CentOS 5.4 is not out yet we haven't tested this, though we may install RHEL 5.4 if necessary.
Both IBM and Broadcom updated their web sites last week with the 1.9.20b version of the driver. We will investigate these if we get the chance - unfortunately as we have not been able to reliably reproduce this problem the operations manager is not too keen on making yet more changes to the production systems without some feelgood feeling that it will actually fix the problem.
We have forced crash dumps of various systems whilst they are hung, but these are rather large (minimum of 2GB in size), so I won't attempt to upload them.
All the above is rather fuzzy. What additional information is required to correlate this problem with a known fix or to indicate it is a new bug?
|Tags||No tags attached.|
There seem to be some problems with th bnx2 drivers and jumbo frames.
What is the MTU on your cards?
|See 0003672 and https://bugzilla.redhat.com/show_bug.cgi?id=482747|
|We're using the standard MTU (1500) - so far as I'm aware we haven't been doing any significant network tweaking for this cluster.|
I have seen similar issue on Dell servers, however there was no error message and the network returned to operational mode after some time (few secs up to half an hour).
solution was to load the bnx2 module with following /etc/modprobe.conf line
options bnx2 disable_msi=1,1,1,1
If you still have this issue, please let me know if it helped.
Thanks for the update. I'll see if the admins on the cluster concerned can apply this.
I have seen similar reported in an upstream bugzilla (https://bugzilla.redhat.com/show_bug.cgi?id=520888).
We've now escalated this to upstream (we've installed RHEL 5.4 on a subset of nodes to replicate the problem). I believe there is a bugzilla specific to our call but it's not open to general viewing :-(
running on a Dell R610 here.
Kernel 2.6.18-164.10.1.el5.centos.plus #1 SMP Fri Jan 8 16:47:55 EST 2010 x86_64 x86_64 x86_64 GNU/Linux
description: Broadcom NetXtreme II BCM5706/5708/5709/5716 Driver
author: Michael Chan <firstname.lastname@example.org>
vermagic: 2.6.18-164.10.1.el5.centos.plus SMP mod_unload gcc-4.1
parm: disable_msi:Disable Message Signaled Interrupt (MSI) (int)
parm: enable_entropy:Allow bnx2 to populate the /dev/random entropy pool (int)
Experiencing the same result as others have reported. However the fun parts are.
1. It happens to all nics at the same time
2. Some protocols work. Some completely stop, others keep existing sessions.
3. On occasion the box will return to serviceability. On it's own.
4. work load inensity level doesn't affect it, however it seems that changes in load level could. We've identical systems and install base being used to push all 4 nics to the max throughput for stress tests we run on our product. Those keep running. But a lower throughput box (a build server) will randomly fail most frequently after long idle periods, and a sudden call to action.
Key services running (eth0 and eth1)
DHCP, TFTP SSHD Webmin, NFS, (server and client) NTP. I've now upped the logging level for the kernel and will try the msi work around suggested above.
|I've had the same issue on over 50 Dell R410, we've disabled msi at one datacenter and the servers are stable. We've had one instance that three of our databases went offline simultaneously but usually it's sporadic. All of our NICs are bonded, which causes 30 to 50 percent packet loss.|
With thanks to the Dell Linux Engineering team, this has been root caused, and a workaround (patch to the bnx2 driver) has been posted to the upstream netdev mailing list and is queued for inclusion in a future upstream kernel.
|upstream patch http://bit.ly/info/axXbpa|
This now appears to have been fixed upstream: https://rhn.redhat.com/errata/RHSA-2010-0398.html.
I haven't done any testing to verify whether this works or not but I can confirm that the patch reported by mdomsch is included.
|2009-09-11 10:17||markl||New Issue|
||Note Added: 0009903|
||Note Added: 0009904|
||Status||new => feedback|
|2009-09-11 10:45||markl||Note Added: 0009905|
|2009-12-09 16:24||davidk||Note Added: 0010485|
|2009-12-10 10:49||markl||Note Added: 0010494|
|2010-02-03 09:50||linuxrebel||Note Added: 0010927|
|2010-02-20 21:11||jawbrkr||Note Added: 0011023|
|2010-04-28 16:36||mdomsch||Note Added: 0011179|
|2010-04-28 16:38||mdomsch||Note Added: 0011180|
|2010-04-28 16:38||mdomsch||Status||feedback => acknowledged|
|2010-05-07 09:45||markl||Note Added: 0011232|