View Issue Details

IDProjectCategoryView StatusLast Update
0003832CentOS-5kernelpublic2010-05-07 09:45
Reportermarkl Assigned To 
PrioritynormalSeveritymajorReproducibilityrandom
Status acknowledgedResolutionopen 
Product Version5.3 
Summary0003832: bnx2 stops transmitting data
DescriptionWe have a large cluster of IBM HS22 blades with Broadcom NetXtreme II BCM5709S Gigabit Ethernet cards running CentOS 5.3. After some period of time some systems will stop transmitting data.

We see no particular errors reported in /var/log/messages or on the console other than we get messages similar to:

Aug 24 21:12:19 kernel: NETDEV WATCHDOG: eth0: transmit timed out
Aug 24 21:12:19 kernel: bnx2: eth0 NIC SerDes Link is Down

and we get messages such as as "NFS server not responding" relating to the lack of network connectivity.

We can get network back by rebooting the system or by unloading/reloading the bnx2 driver (rmmod bnx2/modprobe bnx2) - the system will then apparently resume normally.
Additional InformationThe NICs have the latest firmware available from IBM installed (4.6.7 NCSI 1.0.6)

We do not believe that there is anything wrong with the network infrastructure (cabling, switches) as this have been exhaustively checked out.

We are not using jumbo frames or similar, so we are not seeing any of the bugs with bnx2 related to this.

We have now tried the stock CentOS 5.3 kernel (bnx2 driver version 1.7.9-1), kernel 2.6.18_128.4.1.el5 (bnx2 driver version 1.7.9-2) plus the 1.8.2b bnx2 driver downloaded from IBMs website (with both the above kernels).

We note that the 5.4 kernel updates the bnx2 driver to version 1.9.3; as CentOS 5.4 is not out yet we haven't tested this, though we may install RHEL 5.4 if necessary.

Both IBM and Broadcom updated their web sites last week with the 1.9.20b version of the driver. We will investigate these if we get the chance - unfortunately as we have not been able to reliably reproduce this problem the operations manager is not too keen on making yet more changes to the production systems without some feelgood feeling that it will actually fix the problem.

We have forced crash dumps of various systems whilst they are hung, but these are rather large (minimum of 2GB in size), so I won't attempt to upload them.

All the above is rather fuzzy. What additional information is required to correlate this problem with a known fix or to indicate it is a new bug?
TagsNo tags attached.

Activities

user430

2009-09-11 10:32

  ~0009903

There seem to be some problems with th bnx2 drivers and jumbo frames.

What is the MTU on your cards?

user430

2009-09-11 10:33

  ~0009904

See 0003672 and https://bugzilla.redhat.com/show_bug.cgi?id=482747
markl

markl

2009-09-11 10:45

reporter   ~0009905

We're using the standard MTU (1500) - so far as I'm aware we haven't been doing any significant network tweaking for this cluster.
davidk

davidk

2009-12-09 16:24

reporter   ~0010485

I have seen similar issue on Dell servers, however there was no error message and the network returned to operational mode after some time (few secs up to half an hour).

solution was to load the bnx2 module with following /etc/modprobe.conf line
options bnx2 disable_msi=1,1,1,1

If you still have this issue, please let me know if it helped.
markl

markl

2009-12-10 10:49

reporter   ~0010494

Thanks for the update. I'll see if the admins on the cluster concerned can apply this.

I have seen similar reported in an upstream bugzilla (https://bugzilla.redhat.com/show_bug.cgi?id=520888).

We've now escalated this to upstream (we've installed RHEL 5.4 on a subset of nodes to replicate the problem). I believe there is a bugzilla specific to our call but it's not open to general viewing :-(
linuxrebel

linuxrebel

2010-02-03 09:50

reporter   ~0010927

running on a Dell R610 here.

CentOS 5.4

Kernel 2.6.18-164.10.1.el5.centos.plus #1 SMP Fri Jan 8 16:47:55 EST 2010 x86_64 x86_64 x86_64 GNU/Linux

modinfo:

filename: /lib/modules/2.6.18-164.10.1.el5.centos.plus/kernel/drivers/net/bnx2.ko
version: 1.9.3
license: GPL
description: Broadcom NetXtreme II BCM5706/5708/5709/5716 Driver
author: Michael Chan <mchan@broadcom.com>
srcversion: 1040A42F87B8BE8A019736C
alias: pci:v000014E4d0000163Csv*sd*bc*sc*i*
alias: pci:v000014E4d0000163Bsv*sd*bc*sc*i*
alias: pci:v000014E4d0000163Asv*sd*bc*sc*i*
alias: pci:v000014E4d00001639sv*sd*bc*sc*i*
alias: pci:v000014E4d000016ACsv*sd*bc*sc*i*
alias: pci:v000014E4d000016AAsv*sd*bc*sc*i*
alias: pci:v000014E4d000016AAsv0000103Csd00003102bc*sc*i*
alias: pci:v000014E4d0000164Csv*sd*bc*sc*i*
alias: pci:v000014E4d0000164Asv*sd*bc*sc*i*
alias: pci:v000014E4d0000164Asv0000103Csd00003106bc*sc*i*
alias: pci:v000014E4d0000164Asv0000103Csd00003101bc*sc*i*
depends:
vermagic: 2.6.18-164.10.1.el5.centos.plus SMP mod_unload gcc-4.1
parm: disable_msi:Disable Message Signaled Interrupt (MSI) (int)
parm: enable_entropy:Allow bnx2 to populate the /dev/random entropy pool (int)
module_sig: 883f3504b47af9bd3b84a368dd51f2112b6b90a0ed1bac15e1b94720602336594dc65775db83c460991575cc8694cf9c03aca6e623e0950281e5094

Experiencing the same result as others have reported. However the fun parts are.

1. It happens to all nics at the same time
2. Some protocols work. Some completely stop, others keep existing sessions.
3. On occasion the box will return to serviceability. On it's own.
4. work load inensity level doesn't affect it, however it seems that changes in load level could. We've identical systems and install base being used to push all 4 nics to the max throughput for stress tests we run on our product. Those keep running. But a lower throughput box (a build server) will randomly fail most frequently after long idle periods, and a sudden call to action.

Key services running (eth0 and eth1)

DHCP, TFTP SSHD Webmin, NFS, (server and client) NTP. I've now upped the logging level for the kernel and will try the msi work around suggested above.
jawbrkr

jawbrkr

2010-02-20 21:11

reporter   ~0011023

I've had the same issue on over 50 Dell R410, we've disabled msi at one datacenter and the servers are stable. We've had one instance that three of our databases went offline simultaneously but usually it's sporadic. All of our NICs are bonded, which causes 30 to 50 percent packet loss.
mdomsch

mdomsch

2010-04-28 16:36

reporter   ~0011179

With thanks to the Dell Linux Engineering team, this has been root caused, and a workaround (patch to the bnx2 driver) has been posted to the upstream netdev mailing list and is queued for inclusion in a future upstream kernel.

http://bit.ly/info/axXbpa
mdomsch

mdomsch

2010-04-28 16:38

reporter   ~0011180

upstream patch http://bit.ly/info/axXbpa
markl

markl

2010-05-07 09:45

reporter   ~0011232

This now appears to have been fixed upstream: https://rhn.redhat.com/errata/RHSA-2010-0398.html.

I haven't done any testing to verify whether this works or not but I can confirm that the patch reported by mdomsch is included.

Issue History

Date Modified Username Field Change
2009-09-11 10:17 markl New Issue
2009-09-11 10:32 user430 Note Added: 0009903
2009-09-11 10:33 user430 Note Added: 0009904
2009-09-11 10:33 user430 Status new => feedback
2009-09-11 10:45 markl Note Added: 0009905
2009-12-09 16:24 davidk Note Added: 0010485
2009-12-10 10:49 markl Note Added: 0010494
2010-02-03 09:50 linuxrebel Note Added: 0010927
2010-02-20 21:11 jawbrkr Note Added: 0011023
2010-04-28 16:36 mdomsch Note Added: 0011179
2010-04-28 16:38 mdomsch Note Added: 0011180
2010-04-28 16:38 mdomsch Status feedback => acknowledged
2010-05-07 09:45 markl Note Added: 0011232