View Issue Details

IDProjectCategoryView StatusLast Update
0005614CentOS-6kernelpublic2013-02-06 22:28
Reporterrsandu 
PrioritynormalSeverityminorReproducibilityhave not tried
Status newResolutionopen 
Platformx86_64OSCentOSOS Version6.2
Product Version6.2 
Target VersionFixed in Version 
Summary0005614: Networking: Intel NIC locks machine after a while, requiring cold reboot
DescriptionHello,

One of my routers runs (stock) CentOS 6.2 x86_64 and has an Intel NIC inside (Intel Corporation 82574L Gigabit Network Connection [8086:10d3]), managed by the e1000e kernel module.

After a while, the system locks completely, requiring a cold reboot.


Looking in /var/log/messages, I get (excerpt):
Mar 21 11:24:37 example kernel: e1000e: Intel(R) PRO/1000 Network Driver - 1.4.4-k
Mar 21 11:24:37 example kernel: e1000e: Copyright(c) 1999 - 2011 Intel Corporation.
Mar 21 11:24:37 example kernel: e1000e 0000:01:00.0: Disabling ASPM L0s
Mar 21 11:24:37 example kernel: e1000e 0000:01:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
Mar 21 11:24:37 example kernel: e1000e 0000:01:00.0: eth0: (PCI Express:2.5GT/s:Width x1) 00:1b:21:d4:7b:84
Mar 21 11:24:37 example kernel: e1000e 0000:01:00.0: eth0: Intel(R) PRO/1000 Network Connection
Mar 21 11:24:37 example kernel: e1000e 0000:01:00.0: eth0: MAC: 3, PHY: 8, PBA No: E46981-006
Mar 21 11:24:37 example kernel: e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
Mar 23 13:31:43 example kernel: NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Mar 23 13:31:43 example kernel: xt_CLASSIFY xt_AUDIT ipt_LOG xt_state iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack ipta
ble_mangle nfnetlink iptable_filter ip_tables autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ipv6 uinput ppdev parport_pc parpo
rt r8169 mii atl1e(U) e1000e sg microcode i2c_i801 iTCO_wdt iTCO_vendor_support shpchp snd_hda_codec_via snd_hda_intel snd_hda_codec snd_hwdep
 snd_seq snd_seq_device snd_pcm snd_timer snd soundcore snd_page_alloc ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod cdrom ahci i915 drm_kms_help
er drm i2c_algo_bit i2c_core video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Mar 23 13:31:43 example kernel: e1000e 0000:01:00.0: eth0: Reset adapter
Mar 23 13:31:50 example kernel: e1000e 0000:01:00.0: eth0: Error reading PHY register
Mar 23 13:31:50 example kernel: e1000e 0000:01:00.0: eth0: Error reading PHY register
Mar 23 13:31:50 example kernel: e1000e 0000:01:00.0: eth0: Error reading PHY register
Mar 23 13:31:50 example kernel: e1000e 0000:01:00.0: eth0: Error reading PHY register
Mar 23 13:31:50 example kernel: e1000e 0000:01:00.0: eth0: Error reading PHY register
Mar 23 13:31:50 example kernel: e1000e 0000:01:00.0: eth0: Error reading PHY register
Mar 23 13:31:50 example kernel: e1000e 0000:01:00.0: eth0: Error reading PHY register
Mar 23 13:31:50 example kernel: e1000e 0000:01:00.0: eth0: Error reading PHY register
Mar 23 13:32:05 example kernel: e1000e 0000:01:00.0: eth0: Error reading PHY register
Mar 23 13:32:05 example kernel: e1000e 0000:01:00.0: eth0: Error reading PHY register
Mar 23 13:32:05 example kernel: e1000e 0000:01:00.0: eth0: Error reading PHY register
Mar 23 13:32:05 example kernel: e1000e 0000:01:00.0: eth0: Error reading PHY register
Mar 23 13:32:05 example kernel: e1000e 0000:01:00.0: eth0: Error reading PHY register
Mar 23 13:32:05 example kernel: e1000e 0000:01:00.0: eth0: Error reading PHY register
Mar 23 13:32:05 example kernel: e1000e 0000:01:00.0: eth0: Error reading PHY register
Mar 23 13:32:05 example kernel: e1000e 0000:01:00.0: eth0: Error reading PHY register
Mar 23 13:32:20 example kernel: e1000e 0000:01:00.0: eth0: Error reading PHY register
Mar 23 13:32:20 example kernel: e1000e 0000:01:00.0: eth0: Error reading PHY register

[...]


Instead, the system should work normally.


Best regards,
R?zvan
Steps To ReproduceCannot reproduce on other machines (for the moment, I don't own similar hardware).

TagsNo tags attached.

Activities

toracat

toracat

2012-03-23 15:19

manager   ~0014722

To find out if this is due to (a) bug(s) in the kernel e1000e driver, I suggest you try the kmod-e1000e package [1] from ELRepo. Their driver is 1.9.5 while the one in the kernel is 1.4.4-k.

[1] http://elrepo.org/tiki/kmod-e1000e
dsieme01

dsieme01

2012-03-25 02:23

reporter   ~0014728

I'm having the same issue. One change that I made was in the ifcfg-eth0 file I added the hwaddr of my mac. If it locks up again I will try the different kernel mod.
rsandu

rsandu

2012-03-26 08:56

reporter   ~0014730

Hello,

I've followed toracat's advice in comment 0014722 and I may confirm that the bug is still present even with the newer module from ELREPO. The router locked again.

Best regards,
R?zvan
dsieme01

dsieme01

2012-03-30 01:52

reporter   ~0014770

Confirmed that I still have the issue.

Hardware is a Supermicro H8SGL
32GB Ram ECC Micro I believe
LSL SAS controler with 8 1gb sata disks on it.
A few more drives for boot in raid 1 array for boot.
dsieme01

dsieme01

2012-03-30 02:03

reporter   ~0014771

The next setting I'm going to try is
pcie_aspm=off to the kernel cmd line
dsieme01

dsieme01

2012-04-03 22:34

reporter   ~0014787

I don't believe that this is a bug in the network driver as much as configuration of the PCIE bus.

Problem appears to have gone away.
tigalch

tigalch

2012-04-24 17:33

manager   ~0014926

dsieme01: can we then close the issue?
rsandu

rsandu

2012-04-24 17:47

reporter   ~0014927

@tigalch

IMHO, no, since the bug still occurs from time to time (last few days). Personally, I will consider it closed when it doesn't show up in the *default* configuration of CentOS (default kernel parameters at boot, default PCIE bus configuration, etc).

Other users, please confirm if it is still the case...

Regards,
R?zvan
toracat

toracat

2012-04-27 17:52

manager   ~0014964

Intel released the e1000e driver version 1.11.3 on 2012-04-24.

Changelog for e1000e-1.11.3
===========================

* Enabled DMA Burst Mode on 82574 by default for performance gain with small packets.
* Fixed issue with 82574/82583 sometimes not auto-negotiating gigabit link.
* Disabled IPv6 extension header parsing because some malformed IPv6 headers can hang the Rx.
* Disabled IBIST slave mode (far-end loopback) on 80003ES2LAN during reset to ensure the mode does not accidentally become enabled.

=============================================================

"Fixed issue with 82574" looks hopeful. If you wish to give the new version a try, wait for ELRepo to update the kmod (will be announced on the ELRepo mailing list).
bkamen

bkamen

2012-05-12 17:25

reporter   ~0015068

I just had an issue with a client server using a SuperMicro motherboard with the e1000 driver.

The system didn't lock, the driver just glitched causing the kernel to kick it... and then the system disabled the ETH0 interface.

It's on a CentOS 6.2 (i686) system recently built.

Will keep looking (and doing a kernel update since the system is a rev or two behind.)
toracat

toracat

2012-05-12 17:41

manager   ~0015069

I should have updated my note 14964. ELRepo did release the kmod-e1000e package, driver version 1.11.3. Those who are affected are encouraged to try this latest version.
enterco

enterco

2012-07-02 17:02

reporter   ~0015359

I have the same issue with a SuperMicro X9SCM and e1000e driver version 1.4.4-k. The onboard ethernet device built with Intel 82574L reports "Link down", while the /var/log/messages file gets filled with
"e1000e 0000:05:00.0: eth2: Error reading PHY register"
Can be "pcie_aspm=off" parameters on the kernel command line considered a 'workaround' for 1.4.4-k?
Can anyone confirm that the driver from elrepo, version 1.11.3 works as expected?
toracat

toracat

2013-02-06 20:14

manager   ~0016445

To people who are affected: a newer version of kmod-e1000e (v. 2.2.14) has just been released from ELRepo. Can you try and see if this version fixes the issue?
enterco

enterco

2013-02-06 21:31

reporter   ~0016447

I have few things to add. Now I have two Super Micro X9SCM servers, the first runs firmware version 2.x updated by me, elrepo e1000e driver and pcie_aspm=off in kernel command line, the second runs with the factory installed firmware, CentOS 6.2 stock drivers and unaltered command line.

On the first server the lockup does appear at longer intervals (beyond 2months) , on the second I haven't encountered any issue after more a half of year uptime.
In the meantime, I've contacted Supermicro regarding this issue, and they provided a supplemental firmware update tool for the NIC rom/flash, but I wasn't able to apply this 'fix'. So, avoid updating firmware when it is not a critical situation.
tru

tru

2013-02-06 22:28

administrator   ~0016448

http://blog.krisk.org/2013/02/packets-of-death.html and the POC at http://www.kriskinc.com/intel-pod

Issue History

Date Modified Username Field Change
2012-03-23 12:44 rsandu New Issue
2012-03-23 15:19 toracat Note Added: 0014722
2012-03-25 02:23 dsieme01 Note Added: 0014728
2012-03-26 08:56 rsandu Note Added: 0014730
2012-03-30 01:52 dsieme01 Note Added: 0014770
2012-03-30 02:03 dsieme01 Note Added: 0014771
2012-04-03 22:34 dsieme01 Note Added: 0014787
2012-04-24 17:33 tigalch Note Added: 0014926
2012-04-24 17:47 rsandu Note Added: 0014927
2012-04-27 17:52 toracat Note Added: 0014964
2012-05-12 17:25 bkamen Note Added: 0015068
2012-05-12 17:41 toracat Note Added: 0015069
2012-07-02 17:02 enterco Note Added: 0015359
2013-02-06 20:14 toracat Note Added: 0016445
2013-02-06 21:31 enterco Note Added: 0016447
2013-02-06 22:28 tru Note Added: 0016448