View Issue Details

IDProjectCategoryView StatusLast Update
0006187CentOS-6kernelpublic2013-03-02 02:57
Reporterdlmiles 
PrioritynormalSeveritycrashReproducibilitysometimes
Status resolvedResolutionfixed 
Product Version6.3 
Target VersionFixed in Version6.4 
Summary0006187: 82541GI e1000 linux 2.6.32 7.3.21-k8-NAPI kernel panics
DescriptionCentOS 6.3 with a system with 2 e1000 ports on 2 distinct controllers. 1 ethernet controller is on board motherboard the other is on an add-in card.
NOTE the udev perform a device name swap eth0<>eth1 on bootup
 after bootup completes, eth0 is the only used port, eth1 is not connected
eth0 is HWaddr 00:14:22:75:D7:1E and on PCI bus address 03:07.0
 eth1 is HWaddr 00:03:47:6B:45:6D and on PCI bus address 02:05.0
The problem port is eth0 as eth1 has not been used in a while.
PROBLEM
Many kernel panics/Opps over just 1 months of operation. I am unable to make this system's network stable.
The easiest way to make it panic is to issue:
ethtool -k eth0 gso off sg off rx off tx off

This causes:
e1000 0000:03:07.0: eth0: TSO is Disabled
e1000 0000:03:07.0: eth0: TSO is Disabled
e1000 0000:03:07.0: eth0: Reset adapter

Then the port no longer works and the system will start to opps and panic. I include this stack trace in another follow up comment.

Steps To ReproduceI think trying to issue command: ethtool -k eth0 gso off sg off rx off tx off

Causes Opps and NIC lockup and all processes that touch it to hang/die.
Otherwise I just use the system across the network like NFS file server and within a short time it will have some kind of Opps.

Additional InformationSYSTEM DATA
Ask if you need more data.
# ethtool -i eth0
driver: e1000
version: 7.3.21-k8-NAPI
firmware-version:
bus-info: 0000:03:07.0

# IGNORE THIS ETHERNET PORT THIS DATA HERE FOR COMPLETENESS
# ethtool -i eth1
driver: e1000
version: 7.3.21-k8-NAPI
firmware-version:
bus-info: 0000:02:05.0

# KERNEL DATA
# uname -r
2.6.32-279.19.1.el6.i686

# BOOTUP MESSAGES BEWARE of device name swapping eth0<>eth1
# dmesg | egrep -i "eth|e1000"
e1000: Intel(R) PRO/1000 Network Driver - version 7.3.21-k8-NAPI
e1000: Copyright (c) 1999-2006 Intel Corporation.
e1000 0000:02:05.0: PCI->APIC IRQ transform: INT A -> IRQ 29
e1000 0000:02:05.0: eth0: (PCI:66MHz:64-bit) 00:03:47:6b:45:6d
e1000 0000:02:05.0: eth0: Intel(R) PRO/1000 Network Connection
e1000 0000:03:07.0: PCI->APIC IRQ transform: INT A -> IRQ 53
e1000 0000:03:07.0: eth1: (PCI:66MHz:32-bit) 00:14:22:75:d7:1e
e1000 0000:03:07.0: eth1: Intel(R) PRO/1000 Network Connection
udev: renamed network interface eth0 to rename2
udev: renamed network interface eth1 to eth0
udev: renamed network interface rename2 to eth1
ADDRCONF(NETDEV_UP): eth0: link is not ready
e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
ADDRCONF(NETDEV_UP): eth1: link is not ready

# lspci -s 03:07.0 -vv
03:07.0 Ethernet controller: Intel Corporation 82541GI Gigabit Ethernet Controller (rev 05)
    Subsystem: Dell Device 0183
    Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
    Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 32 (63750ns min), Cache Line Size: 64 bytes
    Interrupt: pin A routed to IRQ 53
    Region 0: Memory at fe7e0000 (32-bit, non-prefetchable) [size=128K]
    Region 2: I/O ports at dcc0 [size=64]
    Capabilities: [dc] Power Management version 2
            Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
            Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
    Capabilities: [e4] PCI-X non-bridge device
            Command: DPERE- ERO+ RBC=512 OST=1
            Status: Dev=00:00.0 64bit- 133MHz- SCD- USC- DC=simple DMMRBC=2048 DMOST=1 DMCRS=8 RSCEM- 266MHz- 533MHz-
    Kernel driver in use: e1000
    Kernel modules: e1000

# IGNORE THIS ETHERNET PORT THIS DATA HERE FOR COMPLETENESS
# lspci -s 02:05.0 -vv
02:05.0 Ethernet controller: Intel Corporation 82543GC Gigabit Ethernet Controller (Copper) (rev 02)
    Subsystem: Intel Corporation PRO/1000 T Server Adapter
    Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx-
    Status: Cap+ 66MHz+ UDF- FastB2B- ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Latency: 32 (63750ns min), Cache Line Size: 64 bytes
    Interrupt: pin A routed to IRQ 29
    Region 0: Memory at fe9c0000 (32-bit, non-prefetchable) [size=128K]
    Region 1: Memory at fe9b0000 (32-bit, non-prefetchable) [size=64K]
    Expansion ROM at fe900000 [disabled] [size=64K]
    Capabilities: [dc] Power Management version 2
            Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
            Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
    Kernel driver in use: e1000
    Kernel modules: e1000

# ethtool eth0
Settings for eth0:
   Supported ports: [ TP ]
   Supported link modes: 10baseT/Half 10baseT/Full
                           100baseT/Half 100baseT/Full
                           1000baseT/Full
   Supports auto-negotiation: Yes
   Advertised link modes: 10baseT/Half 10baseT/Full
                           100baseT/Half 100baseT/Full
                           1000baseT/Full
   Advertised pause frame use: No
   Advertised auto-negotiation: Yes
   Speed: 1000Mb/s
   Duplex: Full
   Port: Twisted Pair
   PHYAD: 0
   Transceiver: internal
   Auto-negotiation: on
   MDI-X: Unknown
   Supports Wake-on: umbg
   Wake-on: d
   Current message level: 0x00000007 (7)
   Link detected: yes

# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off

# I ALWAYS PUT THE 100baseTx-FD down to a bug in mii-tool and/or user/kernel interface
# but the Ethernet switch it is plugged into reports 1000MBit and the performance over
# 100MBit is easily possible. Maybe this is a separate driver bug? or mii-tool bug?

# mii-tool eth0
eth0: negotiated 100baseTx-FD flow-control, link ok
# mii-tool eth1
eth1: no link
TagsNo tags attached.

Activities

dlmiles

dlmiles

2013-01-14 07:27

reporter   ~0016278

# THE FOLLOWING IS FROM ISSUING THE 'eth-tool -k eth0 gso off sg off rx off tx off'
# COMMAND AS Jan 14 06:03:03 THEN 1 min 40 seconds LATER THIS OUTPUT THAT IS
# EXPLAINED IN THE BUG SUMMARY.
Jan 14 06:04:44 tyr kernel: ------------[ cut here ]------------
Jan 14 06:04:44 tyr kernel: WARNING: at drivers/net/e1000/e1000_main.c:1394 e1000_close+0xa7/0xb0 [e1000]() (Not tainted)
Jan 14 06:04:44 tyr kernel: Hardware name: PowerEdge 1800
Jan 14 06:04:44 tyr kernel: Modules linked in: nfsd lockd nfs_acl auth_rpcgss exportfs sunrpc ipt_REJECT ipt_LOG nf_conntrack_ipv4 n
f_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6t_rt ip6table_filter ip
6_tables ipv6 ppdev parport_pc parport e1000 snd_cmipci snd_seq snd_pcm snd_page_alloc snd_opl3_lib snd_timer snd_hwdep snd_mpu401_u
art snd_rawmidi snd_seq_device snd soundcore dcdbas iTCO_wdt iTCO_vendor_support sg e752x_edac edac_core ext4 mbcache jbd2 sd_mod cr
c_t10dif video output 3w_9xxx mptspi mptscsih mptbase scsi_transport_spi sr_mod cdrom ata_generic ata_piix radeon ttm drm_kms_helper
 drm i2c_algo_bit i2c_core dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
Jan 14 06:04:44 tyr kernel: Pid: 3261, comm: ip Not tainted 2.6.32-279.19.1.el6.i686 #1
Jan 14 06:04:44 tyr kernel: Call Trace:
Jan 14 06:04:44 tyr kernel: [<c04550a1>] ? warn_slowpath_common+0x81/0xc0
Jan 14 06:04:44 tyr kernel: [<f8f96d57>] ? e1000_close+0xa7/0xb0 [e1000]
Jan 14 06:04:44 tyr kernel: [<f8f96d57>] ? e1000_close+0xa7/0xb0 [e1000]
Jan 14 06:04:44 tyr kernel: [<c04550fb>] ? warn_slowpath_null+0x1b/0x20
Jan 14 06:04:44 tyr kernel: [<f8f96d57>] ? e1000_close+0xa7/0xb0 [e1000]
Jan 14 06:04:44 tyr kernel: [<c07869cb>] ? dev_close+0x5b/0xb0
Jan 14 06:04:44 tyr kernel: [<c0784920>] ? dev_set_rx_mode+0x20/0x40
Jan 14 06:04:44 tyr kernel: [<c0786307>] ? dev_change_flags+0x87/0x1a0
Jan 14 06:04:44 tyr kernel: [<c0522ee3>] ? __mem_cgroup_commit_charge.clone.3+0x33/0x80
Jan 14 06:04:44 tyr kernel: [<c0790a18>] ? do_setlink+0x188/0x720
Jan 14 06:04:44 tyr kernel: [<c06060b1>] ? nla_parse+0x21/0xd0
Jan 14 06:04:44 tyr kernel: [<c0791e14>] ? rtnl_newlink+0x424/0x4f0
Jan 14 06:04:44 tyr kernel: [<c07919f0>] ? rtnl_newlink+0x0/0x4f0
Jan 14 06:04:44 tyr kernel: [<c0791706>] ? rtnetlink_rcv_msg+0x146/0x230
Jan 14 06:04:44 tyr kernel: [<c07915c0>] ? rtnetlink_rcv_msg+0x0/0x230
Jan 14 06:04:44 tyr kernel: [<c07a6a4e>] ? netlink_rcv_skb+0x7e/0xa0
Jan 14 06:04:44 tyr kernel: [<c07915a0>] ? rtnetlink_rcv+0x0/0x20
Jan 14 06:04:44 tyr kernel: [<c07915b4>] ? rtnetlink_rcv+0x14/0x20
Jan 14 06:04:44 tyr kernel: [<c07a6740>] ? netlink_unicast+0x250/0x280
Jan 14 06:04:44 tyr kernel: [<c07a6f1c>] ? netlink_sendmsg+0x1bc/0x2a0
Jan 14 06:04:44 tyr kernel: [<c07759d5>] ? sock_sendmsg+0xe5/0x120
Jan 14 06:04:44 tyr kernel: [<c0475d20>] ? autoremove_wake_function+0x0/0x40
Jan 14 06:04:44 tyr kernel: [<c04eea94>] ? __alloc_pages_nodemask+0xf4/0x870
Jan 14 06:04:44 tyr kernel: [<c0475d20>] ? autoremove_wake_function+0x0/0x40
Jan 14 06:04:44 tyr kernel: [<c04dd97d>] ? find_get_page+0x1d/0x90
Jan 14 06:04:44 tyr kernel: [<c05fd885>] ? copy_from_user+0x35/0x120
Jan 14 06:04:44 tyr kernel: [<c077f8f2>] ? verify_iovec+0x62/0xb0
Jan 14 06:04:44 tyr kernel: [<c07771fd>] ? __sys_sendmsg+0x2ad/0x2c0
Jan 14 06:04:44 tyr kernel: [<c0439b90>] ? kmap_atomic_prot+0x120/0x150
Jan 14 06:04:44 tyr kernel: [<c0503c81>] ? handle_mm_fault+0x131/0x1d0
Jan 14 06:04:44 tyr kernel: [<c0433a5a>] ? __do_page_fault+0x1aa/0x430
Jan 14 06:04:44 tyr kernel: [<c0777379>] ? sys_sendmsg+0x39/0x70
Jan 14 06:04:44 tyr kernel: [<c07774aa>] ? sys_socketcall+0xfa/0x2e0
Jan 14 06:04:44 tyr kernel: [<c04af32e>] ? audit_syscall_entry+0x1be/0x1e0
Jan 14 06:04:44 tyr kernel: [<c08302fa>] ? do_page_fault+0x2a/0x90
Jan 14 06:04:44 tyr kernel: [<c04099bf>] ? sysenter_do_call+0x12/0x28
Jan 14 06:04:44 tyr kernel: ---[ end trace a47fd97d66ac12c3 ]---
Jan 14 06:05:10 tyr kernel: INFO: task events/2:21 blocked for more than 120 seconds.

# THIS APPEARS TO BE FROM THE "events/2" PROCESS
Jan 14 06:05:10 tyr kernel: INFO: task events/2:21 blocked for more than 120 seconds.
Jan 14 06:05:10 tyr kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 14 06:05:10 tyr kernel: events/2 D f7109e38 0 21 2 0x00000000
Jan 14 06:05:10 tyr kernel: f70df000 00000046 00000002 f7109e38 c1f04024 00000000 00000400 00000499
Jan 14 06:05:10 tyr kernel: 00000000 f08f3d00 00000036 40a88682 00000036 c0b14680 c0b14680 f70df2a8
Jan 14 06:05:10 tyr kernel: c0b14680 c0b10024 c0b14680 f70df2a8 fffefa53 f708a000 c04082c7 f70df000
Jan 14 06:05:10 tyr kernel: Call Trace:
Jan 14 06:05:10 tyr kernel: [<c04082c7>] ? __switch_to+0xd7/0x1a0
Jan 14 06:05:10 tyr kernel: [<c082b0c0>] ? schedule+0x3c0/0xad0
Jan 14 06:05:10 tyr kernel: [<c082bda5>] ? schedule_timeout+0x195/0x250
Jan 14 06:05:10 tyr kernel: [<c0691980>] ? vt_console_print+0x0/0x300
Jan 14 06:05:10 tyr kernel: [<c045520b>] ? __call_console_drivers+0x5b/0x70
Jan 14 06:05:10 tyr kernel: [<c082bb09>] ? wait_for_common+0xe9/0x150
Jan 14 06:05:10 tyr kernel: [<c044de30>] ? default_wake_function+0x0/0x10
Jan 14 06:05:10 tyr kernel: [<c0471fab>] ? __cancel_work_timer+0x15b/0x180
Jan 14 06:05:10 tyr kernel: [<c0471a90>] ? wq_barrier_func+0x0/0x10
Jan 14 06:05:10 tyr kernel: [<f8f90bb6>] ? e1000_down_and_stop+0x16/0x40 [e1000]
Jan 14 06:05:10 tyr kernel: [<f8f95d8f>] ? e1000_down+0x12f/0x1b0 [e1000]
Jan 14 06:05:10 tyr kernel: [<f8f962d0>] ? e1000_reset_task+0x0/0xc0 [e1000]
Jan 14 06:05:10 tyr kernel: [<f8f96331>] ? e1000_reset_task+0x61/0xc0 [e1000]
Jan 14 06:05:10 tyr kernel: [<c047168b>] ? worker_thread+0x11b/0x230
Jan 14 06:05:10 tyr kernel: [<c0475d20>] ? autoremove_wake_function+0x0/0x40
Jan 14 06:05:10 tyr kernel: [<c0471570>] ? worker_thread+0x0/0x230
Jan 14 06:05:10 tyr kernel: [<c0475ae4>] ? kthread+0x74/0x80
Jan 14 06:05:10 tyr kernel: [<c0475a70>] ? kthread+0x0/0x80
Jan 14 06:05:10 tyr kernel: [<c0409f1f>] ? kernel_thread_helper+0x7/0x10

# THIS APPEARS TO BE FROM THE 'ntpd' PROCESS
Jan 14 06:07:10 tyr kernel: INFO: task ntpd:1542 blocked for more than 120 seconds.
Jan 14 06:07:10 tyr kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 14 06:07:10 tyr kernel: ntpd D ebd8de40 0 1542 1 0x00000080
Jan 14 06:07:10 tyr kernel: f5250000 00200082 00000002 ebd8de40 c1e04024 00000000 00100100 00200200
Jan 14 06:07:10 tyr kernel: f4a43090 f525e580 0000004e 372de5b7 0000004e c0b14680 c0b14680 f52502a8
Jan 14 06:07:10 tyr kernel: c0b14680 c0b10024 c0b14680 f52502a8 00008b36 c053bb30 00100100 f5250000
Jan 14 06:07:10 tyr kernel: Call Trace:
Jan 14 06:07:10 tyr kernel: [<c053bb30>] ? pollwake+0x0/0x60
Jan 14 06:07:10 tyr kernel: [<c05a7d04>] ? avc_has_perm+0x64/0x80
Jan 14 06:07:10 tyr kernel: [<c082c428>] ? __mutex_lock_slowpath+0xd8/0x140
Jan 14 06:07:10 tyr kernel: [<c082c32d>] ? mutex_lock+0x1d/0x40
Jan 14 06:07:10 tyr kernel: [<c0788375>] ? dev_ioctl+0xe5/0x6a0
Jan 14 06:07:10 tyr kernel: [<c05a89ed>] ? selinux_sk_alloc_security+0x3d/0x50
Jan 14 06:07:10 tyr kernel: [<c051bec7>] ? kmem_cache_alloc_trace+0x107/0x110
Jan 14 06:07:10 tyr kernel: [<c051b71d>] ? kmem_cache_alloc+0xfd/0x110
Jan 14 06:07:10 tyr kernel: [<c07d5c50>] ? udp_ioctl+0x0/0x70
Jan 14 06:07:10 tyr kernel: [<c07dceee>] ? inet_ioctl+0x2e/0xb0
Jan 14 06:07:10 tyr kernel: [<c0774abf>] ? sock_ioctl+0x6f/0x260
Jan 14 06:07:10 tyr kernel: [<c0774a50>] ? sock_ioctl+0x0/0x260
Jan 14 06:07:10 tyr kernel: [<c0539c8b>] ? vfs_ioctl+0x1b/0xa0
Jan 14 06:07:10 tyr kernel: [<c0539e6c>] ? do_vfs_ioctl+0x6c/0x5c0
Jan 14 06:07:10 tyr kernel: [<c053a436>] ? sys_ioctl+0x76/0x90
Jan 14 06:07:10 tyr kernel: [<c04af0a0>] ? __audit_syscall_exit+0x220/0x250
Jan 14 06:07:10 tyr kernel: [<c04099bf>] ? sysenter_do_call+0x12/0x28

# WHAT FOLLOWS HERE ARE MORE RANDOM OPPS FROM THE SAME SYSTEM THAT JUST
# OCCURED 'NATURALLY' FROM UTILIZING THE SYSTEM.

# THE SYSTEM HAS 6Gb RAM AND NEVER USES SWAP, THE WORKING SET FOR PROCESSES
# IS ABOUT 1.2GB SO ALTHOUGH IT SAYS 'page allocation failure'.
# THE SYSTEM ONLY HAD 30 HOURS UPTIME AT THE TIME OF THE CRASH

Jan 14 05:50:46 tyr kernel: kswapd0: page allocation failure. order:5, mode:0x20
Jan 14 05:50:46 tyr kernel: Pid: 58, comm: kswapd0 Not tainted 2.6.32-279.19.1.el6.i686 #1
Jan 14 05:50:46 tyr kernel: Call Trace:
Jan 14 05:50:46 tyr kernel: [<c04ef05c>] ? __alloc_pages_nodemask+0x6bc/0x870
Jan 14 05:50:46 tyr kernel: [<f946c702>] ? nf_conntrack_find_get+0x22/0x110 [nf_conntrack]
Jan 14 05:50:46 tyr kernel: [<c051b9ec>] ? cache_alloc_refill+0x2bc/0x510
Jan 14 05:50:46 tyr kernel: [<c051bd82>] ? __kmalloc+0x142/0x180
Jan 14 05:50:46 tyr kernel: [<c077da83>] ? pskb_expand_head+0x53/0x200
Jan 14 05:50:46 tyr kernel: [<c077da83>] ? pskb_expand_head+0x53/0x200
Jan 14 05:50:46 tyr kernel: [<c077e05c>] ? __pskb_pull_tail+0x4c/0x2b0
Jan 14 05:50:46 tyr kernel: [<c07a8b16>] ? nf_iterate+0x66/0x80
Jan 14 05:50:46 tyr kernel: [<c07892bd>] ? dev_queue_xmit+0x1ed/0x6f0
Jan 14 05:50:46 tyr kernel: [<c07b65a0>] ? ip_finish_output+0x0/0x280
Jan 14 05:50:46 tyr kernel: [<c07a8c82>] ? nf_hook_slow+0x62/0xf0
Jan 14 05:50:46 tyr kernel: [<c07b65a0>] ? ip_finish_output+0x0/0x280
Jan 14 05:50:46 tyr kernel: [<c07b66a5>] ? ip_finish_output+0x105/0x280
Jan 14 05:50:46 tyr kernel: [<c07b68aa>] ? ip_output+0x8a/0xb0
Jan 14 05:50:46 tyr kernel: [<c07b5d65>] ? ip_local_out+0x15/0x20
Jan 14 05:50:46 tyr kernel: [<c07b61a5>] ? ip_queue_xmit+0x145/0x3b0
Jan 14 05:50:46 tyr kernel: [<c07c3a06>] ? tcp_data_snd_check+0xc6/0xe0
Jan 14 05:50:46 tyr kernel: [<c051c282>] ? slab_destroy+0x22/0x70
Jan 14 05:50:46 tyr kernel: [<c07c8b63>] ? tcp_transmit_skb+0x3a3/0x710
Jan 14 05:50:46 tyr kernel: [<c07cab3a>] ? tcp_write_xmit+0x1ea/0x9c0
Jan 14 05:50:46 tyr kernel: [<c07cb441>] ? __tcp_push_pending_frames+0x31/0xe0
Jan 14 05:50:46 tyr kernel: [<c07c3963>] ? tcp_data_snd_check+0x23/0xe0
Jan 14 05:50:46 tyr kernel: [<c07c71fa>] ? tcp_rcv_established+0x37a/0x760
Jan 14 05:50:46 tyr kernel: [<c07ce59f>] ? tcp_v4_do_rcv+0x27f/0x3c0
Jan 14 05:50:46 tyr kernel: [<f94f94c8>] ? ipv4_confirm+0x68/0x190 [nf_conntrack_ipv4]
Jan 14 05:50:46 tyr kernel: [<c07cfb9e>] ? tcp_v4_rcv+0x48e/0x7c0
Jan 14 05:50:46 tyr kernel: [<c07b1710>] ? ip_local_deliver_finish+0x0/0x260
Jan 14 05:50:46 tyr kernel: [<c07a8c82>] ? nf_hook_slow+0x62/0xf0
Jan 14 05:50:46 tyr kernel: [<c07b17af>] ? ip_local_deliver_finish+0x9f/0x260
Jan 14 05:50:46 tyr kernel: [<c07b19bf>] ? ip_local_deliver+0x4f/0x90
Jan 14 05:50:46 tyr kernel: [<c07b1043>] ? ip_rcv_finish+0xf3/0x390
Jan 14 05:50:46 tyr kernel: [<c07b0f50>] ? ip_rcv_finish+0x0/0x390
Jan 14 05:50:46 tyr kernel: [<c0785361>] ? __netif_receive_skb+0x401/0x5f0
Jan 14 05:50:46 tyr kernel: [<c078700f>] ? netif_receive_skb+0x3f/0x50
Jan 14 05:50:46 tyr kernel: [<c079c9ed>] ? eth_type_trans+0x2d/0x120
Jan 14 05:50:46 tyr kernel: [<c07870df>] ? napi_skb_finish+0x2f/0x40
Jan 14 05:50:46 tyr kernel: [<c0788e35>] ? napi_gro_receive+0x25/0x40
Jan 14 05:50:46 tyr kernel: [<f8f8bf31>] ? e1000_clean_rx_irq+0x241/0x4a0 [e1000]
Jan 14 05:50:46 tyr kernel: [<f8f89e38>] ? e1000_clean+0x198/0x8e0 [e1000]
Jan 14 05:50:46 tyr kernel: [<c04f3ef9>] ? shrink_page_list.clone.0+0x3e9/0x520
Jan 14 05:50:46 tyr kernel: [<c0788f2e>] ? net_rx_action+0xde/0x280
Jan 14 05:50:46 tyr kernel: [<c045c3da>] ? __do_softirq+0x8a/0x1a0
Jan 14 05:50:46 tyr kernel: [<c042a65f>] ? ack_apic_level+0x5f/0x1f0
Jan 14 05:50:46 tyr kernel: [<c04b5675>] ? handle_fasteoi_irq+0x85/0xc0
Jan 14 05:50:46 tyr kernel: [<c045c52d>] ? do_softirq+0x3d/0x50
Jan 14 05:50:46 tyr kernel: [<c045c685>] ? irq_exit+0x65/0x70
Jan 14 05:50:46 tyr kernel: [<c040b030>] ? do_IRQ+0x50/0xc0
Jan 14 05:50:46 tyr kernel: [<c0409f10>] ? common_interrupt+0x30/0x38
Jan 14 05:50:46 tyr kernel: [<c053dba0>] ? d_callback+0x0/0x10
Jan 14 05:50:46 tyr kernel: [<c04b7732>] ? __call_rcu+0x22/0x110
Jan 14 05:50:46 tyr kernel: [<c053d4f6>] ? d_kill+0x36/0x50
Jan 14 05:50:46 tyr kernel: [<c053d7a3>] ? __shrink_dcache_sb+0x293/0x2e0
Jan 14 05:50:46 tyr kernel: [<c053d8f2>] ? shrink_dcache_memory+0x102/0x1a0
Jan 14 05:50:46 tyr kernel: [<c04f379b>] ? shrink_slab+0x11b/0x180
Jan 14 05:50:46 tyr kernel: [<c04f5beb>] ? kswapd+0x57b/0x920
Jan 14 05:50:46 tyr kernel: [<c04f5f90>] ? isolate_pages_global+0x0/0x2b0
Jan 14 05:50:46 tyr kernel: [<c0475d20>] ? autoremove_wake_function+0x0/0x40
Jan 14 05:50:46 tyr kernel: [<c04f5670>] ? kswapd+0x0/0x920
Jan 14 05:50:46 tyr kernel: [<c0475ae4>] ? kthread+0x74/0x80
Jan 14 05:50:46 tyr kernel: [<c0475a70>] ? kthread+0x0/0x80
Jan 14 05:50:46 tyr kernel: [<c0409f1f>] ? kernel_thread_helper+0x7/0x10
Jan 14 05:50:46 tyr kernel: Mem-Info:
Jan 14 05:50:46 tyr kernel: DMA per-cpu:
Jan 14 05:50:46 tyr kernel: CPU 0: hi: 0, btch: 1 usd: 0
Jan 14 05:50:46 tyr kernel: CPU 1: hi: 0, btch: 1 usd: 0
Jan 14 05:50:46 tyr kernel: CPU 2: hi: 0, btch: 1 usd: 0
Jan 14 05:50:46 tyr kernel: CPU 3: hi: 0, btch: 1 usd: 0
Jan 14 05:50:46 tyr kernel: Normal per-cpu:
Jan 14 05:50:46 tyr kernel: CPU 0: hi: 186, btch: 31 usd: 94
Jan 14 05:50:46 tyr kernel: CPU 1: hi: 186, btch: 31 usd: 170
Jan 14 05:50:46 tyr kernel: CPU 2: hi: 186, btch: 31 usd: 164
Jan 14 05:50:46 tyr kernel: CPU 3: hi: 186, btch: 31 usd: 183
Jan 14 05:50:46 tyr kernel: HighMem per-cpu:
Jan 14 05:50:46 tyr kernel: CPU 0: hi: 186, btch: 31 usd: 26
Jan 14 05:50:46 tyr kernel: CPU 1: hi: 186, btch: 31 usd: 24
Jan 14 05:50:46 tyr kernel: CPU 2: hi: 186, btch: 31 usd: 152
Jan 14 05:50:46 tyr kernel: CPU 3: hi: 186, btch: 31 usd: 63
Jan 14 05:50:46 tyr kernel: active_anon:119292 inactive_anon:17142 isolated_anon:0
Jan 14 05:50:46 tyr kernel: active_file:77328 inactive_file:460484 isolated_file:0
Jan 14 05:50:46 tyr kernel: unevictable:0 dirty:15650 writeback:0 unstable:0
Jan 14 05:50:46 tyr kernel: free:801387 slab_reclaimable:21875 slab_unreclaimable:12758
Jan 14 05:50:46 tyr kernel: mapped:14393 shmem:1011 pagetables:1385 bounce:0
Jan 14 05:50:46 tyr kernel: DMA free:3528kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:3588kB inactiv
e_file:372kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15868kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB sh
mem:0kB slab_reclaimable:412kB slab_unreclaimable:60kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pag
es_scanned:0 all_unreclaimable? no
Jan 14 05:50:46 tyr kernel: lowmem_reserve[]: 0 863 6075 6075
Jan 14 05:50:46 tyr kernel: Normal free:25264kB min:3724kB low:4652kB high:5584kB active_anon:0kB inactive_anon:0kB active_file:2483
36kB inactive_file:248380kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:883912kB mlocked:0kB dirty:440kB writeback
:0kB mapped:4kB shmem:0kB slab_reclaimable:87088kB slab_unreclaimable:50972kB kernel_stack:4944kB pagetables:0kB unstable:0kB bounce
:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Jan 14 05:50:46 tyr kernel: lowmem_reserve[]: 0 0 41701 41701
Jan 14 05:50:46 tyr kernel: HighMem free:3176756kB min:512kB low:6132kB high:11756kB active_anon:477168kB inactive_anon:68568kB acti
ve_file:57388kB inactive_file:1593184kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:5337780kB mlocked:0kB dirty:62
160kB writeback:0kB mapped:57568kB shmem:4044kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:5540kB unsta
ble:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Jan 14 05:50:46 tyr kernel: lowmem_reserve[]: 0 0 0 0
Jan 14 05:50:46 tyr kernel: DMA: 25*4kB 11*8kB 5*16kB 2*32kB 4*64kB 1*128kB 1*256kB 1*512kB 0*1024kB 1*2048kB 0*4096kB = 3532kB
Jan 14 05:50:46 tyr kernel: Normal: 5868*4kB 150*8kB 19*16kB 2*32kB 0*64kB 0*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2529
6kB
Jan 14 05:50:46 tyr kernel: HighMem: 15*4kB 23*8kB 4*16kB 0*32kB 2*64kB 1*128kB 1*256kB 7*512kB 4*1024kB 3*2048kB 772*4096kB = 31767
56kB
Jan 14 05:50:46 tyr kernel: 538831 total pagecache pages
Jan 14 05:50:46 tyr kernel: 0 pages in swap cache
Jan 14 05:50:46 tyr kernel: Swap cache stats: add 0, delete 0, find 0/0
Jan 14 05:50:46 tyr kernel: Free swap = 1557496kB
Jan 14 05:50:46 tyr kernel: Total swap = 1557496kB
Jan 14 05:50:46 tyr kernel: 1572863 pages RAM
Jan 14 05:50:46 tyr kernel: 1346050 pages HighMem
Jan 14 05:50:46 tyr kernel: 47775 pages reserved
Jan 14 05:50:46 tyr kernel: 521984 pages shared
Jan 14 05:50:46 tyr kernel: 220111 pages non-shared
dlmiles

dlmiles

2013-01-14 07:32

reporter   ~0016279

I'm sure I can find plenty of Opps by past month, I had previously thought it was a hardware matter but this is the 2nd different generation system with similar ethernet card on CentOS6 to have e1000 network issues.


I have previous kernel developer experience (many years ago) so if you can supply a command sequence that will obtain the kernel+patches and build it. So I can then examine and compare and try driver modifications.

Also the command sequence to link the kernel Call Trace and debug offsets to source code line numbers.



The official Intel driver for e1000 has many versions
http://sourceforge.net/projects/e1000/files/e1000+stable/

The version RHEL6/CentOS6 looks to be from 2007, I understand it maybe heavily patched to include bug fixes but 5 years is a long time maybe Intel already fixes the issue a long time ago?


This report is also reported here, maybe with better formatting
http://sourceforge.net/p/e1000/bugs/370/
dlmiles

dlmiles

2013-01-14 08:15

reporter   ~0016280

I am now trying with http://elrepo.org/tiki/kmod-e1000

Which is the current version from Intel (that seems 5+ years newer than the stock RHEL6 version 7.3.21-k8-NAPI):

# ethtool -i eth0
driver: e1000
version: 8.0.35-NAPI
firmware-version: N/A
bus-info: 0000:03:07.0
toracat

toracat

2013-01-14 15:46

manager   ~0016283

Please do report back with the result after trying ELRepo's package. If that fixes the issue, I suggest you file a bug report upstream at http://bugzilla.redhat.com . They seem to work closely with IBM people and will update the driver as far as it is in the mainline kernel (from kernel.org). CentOS gets the update (only) if it is done upstream.
dlmiles

dlmiles

2013-01-15 04:12

reporter   ~0016284

I also report it at Intel's sourceforge ethernet driver project: http://sourceforge.net/p/e1000/bugs/370/

A request has been made:

    Sorry to hear this.
    Would you make sure that your kernel is running with following patch?

    commit 8ce6909f77ba1b7bcdea65cc2388fd1742b6d669
    Author: Tushar Dave tushar.n.dave@intel.com
    Date: Thu May 17 01:04:50 2012 +0000


http://git.kernel.org/?p=linux/kernel/git/stable/linux-stable.git;a=commit;h=3a58107e4ed76e3a314002233a600234e0785aa1

This made it into 3.2.18 so would need explicit backport by RH. I'm sure we can confirm if this patch is in the kernel patch at this time ?
toracat

toracat

2013-01-15 08:17

manager   ~0016285

As I responded at sourceforge.net, The current CentOS 6 kernel (2.6.32-279.19.1.el6) does not have the referenced patch. CentOS provides a custom kernel called centosplus kernel. I can try building it with the patch applied.
toracat

toracat

2013-01-15 08:40

manager   ~0016286

The kmod-e1000 packages most likely does not have the referenced patch judging from its date. So, I recommend you try kernel-ml from ELRepo:

http://elrepo.org/tiki/kernel-ml

Its version is 3.7.2 at the moment. kernel-ml installs in parallel with the distro kernel. When you are done with testing, uninstalling it is as easy as running a 'yum remove'.
dlmiles

dlmiles

2013-01-16 15:48

reporter   ~0016290

I have examined the 8.0.35-NAPI driver code and it looks nothing like the 7.3.20 code (I don't have exact RH/CentOS source to hand only Intel's 7.3.20 release). In particular the offending line to cancel the internal kernel task/thread for doing slow driver operations of resetting card does not exist in the reset code path, only in the e1000_remove() code path, which I believe to be used only for module unload. I never unload the module, but a card reset can be perform for many reasons, change of certain settings, iface up/down, bus errors, to many media errors, overflow/underrun of interrupts/data, etc... so in short the 8.0.35 driver does not need this patch.

So far the system has been solid since I used elrepo.org 8.0.35 driver.

I am happy to try a patched/replacement e1000.ko module based RH version of 7.3.21 that includes this recent patch. This is only if you want feedback on something specific.

It was not clear which centosplus kernel version includes this patch (what is full centos version designation), also are you saying this centosplus kernel is 3.7.2 ? not 2.6.32 based ? if so it will include the patch already as my quick look saw it hit mainline in 3.2.18 so any newer than this mainline kernel is safe ?


If you wish me to test something specific please be specific about the combinations.

Until then I shall continue to test/enjoy stability with CentOS6 and my hardware.

Once this matter is resolved I am happy to stay with 2.6.32 based kernel using elrepo 8.0.35 driver. It is preferred by me to stay with that kernel (if I can).
toracat

toracat

2013-01-16 17:58

manager   ~0016291

Glad to learn that the e1000 driver from ELRepo is working fine.

I have just uploaded a test version of centosplus kernel that has the referenced patch applied.

http://people.centos.org/toracat/kernel/6/plus/bug6187/

Note that the code for the e1000 module in the cplus kernel is unchanged from the distro kernel (except for the patch, of course).

Please test to see if this kernel fixes the issue you are seeing. If that turns out to be the case, I will include the patch to the official version of the plus kernel.

Whatever the outcome, the driver needs to be updated upstream (at RH). Until that happens, your best solution will be, as you said, to continue using the ELRepo kmod package.
toracat

toracat

2013-02-05 22:25

manager  

centos-linux-2.6-e1000-properly-kill-reset-task-bug6187.patch (1,472 bytes)
centos-linux-2.6-e1000-properly-kill-reset-task-bug6187.patch
http://bugs.centos.org/view.php?id=6187

commit 8ce6909f77ba1b7bcdea65cc2388fd1742b6d669
Author: Tushar Dave <tushar.n.dave@intel.com>
Date:   Thu May 17 01:04:50 2012 +0000

    e1000: Prevent reset task killing itself.
    
    Killing reset task while adapter is resetting causes deadlock.
    Only kill reset task if adapter is not resetting.
    Ref bug #43132 on bugzilla.kernel.org
    
    CC: stable@vger.kernel.org
    Signed-off-by: Tushar Dave <tushar.n.dave@intel.com>
    Tested-by: Aaron Brown <aaron.f.brown@intel.com>
    Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

    Applied by: Akemi Yagi <toracat@centos.org>

diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c b/drivers/net/ethernet/intel/e1000/e1000_main.c
index 37caa88..8d8908d 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -483,7 +483,11 @@ out:
 static void e1000_down_and_stop(struct e1000_adapter *adapter)
 {
 	set_bit(__E1000_DOWN, &adapter->flags);
-	cancel_work_sync(&adapter->reset_task);
+
+	/* Only kill reset task if adapter is not resetting */
+	if (!test_bit(__E1000_RESETTING, &adapter->flags))
+		cancel_work_sync(&adapter->reset_task);
+
 	cancel_delayed_work_sync(&adapter->watchdog_task);
 	cancel_delayed_work_sync(&adapter->phy_info_task);
 	cancel_delayed_work_sync(&adapter->fifo_stall_task);
toracat

toracat

2013-02-05 22:30

manager   ~0016439

centosplus kernel 2.6.32-279.22.1.el6.centos.plus includes the patch referenced in comment 16284.

http://bugs.centos.org/file_download.php?file_id=1433&type=bug
toracat

toracat

2013-02-07 16:14

manager   ~0016456

I am closing this report as 'resolved'. Please feel free to reopen if you still see the problem.
toracat

toracat

2013-02-27 01:17

manager   ~0016549

The patch is in the 6.4 kernel ( 2.6.32-358.el6 ).

Issue History

Date Modified Username Field Change
2013-01-14 07:25 dlmiles New Issue
2013-01-14 07:27 dlmiles Note Added: 0016278
2013-01-14 07:32 dlmiles Note Added: 0016279
2013-01-14 08:15 dlmiles Note Added: 0016280
2013-01-14 15:46 toracat Note Added: 0016283
2013-01-15 04:12 dlmiles Note Added: 0016284
2013-01-15 08:17 toracat Note Added: 0016285
2013-01-15 08:40 toracat Note Added: 0016286
2013-01-15 08:41 toracat Status new => assigned
2013-01-16 15:48 dlmiles Note Added: 0016290
2013-01-16 17:58 toracat Note Added: 0016291
2013-02-05 22:25 toracat File Added: centos-linux-2.6-e1000-properly-kill-reset-task-bug6187.patch
2013-02-05 22:30 toracat Note Added: 0016439
2013-02-07 16:14 toracat Note Added: 0016456
2013-02-07 16:14 toracat Status assigned => resolved
2013-02-07 16:14 toracat Resolution open => fixed
2013-02-27 01:17 toracat Note Added: 0016549
2013-02-27 01:17 toracat Status resolved => feedback
2013-02-27 01:17 toracat Resolution fixed => reopened
2013-02-27 01:18 toracat Status feedback => resolved
2013-02-27 01:18 toracat Resolution reopened => fixed
2013-02-27 01:18 toracat Fixed in Version => 6.4