View Issue Details

IDProjectCategoryView StatusLast Update
0006688CentOS-6kernelpublic2013-10-24 20:02
ReporterTomKong 
PriorityhighSeverityminorReproducibilityalways
Status resolvedResolutionfixed 
Platformx86_64OSCentosOS VersionCentos 6.4
Product Version6.4 
Target VersionFixed in Version 
Summary0006688: CPU#XX stuck for 67s! about every 5 times service network restart
Descriptionrestarting network often leads to dmesg output below, at this point server is not able to reboot, needs to be power cycled,

BUG: soft lockup - CPU#19 stuck for 67s! [ip:6972]
Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables mpt2sas raid_class mptctl mptbase nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf bridge stp llc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vhost_net macvtap macvlan tun kvm_intel kvm microcode sb_edac edac_core iTCO_wdt iTCO_vendor_support i2c_i801 i2c_core sg ioatdma igb dca ptp pps_core ext4 jbd2 mbcache usb_storage sd_mod crc_t10dif ahci isci libsas scsi_transport_sas megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
CPU 19
Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables mpt2sas raid_class mptctl mptbase nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf bridge stp llc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 vhost_net macvtap macvlan tun kvm_intel kvm microcode sb_edac edac_core iTCO_wdt iTCO_vendor_support i2c_i801 i2c_core sg ioatdma igb dca ptp pps_core ext4 jbd2 mbcache usb_storage sd_mod crc_t10dif ahci isci libsas scsi_transport_sas megaraid_sas dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 6972, comm: ip Not tainted 2.6.32-358.18.1.el6.x86_64 #1 Supermicro X9DR3-F/X9DR3-F
RIP: 0010:[<ffffffff81510b3e>] [<ffffffff81510b3e>] _spin_lock_bh+0x2e/0x40
RSP: 0018:ffff8810689e1588 EFLAGS: 00000297
RAX: 000000000000ffff RBX: ffff8810689e1598 RCX: 0000000000000246
RDX: 0000000000008808 RSI: 0000000000000000 RDI: ffff8808728f242c
RBP: ffffffff8100bb8e R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000030 R11: 0000000000000000 R12: 0000000000000003
R13: 0000000000000286 R14: ffffffff81b19430 R15: ffffffff81471c38
FS: 00007fab5e3a8700(0000) GS:ffff88089c4e0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000038c24ff2f0 CR3: 00000010697b1000 CR4: 00000000000407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ip (pid: 6972, threadinfo ffff8810689e0000, task ffff881067c50080)
Stack:
 ffff8810689e1598 ffff8808728f2400 ffff8810689e15e8 ffffffff814b4cda
<d> 00000000000000d0 010000e06a6859c0 0000000000000000 ffff8808728f2400
<d> ffff8808728f2400 0000000000000001 0000000000000000 ffffffff81b1c720
Call Trace:
 [<ffffffff814b4cda>] ? ip_mc_inc_group+0x15a/0x280
 [<ffffffff814b4f57>] ? ip_mc_up+0x27/0x70
 [<ffffffff814ae70e>] ? inetdev_event+0x9e/0x4d0
 [<ffffffff81457830>] ? rtnl_notify+0x30/0x40
 [<ffffffff81459b7b>] ? rtmsg_ifinfo+0x16b/0x200
 [<ffffffff81513c25>] ? notifier_call_chain+0x55/0x80
 [<ffffffff8109cce6>] ? raw_notifier_call_chain+0x16/0x20
 [<ffffffff814498fb>] ? call_netdevice_notifiers+0x1b/0x20
 [<ffffffff8144a77e>] ? dev_open+0xce/0x100
 [<ffffffff81449cd1>] ? dev_change_flags+0xa1/0x1d0
 [<ffffffff81456cd8>] ? do_setlink+0x208/0x870
 [<ffffffff8128edc3>] ? __nla_reserve+0x53/0x70
 [<ffffffff8128f194>] ? nla_parse+0x34/0x110
 [<ffffffff8145856a>] ? rtnl_newlink+0x42a/0x550
 [<ffffffff81226fcd>] ? selinux_netlink_recv+0x6d/0x90
 [<ffffffff81457d77>] ? rtnetlink_rcv_msg+0x2d7/0x340
 [<ffffffff81224704>] ? socket_has_perm+0x74/0x90
 [<ffffffff81457aa0>] ? rtnetlink_rcv_msg+0x0/0x340
 [<ffffffff814726b9>] ? netlink_rcv_skb+0xa9/0xd0
 [<ffffffff81457a85>] ? rtnetlink_rcv+0x25/0x40
 [<ffffffff81472316>] ? netlink_unicast+0x2e6/0x300
 [<ffffffff81472ca0>] ? netlink_sendmsg+0x200/0x2e0
 [<ffffffff81435ec3>] ? sock_sendmsg+0x123/0x150
 [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff81143867>] ? handle_pte_fault+0xf7/0xb50
 [<ffffffff81435d14>] ? move_addr_to_kernel+0x64/0x70
 [<ffffffff814376b6>] ? __sys_sendmsg+0x406/0x420
 [<ffffffff8104759c>] ? __do_page_fault+0x1ec/0x480
 [<ffffffff81147dab>] ? vma_link+0x9b/0xf0
 [<ffffffff81149fac>] ? do_brk+0x26c/0x350
 [<ffffffff814378d9>] ? sys_sendmsg+0x49/0x90
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Code: e5 53 48 83 ec 08 0f 1f 44 00 00 48 89 fb e8 2a 64 b6 ff b8 00 00 01 00 f0 0f c1 03 0f b7 d0 c1 e8 10 39 c2 74 0e f3 90 0f b7 13 <eb> f5 83 3b 00 75 f4 eb df 48 83 c4 08 5b c9 c3 66 90 55 48 89
Call Trace:
 [<ffffffff81510b26>] ? _spin_lock_bh+0x16/0x40
 [<ffffffff814b4cda>] ? ip_mc_inc_group+0x15a/0x280
 [<ffffffff814b4f57>] ? ip_mc_up+0x27/0x70
 [<ffffffff814ae70e>] ? inetdev_event+0x9e/0x4d0
 [<ffffffff81457830>] ? rtnl_notify+0x30/0x40
 [<ffffffff81459b7b>] ? rtmsg_ifinfo+0x16b/0x200
 [<ffffffff81513c25>] ? notifier_call_chain+0x55/0x80
 [<ffffffff8109cce6>] ? raw_notifier_call_chain+0x16/0x20
 [<ffffffff814498fb>] ? call_netdevice_notifiers+0x1b/0x20
 [<ffffffff8144a77e>] ? dev_open+0xce/0x100
 [<ffffffff81449cd1>] ? dev_change_flags+0xa1/0x1d0
 [<ffffffff81456cd8>] ? do_setlink+0x208/0x870
 [<ffffffff8128edc3>] ? __nla_reserve+0x53/0x70
 [<ffffffff8128f194>] ? nla_parse+0x34/0x110
 [<ffffffff8145856a>] ? rtnl_newlink+0x42a/0x550
 [<ffffffff81226fcd>] ? selinux_netlink_recv+0x6d/0x90
 [<ffffffff81457d77>] ? rtnetlink_rcv_msg+0x2d7/0x340
 [<ffffffff81224704>] ? socket_has_perm+0x74/0x90
 [<ffffffff81457aa0>] ? rtnetlink_rcv_msg+0x0/0x340
 [<ffffffff814726b9>] ? netlink_rcv_skb+0xa9/0xd0
 [<ffffffff81457a85>] ? rtnetlink_rcv+0x25/0x40
 [<ffffffff81472316>] ? netlink_unicast+0x2e6/0x300
 [<ffffffff81472ca0>] ? netlink_sendmsg+0x200/0x2e0
 [<ffffffff81435ec3>] ? sock_sendmsg+0x123/0x150
 [<ffffffff81096da0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff81143867>] ? handle_pte_fault+0xf7/0xb50
 [<ffffffff81435d14>] ? move_addr_to_kernel+0x64/0x70
 [<ffffffff814376b6>] ? __sys_sendmsg+0x406/0x420
 [<ffffffff8104759c>] ? __do_page_fault+0x1ec/0x480
 [<ffffffff81147dab>] ? vma_link+0x9b/0xf0
 [<ffffffff81149fac>] ? do_brk+0x26c/0x350
 [<ffffffff814378d9>] ? sys_sendmsg+0x49/0x90
 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Steps To ReproduceOn 20 "service network restart" at least on leads to stuck cpu
Additional InformationCPU Intel(R) Xeon(R) E5-2620
Board Supermicro X9DR3-F
Chipset Intel Patsburg
Network card Intel Ethernet I350
HD LSI MR9271-8i
TagsNo tags attached.

Activities

TomKong

TomKong

2013-10-23 16:57

reporter   ~0018220

For me the problem is solved. Even after latest yum update, cpu stucked. After installing kernel sources and gcc I was able to update igb driver manually from 4.0.1 to 5.0.6
toracat

toracat

2013-10-23 17:38

manager   ~0018223

If updating the igb driver to 5.0.x is the solution, I suggest you use the kmod-igb package from ELRepo [1]. It's version 5.0.6 at this moment. The advantage is that it is kABI-tracking [2], meaning it survives kernel updates transparently.

[1] http://elrepo.org/tiki/kmod-igb
[2] http://elrepo.org/tiki/FAQ
TomKong

TomKong

2013-10-24 15:47

reporter   ~0018239

Thanks for your suggestion. I installed the ELRepo package. CPU now does not stuck in more than 100 "service network restart" operations
toracat

toracat

2013-10-24 16:07

manager   ~0018240

Glad to hear the ELRepo's package worked. As mentioned, there is no need to reinstall it upon kernel updates. If/when a new version of the driver becomes available, 'yum update' will find it (provided elrepo is enabled).

Could you let me know the device ID paring of your network device (for information purposes) ? The following command will show it in brackets:

 lspci -nn | grep net
TomKong

TomKong

2013-10-24 19:35

reporter   ~0018242

[root@vmserver1 ~]# lspci -nn | grep net
02:00.0 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01)
02:00.1 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01)
82:00.0 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01)
82:00.1 Ethernet controller [0200]: Intel Corporation I350 Gigabit Network Connection [8086:1521] (rev 01)
toracat

toracat

2013-10-24 19:58

manager   ~0018243

Thanks. Just wanted to make sure it is listed correctly in the device ID page at:

http://elrepo.org/tiki/DeviceIDs

and indeed 8086:1521 is there in the igb.ko section.
toracat

toracat

2013-10-24 20:02

manager   ~0018244

Strictly speaking, the issue has not been regarded as fixed until the updated driver appears in the distro kernel. However, I'm going to mark this ticket as 'resolved' for now. If/when the kernel gets fixed in a future, a note can be added here.

Issue History

Date Modified Username Field Change
2013-10-01 14:17 TomKong New Issue
2013-10-23 16:57 TomKong Note Added: 0018220
2013-10-23 17:38 toracat Note Added: 0018223
2013-10-24 15:47 TomKong Note Added: 0018239
2013-10-24 16:07 toracat Note Added: 0018240
2013-10-24 19:35 TomKong Note Added: 0018242
2013-10-24 19:58 toracat Note Added: 0018243
2013-10-24 20:02 toracat Note Added: 0018244
2013-10-24 20:02 toracat Status new => resolved
2013-10-24 20:02 toracat Resolution open => fixed