2017-11-17 21:06 UTC

View Issue Details Jump to Notes ]
IDProjectCategoryView StatusLast Update
0013776CentOS-7-OTHERpublic2017-11-03 19:21
Reporterjittu_mohan 
PriorityhighSeveritycrashReproducibilityalways
StatusclosedResolutionfixed 
PlatformCentOS Linux 7 OSCentOS Linux release 7.2.1511 OS Version3.10.0-327.28.3
Product Version7.2.1511 
Target VersionFixed in Version 
Summary0013776: system paniced and rebooted very often on Centos 7.2.1511
Descriptionwe are running big data software and see this issue very often . system rebooted
and see the system core file.

[root@noderhel73 vmcore-127.0.0.1-2017-08-17-22:52:53]# crash --osrelease ./vmcore
3.10.0-327.28.3.el7.x86_64

>>>Error message from log file of replaying system core file using the crash<<<<
[221905.562244] NETDEV WATCHDOG: eno1 (ixgbe): transmit queue 11 timed out
Steps To ReproduceRandom in nature.
Additional Informationcrash> sys
KERNEL: /selfhost/bugs/jmohan/comscore/vmlinux
DUMPFILE: ./vmcore [PARTIAL DUMP]
CPUS: 88
DATE: Tue Aug 15 22:39:56 2017
UPTIME: 2 days, 13:42:12
LOAD AVERAGE: 93.86, 53.47, 42.44
TASKS: 3392
NODENAME: csia1hdw585.office.comscore.com
RELEASE: 3.10.0-327.28.3.el7.x86_64
VERSION: #1 SMP Thu Aug 18 19:05:49 UTC 2016
MACHINE: x86_64 (2200 Mhz)
MEMORY: 255.9 GB
PANIC: "Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 68"

crash> log
[221897.410281] ------------[ cut here ]------------
[221905.562235] ------------[ cut here ]------------
[221905.562243] WARNING: at net/sched/sch_generic.c:297 dev_watchdog+0x270/0x280()

[221905.562244] NETDEV WATCHDOG: eno1 (ixgbe): transmit queue 11 timed out
[221905.562290] Modules linked in: cdc_ether usbnet mii 8021q garp stp mrp llc ipmi_si binfmt_misc vfat fat usb_storage mpt3sas mpt2sas raid_class scsi_transport_sas

mptctl mptbase dell_rbu intel_powerclamp coretemp intel_rapl kvm_intel kvm crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd

sg mxm_wmi iTCO_wdt pcspkr iTCO_vendor_support dcdbas ipmi_devintf sb_edac mei_me mei ipmi_msghandler lpc_ich edac_core mfd_core shpchp wmi acpi_power_meter ip_tables

xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200 syscopyarea sysfillrect sysimgblt drm_kms_helper crct10dif_pclmul crct10dif_common ttm crc32c_intel drm ahci

ixgbe(OE) igb(OE) mdio libahci ptp i2c_algo_bit pps_core libata i2c_core megaraid_sas dca dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ipmi_si]

[221905.562291]
[221905.562293] CPU: 2 PID: 175682 Comm: java Tainted: G OE ------------ 3.10.0-327.28.3.el7.x86_64 #1

[221905.562294] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016

[221905.562306] ffff881ffe643d88 00000000dc09304b ffff881ffe643d40 ffffffff81636453

[221905.562312] ffff881ffe643d78 ffffffff8107b200 000000000000000b ffff881fcb0c0000

[221905.562317] ffff881fcb0bcf40 0000000000000040 0000000000000002 ffff881ffe643de0

[221905.562318] Call Trace:
[221905.562327] [] dump_stack+0x19/0x1b
[221905.562332] [] warn_slowpath_common+0x70/0xb0
[221905.562335] [] warn_slowpath_fmt+0x5c/0x80
[221905.562340] [] dev_watchdog+0x270/0x280
[221905.562343] [] ? dev_graft_qdisc+0x80/0x80
[221905.562348] [] call_timer_fn+0x36/0x110
[221905.562351] [] ? dev_graft_qdisc+0x80/0x80
[221905.562354] [] run_timer_softirq+0x237/0x340
[221905.562359] [] ? leave_mm+0x70/0x70
[221905.562362] [] __do_softirq+0xef/0x280
[221905.562364] [] ? leave_mm+0x70/0x70
[221905.562368] [] call_softirq+0x1c/0x30
[221905.562373] [] do_softirq+0x65/0xa0
[221905.562376] [] irq_exit+0x115/0x120
[221905.562379] [] smp_apic_timer_interrupt+0x45/0x60
[221905.562384] [] apic_timer_interrupt+0x6d/0x80
[221905.562387] [] ? leave_mm+0x70/0x70
[221905.562392] [] ? free_cpumask_var+0x9/0x10
[221905.562398] [] ? smp_call_function_many+0x202/0x260
[221905.562401] [] native_flush_tlb_others+0xb8/0xc0
[221905.562404] [] flush_tlb_mm_range+0x66/0x140
[221905.562408] [] change_protection_range+0x720/0x810
[221905.562411] [] change_protection+0x65/0xa0
[221905.562416] [] change_prot_numa+0x1b/0x40
[221905.562421] [] task_numa_work+0x1f6/0x320
[221905.562426] [] task_work_run+0xa7/0xe0
[221905.562429] [] do_notify_resume+0x92/0xb0
[221905.562434] [] retint_signal+0x48/0x8c
[221905.562436] ---[ end trace ea0b35d951e02210 ]---
[221905.562439] ixgbe 0000:01:00.0 eno1: initiating reset due to tx timeout
[221905.742181] ixgbe 0000:01:00.0 eno1: Reset adapter
[221906.422952] ixgbe 0000:01:00.0 eno1: initiating reset to clear Tx work after link loss

[221906.423497] ixgbe 0000:01:00.0 eno1: Reset adapter
[221913.901444] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 68

Look this error message:
>>> [221905.562244] NETDEV WATCHDOG: eno1 (ixgbe): transmit queue 11 timed out.

Seems you are running the tainting kernel which is causing issue :

crash> mod -t
NAME TAINTS
igb OE
ixgbe OE

We see some numa calls . Is disabling NUMA will help to resolve or any fix needed for the NIC drivers. ? I can provide the vmcore and debug vmlinux but need ftp server to upload it.
TagsNo tags attached.
abrt_hash
URL
Attached Files

-Relationships
+Relationships

-Notes

~0030029

tigalch (manager)

Please run 'yum update' to get your installation up to date. You can also try and enable the CR-repo which will bring your installation to 7.4(1708) state.

~0030030

jittu_mohan (reporter)

Hello

Does any specfic patch on 7.4 upgrade resolve this issue ? Appreciate your info.

Thanks
Jitendra

~0030031

TrevorH (developer)

Your 3.10.0-327.28.3.el7.x86_64 dates from the end of January 2016. The changelog for the current 7.4 kernel, kernel-3.10.0-693.1.1.el7.x86_64, has over 30,000 lines in it since then.

Your first step is to update since your current kernel has major security vulnerabilities that have been fixed in later ones. You'll also get the extra added benefit of around 30k other bugs that have been fixed.

~0030051

jittu_mohan (reporter)

Hello Folks

we have huge no of clusters. Its tough to upgrade the kernel without knowing the RCA for this issue. Appreciate someone willing to look the system dump to figure out RCA for this issue.

-Jitendra

~0030068

tigalch (manager)

As your kernel is not supported anymore (as stated by TreverH) that is very unlikely going to happen.
We only support the current versions, and those are 6.9 and (since yesterday) 7.4(1708)

~0030522

jittu_mohan (reporter)

This issue is solved by upgrading to 7.4 kernel ( ixgbe driver update ) . No more panic .

~0030523

jittu_mohan (reporter)

Please close this case.

~0030524

TrevorH (developer)

Closed as per above.
+Notes

-Issue History
Date Modified Username Field Change
2017-09-07 17:56 jittu_mohan New Issue
2017-09-08 20:42 tigalch Note Added: 0030029
2017-09-08 21:17 jittu_mohan Note Added: 0030030
2017-09-08 21:22 TrevorH Note Added: 0030031
2017-09-13 20:31 jittu_mohan Note Added: 0030051
2017-09-14 09:04 tigalch Note Added: 0030068
2017-11-03 19:18 jittu_mohan Note Added: 0030522
2017-11-03 19:20 jittu_mohan Note Added: 0030523
2017-11-03 19:21 TrevorH Status new => closed
2017-11-03 19:21 TrevorH Resolution open => fixed
2017-11-03 19:21 TrevorH Note Added: 0030524
+Issue History