View Issue Details

IDProjectCategoryView StatusLast Update
0015318administrationmailing listspublic2018-09-26 01:33
Reporterfzuwwl89 
PriorityimmediateSeverityblockReproducibilitysometimes
Status newResolutionopen 
PlatformCentOS Linux release 7.4.1708 (COSLinuxOS Version3.10.0-693.el7.x
Product Version 
Target VersionFixed in Version 
Summary0015318: distribute_cfs_runtime kernel hard lockup
DescriptionKernel keep crash and leave stack dump as follow, it happends in our hadoop cluster with CGroup CPU resource isolation turned on, the frequency range from 1-2 crashes everyday to more than 10 crashes everyday. The situation is kind of like the bug fixed in path https://lore.kernel.org/patchwork/patch/479983/ , but this bug is supposed to be fixed in kernel version 3.10.0-693 and 3.10.0-862 .

[5561883.978551] NMI watchdog: Watchdog detected hard LOCKUP on cpu 15
[5561883.978593] Modules linked in:
[5561883.978598] dm_mod sctp_diag sctp dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag iptable_filter binfmt_misc bonding skx_edac edac_core intel_powerclamp coretemp intel_rapl iTCO_wdt iosf_mbi kvm_intel iTCO_vendor_support kvm dcdbas irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd sg ipmi_ssif pcspkr ipmi_si shpchp ipmi_devintf mei_me ipmi_msghandler mei lpc_ich nfit i2c_i801 libnvdimm acpi_power_meter acpi_pad ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_generic mgag200 i2c_algo_bit drm_kms_helper crct10dif_pclmul syscopyarea crct10dif_common sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm i40e drm ahci libahci tg3 libata megaraid_sas ptp i2c_core pps_core
[5561883.978652] CPU: 15 PID: 0 Comm: swapper/15 Not tainted 3.10.0-693.el7.x86_64 #1
[5561883.978654] Hardware name: Dell Inc. PowerEdge R540/0NJK2F, BIOS 1.3.7 02/09/2018
[5561883.978657] task: ffff88289bb1bf40 ti: ffff88289bb90000 task.ti: ffff88289bb90000
[5561883.978659] RIP: 0010:[<ffffffff810bf107>] [<ffffffff810bf107>] finish_task_switch+0x57/0x160
[5561883.978671] RSP: 0000:ffff88289bb93e48 EFLAGS: 00000286
[5561883.978672] RAX: ffff882899e0dee0 RBX: ffffffff810b4155 RCX: 0000000000000000
[5561883.978673] RDX: ffff88289bb91fd8 RSI: ffff88289bb1bf40 RDI: ffff883fdbfd6cc0
[5561883.978675] RBP: ffff88289bb93e68 R08: ffff88289bb90000 R09: 0000000000000002
[5561883.978676] R10: 000000000000000f R11: 0000000000000000 R12: ffff883fdbfcfe80
[5561883.978677] R13: ffff883fdbfcf960 R14: ffffffff8132bfe0 R15: ffff88289bb93db8
[5561883.978679] FS: 0000000000000000(0000) GS:ffff883fdbfc0000(0000) knlGS:0000000000000000
[5561883.978680] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[5561883.978682] CR2: 00007f3fe8b4aff8 CR3: 00000004981d8000 CR4: 00000000003407e0
[5561883.978683] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[5561883.978684] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[5561883.978685] Stack:
[5561883.978686] 0000000000000000 ffff883fdbfd6cc0 ffff88026b7ce400 0000000000000000
[5561883.978689] ffff88289bb93ec8 ffffffff816a8f8d ffff88289bb1bf40 ffff88289bb93fd8
[5561883.978691] ffff88289bb93fd8 ffff88289bb93fd8 ffff88289bb1bf40 ffffffff81b1c820
[5561883.978694] Call Trace:
[5561883.978700] [<ffffffff816a8f8d>] __schedule+0x39d/0x8b0
[5561883.978703] [<ffffffff816aa3e9>] schedule_preempt_disabled+0x29/0x70
[5561883.978710] [<ffffffff810e7c0a>] cpu_startup_entry+0x18a/0x1c0
[5561883.978715] [<ffffffff81051af6>] start_secondary+0x1b6/0x230
[5561883.978717] Code: 1f 44 00 00 65 48 8b 34 25 00 ce 00 00 0f 1f 44 00 00 41 c7 45 28 00 00 00 00 48 89 df c6 07 00 0f 1f 40 00 fb 66 0f 1f 44 00 00 <65> 48 8b 04 25 00 ce 00 00 48 8b 98 78 01 00 00 48 85 db 74 1c
[5561883.978741] Kernel panic - not syncing: Hard LOCKUP
[5561883.978767] CPU: 15 PID: 0 Comm: swapper/15 Not tainted 3.10.0-693.el7.x86_64 #1
[5561883.978802] Hardware name: Dell Inc. PowerEdge R540/0NJK2F, BIOS 1.3.7 02/09/2018
[5561883.978837] ffff88289bb93d00 d5511a25db39e950 ffff883fdbfc5b18 ffffffff816a3d91
[5561883.978879] ffff883fdbfc5b98 ffffffff8169dc54 0000000000000010 ffff883fdbfc5ba8
[5561883.978923] ffff883fdbfc5b48 d5511a25db39e950 0000000000000000 ffffffff8190ac0f
[5561883.978966] Call Trace:
[5561883.978980] <NMI> [<ffffffff816a3d91>] dump_stack+0x19/0x1b
[5561883.979019] [<ffffffff8169dc54>] panic+0xe8/0x20d
[5561883.979046] [<ffffffff8108771f>] nmi_panic+0x3f/0x40
[5561883.979073] [<ffffffff8112fa75>] watchdog_overflow_callback+0xf5/0x100
[5561883.979108] [<ffffffff8116e561>] __perf_event_overflow+0x51/0xf0
[5561883.979139] [<ffffffff811770b4>] perf_event_overflow+0x14/0x20
[5561883.979170] [<ffffffff81009f78>] intel_pmu_handle_irq+0x218/0x4f0
[5561883.979204] [<ffffffff81324abc>] ? ioremap_page_range+0x26c/0x3d0
[5561883.979236] [<ffffffff811c0a04>] ? vunmap_page_range+0x1b4/0x300
[5561883.979266] [<ffffffff811c0b61>] ? unmap_kernel_range_noflush+0x11/0x20
[5561883.979300] [<ffffffff813da15e>] ? ghes_copy_tofrom_phys+0x10e/0x210
[5561883.979332] [<ffffffff813da300>] ? ghes_read_estatus+0xa0/0x190
[5561883.979363] [<ffffffff816ac06b>] perf_event_nmi_handler+0x2b/0x50
[5561883.979394] [<ffffffff816ad427>] nmi_handle.isra.0+0x87/0x160
[5561883.979424] [<ffffffff816ad710>] do_nmi+0x210/0x450
[5561883.979451] [<ffffffff810c89b0>] ? task_scan_max+0x40/0x40
[5561883.979480] [<ffffffff816ac8d3>] end_repeat_nmi+0x1e/0x2e
[5561883.979508] [<ffffffff810c89b0>] ? task_scan_max+0x40/0x40
[5561883.979536] [<ffffffff810c89ce>] ? tg_unthrottle_up+0x1e/0x50
[5561883.979566] [<ffffffff810c89ce>] ? tg_unthrottle_up+0x1e/0x50
[5561883.979595] [<ffffffff810c89ce>] ? tg_unthrottle_up+0x1e/0x50
[5561883.979624] <<EOE>> <IRQ> [<ffffffff810c0bcb>] walk_tg_tree_from+0x7b/0x110
[5561883.979666] [<ffffffff810ba190>] ? __smp_mb__after_atomic+0x10/0x10
[5561883.979698] [<ffffffff810d0977>] unthrottle_cfs_rq+0xb7/0x170
[5561883.979726] [<ffffffff810d0bfa>] distribute_cfs_runtime+0x10a/0x130
[5561883.979759] [<ffffffff810d0da7>] sched_cfs_period_timer+0xb7/0x150
[5561883.979790] [<ffffffff810d0cf0>] ? sched_cfs_slack_timer+0xd0/0xd0
[5561883.979822] [<ffffffff810b4ae4>] __hrtimer_run_queues+0xd4/0x260
[5561883.979853] [<ffffffff810b507f>] hrtimer_interrupt+0xaf/0x1d0
[5561883.979883] [<ffffffff81053895>] local_apic_timer_interrupt+0x35/0x60
[5561883.979917] [<ffffffff816b76bd>] smp_apic_timer_interrupt+0x3d/0x50
[5561883.979949] [<ffffffff816b5c1d>] apic_timer_interrupt+0x6d/0x80
[5561883.979977] <EOI> [<ffffffff810b4155>] ? enqueue_hrtimer+0x25/0x80
[5561883.980013] [<ffffffff810bf107>] ? finish_task_switch+0x57/0x160
[5561883.980044] [<ffffffff816a8f8d>] __schedule+0x39d/0x8b0
[5561883.980071] [<ffffffff816aa3e9>] schedule_preempt_disabled+0x29/0x70
[5561883.981014] [<ffffffff810e7c0a>] cpu_startup_entry+0x18a/0x1c0
[5561883.981889] [<ffffffff81051af6>] start_secondary+0x1b6/0x230
Steps To Reproducecannot be reproduced, but the frequency range from 1-2 crashes everyday to more than 10 crashes everyday.
Tags3.10.0-693.el7.x86_64, 3.10.0-862, centos7

Activities

There are no notes attached to this issue.

Issue History

Date Modified Username Field Change
2018-09-25 11:56 fzuwwl89 New Issue
2018-09-25 11:56 fzuwwl89 Tag Attached: 3.10.0-693.el7.x86_64
2018-09-25 11:56 fzuwwl89 Tag Attached: 3.10.0-862
2018-09-25 11:56 fzuwwl89 Tag Attached: centos7