View Issue Details

IDProjectCategoryView StatusLast Update
0013228CentOS-7kernelpublic2021-07-06 02:47
Reporterlnykww Assigned To 
PriorityimmediateSeverityblockReproducibilityalways
Status newResolutionopen 
Product Version7.3.1611 
Summary0013228: run nested vm with kernel 3.10.0-514.16.1.el7.x86_64 softlock up
Descriptionkernel version:3.10.0-514.16.1.el7.x86_64

I was running nested vm on this kernel and at same time, running the perf top at the hypervisor. After a little while, the hypervisor has soft lockup.
----------------------------------- dmesg -----------------------------------------
 kernel: [ 115.395738] kvm: zapping shadow pages for mmio generation wraparound
 kernel: [ 115.396388] kvm: zapping shadow pages for mmio generation wraparound
 kernel: [ 120.692540] kvm [6584]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xabcd
 kernel: [ 168.538824] perf: interrupt took too long (2503 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
 kernel: [ 168.547203] perf: interrupt took too long (3140 > 3128), lowering kernel.perf_event_max_sample_rate to 63000
 kernel: [ 168.561457] perf: interrupt took too long (3927 > 3925), lowering kernel.perf_event_max_sample_rate to 50000
 kernel: [ 168.601371] perf: interrupt took too long (4909 > 4908), lowering kernel.perf_event_max_sample_rate to 40000
 kernel: [ 268.914387] ------------[ cut here ]------------
 kernel: [ 268.914400] WARNING: at arch/x86/kvm/vmx.c:8095 vmx_handle_exit+0xaa6/0xbe0 [kvm_intel]()
 kernel: [ 268.914401] vmx: unexpected exit reason 0x3
 kernel: [ 268.914402] Modules linked in: kvm_intel kvm irqbypass dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag iptable_mangle nbd(OE) vhost_net vhost macvtap macvlan ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat xt_addrtype iptable_filter xt_conntrack br_netfilter bridge stp llc dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio loop dm_mod tun ipmi_si ipmi_devintf ipmi_msghandler openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack bonding ext4 mbcache jbd2 intel_powerclamp coretemp crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ses enclosure iTCO_wdt iTCO_vendor_support sg pcspkr wmi i2c_i801 ioatdma shpchp i2c_core sb_edac edac_core lpc_ich mei_me mei ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_pclmul crct10dif_common crc32c_intel isci serio_raw libsas ahci ixgbe scsi_transport_sas libahci libata mdio ptp megaraid_sas pps_core dca fjes [last unloaded: irqbypass]
 kernel: [ 268.914455] CPU: 1 PID: 6608 Comm: qemu-system-x86 Tainted: G OE ------------ 3.10.0-514.16.1.el7.x86_64 #1
 kernel: [ 268.914456] Hardware name: Huawei Technologies Co., Ltd. RH2288H V2-12L/BC11SRSG1, BIOS RMIBV512 08/27/2015
 kernel: [ 268.914457] ffff880fccf07c68 000000007695eff0 ffff880fccf07c20 ffffffff81686ac3
 kernel: [ 268.914459] ffff880fccf07c58 ffffffff81085cb0 ffff882005368000 0000000000000003
 kernel: [ 268.914460] 0000000000000000 0000000000000000 0000000000000001 ffff880fccf07cc0
 kernel: [ 268.914462] Call Trace:
 kernel: [ 268.914467] [<ffffffff81686ac3>] dump_stack+0x19/0x1b
 kernel: [ 268.914470] [<ffffffff81085cb0>] warn_slowpath_common+0x70/0xb0
 kernel: [ 268.914472] [<ffffffff81085d4c>] warn_slowpath_fmt+0x5c/0x80
 kernel: [ 268.914475] [<ffffffffa06949ef>] ? handle_pause+0x2f/0xe0 [kvm_intel]
 kernel: [ 268.914477] [<ffffffffa068f390>] ? vmx_invpcid_supported+0x20/0x20 [kvm_intel]
 kernel: [ 268.914480] [<ffffffffa069c926>] vmx_handle_exit+0xaa6/0xbe0 [kvm_intel]
 kernel: [ 268.914482] [<ffffffffa0699ce4>] ? vmx_vcpu_run+0x5c4/0x760 [kvm_intel]
 kernel: [ 268.914490] [<ffffffff81697d70>] ? uv_bau_message_intr1+0x80/0x80
 kernel: [ 268.914507] [<ffffffffa055c95b>] vcpu_enter_guest+0x3bb/0x1100 [kvm]
 kernel: [ 268.914516] [<ffffffffa0564a3d>] kvm_arch_vcpu_ioctl_run+0xcd/0x450 [kvm]
 kernel: [ 268.914522] [<ffffffffa0549a31>] kvm_vcpu_ioctl+0x2b1/0x640 [kvm]
 kernel: [ 268.914525] [<ffffffff81212555>] do_vfs_ioctl+0x2d5/0x4b0
 kernel: [ 268.914527] [<ffffffff81692661>] ? __do_page_fault+0x171/0x450
 kernel: [ 268.914534] [<ffffffffa0553154>] ? kvm_on_user_return+0x74/0x80 [kvm]
 kernel: [ 268.914536] [<ffffffff812127d1>] SyS_ioctl+0xa1/0xc0
 kernel: [ 268.914538] [<ffffffff81697189>] system_call_fastpath+0x16/0x1b
 kernel: [ 268.914539] ---[ end trace 396bc961d74cd860 ]---
 kernel: [ 320.255594] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [perf:6822]
 kernel: [ 320.255682] Modules linked in: kvm_intel kvm irqbypass dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag iptable_mangle nbd(OE) vhost_net vhost macvtap macvlan ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat xt_addrtype iptable_filter xt_conntrack br_netfilter bridge stp llc dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio loop dm_mod tun ipmi_si ipmi_devintf ipmi_msghandler openvswitch nf_conntrack_ipv6 nf_nat_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_defrag_ipv6 nf_nat nf_conntrack bonding ext4 mbcache jbd2 intel_powerclamp coretemp crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ses enclosure iTCO_wdt iTCO_vendor_support sg pcspkr wmi i2c_i801 ioatdma shpchp i2c_core sb_edac edac_core lpc_ich mei_me mei ip_tables xfs libcrc32c sd_mod crc_t10dif crct10dif_pclmul crct10dif_common crc32c_intel isci serio_raw libsas ahci ixgbe scsi_transport_sas libahci libata mdio ptp megaraid_sas pps_core dca fjes [last unloaded: irqbypass]
 kernel: [ 320.255731] CPU: 7 PID: 6822 Comm: perf Tainted: G W OE ------------ 3.10.0-514.16.1.el7.x86_64 #1
 kernel: [ 320.255732] Hardware name: Huawei Technologies Co., Ltd. RH2288H V2-12L/BC11SRSG1, BIOS RMIBV512 08/27/2015
 kernel: [ 320.255734] task: ffff882004b96dd0 ti: ffff882002338000 task.ti: ffff882002338000
 kernel: [ 320.255735] RIP: 0010:[<ffffffff810f989e>] [<ffffffff810f989e>] generic_exec_single+0xfe/0x1a0
 kernel: [ 320.255741] RSP: 0018:ffff88200233bd30 EFLAGS: 00000202
 kernel: [ 320.255742] RAX: 00000000000000f0 RBX: 0000000000000282 RCX: 0000000000000010
 kernel: [ 320.255743] RDX: 0000ffffffffffff RSI: 0000000000000030 RDI: 0000000000000282
 kernel: [ 320.255744] RBP: ffff88200233bd80 R08: ffff88016906ec08 R09: 0000000000000004
 kernel: [ 320.255745] R10: ffff88017fa8fa90 R11: ffff882007635b10 R12: ffff88200233bd00
 kernel: [ 320.255746] R13: ffff88016906ec00 R14: 0000000000020040 R15: ffffffff813180c5
 kernel: [ 320.255747] FS: 00007fd28db12780(0000) GS:ffff88103e1c0000(0000) knlGS:0000000000000000
 kernel: [ 320.255748] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 kernel: [ 320.255749] CR2: 000000000041346d CR3: 000000202d6a9000 CR4: 00000000001427e0
 kernel: [ 320.255750] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 kernel: [ 320.255751] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 kernel: [ 320.255752] Stack:
 kernel: [ 320.255752] 0000000000000000 0000000000000000 ffffffff8116ca60 ffff88200233bdc0
 kernel: [ 320.255754] 0000000000000003 00000000c8f235d6 000000010001d580 000000000000000b
 kernel: [ 320.255756] ffffffff8116ca60 ffff882039740000 ffff88200233bdb0 ffffffff810f999f
 kernel: [ 320.255757] Call Trace:
 kernel: [ 320.255761] [<ffffffff8116ca60>] ? perf_cgroup_attach+0x60/0x60
 kernel: [ 320.255763] [<ffffffff8116ca60>] ? perf_cgroup_attach+0x60/0x60
 kernel: [ 320.255765] [<ffffffff810f999f>] smp_call_function_single+0x5f/0xa0
 kernel: [ 320.255766] [<ffffffff8116c293>] cpu_function_call+0x43/0x60
 kernel: [ 320.255768] [<ffffffff8116bf30>] ? perf_unpin_context+0x30/0x30
 kernel: [ 320.255770] [<ffffffff8116fc2a>] event_function_call+0x14a/0x160
 kernel: [ 320.255774] [<ffffffff811dcc8b>] ? kmem_cache_free+0x1bb/0x1f0
 kernel: [ 320.255776] [<ffffffff81170c90>] ? __perf_event_disable+0xe0/0xe0
 kernel: [ 320.255778] [<ffffffff811734c3>] perf_event_release_kernel+0xd3/0x2b0
 kernel: [ 320.255780] [<ffffffff811736b0>] perf_release+0x10/0x20
 kernel: [ 320.255783] [<ffffffff81200589>] __fput+0xe9/0x260
 kernel: [ 320.255785] [<ffffffff8120083e>] ____fput+0xe/0x10
 kernel: [ 320.255788] [<ffffffff810ad1e7>] task_work_run+0xa7/0xe0
 kernel: [ 320.255792] [<ffffffff8102ab22>] do_notify_resume+0x92/0xb0
 kernel: [ 320.255795] [<ffffffff8169743d>] int_signal+0x12/0x17
 kernel: [ 320.255796] Code: 48 89 de 48 03 14 c5 80 58 ad 81 48 89 df e8 ca 29 23 00 84 c0 75 46 45 85 ed 74 11 f6 43 20 01 74 0b 0f 1f 00 8 8b 7c 24 28 65 48 33 3c 25 28 00 00 00 0f 85 80
 kernel: [ 320.255817] sending NMI to other CPUs:
 kernel: [ 328.913957] INFO: rcu_sched detected stalls on CPUs/tasks: { 11} (detected by 15, t=60002 jiffies, g=13018, c=13017, q=17552)
 kernel: [ 328.914378] Task dump for CPU 11:
 kernel: [ 328.914380] qemu-system-x86 R running task 0 6476 1 0x0000088a
 kernel: [ 328.914382] 0000000000000801 ffff882005b50000 000000001be75000 ffff882005b50000
 kernel: [ 328.914384] 0000000000000004 ffff88103de07d50 0000000044783776 ffff882005b50000
 kernel: [ 328.914386] 0000000000000000 0000000000000000 ffffffffa0699660 0000000000000001
 kernel: [ 328.914388] Call Trace:
 kernel: [ 328.914394] [<ffffffffa0699660>] ? vmx_inject_irq+0xf0/0xf0 [kvm_intel]
 kernel: [ 328.914408] [<ffffffffa055c81b>] ? vcpu_enter_guest+0x27b/0x1100 [kvm]
 kernel: [ 328.914417] [<ffffffffa0581c13>] ? kvm_apic_update_irr+0x23/0x30 [kvm]
 kernel: [ 328.914420] [<ffffffffa0690bb9>] ? vmx_sync_pir_to_irr+0x29/0x30 [kvm_intel]
 kernel: [ 328.914434] [<ffffffffa05843c0>] ? kvm_apic_has_interrupt+0x40/0xe0 [kvm]
 kernel: [ 328.914442] [<ffffffffa0564a3d>] ? kvm_arch_vcpu_ioctl_run+0xcd/0x450 [kvm]
 kernel: [ 328.914448] [<ffffffffa0549a31>] ? kvm_vcpu_ioctl+0x2b1/0x640 [kvm]
 kernel: [ 328.914453] [<ffffffff811b9015>] ? change_protection+0x65/0xa0
 kernel: [ 328.914456] [<ffffffff81212555>] ? do_vfs_ioctl+0x2d5/0x4b0
 kernel: [ 328.914459] [<ffffffff81692661>] ? __do_page_fault+0x171/0x450
 kernel: [ 328.914466] [<ffffffffa0553154>] ? kvm_on_user_return+0x74/0x80 [kvm]
 kernel: [ 328.914468] [<ffffffff812127d1>] ? SyS_ioctl+0xa1/0xc0
 kernel: [ 328.914470] [<ffffffff81697189>] ? system_call_fastpath+0x16/0x1b
 kernel: [ 330.256678] NMI backtrace for cpu 0
 kernel: [ 330.256680] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W OE ------------ 3.10.0-514.16.1.el7.x86_64 #1
 
TagsNo tags attached.
abrt_hash
URL

Activities

lnykww

lnykww

2017-05-09 06:13

reporter   ~0029247

It seems that on of the CPU didn't recevie any interrupt, event though the nmi interrrupt. I have test with linux-4.9.25,also has this problem. and linux-4.10.1 didn't.
lnykww

lnykww

2017-05-09 06:53

reporter   ~0029249

I add the timeout detect patch to csd_lock_wait function. with the log CPU#12 did not deal with the ipi interrupt, and then i call trigger_all_cpu_backtrace ( send nmi to all cpu to dump the stack),and there isn't any backtrace on cpu 12. but there is a rcu detect stack backtrace on cpu 12. the dmesg
-------------------------- dmesg ----------------------
May 9 14:45:12 kernel: [ 777.476680] csd: Detected non-responsive CSD lock (#5) on CPU#09, waiting 10.014 secs for CPU#12
May 9 14:45:12 kernel: [ 777.476681] csd: Re-sending CSD lock (#5) IPI from CPU#09 to CPU#12
May 9 14:45:12 kernel: [ 777.476683] sending NMI to all CPUs:
May 9 14:45:16 kernel: [ 781.340908] INFO: rcu_sched detected stalls on CPUs/tasks: { 12} (detected by 27, t=240007 jiffies, g=20597, c=20596, q=82532)
May 9 14:45:16 kernel: [ 781.341328] Task dump for CPU 12:
May 9 14:45:16 kernel: [ 781.341330] qemu-system-x86 R running task 0 13599 1 0x0000088a
May 9 14:45:16 kernel: [ 781.341332] 0000000000000801 ffff8810398c0000 00000000a057d3c0 ffff8810398c0000
May 9 14:45:16 kernel: [ 781.341334] ffff8820282f3d28 ffff8820282f3d30 000000006a96e71d ffff8810398c0000
May 9 14:45:16 kernel: [ 781.341335] 0000000000000000 0000000000000000 0000000000000000 0000000000000001
May 9 14:45:16 kernel: [ 781.341337] Call Trace:
May 9 14:45:16 kernel: [ 781.341353] [<ffffffffa055581b>] ? vcpu_enter_guest+0x27b/0x1100 [kvm]
May 9 14:45:16 kernel: [ 781.341364] [<ffffffffa057d265>] ? kvm_apic_local_deliver+0x65/0x70 [kvm]
May 9 14:45:16 kernel: [ 781.341373] [<ffffffffa055da3d>] ? kvm_arch_vcpu_ioctl_run+0xcd/0x450 [kvm]
May 9 14:45:16 kernel: [ 781.341379] [<ffffffffa0542a31>] ? kvm_vcpu_ioctl+0x2b1/0x640 [kvm]
May 9 14:45:16kernel: [ 781.341382] [<ffffffff811b8d55>] ? change_protection+0x65/0xa0
May 9 14:45:16 kernel: [ 781.341385] [<ffffffff81212285>] ? do_vfs_ioctl+0x2d5/0x4b0
May 9 14:45:16 kernel: [ 781.341387] [<ffffffff81692131>] ? __do_page_fault+0x171/0x450
May 9 14:45:16 kernel: [ 781.341394] [<ffffffffa054c154>] ? kvm_on_user_return+0x74/0x80 [kvm]
May 9 14:45:16 kernel: [ 781.341396] [<ffffffff81212501>] ? SyS_ioctl+0xa1/0xc0
May 9 14:45:16 kernel: [ 781.341399] [<ffffffff81696c49>] ? system_call_fastpath+0x16/0x1b
May 9 14:45:18 kernel: [ 783.039833] csd: Detected non-responsive CSD lock (#6) on CPU#19, waiting 10.000 secs for CPU#12
May 9 14:45:18 kernel: [ 783.039835] csd: Re-sending CSD lock (#6) IPI from CPU#19 to CPU#12
-----------------------------------------------------------------------------
fuqiang.wang

fuqiang.wang

2021-07-05 10:03

reporter   ~0038522

hi~lnykww~
I also encountered this issue recently, on the 3.10x kernel. Now we are investigating the cause of this, you seem to be able to reproduce this issue, please ask the method to reproduce this issue.
thinks : )
tru

tru

2021-07-05 10:19

administrator   ~0038523

Last edited: 2021-07-05 10:21

3.10.0-514.16.1.el7.x86_64 is way too old, current version is kernel-3.10.0-1160.31.1.el7.x86_64
please upgrade and re-open on an up to date CentOS-7 version
fuqiang.wang

fuqiang.wang

2021-07-05 10:50

reporter   ~0038524

hi ~ tru,
my kernel version is 3.10.0-693, it seems too olds, but we are find the cause of this problem, not only to solve it, but we can't reproduce it , so I would like to ask Inykww how to reproduce it.
TrevorH

TrevorH

2021-07-05 11:01

manager   ~0038525

Start by updating to 7.9.2009 with its 3.10.1160.31.1.el7 kernel. Nothing will ever get fixed in that ancient kernel so you will have to recreate the problem on the current one anyway so you might as well just start from there.

Only the _current_ version of CentOS gets any support at all. Update to the current version first, then fix any remaining problems.
ManuelWolfshant

ManuelWolfshant

2021-07-05 11:01

manager   ~0038526

@fuqiang.wang : you are on your own. CentOS does not provide support or help for anything but the most current version / release.
fuqiang.wang

fuqiang.wang

2021-07-06 02:47

reporter   ~0038527

hi ~ TrevorH, ManuelWolfshant , Thank you very much for your reply : )
I have understood the maintenance mechanism of CentOS now.
I'm a kernel enthusiast and want to investigate this issue. I still want to ask lnykww about the details of this
in older to reproduce the issue on my own machine.
Thank you again for your patience

Issue History

Date Modified Username Field Change
2017-05-09 06:08 lnykww New Issue
2017-05-09 06:13 lnykww Note Added: 0029247
2017-05-09 06:53 lnykww Note Added: 0029249
2021-07-05 10:03 fuqiang.wang Note Added: 0038522
2021-07-05 10:19 tru Note Added: 0038523
2021-07-05 10:21 tru Note Edited: 0038523
2021-07-05 10:50 fuqiang.wang Note Added: 0038524
2021-07-05 11:01 TrevorH Note Added: 0038525
2021-07-05 11:01 ManuelWolfshant Note Added: 0038526
2021-07-06 02:47 fuqiang.wang Note Added: 0038527