View Issue Details

IDProjectCategoryView StatusLast Update
0017722CentOS-7-OTHERpublic2020-09-11 09:19
Reportercentosuser42 Assigned To 
PrioritynormalSeverityminorReproducibilityrandom
Status newResolutionopen 
Product Version7.8-2003 
Summary0017722: Error message: "kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 40s! [snmpd:1047]"
DescriptionHi there,

# cat /etc/centos-release
CentOS Linux Release 7.8.2003 (Core)

Hypervisor information:

VMWare hypervisor version:
VMware ESXi, 6.5.0, 14874964
CPU: Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz

System log: please see attachments.
TagsNo tags attached.
abrt_hash
URL

Relationships

duplicate of 0017720 closedIssue Tracker Error message: "kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 40s! [snmpd:1047]" 

Activities

centosuser42

centosuser42

2020-09-09 08:47

reporter  

jounral.tgz (505,236 bytes)
messages.txt (1,104,005 bytes)
centosuser42

centosuser42

2020-09-09 08:50

reporter   ~0037674

journal.txt: line 6

Linux version 3.10.0-1127.18.2.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Sun Jul 26 15:27:06 UTC 2020
TrevorH

TrevorH

2020-09-09 09:34

manager   ~0037675

This almost always ends up being a vmware problem. Check how loaded your host is - it looks like it's starving the guest of cpu time so it never gets to do any work,

Did you actually read the 1.1MB mesages.txt file before you uploaded it? It's filled with snmpd diagnostic messages because you have the log level for it too high and there is nothing else of value in it at all. As for the gzipped attachment for the journal, not even going to bother downloading that as it's too big and too much effort.
centosuser42

centosuser42

2020-09-09 13:30

reporter   ~0037676

Update: attach filtered journal log, removing repeated messages like snmpd diagnostics, login sessions, postfix info etc.

Relevant error messages:

593:Aug 20 17:08:25 vipdepot.vip.intern kernel: NMI watchdog: disabled (cpu0): hardware events not enabled
594:Aug 20 17:08:25 vipdepot.vip.intern kernel: NMI watchdog: Shutting down hard lockup detector on all cpus
3301:Sep 07 23:30:51 vipdepot.vip.intern kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 40s! [snmpd:1047]

What has happened before the error? I can see series of UDP connection diagnostics messages over half an hour, but there is for example no burst of them directly before the message.

I consider the CPU starvation to be a valid argument, need to double check that.
centosuser42

centosuser42

2020-09-09 13:30

reporter   ~0037677

attachment
journal_filtered1.txt (348,967 bytes)
centosuser42

centosuser42

2020-09-09 15:09

reporter   ~0037678

Update: I've reviewed both VM's and its hosts monitors: no alarms were triggered.

vCenter host CPU usage is below 50%.
ManuelWolfshant

ManuelWolfshant

2020-09-09 15:41

manager   ~0037679

Last edited: 2020-09-11 09:19

Sometime in the past I worked for someone who proudly showed me that the physical processors were mostly unused and was firmly convinced that he can host 200 VMs on 4 processors.
Until I started a "yum update" simultaneously on 50 machines and there was a 5 minute (!) delay between the moments the first machine started the update and the last one did the same

Conclusion being that the error you see is triggered when the processors are starved,

centosuser42

centosuser42

2020-09-11 09:15

reporter   ~0037691

Hi there @ManuelWolfshant,

we do not follow such a heavy overprovisioning style as in your example: for example, currently this host with its 32 logical processors serves 27 VMs.
But I get the argumentation.

Given there has been indeed a peak leading to high system load and CPU starvation: wouldn't be there a vcenter alert?
ManuelWolfshant

ManuelWolfshant

2020-09-11 09:19

manager   ~0037692

I would not now and I do not care.. I quit using VMWare 10 years ago. Your question should be addressed in a vmware support venue.

Issue History

Date Modified Username Field Change
2020-09-09 08:47 centosuser42 New Issue
2020-09-09 08:47 centosuser42 File Added: jounral.tgz
2020-09-09 08:47 centosuser42 File Added: messages.txt
2020-09-09 08:47 centosuser42 Issue generated from: 0017720
2020-09-09 08:50 centosuser42 Note Added: 0037674
2020-09-09 09:29 ManuelWolfshant Relationship added duplicate of 0017720
2020-09-09 09:34 TrevorH Note Added: 0037675
2020-09-09 13:30 centosuser42 Note Added: 0037676
2020-09-09 13:30 centosuser42 File Added: journal_filtered1.txt
2020-09-09 13:30 centosuser42 Note Added: 0037677
2020-09-09 15:09 centosuser42 Note Added: 0037678
2020-09-09 15:41 ManuelWolfshant Note Added: 0037679
2020-09-09 15:42 ManuelWolfshant Note Edited: 0037679
2020-09-11 09:15 centosuser42 Note Added: 0037691
2020-09-11 09:19 ManuelWolfshant Note Added: 0037692
2020-09-11 09:19 ManuelWolfshant Note Edited: 0037679