View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0017722||CentOS-7||-OTHER||public||2020-09-09 08:47||2020-09-11 09:19|
|Target Version||Fixed in Version|
|Summary||0017722: Error message: "kernel:NMI watchdog: BUG: soft lockup - CPU#0 stuck for 40s! [snmpd:1047]"|
# cat /etc/centos-release
CentOS Linux Release 7.8.2003 (Core)
VMWare hypervisor version:
VMware ESXi, 6.5.0, 14874964
CPU: Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz
System log: please see attachments.
|Tags||No tags attached.|
jounral.tgz (505,236 bytes)
messages.txt (1,104,005 bytes)
journal.txt: line 6
Linux version 3.10.0-1127.18.2.el7.x86_64 (email@example.com) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) ) #1 SMP Sun Jul 26 15:27:06 UTC 2020
This almost always ends up being a vmware problem. Check how loaded your host is - it looks like it's starving the guest of cpu time so it never gets to do any work,
Did you actually read the 1.1MB mesages.txt file before you uploaded it? It's filled with snmpd diagnostic messages because you have the log level for it too high and there is nothing else of value in it at all. As for the gzipped attachment for the journal, not even going to bother downloading that as it's too big and too much effort.
Update: attach filtered journal log, removing repeated messages like snmpd diagnostics, login sessions, postfix info etc.
Relevant error messages:
593:Aug 20 17:08:25 vipdepot.vip.intern kernel: NMI watchdog: disabled (cpu0): hardware events not enabled
594:Aug 20 17:08:25 vipdepot.vip.intern kernel: NMI watchdog: Shutting down hard lockup detector on all cpus
3301:Sep 07 23:30:51 vipdepot.vip.intern kernel: NMI watchdog: BUG: soft lockup - CPU#0 stuck for 40s! [snmpd:1047]
What has happened before the error? I can see series of UDP connection diagnostics messages over half an hour, but there is for example no burst of them directly before the message.
I consider the CPU starvation to be a valid argument, need to double check that.
journal_filtered1.txt (348,967 bytes)
Update: I've reviewed both VM's and its hosts monitors: no alarms were triggered.
vCenter host CPU usage is below 50%.
Sometime in the past I worked for someone who proudly showed me that the physical processors were mostly unused and was firmly convinced that he can host 200 VMs on 4 processors.
Until I started a "yum update" simultaneously on 50 machines and there was a 5 minute (!) delay between the moments the first machine started the update and the last one did the same
Conclusion being that the error you see is triggered when the processors are starved,
Hi there @ManuelWolfshant,
we do not follow such a heavy overprovisioning style as in your example: for example, currently this host with its 32 logical processors serves 27 VMs.
But I get the argumentation.
Given there has been indeed a peak leading to high system load and CPU starvation: wouldn't be there a vcenter alert?
|I would not now and I do not care.. I quit using VMWare 10 years ago. Your question should be addressed in a vmware support venue.|
|2020-09-09 08:47||centosuser42||New Issue|
|2020-09-09 08:47||centosuser42||File Added: jounral.tgz|
|2020-09-09 08:47||centosuser42||File Added: messages.txt|
|2020-09-09 08:47||centosuser42||Issue generated from: 0017720|
|2020-09-09 08:50||centosuser42||Note Added: 0037674|
|2020-09-09 09:29||ManuelWolfshant||Relationship added||duplicate of 0017720|
|2020-09-09 09:34||TrevorH||Note Added: 0037675|
|2020-09-09 13:30||centosuser42||Note Added: 0037676|
|2020-09-09 13:30||centosuser42||File Added: journal_filtered1.txt|
|2020-09-09 13:30||centosuser42||Note Added: 0037677|
|2020-09-09 15:09||centosuser42||Note Added: 0037678|
|2020-09-09 15:41||ManuelWolfshant||Note Added: 0037679|
|2020-09-09 15:42||ManuelWolfshant||Note Edited: 0037679||View Revisions|
|2020-09-11 09:15||centosuser42||Note Added: 0037691|
|2020-09-11 09:19||ManuelWolfshant||Note Added: 0037692|
|2020-09-11 09:19||ManuelWolfshant||Note Edited: 0037679||View Revisions|