View Issue Details

IDProjectCategoryView StatusLast Update
0018075CentOS-7libvirtpublic2021-02-18 16:52
ReporterThomasLef Assigned To 
PriorityhighSeveritymajorReproducibilityrandom
Status newResolutionopen 
OSCentOS 7 RTOS Version7.6 
Product Version7.6.1810 
Summary0018075: when starting VMs sometimes one VM will get to 100% cpu usage immediatly and libvirtd service will freeze.
DescriptionI have 9 VMs that i start with some sleep time in between.
Most of the time they all start and work just fine, but sometimes, at start, one VM will see its cpu usage ramp up to 100% in virt-manager and stay stuck, subsequent vm start will fail with the following errors:

Domain VM1 started

Domain VM2 started

error: Failed to start domain VM3
error: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

error: failed to connect to the hypervisor
error: error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

error: failed to connect to the hypervisor
error: error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

error: failed to connect to the hypervisor
error: error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

error: failed to connect to the hypervisor
error: error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

error: failed to connect to the hypervisor
error: error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

error: failed to connect to the hypervisor
error: error from service: CheckAuthorization: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
 
The script I use do reproduce that does the following:

    for i in $(seq 1 100)
    do
        virsh start vm0
        sleep 1
        virsh start vm1
        sleep 20
        virsh start vm2
        sleep 1
        virsh start vm3
        sleep 1
        virsh start vm4
        sleep 1
        virsh start vm5
        sleep 1
        virsh start vm6
        sleep 1
        virsh start vm7
        sleep 1
        virsh start vm8
        vm_count=$(pgrep qemu-kvm |wc -l);
        echo "AFTER START : VM COUNT = $vm_count";
        if [ "$vm_count" -ne "9" ]; then
            echo "failed start at iteration $i"
            echo "only $vm_count/9 vm started"
            exit -1;
        fi;
        sleep 60 #waiting for all vm to be fully started
        virsh shutdown vm0
        sleep 1
        virsh shutdown vm2
        virsh shutdown vm3
        virsh shutdown vm4
        virsh shutdown vm5
        virsh shutdown vm6
        virsh shutdown vm7
        virsh shutdown vm8
        sleep 15

        virsh shutdown vm1
        sleep 15;
        vm_count=$(pgrep qemu-kvm |wc -l);
        echo "AFTER STOP : VM COUNT = $vm_count";
        if [ "$vm_count" -ne "0" ]; then
            echo "failed stop at iteration $i"
            echo "$vm_count/9 not stopped"
                exit -1;
        fi;
    done

=> most recent experiment got the failure on the 23rd iteration of this loop.

the cpu-maxed vm as seen in top :

#top
top - 16:49:05 up 2 days, 4:34, 2 users, load average: 79.24, 78.90, 76.41
Tasks: 2605 total, 68 running, 2527 sleeping, 0 stopped, 10 zombie
%Cpu(s): 9.9 us, 0.0 sy, 0.0 ni, 90.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 79204140+total, 76074483+free, 23655132 used, 7641428 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 76729036+avail Mem

   PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 76846 qemu 20 0 32.1g 1.0g 11436 R 1824 0.1 1076:48 qemu-kvm

i've tried to attach gdb to this pid to apply "thread apply all bt", to no avail, gdb also hangs once attached.
Same result trying to attach gdb to libvirtd process.

going root
and trying to monitor libvirtd.service also hangs without any info displayed :
# su -
# systemctl status libvirtd.service
...



system details :

# rpm -qa |grep qemu
libvirt-daemon-driver-qemu-4.5.0-33.el7_8.1.x86_64
qemu-system-moxie-2.0.0-1.el7.6.x86_64
qemu-system-m68k-2.0.0-1.el7.6.x86_64
ipxe-roms-qemu-20180825-2.git133f4c.el7.noarch
qemu-system-alpha-2.0.0-1.el7.6.x86_64
qemu-system-arm-2.0.0-1.el7.6.x86_64
qemu-system-microblaze-2.0.0-1.el7.6.x86_64
qemu-system-x86-2.0.0-1.el7.6.x86_64
qemu-system-s390x-2.0.0-1.el7.6.x86_64
qemu-system-xtensa-2.0.0-1.el7.6.x86_64
qemu-2.0.0-1.el7.6.x86_64
qemu-img-1.5.3-173.el7_8.3.x86_64
qemu-kvm-1.5.3-173.el7_8.3.x86_64
qemu-common-2.0.0-1.el7.6.x86_64
qemu-system-unicore32-2.0.0-1.el7.6.x86_64
qemu-system-cris-2.0.0-1.el7.6.x86_64
qemu-system-lm32-2.0.0-1.el7.6.x86_64
qemu-user-2.0.0-1.el7.6.x86_64
qemu-guest-agent-2.12.0-3.el7.x86_64
qemu-system-sh4-2.0.0-1.el7.6.x86_64
qemu-kvm-tools-1.5.3-173.el7_8.3.x86_64
qemu-kvm-common-1.5.3-173.el7_8.3.x86_64
qemu-system-or32-2.0.0-1.el7.6.x86_64
qemu-system-mips-2.0.0-1.el7.6.x86_64

# rpm -qa |grep libvirt
libvirt-client-4.5.0-33.el7_8.1.x86_64
libvirt-daemon-driver-qemu-4.5.0-33.el7_8.1.x86_64
libvirt-daemon-driver-storage-4.5.0-33.el7_8.1.x86_64
libvirt-glib-1.0.0-1.el7.x86_64
libvirt-daemon-4.5.0-33.el7_8.1.x86_64
libvirt-daemon-driver-storage-disk-4.5.0-33.el7_8.1.x86_64
libvirt-python-4.5.0-1.el7.x86_64
libvirt-daemon-config-nwfilter-4.5.0-33.el7_8.1.x86_64
libvirt-4.5.0-33.el7_8.1.x86_64
libvirt-gobject-1.0.0-1.el7.x86_64
libvirt-daemon-driver-storage-core-4.5.0-33.el7_8.1.x86_64
libvirt-daemon-driver-storage-logical-4.5.0-33.el7_8.1.x86_64
libvirt-daemon-driver-storage-gluster-4.5.0-33.el7_8.1.x86_64
libvirt-daemon-driver-nwfilter-4.5.0-33.el7_8.1.x86_64
libvirt-bash-completion-4.5.0-33.el7_8.1.x86_64
libvirt-daemon-driver-lxc-4.5.0-33.el7_8.1.x86_64
libvirt-daemon-driver-network-4.5.0-33.el7_8.1.x86_64
libvirt-daemon-driver-storage-iscsi-4.5.0-33.el7_8.1.x86_64
libvirt-daemon-driver-storage-mpath-4.5.0-33.el7_8.1.x86_64
libvirt-daemon-driver-secret-4.5.0-33.el7_8.1.x86_64
libvirt-daemon-kvm-4.5.0-33.el7_8.1.x86_64
libvirt-daemon-driver-storage-scsi-4.5.0-33.el7_8.1.x86_64
libvirt-daemon-driver-interface-4.5.0-33.el7_8.1.x86_64
libvirt-gconfig-1.0.0-1.el7.x86_64
libvirt-daemon-config-network-4.5.0-33.el7_8.1.x86_64
libvirt-libs-4.5.0-33.el7_8.1.x86_64
libvirt-daemon-driver-storage-rbd-4.5.0-33.el7_8.1.x86_64
libvirt-daemon-driver-nodedev-4.5.0-33.el7_8.1.x86_64

# uname -r
3.10.0-1127.rt56.1093.el7.x86_64

# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 192
On-line CPU(s) list: 0-191
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 4
NUMA node(s): 4
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz
Stepping: 7
CPU MHz: 1572.052
CPU max MHz: 3700.0000
CPU min MHz: 1000.0000
BogoMIPS: 4200.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 36608K
NUMA node0 CPU(s): 0-23,96-119
NUMA node1 CPU(s): 24-47,120-143
NUMA node2 CPU(s): 48-71,144-167
NUMA node3 CPU(s): 72-95,168-191

Any idea on this issue and or how to pin-down the root cause ?
TagsNo tags attached.
abrt_hash
URL

Activities

ThomasLef

ThomasLef

2021-02-18 16:52

reporter   ~0038254

Additionnal precisions :
- I'm using the Real-Time kernel,
- I've pinned the vcpus so that all vms have their cores mapped on a single NUMA node
- scheduler is set to work in FIFO mode

Issue History

Date Modified Username Field Change
2021-02-18 16:49 ThomasLef New Issue
2021-02-18 16:52 ThomasLef Note Added: 0038254