View Issue Details

IDProjectCategoryView StatusLast Update
0014284Xen4[CentOS-6] kernelpublic2018-04-24 16:49
Reporterjazaman 
PriorityhighSeveritymajorReproducibilityhave not tried
Status newResolutionopen 
PlatformCentOS 6.7 x64OSCentOS 6.7 x64OS Version 4.9.63-29.el7.x
Product Version 
Target VersionFixed in Version 
Summary0014284: Linux (CentOS/Fedora) domU hosts freezes
DescriptionI installed xen in the Centos 7.4.1708 following the Xen4CentOS guide.

And then I installed 4 dom-U guest (2 CentOS7, 1 fedora 27 and 1 Windows server) with full virtualization. After the initial testing when I made system production available, only the linux systems periodically hangs but Windows server system is running alright.

The xen kernel is 4.9.63-29.el7.x86_64. The dom-U linux hosts are CentOS 7 (3.10.0-693.5.2.el7.x86_64), and the windows host is Windows Server 2012 R2. The linux kernel for dom-U hosts hangs with the following kernel hang message:

> [ 3746.780097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 3746.780223] INFO: task jbd2/xvdb6-8:8173 blocked for more than 120 seconds.

Steps To ReproduceThe system freezes randomly sometimes in couple of hours and sometime in couple of days.
Additional InformationThe logs end at some point until the new reboot. Sometimes it's still possible to log on to the system, but nothing really works. It is like all IO to the virtual block devices is suspended indefinitely. Until this happens, the systems seems to work without issues.

Something like 'ls' on a directory listed before still gets a result, but everything 'new', i.e. 'vim somefile' will cause the shell to stall. sar -u reveals hi I/O wait.

Similar problem is reported for xen for other kernel (debian/suse)[https://www.novell.com/support/kb/doc.php?id=7018590] and following their suggestion I have raised gnttab_max_frames=xxx to 256. It was stable 1 weak and then one of the dom-U hangs.

Following is the output from xl info:

release : 4.9.63-29.el7.x86_64
version : #1 SMP Mon Nov 20 14:39:22 UTC 2017
machine : x86_64
nr_cpus : 32
max_cpu_id : 191
nr_nodes : 2
cores_per_socket : 8
threads_per_core : 2
cpu_mhz : 2100
hw_caps : bfebfbff:2c100800:00000000:00007f00:77fefbff:00000000:00000121:021cbfbb
virt_caps : hvm hvm_directio
total_memory : 130978
free_memory : 68109
sharing_freed_memory : 0
sharing_used_memory : 0
outstanding_claims : 0
free_cpus : 0
xen_major : 4
xen_minor : 6
xen_extra : .6-6.el7
xen_version : 4.6.6-6.el7
xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64
xen_scheduler : credit
xen_pagesize : 4096
platform_params : virt_start=0xffff800000000000
xen_changeset : Fri Nov 17 18:32:23 2017 +0000 git:a559dc3-dirty
xen_commandline : placeholder dom0_mem=2048M,max:2048M cpuinfo com1=115200,8n1 console=com1,tty loglvl=all guest_loglvl=all gnttab_max_frames=256
cc_compiler : gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
cc_compile_by : mockbuild
cc_compile_domain : centos.org
cc_compile_date : Mon Nov 20 12:28:41 UTC 2017
xend_config_format : 4
TagsNo tags attached.

Activities

deltadarren

deltadarren

2018-04-13 10:19

reporter   ~0031613

I can confirm the same issue. We have been migrating hypervisors to CentOS 7 using xen and have been experiencing DomU's locking up. Across different environments we have different numbers of VMs all doing a variety of jobs. Some are fairly light (such as a Salt master) that don't do much most of the time and some are backup database hosts that are consistently using a lot of CPU & IO. The DomU's are all Linux of differing CentOS versions

We have tried increasing the ''gnttab_max_frames'' to 256 as per the original poster's change (and Gentoo & Novell both advise this too), all was fine for around a week, and then we started seeing the DomU's lock up. We're unable to login at all. Sometimes, we can get as far as typing a username, but no password prompt appears, other times, we can't even do that. We've tried changing the vm_dirty settings but to no avail. I've tried increasing debug levels, but there's nothing being shown prior to the lockup and no unusual behaviour; the DomU just stops.

Hypervisor is running kernel 4.9.75-29.el7.x86_64 and xl info is as follows:
release : 4.9.75-29.el7.x86_64
version : #1 SMP Fri Jan 5 19:42:28 UTC 2018
machine : x86_64
nr_cpus : 40
max_cpu_id : 191
nr_nodes : 2
cores_per_socket : 10
threads_per_core : 2
cpu_mhz : 2197
hw_caps : bfebfbff:2c100800:00000000:00007f00:77fefbff:00000000:00000121:021cbfbb
virt_caps : hvm hvm_directio
total_memory : 81826
free_memory : 13777
sharing_freed_memory : 0
sharing_used_memory : 0
outstanding_claims : 0
free_cpus : 0
xen_major : 4
xen_minor : 6
xen_extra : .6-10.el7
xen_version : 4.6.6-10.el7
xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64
xen_scheduler : credit
xen_pagesize : 4096
platform_params : virt_start=0xffff800000000000
xen_changeset : Thu Mar 1 17:24:01 2018 -0600 git:2a1e1e0-dirty
xen_commandline : placeholder dom0_mem=4096M,max:4096M cpuinfo com1=115200,8n1 console=com1,tty loglvl=all guest_loglvl=all gnttab_max_frames=256
cc_compiler : gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
cc_compile_by : mockbuild
cc_compile_domain : centos.org
cc_compile_date : Mon Mar 5 18:00:43 UTC 2018
xend_config_format : 4

I'm struggling massively to find anyone else still having issues after making the ''gnttab_max_frames'' change and can't believe there's only 2 of us still seeing this error. If any additional debug output is required, please let me know and I'll be happy to provide
peak

peak

2018-04-17 15:11

reporter   ~0031630

Have you tried to save & restore a locked-up domU?
deltadarren

deltadarren

2018-04-17 16:44

reporter   ~0031631

Hi peak, I have not tried that yet, but will certainly give it a go next time I see it happen. I'll report back with any findings - most likely tomorrow given how frequently this is happening
deltadarren

deltadarren

2018-04-20 14:12

reporter   ~0031647

I've tried the save & restore method on a few recent freezes we've had. In most cases, it works (which is a great workaround, given some of these VMs are DBs and they get in a bit of a mess when they start back up), however, some of them aren't responding to the save and kick up an error of:

virsh save t02dns02 t02dns02.state
error: Failed to save domain t02dns02 to t02dns02.state
error: internal error: Failed to save domain '44' with libxenlight

Looking in the libvirtd logs, there are these messages:
2018-04-19 15:56:03.638+0000: libxl: libxl_dom_suspend.c:318:suspend_common_wait_guest_timeout: guest did not suspend, timed out
2018-04-19 15:56:03.654+0000: xc: save callback suspend() failed: 0: Internal error
2018-04-19 15:56:03.654+0000: xc: Save failed (0 = Success): Internal error
2018-04-19 15:56:03.659+0000: libxl: libxl_stream_write.c:329:libxl__xc_domain_save_done: saving domain: domain responded to suspend request: Success

Unfortunately, that's all I've got to go on at the moment. I've also hit up the centos-devel mailing list, but haven't heard anything yet.
peak

peak

2018-04-24 16:49

reporter   ~0031671

1. Are there any multiqueues on your system? Try running "xenstore-ls -f | grep multi-queue" on dom0. (See <https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=880554>.)

2. Did any of revived domUs printed "INFO: task ... blocked for more than 120 seconds"? Such messages should be followed (in logs files/dmesg) by stack traces of hung tasks.

3. Can you use ftrace and/or SystemTap on any of domUs that are prone to lock up?

Issue History

Date Modified Username Field Change
2017-12-17 17:03 jazaman New Issue
2018-04-13 10:19 deltadarren Note Added: 0031613
2018-04-17 15:11 peak Note Added: 0031630
2018-04-17 16:44 deltadarren Note Added: 0031631
2018-04-20 14:12 deltadarren Note Added: 0031647
2018-04-24 16:49 peak Note Added: 0031671