View Issue Details

IDProjectCategoryView StatusLast Update
0014284Xen4[CentOS-6] kernelpublic2018-06-19 11:16
Reporterjazaman 
PriorityhighSeveritymajorReproducibilityhave not tried
Status newResolutionopen 
PlatformCentOS 6.7 x64OSCentOS 6.7 x64OS Version 4.9.63-29.el7.x
Product Version 
Target VersionFixed in Version 
Summary0014284: Linux (CentOS/Fedora) domU hosts freezes
DescriptionI installed xen in the Centos 7.4.1708 following the Xen4CentOS guide.

And then I installed 4 dom-U guest (2 CentOS7, 1 fedora 27 and 1 Windows server) with full virtualization. After the initial testing when I made system production available, only the linux systems periodically hangs but Windows server system is running alright.

The xen kernel is 4.9.63-29.el7.x86_64. The dom-U linux hosts are CentOS 7 (3.10.0-693.5.2.el7.x86_64), and the windows host is Windows Server 2012 R2. The linux kernel for dom-U hosts hangs with the following kernel hang message:

> [ 3746.780097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 3746.780223] INFO: task jbd2/xvdb6-8:8173 blocked for more than 120 seconds.

Steps To ReproduceThe system freezes randomly sometimes in couple of hours and sometime in couple of days.
Additional InformationThe logs end at some point until the new reboot. Sometimes it's still possible to log on to the system, but nothing really works. It is like all IO to the virtual block devices is suspended indefinitely. Until this happens, the systems seems to work without issues.

Something like 'ls' on a directory listed before still gets a result, but everything 'new', i.e. 'vim somefile' will cause the shell to stall. sar -u reveals hi I/O wait.

Similar problem is reported for xen for other kernel (debian/suse)[https://www.novell.com/support/kb/doc.php?id=7018590] and following their suggestion I have raised gnttab_max_frames=xxx to 256. It was stable 1 weak and then one of the dom-U hangs.

Following is the output from xl info:

release : 4.9.63-29.el7.x86_64
version : #1 SMP Mon Nov 20 14:39:22 UTC 2017
machine : x86_64
nr_cpus : 32
max_cpu_id : 191
nr_nodes : 2
cores_per_socket : 8
threads_per_core : 2
cpu_mhz : 2100
hw_caps : bfebfbff:2c100800:00000000:00007f00:77fefbff:00000000:00000121:021cbfbb
virt_caps : hvm hvm_directio
total_memory : 130978
free_memory : 68109
sharing_freed_memory : 0
sharing_used_memory : 0
outstanding_claims : 0
free_cpus : 0
xen_major : 4
xen_minor : 6
xen_extra : .6-6.el7
xen_version : 4.6.6-6.el7
xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64
xen_scheduler : credit
xen_pagesize : 4096
platform_params : virt_start=0xffff800000000000
xen_changeset : Fri Nov 17 18:32:23 2017 +0000 git:a559dc3-dirty
xen_commandline : placeholder dom0_mem=2048M,max:2048M cpuinfo com1=115200,8n1 console=com1,tty loglvl=all guest_loglvl=all gnttab_max_frames=256
cc_compiler : gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
cc_compile_by : mockbuild
cc_compile_domain : centos.org
cc_compile_date : Mon Nov 20 12:28:41 UTC 2017
xend_config_format : 4
TagsNo tags attached.

Activities

deltadarren

deltadarren

2018-04-13 10:19

reporter   ~0031613

I can confirm the same issue. We have been migrating hypervisors to CentOS 7 using xen and have been experiencing DomU's locking up. Across different environments we have different numbers of VMs all doing a variety of jobs. Some are fairly light (such as a Salt master) that don't do much most of the time and some are backup database hosts that are consistently using a lot of CPU & IO. The DomU's are all Linux of differing CentOS versions

We have tried increasing the ''gnttab_max_frames'' to 256 as per the original poster's change (and Gentoo & Novell both advise this too), all was fine for around a week, and then we started seeing the DomU's lock up. We're unable to login at all. Sometimes, we can get as far as typing a username, but no password prompt appears, other times, we can't even do that. We've tried changing the vm_dirty settings but to no avail. I've tried increasing debug levels, but there's nothing being shown prior to the lockup and no unusual behaviour; the DomU just stops.

Hypervisor is running kernel 4.9.75-29.el7.x86_64 and xl info is as follows:
release : 4.9.75-29.el7.x86_64
version : #1 SMP Fri Jan 5 19:42:28 UTC 2018
machine : x86_64
nr_cpus : 40
max_cpu_id : 191
nr_nodes : 2
cores_per_socket : 10
threads_per_core : 2
cpu_mhz : 2197
hw_caps : bfebfbff:2c100800:00000000:00007f00:77fefbff:00000000:00000121:021cbfbb
virt_caps : hvm hvm_directio
total_memory : 81826
free_memory : 13777
sharing_freed_memory : 0
sharing_used_memory : 0
outstanding_claims : 0
free_cpus : 0
xen_major : 4
xen_minor : 6
xen_extra : .6-10.el7
xen_version : 4.6.6-10.el7
xen_caps : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64
xen_scheduler : credit
xen_pagesize : 4096
platform_params : virt_start=0xffff800000000000
xen_changeset : Thu Mar 1 17:24:01 2018 -0600 git:2a1e1e0-dirty
xen_commandline : placeholder dom0_mem=4096M,max:4096M cpuinfo com1=115200,8n1 console=com1,tty loglvl=all guest_loglvl=all gnttab_max_frames=256
cc_compiler : gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
cc_compile_by : mockbuild
cc_compile_domain : centos.org
cc_compile_date : Mon Mar 5 18:00:43 UTC 2018
xend_config_format : 4

I'm struggling massively to find anyone else still having issues after making the ''gnttab_max_frames'' change and can't believe there's only 2 of us still seeing this error. If any additional debug output is required, please let me know and I'll be happy to provide
peak

peak

2018-04-17 15:11

reporter   ~0031630

Have you tried to save & restore a locked-up domU?
deltadarren

deltadarren

2018-04-17 16:44

reporter   ~0031631

Hi peak, I have not tried that yet, but will certainly give it a go next time I see it happen. I'll report back with any findings - most likely tomorrow given how frequently this is happening
deltadarren

deltadarren

2018-04-20 14:12

reporter   ~0031647

I've tried the save & restore method on a few recent freezes we've had. In most cases, it works (which is a great workaround, given some of these VMs are DBs and they get in a bit of a mess when they start back up), however, some of them aren't responding to the save and kick up an error of:

virsh save t02dns02 t02dns02.state
error: Failed to save domain t02dns02 to t02dns02.state
error: internal error: Failed to save domain '44' with libxenlight

Looking in the libvirtd logs, there are these messages:
2018-04-19 15:56:03.638+0000: libxl: libxl_dom_suspend.c:318:suspend_common_wait_guest_timeout: guest did not suspend, timed out
2018-04-19 15:56:03.654+0000: xc: save callback suspend() failed: 0: Internal error
2018-04-19 15:56:03.654+0000: xc: Save failed (0 = Success): Internal error
2018-04-19 15:56:03.659+0000: libxl: libxl_stream_write.c:329:libxl__xc_domain_save_done: saving domain: domain responded to suspend request: Success

Unfortunately, that's all I've got to go on at the moment. I've also hit up the centos-devel mailing list, but haven't heard anything yet.
peak

peak

2018-04-24 16:49

reporter   ~0031671

1. Are there any multiqueues on your system? Try running "xenstore-ls -f | grep multi-queue" on dom0. (See <https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=880554>.)

2. Did any of revived domUs printed "INFO: task ... blocked for more than 120 seconds"? Such messages should be followed (in logs files/dmesg) by stack traces of hung tasks.

3. Can you use ftrace and/or SystemTap on any of domUs that are prone to lock up?
deltadarren

deltadarren

2018-04-25 09:01

reporter   ~0031675

1. We do have multiqueues, yes. Here's what the list looks like:
 ~ # xenstore-ls -f | grep multi-queue
/local/domain/0/backend/vif/1/0/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/2/0/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/2/1/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/2/2/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/2/3/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/2/4/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/2/5/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/2/6/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/2/7/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/3/0/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/3/1/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/3/2/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/3/3/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/3/4/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/3/5/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/3/6/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/3/7/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/4/0/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/5/0/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/6/0/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/7/0/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/8/0/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/9/0/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/9/1/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/9/2/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/9/3/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/9/4/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/9/5/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/9/6/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/9/7/multi-queue-max-queues = "8"
/local/domain/0/backend/vif/10/0/multi-queue-max-queues = "8"
/local/domain/1/device/vif/0/multi-queue-num-queues = "2"
/local/domain/4/device/vif/0/multi-queue-num-queues = "2"
/local/domain/5/device/vif/0/multi-queue-num-queues = "2"
/local/domain/8/device/vif/0/multi-queue-num-queues = "2"
/local/domain/10/device/vif/0/multi-queue-num-queues = "2"

The linked to debian bug is quite a lengthy one, but it seems the issue was resolved by either a newer kernel or it's fixed in a newer version of Xen (unless I'm not reading that right?). I could try the modprobe settings that are discussed towards the end of the report. At the moment our settings are:
/sys/module/xen_blkback/parameters/max_ring_page_order:4
/sys/module/xen_blkback/parameters/max_queues:12

The Debian bug suggests changing these to:
options xen_blkback max_ring_page_order=0
options xen_blkback max_queues=1

2. The domU's that were successfully revived, sometimes show the "blocked for more than 120 seconds" in their logs, but not always. We had one go bad yesterday and upon it being revived the load hit 280 - I'm thinking this is a case of the kernel thinks it's been waiting that long for IO, right?

3. I can look to see if we can get some ftrace going on one of the likely candidates

We're also looking at upgrading to Xen 4.8, given it's available in the repo.
deltadarren

deltadarren

2018-04-25 14:16

reporter   ~0031678

I've not used ftrace before and having had a little play with it, it looks like it could end up generating a lot of data. Is there anything in particular I should be looking for or any way of narrowing down what gets traced so we're only capturing the important stuff and not just noise?
peak

peak

2018-05-01 20:33

reporter   ~0031697

1. Hmm... you've got some multiqueue vifs but no multiqueue vbds. Entries similar to the following one would appear in xenstore if your domU were using multiqueu vbds:

/local/domain/22/device/vbd/768/multi-queue-num-queues = "2"

A multiqueue vbd frontend consumes much more grant entries and that might cause problems (as discussed in that Debian bug) but you don't seem to have them.

2. In those cases when domU logged "blocked for more than 120 seconds" was that messages followed by a stack trace in the log? If yes, can you share an example?It might explain where and how I/O got stuck (assuming the message was generated while domU was stuck, not after its resuscitation).

Load 280 was probably the result of too many processes waiting for I/O when domU was stuck. Alternatively, it might have been a result of too many processes eager to finish their I/O when domU was revived.

3. I think it would be a good idea to check whether domU lock-ups you experience are related to the exhaustion of grant table entries (as suggested by similar cases mentioned above) or not. The actual selection of tracepoints depends on the kernel in domU.

Let us assume it's 3.10 as in CentOS 7: I guess it would make sense to trace calls to gnttab_alloc_grant_references (to see who and when allocates grant entries), its return values (to see whether the call has been successful; I'll need to define a kprobe event and I can add a ) and calls to, say, blkfront_resume (to see when domU has been resuscitated by save/restore):

# cd /sys/kernel/debug/tracing
# echo gnttab_alloc_grant_references >> set_ftrace_filter
# echo blkfront_resume >> set_ftrace_filter
# echo function > current_tracer
# echo 'r gnttab_alloc_grant_references ret=$retval nr=@nr_grant_frames' >> kprobe_events
# echo 1 > kprobes/r_gnttab_alloc_grant_references_0/enable

Now I can print events as they occur:
# cat trace_pipe

or dump their record (it suspends tracing temporarily)
# cat trace

DomU should allocate a bunch of grant entries after it connects (or reconnects) to the backend, it might allocate a few more entries when its load increases but sooner or later it should reach a kind of steady-state when it has already allocated all entries it needs.

(See Documentation/trace/ftrace.txt and Documentation/trace/kprobetrace.txt for more information about ftrace and kprobe tracepoints.)

You can also try running "xentrace -D -e 0x0010f000 <output-file>" to watch things in dom0. But you will need to provide your own format file to xentrace_format to read the trace because it is not included in the package.

(There is also "xl debug-keys g" that makes the hypervisor dump grant tables (all of them!) and "xl dmesg" to retrieve the dump but I can't recommend doing it on a production system.)
peak

peak

2018-05-10 17:49

reporter   ~0031752

I have found https://lists.xenproject.org/archives/html/xen-users/2018-04/msg00058.html ...
jbd2/xvda2-8:137 seems to be waiting for an I/O operation to complete (__wait_on_buffer), dockerd:807 is waiting for jbd2 to finish a transaction (jbd2_log_wait_commit) and everyone else is waiting for them to release locked buffers (do_get_write_access).
deltadarren

deltadarren

2018-06-19 11:16

reporter   ~0032113

Just an update on this problem. We eventually upgraded Xen to 4.8 and changed the kernel that the VMs use to the LTS kernel from elrepo. Since doing this, we haven't had any lock ups or freezes in over a month. It might be that the 4.8 upgrade is enough, but we tried a newer kernel first as the initial attempt to update to 4.8 failed.

Issue History

Date Modified Username Field Change
2017-12-17 17:03 jazaman New Issue
2018-04-13 10:19 deltadarren Note Added: 0031613
2018-04-17 15:11 peak Note Added: 0031630
2018-04-17 16:44 deltadarren Note Added: 0031631
2018-04-20 14:12 deltadarren Note Added: 0031647
2018-04-24 16:49 peak Note Added: 0031671
2018-04-25 09:01 deltadarren Note Added: 0031675
2018-04-25 14:16 deltadarren Note Added: 0031678
2018-05-01 20:33 peak Note Added: 0031697
2018-05-10 17:49 peak Note Added: 0031752
2018-06-19 11:16 deltadarren Note Added: 0032113