CentOS Bug Tracker
CentOS Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0003996CentOS-5kvmpublic2009-11-10 03:002010-01-31 17:15
Reporterroflcopter69 
PrioritynormalSeveritymajorReproducibilityalways
StatusresolvedResolutionfixed 
PlatformOSOS Version
Product Version5.4 
Target VersionFixed in Version5.4 
Summary0003996: automount uses 100% cpu, never finishes
DescriptionI run a network of KVMs and two of them have a common issue. When I log in as a user (users are ldap auth'd) and then log out, after a while automount uses 100% CPU of one CPU (never more than one) and never exits. Strace shows a repeating of these lines:

pid 1885] futex(0x2ae83b52f5e0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1885] clock_gettime(CLOCK_REALTIME, {1257821783, 343731000}) = 0
[pid 1885] futex(0x2ae83b52f5e0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1885] clock_gettime(CLOCK_REALTIME, {1257821783, 343906000}) = 0
[pid 1885] futex(0x2ae83b52f5e0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1885] clock_gettime(CLOCK_REALTIME, {1257821783, 344150000}) = 0
[pid 1885] futex(0x2ae83b52f5e0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1885] clock_gettime(CLOCK_REALTIME, {1257821783, 344388000}) = 0
[pid 1885] futex(0x2ae83b52f5e0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1885] clock_gettime(CLOCK_REALTIME, {1257821783, 344633000}) = 0
[pid 1885] futex(0x2ae83b52f5e0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1885] clock_gettime(CLOCK_REALTIME, {1257821783, 344905000}) = 0
[pid 1885] futex(0x2ae83b52f5e0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1885] clock_gettime(CLOCK_REALTIME, {1257821783, 345225000}) = 0
[pid 1885] futex(0x2ae83b52f5e0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1885] clock_gettime(CLOCK_REALTIME, {1257821783, 345496000}) = 0
[pid 1885] futex(0x2ae83b52f5e0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1885] clock_gettime(CLOCK_REALTIME, {1257821783, 345766000}) = 0
[pid 1885] futex(0x2ae83b52f5e0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 1885] clock_gettime(CLOCK_REALTIME, {1257821783, 346030000}) = 0
[pid 1885] futex(0x2ae83b52f5e0, FUTEX_WAKE_PRIVATE, 1) = 0


and so on and so forth into infinity. Here is my /etc/auto.master:
root@web1 ~]# cat /etc/auto.master | grep -vP "^#.*"
/misc /etc/auto.misc
/net -hosts
+auto.master
/home /etc/auto.home

and my /etc/auto.home:
[root@web1 ~]# cat /etc/auto.home | grep -vP "^#.*"
* -soft,intr homedirs.MYDOMAIN.com:/home/&

the MYDOMAIN is just an omission here to hide my real domain name, it's a legit address in real life.

This is on CentOS release 5.4 (Final) and autofs-5.0.1-0.rc2.131.el5_4.1

Let me know if you need anything else. Thanks,
--Benjamin Rose
Additional InformationThis may be a duplicate of this bug, not sure: https://bugzilla.redhat.com/show_bug.cgi?id=247711 [^]
TagsNo tags attached.
Attached Files

- Relationships
related to 0004058resolvedtoracat Time drift in KVM guest 

-  Notes
(0010435)
gilboa (reporter)
2009-12-01 10:59

I'm seeing the same.
5.4 x86_64 VM running under F11 x86_64 host.

The weird part is that I'm getting nailed by this bug only on one guest, even though I've got a number of identical guests running on identical hosts...

Go figure.
(0010437)
roflcopter69 (reporter)
2009-12-02 05:56

Same exact software specs for me, the Fedora 11 host.

One interesting thing, I noticed that it was only occurring on the KVM's with more than one processor allocated. Configuring the guests such that they only have one processor solved the 100% CPU problem, but (obviously) has it's own issues. I guess I can use this as a *very* temporary workaround.

When the package was installed, there was only one processor for the guest, then I upped it to 4. This is when the problem started. I wonder if somehow the package might need to be reinstalled after more than one processor exists on the server... wouldn't make much sense to me, but hey, such is life.
(0010439)
gilboa (reporter)
2009-12-02 12:11

My CentOS image was originally installed with 4 vCPUs.
However, you seem to be right - I'm not seeing the same problem on guests running 1 and 2 vCPUs.

I wonder if this is not a host clock problem that causes the automount's high-resolution timers to go wild.

- Gilboa
(0010448)
roflcopter69 (reporter)
2009-12-03 20:20

Another quick thing I'm seeing, on the one and two processor systems, it doesn't look like automount is EVER unmounting the directories. I have it nfs mounting homedirs, and hours upon hours later, "ls /home" still has all of the users who have ever logged in mounted... even though at least one of them has been logged out for a very long time.

Are you also seeing this behavior? I may be upgrading the host to Fedora 12 soon, maybe that will fix it, but I feel like this is more of a client issue...
(0010472)
gilboa (reporter)
2009-12-08 09:41

At least as far as I could see, autofs seems to unmounts just fine. However, I've yet to see if it unmounts during an I-eat-100%-CPU episode.
I;ll report back when I know more.

- Gilboa
(0010473)
gilboa (reporter)
2009-12-08 12:06

OK. This is weird.
Automount started chewing away at the CPU, I check, and it had no active mounts.
I accessed one of the mount under /autofs (effectively mounting a remote NFS share) and automount stopped eating CPU.

Something is seriously wrong with automount's idle function. I'll try to free some time and look at the its code.

- Gilboa
(0010518)
toracat (developer)
2009-12-12 00:04

@roflcopter69 and gilboa,

How is the time keeping on the troubled guests? Is the clock running accurately?

Which kernel version is the guest running? Does the problem occur with 5.4's original kernel ( 2.6.18-164 ) ?

Do you have any 32-bit guest on the same host, and if so, do you see the same problem there?
(0010519)
gilboa (reporter)
2009-12-13 14:24

Time keeping looks OK.
Running latest 164 series kernel.
I've got a 32 bit guest on the same host working just fine.

This problem seems to be limited to 64bit hosts running using the 5.4 kernel.

- Gilboa
(0010520)
gilboa (reporter)
2009-12-13 14:25

P.S. I'll try to reduce the core count to 1/2 and see if I can reproduce this problem.

- Gilboa
(0010521)
roflcopter69 (reporter)
2009-12-13 19:22

The time is running very accurately here, I have all hosts running NTP so the clock difference is about 4x10^-6 seconds. It is running the latest kernel, and has had this problem since I first installed it and configured it. I do not have any 32-bit guests or hosts to test this on, sorry.
(0010523)
toracat (developer)
2009-12-13 19:53

Thanks, both, for the feedback. I am seeing a similar issue on a system I log onto which is a CentOS-5 x86_64 guest on a CentOS-5 host. The clock goes irregular (ntp running). A 32-bit guest is fine. The issue seems to go away if we go back to kernel 2.6.18-164.

I expect more details from the person who manages the system. :)
(0010532)
stindall (reporter)
2009-12-15 01:33

...from the person who manages the system: :-)

This is on an AMD Phenom II 920 quad core system with 8GB memory.

All guests referenced below are running on a C5.4/KVM host using the 2.6.18-164.6.1.el5 kernel and are using virtio block and net drivers.

Healthy guests running light loads on this system typically keep accurate time to within a second per day with the help of ntpd, but that does degrade under high loads and ntpdate is used every 8 hr to force time resynchronization.

First, the good news:

* The 32-bit C5.4 guest using the 2.6.18-164.6.1.el5 kernel runs fine. No automount or other issues. (I really like 32-bit guests. :-)

Mixed news:

* The 64-bit C5.4 guest using the 2.6.18-164.el5 kernel (the installation kernel) runs fine also. No automount or other issues.

* The same 64-bit C5.4 guest using the 2.6.18-164.6.1.el5 kernel has lots of issues.

** About 10min after boot, 1 vcpu pegs at 100% with top showing automount to be the culprit. The automount issue can be controlled by commenting out the +auto.master line in /etc/auto.master.

** As seen using date and sampled every 5 sec or so, time jumps around by several minutes, sometimes correct and sometimes many minutes or slow.

** Shutdown in painfully slow, taking 5 sec or more to stop each service as viewed via virt-viewer. virsh shutdown <guest> may take 5-10 min before the guests stops (if you have not already issued the virsh destroy command), whereas normal guests are down in maybe 30 sec.

* Thinking the 64-bit guest may be trashed, I built a new one from scratch and as soon as the 2.6.18-164.6.1.el5 kernel was installed and booted, all the above problems appeared.


Other than the automount issue, there are no smoking guns in the guest syslog.
(0010544)
gilboa (reporter)
2009-12-17 09:03

I just checked, but 32bit and 64bit do not show any sign of time drifting. (At least not in the second realm.)

As for load, I managed to trigger the automount bug, even when both the guest and the host more-or-less idle.

- Gilboa
(0010622)
roflcopter69 (reporter)
2009-12-28 19:02

This has already been said probably, but I can confirm massive clock drift. Under heavy load, nagios went wild reporting the clock drifts on some machines got as high as 600+ seconds in about a 60 second time period. This bug has always happened to me under idle situations, but I never noticed the clock skew was this bad until I put the server under extremely heavy number-crunching load.
(0010624)
toracat (developer)
2009-12-28 19:10

@roflcopter69,

The clock drift problem in KVM is actually being tracked in bug #4058. Could you try the centosplus test kernel offered in there? You can find it here:

http://centos.toracat.org/kernel/centos5/centosplus-testing/x86_64/ [^]

With this patched kernel, we do not see the clock problem any more even after stressing the cpu's.
(0010666)
toracat (developer)
2010-01-02 22:01

Changing category to (now available) kvm.
(0010669)
gilboa (reporter)
2010-01-03 09:25

@toracat,

This is a host kernel fix, right?
(0010671)
toracat (developer)
2010-01-03 12:39

The fix is for the guest (CentOS) kernel.
(0010740)
roflcopter69 (reporter)
2010-01-10 01:03

Posted this on the wrong ticket, my bad:

Updated my kernel today, and right now I am looking at htop reporting that one CPU is pegged on both my KVM's with >1 processor. It's saying that automount is the reason the CPU is pegged, and automount still never unmounts user home directories as it should, and does on my Fedora box.
(0010744)
toracat (developer)
2010-01-10 08:19

The following notes have been copied from bug #4058.

=================================================
stindall 2010-01-10 01:28

@roflcopter69,

So far, I have the 2.6.18-164.10.1.el5.centos.plus kernel installed on three 64-bit C5.4 guests and the automount problem has not occurred (so far).

My past experience was that there was no automount issue under the 2.6.18-164.el5 kernel, but it occurred under the 2.6.18-164.2.1.el5 and 2.6.18-164.6.1.el5 kernels.

I never tried the later kernels and have been running under the 2.6.18-164.el5 kernel and later under the 2.6.18-164.9.1.kvmmd.el5.ayplus kernel, both without automount issues.

All guests are run under C5.4/KVM hosts and the hosts are running the 2.6.18-164.10.1.el5 kernel.

=================================================

roflcopter69 2010-01-10 02:39

Uploading a picture of htop's running on all the KVM's, the middle-most one showing the vCPU being pegged by automount. The hypervisor is being pegged by 16 "dd if=/dev/urandom of=/dev/null" processes. Kernel on all guests is 2.6.18-164.10.1.el5.

=================================================

toracat 2010-01-10 03:27

@roflcopter69

You wrote, "Kernel on all guests is 2.6.18-164.10.1.el5."

You are not running the centosplus kernel? The standard kernel does not (yet) have the patches.

=================================================
(0010885)
toracat (developer)
2010-01-28 18:16

The distro kernel 2.6.18-164.11.1 now has the patches that were added to the cplus kernel. The automount problem has not occurred on this kernel as far as I can tell.

If there is no more comment on this issue, the ticket will be closed as "resolved".
(0010905)
toracat (developer)
2010-01-31 17:15

Closing as "resolved" - fixed as of kernel-2.6.18-164.11.1.

- Issue History
Date Modified Username Field Change
2009-11-10 03:00 roflcopter69 New Issue
2009-12-01 10:59 gilboa Note Added: 0010435
2009-12-02 05:56 roflcopter69 Note Added: 0010437
2009-12-02 12:11 gilboa Note Added: 0010439
2009-12-03 20:20 roflcopter69 Note Added: 0010448
2009-12-08 09:41 gilboa Note Added: 0010472
2009-12-08 12:06 gilboa Note Added: 0010473
2009-12-12 00:04 toracat Note Added: 0010518
2009-12-13 14:04 toracat Status new => feedback
2009-12-13 14:24 gilboa Note Added: 0010519
2009-12-13 14:25 gilboa Note Added: 0010520
2009-12-13 19:22 roflcopter69 Note Added: 0010521
2009-12-13 19:53 toracat Note Added: 0010523
2009-12-13 19:53 toracat Status feedback => acknowledged
2009-12-15 01:33 stindall Note Added: 0010532
2009-12-17 09:03 gilboa Note Added: 0010544
2009-12-28 16:46 toracat Relationship added related to 0004058
2009-12-28 19:02 roflcopter69 Note Added: 0010622
2009-12-28 19:10 toracat Note Added: 0010624
2010-01-02 22:01 toracat Note Added: 0010666
2010-01-02 22:01 toracat Category autofs => kvm
2010-01-03 09:25 gilboa Note Added: 0010669
2010-01-03 12:39 toracat Note Added: 0010671
2010-01-10 01:03 roflcopter69 Note Added: 0010740
2010-01-10 08:19 toracat Note Added: 0010744
2010-01-28 18:16 toracat Note Added: 0010885
2010-01-31 17:15 toracat Note Added: 0010905
2010-01-31 17:15 toracat Status acknowledged => resolved
2010-01-31 17:15 toracat Resolution open => fixed
2010-01-31 17:15 toracat Fixed in Version => 5.4


Copyright © 2000 - 2014 MantisBT Team
Powered by Mantis Bugtracker