CentOS Bug Tracker
CentOS Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0003373CentOS-4nscdpublic2009-01-31 20:332013-03-23 21:47
Reporterjweage 
PrioritynormalSeveritymajorReproducibilityrandom
StatusresolvedResolutionfixed 
PlatformOSOS Version
Product Version4.7 
Target VersionFixed in Version 
Summary0003373: nscd uses 100% cpu and stops reponding
DescriptionI have a 48 node engineering compute cluster running CentOS 4.7. I've noticed nscd processes using up 100% CPU on several machines. 'service restart nscd' is not able to stop the process, so I have to manually kill the offending nscd.

I've enabled debug and the log file and restarted nscd, but don't see any error messages. I've disabled nscd for now, as this issue causes significant performance issues on the cluster.
TagsNo tags attached.
Attached Files

- Relationships

-  Notes
(0008921)
james-p (reporter)
2009-03-16 11:00

I also have this issue with CentOS 4.7 on a number of machines

nscd will only respond to a 'kill -9'

lsof shows that nscd has /var/run/nscd/socket opened twice - machines running nscd 'normally' have /var/run/nscd/socket opened once.
(0008946)
Malinfro (reporter)
2009-03-26 12:29

Hi,

We are recently started using CentOS 4.7 on all our clusters, approximately 300 machines. We have experienced the same problem, nscd processes are using up 100% CPU on several of machines. 'service restart nscd' is not able to stop the process and nscd will only respond to a 'kill -9'. We are currently restarting nscd in a daily cronjob as a workaround.

We have also noticed that on the machines where the nscd processes are using up 100% CPU, 'lsof' shows two open /var/run/nscd/socket. But on the machines with a normal nscd 'lsof' shows one open /var/run/nscd/socket.
(0008947)
Malinfro (reporter)
2009-03-26 14:46

Initially we thought it was the Red Hat Bugzilla – Bug 428837 - leaking file descriptors. We tried using a rebuild nscd using this patch. It didn't solve our problems.
(0008948)
james-p (reporter)
2009-03-26 15:22

I can't find anything in the Red Hat Bugzilla that matches this problem - other Bugzilla entries about nscd using 100% CPU seem to be related to issues with LDAP - which we are not using.

Is it worth opening a Red Hat Bug about this?
(0008954)
Malinfro (reporter)
2009-03-27 09:47

Very interesting that you experience it, that are not using LDAP.

Found the following information on Debians bug report, on the leaking file handles (Bug report #401758)

the do_drop_connection portion of this patch which is not technically
required to fix the leak -- it fixes another bug: libnss-ldap is totally
broken in multithreaded programs (such as nscd) because you can't do
"close(10); dup2(14,10);" and guarantee another thread didn't re-open fd
10 in the meanwhile. the patch as included fixes this problem but only
when non-ssl connections are in use... in the case ssl connections are in
use it's just totally broken and can't be fixed. yay. (however thanks to
fixing the do_get_our_socket code the drop code is rarely called in the
dangerous manner.)

So it can't be that problem since you experience it and you have no LDAP connection.

We should definitely, report it to Red Hat, we have found the same problem on some of our Red Hat servers as well. Will you report it or should I?
(0008955)
james-p (reporter)
2009-03-27 10:26

I don't have any machines running RHEL4.7 - so it would be 'difficult' for me to log it as RHEL4.7 issue - however, if you've seen it on RHEL4.7 boxes, then it is probably best if you log it - if that is OK?
(0008966)
Malinfro (reporter)
2009-03-27 14:53

I've reported it in Red Hat Bugzilla – Bug 492581.
(0009285)
james-p (reporter)
2009-05-01 08:30

Three more bugzilla reports has appeared about the same subject:

495082
495083
496201

496201 includes a possible explanation and a suggested patch to fix the issue
(0009348)
james-p (reporter)
2009-05-12 14:38

I've been running nscd with the patch at <https://bugzilla.redhat.com/attachment.cgi?id=339968> [^] on all my 4.7 systems for a week now and not seen any running at 100% CPU
(0009377)
james-p (reporter)
2009-05-20 19:10

It appears that this is actually a kernel bug. The glibc/nscd patch just 'papers over' this.

Bugzilla #496201 (and now #501800) has been bumped up to high/urgent priority - but not sure if it will make it into a 4.8 kernel update ...
(0009379)
james-p (reporter)
2009-05-21 19:00

Patch will be in kernel 89.0.1.EL
(0009536)
james-p (reporter)
2009-06-30 11:42

Fix will be in the errata kernel 89.0.3.EL - see:

<http://rhn.redhat.com/errata/RHSA-2009-1132.html> [^]
(0016976)
tigalch (developer)
2013-03-23 21:47

upstream marked this as solved.

- Issue History
Date Modified Username Field Change
2009-01-31 20:33 jweage New Issue
2009-03-16 11:00 james-p Note Added: 0008921
2009-03-26 12:29 Malinfro Note Added: 0008946
2009-03-26 14:46 Malinfro Note Added: 0008947
2009-03-26 15:22 james-p Note Added: 0008948
2009-03-27 09:47 Malinfro Note Added: 0008954
2009-03-27 10:26 james-p Note Added: 0008955
2009-03-27 14:53 Malinfro Note Added: 0008966
2009-05-01 08:30 james-p Note Added: 0009285
2009-05-12 14:38 james-p Note Added: 0009348
2009-05-20 19:10 james-p Note Added: 0009377
2009-05-21 19:00 james-p Note Added: 0009379
2009-06-30 11:42 james-p Note Added: 0009536
2013-03-23 21:47 tigalch Note Added: 0016976
2013-03-23 21:47 tigalch Status new => resolved
2013-03-23 21:47 tigalch Resolution open => fixed


Copyright © 2000 - 2014 MantisBT Team
Powered by Mantis Bugtracker