2016-12-11 13:59 UTC

View Issue Details Jump to Notes ]
IDProjectCategoryView StatusLast Update
0003373CentOS-4nscdpublic2013-03-23 21:47
Reporterjweage 
PrioritynormalSeveritymajorReproducibilityrandom
StatusresolvedResolutionfixed 
Product Version4.7 
Target VersionFixed in Version 
Summary0003373: nscd uses 100% cpu and stops reponding
DescriptionI have a 48 node engineering compute cluster running CentOS 4.7. I've noticed nscd processes using up 100% CPU on several machines. 'service restart nscd' is not able to stop the process, so I have to manually kill the offending nscd.

I've enabled debug and the log file and restarted nscd, but don't see any error messages. I've disabled nscd for now, as this issue causes significant performance issues on the cluster.
TagsNo tags attached.
Attached Files

-Relationships
+Relationships

-Notes

~0008921

james-p (reporter)

I also have this issue with CentOS 4.7 on a number of machines

nscd will only respond to a 'kill -9'

lsof shows that nscd has /var/run/nscd/socket opened twice - machines running nscd 'normally' have /var/run/nscd/socket opened once.

~0008946

Malinfro (reporter)

Hi,

We are recently started using CentOS 4.7 on all our clusters, approximately 300 machines. We have experienced the same problem, nscd processes are using up 100% CPU on several of machines. 'service restart nscd' is not able to stop the process and nscd will only respond to a 'kill -9'. We are currently restarting nscd in a daily cronjob as a workaround.

We have also noticed that on the machines where the nscd processes are using up 100% CPU, 'lsof' shows two open /var/run/nscd/socket. But on the machines with a normal nscd 'lsof' shows one open /var/run/nscd/socket.

~0008947

Malinfro (reporter)

Initially we thought it was the Red Hat Bugzilla – Bug 428837 - leaking file descriptors. We tried using a rebuild nscd using this patch. It didn't solve our problems.

~0008948

james-p (reporter)

I can't find anything in the Red Hat Bugzilla that matches this problem - other Bugzilla entries about nscd using 100% CPU seem to be related to issues with LDAP - which we are not using.

Is it worth opening a Red Hat Bug about this?

~0008954

Malinfro (reporter)

Very interesting that you experience it, that are not using LDAP.

Found the following information on Debians bug report, on the leaking file handles (Bug report #401758)

the do_drop_connection portion of this patch which is not technically
required to fix the leak -- it fixes another bug: libnss-ldap is totally
broken in multithreaded programs (such as nscd) because you can't do
"close(10); dup2(14,10);" and guarantee another thread didn't re-open fd
10 in the meanwhile. the patch as included fixes this problem but only
when non-ssl connections are in use... in the case ssl connections are in
use it's just totally broken and can't be fixed. yay. (however thanks to
fixing the do_get_our_socket code the drop code is rarely called in the
dangerous manner.)

So it can't be that problem since you experience it and you have no LDAP connection.

We should definitely, report it to Red Hat, we have found the same problem on some of our Red Hat servers as well. Will you report it or should I?

~0008955

james-p (reporter)

I don't have any machines running RHEL4.7 - so it would be 'difficult' for me to log it as RHEL4.7 issue - however, if you've seen it on RHEL4.7 boxes, then it is probably best if you log it - if that is OK?

~0008966

Malinfro (reporter)

I've reported it in Red Hat Bugzilla – Bug 492581.

~0009285

james-p (reporter)

Three more bugzilla reports has appeared about the same subject:

495082
495083
496201

496201 includes a possible explanation and a suggested patch to fix the issue

~0009348

james-p (reporter)

I've been running nscd with the patch at <https://bugzilla.redhat.com/attachment.cgi?id=339968> on all my 4.7 systems for a week now and not seen any running at 100% CPU

~0009377

james-p (reporter)

It appears that this is actually a kernel bug. The glibc/nscd patch just 'papers over' this.

Bugzilla #496201 (and now #501800) has been bumped up to high/urgent priority - but not sure if it will make it into a 4.8 kernel update ...

~0009379

james-p (reporter)

Patch will be in kernel 89.0.1.EL

~0009536

james-p (reporter)

Fix will be in the errata kernel 89.0.3.EL - see:

<http://rhn.redhat.com/errata/RHSA-2009-1132.html>

~0016976

tigalch (manager)

upstream marked this as solved.
+Notes

-Issue History
Date Modified Username Field Change
2009-01-31 20:33 jweage New Issue
2009-03-16 11:00 james-p Note Added: 0008921
2009-03-26 12:29 Malinfro Note Added: 0008946
2009-03-26 14:46 Malinfro Note Added: 0008947
2009-03-26 15:22 james-p Note Added: 0008948
2009-03-27 09:47 Malinfro Note Added: 0008954
2009-03-27 10:26 james-p Note Added: 0008955
2009-03-27 14:53 Malinfro Note Added: 0008966
2009-05-01 08:30 james-p Note Added: 0009285
2009-05-12 14:38 james-p Note Added: 0009348
2009-05-20 19:10 james-p Note Added: 0009377
2009-05-21 19:00 james-p Note Added: 0009379
2009-06-30 11:42 james-p Note Added: 0009536
2013-03-23 21:47 tigalch Note Added: 0016976
2013-03-23 21:47 tigalch Status new => resolved
2013-03-23 21:47 tigalch Resolution open => fixed
+Issue History