CentOS Bug Tracker
CentOS Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0005634CentOS-5kernelpublic2012-03-28 20:052012-06-03 12:29
Reporterdjlarsu 
PriorityhighSeveritymajorReproducibilityalways
StatusresolvedResolutionfixed 
Platformx86_64OSCentOS Version5.8
Product Version5.8 
Target VersionFixed in Version 
Summary0005634: Apache hang on kernel 2.6.18-308.1.1.el5 with large StartServers and low MaxSpareServers
DescriptionI am able to reliably reproduce an Apache hang (clients connect and wait forever for response) using these preform mpm settings and kernel 2.6.18-308.1.1.el5. Booting off an earlier kernel (kernel-2.6.18-274.12.1.el5) allows this to work as expected.

Steps To Reproducehttpd.conf

<IfModule prefork.c>
StartServers 1024
MinSpareServers 5
MaxSpareServers 20
ServerLimit 1024
MaxClients 1024
MaxRequestsPerChild 4000
</IfModule>

apachectl restart
while [ 1 == 1 ]; do curl -sk http://localhost [^] -o /dev/null; done

After several minutes, apache stops responding to requests. (Unresponsive state occurs naturally after about 10 minutes, but above curl loop speeds up the process to just a few)
TagsNo tags attached.
Attached Files

- Relationships

-  Notes
(0014768)
jotazuz (reporter)
2012-03-29 19:43

I'm having the same issue since I upgraded from CentOS 5.7 to CentOS 5.8, but I cannot reproduce it. My server has hanged (not chashed) 4 times since then.

I run a strace on several of the httpd procceses and it seams they were waiting for some kind of lock:

[~]# strace -p22681

Process 22681 attached - interrupt to quit
semop(950276, 0x2b7b27801da0, 1

[~]# ipcs -s -i 950276

Semaphore Array semid=950276
uid=48 gid=48 cuid=0 cgid=0
mode=0600, access_perms=0600
nsems = 1
otime = Thu Mar 29 14:57:36 2012
ctime = Sun Mar 25 04:02:23 2012
semnum value ncount zcount pid
0 0 919 0 22670
(0014769)
djlarsu (reporter)
2012-03-29 20:26

I believe the children waiting in semop (when there's no work to be done) is normal. I'm see the same thing in strace for both afflicted and immune configurations.

When my httpd is in the hung state, I see `netstat -s` increment "times the listen queue of a socket overflowed" with each new connection. That's consistent with httpd workers not becoming ready for new requests. My /server-status appears normal, however.

My strace -f -p <parent> shows:
connect(30, {sa_family=AF_INET6, sin6_port=htons(80), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
poll([{fd=30, events=POLLOUT}], 1, 3000) = 1 ([{fd=30, revents=POLLOUT}])
getsockopt(30, SOL_SOCKET, SO_ERROR, [-9092130858629857280], [4]) = 0
write(30, "OPTIONS * HTTP/1.0\r\nUser-Agent: "..., 70) = 70
close(30) = 0
wait4(-1, 0x7fff8077789c, WNOHANG|WSTOPPED, NULL) = 0
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
write(8, "!", 1) = 1
socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 30
fcntl(30, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(30, F_SETFL, O_RDWR|O_NONBLOCK) = 0
(0014827)
pbennes (reporter)
2012-04-12 19:56

I can reproduce this issue on a fresh CentOS 5.8 i386 installation. The contributing factors appear to be multiple listen directives, prefork mpm, and MaxClients > 1000. If either only a single listen directive, or MaxClients <= 1000 is set, the issue does not manifest. Starting from a fresh installation:

yum -y install mod_ssl
patch -p0 << EOF
--- /etc/httpd/conf/httpd.conf 2012-04-12 12:34:41.000000000 -0700
+++ /etc/httpd/conf/httpd.conf.new 2012-04-12 12:35:21.000000000 -0700
@@ -98,11 +98,11 @@
 # MaxClients: maximum number of server processes allowed to start
 # MaxRequestsPerChild: maximum number of requests a server process serves
 <IfModule prefork.c>
-StartServers 8
+StartServers 1001
 MinSpareServers 5
 MaxSpareServers 20
-ServerLimit 256
-MaxClients 256
+ServerLimit 1001
+MaxClients 1001
 MaxRequestsPerChild 4000
 </IfModule>
EOF
service httpd start
ab -n 2000 -c 1 http://127.0.0.1/icons/blank.gif [^]

Apache bench will die around the 1000th request:
[root@localhost ~]# ab -n 2000 -c 1 http://127.0.0.1/icons/blank.gif [^]
This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/ [^]
Copyright 2006 The Apache Software Foundation, http://www.apache.org/ [^]

Benchmarking 127.0.0.1 (be patient)
Completed 200 requests
Completed 400 requests
Completed 600 requests
Completed 800 requests
apr_poll: The timeout specified has expired (70007)
Total of 997 requests completed
[root@localhost ~]#

The apache processes remain running but no longer serve requests.

pstack and strace of the parent root owned httpd process:

[root@localhost ~]# pstack $(pgrep -u root httpd)
#0 0x00eb1402 in __kernel_vsyscall ()
#1 0x002f12cd in ___newselect_nocancel () from /lib/libc.so.6
#2 0x00744f49 in apr_sleep () from /usr/lib/libapr-1.so.0
#3 0x00204616 in ap_wait_or_timeout ()
#4 0x0020f223 in ap_mpm_run ()
#5 0x001e5277 in main ()
[root@localhost ~]# strace -ttfp $(pgrep -u root httpd)
Process 5551 attached - interrupt to quit
12:55:02.534681 select(0, NULL, NULL, NULL, {0, 530000}) = 0 (Timeout)
12:55:03.065286 write(8, "!", 1) = 1
12:55:03.065791 socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 15
12:55:03.066107 fcntl64(15, F_GETFL) = 0x2 (flags O_RDWR)
12:55:03.066350 fcntl64(15, F_SETFL, O_RDWR|O_NONBLOCK) = 0
12:55:03.066626 connect(15, {sa_family=AF_INET6, sin6_port=htons(80), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
12:55:03.067122 poll([{fd=15, events=POLLOUT}], 1, 3000) = 1 ([{fd=15, revents=POLLOUT}])
12:55:03.067373 getsockopt(15, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
12:55:03.067707 write(15, "OPTIONS * HTTP/1.0\r\nUser-Agent: "..., 85) = 85
12:55:03.068183 close(15) = 0
12:55:03.068474 waitpid(-1, 0xbfc11e68, WNOHANG|WSTOPPED) = 0
12:55:03.068812 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
12:55:04.069254 write(8, "!", 1) = 1
12:55:04.069780 socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 15
12:55:04.070166 fcntl64(15, F_GETFL) = 0x2 (flags O_RDWR)
12:55:04.070487 fcntl64(15, F_SETFL, O_RDWR|O_NONBLOCK) = 0
12:55:04.070781 connect(15, {sa_family=AF_INET6, sin6_port=htons(80), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
12:55:04.071154 poll([{fd=15, events=POLLOUT}], 1, 3000) = 1 ([{fd=15, revents=POLLOUT}])
12:55:04.071405 getsockopt(15, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
12:55:04.071700 write(15, "OPTIONS * HTTP/1.0\r\nUser-Agent: "..., 85) = 85
12:55:04.072137 close(15) = 0
12:55:04.072386 waitpid(-1, 0xbfc11e68, WNOHANG|WSTOPPED) = 0
12:55:04.072731 select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
12:55:05.073214 write(8, "!", 1) = 1
12:55:05.073783 socket(PF_INET6, SOCK_STREAM, IPPROTO_IP) = 15
12:55:05.074052 fcntl64(15, F_GETFL) = 0x2 (flags O_RDWR)
12:55:05.074284 fcntl64(15, F_SETFL, O_RDWR|O_NONBLOCK) = 0
12:55:05.074516 connect(15, {sa_family=AF_INET6, sin6_port=htons(80), inet_pton(AF_INET6, "::", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EINPROGRESS (Operation now in progress)
12:55:05.074798 poll([{fd=15, events=POLLOUT}], 1, 3000) = 1 ([{fd=15, revents=POLLOUT}])
12:55:05.075036 getsockopt(15, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
12:55:05.075331 write(15, "OPTIONS * HTTP/1.0\r\nUser-Agent: "..., 85) = 85
12:55:05.075695 close(15) = 0
12:55:05.076027 waitpid(-1, 0xbfc11e68, WNOHANG|WSTOPPED) = 0
12:55:05.076479 select(0, NULL, NULL, NULL, {1, 0} <unfinished ...>
Process 5551 detached

And of a child process:
[root@localhost ~]# pstack $(pgrep -u apache httpd | head -1)
#0 0x00eb1402 in __kernel_vsyscall ()
#1 0x002fa10b in semop () from /lib/libc.so.6
#2 0x007383ea in ?? () from /usr/lib/libapr-1.so.0
#3 0x007379c2 in apr_proc_mutex_lock () from /usr/lib/libapr-1.so.0
#4 0x0020e8b5 in ?? ()
#5 0x0020ed31 in ?? ()
#6 0x0020ee0a in ?? ()
#7 0x0020f96b in ap_mpm_run ()
#8 0x001e5277 in main ()
[root@localhost ~]# strace -ttfp $(pgrep -u apache httpd | head -1)
Process 5558 attached - interrupt to quit
12:55:38.531264 semop(393219, 0x74c6dc, 1 <unfinished ...>
Process 5558 detached
[root@localhost ~]#
(0014868)
pbennes (reporter)
2012-04-19 20:18

The issue may be related to epoll path limits and upstream bug 681578 https://bugzilla.redhat.com/show_bug.cgi?id=681578. [^] I recompiled a CentOS kernel from kernel-2.6.18-308.el5.src.rpm and made the following change:

--- kernel-2.6.18-redhat.patch 2012-04-19 13:10:47.000000000 -0700
+++ kernel-2.6.18-redhat.patch.NEW 2012-04-19 13:10:34.000000000 -0700
@@ -2162830,7 +2162830,7 @@
 + * is connected to n file sources. In this case each file source has 1 path
 + * of length 1. Thus, the numbers below should be more than sufficient.
 + */
-+int path_limits[PATH_ARR_SIZE] = { 1000, 500, 100, 50, 10 };
++int path_limits[PATH_ARR_SIZE] = { 2000, 500, 100, 50, 10 };
 +int path_count[PATH_ARR_SIZE];
 +
 +static int path_count_inc(int nests)

This change allows for apache bench to complete successfully (apache remains stable) at a max clients setting of 1001. Using the same patched kernel I tested settings of 2000 and 2001 and experienced the same types of failure results as with the CentOS kernel limits at 1000/1001.
(0014870)
toracat (developer)
2012-04-19 21:19

For the record, the patch reported in the upstream BZ 681578 is:

--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -986,6 +986,9 @@ int path_count[PATH_ARR_SIZE];
 
 static int path_count_inc(int nests)
 {
+ if (nests == 0)
+ return 0;
+
     if (++path_count[nests] > path_limits[nests])
         return -1;
     return 0;
(0014871)
pbennes (reporter)
2012-04-20 00:35

This patch appears to resolve my issue.
(0015038)
jcats (reporter)
2012-05-07 16:07

I am having the same issue but my config looks like:

<IfModule mpm_prefork_module>
    StartServers 5
    MinSpareServers 5
    MaxSpareServers 10
    ServerLimit 450
    MaxClients 450
    MaxRequestsPerChild 10000
</IfModule>

Its completely random, the server is serving tons of images constantly.
The apache processes remain running but no longer serve requests.

A simple restart of Apache and its back up online, is this non related since my startsevers is below 1000
(0015056)
pbennes (reporter)
2012-05-09 19:39

jcats,

You can try rolling a custom kernel with the patch applied and see if that resolves your issue, until the patch is applied by the vendor. If it does, you know that this issue is the cause of your problem.
(0015057)
toracat (developer)
2012-05-09 20:18

I can apply the patch in the next update of centosplus kernel.
(0015060)
toracat (developer)
2012-05-10 16:19

A test version of the centosplus kernel that has the patch ( from note 14870 ) applied is now available from:

http://centos.toracat.org/kernel/centos5/centosplus-testing/ [^]

kernel version:

kernel-2.6.18-308.4.1.el5.centos.bug5634.ayplus

Please note that they are not signed and are provided for testing purposes only.
(0015204)
JohnnyHughes (administrator)
2012-06-02 15:08

The epoll issues are supposedly fixed in the latest released kernel ... *kernel-2.6.18-308.8.1.el5*. Can we verify this issue can be closed.
(0015205)
toracat (developer)
2012-06-02 15:35

I confirm that the patch noted in 14870 is in kernel-2.6.18-308.8.1.el5.
(0015210)
JohnnyHughes (administrator)
2012-06-03 12:29

fixed in 2.6.18-308.8.1

- Issue History
Date Modified Username Field Change
2012-03-28 20:05 djlarsu New Issue
2012-03-29 19:43 jotazuz Note Added: 0014768
2012-03-29 20:26 djlarsu Note Added: 0014769
2012-04-12 19:56 pbennes Note Added: 0014827
2012-04-19 20:18 pbennes Note Added: 0014868
2012-04-19 21:19 toracat Note Added: 0014870
2012-04-20 00:35 pbennes Note Added: 0014871
2012-05-07 16:07 jcats Note Added: 0015038
2012-05-09 19:39 pbennes Note Added: 0015056
2012-05-09 20:18 toracat Note Added: 0015057
2012-05-10 16:19 toracat Note Added: 0015060
2012-06-02 15:08 JohnnyHughes Note Added: 0015204
2012-06-02 15:35 toracat Note Added: 0015205
2012-06-03 12:29 JohnnyHughes Note Added: 0015210
2012-06-03 12:29 JohnnyHughes Status new => resolved
2012-06-03 12:29 JohnnyHughes Resolution open => fixed


Copyright © 2000 - 2014 MantisBT Team
Powered by Mantis Bugtracker