CentOS Bug Tracker
CentOS Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0005971CentOS-6autofspublic2012-09-22 00:142014-12-02 17:32
ReporterDavidWarme 
PrioritynormalSeveritymajorReproducibilityalways
StatusnewResolutionopen 
Platformx86_64OSCentOSOS Version6.3
Product Version6.3 
Target VersionFixed in Version 
Summary0005971: autofs fails during first boot after install
Description
We install CentOS 6.3 using a kickstart file (attached). During first
reboot after successful install, the autofs does not work correctly,
e.g., failing to mount home directories (or any other) in our LDAP
resident maps. We get the following message in the system log:

Sep 19 16:51:19 automount[6554]: bind_ldap_simple: lookup(ldap): Unable
to bind to the LDAP server: (default), error Can't contact LDAP server

Rebooting "seems" to make the problem go away -- although I have
personally witnessed the following sequence of events in one log file:

    <First boot after install>
    ...
    ... Can't contact LDAP server
    ...
    <Second reboot>
    ...
    <Third reboot>
    ...
    ... Can't contact LDAP server

My concern is that this can potentially be a repeatable problem, not
just during the "first boot" after installation. If I reboot a dozen
times after first boot, it does not seem to fail -- but it always fails
during first boot. The subsequent failure listed in the log sketch
above probably resulted from some some sort of manual action I took,
e.g., bringing the interface down and then up again, restarting
NetworkManager, etc. Unfortunately, I do not know what I did that made
it occur during that subsequent boot. I can say unequivocably, however,
that it *has* happened during at least one later boot.

Looking at the autofs source code, I observed that this error message is
produced within lookup_ldap.c, by the failure of a single call to
openldap:

        rv = ldap_simple_bind_s (ldap, ...);

My first approach to tracking this down further was to prepare a special
version of the autofs RPM, that enabled the debugging trace output
features of the openldap library. All this showed was that the failure
was happening at the point where the ldap_connect_to_host() routine (in
openldap-2.4.23/libraries/libldap/os-ip.c) performs a call to
getaddrinfo(), with the following message in the debugging output:

    ldap_connect_to_host: getaddrinfo failed: Name or service not known

Since the service is specified as a numeric port number, that leaves the
host name that is failing to resolve. The source code for getaddrinfo()
is quite gnarly and there was no way to easily enable debugging of it
(during first boot, that is). My hunch was that the /etc/resolv.conf
file was messed up, preventing the resolver from looking up the IP
address for our LDAP server "ldap-tls.group-w-inc.com". This turned out
to not be the case.

My second approach was more successful. I modified my kickstart file
to alter the /etc/rc.d/init.d/autofs file so that the
/usr/sbin/automount program is invoked using strace. The fragments of
strace output that proved enlightening are as follows:

...

6554 16:51:19.063975 open("/etc/resolv.conf", O_RDONLY) = 7
6554 16:51:19.064149 fstat(7, {st_dev=makedev(253, 1), st_ino=9830415, st_mode=S_IFREG|0644, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=8, st_size=80, st_atime=2012/09/19-16:50:54, st_mtime=2012/09/19-16:50:53, st_ctime=2012/09/19-16:50:53}) = 0
6554 16:51:19.064239 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0f6c483000
6554 16:51:19.064291 read(7, "; generated by /sbin/dhclient-script\nsearch group-w-inc.com\nnameserver 10.1.1.1\n", 4096) = 80

...

6554 16:51:19.069263 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 7
6554 16:51:19.069480 connect(7, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.1.1.1")}, 16) = -1 ENETUNREACH (Network is unreachable)
6554 16:51:19.069552 close(7) = 0
6554 16:51:19.069606 socket(PF_INET, SOCK_DGRAM|SOCK_NONBLOCK, IPPROTO_IP) = 7
6554 16:51:19.069656 connect(7, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.1.1.1")}, 16) = -1 ENETUNREACH (Network is unreachable)

...

So the /etc/resolv.conf file is actually reasonable:

    ; generated by /sbin/dhclient-script
    search group-w-inc.com
    nameserver 10.1.1.1

The "eth0" interface is supposed to be on the "10.1.1.x" network (i.e.,
our DNS server is on the directly connected LAN), so if the *network* is
unreachable, it must be that the network interface itself is "down", has
had its address wiped, or is otherwise being mucked with in an unsavory
manner.

Note that this appears to be a transient (though perhaps not short in
duration) state during "first boot." Once the system has finished
booting, I am indeed able, e.g., to "ssh" into this system as root, so
by this time the network interface is definitely back in a reasonable
state. (Users other than root don't fare as well, since they don't have
home directories.)

This is weird. The "autofs" service is being started about 15 *seconds*
after the "bringing up eth0" message, and perhaps 10 seconds after
NetworkManager starts. I have no idea what could be making the
interface invalid this long after these system init events. I know that
our DHCP server is not that slow.

I'm not too sure how to track this down any further. For example, I
could start a process early on during first boot that repeatedly does
the equivalent of an "ifconfig eth0" and "netstat -rn", together with
high-resolution time stamps, to observe the approximate moments in time
when the interface is getting altered. I could even bracket the
ldap_simple_bind_s() call with similar timestamps in the debugging
output to assure me that autofs is indeed attempting its LDAP binding
during a time window where the interface is out to lunch. But this
would give me no clue as to *which* process is actually modifying the
interface.

Perhaps there is some sort of kernel cmdline flag that turns on
debugging info within the network stack that would provide visibility
into which processes are doing what alterations to interface
configurations, routing tables, etc.? If I knew such flags, I could
easily make my kickstart file modify the grub.conf file so that first
boot was invoked with the necessary additional kernel incantations.

I have not had too much luck finding documentation regarding the
architecture of NetworkManager -- how it works, what you can expect it
to do, when and why. NetworkManager is at the top of my list of
suspects, but the apparent lack of documentation is making it hard for
me to proceed here. Is it really the case that 15 seconds after
"bringing up eth0" and 10 seconds after NetworkManager is started that
the interface is still in a bogus state (assuming essentially
instantaneous response from the DHCP server, which is what we really do
get here)? Does NetworkManager have some massive amount of extra work
that it does during first boot that it doesn't have to repeat during
subsequent boots?

My fear is that this is not really just a "first boot" problem -- that
literally any boot can have this race condition between various startup
processes causing the automounter to fail. I do not want to roll out
many replications of this system (I'm using kickstart for a reason)
without having a much clearer understanding of what is happening here
and why.

This is, of course, made much worse by the fact that automount
apparently never retries binding with the LDAP server if the connection
ever fails. Once it fails, the automounter seems to be permanently dead
(until somebody comes along and manually either restarts autofs or
reboots the system, in the hope that the race condition will not recur).
This problem also needs to be fixed quite urgently.
Steps To ReproduceSee the attached kickstart file.
TagsNo tags attached.
Attached Files? file icon autofs.bug_report [^] (7,390 bytes) 2012-09-22 00:14
? file icon ks.cfg [^] (5,779 bytes) 2012-09-22 00:15

- Relationships

-  Notes
(0015866)
DavidWarme (reporter)
2012-10-01 17:47

The main issue reported here has been resolved.

The "network" init script brings up eth0, but when NetworkManager starts up,
NM brings eth0 back down again. NetworkManager very quickly determines the
IPV4 settings via DHCP, but then begins the IPV6 auto-configuration process.
Because there is nothing to respond to any of this on our LAN, it takes
quite a long time for all of the attempts to time out before NetworkManager
throws in the towel and brings up the interface with only IPV4 settings
present. By that time, the automounter has already failed.

Specifying --noipv6 in our kickstart file, and forcing NETWORKWAIT=yes in
/etc/sysconfig/network resolves this problem regarding first boot after
installation of CentOS.

The remaining issue -- that the automount program never retries after any
failure to contact the LDAP server -- is still a very major reliability /
robustness issue.

- Issue History
Date Modified Username Field Change
2012-09-22 00:14 DavidWarme New Issue
2012-09-22 00:14 DavidWarme File Added: autofs.bug_report
2012-09-22 00:15 DavidWarme File Added: ks.cfg
2012-10-01 17:47 DavidWarme Note Added: 0015866


Copyright © 2000 - 2014 MantisBT Team
Powered by Mantis Bugtracker