Notes |
(0005768)
herrold (administrator)
2007-07-31 15:58
|
per IRC channel, the hardware of the reporter seems to be the issue |
|
(0005769)
cap_ (updater)
2007-07-31 15:59
|
talked to the reporter and ran this exact script:
dd if=/dev/urandom bs=1024 count=1 of=junk ; md5sum junk ; echo starting ; export host="l1"; for i in $(seq 1 1000) ; do echo -n "$i " ; ssh $host rm /tmp/junk; scp junk $host:/tmp/ >/dev/null || break ; ssh $host sha1sum /tmp/junk; done
I was not able to reproduce any errors on my fully updated c5 x86_64.
I suspect that he has a bad combo of nic hardware and tg3 driver. |
|
(0005770)
wadester (reporter)
2007-07-31 17:22
|
Ran additional tests.
- I am able to reproduce this on multiple systems on multiple networks
-- all are on identical hardware.
- File size does not matter (tried with 1K file and 2M file).
- Updated tg3 driver to 3.71b from Broadcom, problem was not fixed.
- Installed openssh 4.6p1 from openssh site:
-- less errors were detected (1 in a few thousand transfers).
-- no core dumps over about 3 hours.
- Same hardware and test software on Fedora Core 4
-- 33.5K transfers of 1024 bytes, 0 errors. |
|
(0005796)
wadester (reporter)
2007-08-02 14:44
|
Update: Problem seems related to having soft RAID running. Boxes have dual 80G SATA drives which I typically put in RAID1 using Linux soft RAID (/ is on /dev/md0, /home on /dev/md1, etc.).
Loaded 2 new servers with CentOS 5 with default disk partitioning (LVM, 100M /dev/sda1 /boot, rest LVM with 2G swap). Tested with default kernel, default xen kernel, and updated 2.6.18-8.1.1.1 kernel (non-xen). This configuration did not fail in over 10K transfers.
Reloaded SAME two servers with CentOS 5 with soft RAID (100M /dev/sda1 /boot, / on /dev/md0 10G mirrored on /dev/sda2 and /dev/sdb2, 1G swap on both disks). This configuration failed 167 times in 139K transfers or about .1%. This error rate is about the same as my baseline install (multiple soft RAIDs, a few additional packages and services, etc.).
Loaded 2 more servers with CentOS 5, default LVM partitions, and started testing, with 0 errors in at least 7000 transfers (still running).
Note the pair of identical hardware running my baseline (FC4 with updates) is up to 407K transfers with 0 errors. It also uses 3 soft RAID partitions. |
|
(0005822)
wadester (reporter)
2007-08-06 19:17
|
Soft RAID seems to make the problem worse, but removing soft RAID did not solve the problem. I am having to work around the problem for now. |
|
(0005848)
wadester (reporter)
2007-08-10 12:24
edited on: 2007-08-10 15:12
|
Ran same test on exact same hardware using RHEL 5, stock (no updates). This system has transfered almost 200K files without error. Note, this was a default install using LVM.
|
|
(0005852)
wadester (reporter)
2007-08-10 15:21
|
I can now prove that the problem is with CentOS. I took two pairs of CentOS boxes that demonstrated this problem and I replaced a block of RPMS with those from RHEL 5. One set of boxes has transferred over 15K files and the second set has transferred over 2K files, both without error.
The packages I obtained from RHEL 5 were:
glibc-2.5-12.i686.rpm
glibc-common-2.5-12.i386.rpm
glibc-devel-2.5-12.i386.rpm
glibc-headers-2.5-12.i386.rpm
glibc-utils-2.5-12.i386.rpm
krb5-auth-dialog-0.7-1.i386.rpm
krb5-devel-1.5-17.i386.rpm
krb5-libs-1.5-17.i386.rpm
krb5-server-1.5-17.i386.rpm
krb5-workstation-1.5-17.i386.rpm
libgssapi-0.10-2.i386.rpm
libgssapi-devel-0.10-2.i386.rpm
openssh-4.3p2-16.el5.i386.rpm
openssh-askpass-4.3p2-16.el5.i386.rpm
openssh-clients-4.3p2-16.el5.i386.rpm
openssh-server-4.3p2-16.el5.i386.rpm
openssl097a-0.9.7a-9.i386.rpm
openssl-0.9.8b-8.3.el5.i686.rpm
openssl-devel-0.9.8b-8.3.el5.i386.rpm
openssl-perl-0.9.8b-8.3.el5.i386.rpm
These were installed using:
rpm -Uvh --force --nodeps --replacefiles |
|
(0005853)
kbsingh@karan.org (administrator)
2007-08-10 16:36
|
I'll check the rpm-diff for these rpms and check if there is any real difference. |
|
(0005855)
wadester (reporter)
2007-08-10 20:38
|
On a pair of CentOS systems, I replaced the rpms listed in my previous post with those from RHEL -- the problem was fixed. I then replaced THOSE with the rpms from CentOS5. The problem REMAINS FIXED.
Also note that SELinux is turned off on these systems by default. |
|
(0005856)
Evolution (developer)
2007-08-10 22:39
|
did you reboot after the package replacements, or restart ssh? |
|
(0005859)
wadester (reporter)
2007-08-13 17:44
|
I restarted SSH and tested. Then I rebooted then tested. I'm doing a more granular test now (replace a package, reboot, retest). |
|
(0005953)
awood (reporter)
2007-08-31 15:45
edited on: 2007-09-04 13:26
|
Apologies for jumping in, but I found this bug while searching for information about what appears to be an identical bug with RHEL 5.
On various RHEL 5 systems, all with similar (but not quite identical) hardware, I am having a problem with scp segfaulting about 0.2%-0.5% of the time. It's not just scp, though, I have also seen ntpstat die occasionally. (edit: 0.02% of the time).
My lspci output looks like this (repeat entries snipped for space):
00:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02)
00:01.0 VGA compatible controller: ATI Technologies Inc Radeon RV100 QY [Radeon 7000/VE]
00:03.0 USB Controller: NEC Corporation USB (rev 43)
00:03.2 USB Controller: NEC Corporation USB 2.0 (rev 04)
00:0f.0 Host bridge: Broadcom CSB6 South Bridge (rev a0)
00:0f.1 IDE interface: Broadcom CSB6 RAID/IDE Controller (rev a0)
00:0f.3 ISA bridge: Broadcom GCLE-2 Host Bridge
01:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10)
01:02.0 RAID bus controller: Adaptec AAC-RAID (rev 02)
It appears to be pretty similar on all other affected servers. Versions (this is RHEL 5 with "yum update" run recently):
openssh-4.3p2-16.el5
glibc-2.5-12
kernel-PAE-2.6.18-8.1.8.el5
This is happening on 8 (update: 12) servers under my control. It does not happen on RHEL 4. The servers are IBM System x3850 and IBM System x3550 and some xSeries variants, with between 8 and 16GB of RAM.
[test script snipped - redundant]
Hope this helps - as I say, it's not just a CentOS problem. It happens for me on RHEL 5 systems, both on ones installed from scratch and on ones upgraded from RHEL 4, but I haven't found anyone else reporting similar bugs yet. The systems that were upgraded had been running on RHEL 4 with no problems.
SELinux is disabled on all my systems at the moment, in case it makes a difference.
|
|
(0005957)
awood (reporter)
2007-09-03 12:55
|
I believe I've fixed the problem on my systems, the same fix may work for you if it's the same underlying cause.
As root run:
prelink -au
And then turn prelinking off (in /etc/sysconfig/prelink on RHEL 5). There was no need to reboot, the tests I was running just started working immediately after undoing all prelinking.
Presumably installing other glibc etc and then putting them back again acted to restore the non-prelinked versions. If doing that RPM shuffle fixed it, then I bet if you wait a few days for prelink to run again, the problem will come back.
FWIW, I had previously tried booting an affected system with "maxcpus=1" (all affected systems are multiprocessor), and tried running /usr/sbin/glibc_post_upgrade.i686; neither worked. Running "strace" on thousands of instances of "ntpstat" finally showed that the segfaults were actually happening as the process was being executed - execve() was giving EINVAL. Hence the idea to try prelink.
Hope this helps, again sorry for jumping in but I saw this being reported nowhere else. |
|
(0006587)
awood (reporter)
2007-12-19 17:11
|
FYI this issue seems to be resolved by kernel 2.6.18-53.1.4.el5 on RHEL 5. |
|
(0007790)
phsuatabacadotcom (reporter)
2008-08-01 22:04
|
awood: Were there any release notes or a patch that specifically addressed this issue? I couldn't find it referenced in the kernel release notes.
I'm seeing similar issues on a different supermicro motherboard (PDMSi+) with RAID1. The problem is with a custom kernel, so upgrading kernels is not a good solution for me. |
|
(0007804)
awood (reporter)
2008-08-05 22:54
|
I didn't see any specific notes that addressed the issue, I just noticed that the problem went away. If you're having similar problems and don't want to change kernels just run "prelink -au" as root as I described. If that doesn't fix it your problem presumably doesn't have the same cause as I had. |
|
(0007926)
gswoods (reporter)
2008-09-04 16:40
|
Just want to add to this. We are experiencing what appears to be the same issue. About once a week, one of our daemons (such as sshd) will segfault for no apparent reason. This gets logged in /var/log/messages:
Sep 3 21:48:18 auth2 kernel: sshd[19959]: segfault at 0000000000000000 rip 00002aaaaea3f6e5 rsp 00007fff5b10c9d0 error 4
At first I thought this was a bug in one of our locally installed daemons until it started showing up for a system daemon like sshd.
This is also an x86_64 machine running CentOS 5, and it happens on two different systems with identical (but not shared) hardware, so it is unlikely to be a hardware issue.
We do have software RAID enabled.
I haven't tried turning off prelinking yet but will give that a shot, as this is a royal pain in the patootie for us since when the one-time password daemon is the thing that segfaults, all our authentication stops working )-: I just love those 3AM calls. For now, I have implemented a monitoring script that checks to make sure all the necessary authentication-related daemons are running and starts any that are not, but that's a hideous kludge.
BTW, we were running the latest kernel kernel-2.6.18-92.1.10.el5 when this started happening, so I thought it might be related to having recently updated the kernel, but I backed off to 2.6.18-53.1.21.el5 and we're still having the problem. So if you are experiencing this, it is unlikely to be fixed by a kernel update as of now. |
|
(0007932)
gswoods (reporter)
2008-09-08 21:26
|
I tried turning off prelinking as described earlier in this thread, and still got a segfault yesterday:
Sep 7 15:37:16 auth2 kernel: otpd[26529]: segfault at 0000000000000010 rip 0000000000409c8c rsp 0000000041dfefd0 error 6 |
|