View Issue Details

IDProjectCategoryView StatusLast Update
0002241CentOS-5opensshpublic2008-09-08 21:26
Reporterwadester Assigned To 
PrioritynormalSeveritycrashReproducibilityhave not tried
Status acknowledgedResolutionopen 
Product Version5.0 - i386 
Summary0002241: scp dumps core, signal 11, segmentation fault
DescriptionSCP of a file between two CentOS 5 boxes and between a CentOS 5 box and Fedora Core 4 sometimes dumps core (about 1% of the time). Problem appears when I transfer a 1K file and when I transfer a 2M file.
Additional InformationKernel 2.6.18-8.1.1.el5
openssh 4.3p2-16.el6
hardware: SuperMicro P8SCT, dual TG3 NICs, 3COM 100T switch
Using pre-generated 1024-bit dsa keys, copied to /root/authorized_keys2
Command: scp -o BatchMode=yes junk1 10.1.1.2:/root/test
TagsNo tags attached.

Activities

herrold

herrold

2007-07-31 15:58

reporter   ~0005768

per IRC channel, the hardware of the reporter seems to be the issue
cap_

cap_

2007-07-31 15:59

updater   ~0005769

talked to the reporter and ran this exact script:
dd if=/dev/urandom bs=1024 count=1 of=junk ; md5sum junk ; echo starting ; export host="l1"; for i in $(seq 1 1000) ; do echo -n "$i " ; ssh $host rm /tmp/junk; scp junk $host:/tmp/ >/dev/null || break ; ssh $host sha1sum /tmp/junk; done

I was not able to reproduce any errors on my fully updated c5 x86_64.

I suspect that he has a bad combo of nic hardware and tg3 driver.
wadester

wadester

2007-07-31 17:22

reporter   ~0005770

Ran additional tests.
- I am able to reproduce this on multiple systems on multiple networks
  -- all are on identical hardware.
- File size does not matter (tried with 1K file and 2M file).
- Updated tg3 driver to 3.71b from Broadcom, problem was not fixed.
- Installed openssh 4.6p1 from openssh site:
  -- less errors were detected (1 in a few thousand transfers).
  -- no core dumps over about 3 hours.
- Same hardware and test software on Fedora Core 4
  -- 33.5K transfers of 1024 bytes, 0 errors.
wadester

wadester

2007-08-02 14:44

reporter   ~0005796

Update: Problem seems related to having soft RAID running. Boxes have dual 80G SATA drives which I typically put in RAID1 using Linux soft RAID (/ is on /dev/md0, /home on /dev/md1, etc.).

Loaded 2 new servers with CentOS 5 with default disk partitioning (LVM, 100M /dev/sda1 /boot, rest LVM with 2G swap). Tested with default kernel, default xen kernel, and updated 2.6.18-8.1.1.1 kernel (non-xen). This configuration did not fail in over 10K transfers.

Reloaded SAME two servers with CentOS 5 with soft RAID (100M /dev/sda1 /boot, / on /dev/md0 10G mirrored on /dev/sda2 and /dev/sdb2, 1G swap on both disks). This configuration failed 167 times in 139K transfers or about .1%. This error rate is about the same as my baseline install (multiple soft RAIDs, a few additional packages and services, etc.).

Loaded 2 more servers with CentOS 5, default LVM partitions, and started testing, with 0 errors in at least 7000 transfers (still running).

Note the pair of identical hardware running my baseline (FC4 with updates) is up to 407K transfers with 0 errors. It also uses 3 soft RAID partitions.
wadester

wadester

2007-08-06 19:17

reporter   ~0005822

Soft RAID seems to make the problem worse, but removing soft RAID did not solve the problem. I am having to work around the problem for now.
wadester

wadester

2007-08-10 12:24

reporter   ~0005848

Last edited: 2007-08-10 15:12

Ran same test on exact same hardware using RHEL 5, stock (no updates). This system has transfered almost 200K files without error. Note, this was a default install using LVM.

wadester

wadester

2007-08-10 15:21

reporter   ~0005852

I can now prove that the problem is with CentOS. I took two pairs of CentOS boxes that demonstrated this problem and I replaced a block of RPMS with those from RHEL 5. One set of boxes has transferred over 15K files and the second set has transferred over 2K files, both without error.

The packages I obtained from RHEL 5 were:

  glibc-2.5-12.i686.rpm
  glibc-common-2.5-12.i386.rpm
  glibc-devel-2.5-12.i386.rpm
  glibc-headers-2.5-12.i386.rpm
  glibc-utils-2.5-12.i386.rpm
  krb5-auth-dialog-0.7-1.i386.rpm
  krb5-devel-1.5-17.i386.rpm
  krb5-libs-1.5-17.i386.rpm
  krb5-server-1.5-17.i386.rpm
  krb5-workstation-1.5-17.i386.rpm
  libgssapi-0.10-2.i386.rpm
  libgssapi-devel-0.10-2.i386.rpm
  openssh-4.3p2-16.el5.i386.rpm
  openssh-askpass-4.3p2-16.el5.i386.rpm
  openssh-clients-4.3p2-16.el5.i386.rpm
  openssh-server-4.3p2-16.el5.i386.rpm
  openssl097a-0.9.7a-9.i386.rpm
  openssl-0.9.8b-8.3.el5.i686.rpm
  openssl-devel-0.9.8b-8.3.el5.i386.rpm
  openssl-perl-0.9.8b-8.3.el5.i386.rpm

These were installed using:
  rpm -Uvh --force --nodeps --replacefiles
kbsingh@karan.org

kbsingh@karan.org

2007-08-10 16:36

administrator   ~0005853

I'll check the rpm-diff for these rpms and check if there is any real difference.
wadester

wadester

2007-08-10 20:38

reporter   ~0005855

On a pair of CentOS systems, I replaced the rpms listed in my previous post with those from RHEL -- the problem was fixed. I then replaced THOSE with the rpms from CentOS5. The problem REMAINS FIXED.

Also note that SELinux is turned off on these systems by default.
Evolution

Evolution

2007-08-10 22:39

administrator   ~0005856

did you reboot after the package replacements, or restart ssh?
wadester

wadester

2007-08-13 17:44

reporter   ~0005859

I restarted SSH and tested. Then I rebooted then tested. I'm doing a more granular test now (replace a package, reboot, retest).
awood

awood

2007-08-31 15:45

reporter   ~0005953

Last edited: 2007-09-04 13:26

Apologies for jumping in, but I found this bug while searching for information about what appears to be an identical bug with RHEL 5.

On various RHEL 5 systems, all with similar (but not quite identical) hardware, I am having a problem with scp segfaulting about 0.2%-0.5% of the time. It's not just scp, though, I have also seen ntpstat die occasionally. (edit: 0.02% of the time).

My lspci output looks like this (repeat entries snipped for space):

00:00.0 Host bridge: IBM Calgary PCI-X Host Bridge (rev 02)
00:01.0 VGA compatible controller: ATI Technologies Inc Radeon RV100 QY [Radeon 7000/VE]
00:03.0 USB Controller: NEC Corporation USB (rev 43)
00:03.2 USB Controller: NEC Corporation USB 2.0 (rev 04)
00:0f.0 Host bridge: Broadcom CSB6 South Bridge (rev a0)
00:0f.1 IDE interface: Broadcom CSB6 RAID/IDE Controller (rev a0)
00:0f.3 ISA bridge: Broadcom GCLE-2 Host Bridge
01:01.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 10)
01:02.0 RAID bus controller: Adaptec AAC-RAID (rev 02)

It appears to be pretty similar on all other affected servers. Versions (this is RHEL 5 with "yum update" run recently):

openssh-4.3p2-16.el5
glibc-2.5-12
kernel-PAE-2.6.18-8.1.8.el5

This is happening on 8 (update: 12) servers under my control. It does not happen on RHEL 4. The servers are IBM System x3850 and IBM System x3550 and some xSeries variants, with between 8 and 16GB of RAM.

[test script snipped - redundant]

Hope this helps - as I say, it's not just a CentOS problem. It happens for me on RHEL 5 systems, both on ones installed from scratch and on ones upgraded from RHEL 4, but I haven't found anyone else reporting similar bugs yet. The systems that were upgraded had been running on RHEL 4 with no problems.

SELinux is disabled on all my systems at the moment, in case it makes a difference.

awood

awood

2007-09-03 12:55

reporter   ~0005957

I believe I've fixed the problem on my systems, the same fix may work for you if it's the same underlying cause.

As root run:

  prelink -au

And then turn prelinking off (in /etc/sysconfig/prelink on RHEL 5). There was no need to reboot, the tests I was running just started working immediately after undoing all prelinking.

Presumably installing other glibc etc and then putting them back again acted to restore the non-prelinked versions. If doing that RPM shuffle fixed it, then I bet if you wait a few days for prelink to run again, the problem will come back.

FWIW, I had previously tried booting an affected system with "maxcpus=1" (all affected systems are multiprocessor), and tried running /usr/sbin/glibc_post_upgrade.i686; neither worked. Running "strace" on thousands of instances of "ntpstat" finally showed that the segfaults were actually happening as the process was being executed - execve() was giving EINVAL. Hence the idea to try prelink.

Hope this helps, again sorry for jumping in but I saw this being reported nowhere else.
awood

awood

2007-12-19 17:11

reporter   ~0006587

FYI this issue seems to be resolved by kernel 2.6.18-53.1.4.el5 on RHEL 5.
phsuatabacadotcom

phsuatabacadotcom

2008-08-01 22:04

reporter   ~0007790

awood: Were there any release notes or a patch that specifically addressed this issue? I couldn't find it referenced in the kernel release notes.

I'm seeing similar issues on a different supermicro motherboard (PDMSi+) with RAID1. The problem is with a custom kernel, so upgrading kernels is not a good solution for me.
awood

awood

2008-08-05 22:54

reporter   ~0007804

I didn't see any specific notes that addressed the issue, I just noticed that the problem went away. If you're having similar problems and don't want to change kernels just run "prelink -au" as root as I described. If that doesn't fix it your problem presumably doesn't have the same cause as I had.
gswoods

gswoods

2008-09-04 16:40

reporter   ~0007926

Just want to add to this. We are experiencing what appears to be the same issue. About once a week, one of our daemons (such as sshd) will segfault for no apparent reason. This gets logged in /var/log/messages:

Sep 3 21:48:18 auth2 kernel: sshd[19959]: segfault at 0000000000000000 rip 00002aaaaea3f6e5 rsp 00007fff5b10c9d0 error 4

At first I thought this was a bug in one of our locally installed daemons until it started showing up for a system daemon like sshd.

This is also an x86_64 machine running CentOS 5, and it happens on two different systems with identical (but not shared) hardware, so it is unlikely to be a hardware issue.

We do have software RAID enabled.

I haven't tried turning off prelinking yet but will give that a shot, as this is a royal pain in the patootie for us since when the one-time password daemon is the thing that segfaults, all our authentication stops working )-: I just love those 3AM calls. For now, I have implemented a monitoring script that checks to make sure all the necessary authentication-related daemons are running and starts any that are not, but that's a hideous kludge.

BTW, we were running the latest kernel kernel-2.6.18-92.1.10.el5 when this started happening, so I thought it might be related to having recently updated the kernel, but I backed off to 2.6.18-53.1.21.el5 and we're still having the problem. So if you are experiencing this, it is unlikely to be fixed by a kernel update as of now.
gswoods

gswoods

2008-09-08 21:26

reporter   ~0007932

I tried turning off prelinking as described earlier in this thread, and still got a segfault yesterday:

Sep 7 15:37:16 auth2 kernel: otpd[26529]: segfault at 0000000000000010 rip 0000000000409c8c rsp 0000000041dfefd0 error 6

Issue History

Date Modified Username Field Change
2007-07-31 12:54 wadester New Issue
2007-07-31 12:54 wadester Status new => assigned
2007-07-31 15:58 herrold Note Added: 0005768
2007-07-31 15:59 cap_ Note Added: 0005769
2007-07-31 16:09 kbsingh@karan.org Status assigned => acknowledged
2007-07-31 17:22 wadester Note Added: 0005770
2007-08-02 14:44 wadester Note Added: 0005796
2007-08-06 19:17 wadester Note Added: 0005822
2007-08-10 12:24 wadester Note Added: 0005848
2007-08-10 15:12 wadester Note Edited: 0005848
2007-08-10 15:21 wadester Note Added: 0005852
2007-08-10 16:36 kbsingh@karan.org Note Added: 0005853
2007-08-10 20:38 wadester Note Added: 0005855
2007-08-10 22:39 Evolution Note Added: 0005856
2007-08-13 17:44 wadester Note Added: 0005859
2007-08-31 15:45 awood Note Added: 0005953
2007-09-03 12:55 awood Note Added: 0005957
2007-09-04 13:26 awood Note Edited: 0005953
2007-12-19 17:11 awood Note Added: 0006587
2008-08-01 22:04 phsuatabacadotcom Note Added: 0007790
2008-08-05 22:54 awood Note Added: 0007804
2008-09-04 16:40 gswoods Note Added: 0007926
2008-09-08 21:26 gswoods Note Added: 0007932