View Issue Details

IDProjectCategoryView StatusLast Update
0002186CentOS-5-OTHERpublic2008-05-22 13:20
Reporterscherar1 
PrioritynormalSeveritymajorReproducibilityrandom
Status closedResolutionno change required 
Product Version5.0 - x86_64 
Target VersionFixed in Version 
Summary0002186: i/o reject to offline device
DescriptionHi,

i got a major Problem with my Server using CentOS 5.
Hardware is
ASUS A8V Deluxe
AMD64 X2 3800+
2 GB Memory (Corsair Value RAM)
RAID 0 Controller is 3Ware 8006-2LP
2 X Harddisk Seagate 500 GB SATA2 new ones.

I got the same problem with 2 WD HDs Raid Edition 500 GB, so i decided to change the harddisks.
System wordked fine for 5 days.
Then the old state was reached again.
Unfortuneately now the bug comes in a period of 10 minutes to 2 hours.
Same shit as before when running the WD disks.

System worked fine with 2 WD Raid edition 160 GB under SUSE 10.1 kernel was standard.
Is it a hardware or a kernel problem? Does anybody got the same problem? Does anybody solved the problem?
I will upload the necessary log-files soon.

thanx for help
TagsNo tags attached.

Activities

andyfowler

andyfowler

2007-09-13 13:51

reporter   ~0005999

I can confirm that this problem exists -- I have talked to two other admins who have had similar serious problems using 3ware 7xxx-8xxx RAIDs on CentOS 5. I am running the 32-bit CentOS 5, on an Athlon 64 dual-processor 4200.

Have tried replacing the controller, drives and cables to no avail.

See http://lists.centos.org/pipermail/centos/2007-June/083100.html for another user with my issue.

When the failure occurs, dmesg is flooded with:
3w-xxxx: scsi0: Character ioctl (0x1f) timed out, resetting card.
sd 0:0:0:0: rejecting I/O to offline device

The entire filesystem becomes read-only, and eventually the kernel panics.
roos

roos

2007-10-19 09:27

reporter   ~0006152

Last edited: 2007-10-19 09:31

Same problem here.
CentOS 5 x86_64, 3Ware 800x Raid-1 (250GB).
System is an Athlon-64 X2 4200+
Kernel is stock CentOS-5 x86_64 SMP kernel
Linux localhost.localdomain 2.6.18-8.1.14.el5 #1 SMP Thu Sep 27 19:05:32 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux

Every 2 days the system crashes with the same error you mentioned.

Pls. raise the severity.

timchipman

timchipman

2007-10-19 14:10

reporter   ~0006154

I want to confirm having this same problem. Fresh install CentOS 5.0 32-bit on a dual-xeon 2.4ghz system with a 3ware 8506 controller. Initial install was just fine. Yum updates to latest updates as of 16th Oct. Then installed and booted openvz production stable kernel - is my main deviation from a "vanilla" install. Machine was up running fine a few days; then started a large-ish rsync job (1Tb) into the raid5 volume. Performance was sluggish, then was clear from dmesg why - I had errors logged as per others in this thread.

on reboot the luns defined on the 3ware were set to re-initialize themselves after next boot. Tried to boot in rescue mode but the HDD slices were corrupt, couldn't be changed from read-only, and thus I've given up on this install instance, have rebuilt the luns and intend to proceed with an install of CentOS 4.5/32bit - possibly the bug is not a problem with this hardare on that platform OS ? (based on some google hits suggesting others experienced this problem only after upgrades from 4.5->5.0 CentOS...) Will report back next week with an update.

I also re-flashed the 3ware up to latest firmware, on off-chance that would help. Didn't "rescue" the degraded arrays, for sure..

Tim Chipman
andyfowler

andyfowler

2007-10-19 14:22

reporter   ~0006155

I've been running the 32-bit version of CentOS the entire time, with the same problem, so it seems to affect both 32-bit and 64-bit versions.

Linux coltrane.paradigm-reborn.com 2.6.18-8.el5 #1 SMP Thu Mar 15 19:57:35 EDT 2007 i686 athlon i386 GNU/Linux

Interesting that I've got the exact same hardware as roos. And timchipman, from all of my Googling, I've only seen the problem in CentOS5, 4 should be fine.

I was finally able to capture the output of dmesg when the problem began (usually the error messages fill up the buffer).

3w-xxxx: scsi0: Character ioctl (0x1f) timed out, resetting card.
sd 0:0:0:0: WARNING: Command (0x2a) timed out, resetting card.
3w-xxxx: scsi0: AEN drain failed, retrying.
 [ previous message repeated 4 times ]
3w-xxxx: scsi0: Controller errors, card not responding, check all cabling.
3w-xxxx: scsi0: Reset sequence failed.
3w-xxxx: scsi0: Reset failed.
sd 0:0:0:0: scsi: Device offlined - not ready after error recovery
 [ previous message repeated 5 times ]
sd 0:0:0:0: rejecting I/O to offline device
Buffer I/O error on device sda3, logical block 26730314
lost page write due to I/O error on sda3
sd 0:0:0:0: rejecting I/O to offline device
  [ previous message repeated ad infinitum, until kernel panics ]
roos

roos

2007-10-19 14:47

reporter   ~0006156

Last edited: 2007-10-19 14:50

Thats exactly the problem many people report for centos-5
http://layer0.layeredtech.com/archive/index.php?t-6593.html
http://www.centos.org/modules/newbb/viewtopic.php?topic_id=9234

The 3ware KB has as Statement regarding this:
http://www.3ware.com/KB/article.aspx?id=14224

But we are on 7.7.1 already.

We got 5 other, identical systems with CentOS-4 x86_64, which run flawlessly.
So the problem seems to be in the CentOS5 Kernels, only.

roos

roos

2007-10-19 14:50

reporter   ~0006157

I now set "noapic" for the kernel and stresstest the system.
I'll keep you up to date.
roos

roos

2007-10-22 06:12

reporter   ~0006165

We've run a bonnie++ loop during the last 3 days.
Before, we had a crash every day.
With "noapic" kernel option, the system works so far.
andyfowler

andyfowler

2007-10-22 15:55

reporter   ~0006166

Excellent suggestion, roos -- is noapic the only option that you added? We'll try that here, as well. You didn't tweak anything related to acpi?
roos

roos

2007-10-22 18:18

reporter   ~0006168

noapic was the only option we set in grub.
Folks, please try that workaround, first.
Our system seems to be stable now.
timchipman

timchipman

2007-10-25 19:25

reporter   ~0006193

Hi Folks, I've done more testing on my machine, and still have problems. Given that "noapic" fails to resolve things, it makes me wonder if something else is up?

(1) installed CentOS 4.5/32bit, using HW raid on my 8506 - same layout as tried with CentOS5.0. System still dies. Tried with normal kernel, also with OpenVZ kernel - no change to behaviour;

(2) tried with no hardware raid, just presented 8 x 400gig disks to the OS and then installed using software raid. Now when it dies, I have different errors visible on crashed console, and no traces in messages about the scsi resets.

Console errors read approx,

tw_poll_status
tw_aen_drain_queue
tw_reset_sequence
scsi_eh_abort
scsi_try_to_abort
scsi_unjam_host
scsi_error_handler
kernel_thread_helper

Console Shuts Up ....

To get back from this I do a power cycle, the machine boots fine and the software raids rebuilt themselves. AFter that they work until I start loading data into my raid5 (5 disk) set, then after 10-30min it will crash again.

Interesting to me, it was able to build the raid5 successfully (3-4hours of sustained disk activity) but then chokes when doing file access. Seems counterindicated.

I've tried adding noapic and acpi=off flags to kernel @ boot, it appears not to modulate behaviour at all.

note this machine is really an old "appliance" from Overland Storage, a "REO" thing (ran a custom linux on firmware, didn't use hardware raid at all, presented disks as iscsi targets to clients I believe). However in theory it worked in that role (I think?). It is just generic hardware (tyan server-grade motherboard, dual xeon, etc, in a 2urackmount case)

If anyone has thoughts or comments on this, they certainly would be appreciated.

--Tim
andyfowler

andyfowler

2007-10-25 19:55

reporter   ~0006194

Last edited: 2007-10-25 19:55

Actually, we're going on 72 hours now with no problems, after setting noapic, including several hours of intensive compiling. Tim, it sounds like your problem is similar, but not identical to the one that we were experiencing.

Reiterate: setting noapic worked for us. I'll give it a few more days and report back.

timchipman

timchipman

2007-10-29 14:01

reporter   ~0006203

Hi,

Just a brief update in this thread. I left the server running over the weekend, thus,

--software raid data volume / OS disks, CentOS 4.5 32-bit
--booted with stock RHEL/CentOS kernel, but with noapic and acpi=off kernel flags in the /boot/grub/grub.conf as used to boot up kernel/system
--left it running a big rsync job which in the past generated enough traffic to panic kernel after 30minutes.

Got in MondayAM and the rsync had completed moving the last 300gigs of data, no crash. I've just started it running a sustained cycle of bonnie disk tests to load disk activity up for a while.

If it completes this smoothly then I'm willing to believe that this is a stable config.

I suspect the SW vs HW raid is a non-issue here.

It appears that the OpenVZ kernel (vmlinuz-2.6.9-023stab044.11-smp) is unstable for this 3ware hardware, regardless of using the acpi=off / noapic flags or not. Thus I'll have to abandon the OVZ on this hardware likely, it seems.

I would suspect that whatever problem is present in the CentOS 5 kernel, may also be related to what the OpenVZ/Virtuozzo folks are compiling into their OVZ kernel.

If I'm feeling crazy, I might try installing the OVZ-CentOS kernel (ie, stock RHEL kernel which differs from Stock only in the application of the OVZ patches) - which differs somewhat from the default production OVZ kernel I believe) -- and see if that runs solid. I'm not sure if I feel up to more testing on this, given it appears to be "solved" with the present workaround. Guess I'll see how things go today :-)

Hope this info is of slight use to others,


Tim
luressl

luressl

2008-04-01 08:54

reporter   ~0007078

We had the same probele with XenServer 4.0 (CentOS)
on one of three IBM maschines x3550.

(rejecting I/O on offline device, kernel panic)

DSA Log-Output showed a HW Error in the Disk-Backplane (ServeRAID 8kl),
IBM replaced the part.
krazybob

krazybob

2008-05-14 21:33

reporter   ~0007263

We are having the exact same issue using CE 4.5 and Virtuozzo enterprise kernel on a dual Opteron 248 system. In fact, all of our servers are identical and only this server has an issue. This is a fresh and updated install of Virtuozzo.

I called 3Ware and they recommend adding noapic to the line ending in ro in grub.conf

Odd that not all installations are affected. 3Ware for reports that some users experience and improvement in performance while others experience a decrease. Flip a coin?
luressl

luressl

2008-05-16 06:28

reporter   ~0007275

Problem solved: After replacing the Mainboard (guarantee) by IBM Support the problem didn't occur any more !!!
timverhoeven

timverhoeven

2008-05-22 13:20

developer   ~0007299

Most likely hardware issues.

Issue History

Date Modified Username Field Change
2007-07-02 16:51 scherar1 New Issue
2007-09-13 13:51 andyfowler Note Added: 0005999
2007-10-19 09:27 roos Note Added: 0006152
2007-10-19 09:31 roos Note Edited: 0006152
2007-10-19 14:10 timchipman Note Added: 0006154
2007-10-19 14:22 andyfowler Note Added: 0006155
2007-10-19 14:47 roos Note Added: 0006156
2007-10-19 14:50 roos Note Added: 0006157
2007-10-19 14:50 roos Note Edited: 0006156
2007-10-22 06:12 roos Note Added: 0006165
2007-10-22 15:55 andyfowler Note Added: 0006166
2007-10-22 18:18 roos Note Added: 0006168
2007-10-25 19:25 timchipman Note Added: 0006193
2007-10-25 19:55 andyfowler Note Added: 0006194
2007-10-25 19:55 andyfowler Note Edited: 0006194
2007-10-29 14:01 timchipman Note Added: 0006203
2008-04-01 08:54 luressl Note Added: 0007078
2008-05-14 21:33 krazybob Note Added: 0007263
2008-05-16 06:28 luressl Note Added: 0007275
2008-05-22 13:20 timverhoeven Status new => closed
2008-05-22 13:20 timverhoeven Note Added: 0007299
2008-05-22 13:20 timverhoeven Resolution open => no change required