View Issue Details

IDProjectCategoryView StatusLast Update
0006209CentOS-6kernelpublic2013-01-22 19:06
Reportersyshackmin 
PrioritynormalSeveritycrashReproducibilityhave not tried
Status newResolutionopen 
Product Version6.3 
Target VersionFixed in Version 
Summary0006209: RAID10 disk fail triggering kernel BUG at drivers/scsi/scsi_lib.c:1156!
DescriptionThis issue was happening on a large Raid10 with a failing disk. It was in production so I had to repair the issue by replacing the disk but I still have the crashdumps. Instead of dropping the disk the kernel would crash.

I found an issue that someone on the Debian bug list thought may be the issue. I'm not sure if this has made it into the CentOS kernel.

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=682233
http://git.kernel.org/?p=linux/kernel/git/jejb/scsi.git;a=commit;h=98dc81b0d6c483a3eb256764ae10f156ccefdbbb

The drives are SATA WD Raid Editions(WDC WD5003ABYX-01WERA1) on a LSI 9211-8i thru an LSI SAS2X36 expander.

I'm was originally running old an older LSI firmware and driver, however, I am currently running the latest of both. Still crashing.

Raid Info(Its currently rebuilding onto the spare):

/dev/md4:
        Version : 1.1
  Creation Time : Mon Sep 17 11:42:08 2012
     Raid Level : raid10
     Array Size : 5372224000 (5123.35 GiB 5501.16 GB)
  Used Dev Size : 488384000 (465.76 GiB 500.11 GB)
   Raid Devices : 22
  Total Devices : 23
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Mon Jan 21 13:23:50 2013
          State : active, degraded, recovering
 Active Devices : 21
Working Devices : 23
 Failed Devices : 0
  Spare Devices : 2

         Layout : near=2
     Chunk Size : 512K

 Rebuild Status : 37% complete

           Name : ???.a2hosting.com:4 (local to host ???.a2hosting.com)
           UUID : 248488c8:93b3e4bc:971a6676:3d77fb4d
         Events : 447295

    Number Major Minor RaidDevice State
       0 8 1 0 active sync /dev/sda1
       1 8 161 1 active sync /dev/sdk1
       2 8 17 2 active sync /dev/sdb1
       3 8 177 3 active sync /dev/sdl1
       4 8 33 4 active sync /dev/sdc1
       5 8 193 5 active sync /dev/sdm1
       6 8 49 6 active sync /dev/sdd1
       7 8 209 7 active sync /dev/sdn1
       8 8 65 8 active sync /dev/sde1
       9 8 225 9 active sync /dev/sdo1
      10 8 81 10 active sync /dev/sdf1
      11 8 241 11 active sync /dev/sdp1
      12 8 97 12 active sync /dev/sdg1
      13 65 1 13 active sync /dev/sdq1
      14 8 113 14 active sync /dev/sdh1
      15 65 17 15 active sync /dev/sdr1
      16 8 129 16 active sync /dev/sdi1
      17 65 33 17 active sync /dev/sds1
      18 8 145 18 active sync /dev/sdj1
      22 65 97 19 spare rebuilding /dev/sdw1
      20 65 65 20 active sync /dev/sdu1
      21 65 81 21 active sync /dev/sdv1

      23 65 113 - spare /dev/sdx1

LSI Info:

mpt2sas version 15.00.00.00 loaded
scsi0 : Fusion MPT SAS Host
  alloc irq_desc for 30 on node 0
  alloc kstat_irqs on node 0
alloc irq_2_iommu on node 0
mpt2sas 0000:03:00.0: PCI INT A -> GSI 30 (level, low) -> IRQ 30
mpt2sas 0000:03:00.0: setting latency timer to 64
mpt2sas0: 64 BIT PCI BUS DMA ADDRESSING SUPPORTED, total mem (49416756 kB)
  alloc irq_desc for 52 on node 0
  alloc kstat_irqs on node 0
alloc irq_2_iommu on node 0
mpt2sas 0000:03:00.0: irq 52 for MSI/MSI-X
mpt2sas0-msix0: PCI-MSI-X enabled: IRQ 52
mpt2sas0: iomem(0x00000000fbb3c000), mapped(0xffffc90017168000), size(16384)
mpt2sas0: ioport(0x000000000000c000), size(256)
mpt2sas0: sending diag reset !!
mpt2sas0: diag reset: SUCCESS
mpt2sas0: Allocated physical memory: size(3392 kB)
mpt2sas0: Current Controller Queue Depth(1483), Max Controller Queue Depth(1720)
mpt2sas0: Scatter Gather Elements per IO(128)
mpt2sas0: LSISAS2008: FWVersion(15.00.00.00), ChipRevision(0x03), BiosVersion(07.29.00.00)
mpt2sas0: Protocol=(Initiator), Capabilities=(Raid,TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
mpt2sas0: sending port enable !!


Crash Info(From the crashdump kernel log):

sd 0:0:19:0: [sdt] Unhandled sense code
sd 0:0:19:0: [sdt] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:19:0: [sdt] Sense Key : Medium Error [current]
Info fld=0x39e30f68
sd 0:0:19:0: [sdt] Add. Sense: Unrecovered read error
sd 0:0:19:0: [sdt] CDB: Read(10): 28 00 39 e3 0f 40 00 00 68 00
sd 0:0:19:0: [sdt] Unhandled sense code
sd 0:0:19:0: [sdt] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 0:0:19:0: [sdt] Sense Key : Medium Error [current]
Info fld=0x39e30f68
sd 0:0:19:0: [sdt] Add. Sense: Unrecovered read error
sd 0:0:19:0: [sdt] CDB: Read(10): 28 00 39 e3 0f 68 00 00 08 00
------------[ cut here ]------------
kernel BUG at drivers/scsi/scsi_lib.c:1156!
invalid opcode: 0000 [#1] SMP
last sysfs file: /sys/devices/system/cpu/cpu15/cache/index2/shared_cpu_map
CPU 4
Modules linked in: nfsd lockd nfs_acl auth_rpcgss exportfs autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables ipv6 raid10 ses enclosure microcode serio_raw i2c_i801 i2c_core sg iTCO_wdt iTCO_vendor_support e1000e ioatdma dca i7core_edac edac_core shpchp ext4 mbcache jbd2 raid1 sd_mod crc_t10dif ahci mpt2sas(U) scsi_transport_sas raid_class dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 2008, comm: md4_raid10 Not tainted 2.6.32-279.19.1.el6.x86_64 #1 Supermicro X8DTL/X8DTL
RIP: 0010:[<ffffffff8135dbfe>] [<ffffffff8135dbfe>] scsi_setup_fs_cmnd+0x9e/0xe0
RSP: 0018:ffff88062ee27870 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffff880c14fe6e20 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffff880c14fe6e20 RDI: ffff88062c649800
RBP: ffff88062ee27880 R08: 0000000000000086 R09: 0000000000000001
R10: 0000000039e30768 R11: 0000000000000000 R12: ffff88062c649800
R13: ffff88062c652838 R14: ffff88062c649800 R15: ffff88062c732800
FS: 0000000000000000(0000) GS:ffff880655400000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000002d77e68 CR3: 0000000c18027000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process md4_raid10 (pid: 2008, threadinfo ffff88062ee26000, task ffff88062bfdaaa0)
Stack:
 ffff880c14fe6e20 ffff880c14fe6e20 ffff88062ee27910 ffffffffa0099d17
<d> ffff880c14fe6e20 ffff88062df3c000 ffff88062ee27910 ffffffff8126476f
<d> ffff880600000000 0000000039e30768 0000000000000000 0000000004100031
Call Trace:

 [<ffffffffa0099d17>] sd_prep_fn+0x157/0xf30 [sd_mod]
 [<ffffffff8126476f>] ? cfq_dispatch_requests+0x2cf/0xa70
 [<ffffffff81261c47>] ? cfq_prio_tree_add+0xc7/0xd0
 [<ffffffff8124f527>] blk_peek_request+0xc7/0x210
 [<ffffffff8135cd33>] scsi_request_fn+0x63/0x790
 [<ffffffff8107caed>] ? del_timer+0x7d/0xe0
 [<ffffffff81247271>] ? elv_insert+0xd1/0x1a0
 [<ffffffff8124cf02>] __generic_unplug_device+0x32/0x40
 [<ffffffff81250088>] __make_request+0x168/0x5a0
 [<ffffffff8124e65e>] generic_make_request+0x25e/0x530
 [<ffffffff811124c5>] ? mempool_alloc_slab+0x15/0x20
 [<ffffffff81112663>] ? mempool_alloc+0x63/0x140
 [<ffffffff8124e65e>] ? generic_make_request+0x25e/0x530
 [<ffffffff811124c5>] ? mempool_alloc_slab+0x15/0x20
 [<ffffffff81112663>] ? mempool_alloc+0x63/0x140
 [<ffffffff8124e9bd>] submit_bio+0x8d/0x120
 [<ffffffff813e90e6>] sync_page_io+0xb6/0x110
 [<ffffffffa01f2de6>] r10_sync_page_io+0x56/0x110 [raid10]
 [<ffffffffa01f3216>] fix_read_error+0x376/0x6f0 [raid10]
 [<ffffffffa01f4563>] raid10d+0xfd3/0x1130 [raid10]
 [<ffffffff8107d4eb>] ? try_to_del_timer_sync+0x7b/0xe0
 [<ffffffff8107d572>] ? del_timer_sync+0x22/0x30
 [<ffffffff814eaa4a>] ? schedule_timeout+0x19a/0x2e0
 [<ffffffff8105fa40>] ? default_wake_function+0x0/0x20
 [<ffffffff813e8046>] md_thread+0x116/0x150
 [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff813e7f30>] ? md_thread+0x0/0x150
 [<ffffffff81090626>] kthread+0x96/0xa0
 [<ffffffff8100c0ca>] child_rip+0xa/0x20
 [<ffffffff81090590>] ? kthread+0x0/0xa0
 [<ffffffff8100c0c0>] ? child_rip+0x0/0x20
Code: 00 e8 17 fe ff ff 5b 41 5c c9 c3 66 90 4c 89 e7 be 20 00 00 00 e8 23 85 ff ff 48 85 c0 48 89 c7 74 38 48 89 83 d8 00 00 00 eb a0 <0f> 0b eb fe 48 8b 00 48 85 c0 0f 84 7a ff ff ff 48 8b 40 48 48
RIP [<ffffffff8135dbfe>] scsi_setup_fs_cmnd+0x9e/0xe0
 RSP <ffff88062ee27870>

TagsNo tags attached.

Activities

toracat

toracat

2013-01-22 19:06

manager   ~0016320

The patch from that git reference is:

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index b583277..9377ed2 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -759,7 +759,6 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
     }
 
     if (req->cmd_type == REQ_TYPE_BLOCK_PC) { /* SG_IO ioctl from block level */
- req->errors = result;
         if (result) {
             if (sense_valid && req->sense) {
                 /*
@@ -775,6 +774,10 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int good_bytes)
             if (!sense_deferred)
                 error = __scsi_error_from_host_byte(cmd, result);
         }
+ /*
+ * __scsi_error_from_host_byte may have reset the host_byte
+ */
+ req->errors = cmd->result;
 
         req->resid_len = scsi_get_resid(cmd);


As far as I can see, it is already in the current c6 kernel (2.6.32-279.19.1.el6).

Issue History

Date Modified Username Field Change
2013-01-21 18:37 syshackmin New Issue
2013-01-22 19:06 toracat Note Added: 0016320