View Issue Details

IDProjectCategoryView StatusLast Update
0010073CentOS-7kernelpublic2016-03-22 17:08
Reportermrcuongnv 
PriorityhighSeveritycrashReproducibilitysometimes
Status closedResolutionduplicate 
Product Version7.2.1511 
Target VersionFixed in Version 
Summary0010073: Crash on I/O over 3ware RAID Volume
DescriptionAfter upgrading my server from 7.1.1503 to 7.2.1511 (3.10.0-327.3.1.el7.x86_64), my server sometimes crashed whenever there is high-intensive I/O on 3ware RAID Volume.

My card: 3ware Inc 9750 SAS2/SATA-II RAID PCIe (rev 05)

My server has no crash for more than one year with 7.0 and 7.1.
Steps To ReproduceMy system has 1 RAID volume of 4 x 2TB disks. There are cloned repositories of CentOS 5, 6, 7 on it. The repo is shared via HTTP with Nginx.

Reproduce: Install a CentOS 6.7 via network with the repo from above repository. During downloading/installing packages, server crashes (reboot).
Additional InformationFrom vmcore-dmesg.txt:

[307733.929900] BUG: unable to handle kernel NULL pointer dereference at 0000000000000448
[307733.938513] IP: [<ffffffff81314218>] swiotlb_unmap_sg_attrs+0x28/0x60
[307733.947067] PGD 0
[307733.955498] Oops: 0000 [#1] SMP
[307733.963852] Modules linked in: 8021q garp mrp xt_CHECKSUM tun ib_isert pax(OE) target_core_pscsi target_core_file target_core_iblock iscsi_target_mod target_core_mod iscsi_tcp libiscsi_tcp ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ip6t_REJECT ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter ib_iser libiscsi scsi_transport_iscsi ib_srp scsi_transport_srp scsi_tgt ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm dm_mirror dm_region_hash dm_log dm_mod xfs libcrc32c iTCO_wdt
[307733.990755] iTCO_vendor_support lpc_ich i7core_edac intel_powerclamp mfd_core edac_core ioatdma coretemp kvm_intel kvm i2c_i801 pcspkr sg shpchp acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif sr_mod crct10dif_generic cdrom crct10dif_common mlx4_ib ib_sa ib_mad mlx4_en vxlan ib_core ip6_udp_tunnel udp_tunnel ib_addr ata_generic pata_acpi radeon drm_kms_helper ttm crc32c_intel drm serio_raw ahci igb libahci usb_storage pata_jmicron e1000e mlx4_core libata dca ptp pps_core i2c_algo_bit 3w_sas i2c_core zfs(POE) zunicode(POE) zavl(POE) zcommon(POE) znvpair(POE) spl(OE) zlib_deflate
[307734.027786] CPU: 9 PID: 0 Comm: swapper/9 Tainted: P OE ------------ 3.10.0-327.3.1.el7.x86_64 #1
[307734.037206] Hardware name: Supermicro X8DTN/X8DTN, BIOS 2.1c 10/28/2011
[307734.046577] task: ffff88061e7b0b80 ti: ffff88061e7b8000 task.ti: ffff88061e7b8000
[307734.055916] RIP: 0010:[<ffffffff81314218>] [<ffffffff81314218>] swiotlb_unmap_sg_attrs+0x28/0x60
[307734.065257] RSP: 0018:ffff880c3fc63e18 EFLAGS: 00010097
[307734.074525] RAX: 0000000000000430 RBX: 0000000000000430 RCX: 0000000000000001
[307734.083769] RDX: 0000000000000431 RSI: ffff880c11ab89c0 RDI: ffff880532923600
[307734.092950] RBP: ffff880c3fc63e40 R08: 0000000000000000 R09: ffffffff813141f0
[307734.102077] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[307734.111140] R13: 0000000000000008 R14: 0000000000000001 R15: ffff880c1de26098
[307734.120114] FS: 0000000000000000(0000) GS:ffff880c3fc60000(0000) knlGS:0000000000000000
[307734.129054] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[307734.137922] CR2: 0000000000000448 CR3: 000000000194a000 CR4: 00000000000007e0
[307734.146765] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[307734.155512] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[307734.164161] Stack:
[307734.172695] ffff880035a28740 ffff880035a2cc98 0000000000000000 00000000000000a9
[307734.181288] 00000000000000a9 ffff880c3fc63e50 ffffffff81421be6 ffff880c3fc63ea8
[307734.189834] ffffffffa03b41b5 0000000000000040 0000000000000086 287fbf74373df8f2
[307734.198341] Call Trace:
[307734.206726] <IRQ>
[307734.206757]
[307734.215028] [<ffffffff81421be6>] scsi_dma_unmap+0x56/0x70
[307734.223267] [<ffffffffa03b41b5>] twl_interrupt+0x5b5/0x850 [3w_sas]
[307734.231458] [<ffffffff8111c2be>] handle_irq_event_percpu+0x3e/0x1e0
[307734.239576] [<ffffffff8111c49d>] handle_irq_event+0x3d/0x60
[307734.247594] [<ffffffff8111f90a>] handle_fasteoi_irq+0x5a/0x100
[307734.255538] [<ffffffff81016ecf>] handle_irq+0xbf/0x150
[307734.263392] [<ffffffff810e131a>] ? tick_check_idle+0x8a/0xd0
[307734.271173] [<ffffffff816412da>] ? atomic_notifier_call_chain+0x1a/0x20
[307734.278900] [<ffffffff81647d6f>] do_IRQ+0x4f/0xf0
[307734.286530] [<ffffffff8163d06d>] common_interrupt+0x6d/0x6d
[307734.294085] <EOI>
[307734.294116]
[307734.301553] [<ffffffff814d450f>] ? cpuidle_enter_state+0x4f/0xc0
[307734.308982] [<ffffffff814d4659>] cpuidle_idle_call+0xd9/0x210
[307734.316344] [<ffffffff8101e4be>] arch_cpu_idle+0xe/0x30
[307734.323618] [<ffffffff810d6305>] cpu_startup_entry+0x245/0x290
[307734.330820] [<ffffffff810475fa>] start_secondary+0x1ba/0x230
[307734.337944] Code: 44 00 00 55 83 f9 03 48 89 e5 41 57 41 56 41 89 ce 41 55 41 54 53 74 44 45 31 e4 85 d2 49 89 ff 48 89 f3 41 89 d5 7e 29 0f 1f 00 <8b> 53 18 48 8b 73 10 44 89 f1 4c 89 ff 41 83 c4 01 e8 82 ff ff
[307734.352990] RIP [<ffffffff81314218>] swiotlb_unmap_sg_attrs+0x28/0x60
[307734.360378] RSP <ffff880c3fc63e18>
[307734.367665] CR2: 0000000000000448
TagsNo tags attached.
abrt_hash
URL

Relationships

duplicate of 0010033 resolvedtoracat Frequent Crash: BUG: unable to handle kernel NULL Pointer dereference at 0000000000000018 

Activities

host45

host45

2016-01-21 18:51

reporter   ~0025455

We have encountered the same kernel bug using a 3Ware 9740-4i SAS controller. This appears to be the same bug which was patched here: https://lkml.org/lkml/2015/4/19/92

It looks like it hasn't made it into the CentOS kernels yet. Hopefully soon?
host45

host45

2016-01-21 18:52

reporter  

3ware_kernel_bug.jpg (660,749 bytes)
toracat

toracat

2016-01-21 19:38

manager   ~0025456

Upstream (kernel.org) commit: 579d69bc1fd56d5af5761969aa529d1d1c188300

CentOS can provide a centosplus kernel with the referenced patch. But to get the patch into the distro kernel, you need to file a bug report at Red Hat ( http://bugzilla.redhat.com ).
toracat

toracat

2016-01-21 20:03

manager  

10073.patch (4,601 bytes)
commit 579d69bc1fd56d5af5761969aa529d1d1c188300
Author: Christoph Hellwig <hch@lst.de>
Date:   Thu Apr 23 09:48:49 2015 +0200

    3w-sas: fix command completion race
    
    The 3w-sas driver needs to tear down the dma mappings before returning
    the command to the midlayer, as there is no guarantee the sglist and
    count are valid after that point.  Also remove the dma mapping helpers
    which have another inherent race due to the request_id index.
    
    Signed-off-by: Christoph Hellwig <hch@lst.de>
    Reported-by: Torsten Luettgert <ml-lkml@enda.eu>
    Tested-by: Bernd Kardatzki <Bernd.Kardatzki@med.uni-tuebingen.de>
    Cc: stable@vger.kernel.org
    Acked-by: Adam Radford <aradford@gmail.com>
    Signed-off-by: James Bottomley <JBottomley@Odin.com>

diff --git a/drivers/scsi/3w-sas.c b/drivers/scsi/3w-sas.c
index 2361772..f837485 100644
--- a/drivers/scsi/3w-sas.c
+++ b/drivers/scsi/3w-sas.c
@@ -290,26 +290,6 @@ static int twl_post_command_packet(TW_Device_Extension *tw_dev, int request_id)
 	return 0;
 } /* End twl_post_command_packet() */
 
-/* This function will perform a pci-dma mapping for a scatter gather list */
-static int twl_map_scsi_sg_data(TW_Device_Extension *tw_dev, int request_id)
-{
-	int use_sg;
-	struct scsi_cmnd *cmd = tw_dev->srb[request_id];
-
-	use_sg = scsi_dma_map(cmd);
-	if (!use_sg)
-		return 0;
-	else if (use_sg < 0) {
-		TW_PRINTK(tw_dev->host, TW_DRIVER, 0x1, "Failed to map scatter gather list");
-		return 0;
-	}
-
-	cmd->SCp.phase = TW_PHASE_SGLIST;
-	cmd->SCp.have_data_in = use_sg;
-
-	return use_sg;
-} /* End twl_map_scsi_sg_data() */
-
 /* This function hands scsi cdb's to the firmware */
 static int twl_scsiop_execute_scsi(TW_Device_Extension *tw_dev, int request_id, char *cdb, int use_sg, TW_SG_Entry_ISO *sglistarg)
 {
@@ -357,8 +337,8 @@ static int twl_scsiop_execute_scsi(TW_Device_Extension *tw_dev, int request_id,
 	if (!sglistarg) {
 		/* Map sglist from scsi layer to cmd packet */
 		if (scsi_sg_count(srb)) {
-			sg_count = twl_map_scsi_sg_data(tw_dev, request_id);
-			if (sg_count == 0)
+			sg_count = scsi_dma_map(srb);
+			if (sg_count <= 0)
 				goto out;
 
 			scsi_for_each_sg(srb, sg, sg_count, i) {
@@ -1102,15 +1082,6 @@ out:
 	return retval;
 } /* End twl_initialize_device_extension() */
 
-/* This function will perform a pci-dma unmap */
-static void twl_unmap_scsi_data(TW_Device_Extension *tw_dev, int request_id)
-{
-	struct scsi_cmnd *cmd = tw_dev->srb[request_id];
-
-	if (cmd->SCp.phase == TW_PHASE_SGLIST)
-		scsi_dma_unmap(cmd);
-} /* End twl_unmap_scsi_data() */
-
 /* This function will handle attention interrupts */
 static int twl_handle_attention_interrupt(TW_Device_Extension *tw_dev)
 {
@@ -1251,11 +1222,11 @@ static irqreturn_t twl_interrupt(int irq, void *dev_instance)
 			}
 
 			/* Now complete the io */
+			scsi_dma_unmap(cmd);
+			cmd->scsi_done(cmd);
 			tw_dev->state[request_id] = TW_S_COMPLETED;
 			twl_free_request_id(tw_dev, request_id);
 			tw_dev->posted_request_count--;
-			tw_dev->srb[request_id]->scsi_done(tw_dev->srb[request_id]);
-			twl_unmap_scsi_data(tw_dev, request_id);
 		}
 
 		/* Check for another response interrupt */
@@ -1400,10 +1371,12 @@ static int twl_reset_device_extension(TW_Device_Extension *tw_dev, int ioctl_res
 		if ((tw_dev->state[i] != TW_S_FINISHED) &&
 		    (tw_dev->state[i] != TW_S_INITIAL) &&
 		    (tw_dev->state[i] != TW_S_COMPLETED)) {
-			if (tw_dev->srb[i]) {
-				tw_dev->srb[i]->result = (DID_RESET << 16);
-				tw_dev->srb[i]->scsi_done(tw_dev->srb[i]);
-				twl_unmap_scsi_data(tw_dev, i);
+			struct scsi_cmnd *cmd = tw_dev->srb[i];
+
+			if (cmd) {
+				cmd->result = (DID_RESET << 16);
+				scsi_dma_unmap(cmd);
+				cmd->scsi_done(cmd);
 			}
 		}
 	}
@@ -1507,9 +1480,6 @@ static int twl_scsi_queue_lck(struct scsi_cmnd *SCpnt, void (*done)(struct scsi_
 	/* Save the scsi command for use by the ISR */
 	tw_dev->srb[request_id] = SCpnt;
 
-	/* Initialize phase to zero */
-	SCpnt->SCp.phase = TW_PHASE_INITIAL;
-
 	retval = twl_scsiop_execute_scsi(tw_dev, request_id, NULL, 0, NULL);
 	if (retval) {
 		tw_dev->state[request_id] = TW_S_COMPLETED;
diff --git a/drivers/scsi/3w-sas.h b/drivers/scsi/3w-sas.h
index d474892..fec6449 100644
--- a/drivers/scsi/3w-sas.h
+++ b/drivers/scsi/3w-sas.h
@@ -103,10 +103,6 @@ static char *twl_aen_severity_table[] =
 #define TW_CURRENT_DRIVER_BUILD 0
 #define TW_CURRENT_DRIVER_BRANCH 0
 
-/* Phase defines */
-#define TW_PHASE_INITIAL 0
-#define TW_PHASE_SGLIST  2
-
 /* Misc defines */
 #define TW_SECTOR_SIZE                        512
 #define TW_MAX_UNITS			      32
10073.patch (4,601 bytes)
toracat

toracat

2016-01-21 21:04

manager   ~0025459

The following info was provided by tru_tru:

C6 kernel has the same issue. The same patch applies cleanly there.

C5 kernel already has the fix ( https://bugzilla.redhat.com/show_bug.cgi?id=572011 )
JohnnyHughes

JohnnyHughes

2016-01-22 14:35

administrator   ~0025470

upstream bug: https://bugzilla.redhat.com/show_bug.cgi?id=1301080
toracat

toracat

2016-01-22 18:34

manager   ~0025484

A centosplus kernel set with the patch applied is now available from:

http://people.centos.org/toracat/kernel/7/plus/bug10073_10191/

Please test. Note that the packages are unsigned and are provided for testing purposes.
toracat

toracat

2016-01-26 17:03

manager   ~0025521

kernel-plus-3.10.0-327.4.5.el7 is out. The patch is now in this update.
toracat

toracat

2016-01-26 17:05

manager   ~0025523

We will keep this ticket open until the upstream (RHEL) kernel gets fixed.
host45

host45

2016-01-27 18:20

reporter   ~0025540

I'll be testing a machine with the same controller and centos-plus kernel. Installing it now. Thanks!
toracat

toracat

2016-01-27 19:07

manager   ~0025542

Looking forward to hearing the result.
cmack

cmack

2016-02-01 07:35

reporter   ~0025559

I have a 3ware 9650SE-LP and I'm having the same issue. I had the latest kernel 3.10.0-327.4.5 and that did not fix the bug, at least for my controller. I went back and matched the crash files with yum updates and it appears to limited to 327 kernels. My server is stable again running Linux 3.10.0-229.20.1.
mrcuongnv

mrcuongnv

2016-02-12 07:45

reporter   ~0025702

Hi, thanks for the patch. I have just come back from vacation. I will test it next week on a new server with the same card.
toracat

toracat

2016-03-17 16:58

manager   ~0026066

@host45 @mrcuongnv

Could you let us know the result of your test? Did the patch fix the issue? Please note that the "official" centosplus kernel (kernel-plus) has the patch applied.

Also, are you willing to do the same testing for the upstream (RHEL) kernel if/when they offer a patched version?
toracat

toracat

2016-03-22 17:08

manager   ~0026093

This is a dup of https://bugs.centos.org/view.php?id=10033 .

Issue History

Date Modified Username Field Change
2016-01-04 04:02 mrcuongnv New Issue
2016-01-21 18:51 host45 Note Added: 0025455
2016-01-21 18:52 host45 File Added: 3ware_kernel_bug.jpg
2016-01-21 19:38 toracat Note Added: 0025456
2016-01-21 19:38 toracat Status new => acknowledged
2016-01-21 20:03 toracat File Added: 10073.patch
2016-01-21 21:04 toracat Note Added: 0025459
2016-01-22 14:35 JohnnyHughes Note Added: 0025470
2016-01-22 18:34 toracat Note Added: 0025484
2016-01-26 17:03 toracat Note Added: 0025521
2016-01-26 17:05 toracat Note Added: 0025523
2016-01-26 17:05 toracat Status acknowledged => assigned
2016-01-27 18:20 host45 Note Added: 0025540
2016-01-27 19:07 toracat Note Added: 0025542
2016-02-01 07:35 cmack Note Added: 0025559
2016-02-12 07:45 mrcuongnv Note Added: 0025702
2016-03-17 16:58 toracat Note Added: 0026066
2016-03-17 16:59 toracat Status assigned => feedback
2016-03-22 17:07 toracat Relationship added duplicate of 0010033
2016-03-22 17:08 toracat Note Added: 0026093
2016-03-22 17:08 toracat Status feedback => closed
2016-03-22 17:08 toracat Resolution open => duplicate