View Issue Details

IDProjectCategoryView StatusLast Update
0017339CentOS-7kernelpublic2020-05-11 05:54
Reporterharrison 
PrioritynormalSeveritycrashReproducibilityalways
Status newResolutionopen 
Product Version7.8-2003 
Target VersionFixed in Version 
Summary0017339: on the second load of the mlx4_core kernel module panic
Descriptionon boot the mlx4_core kernel module propoerly finds a Mellanox connectx-2 VPD device. The device is not connected to the kernel module the first time as per the syslog message:
[ 62.792147] mlx4_core 0000:05:00.0: command 0x4 timed out (go bit not cleared)
[ 62.792150] mlx4_core 0000:05:00.0: device is going to be reset
[ 62.798085] mlx4_core 0000:05:00.0: crdump: FW doesn't support health buffer access, skipping
[ 63.799505] mlx4_core 0000:05:00.0: device was reset successfully
[ 63.805637] mlx4_core 0000:05:00.0: QUERY_FW command failed, aborting
[ 63.812112] mlx4_core 0000:05:00.0: Failed to init fw, aborting.
[ 64.819389] mlx4_core: probe of 0000:05:00.0 failed with error -5
[ 64.833322] pps_core: LinuxPPS API ver. 1 registered

after removing the mlx4_core module and modprobe causes a kernel panic
Steps To Reproduceboot a dell r510 or r610 with a mellanox MT26448 device running vanilla rpm kernel:
6:17pm kuber-2/harrison [~] 1002$uname -a
Linux kuber-2.biostat.wisc.edu 3.10.0-1062.12.1.el7.x86_64 #1 SMP Wed Feb 5 09:10:55 CST 2020 x86_64 x86_64 x86_64 GNU/Linux

the message "Failed to init fw, aborting." appears.
clear the semaphore lock:
mstflist -clear_semaphore -d 05:00.0
Remove the mlx kernel module:
rmmod mlx4_en mlx4_ib mlx4_core
probe the mellanox module again:
modprobe mlx4_core
PANIC (see additional info about the state of the system at panic)

Additional Information[ 1394.725722] mlx4_core: Mellanox ConnectX core driver v4.0-0
[ 1394.725760] mlx4_core: Initializing 0000:05:00.0
[ 1395.727916] mlx4_core 0000:05:00.0: mlx4_cmd_post:cmd_pending failed
[ 1395.734316] mlx4_core 0000:05:00.0: Could not post command 0x4: ret=-5, in_param=0x0, in_mod=0x0, op_mod=0x0
[ 1395.734320] mlx4_core 0000:05:00.0: device is going to be reset
[ 1405.739851] mlx4_core 0000:05:00.0: Failed to obtain HW semaphore, aborting
[ 1405.746865] mlx4_core 0000:05:00.0: Fail to reset HCA
[ 1405.751981] ------------[ cut here ]------------
[ 1405.756616] kernel BUG at drivers/net/ethernet/mellanox/mlx4/catas.c:195!
[ 1405.763483] invalid opcode: 0000 [#1] SMP
[ 1405.767616] Modules linked in: mlx4_core(+) rpcsec_gss_krb5 auth_rpcgss nfsv]
[ 1405.878422] CPU: 9 PID: 3057 Comm: modprobe Kdump: loaded Tainted: G 1
[ 1405.890029] Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.14.0 05/38
[ 1405.897625] task: ffff9ea3c5e69070 ti: ffff9ea3caab4000 task.ti: ffff9ea3caa0
[ 1405.905134] RIP: 0010:[<ffffffffc0a61966>] [<ffffffffc0a61966>] mlx4_enter_]
[ 1405.915290] RSP: 0018:ffff9ea3caab7940 EFLAGS: 00010246
[ 1405.920620] RAX: ffff9eabbcf35400 RBX: ffff9ea3aced00c0 RCX: 0000000000000000
[ 1405.927781] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9eabcf2b1000
[ 1405.934942] RBP: ffff9ea3caab7968 R08: 0000000000000001 R09: ffff9ea3ca35c800
[ 1405.942103] R10: 000000000000044b R11: 0000000000000001 R12: ffff9eabbcf35460
[ 1405.949263] R13: 0000000000000000 R14: ffff9ea3aced0508 R15: 0000000000000000
[ 1405.956423] FS: 00007faeb2c1d740(0000) GS:ffff9ea3cfb00000(0000) knlGS:00000
[ 1405.964543] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1405.970309] CR2: 00007f775b0fa000 CR3: 000000080cab0000 CR4: 00000000000207e0
[ 1405.977469] Call Trace:
[ 1405.979941] [<ffffffffc0a6329c>] __mlx4_cmd+0x78c/0x920 [mlx4_core]
[ 1405.986334] [<ffffffffc0a6e5b1>] mlx4_QUERY_FW+0x71/0x420 [mlx4_core]
[ 1405.992901] [<ffffffffc0a7820c>] mlx4_load_one+0x2ac/0x1110 [mlx4_core]
[ 1405.999637] [<ffffffffc015b75d>] ? devlink_region_create+0xcd/0xf0 [devlink]
[ 1406.006813] [<ffffffffc0a799cf>] mlx4_init_one+0x6bf/0x7d0 [mlx4_core]
[ 1406.013455] [<ffffffffaa9d1b8a>] local_pci_probe+0x4a/0xb0
[ 1406.019048] [<ffffffffaa9d32d9>] pci_device_probe+0x109/0x160
[ 1406.024906] [<ffffffffaaab6425>] driver_probe_device+0xc5/0x3e0
[ 1406.030938] [<ffffffffaaab6823>] __driver_attach+0x93/0xa0
[ 1406.036531] [<ffffffffaaab6790>] ? __device_attach+0x50/0x50
[ 1406.042300] [<ffffffffaaab3fc5>] bus_for_each_dev+0x75/0xc0
[ 1406.047980] [<ffffffffaaab5d9e>] driver_attach+0x1e/0x20
[ 1406.055211] [<ffffffffaaab5840>] bus_add_driver+0x200/0x2d0
[ 1406.062697] [<ffffffffaaab6eb4>] driver_register+0x64/0xf0
[ 1406.070094] [<ffffffffaa9d2b15>] __pci_register_driver+0xa5/0xc0
[ 1406.078015] [<ffffffffc0399000>] ? 0xffffffffc0398fff
[ 1406.084964] [<ffffffffc0399138>] mlx4_init+0x138/0x1000 [mlx4_core]
[ 1406.093102] [<ffffffffaa60210a>] do_one_initcall+0xba/0x240
[ 1406.100547] [<ffffffffaa71e2ba>] load_module+0x271a/0x2bb0
[ 1406.107902] [<ffffffffaa9af950>] ? ddebug_proc_write+0x100/0x100
[ 1406.115774] [<ffffffffaa71e83f>] SyS_init_module+0xef/0x140
[ 1406.123202] [<ffffffffaad8dede>] system_call_fastpath+0x25/0x2a
[ 1406.130962] [<ffffffffaad8de21>] ? system_call_after_swapgs+0xae/0x146
[ 1406.139299] Code: 48 c7 c6 30 59 a9 c0 48 8b 38 31 c0 48 81 c7 98 00 00 00 e
[ 1406.162266] RIP [<ffffffffc0a61966>] mlx4_enter_error_state+0x296/0x380 [ml]
[ 1406.171724] RSP <ffff9ea3caab7940>
[ 0.000000] do_IRQ: 0.181 No irq handler for vector (irq -1)
[ 0.149775] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d)
[ 0.498781] mce: Unable to init device /dev/mcelog (rc: -5)


TagsNo tags attached.
abrt_hash
URL

Activities

harrison

harrison

2020-05-06 23:28

reporter   ~0036884

the driver loads properly with kernel 3.10.0.693 :
6:26pm kuber-2/root [~] 1002$uname -a
Linux kuber-2.biostat.wisc.edu 3.10.0-693.17.1.el7.x86_64 #1 SMP Thu Jan 25 04:11:40 CST 2018 x86_64 x86_64 x86_64 GNU/Linux


From dmesg:
[ 2.005400] mlx4_core: Mellanox ConnectX core driver v2.2-1 (Feb, 2014)
[ 2.005433] mlx4_core: Initializing 0000:05:00.0
[ 4.414250] mlx4_core 0000:05:00.0: PCIe BW is different than device's capability
[ 4.414255] mlx4_core 0000:05:00.0: PCIe link speed is 5.0GT/s, device supports 5.0GT/s
[ 4.414257] mlx4_core 0000:05:00.0: PCIe link width is x4, device supports x8
[ 4.414350] mlx4_core 0000:05:00.0: irq 35 for MSI/MSI-X
[ 4.414357] mlx4_core 0000:05:00.0: irq 36 for MSI/MSI-X
[ 4.414363] mlx4_core 0000:05:00.0: irq 37 for MSI/MSI-X
[ 4.414369] mlx4_core 0000:05:00.0: irq 38 for MSI/MSI-X
[ 4.414375] mlx4_core 0000:05:00.0: irq 39 for MSI/MSI-X
[ 4.414387] mlx4_core 0000:05:00.0: irq 40 for MSI/MSI-X
[ 4.414396] mlx4_core 0000:05:00.0: irq 41 for MSI/MSI-X
[ 4.414403] mlx4_core 0000:05:00.0: irq 42 for MSI/MSI-X
[ 4.414409] mlx4_core 0000:05:00.0: irq 43 for MSI/MSI-X
[ 4.414415] mlx4_core 0000:05:00.0: irq 44 for MSI/MSI-X
[ 4.414421] mlx4_core 0000:05:00.0: irq 45 for MSI/MSI-X
[ 4.414427] mlx4_core 0000:05:00.0: irq 46 for MSI/MSI-X
[ 4.414432] mlx4_core 0000:05:00.0: irq 47 for MSI/MSI-X
[ 4.414441] mlx4_core 0000:05:00.0: irq 48 for MSI/MSI-X
[ 4.414446] mlx4_core 0000:05:00.0: irq 49 for MSI/MSI-X
[ 4.414452] mlx4_core 0000:05:00.0: irq 50 for MSI/MSI-X
[ 4.414458] mlx4_core 0000:05:00.0: irq 51 for MSI/MSI-X
[ 4.414463] mlx4_core 0000:05:00.0: irq 52 for MSI/MSI-X
[ 4.414469] mlx4_core 0000:05:00.0: irq 53 for MSI/MSI-X
[ 4.414474] mlx4_core 0000:05:00.0: irq 54 for MSI/MSI-X
[ 4.414480] mlx4_core 0000:05:00.0: irq 55 for MSI/MSI-X
[ 4.414488] mlx4_core 0000:05:00.0: irq 56 for MSI/MSI-X
[ 4.414494] mlx4_core 0000:05:00.0: irq 57 for MSI/MSI-X
[ 4.414501] mlx4_core 0000:05:00.0: irq 58 for MSI/MSI-X
[ 4.414507] mlx4_core 0000:05:00.0: irq 59 for MSI/MSI-X
[ 4.564391] mlx4_en: Mellanox ConnectX HCA Ethernet driver v2.2-1 (Feb 2014)
[ 4.564740] mlx4_en 0000:05:00.0: Activating port:1
[ 4.565371] mlx4_en: 0000:05:00.0: Port 1: enabling only PFC DCB ops
[ 4.573455] mlx4_en: 0000:05:00.0: Port 1: Using 192 TX rings
[ 4.573461] mlx4_en: 0000:05:00.0: Port 1: Using 8 RX rings
[ 4.573641] mlx4_en: 0000:05:00.0: Port 1: Initializing port
[ 6.973055] mlx4_en: p1p1: Link Up
[ 12.730110] <mlx4_ib> mlx4_ib_add: mlx4_ib: Mellanox ConnectX InfiniBand driver v2.2-1 (Feb 2014)
[ 12.730635] <mlx4_ib> mlx4_ib_add: counter index 1 for port 1 allocated 1
ManuelWolfshant

ManuelWolfshant

2020-05-07 00:20

manager   ~0036885

3.10.0-1062.12.1.el7.x86_64 is a kernel for CentOS 7.7 and is not supported any more. Please update your system to CentOS 7.8 ( which ships 3.10.0-1127.el7.x86_64 ) and retry. If the issue persists then please open a bug at bugzilla.redhat.com ( and crosslink with this bug here ) because ( as it works in the older kernel ) it looks like a regression in the kernel shipped by RH and only they can fix it. And once they do, since CentOS is built from the sources they provide, the fix will automatically be inherited by CentOS as well.

Issue History

Date Modified Username Field Change
2020-05-06 23:21 harrison New Issue
2020-05-06 23:28 harrison Note Added: 0036884
2020-05-07 00:20 ManuelWolfshant Note Added: 0036885