View Issue Details

IDProjectCategoryView StatusLast Update
0017104CentOS-7kernelpublic2020-03-13 17:24
Reporterguzzijason 
PrioritynormalSeveritymajorReproducibilityalways
Status newResolutionopen 
PlatformHP ProLiant DL325 Gen10OSCentOSOS Version7
Product Version7.7-1908 
Target VersionFixed in Version 
Summary0017104: Enabling AMD IOMMU in BIOS causes serious performance problems after upgrade from 7.6.1810 to 7.7.1908
DescriptionSo far, problem appears isolated to HP ProLiant DL325 Gen10 (1 x AMD EPYC 7702P 64-Core)
Possibly related to Mellanox Technologies MT27800 Family [ConnectX-5] (mlx5e_core)

Servers were running OK under 7.6.1810. After upgrading to 7.7.1908, after several minutes, our application (Apache Traffic Server) would start consuming massive CPU, network interface would start flapping, and several ksoftirqd processes would start consuming 100% CPU.

Attached top output screenshot shows problem occurring.

dmesg output:

[71431.245857] NMI watchdog: BUG: soft lockup - CPU#53 stuck for 22s! [[ET_NET 18]:36480]
[71431.246149] Modules linked in: bonding ip6table_mangle ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 ip6table_filter ip6_tables xt_DSCP xt_multiport iptable_mangle ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_dscp xt_set iptable_filter ip_set_hash_net ip_set nfnetlink vfat fat amd64_edac_mod edac_mce_amd kvm_amd kvm ttm irqbypass crc32_pclmul drm_kms_helper ghash_clmulni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops aesni_intel lrw gf128mul glue_helper ablk_helper sg ipmi_si drm cryptd pcspkr ipmi_devintf hpilo drm_panel_orientation_quirks hpwdt i2c_piix4 k10temp ipmi_msghandler pcc_cpufreq wmi acpi_cpufreq acpi_power_meter ip_tables ext4 mbcache jbd2 sd_mod crc_t10dif crct10dif_generic mlx5_core igb ahci mlxfw devlink libahci crct10dif_pclmul ptp i2c_algo_bit crct10dif_common
[71431.248950] libata crc32c_intel dca pps_core dm_mirror dm_region_hash dm_log dm_mod brd
[71431.249234] CPU: 53 PID: 36480 Comm: [ET_NET 18] Kdump: loaded Not tainted 3.10.0-1062.9.1.el7.x86_64 #1
[71431.249585] Hardware name: HPE ProLiant DL325 Gen10/ProLiant DL325 Gen10, BIOS A41 09/17/2019
[71431.249905] task: ffff92d9053841c0 ti: ffff92d109a10000 task.ti: ffff92d109a10000
[71431.250263] RIP: 0010:[<ffffffffb7383025>] [<ffffffffb7383025>] _raw_spin_unlock_irqrestore+0x15/0x20
[71431.250617] RSP: 0018:ffff92d97d343cd8 EFLAGS: 00000257
[71431.250813] RAX: ffff92d971e1f140 RBX: ffffffffb71f4343 RCX: ffff92d156539180
[71431.251076] RDX: ffff92d1a4642100 RSI: 0000000000000257 RDI: 0000000000000257
[71431.251346] RBP: ffff92d97d343cd8 R08: ffff92d1a4642100 R09: ffff92d97d2db880
[71431.251609] R10: 000ffffffff9687f R11: fffffb5cb9f28a00 R12: ffff92d97d343c48
[71431.251873] R13: ffffffffb738eefa R14: ffff92d97d343cd8 R15: ffff92d156539180
[71431.252136] FS: 00002af301d2e700(0000) GS:ffff92d97d340000(0000) knlGS:0000000000000000
[71431.252444] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[71431.252674] CR2: 00007ff0fe8a1080 CR3: 000001005df62000 CR4: 0000000000340fe0
[71431.252939] Call Trace:
[71431.253026] <IRQ>
[71431.253106] [<ffffffffb71f4413>] alloc_iova+0x103/0x180
[71431.264170] [<ffffffffb71f514b>] alloc_iova_fast+0x4b/0xb0
[71431.275717] [<ffffffffb71f690a>] dma_ops_alloc_iova.isra.23+0x7a/0x90
[71431.286947] [<ffffffffb71f7b91>] __map_single.isra.27+0x51/0x1c0
[71431.298106] [<ffffffffb71f9f24>] map_page+0x64/0x90
[71431.309226] [<ffffffffc0584313>] mlx5e_post_rx_mpwqes+0x263/0x820 [mlx5_core]
[71431.320501] [<ffffffffc0586b19>] mlx5e_napi_poll+0xe9/0xd40 [mlx5_core]
[71431.331647] [<ffffffffb6d0ff40>] ? tick_sched_do_timer+0x50/0x50
[71431.342506] [<ffffffffb6caf55d>] ? update_process_times+0x6d/0x80
[71431.353611] [<ffffffffb725057f>] net_rx_action+0x26f/0x390
[71431.364618] [<ffffffffb6ca5305>] __do_softirq+0xf5/0x280
[71431.376027] [<ffffffffb739142c>] call_softirq+0x1c/0x30
[71431.386966] <EOI>
[71431.387042] [<ffffffffb6c2f715>] do_softirq+0x65/0xa0
[71431.410158] [<ffffffffb6ca475b>] __local_bh_enable_ip+0x9b/0xb0
[71431.422450] [<ffffffffb6ca4787>] local_bh_enable+0x17/0x20
[71431.433221] [<ffffffffb7381825>] __cond_resched_softirq+0x45/0x60
[71431.444052] [<ffffffffb723499b>] release_sock+0xab/0x170
[71431.454731] [<ffffffffb72ac340>] tcp_sendmsg+0xe0/0xc60
[71431.465411] [<ffffffffb72d89b9>] inet_sendmsg+0x69/0xb0
[71431.475659] [<ffffffffb6f07bb3>] ? selinux_socket_sendmsg+0x23/0x30
[71431.486154] [<ffffffffb722e57d>] sock_aio_write+0x15d/0x180
[71431.496625] [<ffffffffb6e49dcb>] do_sync_readv_writev+0x7b/0xd0
[71431.507257] [<ffffffffb6e4ba0e>] do_readv_writev+0xce/0x260
[71431.517383] [<ffffffffb6e4bc35>] vfs_writev+0x35/0x60
[71431.527325] [<ffffffffb6e4bdef>] SyS_writev+0x7f/0x110
[71431.536865] [<ffffffffb738dede>] system_call_fastpath+0x25/0x2a
[71431.546251] Code: 07 00 66 66 66 90 5d c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 c6 07 00 66 66 66 90 48 89 f7 57 9d <66> 66 90 66 90 5d c3 0f 1f 40 00 66 66 66 66 90 55 48 89 e5 48
Steps To Reproduce1. Boot HP ProLiant DL325 Gen10 with AMD IOMMU enabled in HP RBSU (BIOS, default setting)
2. Start Apache Traffic Sever, and start sending load
3. Error occurs anywhere from within seconds of traffic starting to 10 minutes or so later

Note that by disabling IOMMU in the HP RBSU processor options appears to eliminate the problem entirely.
Additional InformationEvery released 7.7.1908 kernel variant was tried.
kernel-ml-5.5.6-2.el7.elrepo.x86_64 was also tried.
Same results with all kernels.
no iommu specific boot parameters are used.

Boot options:
linuxefi /vmlinuz-3.10.0-1062.12.1.el7.x86_64 root=/dev/mapper/vg01-lv_root ro rd_NO_LUKS LANG=en_US.UTF-8 SYSFONT=latarcyrheb-sun16 crashkernel=auto KEYBOARDTYPE=pc KEYTABLE=us printk.time=1 rd.driver.blacklist=usb-storage audit=1 rd_NO_MD rd_NO_DM biosdevname=1 net.ifnames=1 nomodeset transparent_hugepage=never
TagsNo tags attached.
abrt_hash
URL

Activities

guzzijason

guzzijason

2020-03-02 17:06

reporter  

image (6).png (362,426 bytes)
toracat

toracat

2020-03-02 18:52

manager   ~0036421

I see that you have tested kernel-ml-5.5.6-2.el7.elrepo.x86_64. Can you try the latest test kernel 5.6.0? You can get it from:

http://elrepo.org/people/ajb/devel/kernel-ml/el7/x86_64/RPMS/
guzzijason

guzzijason

2020-03-02 18:56

reporter   ~0036422

Update: I've replicated the same problem on a Dell PowerEdge R7515 with similar confituration (1 X AMD EPYC 7702P 64-Core Processor, Mellanox Technologies MT28800 Family [ConnectX-5 Ex]). So, this is not HW vendor-specific.

I see the note above about 5.6.0. Will look into that.
guzzijason

guzzijason

2020-03-02 22:15

reporter   ~0036424

toracat, I don't see any improvement with 5.6.0-0.rc4.el7.elrepo.x86_64
I didn't get the dump in dmesg this time, but the performance problem looks the same as before - CPU usage shoots to the roof:

top - 22:13:12 up 12 min, 2 users, load average: 3.55, 7.17, 4.44
Tasks: 1162 total, 119 running, 360 sleeping, 0 stopped, 1 zombie
%Cpu(s): 0.1 us, 10.5 sy, 0.0 ni, 83.9 id, 0.3 wa, 0.0 hi, 5.2 si, 0.0 st
KiB Mem : 10439571+total, 10183158+free, 23599036 used, 2042288 buff/cache
KiB Swap: 8388604 total, 8388604 free, 0 used. 10150681+avail Mem

   PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 35605 ats 20 0 8038184 1.6g 16000 R 2388 0.2 26:45.43 [TS_MAIN]
   111 root 20 0 0 0 0 R 100.0 0.0 0:57.69 ksoftirqd/20
   141 root 20 0 0 0 0 R 100.0 0.0 0:47.79 ksoftirqd/26
   176 root 20 0 0 0 0 R 100.0 0.0 0:47.86 ksoftirqd/33
     9 root 20 0 0 0 0 R 99.3 0.0 0:56.88 ksoftirqd/0
   271 root 20 0 0 0 0 R 99.3 0.0 0:36.31 ksoftirqd/52
   161 root 20 0 0 0 0 R 99.0 0.0 0:43.54 ksoftirqd/30
   306 root 20 0 0 0 0 R 98.7 0.0 0:38.04 ksoftirqd/59
    56 root 20 0 0 0 0 R 97.7 0.0 0:45.06 ksoftirqd/9
   296 root 20 0 0 0 0 R 97.7 0.0 0:28.09 ksoftirqd/57
    71 root 20 0 0 0 0 R 96.7 0.0 1:01.24 ksoftirqd/12
    16 root 20 0 0 0 0 R 96.4 0.0 0:58.85 ksoftirqd/1
guzzijason

guzzijason

2020-03-02 22:22

reporter   ~0036425

OK, I do see the same output in dmesg now with the new kernel:


[ 1060.321549] watchdog: BUG: soft lockup - CPU#117 stuck for 22s! [[ET_NET 14]:35724]
[ 1060.330752] Modules linked in: bonding ip6table_mangle ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 ip6table_filter ip6_tables xt_DSCP xt_multiport iptable_mangle ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_dscp xt_set iptable_filter ip_set_hash_net ip_set nfnetlink vfat fat edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul drm_vram_helper ghash_clmulni_intel drm_ttm_helper ttm drm_kms_helper aesni_intel drm syscopyarea sysfillrect crypto_simd sysimgblt ipmi_si cryptd joydev input_leds pcspkr fb_sys_fops sg ipmi_devintf glue_helper hpilo sp5100_tco hpwdt ccp i2c_piix4 k10temp wmi ipmi_msghandler acpi_power_meter acpi_cpufreq ip_tables ext4 mbcache jbd2 sd_mod t10_pi ahci igb libahci crc32c_intel i2c_algo_bit dca libata mlx5_core mlxfw pci_hyperv_intf ptp pps_core dm_mirror dm_region_hash dm_log dm_mod brd
[ 1060.387118] CPU: 117 PID: 35724 Comm: [ET_NET 14] Tainted: G L 5.6.0-0.rc4.el7.elrepo.x86_64 #1
[ 1060.396807] Hardware name: HPE ProLiant DL325 Gen10/ProLiant DL325 Gen10, BIOS A41 09/17/2019
[ 1060.406729] RIP: 0010:native_queued_spin_lock_slowpath+0x64/0x1f0
[ 1060.416435] Code: ff ff 75 3f f0 0f ba 2f 08 0f 82 29 01 00 00 31 d2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1c 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 5d c3 f6 c4 01 75 04 c6
[ 1060.435939] RSP: 0018:ffffc90001af0e58 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[ 1060.446036] RAX: 0000000000000101 RBX: ffff8980223c1040 RCX: ffff89807eb5e8a0
[ 1060.456146] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8980223c10c0
[ 1060.466701] RBP: ffffc90001af0e58 R08: ffffc90001af0f10 R09: 0000000000000000
[ 1060.476612] R10: 0000000000000201 R11: 0000000000000000 R12: ffff898021d00480
[ 1060.486236] R13: 000000000000000d R14: ffff8980223c10c0 R15: 0000000000000075
[ 1060.496761] FS: 00007facbe3d4700(0000) GS:ffff89807eb40000(0000) knlGS:0000000000000000
[ 1060.506841] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1060.516858] CR2: 00007f64c7e10000 CR3: 000000f879f9a000 CR4: 0000000000340ee0
[ 1060.527271] Call Trace:
[ 1060.537684] <IRQ>
[ 1060.548532] queued_spin_lock_slowpath+0xb/0x13
[ 1060.559922] _raw_spin_lock+0x23/0x30
[ 1060.570337] dev_watchdog+0x69/0x280
[ 1060.580771] ? pfifo_fast_enqueue+0x130/0x130
[ 1060.591296] call_timer_fn+0x34/0x140
[ 1060.601429] run_timer_softirq+0x20a/0x480
[ 1060.611456] ? lapic_next_event+0x20/0x30
[ 1060.621408] ? clockevents_program_event+0x7e/0x100
[ 1060.631603] __do_softirq+0xd9/0x29e
[ 1060.641980] do_softirq_own_stack+0x2a/0x40
[ 1060.652607] </IRQ>
[ 1060.663297] do_softirq+0x55/0x60
[ 1060.673969] __local_bh_enable_ip+0x57/0x60
[ 1060.684532] ip_finish_output2+0x195/0x520
[ 1060.695076] __ip_finish_output+0x10d/0x1f0
[ 1060.705287] ip_finish_output+0x2e/0xc0
[ 1060.715462] ip_output+0x76/0xf0
[ 1060.725300] ? __ip_finish_output+0x1f0/0x1f0
[ 1060.734947] ip_local_out+0x3b/0x50
[ 1060.744357] __ip_queue_xmit+0x155/0x3e0
[ 1060.753749] ? __kmalloc_node_track_caller+0x5e/0x2d0
[ 1060.763757] ? __wake_up_common+0x8f/0x160
[ 1060.773950] ip_queue_xmit+0x10/0x20
[ 1060.783680] __tcp_transmit_skb+0x5b0/0xab0
[ 1060.793424] __tcp_send_ack.part.56+0xa5/0x100
[ 1060.802844] tcp_send_ack+0x1c/0x20
[ 1060.811899] __tcp_ack_snd_check+0x42/0x1d0
[ 1060.820588] tcp_rcv_state_process+0xa56/0xe28
[ 1060.829095] ? __schedule+0x2d2/0x6e0
[ 1060.837287] ? tcp_sendmsg_locked+0x94b/0xdf0
[ 1060.845108] tcp_v4_do_rcv+0x77/0x1f0
[ 1060.852628] __release_sock+0x8d/0xe0
[ 1060.859931] release_sock+0x30/0xa0
[ 1060.866876] tcp_sendmsg+0x37/0x50
[ 1060.873765] inet_sendmsg+0x42/0x80
[ 1060.879894] sock_sendmsg+0x5f/0x80
[ 1060.886284] sock_write_iter+0x8c/0xf0
[ 1060.892725] do_iter_readv_writev+0x1b4/0x1e0
[ 1060.898845] do_iter_write+0x83/0x1a0
[ 1060.905096] vfs_writev+0x81/0x100
[ 1060.911025] ? __audit_syscall_entry+0xdd/0x130
[ 1060.916973] ? __fget_light+0x31/0x80
[ 1060.922986] do_writev+0xf4/0x110
[ 1060.928737] __x64_sys_writev+0x1c/0x20
[ 1060.934363] do_syscall_64+0x60/0x1e0
[ 1060.940014] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1060.945615] RIP: 0033:0x7facc64793e0
[ 1060.951187] Code: 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 e9 71 01 00 48 63 54 24 1c 41 89 c0 48 8b 74 24 10 48 63 7c 24 08 b8 14 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 17 44 89 c7 48 89 44 24 08 e8 1b 72 01 00 48
[ 1060.963427] RSP: 002b:00007facbe3cfbb0 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
[ 1060.969817] RAX: ffffffffffffffda RBX: 000000000000287c RCX: 00007facc64793e0
[ 1060.976214] RDX: 0000000000000002 RSI: 00007facbe3cfc80 RDI: 000000000000287c
[ 1060.982424] RBP: 0000000000000230 R08: 0000000000000000 R09: 0000000000000000
[ 1060.988898] R10: 00007facbe3d3db0 R11: 0000000000000293 R12: 00000000000000dc
[ 1060.996178] R13: 00007fac74dc8cc8 R14: 00007fac74dc8cb0 R15: 0000000000000000
guzzijason

guzzijason

2020-03-02 22:28

reporter   ~0036426

Here are all the dumps that dmesg logged this time:

[ 1000.320194] watchdog: BUG: soft lockup - CPU#117 stuck for 22s! [swapper/117:0]
[ 1000.320466] Modules linked in: bonding ip6table_mangle ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 ip6table_filter ip6_tables xt_DSCP xt_multiport iptable_mangle ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_dscp xt_set iptable_filter ip_set_hash_net ip_set nfnetlink vfat fat edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul drm_vram_helper ghash_clmulni_intel drm_ttm_helper ttm drm_kms_helper aesni_intel drm syscopyarea sysfillrect crypto_simd sysimgblt ipmi_si cryptd joydev input_leds pcspkr fb_sys_fops sg ipmi_devintf glue_helper hpilo sp5100_tco hpwdt ccp i2c_piix4 k10temp wmi ipmi_msghandler acpi_power_meter acpi_cpufreq ip_tables ext4 mbcache jbd2 sd_mod t10_pi ahci igb libahci crc32c_intel i2c_algo_bit dca libata mlx5_core mlxfw pci_hyperv_intf ptp pps_core dm_mirror dm_region_hash dm_log dm_mod brd
[ 1000.323199] CPU: 117 PID: 0 Comm: swapper/117 Not tainted 5.6.0-0.rc4.el7.elrepo.x86_64 #1
[ 1000.323504] Hardware name: HPE ProLiant DL325 Gen10/ProLiant DL325 Gen10, BIOS A41 09/17/2019
[ 1000.323821] RIP: 0010:native_queued_spin_lock_slowpath+0x64/0x1f0
[ 1000.324042] Code: ff ff 75 3f f0 0f ba 2f 08 0f 82 29 01 00 00 31 d2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1c 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 5d c3 f6 c4 01 75 04 c6
[ 1000.324723] RSP: 0018:ffffc90001af0e20 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[ 1000.325004] RAX: 0000000000000101 RBX: ffff8980223c0f00 RCX: 0000000000000100
[ 1000.325266] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8980223c0f80
[ 1000.325527] RBP: ffffc90001af0e20 R08: ffff89807eb5e4f0 R09: 00f0000000000000
[ 1000.325789] R10: 0000000000000001 R11: 0000000000000000 R12: ffff898021d00480
[ 1000.326047] R13: 000000000000000c R14: ffff8980223c0f80 R15: 0000000000000075
[ 1000.326309] FS: 0000000000000000(0000) GS:ffff89807eb40000(0000) knlGS:0000000000000000
[ 1000.326606] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1000.326818] CR2: 00007f64c7e10000 CR3: 000000f879f9a000 CR4: 0000000000340ee0
[ 1000.327080] Call Trace:
[ 1000.327171] <IRQ>
[ 1000.338384] queued_spin_lock_slowpath+0xb/0x13
[ 1000.349096] _raw_spin_lock+0x23/0x30

[ 1000.357404] watchdog: BUG: soft lockup - CPU#72 stuck for 21s! [[ET_NET 1]:35711]
[ 1000.360664] dev_watchdog+0x69/0x280
[ 1000.360668] ? pfifo_fast_enqueue+0x130/0x130
[ 1000.371936] Modules linked in: bonding ip6table_mangle ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 ip6table_filter ip6_tables xt_DSCP xt_multiport iptable_mangle ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_dscp xt_set iptable_filter ip_set_hash_net ip_set nfnetlink vfat fat edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul drm_vram_helper ghash_clmulni_intel drm_ttm_helper ttm drm_kms_helper aesni_intel drm syscopyarea sysfillrect crypto_simd sysimgblt ipmi_si cryptd joydev input_leds pcspkr fb_sys_fops sg ipmi_devintf glue_helper hpilo sp5100_tco hpwdt ccp i2c_piix4 k10temp wmi ipmi_msghandler acpi_power_meter acpi_cpufreq ip_tables ext4 mbcache jbd2 sd_mod t10_pi ahci igb libahci crc32c_intel i2c_algo_bit dca libata mlx5_core mlxfw pci_hyperv_intf ptp pps_core dm_mirror dm_region_hash dm_log dm_mod brd
[ 1000.383144] call_timer_fn+0x34/0x140
[ 1000.383147] run_timer_softirq+0x20a/0x480
[ 1000.394786] CPU: 72 PID: 35711 Comm: [ET_NET 1] Not tainted 5.6.0-0.rc4.el7.elrepo.x86_64 #1
[ 1000.394789] Hardware name: HPE ProLiant DL325 Gen10/ProLiant DL325 Gen10, BIOS A41 09/17/2019
[ 1000.482983] ? enqueue_hrtimer+0x3e/0xa0
[ 1000.495994] RIP: 0010:_raw_spin_unlock_irqrestore+0x15/0x20
[ 1000.509320] ? ktime_get+0x3e/0xa0
[ 1000.509324] __do_softirq+0xd9/0x29e
[ 1000.523107] Code: 41 9c 75 ff 31 c0 5d c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 c6 07 00 0f 1f 40 00 48 89 f7 57 9d <0f> 1f 44 00 00 5d c3 0f 1f 40 00 0f 1f 44 00 00 31 c0 ba 01 00 00
[ 1000.536841] irq_exit+0xcc/0xe0
[ 1000.550133] RSP: 0018:ffffc90001334728 EFLAGS: 00000257 ORIG_RAX: ffffffffffffff13
[ 1000.564027] smp_apic_timer_interrupt+0x79/0x140
[ 1000.564030] apic_timer_interrupt+0xf/0x20
[ 1000.577462] RAX: ffff89805090a008 RBX: ffff89787e702200 RCX: ffff8978c325ef00
[ 1000.577465] RDX: ffff89787e702200 RSI: 0000000000000257 RDI: 0000000000000257
[ 1000.590800] </IRQ>
[ 1000.617339] RBP: ffffc90001334728 R08: 0000000000000000 R09: ffff8980648088c0
[ 1000.617341] R10: ffff89805b4b2110 R11: 0000000000001035 R12: 0000000000000001
[ 1000.630571] RIP: 0010:cpuidle_enter_state+0xf1/0x3b0
[ 1000.630574] Code: e8 04 29 9b ff 44 8b 63 04 48 89 45 c0 0f 1f 44 00 00 31 ff e8 70 40 9b ff 80 7d cf 00 0f 85 ba 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 f6 0f 88 37 02 00 00 49 63 d6 4c 8b 7d c0 4c 2b 7d d0 48 8d
[ 1000.643909] R13: ffffffffffffffff R14: ffff8976afec1700 R15: ffff89805090a008
[ 1000.643911] FS: 00007facbf0ee700(0000) GS:ffff89807e000000(0000) knlGS:0000000000000000
[ 1000.656972] RSP: 0018:ffffc90000687e38 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[ 1000.669728] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1000.669730] CR2: 00007f5613435000 CR3: 000000f879f9a000 CR4: 0000000000340ee0
[ 1000.682360] RAX: ffff89807eb6cc80 RBX: ffff898045252400 RCX: 000000000000001f
[ 1000.682363] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
[ 1000.695092] Call Trace:
[ 1000.707208] RBP: ffffc90000687e78 R08: 0000000000000002 R09: ffffffd92416bb8a
[ 1000.707211] R10: 0000000000000018 R11: 071c71c71c71c71c R12: 0000000000000075
[ 1000.718943] <IRQ>
[ 1000.730583] R13: ffffffff8254f300 R14: 0000000000000002 R15: ffffffff8254f3e8
[ 1000.730587] ? cpuidle_enter_state+0xe0/0x3b0
[ 1000.742051] alloc_iova+0x124/0x1b0
[ 1000.764969] cpuidle_enter+0x2e/0x40
[ 1000.776520] alloc_iova_fast+0x4f/0x210
[ 1000.788032] call_cpuidle+0x23/0x40
[ 1000.799708] iommu_dma_alloc_iova.isra.25+0xc6/0xf0
[ 1000.812522] do_idle+0x1e8/0x280
[ 1000.825388] __iommu_dma_map+0x86/0xe0
[ 1000.837143] cpu_startup_entry+0x1d/0x30
[ 1000.848684] iommu_dma_map_page+0x69/0x80
[ 1000.860289] start_secondary+0x169/0x1c0
[ 1000.871975] mlx5e_sq_xmit+0x6f5/0xc80 [mlx5_core]
[ 1000.883320] secondary_startup_64+0xa4/0xb0
[ 1000.890242] perf: interrupt took too long (2603 > 2500), lowering kernel.perf_event_max_sample_rate to 76000
[ 1000.894728] ? netif_skb_features+0x132/0x260
[ 1001.101339] mlx5e_xmit+0xd9/0xe0 [mlx5_core]
[ 1001.112683] dev_hard_start_xmit+0x96/0x210
[ 1001.125078] sch_direct_xmit+0x10c/0x2f0
[ 1001.136855] __qdisc_run+0x14c/0x4f0
[ 1001.148612] __dev_queue_xmit+0x587/0x910
[ 1001.159683] dev_queue_xmit+0x10/0x20
[ 1001.171667] bond_dev_queue_xmit+0x2f/0x80 [bonding]
[ 1001.183921] bond_start_xmit+0x1c2/0x470 [bonding]
[ 1001.194950] dev_hard_start_xmit+0x96/0x210
[ 1001.204535] __dev_queue_xmit+0x71d/0x910
[ 1001.214743] dev_queue_xmit+0x10/0x20
[ 1001.224864] ip_finish_output2+0x287/0x520
[ 1001.233610] __ip_finish_output+0x10d/0x1f0
[ 1001.244727] ip_finish_output+0x2e/0xc0
[ 1001.255407] ip_output+0x76/0xf0
[ 1001.265415] ? __ip_finish_output+0x1f0/0x1f0
[ 1001.276046] ip_local_out+0x3b/0x50
[ 1001.285725] __ip_queue_xmit+0x155/0x3e0
[ 1001.296404] ? __kmalloc_node_track_caller+0x5e/0x2d0
[ 1001.306946] ip_queue_xmit+0x10/0x20
[ 1001.316484] __tcp_transmit_skb+0x5b0/0xab0
[ 1001.326377] __tcp_send_ack.part.56+0xa5/0x100
[ 1001.336340] tcp_send_ack+0x1c/0x20
[ 1001.345935] tcp_delack_timer_handler+0x12a/0x180
[ 1001.355006] tcp_delack_timer+0xe5/0x120
[ 1001.363802] ? tcp_delack_timer_handler+0x180/0x180
[ 1001.372934] call_timer_fn+0x34/0x140
[ 1001.382222] run_timer_softirq+0x20a/0x480
[ 1001.391369] ? lapic_next_event+0x20/0x30
[ 1001.400414] ? clockevents_program_event+0x7e/0x100
[ 1001.408961] __do_softirq+0xd9/0x29e
[ 1001.416818] do_softirq_own_stack+0x2a/0x40
[ 1001.425209] </IRQ>
[ 1001.433319] do_softirq+0x55/0x60
[ 1001.441530] __local_bh_enable_ip+0x57/0x60
[ 1001.450382] ip_finish_output2+0x195/0x520
[ 1001.458248] __ip_finish_output+0x10d/0x1f0
[ 1001.465260] ip_finish_output+0x2e/0xc0
[ 1001.473238] ip_output+0x76/0xf0
[ 1001.480688] ? __ip_finish_output+0x1f0/0x1f0
[ 1001.487887] ip_local_out+0x3b/0x50
[ 1001.496366] __ip_queue_xmit+0x155/0x3e0
[ 1001.504660] ip_queue_xmit+0x10/0x20
[ 1001.511153] __tcp_transmit_skb+0x5b0/0xab0
[ 1001.517789] tcp_connect+0xb74/0xe60
[ 1001.524879] ? ktime_get_with_offset+0x4f/0xc0
[ 1001.532735] tcp_v4_connect+0x44c/0x4d0
[ 1001.539607] __inet_stream_connect+0xcf/0x360
[ 1001.546019] ? release_sock+0x8f/0xa0
[ 1001.553200] ? selinux_netlbl_socket_connect+0x37/0x60
[ 1001.561213] inet_stream_connect+0x3b/0x60
[ 1001.569387] __sys_connect_file+0x61/0x70
[ 1001.577171] __sys_connect+0x8f/0xd0
[ 1001.585193] ? syscall_trace_enter+0x1f8/0x2d0
[ 1001.592850] ? __audit_syscall_exit+0x1e3/0x290
[ 1001.600531] __x64_sys_connect+0x1a/0x20
[ 1001.608172] do_syscall_64+0x60/0x1e0
[ 1001.615796] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1001.623913] RIP: 0033:0x7facc71839fd
[ 1001.631642] Code: ca 20 00 00 75 10 b8 2a 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 0e fa ff ff 48 89 04 24 b8 2a 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 57 fa ff ff 48 89 d0 48 83 c4 08 48 3d 01
[ 1001.648464] RSP: 002b:00007facbf0e7490 EFLAGS: 00000293 ORIG_RAX: 000000000000002a
[ 1001.657033] RAX: ffffffffffffffda RBX: 00007fab040131b8 RCX: 00007facc71839fd
[ 1001.665801] RDX: 0000000000000010 RSI: 00007fab040131c4 RDI: 0000000000000de1
[ 1001.674924] RBP: 00007fab04012f78 R08: 0000000000000004 R09: 000000005eb2300a
[ 1001.683776] R10: 00007facbf0e74c0 R11: 0000000000000293 R12: 00007facbf0e74b8
[ 1001.692805] R13: 0000000000000000 R14: 00007facbf0e7780 R15: 00007fac9426e070

[ 1004.310023] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [[ET_NET 22]:35732]
[ 1004.316943] Modules linked in: bonding ip6table_mangle ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 ip6table_filter ip6_tables xt_DSCP xt_multiport iptable_mangle ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_dscp xt_set iptable_filter ip_set_hash_net ip_set nfnetlink vfat fat edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul drm_vram_helper ghash_clmulni_intel drm_ttm_helper ttm drm_kms_helper aesni_intel drm syscopyarea sysfillrect crypto_simd sysimgblt ipmi_si cryptd joydev input_leds pcspkr fb_sys_fops sg ipmi_devintf glue_helper hpilo sp5100_tco hpwdt ccp i2c_piix4 k10temp wmi ipmi_msghandler acpi_power_meter acpi_cpufreq ip_tables ext4 mbcache jbd2 sd_mod t10_pi ahci igb libahci crc32c_intel i2c_algo_bit dca libata mlx5_core mlxfw pci_hyperv_intf ptp pps_core dm_mirror dm_region_hash dm_log dm_mod brd
[ 1004.380572] CPU: 105 PID: 35732 Comm: [ET_NET 22] Tainted: G L 5.6.0-0.rc4.el7.elrepo.x86_64 #1
[ 1004.389938] Hardware name: HPE ProLiant DL325 Gen10/ProLiant DL325 Gen10, BIOS A41 09/17/2019
[ 1004.399242] RIP: 0010:_raw_spin_unlock_irqrestore+0x15/0x20
[ 1004.409464] Code: 41 9c 75 ff 31 c0 5d c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 c6 07 00 0f 1f 40 00 48 89 f7 57 9d <0f> 1f 44 00 00 5d c3 0f 1f 40 00 0f 1f 44 00 00 31 c0 ba 01 00 00
[ 1004.429140] RSP: 0018:ffffc9000320f3e8 EFLAGS: 00000257 ORIG_RAX: ffffffffffffff13
[ 1004.439358] RAX: ffff89805090a008 RBX: ffff89730610da40 RCX: ffff898033150bc0
[ 1004.449572] RDX: ffff89730610da40 RSI: 0000000000000257 RDI: 0000000000000257
[ 1004.459909] RBP: ffffc9000320f3e8 R08: 0000000000000000 R09: ffff8980648088c0
[ 1004.470258] R10: ffff89805b4b2110 R11: 0000000000001049 R12: 0000000000000001
[ 1004.480022] R13: ffffffffffffffff R14: ffff898058dc6280 R15: ffff89805090a008
[ 1004.490104] FS: 00007facbdbc4700(0000) GS:ffff89807e840000(0000) knlGS:0000000000000000
[ 1004.500343] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1004.511028] CR2: 00007f34ab2f0000 CR3: 000000f879f9a000 CR4: 0000000000340ee0
[ 1004.521504] Call Trace:
[ 1004.532008] alloc_iova+0x124/0x1b0
[ 1004.542293] alloc_iova_fast+0x4f/0x210
[ 1004.553365] iommu_dma_alloc_iova.isra.25+0xc6/0xf0
[ 1004.564609] __iommu_dma_map+0x86/0xe0
[ 1004.575558] iommu_dma_map_page+0x69/0x80
[ 1004.590064] mlx5e_sq_xmit+0x6f5/0xc80 [mlx5_core]
[ 1004.601294] ? netif_skb_features+0x132/0x260
[ 1004.612108] mlx5e_xmit+0xd9/0xe0 [mlx5_core]
[ 1004.622920] dev_hard_start_xmit+0x96/0x210
[ 1004.634757] sch_direct_xmit+0x10c/0x2f0
[ 1004.646561] __qdisc_run+0x14c/0x4f0
[ 1004.658055] __dev_queue_xmit+0x587/0x910
[ 1004.669445] dev_queue_xmit+0x10/0x20
[ 1004.680688] bond_dev_queue_xmit+0x2f/0x80 [bonding]
[ 1004.691826] bond_start_xmit+0x1c2/0x470 [bonding]
[ 1004.703505] dev_hard_start_xmit+0x96/0x210
[ 1004.714534] __dev_queue_xmit+0x71d/0x910
[ 1004.725426] ? selinux_ip_postroute+0x1d0/0x430
[ 1004.735941] dev_queue_xmit+0x10/0x20
[ 1004.746025] ip_finish_output2+0x287/0x520
[ 1004.756014] __ip_finish_output+0x10d/0x1f0
[ 1004.766425] ip_finish_output+0x2e/0xc0
[ 1004.776746] ip_output+0x76/0xf0
[ 1004.787281] ? __ip_finish_output+0x1f0/0x1f0
[ 1004.799811] ip_local_out+0x3b/0x50
[ 1004.809930] __ip_queue_xmit+0x155/0x3e0
[ 1004.819806] ? __kmalloc_node_track_caller+0x5e/0x2d0
[ 1004.829249] ip_queue_xmit+0x10/0x20
[ 1004.838582] __tcp_transmit_skb+0x5b0/0xab0
[ 1004.847560] __tcp_send_ack.part.56+0xa5/0x100
[ 1004.856637] tcp_send_ack+0x1c/0x20
[ 1004.865895] tcp_send_challenge_ack.isra.73+0xd7/0xe0
[ 1004.874689] tcp_validate_incoming+0x2d1/0x3b0
[ 1004.882808] tcp_rcv_established+0x23d/0x690
[ 1004.890888] ? tcp_sendmsg_locked+0x94b/0xdf0
[ 1004.898645] tcp_v4_do_rcv+0x103/0x1f0
[ 1004.905859] __release_sock+0x8d/0xe0
[ 1004.912930] release_sock+0x30/0xa0
[ 1004.919739] tcp_sendmsg+0x37/0x50
[ 1004.926345] inet_sendmsg+0x42/0x80
[ 1004.932910] sock_sendmsg+0x5f/0x80
[ 1004.939335] sock_write_iter+0x8c/0xf0
[ 1004.946041] do_iter_readv_writev+0x1b4/0x1e0
[ 1004.952390] do_iter_write+0x83/0x1a0
[ 1004.958686] vfs_writev+0x81/0x100
[ 1004.964829] ? __audit_syscall_entry+0xdd/0x130
[ 1004.970897] ? __fget_light+0x31/0x80
[ 1004.976714] do_writev+0xf4/0x110
[ 1004.982731] __x64_sys_writev+0x1c/0x20
[ 1004.988712] do_syscall_64+0x60/0x1e0
[ 1004.994535] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1005.000560] RIP: 0033:0x7facc64793e0
[ 1005.006355] Code: 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 e9 71 01 00 48 63 54 24 1c 41 89 c0 48 8b 74 24 10 48 63 7c 24 08 b8 14 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 17 44 89 c7 48 89 44 24 08 e8 1b 72 01 00 48
[ 1005.018978] RSP: 002b:00007facbdbbfbb0 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
[ 1005.025772] RAX: ffffffffffffffda RBX: 000000000000112d RCX: 00007facc64793e0
[ 1005.032711] RDX: 0000000000000001 RSI: 00007facbdbbfc80 RDI: 000000000000112d
[ 1005.039483] RBP: 0000000000000299 R08: 0000000000000000 R09: 0000000000000000
[ 1005.047425] R10: 0000000000000013 R11: 0000000000000293 R12: 0000000000000299
[ 1005.054936] R13: 00007fac3c9c0110 R14: 00007fac3c9c00d0 R15: 0000000000000000

[ 1012.306329] watchdog: BUG: soft lockup - CPU#107 stuck for 22s! [[ET_NET 29]:35739]
[ 1012.312820] Modules linked in: bonding ip6table_mangle ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 ip6table_filter ip6_tables xt_DSCP xt_multiport iptable_mangle ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_dscp xt_set iptable_filter ip_set_hash_net ip_set nfnetlink vfat fat edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul drm_vram_helper ghash_clmulni_intel drm_ttm_helper ttm drm_kms_helper aesni_intel drm syscopyarea sysfillrect crypto_simd sysimgblt ipmi_si cryptd joydev input_leds pcspkr fb_sys_fops sg ipmi_devintf glue_helper hpilo sp5100_tco hpwdt ccp i2c_piix4 k10temp wmi ipmi_msghandler acpi_power_meter acpi_cpufreq ip_tables ext4 mbcache jbd2 sd_mod t10_pi ahci igb libahci crc32c_intel i2c_algo_bit dca libata mlx5_core mlxfw pci_hyperv_intf ptp pps_core dm_mirror dm_region_hash dm_log dm_mod brd
[ 1012.367986] CPU: 107 PID: 35739 Comm: [ET_NET 29] Tainted: G L 5.6.0-0.rc4.el7.elrepo.x86_64 #1
[ 1012.377100] Hardware name: HPE ProLiant DL325 Gen10/ProLiant DL325 Gen10, BIOS A41 09/17/2019
[ 1012.386355] RIP: 0010:_raw_spin_unlock_irqrestore+0x15/0x20
[ 1012.395368] Code: 41 9c 75 ff 31 c0 5d c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 c6 07 00 0f 1f 40 00 48 89 f7 57 9d <0f> 1f 44 00 00 5d c3 0f 1f 40 00 0f 1f 44 00 00 31 c0 ba 01 00 00
[ 1012.414614] RSP: 0018:ffffc90001938c50 EFLAGS: 00000257 ORIG_RAX: ffffffffffffff13
[ 1012.424371] RAX: ffff89805090a008 RBX: ffff8978dd463500 RCX: ffff8978dd463500
[ 1012.434164] RDX: ffff897341412140 RSI: 0000000000000257 RDI: 0000000000000257
[ 1012.444376] RBP: ffffc90001938c50 R08: ffff897341412140 R09: ffff8980648088c0
[ 1012.454705] R10: ffff89805b4b2110 R11: 000000000000104d R12: 0000000000000001
[ 1012.464847] R13: ffffffffffffffff R14: ffff897341412140 R15: ffff89805090a008
[ 1012.475011] FS: 00007facbd4b6700(0000) GS:ffff89807e8c0000(0000) knlGS:0000000000000000
[ 1012.484580] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1012.494875] CR2: 00007f3494df4000 CR3: 000000f879f9a000 CR4: 0000000000340ee0
[ 1012.505097] Call Trace:
[ 1012.515550] <IRQ>
[ 1012.525827] alloc_iova+0x124/0x1b0
[ 1012.536462] alloc_iova_fast+0x4f/0x210
[ 1012.547048] iommu_dma_alloc_iova.isra.25+0xc6/0xf0
[ 1012.557717] __iommu_dma_map+0x86/0xe0
[ 1012.568690] iommu_dma_map_page+0x69/0x80
[ 1012.579466] mlx5e_sq_xmit+0x6f5/0xc80 [mlx5_core]
[ 1012.589968] ? netif_skb_features+0x132/0x260
[ 1012.600485] mlx5e_xmit+0xd9/0xe0 [mlx5_core]
[ 1012.610855] dev_hard_start_xmit+0x96/0x210
[ 1012.620994] sch_direct_xmit+0x10c/0x2f0
[ 1012.631534] __qdisc_run+0x14c/0x4f0
[ 1012.642060] ? run_timer_softirq+0x290/0x480
[ 1012.652804] net_tx_action+0x147/0x240
[ 1012.663727] __do_softirq+0xd9/0x29e
[ 1012.674279] do_softirq_own_stack+0x2a/0x40
[ 1012.684882] </IRQ>
[ 1012.695262] do_softirq+0x55/0x60
[ 1012.705301] __local_bh_enable_ip+0x57/0x60
[ 1012.715159] ip_finish_output2+0x195/0x520
[ 1012.725546] __ip_finish_output+0x10d/0x1f0
[ 1012.735214] ip_finish_output+0x2e/0xc0
[ 1012.744780] ip_output+0x76/0xf0
[ 1012.754586] ? __ip_finish_output+0x1f0/0x1f0
[ 1012.765574] ip_local_out+0x3b/0x50
[ 1012.775590] __ip_queue_xmit+0x155/0x3e0
[ 1012.785023] ip_queue_xmit+0x10/0x20
[ 1012.794460] __tcp_transmit_skb+0x5b0/0xab0
[ 1012.803132] tcp_connect+0xb74/0xe60
[ 1012.811624] ? ktime_get_with_offset+0x4f/0xc0
[ 1012.819910] tcp_v4_connect+0x44c/0x4d0
[ 1012.827841] __inet_stream_connect+0xcf/0x360
[ 1012.835420] ? release_sock+0x8f/0xa0
[ 1012.842711] ? selinux_netlbl_socket_connect+0x37/0x60
[ 1012.849904] inet_stream_connect+0x3b/0x60
[ 1012.857022] __sys_connect_file+0x61/0x70
[ 1012.863900] __sys_connect+0x8f/0xd0
[ 1012.870552] ? syscall_trace_enter+0x1f8/0x2d0
[ 1012.877034] ? __audit_syscall_exit+0x1e3/0x290
[ 1012.883018] __x64_sys_connect+0x1a/0x20
[ 1012.889441] do_syscall_64+0x60/0x1e0
[ 1012.895910] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1012.902006] RIP: 0033:0x7facc71839fd
[ 1012.908295] Code: ca 20 00 00 75 10 b8 2a 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 0e fa ff ff 48 89 04 24 b8 2a 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 57 fa ff ff 48 89 d0 48 83 c4 08 48 3d 01
[ 1012.921383] RSP: 002b:00007facbd4b5730 EFLAGS: 00000293 ORIG_RAX: 000000000000002a
[ 1012.928015] RAX: ffffffffffffffda RBX: 00007facac0532b8 RCX: 00007facc71839fd
[ 1012.935143] RDX: 0000000000000010 RSI: 00007facac0532c4 RDI: 0000000000000763
[ 1012.942463] RBP: 00007facac053078 R08: 0000000000000004 R09: 000000001f6c2b0a
[ 1012.949773] R10: 00007facbd4b5760 R11: 0000000000000293 R12: 00007facbd4b5758
[ 1012.956752] R13: 0000000000000000 R14: 00007facbd4b5a20 R15: 00007fac849f8d40

[ 1016.271741] watchdog: BUG: soft lockup - CPU#82 stuck for 22s! [[ET_NET 27]:35737]
[ 1016.278493] Modules linked in: bonding ip6table_mangle ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 ip6table_filter ip6_tables xt_DSCP xt_multiport iptable_mangle ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_dscp xt_set iptable_filter ip_set_hash_net ip_set nfnetlink vfat fat edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul drm_vram_helper ghash_clmulni_intel drm_ttm_helper ttm drm_kms_helper aesni_intel drm syscopyarea sysfillrect crypto_simd sysimgblt ipmi_si cryptd joydev input_leds pcspkr fb_sys_fops sg ipmi_devintf glue_helper hpilo sp5100_tco hpwdt ccp i2c_piix4 k10temp wmi ipmi_msghandler acpi_power_meter acpi_cpufreq ip_tables ext4 mbcache jbd2 sd_mod t10_pi ahci igb libahci crc32c_intel i2c_algo_bit dca libata mlx5_core mlxfw pci_hyperv_intf ptp pps_core dm_mirror dm_region_hash dm_log dm_mod brd
[ 1016.335753] CPU: 82 PID: 35737 Comm: [ET_NET 27] Tainted: G L 5.6.0-0.rc4.el7.elrepo.x86_64 #1
[ 1016.345081] Hardware name: HPE ProLiant DL325 Gen10/ProLiant DL325 Gen10, BIOS A41 09/17/2019
[ 1016.354137] RIP: 0010:_raw_spin_unlock_irqrestore+0x15/0x20
[ 1016.363450] Code: 41 9c 75 ff 31 c0 5d c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 c6 07 00 0f 1f 40 00 48 89 f7 57 9d <0f> 1f 44 00 00 5d c3 0f 1f 40 00 0f 1f 44 00 00 31 c0 ba 01 00 00
[ 1016.383172] RSP: 0018:ffffc900014ec728 EFLAGS: 00000257 ORIG_RAX: ffffffffffffff13
[ 1016.393207] RAX: ffff89805090a008 RBX: ffff8972dbce8d00 RCX: ffff8972dbce8d00
[ 1016.404690] RDX: ffff897897ffb100 RSI: 0000000000000257 RDI: 0000000000000257
[ 1016.416788] RBP: ffffc900014ec728 R08: ffff897897ffb100 R09: ffff8980648088c0
[ 1016.429080] R10: ffff89805b4b2110 R11: 000000000000104d R12: 0000000000000001
[ 1016.441666] R13: ffffffffffffffff R14: ffff897897ffb100 R15: ffff89805090a008
[ 1016.454597] FS: 00007facbd6ba700(0000) GS:ffff89807e280000(0000) knlGS:0000000000000000
[ 1016.468217] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1016.480161] CR2: 00007f34a1b23000 CR3: 000000f879f9a000 CR4: 0000000000340ee0
[ 1016.491497] Call Trace:
[ 1016.503379] <IRQ>
[ 1016.514761] alloc_iova+0x124/0x1b0
[ 1016.526335] alloc_iova_fast+0x4f/0x210
[ 1016.539882] iommu_dma_alloc_iova.isra.25+0xc6/0xf0
[ 1016.553904] __iommu_dma_map+0x86/0xe0
[ 1016.566590] iommu_dma_map_page+0x69/0x80
[ 1016.579606] mlx5e_sq_xmit+0x6f5/0xc80 [mlx5_core]
[ 1016.593952] ? netif_skb_features+0x132/0x260
[ 1016.607197] mlx5e_xmit+0xd9/0xe0 [mlx5_core]
[ 1016.619726] dev_hard_start_xmit+0x96/0x210
[ 1016.633732] sch_direct_xmit+0x10c/0x2f0
[ 1016.648721] __qdisc_run+0x14c/0x4f0
[ 1016.665088] __dev_queue_xmit+0x587/0x910
[ 1016.676721] dev_queue_xmit+0x10/0x20
[ 1016.687766] bond_dev_queue_xmit+0x2f/0x80 [bonding]
[ 1016.699954] bond_start_xmit+0x1c2/0x470 [bonding]
[ 1016.712988] dev_hard_start_xmit+0x96/0x210
[ 1016.726463] __dev_queue_xmit+0x71d/0x910
[ 1016.739348] dev_queue_xmit+0x10/0x20
[ 1016.749813] ip_finish_output2+0x287/0x520
[ 1016.761954] __ip_finish_output+0x10d/0x1f0
[ 1016.774205] ip_finish_output+0x2e/0xc0
[ 1016.785795] ip_output+0x76/0xf0
[ 1016.798820] ? __ip_finish_output+0x1f0/0x1f0
[ 1016.811848] ip_local_out+0x3b/0x50
[ 1016.824517] __ip_queue_xmit+0x155/0x3e0
[ 1016.836698] ? __kmalloc_node_track_caller+0x5e/0x2d0
[ 1016.848989] ip_queue_xmit+0x10/0x20
[ 1016.860623] __tcp_transmit_skb+0x5b0/0xab0
[ 1016.872300] __tcp_send_ack.part.56+0xa5/0x100
[ 1016.882555] tcp_send_ack+0x1c/0x20
[ 1016.893500] tcp_delack_timer_handler+0x12a/0x180
[ 1016.904054] tcp_delack_timer+0xe5/0x120
[ 1016.913832] ? tcp_delack_timer_handler+0x180/0x180
[ 1016.923383] call_timer_fn+0x34/0x140
[ 1016.933086] run_timer_softirq+0x20a/0x480
[ 1016.942443] ? lapic_next_event+0x20/0x30
[ 1016.951671] ? clockevents_program_event+0x7e/0x100
[ 1016.960364] __do_softirq+0xd9/0x29e
[ 1016.969149] do_softirq_own_stack+0x2a/0x40
[ 1016.977439] </IRQ>
[ 1016.985973] do_softirq+0x55/0x60
[ 1016.994284] __local_bh_enable_ip+0x57/0x60
[ 1017.002611] ip_finish_output2+0x195/0x520
[ 1017.010779] __ip_finish_output+0x10d/0x1f0
[ 1017.018533] ip_finish_output+0x2e/0xc0
[ 1017.026011] ip_output+0x76/0xf0
[ 1017.032538] ? __ip_finish_output+0x1f0/0x1f0
[ 1017.040154] ip_local_out+0x3b/0x50
[ 1017.047081] __ip_queue_xmit+0x155/0x3e0
[ 1017.053752] ip_queue_xmit+0x10/0x20
[ 1017.061073] __tcp_transmit_skb+0x5b0/0xab0
[ 1017.069195] tcp_write_xmit+0x257/0x1020
[ 1017.075350] ? __alloc_skb+0xa1/0x280
[ 1017.081750] __tcp_push_pending_frames+0x33/0xf0
[ 1017.088351] tcp_push+0xdc/0x110
[ 1017.095570] tcp_sendmsg_locked+0x942/0xdf0
[ 1017.102597] tcp_sendmsg+0x2c/0x50
[ 1017.108931] inet_sendmsg+0x42/0x80
[ 1017.116413] sock_sendmsg+0x5f/0x80
[ 1017.123570] sock_write_iter+0x8c/0xf0
[ 1017.131099] do_iter_readv_writev+0x1b4/0x1e0
[ 1017.138550] do_iter_write+0x83/0x1a0
[ 1017.145905] vfs_writev+0x81/0x100
[ 1017.153432] ? __audit_syscall_entry+0xdd/0x130
[ 1017.160352] ? __fget_light+0x31/0x80
[ 1017.167798] do_writev+0xf4/0x110
[ 1017.175101] __x64_sys_writev+0x1c/0x20
[ 1017.182903] do_syscall_64+0x60/0x1e0
[ 1017.190297] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1017.197633] RIP: 0033:0x7facc64793e0
[ 1017.204991] Code: 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 e9 71 01 00 48 63 54 24 1c 41 89 c0 48 8b 74 24 10 48 63 7c 24 08 b8 14 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 17 44 89 c7 48 89 44 24 08 e8 1b 72 01 00 48
[ 1017.221422] RSP: 002b:00007facbd6b5bb0 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
[ 1017.229665] RAX: ffffffffffffffda RBX: 000000000000263b RCX: 00007facc64793e0
[ 1017.237919] RDX: 0000000000000002 RSI: 00007facbd6b5c80 RDI: 000000000000263b
[ 1017.246048] RBP: 0000000000000230 R08: 0000000000000000 R09: 0000000000000000
[ 1017.254922] R10: 0000000000000000 R11: 0000000000000293 R12: 00000000000000dc
[ 1017.263164] R13: 00007faca89fd868 R14: 00007faca89fd850 R15: 0000000000000000

[ 1056.286905] watchdog: BUG: soft lockup - CPU#85 stuck for 22s! [[ET_NET 17]:35727]
[ 1056.293398] Modules linked in: bonding ip6table_mangle ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 ip6table_filter ip6_tables xt_DSCP xt_multiport iptable_mangle ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_dscp xt_set iptable_filter ip_set_hash_net ip_set nfnetlink vfat fat edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul drm_vram_helper ghash_clmulni_intel drm_ttm_helper ttm drm_kms_helper aesni_intel drm syscopyarea sysfillrect crypto_simd sysimgblt ipmi_si cryptd joydev input_leds pcspkr fb_sys_fops sg ipmi_devintf glue_helper hpilo sp5100_tco hpwdt ccp i2c_piix4 k10temp wmi ipmi_msghandler acpi_power_meter acpi_cpufreq ip_tables ext4 mbcache jbd2 sd_mod t10_pi ahci igb libahci crc32c_intel i2c_algo_bit dca libata mlx5_core mlxfw pci_hyperv_intf ptp pps_core dm_mirror dm_region_hash dm_log dm_mod brd
[ 1056.348933] CPU: 85 PID: 35727 Comm: [ET_NET 17] Tainted: G L 5.6.0-0.rc4.el7.elrepo.x86_64 #1
[ 1056.357537] Hardware name: HPE ProLiant DL325 Gen10/ProLiant DL325 Gen10, BIOS A41 09/17/2019
[ 1056.366831] RIP: 0010:_raw_spin_unlock_irqrestore+0x15/0x20
[ 1056.375787] Code: 41 9c 75 ff 31 c0 5d c3 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 c6 07 00 0f 1f 40 00 48 89 f7 57 9d <0f> 1f 44 00 00 5d c3 0f 1f 40 00 0f 1f 44 00 00 31 c0 ba 01 00 00
[ 1056.394845] RSP: 0018:ffffc90003157568 EFLAGS: 00000257 ORIG_RAX: ffffffffffffff13
[ 1056.404494] RAX: ffff89805090a008 RBX: ffff898010b94140 RCX: ffff898010b94140
[ 1056.414419] RDX: ffff8978e6180f00 RSI: 0000000000000257 RDI: 0000000000000257
[ 1056.424357] RBP: ffffc90003157568 R08: ffff8978e6180f00 R09: ffff8980648088c0
[ 1056.434363] R10: ffff89805b4b2110 R11: 0000000000001041 R12: 0000000000000001
[ 1056.444662] R13: ffffffffffffffff R14: ffff8978e6180f00 R15: ffff89805090a008
[ 1056.455209] FS: 00007facbe0ce700(0000) GS:ffff89807e340000(0000) knlGS:0000000000000000
[ 1056.465894] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1056.476213] CR2: 00007f649817fd48 CR3: 000000f879f9a000 CR4: 0000000000340ee0
[ 1056.486674] Call Trace:
[ 1056.497627] alloc_iova+0x124/0x1b0
[ 1056.508214] alloc_iova_fast+0x4f/0x210
[ 1056.518499] iommu_dma_alloc_iova.isra.25+0xc6/0xf0
[ 1056.529769] __iommu_dma_map+0x86/0xe0
[ 1056.540612] iommu_dma_map_page+0x69/0x80
[ 1056.551636] mlx5e_sq_xmit+0x6f5/0xc80 [mlx5_core]
[ 1056.562383] ? netif_skb_features+0x132/0x260
[ 1056.573372] mlx5e_xmit+0xd9/0xe0 [mlx5_core]
[ 1056.584192] dev_hard_start_xmit+0x96/0x210
[ 1056.594767] sch_direct_xmit+0x10c/0x2f0
[ 1056.605338] __qdisc_run+0x14c/0x4f0
[ 1056.615348] __dev_queue_xmit+0x587/0x910
[ 1056.625804] dev_queue_xmit+0x10/0x20
[ 1056.636444] bond_dev_queue_xmit+0x2f/0x80 [bonding]
[ 1056.647380] bond_start_xmit+0x1c2/0x470 [bonding]
[ 1056.658290] dev_hard_start_xmit+0x96/0x210
[ 1056.669079] __dev_queue_xmit+0x71d/0x910
[ 1056.679733] dev_queue_xmit+0x10/0x20
[ 1056.689957] ip_finish_output2+0x287/0x520
[ 1056.699963] __ip_finish_output+0x10d/0x1f0
[ 1056.709946] ip_finish_output+0x2e/0xc0
[ 1056.719677] ip_output+0x76/0xf0
[ 1056.729469] ? __ip_finish_output+0x1f0/0x1f0
[ 1056.739305] ip_local_out+0x3b/0x50
[ 1056.748645] __ip_queue_xmit+0x155/0x3e0
[ 1056.758634] ? __kmalloc_node_track_caller+0x5e/0x2d0
[ 1056.768387] ip_queue_xmit+0x10/0x20
[ 1056.777937] __tcp_transmit_skb+0x5b0/0xab0
[ 1056.786982] __tcp_send_ack.part.56+0xa5/0x100
[ 1056.795841] tcp_send_ack+0x1c/0x20
[ 1056.804104] tcp_send_challenge_ack.isra.73+0xd7/0xe0
[ 1056.812257] tcp_validate_incoming+0x2d1/0x3b0
[ 1056.820010] tcp_rcv_state_process+0x2cc/0xe28
[ 1056.827548] ? tcp_write_xmit+0x2e1/0x1020
[ 1056.834811] ? __kmalloc_reserve.isra.52+0x31/0x90
[ 1056.842045] tcp_v4_do_rcv+0x77/0x1f0
[ 1056.849053] __release_sock+0x8d/0xe0
[ 1056.855614] tcp_close+0xd3/0x490
[ 1056.862393] inet_release+0x39/0x70
[ 1056.868667] __sock_release+0x42/0xc0
[ 1056.875086] sock_close+0x15/0x20
[ 1056.881000] __fput+0xc6/0x260
[ 1056.887256] ____fput+0xe/0x10
[ 1056.893334] task_work_run+0x8c/0xb0
[ 1056.899490] exit_to_usermode_loop+0x74/0xf6
[ 1056.905264] do_syscall_64+0x1ad/0x1e0
[ 1056.911143] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1056.916879] RIP: 0033:0x7facc718377d
[ 1056.922631] Code: cc 20 00 00 75 10 b8 03 00 00 00 0f 05 48 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 8e fc ff ff 48 89 04 24 b8 03 00 00 00 0f 05 <48> 8b 3c 24 48 89 c2 e8 d7 fc ff ff 48 89 d0 48 83 c4 08 48 3d 01
[ 1056.935379] RSP: 002b:00007facbe0cd7a0 EFLAGS: 00000293 ORIG_RAX: 0000000000000003
[ 1056.941940] RAX: 0000000000000000 RBX: 000000000000038e RCX: 00007facc718377d
[ 1056.948587] RDX: 00007facbe0cea30 RSI: 000000000000038e RDI: 000000000000038e
[ 1056.955105] RBP: 00007facc4055010 R08: 0000000000000000 R09: 0000000000000000
[ 1056.961692] R10: 00007facbe0cd7e4 R11: 0000000000000293 R12: 00007facc4055010
[ 1056.968014] R13: 00007facc4055010 R14: 00007fac8ca459f8 R15: 00007fab1406da60

[ 1060.321549] watchdog: BUG: soft lockup - CPU#117 stuck for 22s! [[ET_NET 14]:35724]
[ 1060.330752] Modules linked in: bonding ip6table_mangle ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 ip6table_filter ip6_tables xt_DSCP xt_multiport iptable_mangle ipt_REJECT nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_dscp xt_set iptable_filter ip_set_hash_net ip_set nfnetlink vfat fat edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul drm_vram_helper ghash_clmulni_intel drm_ttm_helper ttm drm_kms_helper aesni_intel drm syscopyarea sysfillrect crypto_simd sysimgblt ipmi_si cryptd joydev input_leds pcspkr fb_sys_fops sg ipmi_devintf glue_helper hpilo sp5100_tco hpwdt ccp i2c_piix4 k10temp wmi ipmi_msghandler acpi_power_meter acpi_cpufreq ip_tables ext4 mbcache jbd2 sd_mod t10_pi ahci igb libahci crc32c_intel i2c_algo_bit dca libata mlx5_core mlxfw pci_hyperv_intf ptp pps_core dm_mirror dm_region_hash dm_log dm_mod brd
[ 1060.387118] CPU: 117 PID: 35724 Comm: [ET_NET 14] Tainted: G L 5.6.0-0.rc4.el7.elrepo.x86_64 #1
[ 1060.396807] Hardware name: HPE ProLiant DL325 Gen10/ProLiant DL325 Gen10, BIOS A41 09/17/2019
[ 1060.406729] RIP: 0010:native_queued_spin_lock_slowpath+0x64/0x1f0
[ 1060.416435] Code: ff ff 75 3f f0 0f ba 2f 08 0f 82 29 01 00 00 31 d2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1c 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 5d c3 f6 c4 01 75 04 c6
[ 1060.435939] RSP: 0018:ffffc90001af0e58 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[ 1060.446036] RAX: 0000000000000101 RBX: ffff8980223c1040 RCX: ffff89807eb5e8a0
[ 1060.456146] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8980223c10c0
[ 1060.466701] RBP: ffffc90001af0e58 R08: ffffc90001af0f10 R09: 0000000000000000
[ 1060.476612] R10: 0000000000000201 R11: 0000000000000000 R12: ffff898021d00480
[ 1060.486236] R13: 000000000000000d R14: ffff8980223c10c0 R15: 0000000000000075
[ 1060.496761] FS: 00007facbe3d4700(0000) GS:ffff89807eb40000(0000) knlGS:0000000000000000
[ 1060.506841] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1060.516858] CR2: 00007f64c7e10000 CR3: 000000f879f9a000 CR4: 0000000000340ee0
[ 1060.527271] Call Trace:
[ 1060.537684] <IRQ>
[ 1060.548532] queued_spin_lock_slowpath+0xb/0x13
[ 1060.559922] _raw_spin_lock+0x23/0x30
[ 1060.570337] dev_watchdog+0x69/0x280
[ 1060.580771] ? pfifo_fast_enqueue+0x130/0x130
[ 1060.591296] call_timer_fn+0x34/0x140
[ 1060.601429] run_timer_softirq+0x20a/0x480
[ 1060.611456] ? lapic_next_event+0x20/0x30
[ 1060.621408] ? clockevents_program_event+0x7e/0x100
[ 1060.631603] __do_softirq+0xd9/0x29e
[ 1060.641980] do_softirq_own_stack+0x2a/0x40
[ 1060.652607] </IRQ>
[ 1060.663297] do_softirq+0x55/0x60
[ 1060.673969] __local_bh_enable_ip+0x57/0x60
[ 1060.684532] ip_finish_output2+0x195/0x520
[ 1060.695076] __ip_finish_output+0x10d/0x1f0
[ 1060.705287] ip_finish_output+0x2e/0xc0
[ 1060.715462] ip_output+0x76/0xf0
[ 1060.725300] ? __ip_finish_output+0x1f0/0x1f0
[ 1060.734947] ip_local_out+0x3b/0x50
[ 1060.744357] __ip_queue_xmit+0x155/0x3e0
[ 1060.753749] ? __kmalloc_node_track_caller+0x5e/0x2d0
[ 1060.763757] ? __wake_up_common+0x8f/0x160
[ 1060.773950] ip_queue_xmit+0x10/0x20
[ 1060.783680] __tcp_transmit_skb+0x5b0/0xab0
[ 1060.793424] __tcp_send_ack.part.56+0xa5/0x100
[ 1060.802844] tcp_send_ack+0x1c/0x20
[ 1060.811899] __tcp_ack_snd_check+0x42/0x1d0
[ 1060.820588] tcp_rcv_state_process+0xa56/0xe28
[ 1060.829095] ? __schedule+0x2d2/0x6e0
[ 1060.837287] ? tcp_sendmsg_locked+0x94b/0xdf0
[ 1060.845108] tcp_v4_do_rcv+0x77/0x1f0
[ 1060.852628] __release_sock+0x8d/0xe0
[ 1060.859931] release_sock+0x30/0xa0
[ 1060.866876] tcp_sendmsg+0x37/0x50
[ 1060.873765] inet_sendmsg+0x42/0x80
[ 1060.879894] sock_sendmsg+0x5f/0x80
[ 1060.886284] sock_write_iter+0x8c/0xf0
[ 1060.892725] do_iter_readv_writev+0x1b4/0x1e0
[ 1060.898845] do_iter_write+0x83/0x1a0
[ 1060.905096] vfs_writev+0x81/0x100
[ 1060.911025] ? __audit_syscall_entry+0xdd/0x130
[ 1060.916973] ? __fget_light+0x31/0x80
[ 1060.922986] do_writev+0xf4/0x110
[ 1060.928737] __x64_sys_writev+0x1c/0x20
[ 1060.934363] do_syscall_64+0x60/0x1e0
[ 1060.940014] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1060.945615] RIP: 0033:0x7facc64793e0
[ 1060.951187] Code: 54 24 1c 48 89 74 24 10 89 7c 24 08 e8 e9 71 01 00 48 63 54 24 1c 41 89 c0 48 8b 74 24 10 48 63 7c 24 08 b8 14 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 17 44 89 c7 48 89 44 24 08 e8 1b 72 01 00 48
[ 1060.963427] RSP: 002b:00007facbe3cfbb0 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
[ 1060.969817] RAX: ffffffffffffffda RBX: 000000000000287c RCX: 00007facc64793e0
[ 1060.976214] RDX: 0000000000000002 RSI: 00007facbe3cfc80 RDI: 000000000000287c
[ 1060.982424] RBP: 0000000000000230 R08: 0000000000000000 R09: 0000000000000000
[ 1060.988898] R10: 00007facbe3d3db0 R11: 0000000000000293 R12: 00000000000000dc
[ 1060.996178] R13: 00007fac74dc8cc8 R14: 00007fac74dc8cb0 R15: 0000000000000000
toracat

toracat

2020-03-02 22:58

manager   ~0036427

Hmm, 5.6.0-0.rc4 is the latest kernel (released yesterday from kernel.org). Not sure where to look at this moment. So, if you boot the 7.6 kernel, the problem is gone?
guzzijason

guzzijason

2020-03-02 23:18

reporter   ~0036429

We've never seen these symptoms on any of our 7.6 servers, no. It only popped up after we started upgrading to 7.7.
So far, the only way to eliminate the problem in 7.7 is to disable IOMMU entirely via BIOS. Very strange.
guzzijason

guzzijason

2020-03-02 23:21

reporter   ~0036430

Would you suggest trying to boot the previously-working 7.6 kernel version while on a 7.7 upgraded system and see if that does anything?
TrevorH

TrevorH

2020-03-02 23:22

manager   ~0036431

kernel-3.10.0-957.27.2.el7.x86_64 was the last 7.6 kernel released for CentOS. Have you tested if the problem starts between that one and the first 1062.el7.x86_64 kernel
TrevorH

TrevorH

2020-03-02 23:23

manager   ~0036432

yum --enablerepo=C7.6.1810-updates install kernel-3.10.0-957.27.2.el7.x86_64 (maybe needs --noplugins)

The original 1062.el7 kernel should be in the current base repo.
guzzijason

guzzijason

2020-03-02 23:30

reporter   ~0036433

OK, will experiment further and let you know.
guzzijason

guzzijason

2020-03-03 17:56

reporter   ~0036443

Update: on the 7.7 server I've been testing with, I did roll back to the latest 7.6 kernel (3.10.0-957.27.2.el7.x86_64), and so far my test is running stable. Typically, when it goes bad, the failure happens in <10 minutes, usually much sooner. My current test has been running for 2 hours so far with no issues.
guzzijason

guzzijason

2020-03-03 21:27

reporter   ~0036444

Test ran for 4 hours with no problems at all. It seems something may have been introduced in the 7.7 kernel(s) that is causing problems. I can replicate pretty easily, so if there's any other diagnostic info I can provide, just let me know.
guzzijason

guzzijason

2020-03-03 21:59

reporter   ~0036445

We are also suspicious of these Mellanox options, which seem to be introduced in the 7.7 kernel configs:
> CONFIG_MLX5_EN_ARFS=y
> CONFIG_MLX5_EN_RXNFC=y
TrevorH

TrevorH

2020-03-03 22:07

manager   ~0036446

If you haven't already done so, you need to report this on bugzilla.redhat.com. CentOS only rebuilds what Red Hat release for RHEL. To get things fixed you need to get Red Hat to fix it upstream in RHEL. Please crosslink the bug reports in both systems.
guzzijason

guzzijason

2020-03-03 22:58

reporter   ~0036447

RHEL bugzilla issue link: https://bugzilla.redhat.com/show_bug.cgi?id=1809819
guzzijason

guzzijason

2020-03-04 19:52

reporter   ~0036458

RE: my earlier comment about the kernel configs
I just created a local kernel build based on 3.10.0-1062.12.1.el7. The only difference in the config is:

# CONFIG_MLX5_EN_ARFS is not set
# CONFIG_MLX5_EN_RXNFC is not set

It seems to have had no impact - I can still replicate the problem.
guzzijason

guzzijason

2020-03-11 17:10

reporter   ~0036492

Preliminary testing seems to indicate that there may be a similar problem on CentOS 8 as well.
guzzijason

guzzijason

2020-03-13 16:07

reporter   ~0036501

It seems that for CentOS 7.7.1908, adding the boot option 'iommu=pt' appears to work for us. So far, we've not been able to replicate the failure in that mode. We've not yet confirmed the same for CentOS 8.
TrevorH

TrevorH

2020-03-13 16:28

manager   ~0036502

That's interesting but that same parameter is also required on some EPYC machines in order to even install CentOS 7 (maybe 8 too). If you look up Dell R6515 EPYC based machines in the RH hardware support database, it lists that as required. I've seen reports that without that it will boot and then there is no mouse or keyboard support so it's impossible to go any further.
guzzijason

guzzijason

2020-03-13 17:24

reporter   ~0036503

I actually tried testing for this earlier, but mistakenly set the option as 'amd_iommu=pt' instead of the correct 'iommu=pt'. Just noticed that discrepancy and was able to correct it and try again.

Issue History

Date Modified Username Field Change
2020-03-02 17:06 guzzijason New Issue
2020-03-02 17:06 guzzijason File Added: image (6).png
2020-03-02 18:52 toracat Note Added: 0036421
2020-03-02 18:56 guzzijason Note Added: 0036422
2020-03-02 22:15 guzzijason Note Added: 0036424
2020-03-02 22:22 guzzijason Note Added: 0036425
2020-03-02 22:28 guzzijason Note Added: 0036426
2020-03-02 22:58 toracat Note Added: 0036427
2020-03-02 23:18 guzzijason Note Added: 0036429
2020-03-02 23:21 guzzijason Note Added: 0036430
2020-03-02 23:22 TrevorH Note Added: 0036431
2020-03-02 23:23 TrevorH Note Added: 0036432
2020-03-02 23:30 guzzijason Note Added: 0036433
2020-03-03 17:56 guzzijason Note Added: 0036443
2020-03-03 21:27 guzzijason Note Added: 0036444
2020-03-03 21:59 guzzijason Note Added: 0036445
2020-03-03 22:07 TrevorH Note Added: 0036446
2020-03-03 22:58 guzzijason Note Added: 0036447
2020-03-04 19:52 guzzijason Note Added: 0036458
2020-03-11 17:10 guzzijason Note Added: 0036492
2020-03-13 16:07 guzzijason Note Added: 0036501
2020-03-13 16:28 TrevorH Note Added: 0036502
2020-03-13 17:24 guzzijason Note Added: 0036503