View Issue Details

IDProjectCategoryView StatusLast Update
0016815CentOS-7kernelpublic2019-12-25 23:06
Reporterossgeek 
PrioritynormalSeverityminorReproducibilityrandom
Status newResolutionopen 
Product Version7.7-1908 
Target VersionFixed in Version 
Summary0016815: r8169 stops working after receiving `NETDEV WATCHDOG: enp2s0 (r8169): transmit queue 0 timed out`
DescriptionI have a Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller built into the motherboard of a GIGABYTE MZGLKAP-00/MZGLKAP-00, BIOS F1 12/21/2017. The network interface works as expected without issue but after random intervals (sometimes days... sometimes only a few hours... but possibly after going `idle`) the kernel will spit out a debug message and then the network interface will become stop transmitting data. It requires a reboot to restore network operations.

I'm running the stock kernel with no external modules.
Steps To ReproduceHave a Realtek network card using the r8169 module
Let the system run 24/7 until the kernel gives a debug message
Additional InformationI found this bug and kernel patch (which the 3.10.0 does not have) that addresses this exact issue. Can we apply this patch?

https://bugzilla.kernel.org/show_bug.cgi?id=199549

This is my kernel output:

Dec 10 11:03:57 mogweb kernel: WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:356 dev_watchdog+0x248/0x260
Dec 10 11:03:57 mogweb kernel: NETDEV WATCHDOG: enp2s0 (r8169): transmit queue 0 timed out
Dec 10 11:03:57 mogweb kernel: Modules linked in: xt_set xt_multiport ip_set_hash_ip xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun devlink rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter sunrpc ext4 vfat fat mbcache jbd2 intel_powerclamp coretemp intel_rapl kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel snd_soc_rt298 snd_soc_rt286 snd_soc_rl6347a snd_soc_core
Dec 10 11:03:57 mogweb kernel: snd_compress snd_seq snd_seq_device aesni_intel sg lrw gf128mul snd_pcm glue_helper ablk_helper cryptd pcspkr wdat_wdt i2c_i801 snd_timer snd soundcore pcc_cpufreq tpm_crb ip_tables xfs libcrc32c i915 sd_mod crc_t10dif crct10dif_generic i2c_algo_bit iosf_mbi drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ahci drm libahci sdhci_pci nvme cqhci sdhci libata mmc_core crct10dif_pclmul crct10dif_common crc32c_intel nvme_core r8169 drm_panel_orientation_quirks uas i2c_hid video usb_storage dm_mirror dm_region_hash dm_log dm_mod
Dec 10 11:03:57 mogweb kernel: CPU: 0 PID: 0 Comm: swapper/0 Kdump: loaded Not tainted 3.10.0-1062.4.3.el7.x86_64 #1
Dec 10 11:03:57 mogweb kernel: Hardware name: GIGABYTE MZGLKAP-00/MZGLKAP-00, BIOS F1 12/21/2017
Dec 10 11:03:57 mogweb kernel: Call Trace:
Dec 10 11:03:57 mogweb kernel: <IRQ> [<ffffffffa0779ba4>] dump_stack+0x19/0x1b
Dec 10 11:03:57 mogweb kernel: [<ffffffffa009b958>] __warn+0xd8/0x100
Dec 10 11:03:57 mogweb kernel: [<ffffffffa009b9df>] warn_slowpath_fmt+0x5f/0x80
Dec 10 11:03:57 mogweb kernel: [<ffffffffa067bf88>] dev_watchdog+0x248/0x260
Dec 10 11:03:57 mogweb kernel: [<ffffffffa067bd40>] ? dev_deactivate_queue.constprop.27+0x60/0x60
Dec 10 11:03:57 mogweb kernel: [<ffffffffa00ac358>] call_timer_fn+0x38/0x110
Dec 10 11:03:57 mogweb kernel: [<ffffffffa067bd40>] ? dev_deactivate_queue.constprop.27+0x60/0x60
Dec 10 11:03:57 mogweb kernel: [<ffffffffa00ae7bd>] run_timer_softirq+0x24d/0x300
Dec 10 11:03:57 mogweb kernel: [<ffffffffa00a5305>] __do_softirq+0xf5/0x280
Dec 10 11:03:57 mogweb kernel: [<ffffffffa079042c>] call_softirq+0x1c/0x30
Dec 10 11:03:57 mogweb kernel: [<ffffffffa002f715>] do_softirq+0x65/0xa0
Dec 10 11:03:57 mogweb kernel: [<ffffffffa00a5685>] irq_exit+0x105/0x110
Dec 10 11:03:57 mogweb kernel: [<ffffffffa07919d8>] smp_apic_timer_interrupt+0x48/0x60
Dec 10 11:03:57 mogweb kernel: [<ffffffffa078defa>] apic_timer_interrupt+0x16a/0x170
Dec 10 11:03:57 mogweb kernel: <EOI> [<ffffffffa05c10f7>] ? cpuidle_enter_state+0x57/0xd0
Dec 10 11:03:57 mogweb kernel: [<ffffffffa05c10ed>] ? cpuidle_enter_state+0x4d/0xd0
Dec 10 11:03:57 mogweb kernel: [<ffffffffa05c124e>] cpuidle_idle_call+0xde/0x230
Dec 10 11:03:57 mogweb kernel: [<ffffffffa0037c6e>] arch_cpu_idle+0xe/0xc0
Dec 10 11:03:57 mogweb kernel: [<ffffffffa0100d3a>] cpu_startup_entry+0x14a/0x1e0
Dec 10 11:03:57 mogweb kernel: [<ffffffffa0768b57>] rest_init+0x77/0x80
Dec 10 11:03:57 mogweb kernel: [<ffffffffa0d881cb>] start_kernel+0x450/0x471
Dec 10 11:03:57 mogweb kernel: [<ffffffffa0d87b7b>] ? repair_env_string+0x5c/0x5c
Dec 10 11:03:57 mogweb kernel: [<ffffffffa0d87120>] ? early_idt_handler_array+0x120/0x120
Dec 10 11:03:57 mogweb kernel: [<ffffffffa0d8772f>] x86_64_start_reservations+0x24/0x26
Dec 10 11:03:57 mogweb kernel: [<ffffffffa0d87885>] x86_64_start_kernel+0x154/0x177
Dec 10 11:03:57 mogweb kernel: [<ffffffffa00000d5>] start_cpu+0x5/0x14
Dec 10 11:03:57 mogweb kernel: ---[ end trace 48f07bcd9213d5ea ]---
Tags"3.10.0-1062.9.1.el7.x85_64", "Network", realtek
abrt_hash
URL

Activities

NeK

NeK

2019-12-25 03:14

reporter   ~0035886

The afforementioned kernel issue does not seem to be the same issue. It may appear relevant but I have verified that it is not the same.

I have the same issue with the same network chip, after a while running (hours to days) the 'NETDEV WATCHDOG: enp2s0 (r8169): transmit queue 0 timed out' message is logged and after some time* the network card just stops working.

with kernel 3.10.0-957.27.2 this issue is not reproducible, but with 3.10.0-1062 up to the current 3.10.0-1062.9.1 it is. Therefore some patch is responsible for this issue that was committed in between 3.10.0-957.27.2 and 3.10.0-1062. I checked the source code and the diffs and tried to find if the kernel issue patch had been applied and it turns out that that specific patch has already been committed to 3.10.0-957.27.2, so this cannot be a fix. Also, in the kernel issue discussion, it is reported that the Runtime Power Management is responsible and that the issue is caused when the network card is set to 'auto' instead of 'on', but in my system the issue is reproducible and the power/control is set to 'on'. So the issue in my system is not related to that sysfs setting.

Furthermore the specific kernel issue description is the exact opposite of the current centos issue, it first the system has no network activity *until* this kernel warning message appears where the network card works fine afterwards. This issue is about the other way around: network card works fine until this kernel message gets logged and then after some time the card completely stops sending or receiving any packets.

So, I don't have a solution, but at least I excluded a wrong direction that could waste time and effort to continue to pursue.

* minutes to hours. I don't know why, can't figure this out. But the issue always happens *after* this log message.
toracat

toracat

2019-12-25 07:15

manager   ~0035887

Can you test-install kernel-ml from ELRepo? As of today, the current kernel version is 5.4.6.el7. This test will show if the issue you are seeing has been fixed in the latest upstream (kernel.org) kernel.
NeK

NeK

2019-12-25 22:45

reporter   ~0035893

I just found a way to reproduce the issue at will:

setup another system (HOST2) and install iperf3 in both systems and execute the following in HOST1:

iperf3 -s

and then execute in HOST2:

iperf3 -b0 -c HOST1 -u

After 10-15secs the HOST1 will immediately exhibit the issue and all its network traffic will just stop completely. So now I can do quick tests and eventually find what causes this issue. I will also test with the new 5.4.6 kernel too. I'll write up my finding ASAP.
ossgeek

ossgeek

2019-12-25 23:01

reporter   ~0035894

Haven't be able to try the v5.4.6 kernel yet but ...

I can verify that the iperf3 setup does cause the issue as described in the initial report.
NeK

NeK

2019-12-25 23:06

reporter   ~0035895

And here they are:

- 1062.9.1.el7: ISSUE EXISTS
- 957-27-2.el7: NO ISSUE
- 5.4.6-1.el7: NO ISSUE

So in 5.4.6 kernel, the issue has been fixed. That's good! Now how do we find out what patch fixed it in order to patch 1062.9.1.el7 and backport the fix? I don't want to run the latest kernel on centos 7 for various reasons (I don't trust running latest kernel on old OS env. and I also use modules like ZFS that are compiled against 1062.9.1.el7 etc.)

Issue History

Date Modified Username Field Change
2019-12-10 18:47 ossgeek New Issue
2019-12-10 18:47 ossgeek Tag Attached: "Network"
2019-12-10 18:47 ossgeek Tag Attached: "3.10.0-1062.9.1.el7.x85_64"
2019-12-10 18:47 ossgeek Tag Attached: realtek
2019-12-25 03:14 NeK Note Added: 0035886
2019-12-25 07:15 toracat Note Added: 0035887
2019-12-25 22:45 NeK Note Added: 0035893
2019-12-25 23:01 ossgeek Note Added: 0035894
2019-12-25 23:06 NeK Note Added: 0035895