View Issue Details

IDProjectCategoryView StatusLast Update
0014073CentOS-7kernelpublic2017-11-01 03:52
Reporterlf1029698952 Assigned To 
PriorityimmediateSeveritycrashReproducibilityalways
Status resolvedResolutionfixed 
Product Version7.3.1611 
Summary0014073: kernel BUG at fs/xfs/xfs_aops.c:1062!
DescriptionRHEL7: kernel crash in xfs_vm_writepage - kernel BUG at fs/xfs/xfs_aops.c:1062!

What happened:
I have many physical machine as kubernetes node, 64cores and 384G memory, and More than 100 containers are run on each node.
cpu utilization: 20%
memory utilization: 60%
It looks normal at zabbix,but the physical machine host are sudden restart(reboot) sometimes.
and the same configuration in virtual machine(16cores 64G memory 20containers)are never happened.

I'm probably already aware of the problem, This is caused by the kernel bug of the Linux system。
kernel BUG at fs/xfs/xfs_aops.c:1062!
CentOS7 use xfs as file system default, this is a xfs bug.
RedHat officials already know the problem, but no formal patch fixes have been issued.
RHEL7: kernel crash in xfs_vm_writepage - kernel BUG at fs/xfs/xfs_aops.c:1062!
https://access.redhat.com/solutions/2779111

Issue:
The systems (docker hosts) are crashing in xfs_vm_writepage()
spontaneous restart while running docker/kubernetes
kernel paniced due to BUG at fs/xfs/xfs_aops.c:1062!

The bug is triggered by the fact that the container has too many data logging volumes.
Let's keep up with the problem.

The specific log is as follows:

[555494.493271] XFS (dm-193): Mounting V5 Filesystem
[555494.539262] XFS (dm-193): Ending clean mount
[555494.541191] XFS (dm-193): Unmounting Filesystem
[555494.629717] XFS (dm-193): Mounting V5 Filesystem
[555494.657255] XFS (dm-193): Ending clean mount
[568864.225042] ------------[ cut here ]------------
[568864.225094] kernel BUG at fs/xfs/xfs_aops.c:1062!
[568864.225129] invalid opcode: 0000 [#1] SMP
[568864.225164] Modules linked in: veth xt_set ip_set_hash_net ip_set xt_mac ip6t_rpfilter ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_raw nf_conntrack_netlink nfnetlink ip6table_filter ip6_tables xt_conntrack br_netfilter bridge stp llc xt_statistic xt_nat ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_recent xt_mark xt_comment dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio bonding iptable_filter xt_addrtype iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack dcdbas ipmi_devintf iTCO_wdt iTCO_vendor_support mxm_wmi intel_powerclamp coretemp intel_rapl iosf_mbi kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ipmi_ssif pcspkr sb_edac edac_core sg mei_me lpc_ich mei shpchp ipmi_si ipmi_msghandler
[568864.225747] wmi acpi_power_meter ip_tables xfs sr_mod cdrom sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul mgag200 crct10dif_common crc32c_intel i2c_algo_bit drm_kms_helper syscopyarea sysfillrect bnx2x sysimgblt fb_sys_fops ttm drm mdio libcrc32c ahci libahci tg3 i2c_core libata ptp megaraid_sas pps_core fjes dm_mirror dm_region_hash dm_log dm_mod
[568864.226019] CPU: 40 PID: 378138 Comm: kworker/u385:5 Tainted: G W ------------ 3.10.0-514.el7.x86_64 #1
[568864.226119] Hardware name: Dell Inc. PowerEdge R630/02C2CP, BIOS 2.3.4 11/08/2016
[568864.226173] Workqueue: writeback bdi_writeback_workfn (flush-253:110)
[568864.226218] task: ffff88177db49f60 ti: ffff8812f3200000 task.ti: ffff8812f3200000
[568864.226266] RIP: 0010:[<ffffffffa04b72fb>] [<ffffffffa04b72fb>] xfs_vm_writepage+0x58b/0x5d0 [xfs]
[568864.226383] RSP: 0018:ffff8812f3203948 EFLAGS: 00010246
[568864.226423] RAX: 00203c0b0002006d RBX: ffff88204447d808 RCX: 000000000000000c
[568864.226483] RDX: 0000000000000008 RSI: ffff8812f3203c40 RDI: ffffea008ef636c0
[568864.226528] RBP: ffff8812f32039f0 R08: 0000000000000000 R09: 000000000001a098
[568864.226573] R10: ffff88247ffda000 R11: 0000000000000000 R12: ffff88204447d808
[568864.226618] R13: ffff8812f3203c40 R14: ffff88204447d6b8 R15: ffffea008ef636c0
[568864.226670] FS: 0000000000000000(0000) GS:ffff8823eeb00000(0000) knlGS:0000000000000000
[568864.226737] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[568864.226775] CR2: 000000c8256c7d14 CR3: 00000000019ba000 CR4: 00000000003407e0
[568864.226820] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[568864.226866] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[568864.226910] Stack:
[568864.226926] 0000000000008000 ffff8812f3203988 ffff8812f3203c40 ffff88206a7f9368
[568864.226981] ffff88206a7f9368 ffffea008ef636c0 0000000000001000 00007f61eb712000
[568864.227047] 0000000000001000 ffffffff811ba911 0000000000000000 0000000000000000
[568864.227100] Call Trace:
[568864.227125] [<ffffffff811ba911>] ? page_mkclean+0x1b1/0x1f0
[568864.228974] [<ffffffff8118b3b3>] __writepage+0x13/0x50
[568864.230810] [<ffffffff8118bed1>] write_cache_pages+0x251/0x4d0
[568864.232604] [<ffffffff8118b3a0>] ? global_dirtyable_memory+0x70/0x70
[568864.234390] [<ffffffff8118c19d>] generic_writepages+0x4d/0x80
[568864.236183] [<ffffffffa04b6063>] xfs_vm_writepages+0x53/0x90 [xfs]
[568864.237918] [<ffffffff8118d24e>] do_writepages+0x1e/0x40
[568864.239634] [<ffffffff81228730>] __writeback_single_inode+0x40/0x210
[568864.241332] [<ffffffff8122941e>] writeback_sb_inodes+0x25e/0x420
[568864.242936] [<ffffffff8122967f>] __writeback_inodes_wb+0x9f/0xd0
[568864.244536] [<ffffffff81229ec3>] wb_writeback+0x263/0x2f0
[568864.246143] [<ffffffff8121878c>] ? get_nr_inodes+0x4c/0x70
[568864.247765] [<ffffffff8122bebb>] bdi_writeback_workfn+0x2cb/0x460
[568864.249403] [<ffffffff810a7f3b>] process_one_work+0x17b/0x470
[568864.251019] [<ffffffff810a8d76>] worker_thread+0x126/0x410
[568864.252548] [<ffffffff810a8c50>] ? rescuer_thread+0x460/0x460
[568864.254053] [<ffffffff810b052f>] kthread+0xcf/0xe0
[568864.255507] [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
[568864.256942] [<ffffffff81696518>] ret_from_fork+0x58/0x90
[568864.258356] [<ffffffff810b0460>] ? kthread_create_on_node+0x140/0x140
[568864.259711] Code: e0 80 3d 4d b4 06 00 00 0f 85 a4 fe ff ff be d7 03 00 00 48 c7 c7 4a 60 50 a0 e8 61 e6 bc e0 c6 05 2f b4 06 00 01 e9 87 fe ff ff <0f> 0b 8b 4d a4 e9 e8 fb ff ff 41 b9 01 00 00 00 e9 69 fd ff ff
[568864.263096] RIP [<ffffffffa04b72fb>] xfs_vm_writepage+0x58b/0x5d0 [xfs]
[568864.264787] RSP <ffff8812f3203948>


At present, although RedHat has code repair, it seems that no relevant patches have been issued.
CentOS7 xfs filesystem this problem still exists.
This is a very serious problem, the impact is great, please pay attention to, hope centos7 kernel in the new version to repair it.

Before the bug fixes, we will consider replacing the file system using ext4 instead of XFS. But changing file systems is a big hassle.

Thanks very much!
Steps To ReproduceUse physical machine as kubernetes node and run many docker containers and Mount many volumes like me.
This problem occurs when logging is output.
Additional InformationAt present, although RedHat has code repair, it seems that no relevant patches have been issued.
CentOS7 xfs filesystem this problem still exists.
This is a very serious problem, the impact is great, please pay attention to, hope centos7 kernel in the new version to repair it.

Before the bug fixes, we will consider replacing the file system using ext4 instead of XFS. But changing file systems is a big hassle.

Thanks very much!
Tagsbug, hang at restarting, kernel, xfs
abrt_hash
URL

Activities

toracat

toracat

2017-10-31 04:25

manager   ~0030496

According to https://access.redhat.com/solutions/2779111 that you quoted, the issue has been resolved with the errata RHSA-2017:1842 for the package(s) kernel-3.10.0-693.el7 or later. This means that it has been resolved with CentOS kernels, 3.10.0-693.el7 or later. You are running kernel- 3.10.0-514.el7.x86_64. Please update your kernel to the latest version. That should resolve the issue.
lf1029698952

lf1029698952

2017-10-31 07:19

reporter   ~0030498

thanks!
I will update my kernel to kernel-3.10.0-693.5.2.el7
I don't really understand the reason for this bug,
I thought there was no repair because there was no explanation in the changelog

Is it this in kernel-3.10.0-693.2.1.el7 changelog:
[fs] xfs: use ->b_state to fix buffer I/O accounting release race (Brian Foster) [1478254 1452228]

and I didn't find the explanation in https://access.redhat.com/errata/RHSA-2017:1842
Can you help me make sure which one is repaired?
Thanks.
lf1029698952

lf1029698952

2017-10-31 08:21

reporter   ~0030499

Do you mean this bug?
https://bugzilla.redhat.com/show_bug.cgi?id=1396941
toracat

toracat

2017-10-31 18:22

manager   ~0030500

To find the patch(s), it is easier to go to the 7.3 kernel log. The solution article states:

"This issue has been resolved with the errata RHSA-2017:0933 for the package(s) kernel-3.10.0-514.16.1.el7 or later.
Originally tracked in private Red Hat bug 1421203."

You search for "1421203" in the changelog and will find:

* Thu Feb 16 2017 Frantisek Hrbata <fhrbata@hrbata.com> [3.10.0-514.13.1.el7]
- [fs] gfs2: Reduce contention on gfs2_log_lock (Robert S Peterson) [1422380 1406850]
- [fs] gfs2: Inline function meta_lo_add (Robert S Peterson) [1422380 1406850]
- [fs] gfs2: Switch tr_touched to flag in transaction (Robert S Peterson) [1422380 1406850]
- [fs] xfs: ioends require logically contiguous file offsets (Brian Foster) [1421203 1398005]
- [fs] xfs: don't chain ioends during writepage submission (Brian Foster) [1421203 1398005]
- [fs] xfs: factor mapping out of xfs_do_writepage (Brian Foster) [1421203 1398005]
- [fs] xfs: xfs_cluster_write is redundant (Brian Foster) [1421203 1398005]
- [fs] xfs: Introduce writeback context for writepages (Brian Foster) [1421203 1398005]
- [fs] xfs: remove xfs_cancel_ioend (Brian Foster) [1421203 1398005]
- [fs] xfs: remove nonblocking mode from xfs_vm_writepage (Brian Foster) [1421203 1398005]
- [fs] mm/filemap.c: make global sync not clear error status of individual inodes (Brian Foster) [1421203 1398005]

Looks like the issue was fixed in kernel-3.10.0-514.13.1.el7.

Then if you go to the changelog of the kernel from 7.4, you will find the same patches in

* Thu Feb 09 2017 Rafael Aquini <aquini@redhat.com> [3.10.0-561.el7]
lf1029698952

lf1029698952

2017-11-01 02:15

reporter   ~0030503

“I‘m not authorized to access bug #1421203”

And I found 3.10.0-561.el7 change log based on your guidance
- [fs] xfs: remove nonblocking mode from xfs_vm_writepage (Brian Foster) [1398005]

The cause of the bug has been found, and the repair has been confirmed, thank you very much!
toracat

toracat

2017-11-01 03:52

manager   ~0030504

You're welcome.

Issue History

Date Modified Username Field Change
2017-10-31 02:31 lf1029698952 New Issue
2017-10-31 02:31 lf1029698952 Tag Attached: bug
2017-10-31 02:31 lf1029698952 Tag Attached: hang at restarting
2017-10-31 02:31 lf1029698952 Tag Attached: kernel
2017-10-31 02:31 lf1029698952 Tag Attached: xfs
2017-10-31 04:25 toracat Note Added: 0030496
2017-10-31 07:19 lf1029698952 Note Added: 0030498
2017-10-31 08:21 lf1029698952 Note Added: 0030499
2017-10-31 18:22 toracat Note Added: 0030500
2017-11-01 02:15 lf1029698952 Note Added: 0030503
2017-11-01 03:52 toracat Status new => resolved
2017-11-01 03:52 toracat Resolution open => fixed
2017-11-01 03:52 toracat Note Added: 0030504