View Issue Details

IDProjectCategoryView StatusLast Update
0017369CentOS-7lvm2public2020-05-17 17:18
Reporterdenibuhe 
PriorityurgentSeveritycrashReproducibilityalways
Status newResolutionopen 
Product Version7.8-2003 
Target VersionFixed in Version 
Summary0017369: Crash when buffer becomes full
DescriptionI am trying to transfer an LV (KVM-VPS Image) from another server (source server) to this server (target server): When I use ssh (dd) to transfer an lvs (300GB) to this server, the target server crashes when the buffer is full. This happens every time.

The dd command is executed on the source server.

Screenshot shows top output on crash.


If I execute the same command and simply specify /home/test.img as source, the server does not crash. So it must have something to do with LVM:

dd if=/dev/lvmkvm/kvm278_img | ssh root@node-fsn-1.domain.de dd of=/home/test.img
Steps To Reproduce[Target Server]
pvcreate /dev/md2
vgcreate lvmkvm /dev/md2
lvcreate -L 400G -n test lvmkvm

[Source Server]
dd if=/dev/lvmkvm/kvm278_img | ssh root@node-fsn-1.domain.de dd of=/dev/lvmkvm/test
Additional InformationTarget Server (Crash Server):

uname -r
3.10.0-1127.8.2.el7.x86_64

Hardware:
CPU: AMD EPYC 7502P
NVMe SSDs (Samsung)
192 GB ECC-RAM
Mainboard: Manufacturer: ASUSTeK COMPUTER INC. Product Name: KRPA-U16 Series Version: Rev 1.xx

Crash Debug:

[47737.193929] ------------[ cut here ]------------
[47737.194016] kernel BUG at mm/page_alloc.c:1656!
[47737.194101] invalid opcode: 0000 [#1] SMP
[47737.194312] Modules linked in: xt_socket nf_defrag_ipv6 nf_defrag_ipv4 xt_mark
iptable_mangle kcare(OE) devlink ebtable_filter ebtables ip6table_filter
ip6_tables iptable_filter rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd
grace fscache sunrpc joydev amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass
ipmi_ssif ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm
drm_panel_orientation_quirks k10temp i2c_piix4 ipmi_si ipmi_devintf
ipmi_msghandler pinctrl_amd i2c_designware_platform i2c_designware_core
acpi_cpufreq cdc_ether usbnet mii ip_tables ext4 mbcache jbd2 raid1 raid10
crct10dif_pclmul crct10dif_common crc32_pclmul crc32c_intel ghash_clmulni_intel
aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ixgbe ahci igb libahci
libata i2c_algo_bit mdio dca ptp nvme pps_core
[47737.199607] nvme_core nfit libnvdimm dm_mirror dm_region_hash dm_log dm_mod
[47737.200152] CPU: 6 PID: 37004 Comm: dd Kdump: loaded Tainted: G OE
------------ 3.10.0-1127.el7.x86_64 #1
[47737.200255] Hardware name: ASUSTeK COMPUTER INC. KRPA-U16 Series/KRPA-U16
Series, BIOS 0601 03/26/2020
[47737.200518] task: ffff8c95896e0000 ti: ffff8c95afd68000 task.ti:
ffff8c95afd68000
[47737.200937] RIP: 0010:[<ffffffffba1c458e>] [<ffffffffba1c458e>]
move_freepages+0x15e/0x160
[47737.201429] RSP: 0018:ffff8c95afd6b8a0 EFLAGS: 00010006
[47737.201676] RAX: ffff8c968f359000 RBX: ffffeadc613f8000 RCX: 0000000000000000
[47737.201929] RDX: ffff8c968f35a000 RSI: 0000000000000000 RDI: ffff8c968f35a000
[47737.202183] RBP: ffff8c95afd6b8f0 R08: 000000000304f380 R09: 000000000184ffff
[47737.202437] R10: ffffeadc613fffc0 R11: 000000000000

====

      KERNEL: /usr/lib/debug/lib/modules/3.10.0-1127.el7.x86_64/vmlinux
    DUMPFILE: /var/crash/127.0.0.1-2020-05-16-23:48:45/vmcore [PARTIAL DUMP]
        CPUS: 64
        DATE: Sat May 16 23:47:59 2020
      UPTIME: 00:30:36
LOAD AVERAGE: 1.41, 1.37, 1.17
       TASKS: 693
    NODENAME: node-fsn-1
     RELEASE: 3.10.0-1127.el7.x86_64
     VERSION: #1 SMP Tue Mar 31 23:36:51 UTC 2020
     MACHINE: x86_64 (2495 Mhz)
      MEMORY: 191.9 GB
       PANIC: "kernel BUG at mm/page_alloc.c:1656!"
         PID: 3974
     COMMAND: "sshd"
        TASK: ffff8c5ec5480000 [THREAD_INFO: ffff8c5ec9570000]
         CPU: 10
       STATE: TASK_RUNNING (PANIC)



====


crash> sys
      KERNEL: /usr/lib/debug/lib/modules/3.10.0-1127.el7.x86_64/vmlinux
    DUMPFILE: /var/crash/127.0.0.1-2020-05-16-23:48:45/vmcore [PARTIAL DUMP]
        CPUS: 64
        DATE: Sat May 16 23:47:59 2020
      UPTIME: 00:30:36
LOAD AVERAGE: 1.41, 1.37, 1.17
       TASKS: 693
    NODENAME: node-fsn-1.
     RELEASE: 3.10.0-1127.el7.x86_64
     VERSION: #1 SMP Tue Mar 31 23:36:51 UTC 2020
     MACHINE: x86_64 (2495 Mhz)
      MEMORY: 191.9 GB
       PANIC: "kernel BUG at mm/page_alloc.c:1656!"

====


crash> bt
PID: 3974 TASK: ffff8c5ec5480000 CPU: 10 COMMAND: "sshd"
 #0 [ffff8c5ec9573730] machine_kexec at ffffffffbce66044
 #1 [ffff8c5ec9573790] __crash_kexec at ffffffffbcf22ee2
 #2 [ffff8c5ec9573860] crash_kexec at ffffffffbcf22fd0
 #3 [ffff8c5ec9573878] oops_end at ffffffffbd58a798
 #4 [ffff8c5ec95738a0] die at ffffffffbce30a7b
 #5 [ffff8c5ec95738d0] do_trap at ffffffffbd589ee0
 #6 [ffff8c5ec9573920] do_invalid_op at ffffffffbce2d2a4
 #7 [ffff8c5ec95739d0] invalid_op at ffffffffbd59622e
    [exception RIP: move_freepages+350]
    RIP: ffffffffbcfc458e RSP: ffff8c5ec9573a88 RFLAGS: 00010006
    RAX: ffff8c5f8f359000 RBX: fffffe68213f8000 RCX: 0000000000000000
    RDX: ffff8c5f8f35a000 RSI: 0000000000000000 RDI: ffff8c5f8f35a000
    RBP: ffff8c5ec9573ad8 R8: 000000000304f380 R9: 000000000184ffff
    R10: fffffe68213fffc0 R11: 0000000000001000 R12: 0000000000000000
    R13: 0000000000000006 R14: ffff8c5f8f35a300 R15: fffffe68213fffc0
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
 #8 [ffff8c5ec9573ae0] move_freepages_block at ffffffffbcfc4603
 #9 [ffff8c5ec9573af0] __rmqueue at ffffffffbcfc6024
#10 [ffff8c5ec9573b60] get_page_from_freelist at ffffffffbcfc874c
#11 [ffff8c5ec9573c78] __alloc_pages_nodemask at ffffffffbcfc8e76
#12 [ffff8c5ec9573d20] alloc_pages_current at ffffffffbd018e18
#13 [ffff8c5ec9573d68] pipe_write at ffffffffbd056aec
#14 [ffff8c5ec9573df0] do_sync_write at ffffffffbd04c663
#15 [ffff8c5ec9573ec8] vfs_write at ffffffffbd04d150
#16 [ffff8c5ec9573f08] sys_write at ffffffffbd04df1f
#17 [ffff8c5ec9573f50] system_call_fastpath at ffffffffbd592ed2
    RIP: 00007f8e6276ca00 RSP: 00007fffeb670220 RFLAGS: 00000212
    RAX: 0000000000000001 RBX: 000055cf09c7b4b0 RCX: 00000000078c1d61
    RDX: 0000000000004000 RSI: 00007f8e5b1e7610 RDI: 000000000000000b
    RBP: 000055cf09c79ae0 R8: 0000000000000000 R9: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 000055cf09c62a60
    R13: 000055cf09c7b540 R14: 0000000000004000 R15: 00007f8e5b1e7610
    ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
TagsNo tags attached.
abrt_hash
URL

Activities

denibuhe

denibuhe

2020-05-17 08:24

reporter  

Bildschirmfoto 2020-05-16 um 22.18.01.png (1,157,157 bytes)
denibuhe

denibuhe

2020-05-17 11:46

reporter   ~0036949

The crash can also be induced in this way:

pvcreate /dev/md2
vgcreate lvmkvm /dev/md2
lvcreate -L 400G -n test lvmkvm

dd if=/dev/zero of=/dev/lvmkvm/test bs=300M count=1024

The Buffer feels now... As soon as the buffer of the RAM is full, the server crashes. (tested at least 10 times).

The problem does not occur with CentOS 7.6. Only under CentOS 7.7 and 7.8

To narrow down the problem, I ran the same thing on another server. Here the problem does not occur at all (CentOS 7.6, 7.7 and 7.8). So the problem should be related to the hardware.
TrevorH

TrevorH

2020-05-17 12:22

manager   ~0036950

What is "kcare(OE)" and why is it loaded on a CentOS system?
denibuhe

denibuhe

2020-05-17 12:36

reporter   ~0036951

This ist Kernelcare ( https://www.kernelcare.com ) from CloudLinux.

But even without kcare the problem occurs: I reinstalled CentOS 7.7-minimal and ran the commands without any changes to the system.

The crash occurs: CentOS 7.7 and 7.8

No crash here: CentOS 7.6 and CentOS 8.1
denibuhe

denibuhe

2020-05-17 14:54

reporter   ~0036953

It is due to the RAM size: 2 RAM bars were removed. So from 192 GB to 128 GB. Now there is no crash (CentOS 7.7 and 7.8)

After that I added 2 new RAM-bars (not the same ones). From 128 to 192 GB: Crash is back.

Guess: CentOS 7.7 and 7.8 (lvm/kernel) has a bug with this RAM size.
denibuhe

denibuhe

2020-05-17 16:19

reporter   ~0036954

I have now installed the kernel 4.4.223-1.el7.elrepo.x86_64 (from ELRepo). The crash does not occur anymore with this kernel. (CentOS 7.7 and 7.8).

So the problem is with the kernel 3.10.0-1127 in connection with the RAM size.
denibuhe

denibuhe

2020-05-17 17:18

reporter   ~0036955

I have now installed 2 additional RAM bars: 192 GB -> 256 GB. Now there is no crash anymore (default kernel).

Summary: CentOS 7.7 and 7.8 with 192 GB RAM causes the crash. With 128 GB or 256 GB there is no crash.

Issue History

Date Modified Username Field Change
2020-05-17 08:24 denibuhe New Issue
2020-05-17 08:24 denibuhe File Added: Bildschirmfoto 2020-05-16 um 22.18.01.png
2020-05-17 11:46 denibuhe Note Added: 0036949
2020-05-17 12:22 TrevorH Note Added: 0036950
2020-05-17 12:36 denibuhe Note Added: 0036951
2020-05-17 14:54 denibuhe Note Added: 0036953
2020-05-17 16:19 denibuhe Note Added: 0036954
2020-05-17 17:18 denibuhe Note Added: 0036955