View Issue Details

IDProjectCategoryView StatusLast Update
0010451CentOS-7[All Projects] generalpublic2020-03-18 21:17
ReporterKramNotlimah 
PrioritynormalSeveritycrashReproducibilityrandom
Status newResolutionopen 
Product Version7.2.1511 
Target VersionFixed in Version 
Summary0010451: Crash/Hang after task blocked for more than 120 seconds.
DescriptionUnder heavy load disk load with high memory usage the server will start blocking tasks and crash. This has happened on 5 different servers so far. It has happened multiple times for some of the servers. The problem seemed to start after updating from the 3.10.0-229 (7.1.1503) kernel to the 3.10.0-327 (7.2.1511) kernel. Previous to the update I was not experiencing any stability issues on the same servers even under extreme load.

Some servers have multiple processors and some are single processor. Some of the servers are mail servers and some are web servers.

I am using xfs mounted as a standard partition. The servers are running in a virtual environment. Some of them are on Xenserver 6.5 and some are on Vmware 6. The hosts are Dell 2950s and Dell R710s using either PERC 6 or PERC H700 raid cards.

I have dropped 3 of the servers back to the 3.10.0-229 kernel to see if the issues go away. I have not been able to force the issue even using something like bonnie++ to force high load.
Steps To ReproduceUpdate from 7.1.1503 to 7.2.1511 and run under very heavy load for a week.
Additional InformationFrom the messages log:

Feb 25 08:06:02 ns1 kernel: INFO: task kworker/2:1:21610 blocked for more than 120 seconds.
Feb 25 08:06:02 ns1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 25 08:06:02 ns1 kernel: kworker/2:1 D ffff88011d97bdd8 0 21610 2 0x00000080
Feb 25 08:06:02 ns1 kernel: ffff88011d97bbf0 0000000000000046 ffff880122bc8b80 ffff88011d97bfd8
Feb 25 08:06:02 ns1 kernel: ffff88011d97bfd8 ffff88011d97bfd8 ffff880122bc8b80 ffff88011d97bd58
Feb 25 08:06:02 ns1 kernel: ffff88011d97bd60 7fffffffffffffff ffff880122bc8b80 ffff88011d97bdd8
Feb 25 08:06:02 ns1 kernel: Call Trace:
Feb 25 08:06:02 ns1 kernel: [<ffffffff8163a889>] schedule+0x29/0x70
Feb 25 08:06:02 ns1 kernel: [<ffffffff81638579>] schedule_timeout+0x209/0x2d0
Feb 25 08:06:02 ns1 kernel: [<ffffffff810bdf62>] ? select_task_rq_fair+0x552/0x6f0
Feb 25 08:06:02 ns1 kernel: [<ffffffff8163ac56>] wait_for_completion+0x116/0x170
Feb 25 08:06:02 ns1 kernel: [<ffffffff810b8c10>] ? wake_up_state+0x20/0x20
Feb 25 08:06:02 ns1 kernel: [<ffffffff810a5988>] kthread_create_on_node+0xa8/0x140
Feb 25 08:06:02 ns1 kernel: [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
Feb 25 08:06:02 ns1 kernel: [<ffffffff8109d9da>] create_worker+0xea/0x250
Feb 25 08:06:02 ns1 kernel: [<ffffffff8109dcd6>] manage_workers.isra.24+0xf6/0x2d0
Feb 25 08:06:02 ns1 kernel: [<ffffffff8109e5e9>] worker_thread+0x339/0x400
Feb 25 08:06:02 ns1 kernel: [<ffffffff8109e2b0>] ? rescuer_thread+0x400/0x400
Feb 25 08:06:02 ns1 kernel: [<ffffffff810a5aef>] kthread+0xcf/0xe0
Feb 25 08:06:02 ns1 kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Feb 25 08:06:02 ns1 kernel: [<ffffffff81645818>] ret_from_fork+0x58/0x90
Feb 25 08:06:02 ns1 kernel: [<ffffffff810a5a20>] ? kthread_create_on_node+0x140/0x140
Feb 25 08:06:02 ns1 kernel: php-cgi invoked oom-killer: gfp_mask=0x3000d0, order=2, oom_score_adj=0
Feb 25 08:06:02 ns1 kernel: php-cgi cpuset=/ mems_allowed=0
Feb 25 08:06:02 ns1 kernel: CPU: 0 PID: 7049 Comm: php-cgi Not tainted 3.10.0-327.3.1.el7.x86_64 #1
Feb 25 08:06:02 ns1 kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2011
Feb 25 08:06:02 ns1 kernel: ffff88008daf4500 000000008f7c51ed ffff88013940bb68 ffffffff8163516c
Feb 25 08:06:02 ns1 kernel: ffff88013940bbf8 ffffffff8163010c ffff88013940bc00 ffff88012f90f3e8
Feb 25 08:06:02 ns1 kernel: ffff88013940bc28 0000000000001eaf 0000000000000000 0000000000001eaf
Feb 25 08:06:02 ns1 kernel: Call Trace:
Feb 25 08:06:02 ns1 kernel: [<ffffffff8163516c>] dump_stack+0x19/0x1b
Feb 25 08:06:02 ns1 kernel: [<ffffffff8163010c>] dump_header+0x8e/0x214
Feb 25 08:06:02 ns1 kernel: [<ffffffff8116cdee>] oom_kill_process+0x24e/0x3b0
Feb 25 08:06:02 ns1 kernel: [<ffffffff8116c956>] ? find_lock_task_mm+0x56/0xc0
Feb 25 08:06:02 ns1 kernel: [<ffffffff81088dae>] ? has_capability_noaudit+0x1e/0x30
Feb 25 08:06:02 ns1 kernel: [<ffffffff8116d616>] out_of_memory+0x4b6/0x4f0
Feb 25 08:06:02 ns1 kernel: [<ffffffff811737f5>] __alloc_pages_nodemask+0xa95/0xb90
Feb 25 08:06:02 ns1 kernel: [<ffffffff81078d73>] copy_process.part.25+0x163/0x1610
Feb 25 08:06:02 ns1 kernel: [<ffffffff8107a401>] do_fork+0xe1/0x320
Feb 25 08:06:02 ns1 kernel: [<ffffffff8107a6c6>] SyS_clone+0x16/0x20
Feb 25 08:06:02 ns1 kernel: [<ffffffff81645c19>] stub_clone+0x69/0x90
Feb 25 08:06:02 ns1 kernel: [<ffffffff816458c9>] ? system_call_fastpath+0x16/0x1b
TagsNo tags attached.
abrt_hash
URL

Activities

KramNotlimah

KramNotlimah

2016-03-04 11:01

reporter   ~0025919

I just dropped another server back to a 3.10.0-229 kernel because of this lock up. Non of the servers that I have dropped back have had this issue although I admit is has only been one week. Within this limited time frame it does look like something went wrong in the new kernel.
dekellum

dekellum

2016-03-11 16:50

reporter   ~0026015

I'm able to consistently reproduce what sounds like the same issue in a matter of seconds under the following setup. I reproduced this 8 times before isolating the cause to the kernel update.

* Amazon EC2, HVM
* Centos 7.2.1511 with all updates as of 2016-02-24
* 3.10.0-327.10.1 (and I think a prior patch release of -327.x)
* lvmcache using local SSD storage over remote ELB storage (doubles writes)
* PostgreSQL remote pg_basebackup of a 30+GB database (very high write load)

The hosts completely hang and do not recover. I enabled persistent journal but was unable to see any useful error messages on the subset of hosts I was able to recover with a hard reboot. Its as if all disk write halt.

I found that both kernels 3.10.0-229 and elrepo kernel-ml 4.4.2 avoid this issue.
KramNotlimah

KramNotlimah

2016-03-15 18:03

reporter   ~0026039

Thank you bekellum for posing your experience with this.

I am wondering if this has something to do with xfs and the kernel. I know there were a ton of changes with xfs in the 3.10.0-327 kernels.

I am very surprised there isn't more interest in this issue. It seems like a game ending issue for this operating system.
Nimafin

Nimafin

2016-06-19 11:29

reporter   ~0026917

I had this issue with kernel 3.10.0-327.13.1.el7. It was VM running top of the ESXi 6.0U2 and bug hits only after three days of production usage. I updated machine on 17.5.2016 with all available updates and it updated kernel to version 3.10.0-327.18.2.el7. After that, it´s been running without problems.

xfs filesystem in use
disk sda = 30Gt
disk sdb = 500Gt
Mailserver usage
Crash happened during backup to nfs-mounted storage
KramNotlimah

KramNotlimah

2016-06-20 16:37

reporter   ~0026934

I have a number servers that are very similar to what you show above. I also upgraded them to the latest kernel to see if that has fixed things. Unfortunately shortly after the file system corrupted. I am not sure if the corruption started before the upgrade or not. Since this corruption happened on a couple servers I decided not to take any chances. I have been migrating to servers using EXT4. I have not seen any issues with these kernels and EXT4 and it doesn't seem that there are as many changes happening with EXT4 in these kernels. I have also noticed that major cloud servers like AWS, Digital Ocean, etc. all use EXT4 in their environments. Maybe they know something we don't about XFS and the newer kernels.
kranos

kranos

2020-03-18 21:17

reporter   ~0036533

just do the following
* reinstall centos 7 with manual partitioning , LVM + ext4 for every partitions except boot ( standard partition + ext4 )
this solved my problem

Issue History

Date Modified Username Field Change
2016-02-25 17:51 KramNotlimah New Issue
2016-03-04 11:01 KramNotlimah Note Added: 0025919
2016-03-11 16:50 dekellum Note Added: 0026015
2016-03-15 18:03 KramNotlimah Note Added: 0026039
2016-06-19 11:29 Nimafin Note Added: 0026917
2016-06-20 16:37 KramNotlimah Note Added: 0026934
2020-03-18 21:17 kranos Note Added: 0036533