View Issue Details

IDProjectCategoryView StatusLast Update
0006254CentOS-6kernelpublic2013-11-12 13:50
Reporternadergan 
PriorityurgentSeveritycrashReproducibilityrandom
Status newResolutionopen 
Platformx86_64OSCentOSOS Version6.3
Product Version6.3 
Target VersionFixed in Version 
Summary0006254: Filesystems became un-writable and apps doing IO are blocks
DescriptionIn random occasions, on several servers running CentOS 6.3, the FS becomes un-writable and applications hung and create a stack dump, only reboot can solve the problem.

I suspect a spinlock in NFS but I'm not sure.

We are using EXT4 and NFS (version 3) on these systems (VMware ESX4.1 vms) and it happens on random times on random servers.
The system is:

Linux wcliwb108 2.6.32-279.2.1.el6.x86_64 #1 SMP Fri Jul 20 01:55:29 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

 INFO: task tail:28122 blocked for more than 120 seconds.
Feb 14 17:15:13 wcliwb108 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 14 17:15:13 wcliwb108 kernel: tail D 0000000000000002 0 28122 28094 0x00000084
Feb 14 17:15:13 wcliwb108 kernel: ffff88013ba6fe38 0000000000000082 0000000000000000 0000000000000024
Feb 14 17:15:13 wcliwb108 kernel: ffff88013ba6fe28 ffffffffffffffe9 ffff88013ba6fdc8 ffffffff81178d24
Feb 14 17:15:13 wcliwb108 kernel: ffff8801139a5af8 ffff88013ba6ffd8 000000000000fb88 ffff8801139a5af8
Feb 14 17:15:13 wcliwb108 kernel: Call Trace:
Feb 14 17:15:13 wcliwb108 kernel: [<ffffffff81178d24>] ? nameidata_to_filp+0x54/0x70
Feb 14 17:15:13 wcliwb108 kernel: [<ffffffff814fefbe>] __mutex_lock_slowpath+0x13e/0x180
Feb 14 17:15:13 wcliwb108 kernel: [<ffffffff814fee5b>] mutex_lock+0x2b/0x50
Feb 14 17:15:13 wcliwb108 kernel: [<ffffffffa00e6f60>] ext4_llseek+0x60/0x110 [ext4]
Feb 14 17:15:13 wcliwb108 kernel: [<ffffffff81179d5a>] vfs_llseek+0x3a/0x40
Feb 14 17:15:13 wcliwb108 kernel: [<ffffffff8117b516>] sys_lseek+0x66/0x80
Feb 14 17:15:13 wcliwb108 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Feb 14 17:17:13 wcliwb108 kernel: INFO: task tail:28122 blocked for more than 120 seconds.
Feb 14 17:17:13 wcliwb108 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 14 17:17:13 wcliwb108 kernel: tail D 0000000000000002 0 28122 28094 0x00000084
Feb 14 17:17:13 wcliwb108 kernel: ffff88013ba6fe38 0000000000000082 0000000000000000 0000000000000024
Feb 14 17:17:13 wcliwb108 kernel: ffff88013ba6fe28 ffffffffffffffe9 ffff88013ba6fdc8 ffffffff81178d24
Feb 14 17:17:13 wcliwb108 kernel: ffff8801139a5af8 ffff88013ba6ffd8 000000000000fb88 ffff8801139a5af8
Steps To ReproduceIt happens randomly.
TagsNo tags attached.

Activities

tru

tru

2013-02-14 10:19

administrator   ~0016487

Please update to the latest kernel version.
It is not clear to me what your issue is, and what is the CentOS server doing in your description:
is it the NFS server for your ESXI? an VM guest on ESXI? an NFS client on ESXI?
nadergan

nadergan

2013-02-14 10:41

reporter   ~0016488

We highly appreciate your help and we are welling to upgrade to the latest kernel, but we have ~600 servers with the same problem, and the upgrade is a major effort, and we would like to know if updating to the latest kernel helps with this problem or not?

To be more clear, our VMs run Java application server(Apache Tomcat) and they also are NFS clients.

When this happens, the server became in a state that it can't do IO in the system, every process that attempts to write to a file (any file) hangs/blocked and can't be killed and only reboot solves the situation.

Please see the stack dump above and help us to point to the problem source.

I'm not sure if it's NFS client related or EXT4 (we use writeback mount option)?
claudinei

claudinei

2013-11-12 13:47

reporter   ~0018334

I have a similar problem.
I think it is the same bug and it fits in what's described in: "http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2018879"
It occurs when I run disc write cache disabled (/sbin/hdparm -W 0 /dev/hda1).
Is this problem solved?
claudinei

claudinei

2013-11-12 13:50

reporter   ~0018335

The correct link: http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&docTypeID=DT_KB_1_1&externalId=51306

Issue History

Date Modified Username Field Change
2013-02-14 09:57 nadergan New Issue
2013-02-14 10:19 tru Note Added: 0016487
2013-02-14 10:41 nadergan Note Added: 0016488
2013-11-12 13:47 claudinei Note Added: 0018334
2013-11-12 13:50 claudinei Note Added: 0018335