View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0017112 | CentOS-7 | cachefilesd | public | 2020-03-05 09:10 | 2020-03-12 15:57 |
Reporter | johnd | Assigned To | |||
Priority | normal | Severity | minor | Reproducibility | always |
Status | new | Resolution | open | ||
Product Version | 7.7-1908 | ||||
Summary | 0017112: Processes get stuck in __fscache_read_or_alloc_pages | ||||
Description | On systems with a large memory configuration (1.5TB) we see that processes are blocked and the stack trace is like this. [<ffffffffc037263f>] fscache_wait_for_deferred_lookup+0x5f/0x90 [fscache] [<ffffffffc0372add>] __fscache_read_or_alloc_pages+0x6d/0x320 [fscache] [<ffffffffc041f294>] __nfs_readpages_from_fscache+0x74/0x1b0 [nfs] [<ffffffffc0415583>] nfs_readpages+0xc3/0x1f0 [nfs] [<ffffffff84bc975f>] __do_page_cache_readahead+0x1cf/0x260 [<ffffffff84bc990f>] ondemand_readahead+0x11f/0x240 [<ffffffff84bc9d34>] page_cache_sync_readahead+0x44/0xb0 [<ffffffff84bbd9e2>] generic_file_aio_read+0x2c2/0x790 [<ffffffffc0408161>] nfs_file_read+0x71/0xf0 [nfs] [<ffffffff84c48353>] do_sync_read+0x93/0xe0 [<ffffffff84c48d8f>] vfs_read+0x9f/0x170 [<ffffffff84c49c4f>] SyS_read+0x7f/0xf0 [<ffffffff8518bede>] system_call_fastpath+0x25/0x2a [<ffffffffffffffff>] 0xffffffffffffffff We have seen this stack trace from the user application and the bash that runs the script with the user application. The bash script is started by the Torque pbs_mom process. Though we have seen the problem once on nodes with 192GB of memory it almost always happens on nodes with 1.5TB of memory. Our cachefs file system is mounted like this: /dev/sda4 on /cachefs type ext4 (rw,noatime,nodiratime,data=ordered) lsscssi shows [0:0:0:0] disk ATA MK001920GWSSE HPG0 /dev/sda The NFS mount is from an EMC Isilon: data:/data/projects on /home/projects type nfs (rw,nosuid,nodev,noatime,vers=3,rsize=131072,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=10.45.15.136,mountvers=3,mountport=300,mountproto=tcp,fsc,local_lock=none,addr=10.45.15.136) | ||||
Steps To Reproduce | The scenario we see is a process doing lots of reads and mallocs being started in 30 instances at nearly the same time. Running strace on the stuck process makes the process run and you can CTRL-C out of strace. Here are the first lines we see from strace. 886 brk(NULL) = 0x1cdf000 886 brk(0x1d00000) = 0x1d00000 886 read(3, "DRB1_0101 GPMAGGSGNSRH\nAGHQTSAES"..., 8192) = 8192 886 read(3, "PTAC_DRB1_0101 PSAAPASVRFMG\nAPDA"..., 8192) = 8192 886 read(3, "RELLSLPATSLADQDI 1 Abelin__MAPTA"..., 8192) = 8192 886 brk(NULL) = 0x1d00000 886 brk(0x1d21000) = 0x1d21000 886 read(3, "EQKRGAGM\nDERGFVWEKAVEGDFFR 1 Abe"..., 8192) = 8192 886 read(3, "VAALQPPVVQLH 1 Abelin__MAPTAC_DR"..., 8192) = 8192 886 read(3, "_MAPTAC_DRB1_0101 SAGDTWQPEHID\nD"..., 8192) = 8192 886 brk(NULL) = 0x1d21000 886 brk(0x1d42000) = 0x1d42000 886 read(3, "AALDV\nEGGYFHVLLAPGVHN 1 Abelin__"..., 8192) = 8192 886 read(3, " Abelin__MAPTAC_DRB1_0101 ILDEQR"..., 8192) = 8192 886 read(3, "VAKT 1 Abelin__MAPTAC_DRB1_0101 "..., 8192) = 8192 886 brk(NULL) = 0x1d42000 886 brk(0x1d63000) = 0x1d63000 886 read(3, "MAPTAC_DRB1_0101 VLKGDNSAGLTE\nGD"..., 8192) = 8192 886 read(3, "0101 DQMGNKRPQFHQ\nGNKASYIHLQGSDL"..., 8192) = 8192 886 brk(NULL) = 0x1d63000 886 brk(0x1d84000) = 0x1d84000 886 read(3, "HPFGLAV\nGSHTLQRMYGCDLGPDG 1 Abel"..., 8192) = 8192 886 read(3, " ANTHNTVEPSAG\nHNTWKAMEGIFIKPSVEP"..., 8192) = 8192 886 read(3, "DRB1_0101 KVYIIQGGDEIV\nIKAAQYQVN"..., 8192) = 8192 886 brk(NULL) = 0x1d84000 886 brk(0x1da5000) = 0x1da5000 886 read(3, "AEKVAQVAEIT 1 Abelin__MAPTAC_DRB"..., 8192) = 8192 886 read(3, "Abelin__MAPTAC_DRB1_0101 VKKKKEV"..., 8192) = 8192 886 read(3, "LNQKSNQDNYCV\nKSNVVSSVVHPLLQLVPHL"..., 8192) = 8192 886 brk(NULL) = 0x1da5000 886 brk(0x1dc6000) = 0x1dc6000 886 read(3, "GVHP\nLDSARFRYLMGERLGVHPLS 1 Abel"..., 8192) = 8192 886 read(3, "belin__MAPTAC_DRB1_0101 TSTLNRPT"..., 8192) = 8192 886 brk(NULL) = 0x1dc6000 886 brk(0x1de7000) = 0x1de7000 886 read(3, " Abelin__MAPTAC_DRB1_0101 QEFLTA"..., 8192) = 8192 886 read(3, "HPQEAS\nNIDLVAQRGERLE 1 Abelin__M"..., 8192) = 8192 886 read(3, "in__MAPTAC_DRB1_0101 APNPKIGTMLT"..., 8192) = 8192 886 brk(NULL) = 0x1de7000 886 brk(0x1e08000) = 0x1e08000 | ||||
Tags | hang | ||||
abrt_hash | |||||
URL | |||||
Hello, Is anybody looking at this? |
|
CentOS is a rebuild of the sources used to create RHEL. We do not modify anything except to remove branding and logos. You will need to submit your request to Redhat via bugzilla.redhat.com and if/when RH accepts it and incorporates it into RHEL and releases a patched version, then CentOS will pick it up and rebuild it. So, no, probably not. |
|
Nice to know. I files this bug 1812979 https://bugzilla.redhat.com/show_bug.cgi?id=1812979 |
|