View Issue Details

IDProjectCategoryView StatusLast Update
0017611CentOS-7nfs-utilspublic2020-07-22 14:14
Reporterjalbasan 
PrioritynormalSeveritymajorReproducibilitysometimes
Status newResolutionopen 
Product Version7.5.1804 
Target VersionFixed in Version 
Summary0017611: NFS Clients hang/wait when removing some files from NFS mounted shares
DescriptionHi All,

We are facing a very bizarre issue with our NFS server shares. On an ordinary file removal operation sometimes the process would take minutes and even hours to complete for a bunch of simple files on nfs mounts. These files don't have to be in large quantity either. Typically our directory would contain 10 files and other times maybe 100 to 1000 but we get this behavior randomly.

This has been going on for more than two years now with no solution. Issue reappears every six months to a year until we restart nfsd.

Restarting nfsd fixes the issue instantly but given that our production heavily relies on NFS service being up %99, it is too costly and sort of unacceptable from other departments perspective.

So basically when user enters rm command, process goes into -D "uninterruptible state" for two minutes and finishes the job for one file and moves to next file. Also "ls" hangs during removal process.

From both server and client "lsof" shows no file being occupied during rm process hanging and server sits at almost idle both CPU and IO wise.

No error logs whatsoever anywhere. Process starts, waits and completes however takes very long time.


-nfs server using EXT4 with no disk/raid problems
- plenty of RAM and CPU on the server(
- nfs-utils-1.3.0-0.54.el7.x86_64
- we have a mix bag of nfs clients using nfsv4 and nfsv3
- export arguments: /exports/data <world>(rw,sync,wdelay,hide,no_subtree_check,sec=sys:krb5:krb5p,secure,root_squash,no_all_squash)
- automount options: -rw,sec=sys,soft nfsserver:/exports/data
- rpcdebug indicates no errors both on server and clients
- both server and client using freeipa via sssd

Any help is appreciated. Thanks
Steps To Reproducerm -rf /home/data/random_directory/random_batch_of_files
Additional InformationHere is nfsstat output:

Server rpc stats:
calls badcalls badclnt badauth xdrcall
3395563255 7633 1569 6064 0

Server nfs v3:
null getattr setattr lookup access readlink
223446 0% 6898249 16% 48474 0% 620911 1% 981674 2% 54630 0%
read write create mkdir symlink mknod
26673249 64% 3667284 8% 52261 0% 1265 0% 11046 0% 0 0%
remove rmdir rename link readdir readdirplus
70057 0% 666 0% 10625 0% 66 0% 233454 0% 298629 0%
fsstat fsinfo pathconf commit
308618 0% 42668 0% 12042 0% 1036474 2%

Server nfs v4:
null compound
114273 0% 3353561129 99%

Server nfs v4 operations:
op0-unused op1-unused op2-future access close commit
0 0% 0 0% 0 0% 134810320 1% 15865620 0% 16197583 0%
create delegpurge delegreturn getattr getfh link
191063 0% 0 0% 5892149 0% 1244888906 13% 11431137 0% 3876 0%
lock lockt locku lookup lookup_root nverify
1791965 0% 1 0% 207651 0% 16489097 0% 0 0% 0 0%
open openattr open_conf open_dgrd putfh putpubfh
46048599 0% 0 0% 111057 0% 273893 0% 2377609592 25% 0 0%
putrootfh read readdir readlink remove rename
134065 0% 171401705 1% 4467974 0% 388563 0% 2183956 0% 554828 0%
renew restorefh savefh secinfo setattr setcltid
778463 0% 787274 0% 1402506 0% 0 0% 2566687 0% 18930212 0%
setcltidconf verify write rellockowner bc_ctl bind_conn
18930212 0% 0 0% 996962644 10% 404 0% 0 0% 272 0%
exchange_id create_ses destroy_ses free_stateid getdirdeleg getdevinfo
3174 0% 444514 0% 2854 0% 153256 0% 0 0% 0 0%
getdevlist layoutcommit layoutget layoutreturn secinfononam sequence
0 0% 0 0% 0 0% 0 0% 1 0% 3267785527 35%
set_ssv test_stateid want_deleg destroy_clid reclaim_comp
0 0% 931047730 10% 0 0% 2857 0% 2864 0%

Client rpc stats:
calls retrans authrefrsh
5753917 8 5754604

Client nfs v3:
null getattr setattr lookup access readlink
0 0% 1171522 24% 0 0% 15890 0% 693811 14% 5624 0%
read write create mkdir symlink mknod
2042055 42% 0 0% 0 0% 0 0% 0 0% 0 0%
remove rmdir rename link readdir readdirplus
0 0% 0 0% 0 0% 0 0% 0 0% 456648 9%
fsstat fsinfo pathconf commit
446495 9% 32 0% 16 0% 0 0%

Client nfs v4:
null read write commit open open_conf
0 0% 0 0% 0 0% 0 0% 2 0% 0 0%
open_noat open_dgrd close setattr fsinfo renew
0 0% 0 0% 2 0% 10 0% 3642 0% 0 0%
setclntid confirm lock lockt locku access
1 0% 1 0% 0 0% 0 0% 0 0% 3683 0%
getattr lookup lookup_root remove rename link
2908 0% 4852 0% 1219 0% 3 0% 0 0% 0 0%
symlink create pathconf statfs readlink readdir
0 0% 3 0% 2428 0% 569409 62% 0 0% 160 0%
server_caps delegreturn getacl setacl fs_locations rel_lkowner
6070 0% 0 0% 0 0% 0 0% 0 0% 0 0%
secinfo exchange_id create_ses destroy_ses sequence get_lease_t
0 0% 0 0% 1222 0% 1221 0% 1218 0% 318506 34%
reclaim_comp layoutget getdevinfo layoutcommit layoutreturn getdevlist
3 0% 1218 0% 0 0% 0 0% 0 0% 0 0%
(null)
1 0%


NFS config file:

#
#
# To set lockd kernel module parameters please see
# /etc/modprobe.d/lockd.conf
#

# Optional arguments passed to rpc.nfsd. See rpc.nfsd(8)
RPCNFSDARGS=""
# Number of nfs server processes to be started.
# The default is 8.
RPCNFSDCOUNT=24
#
# Set V4 grace period in seconds
#NFSD_V4_GRACE=90
#
# Set V4 lease period in seconds
#NFSD_V4_LEASE=90
#
# Optional arguments passed to rpc.mountd. See rpc.mountd(8)
RPCMOUNTDOPTS="-g"
# Port rpc.mountd should listen on.
#MOUNTD_PORT=892
#
# Optional arguments passed to rpc.statd. See rpc.statd(8)
STATDARG=""
# Port rpc.statd should listen on.
#STATD_PORT=662
# Outgoing port statd should used. The default is port
# is random
#STATD_OUTGOING_PORT=2020
# Specify callout program
#STATD_HA_CALLOUT="/usr/local/bin/foo"
#
#
# Optional arguments passed to sm-notify. See sm-notify(8)
SMNOTIFYARGS=""
#
# Optional arguments passed to rpc.idmapd. See rpc.idmapd(8)
RPCIDMAPDARGS=""
#
# Optional arguments passed to rpc.gssd. See rpc.gssd(8)
# Note: The rpc-gssd service will not start unless the
# file /etc/krb5.keytab exists. If an alternate
# keytab is needed, that separate keytab file
# location may be defined in the rpc-gssd.service's
# systemd unit file under the ConditionPathExists
# parameter
RPCGSSDARGS=""
#
# Enable usage of gssproxy. See gssproxy-mech(8).
GSS_USE_PROXY="yes"
#
# Optional arguments passed to blkmapd. See blkmapd(8)
BLKMAPDARGS=""
SECURE_NFS=yes
Tagsnfs, NFSv4, nfsv4.1
abrt_hash
URL

Activities

ManuelWolfshant

ManuelWolfshant

2020-07-21 22:56

manager   ~0037389

First, thank you for taking the time to gather so much data and file such an extensive report.
Unfortunately, based on the data you have provided, I am afraid that we cannot help . Maybe you do not know but CentOS never supported anything but the latest [and updated] minor release of any of its major OS releases.
As far as I can see you rely on the version of nfs-utils from CentOS 7.5 which ceased being supported in Nov 2018. Based on the rest of your message I infer that your whole OS - kernel included - is also old and using package versions ( among which rather important being the kernel ) no longer supported.
In short, since you miss 2 years worth of updates and use a minor release of the OS which we do not and cannot support, you are kindly asked to fully update your system to the _current_ version ( which is CentOS 7.8, kernel 3.10.0-1127.13.1.el7.x86_64, nfs-utils-1.3.0-0.66.el7.x86_64 and so on ). If you can still reproduce the issue after the update, please contact us again.

If you absolutely must rely on older versions of the OS, you are advised to contact Red Hat and purchase an Extended Update Support license since only they can provide support for this situation. As a side note, support for RHEL 7.5 ended April 30th 2020 so they, too, will ask you to update your system.
jalbasan

jalbasan

2020-07-22 00:58

reporter   ~0037390

I understand and I do appreciate you taking time on this. It is best to avoid the upgrade right now and that's why I'm little desperate to solve the issue. However given the situation upgrade may be inevitable.

Thanks again.

Regards
ManuelWolfshant

ManuelWolfshant

2020-07-22 08:16

manager   ~0037391

If it helps with your decisions for the future, keep in mind that between the version of the kernel you run and the current kernel Red Hat introduced 101 (!!!) nfs related patches ( which are mentioned in the changelog of the kernel's rpm ) and, still according to the changelog, there are 12 revisions between the current version of the nfs-utils package and the one you use.
Keep also in mind that under normal circumstances all you need is a yum update followed by a reboot ( with proper checking of all relevant release notes and verification that the hardware you run is not deprecated and so on ).

And last but not least... how do you actually expect to fix a bug in the software without changing the software ?
jalbasan

jalbasan

2020-07-22 14:14

reporter   ~0037395

That is a valid point, however our reluctance comes from other services being live on the same server. So the scope of outage will be greater than just nfs. My original hope was this may be the result of misconfiguration that I may be able to rectify without much noise and unfortunetly that seems unlikely now. So my current effort is to relate this issue to specific bug which could provide a good leverage to press for system update.

My effort is to avoid repetitive outages on a mission critical system by finding the root cause of this issue.

Issue History

Date Modified Username Field Change
2020-07-21 21:46 jalbasan New Issue
2020-07-21 21:46 jalbasan Tag Attached: nfs
2020-07-21 21:46 jalbasan Tag Attached: NFSv4
2020-07-21 21:46 jalbasan Tag Attached: nfsv4.1
2020-07-21 22:56 ManuelWolfshant Note Added: 0037389
2020-07-22 00:58 jalbasan Note Added: 0037390
2020-07-22 08:16 ManuelWolfshant Note Added: 0037391
2020-07-22 14:14 jalbasan Note Added: 0037395