View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0013639 | CentOS-7 | kernel | public | 2017-08-04 15:06 | 2020-11-20 12:30 |
Reporter | sobrique | Assigned To | |||
Priority | normal | Severity | minor | Reproducibility | sometimes |
Status | new | Resolution | open | ||
Platform | x86_64 | OS | Centos | OS Version | 7.3 |
Product Version | 7.3.1611 | ||||
Summary | 0013639: lockd SO_REUSEADDR causes 'lockd: nfs server X not responding, still trying' if the remote host closes the connection with a FIN | ||||
Description | Problem doesn't occur in the 3.10.0-327 kernels (installed with Centos-7.2) but does on 3.10.0-514.* - we have confirmed it applies to 514.10.2, 514.21.1, 514.21.2 and 514.26.2. This is for NFS v3 over TCP. Specifically we have a network trace of our clients receiving a FIN from the remote server (an Isilon NFS server/BSD based) and tearing down the connection. Attempts then to reuse the same port address to talk to the same IP/port tuple on the remote. This doesn't get any response (and causes TCP retransmits). 23 93.895312607 10.0.77.7 -> 10.0.72.93 TCP 66 304 > 739 [FIN, ACK] Seq=361 Ack=2365 Win=28688 Len=0 TSval=3324674743 TSecr=3023764526 24 93.895365165 10.0.72.93 -> 10.0.77.7 TCP 66 739 > 304 [FIN, ACK] Seq=2365 Ack=362 Win=58 Len=0 TSval=3023858350 TSecr=3324674743 25 93.895537170 10.0.77.7 -> 10.0.72.93 TCP 66 304 > 739 [ACK] Seq=362 Ack=2366 Win=28688 Len=0 TSval=3324674743 TSecr=3023858350 26 119.999069392 10.0.72.93 -> 10.0.77.7 TCP 74 [TCP Port numbers reused] 739 > 304 [SYN, ECN, CWR] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=3023884454 TSecr=0 WS=512 27 121.001045228 10.0.72.93 -> 10.0.77.7 TCP 74 [TCP Retransmission] 739 > 304 [SYN, ECN, CWR] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=3023885456 TSecr=0 WS=512 28 123.005029586 10.0.72.93 -> 10.0.77.7 TCP 74 [TCP Retransmission] 739 > 304 [SYN, ECN, CWR] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=3023887460 TSecr=0 WS=512 29 127.017070461 10.0.72.93 -> 10.0.77.7 TCP 74 [TCP Retransmission] 739 > 304 [SYN, ECN, CWR] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=3023891472 TSecr=0 WS=512 30 135.033082431 10.0.72.93 -> 10.0.77.7 TCP 74 [TCP Retransmission] 739 > 304 [SYN, ECN, CWR] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=3023899488 TSecr=0 WS=512 31 151.065047976 10.0.72.93 -> 10.0.77.7 TCP 74 [TCP Retransmission] 739 > 304 [SYN, ECN, CWR] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=3023915520 TSecr=0 WS=512 32 183.129124247 10.0.72.93 -> 10.0.77.7 TCP 74 [TCP Retransmission] 739 > 304 [SYN, ECN, CWR] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=3023947584 TSecr=0 WS=512 33 183.129296560 10.0.77.7 -> 10.0.72.93 TCP 74 304 > 739 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=1460 WS=64 SACK_PERM=1 TSval=1289537099 TSecr=3023947584 34 183.129406460 10.0.72.93 -> 10.0.77.7 TCP 66 739 > 304 [ACK] Seq=1 Ack=1 Win=29696 Len=0 TSval=3023947584 TSecr=1289537099 This would _appear_ to be down to the SO_REUSEPORT option being added in net/sunrpc/xprtsock.c in the 514 kernel. This behaviour doesn't appear to occur with the 327 kernel, which when a FIN occurs, as new source port number is used. The remote server seems to correctly go into TIME_WAIT state, but then discards incoming traffic when the port is reused. (Until the timeout hits). This causes client stalls with the error: lockd: nfs server X not responding, still trying This recovers normally shortly after the successful retransmit and 3-way handshake. | ||||
Steps To Reproduce | This is somewhat difficult, because it relies on a remote NFS server sending a FIN packet to the Centos client. My Isilon does this periodically, a few times a day, at which point _if_ a new lock request is made within the TIME_WAIT period on the remote (CLOSE_WAIT on the client), the system will TCP-retransmit and stall until the timeout is reached, as the remote is discarding the packets (as it should, because of TIME_WAIT) | ||||
Additional Information | Seems likely that patch 5798061 is related, as this introduces the SO_REUSEPORT flag into the net/sunrpc/xprtsock.c My best guess would be that the handling of client CLOSE_WAIT isn't correctly taking account of whether the SO_REUSEPORT is set, and therefore is attempting to do so when it shouldn't. | ||||
Tags | nfs | ||||
abrt_hash | |||||
URL | |||||
This might be better tagged as 'kernel' rather than 'nfs-utils' - it seems to be kernel level nfs-client behaviour. | |
I can reproduce this on a Centos 7.3 'clean' build desktop, running 3.10.0-514.26.2 Mounting an NFS directory from an Isilon NFS server, running OneFS 8.0.0.4 (may be reproducible under other storage systems). The test case is: #!/usr/bin/env perl use strict; use warnings; use Fcntl qw ( :flock ); while (1) { open( my $testfile, '>', "test_lock.out" ) or die $!; my $start = time(); print $start,"\n"; if ( flock( $testfile, LOCK_EX ) ) { #print "$$ got lock\n"; } else { print "$$ didn't lock?\n"; } if ( time() > $start + 10 ) { print "WARNING: LOCK TOOK TOO LONG\n"; } sleep 100; $start = time(); if ( flock( $testfile, LOCK_UN ) ) { #print "$$ unlocked\n"; } else { print "$$ error unlocking\n"; } if ( time() > $start + 10 ) { print "WARNING: UNLOCK TOOK TOO LONG\n"; } close($testfile); sleep 100; } Whilst running a packet capture - a 'general' one (focussing on lockd port) and one _just_ watching for FIN packets: tshark -i em1 -f "port 304 and tcp[tcpflags] & tcp-fin != 0" When the FIN packet is received, there's a high chance of seeing a TCP retransmit as outlined above whilst the server transitions through TIME_WAIT. I therefore would suggest the problem is within the client side of the lockd RPC mechanism, around TCP port reuse. |
|
https://access.redhat.com/solutions/3018371 | |
Is it known if this bug can occur in kernel 3.10.0-1127.19.1.el7.x86_64? We are setting these ones every now and then: [Nov16 04:24] lockd: server storage.nfs not responding, still trying [ +3.001808] lockd: server storage.nfs OK $ cat /etc/centos-release CentOS Linux release 7.8.2003 (Core) |
|
@oscarf: please update your system and if you can reproduce the issue when running kernel-3.10.0-1160.6.1.el7.x86_64 and the current versions of the other relevant packages ( rpcbind, nfs etc ) let us know. | |
s/setting/seeing/ I will try to reproduce it. It occurs sporadically on some workloads so I doubt I will be successful. |
|
I confirmed that this bug is present on kernel 3.10.0-1127.el7.x86_64. | |
@oscarf: unless you plan to purchase from RedHat a subscription for their Extended Update Support, please update to CentOS 7.9 and retry. 7.8.2003 ceased to receive any form of support after the release of 7.9.2009 | |
@ManuelWolfshant: thanks, we will probably do that. | |
Date Modified | Username | Field | Change |
---|---|---|---|
2017-08-04 15:06 | sobrique | New Issue | |
2017-08-04 15:06 | sobrique | Tag Attached: nfs | |
2017-08-04 15:20 | sobrique | Note Added: 0029791 | |
2017-08-04 15:55 | toracat | Category | nfs-utils => kernel |
2017-08-09 12:50 | sobrique | Note Added: 0029820 | |
2017-08-14 09:18 | sobrique | Note Added: 0029858 | |
2020-11-19 13:50 | oscarf | Note Added: 0037939 | |
2020-11-19 16:59 | ManuelWolfshant | Note Added: 0037944 | |
2020-11-20 08:09 | oscarf | Note Added: 0037964 | |
2020-11-20 12:18 | oscarf | Note Added: 0037965 | |
2020-11-20 12:27 | ManuelWolfshant | Note Added: 0037966 | |
2020-11-20 12:30 | oscarf | Note Added: 0037967 |