View Issue Details

IDProjectCategoryView StatusLast Update
0013639CentOS-7kernelpublic2020-11-20 12:30
Reportersobrique 
PrioritynormalSeverityminorReproducibilitysometimes
Status newResolutionopen 
Platformx86_64OSCentosOS Version7.3
Product Version7.3.1611 
Target VersionFixed in Version 
Summary0013639: lockd SO_REUSEADDR causes 'lockd: nfs server X not responding, still trying' if the remote host closes the connection with a FIN
DescriptionProblem doesn't occur in the 3.10.0-327 kernels (installed with Centos-7.2) but does on 3.10.0-514.* - we have confirmed it applies to 514.10.2, 514.21.1, 514.21.2 and 514.26.2.

This is for NFS v3 over TCP.

Specifically we have a network trace of our clients receiving a FIN from the remote server (an Isilon NFS server/BSD based) and tearing down the connection.

Attempts then to reuse the same port address to talk to the same IP/port tuple on the remote.

This doesn't get any response (and causes TCP retransmits).


23 93.895312607 10.0.77.7 -> 10.0.72.93 TCP 66 304 > 739 [FIN, ACK] Seq=361 Ack=2365 Win=28688 Len=0 TSval=3324674743 TSecr=3023764526
 24 93.895365165 10.0.72.93 -> 10.0.77.7 TCP 66 739 > 304 [FIN, ACK] Seq=2365 Ack=362 Win=58 Len=0 TSval=3023858350 TSecr=3324674743
 25 93.895537170 10.0.77.7 -> 10.0.72.93 TCP 66 304 > 739 [ACK] Seq=362 Ack=2366 Win=28688 Len=0 TSval=3324674743 TSecr=3023858350
 26 119.999069392 10.0.72.93 -> 10.0.77.7 TCP 74 [TCP Port numbers reused] 739 > 304 [SYN, ECN, CWR] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=3023884454 TSecr=0 WS=512
 27 121.001045228 10.0.72.93 -> 10.0.77.7 TCP 74 [TCP Retransmission] 739 > 304 [SYN, ECN, CWR] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=3023885456 TSecr=0 WS=512
 28 123.005029586 10.0.72.93 -> 10.0.77.7 TCP 74 [TCP Retransmission] 739 > 304 [SYN, ECN, CWR] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=3023887460 TSecr=0 WS=512
 29 127.017070461 10.0.72.93 -> 10.0.77.7 TCP 74 [TCP Retransmission] 739 > 304 [SYN, ECN, CWR] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=3023891472 TSecr=0 WS=512
 30 135.033082431 10.0.72.93 -> 10.0.77.7 TCP 74 [TCP Retransmission] 739 > 304 [SYN, ECN, CWR] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=3023899488 TSecr=0 WS=512
 31 151.065047976 10.0.72.93 -> 10.0.77.7 TCP 74 [TCP Retransmission] 739 > 304 [SYN, ECN, CWR] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=3023915520 TSecr=0 WS=512
 32 183.129124247 10.0.72.93 -> 10.0.77.7 TCP 74 [TCP Retransmission] 739 > 304 [SYN, ECN, CWR] Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=3023947584 TSecr=0 WS=512
 33 183.129296560 10.0.77.7 -> 10.0.72.93 TCP 74 304 > 739 [SYN, ACK] Seq=0 Ack=1 Win=65535 Len=0 MSS=1460 WS=64 SACK_PERM=1 TSval=1289537099 TSecr=3023947584
 34 183.129406460 10.0.72.93 -> 10.0.77.7 TCP 66 739 > 304 [ACK] Seq=1 Ack=1 Win=29696 Len=0 TSval=3023947584 TSecr=1289537099


This would _appear_ to be down to the SO_REUSEPORT option being added in net/sunrpc/xprtsock.c in the 514 kernel.

This behaviour doesn't appear to occur with the 327 kernel, which when a FIN occurs, as new source port number is used.

The remote server seems to correctly go into TIME_WAIT state, but then discards incoming traffic when the port is reused. (Until the timeout hits).

This causes client stalls with the error:

lockd: nfs server X not responding, still trying

This recovers normally shortly after the successful retransmit and 3-way handshake.


Steps To ReproduceThis is somewhat difficult, because it relies on a remote NFS server sending a FIN packet to the Centos client.

My Isilon does this periodically, a few times a day, at which point _if_ a new lock request is made within the TIME_WAIT period on the remote (CLOSE_WAIT on the client), the system will TCP-retransmit and stall until the timeout is reached, as the remote is discarding the packets (as it should, because of TIME_WAIT)
Additional InformationSeems likely that patch 5798061 is related, as this introduces the SO_REUSEPORT flag into the net/sunrpc/xprtsock.c

My best guess would be that the handling of client CLOSE_WAIT isn't correctly taking account of whether the SO_REUSEPORT is set, and therefore is attempting to do so when it shouldn't.
Tagsnfs
abrt_hash
URL

Activities

sobrique

sobrique

2017-08-04 15:20

reporter   ~0029791

This might be better tagged as 'kernel' rather than 'nfs-utils' - it seems to be kernel level nfs-client behaviour.
sobrique

sobrique

2017-08-09 12:50

reporter   ~0029820

I can reproduce this on a Centos 7.3 'clean' build desktop, running 3.10.0-514.26.2

Mounting an NFS directory from an Isilon NFS server, running OneFS 8.0.0.4 (may be reproducible under other storage systems).

The test case is:

#!/usr/bin/env perl
use strict;
use warnings;

use Fcntl qw ( :flock );

while (1) {
    open( my $testfile, '>', "test_lock.out" ) or die $!;

    my $start = time();
    print $start,"\n";
    if ( flock( $testfile, LOCK_EX ) ) {
        #print "$$ got lock\n";
    }
    else {
        print "$$ didn't lock?\n";
    }

    if ( time() > $start + 10 ) {
        print "WARNING: LOCK TOOK TOO LONG\n";
    }
    sleep 100;

    $start = time();
    if ( flock( $testfile, LOCK_UN ) ) {
        #print "$$ unlocked\n";
    }
    else {
        print "$$ error unlocking\n";
    }
    if ( time() > $start + 10 ) {
        print "WARNING: UNLOCK TOOK TOO LONG\n";
    }

    close($testfile);
    sleep 100;
}


Whilst running a packet capture - a 'general' one (focussing on lockd port) and one _just_ watching for FIN packets:

tshark -i em1 -f "port 304 and tcp[tcpflags] & tcp-fin != 0"


When the FIN packet is received, there's a high chance of seeing a TCP retransmit as outlined above whilst the server transitions through TIME_WAIT.

I therefore would suggest the problem is within the client side of the lockd RPC mechanism, around TCP port reuse.
sobrique

sobrique

2017-08-14 09:18

reporter   ~0029858

https://access.redhat.com/solutions/3018371
oscarf

oscarf

2020-11-19 13:50

reporter   ~0037939

Is it known if this bug can occur in kernel 3.10.0-1127.19.1.el7.x86_64?

We are setting these ones every now and then:

[Nov16 04:24] lockd: server storage.nfs not responding, still trying
[ +3.001808] lockd: server storage.nfs OK

$ cat /etc/centos-release
CentOS Linux release 7.8.2003 (Core)
ManuelWolfshant

ManuelWolfshant

2020-11-19 16:59

manager   ~0037944

@oscarf: please update your system and if you can reproduce the issue when running kernel-3.10.0-1160.6.1.el7.x86_64 and the current versions of the other relevant packages ( rpcbind, nfs etc ) let us know.
oscarf

oscarf

2020-11-20 08:09

reporter   ~0037964

s/setting/seeing/

I will try to reproduce it. It occurs sporadically on some workloads so I doubt I will be successful.
oscarf

oscarf

2020-11-20 12:18

reporter   ~0037965

I confirmed that this bug is present on kernel 3.10.0-1127.el7.x86_64.
ManuelWolfshant

ManuelWolfshant

2020-11-20 12:27

manager   ~0037966

@oscarf: unless you plan to purchase from RedHat a subscription for their Extended Update Support, please update to CentOS 7.9 and retry. 7.8.2003 ceased to receive any form of support after the release of 7.9.2009
oscarf

oscarf

2020-11-20 12:30

reporter   ~0037967

@ManuelWolfshant: thanks, we will probably do that.

Issue History

Date Modified Username Field Change
2017-08-04 15:06 sobrique New Issue
2017-08-04 15:06 sobrique Tag Attached: nfs
2017-08-04 15:20 sobrique Note Added: 0029791
2017-08-04 15:55 toracat Category nfs-utils => kernel
2017-08-09 12:50 sobrique Note Added: 0029820
2017-08-14 09:18 sobrique Note Added: 0029858
2020-11-19 13:50 oscarf Note Added: 0037939
2020-11-19 16:59 ManuelWolfshant Note Added: 0037944
2020-11-20 08:09 oscarf Note Added: 0037964
2020-11-20 12:18 oscarf Note Added: 0037965
2020-11-20 12:27 ManuelWolfshant Note Added: 0037966
2020-11-20 12:30 oscarf Note Added: 0037967