View Issue Details

IDProjectCategoryView StatusLast Update
0006021CentOS-5kernelpublic2012-10-30 08:52
Reportervillettejp 
PrioritynormalSeverityminorReproducibilityalways
Status newResolutionopen 
PlatformX86_64OSCentOSOS Version5.8
Product Version5.8 
Target VersionFixed in Version 
Summary0006021: initcwnd does not work properly
DescriptionI tried to extent the initial TCP congestion windows used inside the slow start of any TCP connexions.

I used the ip route commands like :
ip route change 192.168.1.0/24 dev eth0 proto static scope link src 192.168.1.27 initcwnd 35
ip route change default via 192.168.1.1 dev eth0 proto static initcwnd 35

Ip route show results after this settings :
192.168.1.0/24 dev eth0 proto static scope link src 192.168.1.27 initcwnd 35
169.254.0.0/16 dev eth0 scope link
default via 192.168.1.1 dev eth0 proto static initcwnd 35

This problem was initially seen using http protocol (apache) but i reproduced it using a simple program i created to measure this cwnd.

The first time i run this server, the snd_cwnd stays at 2.
Without touching any other things, the second time i run this server, the snd_cwnd is 35 like i set it using ip route.

This seems to have a link with the routing cache of linux, because if i run an "ip route flush cache" between all run of my application, the snd_cwnd stays at 2.

Any idea ?

This seems to be solved in Centos 6. I am currently using the kernel :

Linux localhost.localdomain 2.6.18-308.16.1.el5 #1 SMP Tue Oct 2 22:01:43 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
TagsNo tags attached.

Activities

villettejp

villettejp

2012-10-30 08:52

reporter   ~0015990

Detailed explanations :
 
Problem with the initcwnd (snd_cwnd) parameter which does not work properly in RHEL 5 UPDATE 8 (kernel : 2.6.18-308.16.1.el5).
 
I am talking about the initial congestion windows used in the very beginning of a tcp connection. This specific windows is commonly used to control the amount of characters sent from a server before this server waits for an ack from the client. This is used only during the slow start period, period during which the server is using a part of the receive TCP windows of the client as a limit in size of the amount of data it can send without problems.
 
With this version of linux kernel, the size is set to 2 or 3 MSS by default. This MSS is the maximum segment size. On ETHERNET networks, MSS is 1460 characters.
 
On large network, when RTT (time between clients and servers) increases, it may be interesting to increase this size to make the servers send more data before waiting for the clients' replies. See google's works on this.
http://www.cdnplanet.com/blog/tune-tcp-initcwnd-for-optimum-performance/
https://developers.google.com/speed/articles/tcp_initcwnd_paper.pdf
 
In RHEL 5 UPDATE 8, as the google documentation says, this is possible to set an other value than the default 2 or 3 MSS. We use for that the "ip route" command.
 
For example :

[root@localhost villette]# ip route show
192.168.1.0/24 dev eth0 proto static scope link src 192.168.1.30
169.254.0.0/16 dev eth0 scope link
default via 192.168.1.1 dev eth0
 
to change this for the first line, so for the local trafic, i use the command :
 
ip route change 192.168.1.0/24 dev eth0 proto static scope link src 192.168.1.30 initcwnd 40
 
the result is :
 
[root@localhost villette]# ip route show
192.168.1.0/24 dev eth0 proto static scope link src 192.168.1.30 initcwnd 40
169.254.0.0/16 dev eth0 scope link
default via 192.168.1.1 dev eth0
 
Now, the slow start initial congestion windows is set to 40 MSS.
 
To see this point, i used a personnalized program. This modification is only a add of a specific function to read the snd_cnwd value in the socket structure. This value is the slow start initial congestion windows in MSS.
 
My application's modification is those lines :
 
  tcp_info_length = sizeof(tcp_info);
  if ( getsockopt( fd,
                   SOL_TCP,
                   TCP_INFO,
                   (void *) &tcp_info,
                   (socklen_t *) &tcp_info_length ) != 0 ) {
    fprintf(stderr, "Can not get the socket options!\n");
    exit(EXIT_FAILURE);
  }
   
  fprintf(stderr, " snd_cwnd %u ssthresh snd %d rcv %u\n",
          tcp_info.tcpi_snd_cwnd,
          tcp_info.tcpi_snd_ssthresh,
          tcp_info.tcpi_rcv_ssthresh);
  fprintf(stderr, " rtt %u rrtvar %u retrans %u\n",
          tcp_info.tcpi_rtt,
          tcp_info.tcpi_rttvar,
          tcp_info.tcpi_retrans);
 
The problem is : this initcwnd modification really works in RHEL 6 UPDATE 3, in Fedora core 17 but not on RHEL 5 UPDATE 8.
 
In this last version, i used the kernel 2.6.18-308.16.1.el5. I saw my first test using this always used a snd_cwnd at 2, even if i set a different value in the route table with the "ip route" (as i described at the beginning of this document). But the strange part starts at the second test with a second connection from the same client. This second test used a snd_cwnd set to the correct value so 40.
 
I made a large number of tests to see this behavior depends on the route cache contents. If the lines defining the connection between my server and my client already exist in the route cache, it worked. If not, it doesn't.
 
For example, before my first test, no definition of my connection :
 
[root@localhost TCPSERV-4STATS]# ip route show grep 86
[root@localhost TCPSERV-4STATS]#
 
My first test used a snd_cwnd set to 2. Not 40 as requested.
 
[root@localhost TCPSERV-4STATS]# ip route show
...
local 192.168.1.30 from 192.168.1.86 dev lo src 192.168.1.30
    cache <local,src-direct> iif eth0
192.168.1.86 from 192.168.1.30 dev eth0
...
 
Without doing other things, my second test using the same program, used a snd_cwnd set to 40, the value i requested.
 
This is confirmed by the following test : if i reset the cache every time i use my program, it always uses a snd_cwnd set to 2. To reset the cache i used the command "ip route flush cache".
 
 
I took the kernel source code to see how it is handle and to find why i have this behavior. First, the snd_cwnd (slow start initial congestion windows) is set by the function tcp_init_cwnd() from tcp_input.c :
 
/* Numbers are taken from RFC2414. */
__u32 tcp_init_cwnd(struct tcp_sock *tp, struct dst_entry *dst)
{
        __u32 cwnd = (dst ? dst_metric(dst, RTAX_INITCWND) : 0);
 
        if (!cwnd) {
                if (tp->mss_cache > 1460)
                        cwnd = 2;
                else
                        cwnd = (tp->mss_cache > 1095) ? 3 : 4;
        }
        return min_t(__u32, cwnd, tp->snd_cwnd_clamp);
}
 
 
This function is first call from the function tcp_init_metrics() from the same file :
 
/* Initialize metrics on socket. */
 
static void tcp_init_metrics(struct sock *sk)
{
        struct tcp_sock *tp = tcp_sk(sk);
        struct dst_entry *dst = __sk_dst_get(sk);
 
        if (dst == NULL)
                goto reset;
 
        dst_confirm(dst);
 
        if (dst_metric_locked(dst, RTAX_CWND))
                tp->snd_cwnd_clamp = dst_metric(dst, RTAX_CWND);
        if (dst_metric(dst, RTAX_SSTHRESH)) {
                tp->snd_ssthresh = dst_metric(dst, RTAX_SSTHRESH);
                if (tp->snd_ssthresh > tp->snd_cwnd_clamp)
                        tp->snd_ssthresh = tp->snd_cwnd_clamp;
        }
        if (dst_metric(dst, RTAX_REORDERING) &&
            tp->reordering != dst_metric(dst, RTAX_REORDERING)) {
                tp->rx_opt.sack_ok &= ~2;
                tp->reordering = dst_metric(dst, RTAX_REORDERING);
        }
 
        if (dst_metric(dst, RTAX_RTT) == 0)
                goto reset;
 
        if (!tp->srtt && dst_metric(dst, RTAX_RTT) < (TCP_TIMEOUT_INIT << 3))
                goto reset;
 
        /* Initial rtt is determined from SYN,SYN-ACK.
         * The segment is small and rtt may appear much
         * less than real one. Use per-dst memory
         * to make it more realistic.
         *
         * A bit of theory. RTT is time passed after "normal" sized packet
         * is sent until it is ACKed. In normal circumstances sending small
         * packets force peer to delay ACKs and calculation is correct too.
         * The algorithm is adaptive and, provided we follow specs, it
         * NEVER underestimate RTT. BUT! If peer tries to make some clever
         * tricks sort of "quick acks" for time long enough to decrease RTT
         * to low value, and then abruptly stops to do it and starts to delay
         * ACKs, wait for troubles.
         */
        if (dst_metric(dst, RTAX_RTT) > tp->srtt) {
                tp->srtt = dst_metric(dst, RTAX_RTT);
                tp->rtt_seq = tp->snd_nxt;
        }
        if (dst_metric(dst, RTAX_RTTVAR) > tp->mdev) {
                tp->mdev = dst_metric(dst, RTAX_RTTVAR);
                tp->mdev_max = tp->rttvar = max(tp->mdev, tcp_rto_min(sk));
        }
        tcp_set_rto(sk);
        tcp_bound_rto(sk);
        if (inet_csk(sk)->icsk_rto < TCP_TIMEOUT_INIT && !tp->rx_opt.saw_tstamp)
                goto reset;
        tp->snd_cwnd = tcp_init_cwnd(tp, dst);
        tp->snd_cwnd_stamp = tcp_time_stamp;
        return;
 
reset:
        /* Play conservative. If timestamps are not
         * supported, TCP will fail to recalculate correct
        * rtt, if initial rto is too small. FORGET ALL AND RESET!
         */
        if (!tp->rx_opt.saw_tstamp && tp->srtt) {
                tp->srtt = 0;
                tp->mdev = tp->mdev_max = tp->rttvar = TCP_TIMEOUT_INIT;
                inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
        }
}
 
As you can see in this last function, the settings of the snd_cwnd* parameters are in the last lines of the function. To reach those lines, there is a condition defined by these lines at the beginning of the function :
 
        if (dst_metric(dst, RTAX_RTT) == 0)
                goto reset;
 
The RTT must be defined in order to reach the cwnd lines. Which is never the case for a first connection.
 
The solutions :
 
1/ Modifying the kernel to move the two cwnd lines before the test on the RTT, to make this definition in every cases :
 
        tp->snd_cwnd = tcp_init_cwnd(tp, dst);
        tp->snd_cwnd_stamp = tcp_time_stamp;
 
        if (dst_metric(dst, RTAX_RTT) == 0)
                goto reset;
 
Of course, you will have to recreate a new kernel if you choose this solution.
 
2/ Add the definition of rtt when you change the initcwnd using the "ip route" command :
 
# ip route change 192.168.1.0/24 dev eth0 proto static scope link src 192.168.1.30 initcwnd 40 rtt 100
 
In this case, the new inicwnd will be used in every cases.

Issue History

Date Modified Username Field Change
2012-10-16 12:50 villettejp New Issue
2012-10-30 08:52 villettejp Note Added: 0015990