View Issue Details

IDProjectCategoryView StatusLast Update
0006021CentOS-5kernelpublic2012-10-30 08:52
Status newResolutionopen 
PlatformX86_64OSCentOSOS Version5.8
Product Version5.8 
Target VersionFixed in Version 
Summary0006021: initcwnd does not work properly
DescriptionI tried to extent the initial TCP congestion windows used inside the slow start of any TCP connexions.

I used the ip route commands like :
ip route change dev eth0 proto static scope link src initcwnd 35
ip route change default via dev eth0 proto static initcwnd 35

Ip route show results after this settings : dev eth0 proto static scope link src initcwnd 35 dev eth0 scope link
default via dev eth0 proto static initcwnd 35

This problem was initially seen using http protocol (apache) but i reproduced it using a simple program i created to measure this cwnd.

The first time i run this server, the snd_cwnd stays at 2.
Without touching any other things, the second time i run this server, the snd_cwnd is 35 like i set it using ip route.

This seems to have a link with the routing cache of linux, because if i run an "ip route flush cache" between all run of my application, the snd_cwnd stays at 2.

Any idea ?

This seems to be solved in Centos 6. I am currently using the kernel :

Linux localhost.localdomain 2.6.18-308.16.1.el5 #1 SMP Tue Oct 2 22:01:43 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
TagsNo tags attached.




2012-10-30 08:52

reporter   ~0015990

Detailed explanations :
Problem with the initcwnd (snd_cwnd) parameter which does not work properly in RHEL 5 UPDATE 8 (kernel : 2.6.18-308.16.1.el5).
I am talking about the initial congestion windows used in the very beginning of a tcp connection. This specific windows is commonly used to control the amount of characters sent from a server before this server waits for an ack from the client. This is used only during the slow start period, period during which the server is using a part of the receive TCP windows of the client as a limit in size of the amount of data it can send without problems.
With this version of linux kernel, the size is set to 2 or 3 MSS by default. This MSS is the maximum segment size. On ETHERNET networks, MSS is 1460 characters.
On large network, when RTT (time between clients and servers) increases, it may be interesting to increase this size to make the servers send more data before waiting for the clients' replies. See google's works on this.
In RHEL 5 UPDATE 8, as the google documentation says, this is possible to set an other value than the default 2 or 3 MSS. We use for that the "ip route" command.
For example :

[root@localhost villette]# ip route show dev eth0 proto static scope link src dev eth0 scope link
default via dev eth0
to change this for the first line, so for the local trafic, i use the command :
ip route change dev eth0 proto static scope link src initcwnd 40
the result is :
[root@localhost villette]# ip route show dev eth0 proto static scope link src initcwnd 40 dev eth0 scope link
default via dev eth0
Now, the slow start initial congestion windows is set to 40 MSS.
To see this point, i used a personnalized program. This modification is only a add of a specific function to read the snd_cnwd value in the socket structure. This value is the slow start initial congestion windows in MSS.
My application's modification is those lines :
  tcp_info_length = sizeof(tcp_info);
  if ( getsockopt( fd,
                   (void *) &tcp_info,
                   (socklen_t *) &tcp_info_length ) != 0 ) {
    fprintf(stderr, "Can not get the socket options!\n");
  fprintf(stderr, " snd_cwnd %u ssthresh snd %d rcv %u\n",
  fprintf(stderr, " rtt %u rrtvar %u retrans %u\n",
The problem is : this initcwnd modification really works in RHEL 6 UPDATE 3, in Fedora core 17 but not on RHEL 5 UPDATE 8.
In this last version, i used the kernel 2.6.18-308.16.1.el5. I saw my first test using this always used a snd_cwnd at 2, even if i set a different value in the route table with the "ip route" (as i described at the beginning of this document). But the strange part starts at the second test with a second connection from the same client. This second test used a snd_cwnd set to the correct value so 40.
I made a large number of tests to see this behavior depends on the route cache contents. If the lines defining the connection between my server and my client already exist in the route cache, it worked. If not, it doesn't.
For example, before my first test, no definition of my connection :
[root@localhost TCPSERV-4STATS]# ip route show grep 86
[root@localhost TCPSERV-4STATS]#
My first test used a snd_cwnd set to 2. Not 40 as requested.
[root@localhost TCPSERV-4STATS]# ip route show
local from dev lo src
    cache <local,src-direct> iif eth0 from dev eth0
Without doing other things, my second test using the same program, used a snd_cwnd set to 40, the value i requested.
This is confirmed by the following test : if i reset the cache every time i use my program, it always uses a snd_cwnd set to 2. To reset the cache i used the command "ip route flush cache".
I took the kernel source code to see how it is handle and to find why i have this behavior. First, the snd_cwnd (slow start initial congestion windows) is set by the function tcp_init_cwnd() from tcp_input.c :
/* Numbers are taken from RFC2414. */
__u32 tcp_init_cwnd(struct tcp_sock *tp, struct dst_entry *dst)
        __u32 cwnd = (dst ? dst_metric(dst, RTAX_INITCWND) : 0);
        if (!cwnd) {
                if (tp->mss_cache > 1460)
                        cwnd = 2;
                        cwnd = (tp->mss_cache > 1095) ? 3 : 4;
        return min_t(__u32, cwnd, tp->snd_cwnd_clamp);
This function is first call from the function tcp_init_metrics() from the same file :
/* Initialize metrics on socket. */
static void tcp_init_metrics(struct sock *sk)
        struct tcp_sock *tp = tcp_sk(sk);
        struct dst_entry *dst = __sk_dst_get(sk);
        if (dst == NULL)
                goto reset;
        if (dst_metric_locked(dst, RTAX_CWND))
                tp->snd_cwnd_clamp = dst_metric(dst, RTAX_CWND);
        if (dst_metric(dst, RTAX_SSTHRESH)) {
                tp->snd_ssthresh = dst_metric(dst, RTAX_SSTHRESH);
                if (tp->snd_ssthresh > tp->snd_cwnd_clamp)
                        tp->snd_ssthresh = tp->snd_cwnd_clamp;
        if (dst_metric(dst, RTAX_REORDERING) &&
            tp->reordering != dst_metric(dst, RTAX_REORDERING)) {
                tp->rx_opt.sack_ok &= ~2;
                tp->reordering = dst_metric(dst, RTAX_REORDERING);
        if (dst_metric(dst, RTAX_RTT) == 0)
                goto reset;
        if (!tp->srtt && dst_metric(dst, RTAX_RTT) < (TCP_TIMEOUT_INIT << 3))
                goto reset;
        /* Initial rtt is determined from SYN,SYN-ACK.
         * The segment is small and rtt may appear much
         * less than real one. Use per-dst memory
         * to make it more realistic.
         * A bit of theory. RTT is time passed after "normal" sized packet
         * is sent until it is ACKed. In normal circumstances sending small
         * packets force peer to delay ACKs and calculation is correct too.
         * The algorithm is adaptive and, provided we follow specs, it
         * NEVER underestimate RTT. BUT! If peer tries to make some clever
         * tricks sort of "quick acks" for time long enough to decrease RTT
         * to low value, and then abruptly stops to do it and starts to delay
         * ACKs, wait for troubles.
        if (dst_metric(dst, RTAX_RTT) > tp->srtt) {
                tp->srtt = dst_metric(dst, RTAX_RTT);
                tp->rtt_seq = tp->snd_nxt;
        if (dst_metric(dst, RTAX_RTTVAR) > tp->mdev) {
                tp->mdev = dst_metric(dst, RTAX_RTTVAR);
                tp->mdev_max = tp->rttvar = max(tp->mdev, tcp_rto_min(sk));
        if (inet_csk(sk)->icsk_rto < TCP_TIMEOUT_INIT && !tp->rx_opt.saw_tstamp)
                goto reset;
        tp->snd_cwnd = tcp_init_cwnd(tp, dst);
        tp->snd_cwnd_stamp = tcp_time_stamp;
        /* Play conservative. If timestamps are not
         * supported, TCP will fail to recalculate correct
        * rtt, if initial rto is too small. FORGET ALL AND RESET!
        if (!tp->rx_opt.saw_tstamp && tp->srtt) {
                tp->srtt = 0;
                tp->mdev = tp->mdev_max = tp->rttvar = TCP_TIMEOUT_INIT;
                inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
As you can see in this last function, the settings of the snd_cwnd* parameters are in the last lines of the function. To reach those lines, there is a condition defined by these lines at the beginning of the function :
        if (dst_metric(dst, RTAX_RTT) == 0)
                goto reset;
The RTT must be defined in order to reach the cwnd lines. Which is never the case for a first connection.
The solutions :
1/ Modifying the kernel to move the two cwnd lines before the test on the RTT, to make this definition in every cases :
        tp->snd_cwnd = tcp_init_cwnd(tp, dst);
        tp->snd_cwnd_stamp = tcp_time_stamp;
        if (dst_metric(dst, RTAX_RTT) == 0)
                goto reset;
Of course, you will have to recreate a new kernel if you choose this solution.
2/ Add the definition of rtt when you change the initcwnd using the "ip route" command :
# ip route change dev eth0 proto static scope link src initcwnd 40 rtt 100
In this case, the new inicwnd will be used in every cases.

Issue History

Date Modified Username Field Change
2012-10-16 12:50 villettejp New Issue
2012-10-30 08:52 villettejp Note Added: 0015990