View Issue Details

IDProjectCategoryView StatusLast Update
0002635CentOS-5kernelpublic2008-07-03 08:06
Reportergreno2 
PrioritynormalSeveritymajorReproducibilityalways
Status resolvedResolutionfixed 
Product Version5.1 
Target VersionFixed in Version5.2 
Summary0002635: Bad performance running kernel 2.6.18-53.1.6.el5
DescriptionAfter upgrading several boxes from 2.6.18-53.1.4.el5 to 2.6.18-53.1.6.el5 we're experiencing bad performance - at least while accessing NFS-Shares.
After downgrading back to 2.6.18-53.1.4.el5 everything works smoothly.
Additional InformationWe realized NFS as source because network-traffic doubled almost (seems like bad or non-existent caching or soemthing else)

mount-Options:
rw,rsize=8192,wsize=8192,soft,nfsvers=3,udp,nocto,actimeo=1,noatime
TagsNo tags attached.

Relationships

has duplicate 0002682 closedkbsingh@karan.org Huge increase in NFS client traffic and ops after kernel upgrade 

Activities

toracat

toracat

2008-01-28 15:51

manager   ~0006768

Is this possibly related to the problem reported on the mailing list?

http://lists.centos.org/pipermail/centos/2008-January/093336.html
greno2

greno2

2008-01-28 17:52

reporter   ~0006769

Seems to match.
How can I validate the following theory from Bent?
"The patches that gives us problems, results in a kernel which makes
something like 2000 times more "NFS V3 LOOKUP Call" and "NFS V3 LOOKUP
Reply" than without."

Which information may I provider you? We're able to make further tests on one of the systems without service degration because it's a cluster :-)
range

range

2008-01-28 18:12

administrator   ~0006770

You can check that with nfsstat, which will give you statistics about NFS.

So run it with the new kernel for a while, then run it with the old kernel and compare the outputs.
greno2

greno2

2008-01-28 18:34

reporter   ~0006772

each 180 seconds with nearly same amount of production-traffic (load with .4.el5 may have been even a bit higher)

running 2.6.18-53.1.6.el5
Client rpc stats:
calls retrans authrefrsh
3624091 9 0

Client nfs v3:
null getattr setattr lookup access readlink
0 0% 56392 1% 57 0% 3373622 93% 191045 5% 5 0%
read write create mkdir symlink mknod
1768 0% 1130 0% 3 0% 0 0% 0 0% 0 0%
remove rmdir rename link readdir readdirplus
6 0% 0 0% 0 0% 0 0% 0 0% 1 0%
fsstat fsinfo pathconf commit
0 0% 4 0% 0 0% 57 0%

running 2.6.18-53.1.4.el5
calls retrans authrefrsh
80315 22 0

Client nfs v3:
null getattr setattr lookup access readlink
0 0% 36510 45% 70 0% 6775 8% 31778 39% 7 0%
read write create mkdir symlink mknod
3676 4% 1359 1% 22 0% 0 0% 0 0% 0 0%
remove rmdir rename link readdir readdirplus
36 0% 0 0% 0 0% 0 0% 0 0% 1 0%
fsstat fsinfo pathconf commit
0 0% 4 0% 0 0% 76 0%
range

range

2008-01-28 18:56

administrator   ~0006774

Can someone with access to an RHEL machine acknowledge those findings? If so, this issue should be reported upstream.
toracat

toracat

2008-01-30 23:52

manager   ~0006789

There is at least one person with the same nfs issue reporting in the SciLinux mail list:

http://listserv.fnal.gov/scripts/wa.exe?A2=ind0801&L=scientific-linux-devel&T=0&P=5427
JohnnyHughes

JohnnyHughes

2008-01-31 15:42

administrator   ~0006792

I tried and can not duplicate this on i686 (with centos kernels) ... we are testing on x86_64 now.
toracat

toracat

2008-01-31 16:37

manager   ~0006793

I could confirm the problem by:

- using a x86_64 *client*
- copying many small files rather than copying a single huge file

nfsstat oupput (excerpt):

<client = 2.6.18-53.1.6.el5>

Server rpc stats:
calls badcalls badauth badclnt xdrcall
247729 0 0 0 0

Server nfs v3:
null getattr setattr lookup access readlink
4 0% 11 0% 0 0% 48 0% 11 0% 0 0%
read write create mkdir symlink mknod
247641 99% 0 0% 0 0% 0 0% 0 0% 0 0%
remove rmdir rename link readdir readdirplus
0 0% 0 0% 0 0% 0 0% 0 0% 4 0%
fsstat fsinfo pathconf commit
3 0% 5 0% 0 0% 0 0%

Client rpc stats:
calls retrans authrefrsh
1429815 0 0

Client nfs v3:
null getattr setattr lookup access readlink
0 0% 80 0% 0 0% 969151 67% 25971 1% 0 0%
read write create mkdir symlink mknod
431023 30% 0 0% 0 0% 0 0% 0 0% 0 0%
remove rmdir rename link readdir readdirplus
0 0% 0 0% 0 0% 0 0% 0 0% 3551 0%
fsstat fsinfo pathconf commit
16 0% 18 0% 0 0% 0 0%

<client = 2.6.18-53.1.4.el5>

Client rpc stats:
calls retrans authrefrsh
151500 0 0

Client nfs v3:
null getattr setattr lookup access readlink
0 0% 51445 33% 0 0% 5 0% 25712 16% 0 0%
read write create mkdir symlink mknod
70805 46% 0 0% 0 0% 0 0% 0 0% 0 0%
remove rmdir rename link readdir readdirplus
0 0% 0 0% 0 0% 0 0% 0 0% 3530 2%
fsstat fsinfo pathconf commit
0 0% 2 0% 0 0% 0 0%
JohnnyHughes

JohnnyHughes

2008-01-31 18:35

administrator   ~0006794

I have verified this issue even on i686 with many smaller files and an upstream bug has been filed:

https://bugzilla.redhat.com/show_bug.cgi?id=431092
toracat

toracat

2008-01-31 18:36

manager   ~0006795

Well, the output I posted is a bit skewed. I ran the test on a freshly booted 53.1.4 kernel (so the calls on nfsstat was nearly zero before the test run). However, I did not checked the same when the test was run on a 53.1.6 kernel. Thanks to cap_ who alerted me with all this. Below is the result that is supposedly more correct.

<client uname -rm = 2.6.18-53.1.6.el5 x86_64>

Client rpc stats: calls 1066478 ('after' minus 'before' the test run)

real 3m1.672s
user 0m0.412s
sys 0m15.829s

<client uname -rm = 2.6.18-53.1.4.el5 x86_64>

Client rpc stats: calls 147258 ('after' minus 'before' the test run)

real 0m58.729s
user 0m0.361s
sys 0m7.578s
toracat

toracat

2008-02-02 20:02

manager   ~0006801

When going from 53.1.4 to 53.1.4, five nfs patches were added. I recompiled the 53.1.6 kernel that is lacking the patch (one at a time). The performance issue went away by removing ANY ONE of them.

Akemi
atinivelli

atinivelli

2008-02-05 11:02

reporter   ~0006816

experiencing same problem on RH5 and Centos5.1 with a netapp filer.
toracat

toracat

2008-02-06 18:13

manager   ~0006825

There is a patch posted in the upstream bugzilla:

https://bugzilla.redhat.com/show_bug.cgi?id=431092

that seemingly fixes the issue (at least in one test I ran). However, the patch cannot be applied to the current (-53.x.x) kernel unless it is backported. This patch might be included in 5.2 but not committed yet.

Akemi
toracat

toracat

2008-02-06 22:14

manager   ~0006826

OK, a test kernel with the aforementioned patch has been made available upstream. My test shows it works like the 53.1.4 kernel. Can someone else look at this kernel?

========================================================================
Comment #30 of the upstream bugzilla

 in 2.6.18-78.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5
JohnnyHughes

JohnnyHughes

2008-02-13 15:12

administrator   ~0006882

this bug has moved to this one upstream:

https://bugzilla.redhat.com/show_bug.cgi?id=321111

There are centos test kernels here that fix this issue as well:

http://people.centos.org/~hughesjr/kernel/5/
greno2

greno2

2008-02-17 09:55

reporter   ~0006903

I can confirm that the test kernel 2.6.18-53.1.13.el5.bz321111 works for us (x86_64) and fixes this issue.
voog

voog

2008-02-20 18:09

reporter   ~0006913

I have tried the 2.6.18-53.1.13 kernel and found that the times were better but still double what they were in the 2.6.18-53.1.4 kernel. nfsstat shows the following:
 2.6.18-53.1.13.el5
Client rpc stats:
calls retrans authrefrsh
1505353 0 0

 2.6.18-53.1.4.el5
Client rpc stats:
calls retrans authrefrsh
150085 0 0


It would seem that the .13 kernel is still sending significantly more nfs traffic.
toracat

toracat

2008-02-20 18:22

manager   ~0006914

Please note the the distro kernel 2.6.18-53.1.13 does NOT have any fix for the nfs problem discussed here. It is the centosplus kernel that has the patch compiled in.
stevefalco

stevefalco

2008-03-01 02:19

reporter   ~0006964

I can confirm that kernel-2.6.18-53.1.13.el5.bz321111 fixes this bug for me. I have an x86_64 Dell 1850 server, and had noticed that a compile-job that normally took 16 minutes had gone up to 37 minutes. The job makes heavy use of NFS. Once I put in the bz321111 kernel, the run time of the job went back to normal.
catselbow

catselbow

2008-03-05 16:20

reporter   ~0006982

Just to clarify:
What's the difference between the "hughesjr" kernels here:

http://people.centos.org/~hughesjr/kernel/5/i386/

and the centosplus kernels here:

http://altruistic.lbl.gov/mirrors/centos/5.1/centosplus/i386/RPMS/

Do they both contain the NFS fix?
toracat

toracat

2008-05-11 03:55

manager   ~0007247

A new kernel (2.6.18-53.1.19.) came out, and once again hughesjr made the patched kernels available at:

http://people.centos.org/hughesjr/kernel/5/

I have tested the x86_64 kernel (kernel-2.6.18-53.1.19.el5.bz321111.x86_64.rpm) and confirmed it does not have the nfs problem resported in this bug tracker.

Akemi
toracat

toracat

2008-05-11 11:43

manager   ~0007248

Tested and confirmed the new centosplus kernel (kernel-2.6.18-53.1.19.el5.centos.plus.x86_64.rpm) also works fine.

Akemi
JohnnyHughes

JohnnyHughes

2008-05-11 11:49

administrator   ~0007249

@catselbow:

With respect to the NFS fixes, there is no difference.

However the centosplus kernel has a lot of other items turned on (extra hardware and file system types supported, etc.), while the bz321111 kernel is the main kernel with ONLY the nfs fix.
catselbow

catselbow

2008-05-21 20:34

reporter   ~0007289

Does the just-released CentOS 2.6.18-53.1.21 kernel contain
a fix for this problem? I don't see it in the list of fixed
bugs here:

http://lwn.net/Articles/283354/
toracat

toracat

2008-05-21 20:56

manager   ~0007290

2.6.18-53.1.21 does not seem to have the fix. However 2.6.18-92 apparently has the nfs patch in. I just ran some tests and it did not have the performance issue reported here. This is supposed to be the kernel used in 5.2 which was release upstream today. As soon as I get hold of the srpm, I will run a test.

Akemi
toracat

toracat

2008-05-22 04:50

manager   ~0007291

Just confirmed the nfs bug has been fixed in the upstream 5.2 kernel (2.6.18-92). When CentOS 5.2 is out, this bug tracker can be closed (finally!).
catselbow

catselbow

2008-05-22 14:54

reporter   ~0007312

Any chance of an patched hughesjr kernel for 2.6.18-53.1.21 until 5.2 comes out? Thanks a million for the previous patched kernels.
tru

tru

2008-05-25 20:19

administrator   ~0007336

Courtesy builds available at http://people.centos.org/tru/kernel+bz321111.

They were not built of the CentOS build hosts (busy building 5.2) but in my chrooted build hosts, but that should make no difference.
toracat

toracat

2008-05-25 20:23

manager   ~0007337

In fact, I just installed this kernel offered by tru and my test run confirmed it is free of the nfs issue.
toracat

toracat

2008-05-27 15:57

manager   ~0007359

Supposedly this is the last test I should be running for this nfs issue. I've got the *CentOS* 5.2 kernel 2.6.18-92.el5 and confirmed this has fixed the problem. Time to finally close the bug report. :-D
toracat

toracat

2008-07-03 00:19

manager   ~0007556

Now marked "fixed". Please change the status to "Resolved".

Issue History

Date Modified Username Field Change
2008-01-27 14:38 greno2 New Issue
2008-01-28 15:51 toracat Note Added: 0006768
2008-01-28 17:52 greno2 Note Added: 0006769
2008-01-28 18:12 range Note Added: 0006770
2008-01-28 18:12 range Status new => acknowledged
2008-01-28 18:34 greno2 Note Added: 0006772
2008-01-28 18:56 range Note Added: 0006774
2008-01-30 23:52 toracat Note Added: 0006789
2008-01-31 15:42 JohnnyHughes Note Added: 0006792
2008-01-31 16:37 toracat Note Added: 0006793
2008-01-31 18:35 JohnnyHughes Note Added: 0006794
2008-01-31 18:36 toracat Note Added: 0006795
2008-02-02 20:02 toracat Note Added: 0006801
2008-02-05 11:02 atinivelli Note Added: 0006816
2008-02-06 18:13 toracat Note Added: 0006825
2008-02-06 22:14 toracat Note Added: 0006826
2008-02-10 16:45 kbsingh@karan.org Category -OTHER => kernel
2008-02-10 16:45 kbsingh@karan.org Product Version 5.0 - x86_64 => 5.1
2008-02-13 15:12 JohnnyHughes Note Added: 0006882
2008-02-15 11:30 range Relationship added has duplicate 0002682
2008-02-17 09:55 greno2 Note Added: 0006903
2008-02-20 18:09 voog Note Added: 0006913
2008-02-20 18:22 toracat Note Added: 0006914
2008-03-01 02:19 stevefalco Note Added: 0006964
2008-03-05 16:20 catselbow Note Added: 0006982
2008-05-11 03:55 toracat Note Added: 0007247
2008-05-11 11:43 toracat Note Added: 0007248
2008-05-11 11:49 JohnnyHughes Note Added: 0007249
2008-05-21 20:34 catselbow Note Added: 0007289
2008-05-21 20:56 toracat Note Added: 0007290
2008-05-22 04:50 toracat Note Added: 0007291
2008-05-22 14:54 catselbow Note Added: 0007312
2008-05-25 20:19 tru Note Added: 0007336
2008-05-25 20:23 toracat Note Added: 0007337
2008-05-27 15:57 toracat Note Added: 0007359
2008-07-03 00:19 toracat Note Added: 0007556
2008-07-03 00:19 toracat Resolution open => fixed
2008-07-03 00:19 toracat Fixed in Version => 5.2
2008-07-03 08:06 timverhoeven Status acknowledged => resolved