View Issue Details

IDProjectCategoryView StatusLast Update
0006275CentOS-6kernelpublic2013-03-05 23:53
Reporterzazichi 
PriorityurgentSeveritycrashReproducibilitysometimes
Status newResolutionopen 
Platformx86_64OSCentOSOS Version6.3
Product Version6.3 
Target VersionFixed in Version 
Summary0006275: Kernel Panic on NFSD
DescriptionKERNEL: /usr/lib/debug/lib/modules/2.6.32-279.22.1.el6.x86_64/vmlinux
    DUMPFILE: /var/crash/127.0.0.1-2013-02-28-16:54:21/vmcore [PARTIAL
DUMP]
        CPUS: 8
        DATE: Thu Feb 28 16:53:16 2013
      UPTIME: 13:49:09
LOAD AVERAGE: 12.67, 12.24, 12.19
       TASKS: 637
    NODENAME: storagex02.ethz.ch
     RELEASE: 2.6.32-279.22.1.el6.x86_64
     VERSION: #1 SMP Wed Feb 6 03:10:46 UTC 2013
     MACHINE: x86_64 (2399 Mhz)
      MEMORY: 32 GB
       PANIC: "Oops: 0000 [#1] SMP " (check log for details)
         PID: 2595
     COMMAND: "nfsd"
        TASK: ffff880469186080 [THREAD_INFO: ffff88046939e000]
         CPU: 1
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 2595 TASK: ffff880469186080 CPU: 1 COMMAND: "nfsd"
#0 [ffff88046939f780] machine_kexec at ffffffff81031f7b
#1 [ffff88046939f7e0] crash_kexec at ffffffff810b8e72
#2 [ffff88046939f8b0] oops_end at ffffffff814eda70
#3 [ffff88046939f8e0] no_context at ffffffff81042a0b
#4 [ffff88046939f930] __bad_area_nosemaphore at ffffffff81042c95
#5 [ffff88046939f980] bad_area_nosemaphore at ffffffff81042d63
#6 [ffff88046939f990] __do_page_fault at ffffffff810434c1
#7 [ffff88046939fab0] do_page_fault at ffffffff814efa4e
#8 [ffff88046939fae0] page_fault at ffffffff814ece05
    [exception RIP: get_page+14]
    RIP: ffffffff81126c7e RSP: ffff88046939fb90 RFLAGS: 00010286
    RAX: ffffffff81b15020 RBX: 0000000000000000 RCX: 0000000000000001
    RDX: 0000000000001000 RSI: 0000000000010430 RDI: 0000000000000000
    RBP: ffff88046939fba0 R8: 000000000000000e R9: 0000000000000000
    R10: 0000000000000000 R11: 000000000000000e R12: ffff880872290700
    R13: ffff88046b5031c0 R14: 0000000000000000 R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff88046939fba8] tcp_sendpage at ffffffff8146d8a5
#10 [ffff88046939fc58] kernel_sendpage at ffffffff81415388
#11 [ffff88046939fc98] svc_send_common at ffffffffa02e109d [sunrpc]
#12 [ffff88046939fd18] svc_sendto at ffffffffa02e1172 [sunrpc]
#13 [ffff88046939fe28] svc_tcp_sendto at ffffffffa02e1389 [sunrpc]
#14 [ffff88046939fe58] svc_send at ffffffffa02ec80b [sunrpc]
#15 [ffff88046939fe98] svc_process at ffffffffa02dee00 [sunrpc]
#16 [ffff88046939feb8] nfsd at ffffffffa0372b62 [nfsd]
#17 [ffff88046939fee8] kthread at ffffffff81090876
#18 [ffff88046939ff48] kernel_thread at ffffffff8100c0ca
crash>
Steps To ReproduceRunning NFS over GPFS with AFM active, when the activity start to be high the load jump to the limit like hundred % and then the trace you see and the reboot of the server.
TagsNo tags attached.

Activities

tru

tru

2013-02-28 20:24

administrator   ~0016563

I don't think that we could even try to replicate your issue: we don't have access to IBM GPFS (with/without Active File Management).
I would suggest that you contact your software provider for support on this case.
zazichi

zazichi

2013-03-04 09:41

reporter   ~0016590

I opened the case with IBM but they told me that this is an NFS bug. I can try to take them in the loop. Could be a problem related to nfs server and client running on the same server?
toracat

toracat

2013-03-04 16:11

manager   ~0016591

Possibly related to:

https://access.redhat.com/knowledge/solutions/109263
toracat

toracat

2013-03-04 16:20

manager   ~0016592

A similar issue was reported for EL5.8:

https://bugzilla.redhat.com/show_bug.cgi?id=814626

A workaround in comment #49 may be worth a try.
rmsppu

rmsppu

2013-03-05 23:10

reporter   ~0016604

I'm seeing what appears to be the same issue: kernel panics on nfsd when
using GPFS after system activity produces a high load.

This appears with NFS servers CentOS5.9 with kernels 308.16.1.el5.x86_64 and
348.1.1.el5, with GPFS 3.5.0-7 and 3.5.0-8.

Here's a backtrace from the latest dump:
------------------------------------------------------------------------------
crash> bt
PID: 23649 TASK: ffff8107d9850860 CPU: 8 COMMAND: "nfsd"
 #0 [ffff8107dfb5b140] crash_kexec at ffffffff800b09a0
 #1 [ffff8107dfb5b200] __die at ffffffff80065137
 #2 [ffff8107dfb5b240] do_page_fault at ffffffff80067484
 #3 [ffff8107dfb5b330] error_exit at ffffffff8005dde9
    [exception RIP: vfs_getattr+23]
    RIP: ffffffff8000e4ba RSP: ffff8107dfb5b3e0 RFLAGS: 00010282
    RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff81077b8111c0
    RDX: ffff8107dfb5b730 RSI: ffff810765b88228 RDI: ffff8107ffc15680
    RBP: ffff810765b88228 R8: ffff8107dfb5b834 R9: ffff810a138a01a8
    R10: 0000000000000000 R11: ffffffff8012dd30 R12: ffff8107dfb5b730
    R13: ffff8107ffc15680 R14: 000000000000000d R15: 0000000000800000
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
 #4 [ffff8107dfb5b408] nfsd4_encode_fattr at ffffffff887b8126 [nfsd]
 #5 [ffff8107dfb5b7f8] nfsd4_encode_dirent at ffffffff887ba6d7 [nfsd]
 #6 [ffff8107dfb5b858] cxiFillDir at ffffffff888b37da [mmfslinux]
 #7 [ffff8107dfb5b8a8] gpfs_f_readdir at ffffffff888d50a3 [mmfslinux]
 #8 [ffff8107dfb5b978] vfs_readdir at ffffffff80035178
 #9 [ffff8107dfb5b9b8] nfsd_readdir at ffffffff887a8fa2 [nfsd]
#10 [ffff8107dfb5b9f8] nfsd4_encode_operation at ffffffff887b9faa [nfsd]
#11 [ffff8107dfb5ba48] nfsd4_proc_compound at ffffffff887b4fc0 [nfsd]
#12 [ffff8107dfb5bea8] __down_read at ffffffff80064624
#13 [ffff8107dfb5bee8] nfsd at ffffffff887a5770 [nfsd]
#14 [ffff8107dfb5bf48] kernel_thread at ffffffff8005dfb1
crash>
------------------------------------------------------------------------------

We are using an max_block_size of 1MB (which I will try reducing).

We are using TCP (I will try changing clients to UDP).

Four clients are running CentOS5.9 (2.6.18-348.1.1.el5.x86_64), one is
running RHEL4.9 (2.6.9-34.ELsmp x86_64) -- soon to be retired.

The CentOS clients mount via NFSv4 with the options:
    timeo=11,retrans=4,hard,intr,rw,bg,noatime,nodiratime,rsize=32768,wsize=32768

The antique^H^H^H^H^H^H^H RHEL4 client mounts using NFSv3 with:
    hard,intr,rw,bg,nordirplus,noatime,nodiratime,proto=tcp,rsize=32768,wsize=32768

I haven't yet reported it to IBM.

[As a side note for GPFS users, 3.5.0-8 has a bug in the calculation
of netmasks after a CNFS failover, which effectively takes the active
server off the network. According to IBM, this will be fixed in GPFS
3.5.0.9 PTF which will be released at 3/28/2013. Contact me directly
for details and a work-around.]
toracat

toracat

2013-03-05 23:35

manager   ~0016605

@rmsppu

Do you have email address that you can share publicly? Or can you offer some other way to contact you directly?
rmsppu

rmsppu

2013-03-05 23:53

reporter   ~0016606

Yes:
     centos -@- merctech.com

(after posting my note I tried to change my preferences to make my address visible...that doesn't seem to be possible).

Issue History

Date Modified Username Field Change
2013-02-28 16:50 zazichi New Issue
2013-02-28 20:24 tru Note Added: 0016563
2013-03-04 09:41 zazichi Note Added: 0016590
2013-03-04 16:11 toracat Note Added: 0016591
2013-03-04 16:20 toracat Note Added: 0016592
2013-03-05 23:10 rmsppu Note Added: 0016604
2013-03-05 23:35 toracat Note Added: 0016605
2013-03-05 23:53 rmsppu Note Added: 0016606