CentOS Bug Tracker
CentOS Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0005988CentOS-6kernelpublic2012-10-01 16:032014-11-18 09:32
Reportermatt0023 
PrioritynormalSeveritymajorReproducibilitysometimes
StatusassignedResolutionno change required 
Platformx86_64OSCentosOS VersionCentos 6.2
Product Version6.2 
Target VersionFixed in Version 
Summary0005988: kernel divide by zero error in find_busiest_group
DescriptionWe have several blade servers on an HP BladeSystem. Over the past 2 months we have experienced about 4 spontaneous reboots. Recently while checking the vmcore files, we found that in each case it appears to be a divide by zero error in the scheduler, particularly in find_busiest_group. Searching via google, we found this report at on the Red Hat site: https://bugzilla.redhat.com/show_bug.cgi?id=644903 [^]

This makes it look like this problem has been resolved in 2.6.32-85 but this is clearly affecting us.
Steps To ReproduceWe are running qemu-kvm hypervisors on these blades. They seem to spontaneously crash on their own, system load was not abnormal at the time.
Additional Informationinitial output from crash... domain name of host is changed to 'example.com'


GNU gdb (GDB) 7.0
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> [^]
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

      KERNEL: /usr/lib/debug/lib/modules/2.6.32-220.el6.x86_64/vmlinux
    DUMPFILE: /var/crash/2012-09-28-23:58/vmcore [PARTIAL DUMP]
        CPUS: 24
        DATE: Fri Sep 28 23:25:05 2012
      UPTIME: 245 days, 10:59:38
LOAD AVERAGE: 0.10, 0.19, 0.24
       TASKS: 678
    NODENAME: blade01-04.las.example.com
     RELEASE: 2.6.32-220.el6.x86_64
     VERSION: #1 SMP Tue Dec 6 19:48:22 GMT 2011
     MACHINE: x86_64 (2399 Mhz)
      MEMORY: 48 GB
       PANIC: ""
         PID: 0
     COMMAND: "swapper"
        TASK: ffff8805fc6beb00 (1 of 24) [THREAD_INFO: ffff880bfc33c000]
         CPU: 15
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 0 TASK: ffff8805fc6beb00 CPU: 15 COMMAND: "swapper"
 #0 [ffff8800282e3900] machine_kexec at ffffffff81031fcb
 #1 [ffff8800282e3960] crash_kexec at ffffffff810b8f72
 #2 [ffff8800282e3a30] oops_end at ffffffff814f0490
 #3 [ffff8800282e3a60] die at ffffffff8100f26b
 #4 [ffff8800282e3a90] do_trap at ffffffff814efd84
 #5 [ffff8800282e3af0] do_divide_error at ffffffff8100cfff
 #6 [ffff8800282e3b90] divide_error at ffffffff8100be7b
    [exception RIP: find_busiest_group+1477]
    RIP: ffffffff81054ad5 RSP: ffff8800282e3c40 RFLAGS: 00010246
    RAX: 0000000000000000 RBX: ffff8800282e3e64 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: ffff8800282cf540 RDI: ffff8800282d5fc0
    RBP: ffff8800282e3dd0 R8: ffff8800282cf860 R9: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000001 R12: 00000000ffffff01
    R13: 0000000000015fc0 R14: ffffffffffffffff R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
 #7 [ffff8800282e3dd8] rebalance_domains at ffffffff8105fc52
 #8 [ffff8800282e3ea8] run_rebalance_domains at ffffffff81060153
 #9 [ffff8800282e3ee8] __do_softirq at ffffffff81072161
#10 [ffff8800282e3f58] call_softirq at ffffffff8100c24c
#11 [ffff8800282e3f70] do_softirq at ffffffff8100de85
#12 [ffff8800282e3f90] irq_exit at ffffffff81071f45
#13 [ffff8800282e3fa0] smp_call_function_single_interrupt at ffffffff8102a255
#14 [ffff8800282e3fb0] call_function_single_interrupt at ffffffff8100bdb3
--- <IRQ stack> ---
#15 [ffff880bfc33ddb8] call_function_single_interrupt at ffffffff8100bdb3
    [exception RIP: intel_idle+222]
    RIP: ffffffff812c4a5e RSP: ffff880bfc33de68 RFLAGS: 00000202
    RAX: 0000000000000000 RBX: ffff880bfc33ded8 RCX: 0000000000000000
    RDX: 0000000000003697 RSI: 0000000000000000 RDI: 0000000000d54147
    RBP: ffffffff8100bdae R8: 0000000000000004 R9: 0000000000000898
    R10: 004b582e31f40a1b R11: ffff880bfc33de78 R12: ffff8800282f1040
    R13: ffff880bfc33de28 R14: ffffffff81094c32 R15: ffff880bfc33dde8
    ORIG_RAX: ffffffffffffff04 CS: 0010 SS: 0018
#16 [ffff880bfc33dee0] cpuidle_idle_call at ffffffff813f9f47
#17 [ffff880bfc33df00] cpu_idle at ffffffff81009e06


the 'log' output from crash also contains: divide error: 0000 [#1] SMP

Attached file is tar containing: 'log' and 'dis' output from crash, as well as lspci -v

TagsNo tags attached.
Attached Filesgz file icon crash.files.tar.gz [^] (21,156 bytes) 2012-10-01 16:03

- Relationships

-  Notes
(0015942)
kbsingh@karan.org (administrator)
2012-10-16 07:57

this is actually reported fixed upstream in 2.6.32-131.0.15 and above, do you still see this problem ?
(0015943)
toracat (developer)
2012-10-16 08:40

> UPTIME: 245 days

Does this happen to be related to the >208.5-day crash?

https://access.redhat.com/knowledge/solutions/68466 [^]
https://www.redhat.com/archives/rhelv6-list/2012-January/msg00010.html [^]

"unnecessary overflow in sched_clock"
"kernel will crash after 209~250 days"
(0015946)
matt0023 (reporter)
2012-10-16 16:04

actually we are running 2.6.32-220.el6.x86_64 and definitely fell victim to this.

and I have to say, digging through mailing lists, release notes, and even some info on pastebin (!) it is not easy to divine in what kernel version this was fixed:

this shows up as a fix in 2.2.26-248.el6. search for the Bug ID 785959 on these pages:

http://rpmfind.net/linux/RPM/centos/updates/6.3/x86_64/Packages/kernel-firmware-2.6.32-279.2.1.el6.noarch.html [^]
https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/html/6.3_Technical_Notes/kernel.html#RHSA-2012-1156 [^]

with regard to CentOS it seems like it made it in as of 2.6.32-220.17.1.el6 in CentOSPlus tree:

http://zid-lux1.uibk.ac.at/linux/rpm2html/centos/6/centosplus/i386/Packages/kernel-doc-2.6.32-220.17.1.el6.centos.plus.noarch.html [^]

Fri Mar 09 2012 Frantisek Hrbata <fhrbata(at)redhat.com> [2.6.32-220.10.1.el6]
[sched] Fix Kernel divide by zero panic in find_busiest_group() (Larry Woodman) [801718 785959]
So it was apparently in the Red Hat 2.6.32-220.10.1.el6 kernel, which explains why it was not in the 2.6.32-220.el6 kernel

@toracat...yes I meant to come back and update this report after we figured out we are being hit by the 208 day uptime bug. isn't there an old saying...you can't fit 64 bits of sugar in a 54-bit bag (?) ;-)

Thanks for the followup; we are working on upgrading our kernel to the latest rev. Someone may close out this ticket or I can, whichever is more appropriate.
(0015947)
toracat (developer)
2012-10-16 16:30

Actually, the "208.5 day" patch was applied to the centosplus kernel 2.6.32-220.4.1.el6. The patch was then removed from 2.6.32-220.4.2.el6 because it was in the distro kernel.

Thanks for reporting back. I'll close this bug. Feel free to reopen if you need to add more comments.
(0015948)
kbsingh@karan.org (administrator)
2012-10-16 18:39

re-opening as i investigate a potential corner case in the distro kernel
(0016004)
fdisk (reporter)
2012-11-05 15:58

We have the same issue, several servers are rebooting unexpectedly.

All effected server are using the same kernel:
2.6.32-220.17.1.el6.centos.plus.x86_64

Up time before rebooting: 153 ~ 160 days



Following our Crash dump output:

GNU gdb (GDB) 7.0
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> [^]
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu"...

      KERNEL: usr/lib/debug/lib/modules/2.6.32-220.17.1.el6.centos.plus.x86_64/vmlinux
    DUMPFILE: vmcore [PARTIAL DUMP]
        CPUS: 12
        DATE: Tue Oct 30 22:16:35 2012
      UPTIME: 155 days, 10:46:00
LOAD AVERAGE: 0.38, 0.36, 0.29
       TASKS: 284
    NODENAME: Application-srv1
     RELEASE: 2.6.32-220.17.1.el6.centos.plus.x86_64
     VERSION: #1 SMP Wed May 16 05:20:13 BST 2012
     MACHINE: x86_64 (2933 Mhz)
      MEMORY: 24 GB
       PANIC: ""
         PID: 18699
     COMMAND: "Application"
        TASK: ffff88066ac33540 [THREAD_INFO: ffff880662822000]
         CPU: 5
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 18699 TASK: ffff88066ac33540 CPU: 5 COMMAND: "Application"
 #0 [ffff880662823ae0] machine_kexec at ffffffff8103214b
 #1 [ffff880662823b40] crash_kexec at ffffffff810b91c2
 #2 [ffff880662823c10] oops_end at ffffffff814f79a0
 #3 [ffff880662823c40] die at ffffffff8100f26b
 #4 [ffff880662823c70] do_trap at ffffffff814f7294
 #5 [ffff880662823cd0] do_divide_error at ffffffff8100cfff
 #6 [ffff880662823d70] divide_error at ffffffff8100be7b
    [exception RIP: thread_group_times+86]
    RIP: ffffffff81056a16 RSP: ffff880662823e28 RFLAGS: 00010046
    RAX: 7548b448fa87ea3c RBX: ffff88066b79f400 RCX: 00000000f306f704
    RDX: 0000000000000000 RSI: ffff880662823e28 RDI: 000e7c4d06af94cb
    RBP: ffff880662823e68 R8: 000000007b8b6c4f R9: ffff88066b79f400
    R10: 00000000b467dec8 R11: 0000000000000246 R12: ffff880662823f20
    R13: ffff880662823f28 R14: 0000000000000000 R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
 #7 [ffff880662823e70] getrusage at ffffffff8108778b
 #8 [ffff880662823f70] sys_getrusage at ffffffff81087966
 #9 [ffff880662823f80] system_call_fastpath at ffffffff8100b0f2
    RIP: 00007f7375dc53a7 RSP: 00007fff82cf93f0 RFLAGS: 00000297
    RAX: 0000000000000062 RBX: ffffffff8100b0f2 RCX: 00000000063f2d98
    RDX: 00000000000004bd RSI: 00007fff82cf9fd0 RDI: 0000000000000000
    RBP: 0000000000000000 R8: 0000000000060cb3 R9: 000000000000490b
    R10: 00000000b467dec8 R11: 0000000000000246 R12: ffffffff81087966
    R13: ffff880662823f78 R14: 00000000042ee0b8 R15: 00000000042ed208
    ORIG_RAX: 0000000000000062 CS: 0033 SS: 002b
(0016536)
lozzd (reporter)
2013-02-26 10:17

We have seen this issue twice since upgrading to 6.2:

  SYSTEM MAP: /boot/System.map-2.6.32-220.4.1.el6.x86_64
DEBUG KERNEL: /usr/lib/debug/lib/modules/2.6.32-220.4.1.el6.x86_64/vmlinux (2.6.32-220.4.1.el6.x86_64)
    DUMPFILE: /var/crash/127.0.0.1-2013-02-26-06:22:59/vmcore [PARTIAL DUMP]
        CPUS: 16
        DATE: Tue Feb 26 06:22:07 2013
      UPTIME: 390 days, 15:00:43
LOAD AVERAGE: 5.28, 7.03, 6.85
       TASKS: 505
    NODENAME: <removed>
     RELEASE: 2.6.32-220.4.1.el6.x86_64
     VERSION: #1 SMP Tue Jan 24 02:13:44 GMT 2012
     MACHINE: x86_64 (2532 Mhz)
      MEMORY: 24 GB
       PANIC: ""
         PID: 0
     COMMAND: "swapper"
        TASK: ffff88061787cb00 (1 of 16) [THREAD_INFO: ffff88031784a000]
         CPU: 14
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 0 TASK: ffff88061787cb00 CPU: 14 COMMAND: "swapper"
 #0 [ffff8800282e38e0] machine_kexec at ffffffff81031fcb
 #1 [ffff8800282e3940] crash_kexec at ffffffff810b8e12
 #2 [ffff8800282e3a10] oops_end at ffffffff814f0420
 #3 [ffff8800282e3a40] die at ffffffff8100f26b
 #4 [ffff8800282e3a70] do_trap at ffffffff814efd14
 #5 [ffff8800282e3ad0] do_divide_error at ffffffff8100cfff
 #6 [ffff8800282e3b70] divide_error at ffffffff8100be7b
    [exception RIP: find_busiest_group+1477]
    RIP: ffffffff81055435 RSP: ffff8800282e3c20 RFLAGS: 00010246
    RAX: 0000000000000000 RBX: ffff8800282e3e44 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: ffff8800282ef500 RDI: ffff8800282f5f80
    RBP: ffff8800282e3db0 R8: ffff8800282ef820 R9: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000001 R12: 00000000ffffff01
    R13: 0000000000015f80 R14: ffffffffffffffff R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
 #7 [ffff8800282e3db8] rebalance_domains at ffffffff8105e9ea
 #8 [ffff8800282e3e88] run_rebalance_domains at ffffffff8105ee4c
 #9 [ffff8800282e3ed8] __do_softirq at ffffffff81072001
#10 [ffff8800282e3f48] call_softirq at ffffffff8100c24c
#11 [ffff8800282e3f60] do_softirq at ffffffff8100de85
#12 [ffff8800282e3f80] irq_exit at ffffffff81071de5
#13 [ffff8800282e3f90] smp_apic_timer_interrupt at ffffffff814f4d70
#14 [ffff8800282e3fb0] apic_timer_interrupt at ffffffff8100bc13
--- <IRQ stack> ---
#15 [ffff88031784bdb8] apic_timer_interrupt at ffffffff8100bc13
    [exception RIP: intel_idle+222]
    RIP: ffffffff812c49de RSP: ffff88031784be68 RFLAGS: 00000206
    RAX: 0000000000000000 RBX: ffff88031784bed8 RCX: 0000000000000000
    RDX: 00000000000003e2 RSI: 0000000000000000 RDI: 00000000000f2c3c
    RBP: ffffffff8100bc0e R8: 0000000000000004 R9: 00000000000000c8
    R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff810096f0
    R13: ffff88031784be28 R14: ffff8800282f5fe8 R15: ffff8801646a4af8
    ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#16 [ffff88031784bee0] cpuidle_idle_call at ffffffff813f9ef7
#17 [ffff88031784bf00] cpu_idle at ffffffff81009e06
(0017131)
lozzd (reporter)
2013-04-08 09:34

Another instance today, also 6.2 but different kernel to above (older)

      KERNEL: /usr/lib/debug/lib/modules/2.6.32-220.el6.x86_64/vmlinux
    DUMPFILE: /var/crash/127.0.0.1-2013-04-08-05:21:05/vmcore [PARTIAL DUMP]
        CPUS: 24
        DATE: Mon Apr 8 05:20:04 2013
      UPTIME: 212 days, 18:44:48
LOAD AVERAGE: 7.37, 9.01, 10.55
       TASKS: 902
    NODENAME: <removed>
     RELEASE: 2.6.32-220.el6.x86_64
     VERSION: #1 SMP Tue Dec 6 19:48:22 GMT 2011
     MACHINE: x86_64 (3065 Mhz)
      MEMORY: 96 GB
       PANIC: ""
         PID: 0
     COMMAND: "swapper"
        TASK: ffff88180111ab40 (1 of 24) [THREAD_INFO: ffff880c01138000]
         CPU: 20
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 0 TASK: ffff88180111ab40 CPU: 20 COMMAND: "swapper"
 #0 [ffff8800283438f0] machine_kexec at ffffffff81031fcb
 #1 [ffff880028343950] crash_kexec at ffffffff810b8f72
 #2 [ffff880028343a20] oops_end at ffffffff814f0490
 #3 [ffff880028343a50] die at ffffffff8100f26b
 #4 [ffff880028343a80] do_trap at ffffffff814efd84
 #5 [ffff880028343ae0] do_divide_error at ffffffff8100cfff
 #6 [ffff880028343b80] divide_error at ffffffff8100be7b
    [exception RIP: find_busiest_group+1477]
    RIP: ffffffff81054ad5 RSP: ffff880028343c30 RFLAGS: 00010246
    RAX: 0000000000000000 RBX: ffff880028343e54 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: ffff88002834f540 RDI: ffff880028355fc0
    RBP: ffff880028343dc0 R8: ffff88002834f860 R9: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000001 R12: 00000000ffffff01
    R13: 0000000000015fc0 R14: ffffffffffffffff R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
 #7 [ffff880028343dc8] rebalance_domains at ffffffff8105fc52
 #8 [ffff880028343e98] run_rebalance_domains at ffffffff810600ac
 #9 [ffff880028343ed8] __do_softirq at ffffffff81072161
#10 [ffff880028343f48] call_softirq at ffffffff8100c24c
#11 [ffff880028343f60] do_softirq at ffffffff8100de85
#12 [ffff880028343f80] irq_exit at ffffffff81071f45
#13 [ffff880028343f90] smp_apic_timer_interrupt at ffffffff814f4de0
#14 [ffff880028343fb0] apic_timer_interrupt at ffffffff8100bc13
--- <IRQ stack> ---
#15 [ffff880c01139db8] apic_timer_interrupt at ffffffff8100bc13
    [exception RIP: intel_idle+222]
    RIP: ffffffff812c4a5e RSP: ffff880c01139e68 RFLAGS: 00000202
    RAX: 0000000000000000 RBX: ffff880c01139ed8 RCX: 0000000000000000
    RDX: 0000000000000032 RSI: 0000000000000000 RDI: 000000000000c51c
    RBP: ffffffff8100bc0e R8: 0000000000000002 R9: 0000000000000050
    R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff8100975d
    R13: ffff880c01139e28 R14: ffff880028356028 R15: ffff8817febc80b8
    ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#16 [ffff880c01139ee0] cpuidle_idle_call at ffffffff813f9f47
#17 [ffff880c01139f00] cpu_idle at ffffffff81009e06
(0017132)
tru (administrator)
2013-04-08 10:13

6.2 is no longer supported kernel 2.6.32-220 series
6.3 neither 2.6.32-279 series

This is reported to be fixed in 6.3/6.4 kernel changelog since 2.6.32-248.el6
(0021728)
Lambry (reporter)
2014-11-18 08:46

Hello,

We are aware that CentOS distribution of Linux, does not have the resources to maintain Linux Kernel.
Described here problem with "find_busiest_group()" function was described and solved several years back from RedHat in Kernel 2.6.32-220.13.1
However latest Kernel 46.1 in this maintenance level 2.6.32-220 is recommended, because of some other issues with 2.6.32-220.13.1

- Issue History
Date Modified Username Field Change
2012-10-01 16:03 matt0023 New Issue
2012-10-01 16:03 matt0023 File Added: crash.files.tar.gz
2012-10-16 07:57 kbsingh@karan.org Note Added: 0015942
2012-10-16 08:40 toracat Note Added: 0015943
2012-10-16 16:04 matt0023 Note Added: 0015946
2012-10-16 16:30 toracat Note Added: 0015947
2012-10-16 16:30 toracat Status new => resolved
2012-10-16 16:30 toracat Resolution open => no change required
2012-10-16 18:39 kbsingh@karan.org Note Added: 0015948
2012-10-16 18:39 kbsingh@karan.org Status resolved => assigned
2012-11-05 15:58 fdisk Note Added: 0016004
2013-02-07 16:46 tru Relationship added has duplicate 0006244
2013-02-26 10:17 lozzd Note Added: 0016536
2013-04-08 09:34 lozzd Note Added: 0017131
2013-04-08 10:13 tru Note Added: 0017132
2014-11-18 08:46 Lambry Note Added: 0021728


Copyright © 2000 - 2014 MantisBT Team
Powered by Mantis Bugtracker