View Issue Details

IDProjectCategoryView StatusLast Update
0006421CentOS-6-OTHERpublic2013-06-12 16:01
Reporterjhickson 
PrioritynormalSeveritymajorReproducibilityhave not tried
Status resolvedResolutionno change required 
PlatformOSCentOSOS Version6.4
Product Version 
Target VersionFixed in Version6.4 
Summary0006421: Upgrade from CentOS 6.3 to 6.4 causes insane CPU time values, increasing load
DescriptionDuring routine upgrades on hosts, we started seeing some hosts that experience issues. When we run a yum update on them to bring them up to 6.4, and reboot them, when they come up they start having slowly increasing CPU load, and insane CPU time numbers for processes. CPU load starts out at a normal level, and over the course of a few minutes continues to ramp up, to several hundred. The CPU time for some processes also goes up into several hundreds of thousands or even millions of hours, almost immediately after it comes up from a reboot.

The confusing part is that is we do a fresh install on the same host, it works completely as expected, with none of these issues.

We have looked at a package comparison between a host freshly installed directly to CentOS 6.4 and the upgraded host with the issues, and the only package we could find that was missing on the upgraded host was libitm, however installing this on the upgraded host and rebooting did not solve any issues. The only other package differences are the upgraded machine kept some old kernels installed, and had some extra perl packages.

Here is the output of top:

Tasks: 483 total, 5 running, 478 sleeping, 0 stopped, 0 zombie
Cpu(s): 11.5%us, 0.5%sy, 0.0%ni, 87.7%id, 0.2%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 32798944k total, 28187760k used, 4611184k free, 105968k buffers
Swap: 10256376k total, 0k used, 10256376k free, 1424572k cached

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 3481 jhickson 20 0 13392 1404 812 R 3.8 0.0 0:00.03 top
    1 root 20 0 21444 1552 1240 S 0.0 0.0 8596343h init
    2 root 20 0 0 0 0 S 0.0 0.0 1290835h kthreadd
    3 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
    4 root 20 0 0 0 0 S 0.0 0.0 10222396h ksoftirqd/0
    5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
    6 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
    7 root RT 0 0 0 0 R 0.0 0.0 300194:20 migration/1
    8 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/1
    9 root 20 0 0 0 0 S 0.0 0.0 900583:01 ksoftirqd/1
   10 root RT 0 0 0 0 R 0.0 0.0 0:00.00 watchdog/1
   11 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2
   12 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2
   13 root 20 0 0 0 0 S 0.0 0.0 20012,57 ksoftirqd/2
   14 root RT 0 0 0 0 S 0.0 0.0 300194:20 watchdog/2
   15 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3
   16 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3
   17 root 20 0 0 0 0 S 0.0 0.0 300194:20 ksoftirqd/3
   18 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/3
   19 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/4
   20 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/4
   21 root 20 0 0 0 0 S 0.0 0.0 300194:20 ksoftirqd/4
   22 root RT 0 0 0 0 S 0.0 0.0 300194:20 watchdog/4
   23 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/5
   24 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/5
   25 root 20 0 0 0 0 S 0.0 0.0 0:00.11 ksoftirqd/5
   26 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/5
   27 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/6
   28 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/6

uname -a:

Linux brokenhost.keek.com 2.6.32-358.6.1.el6.x86_64 #1 SMP Tue Apr 23 19:29:00 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

On a physical console on the machine, we have been seeing lines like:

INFO: task sh:3426 blocked for more than 120 seconds

Where sh could be various things from a shell to kernel components, it's more or less seems to be random.

Has anyone seen anything like this before? Is there any other information I can provide?
TagsNo tags attached.

Activities

foton1981

foton1981

2013-05-02 19:02

reporter   ~0017358

The issue was introduced somewhere between 2.6.32-220 and 2.6.32-358
Symptoms appear immediately after reboot on our systems running 2.6.32-358
My best guess that it has something to do with CPU type/model as this bug only triggered on small number of our fairly recent systems having
CPU0: Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz stepping 07
offline

offline

2013-05-08 14:23

reporter   ~0017388

We had the same issue with 7 of our Redhat servers last night. We were able to solve the problem by doing a cold boot. We don't know why a cold boot worked but we could not escape the problem with a warm reboot. I'll provide more notes.

Linux 88906lpweb001.mhsl01.mhsl.local 2.6.32-358.6.1.el6.x86_64 #1 SMP Fri Mar 29 16:51:51 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux

6 Servers...
CPU0: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz stepping 07

1 server...
CPU0: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz stepping 07
blokecom

blokecom

2013-06-12 12:41

reporter   ~0017557

We are seeing the same with

CPU0: Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz stepping 07

warm boots didn't fixed it. Asking softlayer to cold boot it.

problem started when updated from kernel-2.6.32-358.6.1.el6.x86_64 to
kernel-2.6.32-358.6.2.el6.x86_64 on the reboot. Not sure if it's related.

I disabled most of the services at startup to stabilize the machine a little, but it's still slowing creeping up on load average, and shows the same crazy hours of usage for many processes.
 9 root 20 0 0 0 0 S 0.0 0.0 600388:40 ksoftirqd/1
10 root RT 0 0 0 0 S 0.0 0.0 300194:20 watchdog/1

hope that helps,

Cameron
blokecom

blokecom

2013-06-12 13:32

reporter   ~0017558

Cold boot did the trick! thanks.

Cameron
http://www.flamingtext.com/
jhickson

jhickson

2013-06-12 13:46

reporter   ~0017559

Cold booting the system did seem to work in some other servers affected by this. We would upgrade them into the broken state, then cold boot them, and they all came back on just fine, with no more issues.
tigalch

tigalch

2013-06-12 16:01

manager   ~0017561

Marking as SOLVED by reporters feedback.

Issue History

Date Modified Username Field Change
2013-04-26 16:39 jhickson New Issue
2013-05-02 19:02 foton1981 Note Added: 0017358
2013-05-08 14:23 offline Note Added: 0017388
2013-06-12 12:41 blokecom Note Added: 0017557
2013-06-12 13:32 blokecom Note Added: 0017558
2013-06-12 13:46 jhickson Note Added: 0017559
2013-06-12 16:01 tigalch Note Added: 0017561
2013-06-12 16:01 tigalch Status new => resolved
2013-06-12 16:01 tigalch Fixed in Version => 6.4
2013-06-12 16:01 tigalch Resolution open => no change required