CentOS Bug Tracker
CentOS Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0006949CentOS-6kernelpublic2014-01-29 11:452014-06-19 17:50
Reporterparan 
PrioritynormalSeveritymajorReproducibilityalways
StatusresolvedResolutionfixed 
PlatformOSOS Version6.5
Product Version 
Target VersionFixed in Version 
Summary0006949: Memory bandwidth regression in CentOS 6.5
DescriptionMemory bandwidth measured with the STREAM benchmark show a significant regression in CentOS 6.5 (kernel 2.6.32-431.3.1.el6.x86_64) as compared to CentOS 6.4 (kernel 2.6.32-358.23.2.el6.x86_64).

Affected systems:
HP ProLiant DL380p Gen8, dual socket, Intel Sandy Bridge E5-2660, 128G RAM.
HP ProLiant SL230s Gen8, dual socket, Intel Sandy Bridge E5-2660, 32G RAM.

Not affected systems:
HP ProLiant DL140G3, dual socket, Intel Clovertown E5345, 16G RAM. (non-NUMA)
Steps To ReproduceObtain and compile the STREAM benchmark:

[paran@trio sonc]$ wget \
http://www.cs.virginia.edu/stream/FTP/Code/stream.c [^]
[paran@trio sonc]$ gcc -mtune=native -march=native -O3 -mcmodel=medium \
-fopenmp -DSTREAM_ARRAY_SIZE=100000000 -DNTIMES=2 -o stream.1e8 stream.c
[paran@trio sonc]$ gcc -mtune=native -march=native -O3 -mcmodel=medium \
-fopenmp -DSTREAM_ARRAY_SIZE=1000000000 -DNTIMES=2 -o stream.1e9 stream.c

Results below are from a HP DL380p:

CentOS 6.4, kernel 2.6.32-358.23.2.el6.x86_64, 1e8 array:
[paran@triolith3 sonc]$ while `sleep 1`;do ./stream.1e8|grep -A4 ^Function|awk '{print $2}'|xargs;done
Best 49209.8 49276.3 55422.2 55209.1
Best 49482.3 48876.1 55419.7 55590.2
Best 49374.9 49431.6 55559.5 55119.0
Best 49240.1 49157.9 55400.5 55145.9

CentOS 6.5, kernel 2.6.32-431.3.1.el6.x86_64, 1e8 array:
[paran@triolith2 sonc]$ while `sleep 1`;do ./stream.1e8|grep -A4 ^Function|awk '{print $2}'|xargs;done
Best 29667.5 31001.2 36717.9 36167.5
Best 27742.2 26565.2 34453.6 34515.5
Best 31306.3 26674.2 33756.7 36809.8
Best 32532.9 28816.3 35443.2 38226.3

CentOS 6.4, 1e9 array:
[paran@triolith3 sonc]$ while `sleep 1`;do ./stream.1e9|grep -A4 ^Function|awk '{print $2}'|xargs;done
Best 49209.4 49366.7 55500.4 55626.6
Best 49254.7 49335.9 55540.9 55230.1
Best 49295.1 49466.4 55568.0 55646.5
Best 49317.4 49478.5 55593.5 55599.7

CentOS 6.5, 1e9 array:
[paran@triolith2 sonc]$ while `sleep 1`;do ./stream.1e9|grep -A4 ^Function|awk '{print $2}'|xargs;done
Best 43195.9 43217.6 48784.4 48924.0
Best 41904.8 42042.0 47489.0 47560.8
Best 42481.3 42614.3 48069.5 48197.8
Best 42188.4 42281.4 47809.3 44260.0
Additional InformationI have verified that this is caused by the kernel. Booting a 6.4 kernel on a fully updated 6.5 system gives normal results.

STREAM is a synthetic benchmark. Our initial tests using normal user applications have not showed any performance regression in 6.5. However we have not done extensive testing yet.
TagsNo tags attached.
Attached Filespatch file icon numasched_cpu_power_bug.patch [^] (787 bytes) 2014-02-27 11:05 [Show Content]
patch file icon centos-linux-2.6-numasched_cpu_power_BZ870669_bug-bug6949.patch [^] (2,515 bytes) 2014-03-14 17:51 [Show Content]

- Relationships
has duplicate 0006999resolvedIssue Tracker New kernel 2.6.32-431.5.1.el6.x86_64 can not use 100% cpu core resource. 

-  Notes
(0019173)
TrevorH (reporter)
2014-01-30 14:48

I tested only stream.1e8 as 1e9 was killed immediately on startup for me. Tests run on an Intel DH67CF with core i3-3220T, 16GB RAM and were broadly similar for both kernels.

2.6.32-358.23.1
Best 11791.4 11859.5 13111.9 13043.9
Best 12033.0 11850.1 13116.0 13084.7
Best 11923.5 11854.3 13018.2 13060.1
Best 11963.4 11822.5 13001.5 13072.8

2.6.32-431.3.1
Best 11934.1 11848.4 13038.5 13030.4
Best 11956.7 11826.2 13037.2 13038.3
Best 11981.1 11863.9 13055.2 13080.0
Best 12008.1 11827.9 13094.2 13125.1
(0019174)
tru (administrator)
2014-01-30 16:18

2x Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz (16 threads no HT) with 128GB of RAM, everything being the same except the kernel

2.6.32-358.23.2.el6.x86_64:stream.1e8
Best 50338.2 52650.9 57302.5 57427.4
Best 50452.9 52321.7 57023.3 57551.2
Best 50480.9 52503.9 56826.3 57591.3
Best 50219.9 52488.3 57197.3 57467.0
Best 50634.4 52508.8 57284.9 57390.1
Best 50613.8 52410.8 57349.2 57441.1
Best 50433.1 52419.8 57124.9 57151.2
....
2.6.32-358.23.2.el6.x86_64:stream.1e9
Best 52208.9 51861.0 57477.2 57180.6
Best 52069.9 51888.6 57059.9 57147.0
Best 52184.6 51739.3 57340.8 57039.4
Best 52114.5 51702.3 57183.6 57079.0
Best 52101.1 51821.9 57196.1 57173.9
Best 52225.4 51960.7 57501.8 57184.9
Best 52095.6 51869.9 57202.4 57173.5
Best 52059.6 51793.8 57281.6 57129.1
...

2.6.32-431.3.1.el6.x86_64: stream.1e8
Best 24479.1 29096.2 34317.5 38074.7
Best 34189.5 26762.1 35299.4 38363.7
Best 35100.1 32563.4 37836.9 40027.4
Best 29332.6 31497.2 37743.0 40127.8
Best 27099.4 27165.6 37432.2 33869.1
Best 32844.0 32267.1 36952.7 36089.1
Best 29332.0 31076.4 36032.8 35932.4
Best 33388.3 32234.0 37693.8 40141.2
Best 33377.2 32257.4 34822.0 36541.3
Best 30114.2 29757.5 36054.4 35016.0
Best 33319.5 32301.6 36374.2 37604.6


2.6.32-431.3.1.el6.x86_64: stream.1e9
Best 46580.2 45964.2 51412.3 51232.9
Best 48569.9 48339.3 53649.6 53755.8
Best 45386.6 45195.0 50304.6 50246.1
Best 47343.9 47052.0 52348.2 52174.9
Best 46038.4 45804.6 50929.3 50787.5
Best 46130.7 45669.5 50888.5 50554.4
Best 46926.9 46533.9 51914.8 51821.0
Best 44937.0 44737.5 49690.7 49786.0
Best 47363.9 47036.1 52347.8 52161.9
Best 48562.9 48115.1 53925.3 53896.2
Best 45852.0 45562.1 50734.7 50632.5
Best 46008.2 45769.6 50964.8 50824.6
(0019176)
mlampe (reporter)
2014-01-30 17:57

libgomp doesn't bind threads to cores by default. It's possible that the older kernel did a better job in this case. You might want to rerun after

   export GOMP_CPU_AFFINITY="0-15"

and see if there are still differences between the two kernels.
(0019177)
tru (administrator)
2014-01-30 18:21

much better:)
2.6.32-431.3.1.el6.x86_64/stream.1e8/GOMP_CPU_AFFINITY="0-15"
Best 50363.5 52706.3 57533.1 57918.0
Best 50931.1 52685.6 57547.2 57793.7
Best 50932.7 52727.0 57573.2 57485.1
Best 51012.4 52765.2 57310.7 57871.8
Best 50880.9 52824.6 57744.9 57728.4
Best 50747.0 52843.3 57616.4 57989.4
Best 50739.0 52866.2 57750.6 57820.2
Best 50973.3 52563.9 57735.3 58007.5
Best 50977.9 52581.6 57526.5 57888.4
Best 50657.0 52857.4 57602.2 57947.4
Best 50309.5 52626.6 57471.3 57649.4
Best 50624.9 52817.5 57439.5 58043.9
Best 50822.7 52871.6 57661.9 57943.0
Best 50095.4 52533.0 57239.6 57607.8
Best 50607.3 52689.0 57676.8 57864.8
Best 50679.2 52800.1 57516.6 57996.5
Best 51113.4 52784.3 57570.6 57986.1
Best 50419.1 52576.3 57567.9 57805.0
Best 51265.3 52626.6 57578.8 57706.2
Best 50663.1 52505.5 57517.0 57878.7
Best 50745.1 52701.0 57563.7 57821.6
Best 51157.5 52838.3 57617.7 57931.7
Best 50289.2 52843.7 57599.6 57856.5
Best 50892.1 52676.5 57556.7 57944.4
Best 50698.7 52500.6 57464.4 57725.4
Best 50182.0 52653.8 57591.3 57955.7
Best 50884.4 52775.6 57799.3 57661.6

2.6.32-431.3.1.el6.x86_64/stream.1e9/GOMP_CPU_AFFINITY="0-15"
Best 52174.6 51859.8 57366.3 57237.7
Best 52160.7 51866.1 57212.1 57202.2
Best 52249.3 52014.6 57425.6 57084.8
Best 52058.3 51783.3 57321.6 56974.0
Best 52145.5 51923.1 57333.0 57136.9
Best 52145.5 51923.1 57333.0 57136.9
Best 52094.3 51751.7 57292.2 57091.7
Best 52098.6 51720.5 57358.4 57204.4
Best 52039.3 51793.2 57257.9 57024.9
(0019258)
paran (reporter)
2014-02-11 18:20

In hindsight I should of course have warned about the memory
requirements. Especially stream.1e9 requires 22.4 GiB, which some
may not have: :-)

stream.1e8:
  Array size = 100000000 (elements), Offset = 0 (elements)
  Memory per array = 762.9 MiB (= 0.7 GiB).
  Total memory required = 2288.8 MiB (= 2.2 GiB).

stream.1e9:
  Array size = 1000000000 (elements), Offset = 0 (elements)
  Memory per array = 7629.4 MiB (= 7.5 GiB).
  Total memory required = 22888.2 MiB (= 22.4 GiB).

Anyway the problem is worse using a smaller array size.

That the Core i3 system did not show any difference is good, as
that is also a non-NUMA system, and would not be affected by
binding issues.

I am rather sure that this is an upstream bug. Does anybody have
a RHEL-machine to verify? I could then open a bug report on
bugzilla.redhat.com, unless somebody else wants to do it.
(0019282)
paran (reporter)
2014-02-14 10:14
edited on: 2014-02-20 15:02

edit:[incorrect link removed]

(0019292)
mlampe (reporter)
2014-02-14 23:42

This RH bugzilla entry is about libcgroup and pam. Originally you said it's just about kernels and switching them.

Would you please enlighten lesser mortals about the relationship here?
(0019293)
paran (reporter)
2014-02-15 10:26

Dear lesser mortals,

Hopefully the only relationship between those bugs is that I
reported them. What probably happened is that I copied the URL
from the wrong tab in my browser. :-)

Hopefully I get it correct this time:
https://bugzilla.redhat.com/show_bug.cgi?id=1065304 [^]

Unfortunately it seems that the bug got marked private for some
reason.
(0019311)
cap_ (updater)
2014-02-20 14:37

Underlying problem is that the cpu-scheduler puts several threads on the same core(s) while idleing others. Adding relation to another ticket.
(0019313)
cap_ (updater)
2014-02-20 15:07

Simple reproducer (starts md5sum /dev/zero x times for x processors on system):

for i in $(seq 1 $(egrep "^processor" /proc/cpuinfo | wc -l)) ; do md5sum /dev/zero & done

Run top, hit "1" to get per-cpu view.

Observed on bad system: one or more processors fully or partially idle (despite enough md5sum processes to keep everyone busy).

Observed on good system: all processors have zero or near zero idle numbers.
(0019314)
TrevorH (reporter)
2014-02-20 15:11

Also tested e3-1245v3 with 4 cores and HT. Ran 8 md5sum processes, all 8 "cpus" were used. Unaffected.
(0019316)
cap_ (updater)
2014-02-20 17:31

List of bad and good combinations pulled together from several sources:
* Definitions
SNB = Sandy bridge
WSM = Westmere
NEH = Nehalem
IVB = Ivy bridge
HSW = Haswell

* ok
** dual socket WSM on RHEL-6.5 (did he get it right?)...
** single socket HSW HT on, C6.5
** dual socket IVB on f19 3.12
** dual socket WSM on RHEL-7b 3.10
** X5670 @ 2.93GHz 2.6.18-371.1.2.el5...
** dual socket 2x6 cores HT on, X5660 C6.4 (2.6.32-358.23.2.el6.x86_64)
** dual socket NEH HT on C6.2 (HP DL1000)

* bad
** dual socket SNB on C6.5 (HP SL230)
** dual socket IVB on f19 3.8
** dual socket IVB on C6.5 (HT and noHT)...
** m5a97 le r2.0, 6 core amd 6300 funtoo 3.10...
** E5-4620, el6.5, 2.6.32-431.el6.x86_64...
** rhel 6.5, xeon E5530, 2.6.32-431.5.1.el6.x86_64...
** dual socket 2x6 cores HT on X5670 C6.5 ...
** dual socket 2x6 cores HT on X5660 C6.5 (2.6.32-431.3.1.el6.x86_64)...
** dual socket NEH HT on C6.5 (HP DL1000)
(0019350)
cap_ (updater)
2014-02-25 10:47
edited on: 2014-02-27 15:07

It's possible to reproduce on a VM using the -numa option to qemu. I did virsh edit xxxx and added:
  <qemu:commandline>
    <qemu:arg value='-numa'/>
    <qemu:arg value='node,mem=1G,cpus=0-7'/>
    <qemu:arg value='-numa'/>
    <qemu:arg value='node,mem=1G,cpus=8-15'/>
  </qemu:commandline>

[Edit: also needed] and to the main <domain> tag:
 xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0' [^]

This with 16 core and 2 sockets configured in the VM on my 2-core non-numa laptop. Booting c6.5 on this reproduced the problem.

(0019386)
cap_ (updater)
2014-02-27 11:04

Wrote a systemtap script that dumps all relevant information (sched_domain, sched_groups, ...). It seems the problem is that one numa zone gets an incorrect cpu_power. On 6.4 (output from my stap script on a 20 core IVB server):
 sdlevel:5 sdflags:1071 sdspan:11111111111111111111 sdname:"NODE"
 grpcpupow: 10238 cpupoworig: 0 mask:11111111110000000000
 grpcpupow: 10240 cpupoworig: 0 mask:00000000001111111111

On 6.5:
 sdlevel:5 sdflags:1071 sdspan:11111111111111111111 sdname:"NODE"
 grpcpupow: 10238 cpupoworig: 0 mask:11111111110000000000
 grpcpupow: 20470 cpupoworig: 0 mask:00000000001111111111

Note how the 2nd sched_group in the "NODE" sched_domain has about 2x the expected value (it's supposed to be both ~equal to the first sched_group and ~1024 * numcores in group).

I've successfully tried to update the value on a running kernel with systemtap and this fixes the problem.

I've also reverted a part of sched.c and rebuilt. This also fixes the problem.

I suspect that this is what caused it (fix boot problem on exotic machine and break all normal machines...):

* Tue Jul 02 2013 Jarod Wilson <jarod@redhat.com> [2.6.32-395.el6]
- [kernel] sched: make weird topologies bootable (Radim Krcmar) [892677]
(0019387)
toracat (developer)
2014-02-27 13:52

We will consider applying the patch to the centosplus kernel.
(0019389)
cap_ (updater)
2014-02-27 15:10

The patch has now also been tested with HT/SMT. It works and fixes the problem (hw: 2 socket 10-core Intel Xeon-E5v2 (IVB) => 40 processors seen in linux).
(0019392)
toracat (developer)
2014-02-27 15:44

Thanks, cap_, for your analysis and the patch.
(0019402)
mlampe (reporter)
2014-03-05 00:36

Thank you too for your efforts!

I have rebuilt the current kernel with your patch only and have performed two tests on our shiny new cluster with 2x E5-2660 v2 per node.

1) md5sum like in your example above (one node only).

It's better than with stock but I still see cpus with very low usage (about 3-4).

2) mp_linpack without explicit binding.

One MPI process per real core. Also better than before but also not optimal. Processes are not strictly mapped 1:1 to real cores. About 1/4 per node is sharing a real core. I'm quite sure that this was automatically done right before RH fixed something.
(0019403)
cap_ (updater)
2014-03-05 12:32

Regarding better but not perfect. I didn't mention it in my initial
posts since:

1) The various fixed 6.5 setups behave no better or worse than 6.4
2) I couldn't quite describe the imperfections still left

To be clear, I think this is another bug but heres the data:

* C6.5 with either of (systemtap fix, my patch, upstream patch)
  behaves the same as 6.4 in all my tests
* The remaining problem when using my reference load of x md5sums for
  x cores differs from machine to machine
 - 16 core Xeon-E5(SNB), no HT: 0 misplacements out of 100+ cycles
 - 20 core Xeon-E5v2(IVB), no HT: about every 2nd cycle has one
   misplaced process (9 md5sums on one socket, 11 on the other). The
   situation does not (typically) sort itself out.
 - 20 core Xeon-E5v2(IVB), HT: ~1 in 25 cycles has one misplaced
   md5sum (19 md5sums on one socket, 21 on the other).
* If you start the x copies with a sleep 0.1 between each then the
  problem stays away

Since the linux kernel doesn't do page migration it makes sense to be
very reluctant in moving processes across numa zones after they've
been "initially misplaced".
(0019404)
cap_ (updater)
2014-03-05 12:35

Regarding the upstream bz, it's now public (and updated with a patch
that fixes the problem while keeping the 6.5 functionality I reverted
in my patch).

toracat: the upstream patch seems a better choice for you to pick up
(not that I know of any systems that breaks with my patch but
presumably some "exotic" system will fail to boot).
(0019405)
toracat (developer)
2014-03-05 13:19

@cap_

Got it. Will use the patch from the upstream BZ. Thanks.
(0019497)
toracat (developer)
2014-03-13 17:52
edited on: 2014-03-13 17:53

err, wrong place. (removed)

(0019503)
toracat (developer)
2014-03-14 17:53

An updated patch has been offered in the upstream BZ. Uploaded to this bug report as centos-linux-2.6-numasched_cpu_power_BZ870669_bug-bug6949.patch .
(0019510)
cap_ (updater)
2014-03-17 08:59

I guess the patch file name was a typo too, upstream is bz1065304.
(0019511)
toracat (developer)
2014-03-17 12:16

Indeed. Thanks for catching it. :)
(0019561)
toracat (developer)
2014-03-26 14:57

kernel 2.6.32-431.11.2.el6 is out. The plus kernel now has the patch applied.
(0019743)
cap_ (updater)
2014-05-08 13:55

According to upstream bz:
 - Fixed In Version: kernel-2.6.32-431.19.1.el6
 - Fixed In Version: kernel-2.6.32-461.el6
(0019744)
toracat (developer)
2014-05-08 15:47

We are at 2.6.32-431.17.1.el6, so maybe the next update? Thanks, cap_ for posting the info.
(0019966)
toracat (developer)
2014-06-19 17:48

This has been fixed in the distro kernel-2.6.32-431.20.3.el6.

- Issue History
Date Modified Username Field Change
2014-01-29 11:45 paran New Issue
2014-01-30 14:48 TrevorH Note Added: 0019173
2014-01-30 16:18 tru Note Added: 0019174
2014-01-30 17:57 mlampe Note Added: 0019176
2014-01-30 18:21 tru Note Added: 0019177
2014-02-11 18:20 paran Note Added: 0019258
2014-02-14 10:14 paran Note Added: 0019282
2014-02-14 23:42 mlampe Note Added: 0019292
2014-02-15 10:26 paran Note Added: 0019293
2014-02-20 14:37 cap_ Note Added: 0019311
2014-02-20 14:38 cap_ Relationship added has duplicate 0006999
2014-02-20 15:02 cap_ Note Edited: 0019282 View Revisions
2014-02-20 15:07 cap_ Note Added: 0019313
2014-02-20 15:11 TrevorH Note Added: 0019314
2014-02-20 17:31 cap_ Note Added: 0019316
2014-02-25 10:47 cap_ Note Added: 0019350
2014-02-27 11:04 cap_ Note Added: 0019386
2014-02-27 11:05 cap_ File Added: numasched_cpu_power_bug.patch
2014-02-27 13:52 toracat Note Added: 0019387
2014-02-27 15:07 cap_ Note Edited: 0019350 View Revisions
2014-02-27 15:10 cap_ Note Added: 0019389
2014-02-27 15:44 toracat Note Added: 0019392
2014-02-27 18:01 toracat Status new => assigned
2014-03-05 00:36 mlampe Note Added: 0019402
2014-03-05 12:32 cap_ Note Added: 0019403
2014-03-05 12:35 cap_ Note Added: 0019404
2014-03-05 13:19 toracat Note Added: 0019405
2014-03-13 17:52 toracat Note Added: 0019497
2014-03-13 17:53 toracat Note Edited: 0019497 View Revisions
2014-03-14 17:51 toracat File Added: centos-linux-2.6-numasched_cpu_power_BZ870669_bug-bug6949.patch
2014-03-14 17:53 toracat Note Added: 0019503
2014-03-17 08:59 cap_ Note Added: 0019510
2014-03-17 12:16 toracat Note Added: 0019511
2014-03-26 14:57 toracat Note Added: 0019561
2014-05-08 13:55 cap_ Note Added: 0019743
2014-05-08 15:47 toracat Note Added: 0019744
2014-06-19 17:48 toracat Note Added: 0019966
2014-06-19 17:50 toracat Status assigned => resolved
2014-06-19 17:50 toracat Resolution open => fixed


Copyright © 2000 - 2014 MantisBT Team
Powered by Mantis Bugtracker