View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0016728 | CentOS-8 | kernel | public | 2019-11-13 18:06 | 2019-11-13 19:01 |
Reporter | mnelson | ||||
Priority | normal | Severity | minor | Reproducibility | always |
Status | new | Resolution | open | ||
Product Version | 8.0.1905 | ||||
Target Version | Fixed in Version | ||||
Summary | 0016728: Transparent Huge Pages set to [always] is sub-optimal for many applications | ||||
Description | Transparent Huge Pages provides real benefit to certain applications by potentially reducing TLB misses and improving performance. For other applications, it can bloat memory usage and cause performance regressions. By default, the kernel enables THP for applications that explicitly ask for it via MADV_HUGEPAGE: > "madvise" will enter direct reclaim like "always" but only for regions > that are have used madvise(MADV_HUGEPAGE). This is the default behaviour. https://www.kernel.org/doc/Documentation/vm/transhuge.txt RHEL, CentOS, and CoreOS (but not Fedora) all appear to override this behavior and set THP to [always]. This unfortunately causes issues with a large variety of software including, but not limited to: splunk: https://docs.splunk.com/Documentation/Splunk/7.3.2/ReleaseNotes/SplunkandTHP mongodb: https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/ couchbase: https://docs.couchbase.com/server/current/install/thp-disable.html oracle: https://blogs.oracle.com/linux/performance-issues-with-transparent-huge-pages-thp nuodb: http://doc.nuodb.com/4.0/Content/OpenShift-disable-THP.htm Go runtime: https://github.com/golang/go/issues/8832 jemalloc: https://blog.digitalocean.com/transparent-huge-pages-and-alternative-memory-allocators/ node.js: https://github.com/nodejs/node/issues/11077 tcmalloc: https://github.com/gperftools/gperftools/issues/1073 More recently, we've also seen memory usage bloat in Ceph (using tcmalloc) when THP is set to always potentially resulting in OOM when running inside containers. There are various ways to potentially work around this at the application level including using MADV_NOHUGEPAGE or a prctl flag. Requiring these workarounds to disable THP for a given application is counter-intuitive for several reasons: 1) It deviates from the default kernel behavior without a strong justification as to why. 2) It puts the onus on developers to explicitly stop the kernel from engaging in sub-optimal behavior. 3) It's incredibly confusing to have a system-wide default that claims to "always" enable a setting that many applications may or may not silently disable through workarounds. Finally, when another prominent distribution was faced with a similar choice, they ran stream and malloc tests showing improvement at various allocation sizes when THP was disabled. Ultimately that lead them to switching back to the kernel default (ie madvise) with no apparent performance regressions: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1703742 | ||||
Steps To Reproduce | This is a well known issue that can be reproduced via a variety of software. Steps to reproduce in ceph are listed below. Steps to Reproduce: 1. Install a single OSD ceph cluster. 2. Run a background write workload using hsbench or fio sufficient to fill the ceph-osd caches. 3. compare memory usage of the OSD process when THP is set to [always] vs [madvise] Results: https://docs.google.com/spreadsheets/d/1Xl3nWapi7ZKEmpnsSHHWO96iopEG0hK6GeDWhWKSfDo/edit?usp=sharing | ||||
Additional Information | https://unix.stackexchange.com/questions/495816/which-distributions-enable-transparent-huge-pages-for-all-applications https://www.percona.com/blog/2019/03/06/settling-the-myth-of-transparent-hugepages-for-databases/ https://blog.nelhage.com/post/transparent-hugepages/ https://alexandrnikitin.github.io/blog/transparent-hugepages-measuring-the-performance-impact/ https://dl.acm.org/citation.cfm?id=3359640 | ||||
Tags | No tags attached. | ||||
Update: While the kernel documentation claims that madvise is the default, the actual code in mm/Kconfig shows that "always" is the default choice, so I retract the statement about differing from the kernel. See: https://github.com/torvalds/linux/blob/master/mm/Kconfig#L385-L407 |
|