2017-02-27 11:19 UTC

View Issue Details Jump to Notes ]
IDProjectCategoryView StatusLast Update
0007091CentOS-6kernelpublic2014-08-05 11:34
ReporterStephenSykes 
PrioritynormalSeveritycrashReproducibilityhave not tried
StatusnewResolutionopen 
PlatformLinuxOSCentOSOS Version6.5
Product Version6.5 
Target VersionFixed in Version 
Summary0007091: oom killer kills process rather than freeing cache or swapping processes under swappiness 0
DescriptionRunning on a swappiness of 1, the system shows plenty of free memory: the kernel carves 6 GiB allocating ~ 1.5 GiB as free, 1.5 GiB for cache, and the remainder to processes, swapping roughly 750 MiB of stale process memory to swap.

sar statistics for this from the operational period of the system (under load; I can provide sar details, but let's start with averages - they are representative of the situation during the day):
Linux 2.6.32-358.2.1.el6.x86_64 () 04/22/2014 _x86_64_ (3 CPU)

08:00:02 AM CPU %user %nice %system %iowait %steal %idle
Average: all 19.09 0.00 36.17 15.55 0.13 29.06

08:00:02 AM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit
Average: 1254309 4714303 78.98 20126 2175944 20541612 124.85

08:00:02 AM frmpg/s bufpg/s campg/s
Average: 5.20 -0.10 -11.65

08:00:02 AM kbswpfree kbswpused %swpused kbswpcad %swpcad

Average: 9433352 1051688 10.03 257077 24.44

Running on a swapiness of 0 (the same system, same load), the next day), the kernel oom killed our database server (on a live production system) taking a business temporarily off line.

sar statistics for the time period:

 sar -s 15:00:00 -e 16:00:00
Linux 2.6.32-358.2.1.el6.x86_64 04/23/2014 _x86_64_ (3 CPU)

03:00:02 PM CPU %user %nice %system %iowait %steal %idle
03:10:01 PM all 17.90 0.00 35.27 3.29 0.12 43.42
03:20:02 PM all 19.94 0.00 36.40 8.73 0.12 34.82
03:30:02 PM all 20.18 0.00 42.10 20.37 0.16 17.19
03:40:01 PM all 5.75 0.00 19.93 5.05 0.06 69.20
03:50:01 PM all 22.21 0.00 34.31 8.72 0.16 34.59
Average: all 17.17 0.00 33.58 9.23 0.12 39.90

 sar -s 15:00:00 -e 16:00:00

03:00:02 PM CPU %user %nice %system %iowait %steal %idle
03:10:01 PM all 17.90 0.00 35.27 3.29 0.12 43.42
03:20:02 PM all 19.94 0.00 36.40 8.73 0.12 34.82
03:30:02 PM all 20.18 0.00 42.10 20.37 0.16 17.19
03:40:01 PM all 5.75 0.00 19.93 5.05 0.06 69.20
03:50:01 PM all 22.21 0.00 34.31 8.72 0.16 34.59
Average: all 17.17 0.00 33.58 9.23 0.12 39.90

 sar -r -s 15:00:00 -e 16:00:00

03:00:02 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit
03:10:01 PM 125396 5843216 97.90 22160 1989352 23282984 141.51
03:20:02 PM 44788 5923824 99.25 19092 1759124 23759172 144.40
03:30:02 PM 34400 5934212 99.42 11864 1649488 23856192 144.99
03:40:01 PM 4139748 1828864 30.64 22972 341264 9558784 58.10
03:50:01 PM 1364632 4603980 77.14 30784 2576044 11827612 71.88
Average: 1141793 4826819 80.87 21374 1663054 18456949 112.18

 sar -R -s 15:00:00 -e 16:00:00

03:00:02 PM frmpg/s bufpg/s campg/s
03:10:01 PM -24.49 -0.54 26.00
03:20:02 PM -34.14 -1.30 -97.51
03:30:02 PM -4.43 -3.08 -46.70
03:40:01 PM 1732.71 4.69 -552.15
03:50:01 PM -1188.87 3.35 957.39
Average: 100.36 0.62 55.05

sudo ssh BMS-AppDBServer-4 sar -S -s 15:00:00 -e 16:00:00

03:00:02 PM kbswpfree kbswpused %swpused kbswpcad %swpcad
03:10:01 PM 10471148 13892 0.13 4300 30.95
03:20:02 PM 10471148 13892 0.13 4300 30.95
03:30:02 PM 10471148 13892 0.13 4300 30.95
03:40:01 PM 10475820 9220 0.09 3484 37.79
03:50:01 PM 10475820 9220 0.09 3484 37.79
Average: 10473017 12023 0.11 3973 33.04

And kernel log (abreviated to remove the large suite of processes - its an application/database server so tons of httpd and other app server processes, some db server processes, and the usual remainder):

Apr 23 15:31:01 kernel: proc_count invoked oom-killer: gfp_mask=0xd0, order=1, oom_adj=0, oom_score_adj=0
Apr 23 15:31:01 kernel: proc_count cpuset=/ mems_allowed=0
Apr 23 15:31:01 kernel: Pid: 1713, comm: proc_count Not tainted 2.6.32-358.2.1.el6.x86_64 #1
Apr 23 15:31:01 kernel: Call Trace:
Apr 23 15:31:01 kernel: [<ffffffff810cb5f1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
Apr 23 15:31:01 kernel: [<ffffffff8111cdf0>] ? dump_header+0x90/0x1b0
Apr 23 15:31:01 kernel: [<ffffffff8111d272>] ? oom_kill_process+0x82/0x2a0
Apr 23 15:31:01 kernel: [<ffffffff8111d1b1>] ? select_bad_process+0xe1/0x120
Apr 23 15:31:01 kernel: [<ffffffff8111d6b0>] ? out_of_memory+0x220/0x3c0
Apr 23 15:31:01 kernel: [<ffffffff8112c35c>] ? __alloc_pages_nodemask+0x8ac/0x8d0
Apr 23 15:31:01 kernel: [<ffffffff810074fd>] ? xen_force_evtchn_callback+0xd/0x10
Apr 23 15:31:01 kernel: [<ffffffff8116095a>] ? alloc_pages_current+0xaa/0x110
Apr 23 15:31:01 kernel: [<ffffffff81129d3e>] ? __get_free_pages+0xe/0x50
Apr 23 15:31:01 kernel: [<ffffffff8106bef4>] ? copy_process+0xe4/0x1450
Apr 23 15:31:01 kernel: [<ffffffff81168432>] ? kmem_cache_alloc+0x182/0x190
Apr 23 15:31:01 kernel: [<ffffffff8106d2f4>] ? do_fork+0x94/0x460
Apr 23 15:31:01 kernel: [<ffffffff8119f642>] ? alloc_fd+0x92/0x160
Apr 23 15:31:01 kernel: [<ffffffff8117dd17>] ? fd_install+0x47/0x90
Apr 23 15:31:01 kernel: [<ffffffff81510160>] ? _spin_lock_irq+0x10/0x40
Apr 23 15:31:01 kernel: [<ffffffff81009598>] ? sys_clone+0x28/0x30
Apr 23 15:31:01 kernel: [<ffffffff8100b393>] ? stub_clone+0x13/0x20
Apr 23 15:31:01 kernel: [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
Apr 23 15:31:01 kernel: Mem-Info:
Apr 23 15:31:01 kernel: Node 0 DMA per-cpu:
Apr 23 15:31:01 kernel: CPU 0: hi: 0, btch: 1 usd: 0
Apr 23 15:31:01 kernel: CPU 1: hi: 0, btch: 1 usd: 0
Apr 23 15:31:01 kernel: CPU 2: hi: 0, btch: 1 usd: 0
Apr 23 15:31:01 kernel: Node 0 DMA32 per-cpu:
Apr 23 15:31:01 kernel: CPU 0: hi: 186, btch: 31 usd: 29
Apr 23 15:31:01 kernel: CPU 1: hi: 186, btch: 31 usd: 0
Apr 23 15:31:01 kernel: CPU 2: hi: 186, btch: 31 usd: 0
Apr 23 15:31:01 kernel: Node 0 Normal per-cpu:
Apr 23 15:31:01 kernel: CPU 0: hi: 186, btch: 31 usd: 58
Apr 23 15:31:01 kernel: CPU 1: hi: 186, btch: 31 usd: 0
Apr 23 15:31:01 kernel: CPU 2: hi: 186, btch: 31 usd: 0
Apr 23 15:31:01 kernel: active_anon:657282 inactive_anon:234243 isolated_anon:0
Apr 23 15:31:01 kernel: active_file:1582 inactive_file:1577 isolated_file:0
Apr 23 15:31:01 kernel: unevictable:415617 dirty:0 writeback:4 unstable:0
Apr 23 15:31:01 kernel: free:7393 slab_reclaimable:13322 slab_unreclaimable:36105
Apr 23 15:31:01 kernel: mapped:410321 shmem:409696 pagetables:86995 bounce:0
Apr 23 15:31:01 kernel: Node 0 DMA free:7760kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:7756kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Apr 23 15:31:01 kernel: lowmem_reserve[]: 0 4024 6044 6044
Apr 23 15:31:01 kernel: Node 0 DMA32 free:17736kB min:6620kB low:8272kB high:9928kB active_anon:2256636kB inactive_anon:564276kB active_file:2088kB inactive_file:2180kB unevictable:624700kB isolated(anon):0kB isolated(file):0kB present:4120800kB mlocked:0kB dirty:0kB writeback:16kB mapped:624832kB shmem:624832kB slab_reclaimable:32124kB slab_unreclaimable:90036kB kernel_stack:32696kB pagetables:206100kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Apr 23 15:31:01 kernel: lowmem_reserve[]: 0 0 2020 2020
Apr 23 15:31:01 kernel: Node 0 Normal free:4076kB min:3320kB low:4148kB high:4980kB active_anon:372492kB inactive_anon:372696kB active_file:4240kB inactive_file:4128kB unevictable:1037768kB isolated(anon):0kB isolated(file):0kB present:2068480kB mlocked:24776kB dirty:0kB writeback:0kB mapped:1016452kB shmem:1013952kB slab_reclaimable:21164kB slab_unreclaimable:54384kB kernel_stack:4328kB pagetables:141880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Apr 23 15:31:01 kernel: lowmem_reserve[]: 0 0 0 0
Apr 23 15:31:01 kernel: Node 0 DMA: 2*4kB 1*8kB 2*16kB 1*32kB 0*64kB 2*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 1*4096kB = 7760kB
Apr 23 15:31:01 kernel: Node 0 DMA32: 4083*4kB 12*8kB 4*16kB 3*32kB 2*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 17740kB
Apr 23 15:31:01 kernel: Node 0 Normal: 615*4kB 11*8kB 1*16kB 1*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 4132kB
Apr 23 15:31:01 kernel: 414562 total pagecache pages
Apr 23 15:31:01 kernel: 1075 pages in swap cache
Apr 23 15:31:01 kernel: Swap cache stats: add 22900328, delete 22899253, find 492200942/494634739
Apr 23 15:31:01 kernel: Free swap = 10471148kB
Apr 23 15:31:01 kernel: Total swap = 10485040kB
Apr 23 15:31:01 kernel: 1572863 pages RAM
Apr 23 15:31:01 kernel: 80710 pages reserved
Apr 23 15:31:01 kernel: 2926057 pages shared
Apr 23 15:31:01 kernel: 1058807 pages non-shared
Apr 23 15:31:01 kernel: [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name


Apr 23 15:31:01 kernel: Out of memory: Kill process 1357 (oninit) score 99 or sacrifice child
Apr 23 15:31:01 kernel: Killed process 1359, UID 501, (oninit) total-vm:2056000kB, anon-rss:116kB, file-rss:10132kB
Apr 23 15:31:20 kernel: awk invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
Apr 23 15:31:20 kernel: awk cpuset=/ mems_allowed=0
Apr 23 15:31:20 kernel: Pid: 22899, comm: awk Not tainted 2.6.32-358.2.1.el6.x86_64 #1
Apr 23 15:31:20 kernel: Call Trace:
Apr 23 15:31:20 kernel: [<ffffffff810cb5f1>] ? cpuset_print_task_mems_allowed+0x91/0xb0
Apr 23 15:31:20 kernel: [<ffffffff8111cdf0>] ? dump_header+0x90/0x1b0
Apr 23 15:31:20 kernel: [<ffffffff8111d272>] ? oom_kill_process+0x82/0x2a0
Apr 23 15:31:20 kernel: [<ffffffff8111d1b1>] ? select_bad_process+0xe1/0x120
Apr 23 15:31:20 kernel: [<ffffffff8111d6b0>] ? out_of_memory+0x220/0x3c0
Apr 23 15:31:20 kernel: [<ffffffff8112c35c>] ? __alloc_pages_nodemask+0x8ac/0x8d0
Apr 23 15:31:20 kernel: [<ffffffff8116095a>] ? alloc_pages_current+0xaa/0x110
Apr 23 15:31:20 kernel: [<ffffffff8111a1d7>] ? __page_cache_alloc+0x87/0x90
Apr 23 15:31:20 kernel: [<ffffffff81119bbe>] ? find_get_page+0x1e/0xa0
Apr 23 15:31:20 kernel: [<ffffffff8111b197>] ? filemap_fault+0x1a7/0x500
Apr 23 15:31:20 kernel: [<ffffffff8100628f>] ? xen_set_pte_at+0xaf/0x170
Apr 23 15:31:20 kernel: [<ffffffff81143194>] ? __do_fault+0x54/0x530
Apr 23 15:31:20 kernel: [<ffffffff81143767>] ? handle_pte_fault+0xf7/0xb50
Apr 23 15:31:20 kernel: [<ffffffff8100ba1d>] ? retint_restore_args+0x5/0x6
Apr 23 15:31:20 kernel: [<ffffffff8100c1fb>] ? xen_hypervisor_callback+0x1b/0x20
Apr 23 15:31:20 kernel: [<ffffffff8151098a>] ? error_exit+0x2a/0x60
Apr 23 15:31:20 kernel: [<ffffffff8100ba1d>] ? retint_restore_args+0x5/0x6
Apr 23 15:31:20 kernel: [<ffffffff8100c1fb>] ? xen_hypervisor_callback+0x1b/0x20
Apr 23 15:31:20 kernel: [<ffffffff81004a49>] ? __raw_callee_save_xen_pmd_val+0x11/0x1e
Apr 23 15:31:20 kernel: [<ffffffff811443fa>] ? handle_mm_fault+0x23a/0x310
Apr 23 15:31:20 kernel: [<ffffffff810474c9>] ? __do_page_fault+0x139/0x480
Apr 23 15:31:20 kernel: [<ffffffff810074fd>] ? xen_force_evtchn_callback+0xd/0x10
Apr 23 15:31:20 kernel: [<ffffffff81007ca2>] ? check_events+0x12/0x20
Apr 23 15:31:20 kernel: [<ffffffff810074fd>] ? xen_force_evtchn_callback+0xd/0x10
Apr 23 15:31:20 kernel: [<ffffffff81007ca2>] ? check_events+0x12/0x20
Apr 23 15:31:20 kernel: [<ffffffff8151339e>] ? do_page_fault+0x3e/0xa0
Apr 23 15:31:20 kernel: [<ffffffff81510755>] ? page_fault+0x25/0x30
Apr 23 15:31:20 kernel: Mem-Info:
Apr 23 15:31:20 kernel: Node 0 DMA per-cpu:
Apr 23 15:31:20 kernel: CPU 0: hi: 0, btch: 1 usd: 0
Apr 23 15:31:20 kernel: CPU 1: hi: 0, btch: 1 usd: 0
Apr 23 15:31:20 kernel: CPU 2: hi: 0, btch: 1 usd: 0
Apr 23 15:31:20 kernel: Node 0 DMA32 per-cpu:
Apr 23 15:31:20 kernel: CPU 0: hi: 186, btch: 31 usd: 32
Apr 23 15:31:20 kernel: CPU 1: hi: 186, btch: 31 usd: 0
Apr 23 15:31:20 kernel: CPU 2: hi: 186, btch: 31 usd: 0
Apr 23 15:31:20 kernel: Node 0 Normal per-cpu:
Apr 23 15:31:20 kernel: CPU 0: hi: 186, btch: 31 usd: 90
Apr 23 15:31:20 kernel: CPU 1: hi: 186, btch: 31 usd: 0
Apr 23 15:31:20 kernel: CPU 2: hi: 186, btch: 31 usd: 0
Apr 23 15:31:20 kernel: active_anon:659392 inactive_anon:234908 isolated_anon:0
Apr 23 15:31:20 kernel: active_file:1093 inactive_file:1129 isolated_file:0
Apr 23 15:31:20 kernel: unevictable:415617 dirty:0 writeback:0 unstable:0
Apr 23 15:31:20 kernel: free:6348 slab_reclaimable:11215 slab_unreclaimable:36129
Apr 23 15:31:20 kernel: mapped:410358 shmem:409696 pagetables:86756 bounce:0
Apr 23 15:31:20 kernel: Node 0 DMA free:7760kB min:12kB low:12kB high:16kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:7756kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
Apr 23 15:31:20 kernel: lowmem_reserve[]: 0 4024 6044 6044
Apr 23 15:31:20 kernel: Node 0 DMA32 free:14596kB min:6620kB low:8272kB high:9928kB active_anon:2264272kB inactive_anon:566132kB active_file:1476kB inactive_file:1444kB unevictable:624700kB isolated(anon):0kB isolated(file):0kB present:4120800kB mlocked:0kB dirty:0kB writeback:0kB mapped:624944kB shmem:624832kB slab_reclaimable:24184kB slab_unreclaimable:89940kB kernel_stack:32688kB pagetables:206172kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1620 all_unreclaimable? no
Apr 23 15:31:20 kernel: lowmem_reserve[]: 0 0 2020 2020
Apr 23 15:31:20 kernel: Node 0 Normal free:3036kB min:3320kB low:4148kB high:4980kB active_anon:373296kB inactive_anon:373500kB active_file:2896kB inactive_file:3072kB unevictable:1037768kB isolated(anon):0kB isolated(file):2568kB present:2068480kB mlocked:24776kB dirty:0kB writeback:0kB mapped:1016488kB shmem:1013952kB slab_reclaimable:20676kB slab_unreclaimable:54576kB kernel_stack:4368kB pagetables:140852kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:9666 all_unreclaimable? no
Apr 23 15:31:20 kernel: lowmem_reserve[]: 0 0 0 0
Apr 23 15:31:20 kernel: Node 0 DMA: 2*4kB 1*8kB 2*16kB 1*32kB 0*64kB 2*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 1*4096kB = 7760kB
Apr 23 15:31:20 kernel: Node 0 DMA32: 3297*4kB 12*8kB 4*16kB 3*32kB 2*64kB 0*128kB 0*256kB 0*512kB 1*1024kB 0*2048kB 0*4096kB = 14596kB
Apr 23 15:31:20 kernel: Node 0 Normal: 356*4kB 1*8kB 0*16kB 1*32kB 0*64kB 0*128kB 0*256kB 1*512kB 1*1024kB 0*2048kB 0*4096kB = 3000kB
Apr 23 15:31:20 kernel: 414714 total pagecache pages
Apr 23 15:31:20 kernel: 1097 pages in swap cache
Apr 23 15:31:20 kernel: Swap cache stats: add 22900397, delete 22899300, find 492200999/494634809
Apr 23 15:31:20 kernel: Free swap = 10471368kB
Apr 23 15:31:20 kernel: Total swap = 10485040kB
Apr 23 15:31:20 kernel: 1572863 pages RAM
Apr 23 15:31:20 kernel: 80710 pages reserved
Apr 23 15:31:20 kernel: 2926158 pages shared
Apr 23 15:31:20 kernel: 1059316 pages non-shared
Apr 23 15:31:20 kernel: [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name

Apr 23 15:31:20 kernel: Out of memory: Kill process 1357 (oninit) score 99 or sacrifice child
Apr 23 15:31:20 kernel: Killed process 1357, UID 501, (oninit) total-vm:2056000kB, anon-rss:648kB, file-rss:1637852kB
Apr 23 15:31:23 kernel: Pid 22322(fglrun-bin) over core_pipe_limit
Apr 23 15:31:23 kernel: Skipping core dump
Apr 23 15:31:23 kernel: Pid 21487(fglrun-bin) over core_pipe_limit
Apr 23 15:31:23 kernel: Skipping core dump

My expectation (and it would seem reasonable) is that oom killer would use cache or swap before killing processes.

At a swapiness of 1 we have lots of wait io and the system is lightly sluggish due to swap, at a swappiness of 0 the wait io disappears (except for occasional processes which read large files - and we prefer that).

There seems to be plenty of free RAM on the system; the problem is the kernel is not managing it well. IN particular, my understanding is that cache is supposed to be used only when available, at is to be freed immediately when required.

Additionally, my understanding is that swappiness of 0 does not turn swap off - it just says only swap when needed. Hence oom should not kill it should swap.


Steps To ReproduceWe have no tried to reproduce, but believe it has happened during memory testing and we would reproduce. It involves building a test system with our standard config, setting swapiness to 0, and running a memory hog program.

We are willing to provide a system with a reproduction like this if appropriate
TagsNo tags attached.
Attached Files

-Relationships
+Relationships

-Notes

~0019671

StephenSykes (reporter)

OK, apologies of the newbie - my application db server crashing is not a crash w.r.t this bug report - no, the OS did not crash, so the severity should be major, not crash. I don't seem to be able to make edits, so do not appear able to adjust the severity.

~0020579

infranetworking (reporter)

Hello,

Our server is crashing sometimes and we always see these errors in /var/log/messages:

Aug 2 17:43:23 www kernel: Out of memory: Kill process 17707 (httpd) score 2 or sacrifice child
Aug 2 17:43:23 www kernel: Killed process 17707, UID 48, (httpd) total-vm:580180kB, anon-rss:33632kB, file-rss:40kB
Aug 2 17:43:26 www kernel: httpd invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
Aug 2 17:43:27 www kernel: httpd cpuset=/ mems_allowed=0-1
Aug 2 17:43:27 www kernel: Pid: 18054, comm: httpd Not tainted 2.6.32-431.17.1.el6.x86_64 #1
Aug 2 17:43:27 www kernel: Call Trace:
Aug 2 17:43:27 www kernel: [<ffffffff810d0211>] ? cpuset_print_task_mems_allowed+0x91/0xb0
Aug 2 17:43:27 www kernel: [<ffffffff811225c0>] ? dump_header+0x90/0x1b0
Aug 2 17:43:27 www kernel: [<ffffffff8122761c>] ? security_real_capable_noaudit+0x3c/0x70
Aug 2 17:43:27 www kernel: [<ffffffff81122a42>] ? oom_kill_process+0x82/0x2a0
Aug 2 17:43:27 www kernel: [<ffffffff81122981>] ? select_bad_process+0xe1/0x120
Aug 2 17:43:27 www kernel: [<ffffffff81122e80>] ? out_of_memory+0x220/0x3c0
Aug 2 17:43:27 www kernel: [<ffffffff8112f79f>] ? __alloc_pages_nodemask+0x89f/0x8d0
Aug 2 17:43:27 www kernel: [<ffffffff8116779a>] ? alloc_pages_vma+0x9a/0x150
Aug 2 17:43:27 www kernel: [<ffffffff8114a89d>] ? handle_pte_fault+0x73d/0xb00
Aug 2 17:43:27 www kernel: [<ffffffffa0075409>] ? ext3_check_acl+0x29/0x90 [ext3]
Aug 2 17:43:27 www kernel: [<ffffffff81078f37>] ? current_fs_time+0x27/0x30
Aug 2 17:43:27 www kernel: [<ffffffff8114ae8a>] ? handle_mm_fault+0x22a/0x300
Aug 2 17:43:27 www kernel: [<ffffffff8114ef26>] ? vma_adjust+0x556/0x5e0
Aug 2 17:43:27 www kernel: [<ffffffff8104a8d8>] ? __do_page_fault+0x138/0x480
Aug 2 17:43:27 www kernel: [<ffffffff8114f4fa>] ? vma_merge+0x29a/0x3e0
Aug 2 17:43:27 www kernel: [<ffffffff8114f90b>] ? __vm_enough_memory+0x3b/0x190
Aug 2 17:43:27 www kernel: [<ffffffff8115094c>] ? do_brk+0x26c/0x350
Aug 2 17:43:27 www kernel: [<ffffffff8152d73e>] ? do_page_fault+0x3e/0xa0
Aug 2 17:43:27 www kernel: [<ffffffff8152aaf5>] ? page_fault+0x25/0x30


Do You think our server is affected by this bug?

Thanks.-

~0020591

StephenSykes (reporter)

There's not sufficient information here. oom-killer is a "normal" part of the kernel and sacrifices processes under low memory conditions to save the kernel. Its heuristics are arguably awful under some circumstances, and I argue buggy, but as you can see, no one has taken ownership of this bug, and in fact they've done major rework for the 3.x kernel, so its part of life dealing with the behavior.

This conversation really belongs on elsewhere but I'll give you the cliff notes:

You need to dig into exactly how memory is being used on your system. When overcommit (/proc/meminfo/CommitLimit and Commited_AS - google these) levels are reached, or RAM is full (even if used primarily as cache, and even if there's plenty of swap), heuristics may well kill stuff. The approach is to get stuff out of cache, get swap to be used sooner (keeping memory free), I recommend vmtouch - google it - to see what's in cache, tracking overcommit (you may need to write a script that trolls /proc/(pid)/smaps and adding up each pid's component size, which is a rough analog to commit), and learn and understand where you;'re memory's going. You may simply need more RAM, you may need to tune so you use swap more aggressively, or so you con't use cache.

This is a complex and messy subject, and something Linux does quite poorly under some circumstances.
+Notes

-Issue History
Date Modified Username Field Change
2014-04-24 17:32 StephenSykes New Issue
2014-04-24 17:35 StephenSykes Note Added: 0019671
2014-08-04 00:40 infranetworking Note Added: 0020579
2014-08-05 11:34 StephenSykes Note Added: 0020591
+Issue History