View Issue Details

IDProjectCategoryView StatusLast Update
0015715CentOS-7kernelpublic2019-08-19 02:25
Reporterwangyumu 
PrioritynormalSeveritymajorReproducibilitysometimes
Status newResolutionopen 
Product Version7.4.1708 
Target VersionFixed in Version 
Summary0015715: process hang on shrink_inactive_list when reach the memory limit of cgroup
Descriptionwe found that some filebeat process stop working wtih the stack:
[<ffffffff810f5126>] futex_wait_queue_me+0xc6/0x130
[<ffffffff810f5deb>] futex_wait+0x17b/0x280
[<ffffffff810f7b26>] do_futex+0x106/0x5a0
[<ffffffff810f8040>] SyS_futex+0x80/0x180
[<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

and if you kill , it will become <defunct> state.

in the dmesg, there is some info:
Jan 15 14:30:57 kernel: INFO: task filebeat:34257 blocked for more than 120 seconds.
Jan 15 14:30:57 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 15 14:30:57 kernel: INFO: task filebeat:34257 blocked for more than 120 seconds.
Jan 15 14:30:57 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 15 14:30:57 kernel: filebeat D ffffffff816a75f0 0 34257 30603 0x00000080
Jan 15 14:30:57 kernel: filebeat D ffffffff816a75f0 0 34257 30603 0x00000080
Jan 15 14:30:57 kernel: ffff882b012aafe0 0000000000000086 ffff883b93e4eeb0 ffff882b012abfd8
Jan 15 14:30:57 kernel: ffff882b012aafe0 0000000000000086 ffff883b93e4eeb0 ffff882b012abfd8
Jan 15 14:30:57 kernel: ffff882b012abfd8 ffff882b012abfd8 ffff883b93e4eeb0 ffff881fbfc16cc0
Jan 15 14:30:57 kernel: ffff882b012abfd8 ffff882b012abfd8 ffff883b93e4eeb0 ffff881fbfc16cc0
Jan 15 14:30:57 kernel: 0000000000000000 7fffffffffffffff ffff88203ff958e8 ffffffff816a75f0
Jan 15 14:30:57 kernel: Call Trace:
Jan 15 14:30:57 kernel: 0000000000000000 7fffffffffffffff ffff88203ff958e8 ffffffff816a75f0
Jan 15 14:30:57 kernel: Call Trace:
Jan 15 14:30:57 kernel: [<ffffffff816a75f0>] ? bit_wait+0x50/0x50
Jan 15 14:30:57 kernel: [<ffffffff816a75f0>] ? bit_wait+0x50/0x50
Jan 15 14:30:57 kernel: [<ffffffff816a94c9>] schedule+0x29/0x70
Jan 15 14:30:57 kernel: [<ffffffff816a94c9>] schedule+0x29/0x70
Jan 15 14:30:57 kernel: [<ffffffff816a6fd9>] schedule_timeout+0x239/0x2c0
Jan 15 14:30:57 kernel: [<ffffffff816a6fd9>] schedule_timeout+0x239/0x2c0
Jan 15 14:30:57 kernel: [<ffffffff811de581>] ? __slab_free+0x81/0x2f0
Jan 15 14:30:57 kernel: [<ffffffff811de581>] ? __slab_free+0x81/0x2f0
Jan 15 14:30:57 kernel: [<ffffffff81062efe>] ? kvm_clock_get_cycles+0x1e/0x20
Jan 15 14:30:57 kernel: [<ffffffff81062efe>] ? kvm_clock_get_cycles+0x1e/0x20
Jan 15 14:30:57 kernel: [<ffffffff810e939c>] ? ktime_get_ts64+0x4c/0xf0
Jan 15 14:30:57 kernel: [<ffffffff810e939c>] ? ktime_get_ts64+0x4c/0xf0
Jan 15 14:30:57 kernel: [<ffffffff816a75f0>] ? bit_wait+0x50/0x50
Jan 15 14:30:57 kernel: [<ffffffff816a75f0>] ? bit_wait+0x50/0x50
Jan 15 14:30:57 kernel: [<ffffffff816a8b4d>] io_schedule_timeout+0xad/0x130
Jan 15 14:30:57 kernel: [<ffffffff816a8b4d>] io_schedule_timeout+0xad/0x130
Jan 15 14:30:57 kernel: [<ffffffff816a8be8>] io_schedule+0x18/0x20
Jan 15 14:30:57 kernel: [<ffffffff816a8be8>] io_schedule+0x18/0x20
Jan 15 14:30:57 kernel: [<ffffffff816a7601>] bit_wait_io+0x11/0x50
Jan 15 14:30:57 kernel: [<ffffffff816a7601>] bit_wait_io+0x11/0x50
Jan 15 14:30:57 kernel: [<ffffffff816a7125>] __wait_on_bit+0x65/0x90
Jan 15 14:30:57 kernel: [<ffffffff816a7125>] __wait_on_bit+0x65/0x90
Jan 15 14:30:57 kernel: [<ffffffff812361f2>] ? free_buffer_head+0x22/0x50
Jan 15 14:30:57 kernel: [<ffffffff812361f2>] ? free_buffer_head+0x22/0x50
Jan 15 14:30:57 kernel: [<ffffffff81181cb1>] wait_on_page_bit+0x81/0xa0
Jan 15 14:30:57 kernel: [<ffffffff81181cb1>] wait_on_page_bit+0x81/0xa0
Jan 15 14:30:57 kernel: [<ffffffff810b19d0>] ? wake_bit_function+0x40/0x40
Jan 15 14:30:57 kernel: [<ffffffff810b19d0>] ? wake_bit_function+0x40/0x40
Jan 15 14:30:57 kernel: [<ffffffff811965c2>] shrink_page_list+0x842/0xad0
Jan 15 14:30:57 kernel: [<ffffffff811965c2>] shrink_page_list+0x842/0xad0
Jan 15 14:30:57 kernel: [<ffffffff81196e7a>] shrink_inactive_list+0x1ca/0x5d0
Jan 15 14:30:57 kernel: [<ffffffff81196e7a>] shrink_inactive_list+0x1ca/0x5d0
Jan 15 14:30:57 kernel: [<ffffffff81197975>] shrink_lruvec+0x385/0x730
Jan 15 14:30:57 kernel: [<ffffffff81197975>] shrink_lruvec+0x385/0x730
Jan 15 14:30:57 kernel: [<ffffffffc0280400>] ? kmem_zone_alloc+0x60/0x130 [xfs]
Jan 15 14:30:57 kernel: [<ffffffffc0280400>] ? kmem_zone_alloc+0x60/0x130 [xfs]
Jan 15 14:30:57 kernel: [<ffffffff81197d96>] shrink_zone+0x76/0x1a0
Jan 15 14:30:57 kernel: [<ffffffff81197d96>] shrink_zone+0x76/0x1a0
Jan 15 14:30:57 kernel: [<ffffffff811982a0>] do_try_to_free_pages+0xf0/0x4e0
Jan 15 14:30:57 kernel: [<ffffffff811982a0>] do_try_to_free_pages+0xf0/0x4e0
Jan 15 14:30:57 kernel: [<ffffffff8133d9dd>] ? list_del+0xd/0x30
Jan 15 14:30:57 kernel: [<ffffffff8133d9dd>] ? list_del+0xd/0x30
Jan 15 14:30:57 kernel: [<ffffffff811988ea>] try_to_free_mem_cgroup_pages+0xda/0x170
Jan 15 14:30:57 kernel: [<ffffffff811988ea>] try_to_free_mem_cgroup_pages+0xda/0x170
Jan 15 14:30:57 kernel: [<ffffffff811f386e>] mem_cgroup_reclaim+0x4e/0x120
Jan 15 14:30:57 kernel: [<ffffffff811f386e>] mem_cgroup_reclaim+0x4e/0x120
Jan 15 14:30:57 kernel: [<ffffffff811f3e2c>] __mem_cgroup_try_charge+0x4ec/0x670
Jan 15 14:30:57 kernel: [<ffffffff811f3e2c>] __mem_cgroup_try_charge+0x4ec/0x670
Jan 15 14:30:57 kernel: [<ffffffff811f6126>] __memcg_kmem_newpage_charge+0xf6/0x180
Jan 15 14:30:57 kernel: [<ffffffff811f6126>] __memcg_kmem_newpage_charge+0xf6/0x180
Jan 15 14:30:57 kernel: [<ffffffff8118cbad>] __alloc_pages_nodemask+0x22d/0x420
Jan 15 14:30:57 kernel: [<ffffffff811d1108>] ? alloc_pages_current+0x98/0x110
Jan 15 14:30:57 kernel: [<ffffffff8118cbad>] __alloc_pages_nodemask+0x22d/0x420
Jan 15 14:30:57 kernel: [<ffffffff811d1108>] ? alloc_pages_current+0x98/0x110
Jan 15 14:30:57 kernel: [<ffffffff811d1108>] alloc_pages_current+0x98/0x110
Jan 15 14:30:57 kernel: [<ffffffff811d1108>] alloc_pages_current+0x98/0x110
Jan 15 14:30:57 kernel: [<ffffffff811dbe7c>] new_slab+0x2fc/0x310
Jan 15 14:30:57 kernel: [<ffffffff811dd70c>] ___slab_alloc+0x3ac/0x4f0
Jan 15 14:30:57 kernel: [<ffffffff811dbe7c>] new_slab+0x2fc/0x310
Jan 15 14:30:57 kernel: [<ffffffff811dd70c>] ___slab_alloc+0x3ac/0x4f0
Jan 15 14:30:57 kernel: [<ffffffffc0280437>] ? kmem_zone_alloc+0x97/0x130 [xfs]
Jan 15 14:30:57 kernel: [<ffffffff811df6a3>] ? kmem_cache_alloc+0x193/0x1e0
Jan 15 14:30:57 kernel: [<ffffffffc0280437>] ? kmem_zone_alloc+0x97/0x130 [xfs]
Jan 15 14:30:57 kernel: [<ffffffff811df6a3>] ? kmem_cache_alloc+0x193/0x1e0
Jan 15 14:30:57 kernel: [<ffffffff81185455>] ? mempool_alloc_slab+0x15/0x20
Jan 15 14:30:57 kernel: [<ffffffff81185455>] ? mempool_alloc_slab+0x15/0x20
Jan 15 14:30:57 kernel: [<ffffffffc0280437>] ? kmem_zone_alloc+0x97/0x130 [xfs]
Jan 15 14:30:57 kernel: [<ffffffffc0280437>] ? kmem_zone_alloc+0x97/0x130 [xfs]
Jan 15 14:30:57 kernel: [<ffffffff816a10ce>] __slab_alloc+0x40/0x5c
Jan 15 14:30:57 kernel: [<ffffffff816a10ce>] __slab_alloc+0x40/0x5c
Jan 15 14:30:57 kernel: [<ffffffff811df6a3>] kmem_cache_alloc+0x193/0x1e0
Jan 15 14:30:57 kernel: [<ffffffff811df6a3>] kmem_cache_alloc+0x193/0x1e0
Jan 15 14:30:57 kernel: [<ffffffffc0280437>] ? kmem_zone_alloc+0x97/0x130 [xfs]
Jan 15 14:30:57 kernel: [<ffffffffc0280437>] ? kmem_zone_alloc+0x97/0x130 [xfs]
Jan 15 14:30:57 kernel: [<ffffffffc0280437>] kmem_zone_alloc+0x97/0x130 [xfs]
Jan 15 14:30:57 kernel: [<ffffffffc0280437>] kmem_zone_alloc+0x97/0x130 [xfs]
Jan 15 14:30:57 kernel: [<ffffffffc027f99d>] xfs_trans_alloc+0x6d/0x140 [xfs]
Jan 15 14:30:57 kernel: [<ffffffffc02593de>] xfs_setfilesize_trans_alloc.isra.14+0x3e/0x90 [xfs]
Jan 15 14:30:57 kernel: [<ffffffffc027f99d>] xfs_trans_alloc+0x6d/0x140 [xfs]
Jan 15 14:30:57 kernel: [<ffffffffc02593de>] xfs_setfilesize_trans_alloc.isra.14+0x3e/0x90 [xfs]
Jan 15 14:30:57 kernel: [<ffffffffc0259638>] xfs_submit_ioend.isra.16+0xb8/0xe0 [xfs]
Jan 15 14:30:57 kernel: [<ffffffffc0259638>] xfs_submit_ioend.isra.16+0xb8/0xe0 [xfs]
Jan 15 14:30:57 kernel: [<ffffffffc0259e44>] xfs_do_writepage+0x4c4/0x530 [xfs]
Jan 15 14:30:57 kernel: [<ffffffffc0259e44>] xfs_do_writepage+0x4c4/0x530 [xfs]
Jan 15 14:30:57 kernel: [<ffffffff8118dc91>] write_cache_pages+0x251/0x4d0
Jan 15 14:30:57 kernel: [<ffffffff8118dc91>] write_cache_pages+0x251/0x4d0
Jan 15 14:30:57 kernel: [<ffffffffc0259980>] ? xfs_aops_discard_page+0x150/0x150 [xfs]
Jan 15 14:30:57 kernel: [<ffffffffc0259980>] ? xfs_aops_discard_page+0x150/0x150 [xfs]
Jan 15 14:30:57 kernel: [<ffffffffc025978d>] xfs_vm_writepages+0xbd/0xe0 [xfs]
Jan 15 14:30:57 kernel: [<ffffffff8118effe>] do_writepages+0x1e/0x40
Jan 15 14:30:57 kernel: [<ffffffffc025978d>] xfs_vm_writepages+0xbd/0xe0 [xfs]
Jan 15 14:30:57 kernel: [<ffffffff8118effe>] do_writepages+0x1e/0x40
Jan 15 14:30:57 kernel: [<ffffffff81183c55>] __filemap_fdatawrite_range+0x65/0x80
Jan 15 14:30:57 kernel: [<ffffffff81183c55>] __filemap_fdatawrite_range+0x65/0x80
Jan 15 14:30:57 kernel: [<ffffffff81183da1>] filemap_write_and_wait_range+0x41/0x90
Jan 15 14:30:57 kernel: [<ffffffff81183da1>] filemap_write_and_wait_range+0x41/0x90
Jan 15 14:30:57 kernel: [<ffffffffc0264b16>] xfs_file_fsync+0x66/0x1e0 [xfs]
Jan 15 14:30:57 kernel: [<ffffffff8123331d>] generic_write_sync+0x4d/0x60
Jan 15 14:30:57 kernel: [<ffffffffc0264b16>] xfs_file_fsync+0x66/0x1e0 [xfs]
Jan 15 14:30:57 kernel: [<ffffffff8123331d>] generic_write_sync+0x4d/0x60
Jan 15 14:30:57 kernel: [<ffffffffc0266d55>] xfs_file_aio_write+0x155/0x1b0 [xfs]
Jan 15 14:30:57 kernel: [<ffffffffc0266d55>] xfs_file_aio_write+0x155/0x1b0 [xfs]
Jan 15 14:30:57 kernel: [<ffffffff8120026d>] do_sync_write+0x8d/0xd0
Jan 15 14:30:57 kernel: [<ffffffff8120026d>] do_sync_write+0x8d/0xd0
Jan 15 14:30:57 kernel: [<ffffffff81200d2d>] vfs_write+0xbd/0x1e0
Jan 15 14:30:57 kernel: [<ffffffff81200d2d>] vfs_write+0xbd/0x1e0
Jan 15 14:30:57 kernel: [<ffffffff81201b3f>] SyS_write+0x7f/0xe0
Jan 15 14:30:57 kernel: [<ffffffff81201b3f>] SyS_write+0x7f/0xe0
Jan 15 14:30:57 kernel: [<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b
Jan 15 14:30:57 kernel: [<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b

the filebeat running inside a docker with 2G memory limit, and from our monitoring system, we found that when this happened, the memory usage of docker around 1.95G.

the filebeat is under a heavy load with 5000 log files opened for collecting, and use lots of filesystem cache, and if you execute sync, it will became D state with stack :

[<ffffffff81181cb1>] wait_on_page_bit+0x81/0xa0
[<ffffffff81181de1>] __filemap_fdatawait_range+0x111/0x190
[<ffffffff81184bd7>] filemap_fdatawait_keep_errors+0x27/0x30
[<ffffffff8122dd1d>] sync_inodes_sb+0x16d/0x1f0
[<ffffffff81233349>] sync_inodes_one_sb+0x19/0x20
[<ffffffff81204ea1>] iterate_supers+0xc1/0x120
[<ffffffff81233614>] sys_sync+0x44/0xb0
[<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
Steps To Reproducefive node occured in our cluster (totally more than 20 nodes), but can not reproduce
Tagscgroup, defunct, hang, sync
abrt_hash
URL

Activities

wangyumu

wangyumu

2019-01-16 21:02

reporter   ~0033609

os :
Linux tb-ops01 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

sysctl:
vm.admin_reserve_kbytes = 8192
vm.block_dump = 0
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 50
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 50
vm.dirty_writeback_centisecs = 360000
vm.drop_caches = 1
vm.extfrag_threshold = 500
vm.hugepages_treat_as_movable = 0
vm.hugetlb_shm_group = 0
vm.laptop_mode = 0
vm.legacy_va_layout = 0
vm.lowmem_reserve_ratio = 256 256 32
vm.max_map_count = 655360
vm.memory_failure_early_kill = 0
vm.memory_failure_recovery = 1
vm.min_free_kbytes = 4096000
vm.min_slab_ratio = 5
vm.min_unmapped_ratio = 1
vm.mmap_min_addr = 4096
vm.nr_hugepages = 0
vm.nr_hugepages_mempolicy = 0
vm.nr_overcommit_hugepages = 0
vm.nr_pdflush_threads = 0
vm.numa_zonelist_order = default
vm.oom_dump_tasks = 1
vm.oom_kill_allocating_task = 0
vm.overcommit_kbytes = 0
vm.overcommit_memory = 1
vm.overcommit_ratio = 50
vm.page-cluster = 3
vm.panic_on_oom = 0
vm.percpu_pagelist_fraction = 0
vm.stat_interval = 1
vm.swappiness = 0
vm.user_reserve_kbytes = 131072
vm.vfs_cache_pressure = 100
vm.zone_reclaim_mode = 0

slabtop:
 Active / Total Objects (% used) : 46752646 / 59346525 (78.8%)
 Active / Total Slabs (% used) : 1356757 / 1356757 (100.0%)
 Active / Total Caches (% used) : 99 / 129 (76.7%)
 Active / Total Size (% used) : 15687900.45K / 18818904.98K (83.4%)
 Minimum / Average / Maximum Object : 0.01K / 0.32K / 8.00K

  OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME
17319666 16053457 92% 0.19K 412377 42 3299016K dentry
14000427 13995290 99% 0.64K 285723 49 9143136K proc_inode_cache
7833345 6638647 84% 0.10K 200855 39 803420K buffer_head
2655680 1201683 45% 0.06K 41495 64 165980K kmalloc-64
2582912 311329 12% 0.03K 20179 128 80716K kmalloc-32
1936816 993571 51% 0.57K 69295 28 1108720K radix_tree_node
1687663 751879 44% 0.16K 33095 51 264760K sigqueue
1551780 1551780 100% 0.11K 43105 36 172420K kernfs_node_cache
1343206 615376 45% 0.94K 39589 34 1266848K xfs_inode
966420 211045 21% 0.19K 23010 42 184080K kmalloc-192
858176 280943 32% 0.25K 26818 32 214544K kmalloc-256
855204 329172 38% 0.09K 20362 42 81448K kmalloc-96
833520 74554 8% 0.38K 19848 42 317568K mnt_cache
609984 110922 18% 0.50K 19062 32 304992K kmalloc-512
554260 549404 99% 0.21K 14980 37 119840K vm_area_struct
490467 483529 98% 0.08K 9617 51 38468K anon_vma
486944 231524 47% 0.12K 15217 32 60868K kmalloc-128
423680 422682 99% 0.02K 1655 256 6620K kmalloc-16
349544 147647 42% 1.00K 10925 32 349600K kmalloc-1024
295625 161844 54% 0.58K 5375 55 172000K inode_cache
179200 179200 100% 0.01K 350 512 1400K kmalloc-8
136850 136850 100% 0.02K 805 170 3220K scsi_data_buffer
131327 130745 99% 0.05K 1799 73 7196K fanotify_event_info
 76992 76921 99% 0.06K 1203 64 4812K kmem_cache_node
 74001 73793 99% 0.62K 1451 51 46432K sock_inode_cache
xiaobao

xiaobao

2019-08-19 02:25

reporter   ~0034997

Hi,
Did you have fix your issues?
we have samiler issues since 2 weeks ago, we run a mysql backup program in docker-ce-18.09 containers in centos-7.4 ( kernel-3.10.0-693.el7 x86_64) , but some container's backup program run very long time and hanging, but some containers backup program runs normally and finished.,if we kill the hanging progarma, then it become defunct! In some hosts OS messages we have seen the Call Trace logs.

calltrace-1.png (419,901 bytes)

Issue History

Date Modified Username Field Change
2019-01-16 20:56 wangyumu New Issue
2019-01-16 20:56 wangyumu Tag Attached: hang
2019-01-16 20:56 wangyumu Tag Attached: cgroup
2019-01-16 20:56 wangyumu Tag Attached: defunct
2019-01-16 20:56 wangyumu Tag Attached: sync
2019-01-16 21:02 wangyumu Note Added: 0033609
2019-08-19 02:25 xiaobao File Added: calltrace-1.png
2019-08-19 02:25 xiaobao Note Added: 0034997