View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0017982 | CentOS-7 | kernel | public | 2020-12-29 04:19 | 2020-12-29 09:43 |
Reporter | Jiewei Ke | Assigned To | |||
Priority | urgent | Severity | crash | Reproducibility | always |
Status | new | Resolution | open | ||
Product Version | 7.7-1908 | ||||
Summary | 0017982: ext4/jbd2 hang forever due to deadlock when OOM occurs | ||||
Description | When the memory usage of a cgroup exceeds the limit, OOM killer is triggered and ext4/jbd2 will hang forever due to deadlock. All tasks doing I/O on the file system will be impacted, and we have to reboot the host to recover. This issue can be reproduced for CentOS7 with kernel version after 3.10.0-693.el7. From analysis on the vmcore, I found the kthread jbd2/md127-8 which blocked all other transactions was waiting for the t_updates to decrease to zero (will be done by ext4_journal_stop() ), PID: 587 TASK: ffff9f53f8c31070 CPU: 5 COMMAND: "jbd2/md127-8" #0 [ffff9f53ff35bbf0] __schedule at ffffffffb139e420 #1 [ffff9f53ff35bc88] schedule at ffffffffb139eb09 #2 [ffff9f53ff35bc98] jbd2_journal_commit_transaction at ffffffffc01fe8b0 [jbd2] #3 [ffff9f53ff35be40] kjournald2 at ffffffffc0205c5e [jbd2] #4 [ffff9f53ff35bec8] kthread at ffffffffb0cc6bd1 while the IO task "fxck-with-write" was looping on the __getblk() waiting for memory, but had no luck to get any since the memory usage of the cgroup had exceeded the limit; ext4_journal_stop() would be called after __getblk(); crash> bt 27225 PID: 27225 TASK: ffff9f533a6d3150 CPU: 4 COMMAND: "fxck-with-write" #0 [ffff9f53c234f900] __schedule at ffffffffb139e420 #1 [ffff9f53c234f990] sys_sched_yield at ffffffffb0cdafdc #2 [ffff9f53c234f9b0] yield at ffffffffb139fca2 #3 [ffff9f53c234f9c8] free_more_memory at ffffffffb0e8ee2c #4 [ffff9f53c234fa08] __getblk at ffffffffb0e90013 #5 [ffff9f53c234fa70] __ext4_get_inode_loc at ffffffffc03d358f [ext4] #6 [ffff9f53c234fad8] ext4_reserve_inode_write at ffffffffc03d8e62 [ext4] #7 [ffff9f53c234fb08] ext4_mark_inode_dirty at ffffffffc03d8f23 [ext4] #8 [ffff9f53c234fb60] ext4_ext_tree_init at ffffffffc04038ba [ext4] #9 [ffff9f53c234fb70] __ext4_new_inode at ffffffffc03d271b [ext4] #10 [ffff9f53c234fc18] ext4_create at ffffffffc03e3fb3 [ext4] #11 [ffff9f53c234fc90] vfs_create at ffffffffb0e62aa8 #12 [ffff9f53c234fcc8] do_last at ffffffffb0e67e5f #13 [ffff9f53c234fd70] path_openat at ffffffffb0e681e0 #14 [ffff9f53c234fe08] do_filp_open at ffffffffb0e69d6d #15 [ffff9f53c234fee0] do_sys_open at ffffffffb0e5442e #16 [ffff9f53c234ff40] sys_openat at ffffffffb0e54504 #17 [ffff9f53c234ff50] system_call_fastpath at ffffffffb13ac81e OOM-killer was attempting to kill the other "fxck-with-write" to release memory, but the task was in uninterruptable state waiting for jbd2 transaction commit. crash> bt 27153 PID: 27153 TASK: ffff9f53fc35a0e0 CPU: 5 COMMAND: "fxck-with-write" #0 [ffff9f526e04fd70] __schedule at ffffffffb139e420 #1 [ffff9f526e04fe08] schedule at ffffffffb139eb09 #2 [ffff9f526e04fe18] jbd2_log_wait_commit at ffffffffc02054a5 [jbd2] #3 [ffff9f526e04fe98] jbd2_complete_transaction at ffffffffc0206e21 [jbd2] #4 [ffff9f526e04feb8] ext4_sync_file at ffffffffc03cff60 [ext4] #5 [ffff9f526e04ff00] do_fsync at ffffffffb0e8ccf7 #6 [ffff9f526e04ff40] sys_fsync at ffffffffb0e8cff0 #7 [ffff9f526e04ff50] system_call_fastpath at ffffffffb13ac81e So the deadlock happened: task A was waiting for memory before transaction commit, and task B was waiting for A's transaction commit before freeing more memory by OOM-killer. By code diff between 3.10.0-693.el7.centos and 3.10.0-862.el7.centos, I found the culprit, NOFAIL flag was unexpectedly removed after mapping_gfp_constraint() is called in __add_to_page_cache_locked(): static int __add_to_page_cache_locked(struct page *page, struct address_space *mapping, pgoff_t offset, gfp_t gfp_mask, void **shadowp) { int error; VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); gfp_mask = mapping_gfp_constraint(mapping, gfp_mask); <<<< this line is added by 3.10.0-862.el7.centos error = mem_cgroup_cache_charge(page, current->mm, gfp_mask & GFP_RECLAIM_MASK); if (error) goto out; By investigating the change history of upstream code, I tend to regard it as a mistaken backport issue for the following patches, mm: allow GFP_{FS,IO} for page_cache_read page cache allocation mm, fs: introduce mapping_gfp_constraint() mm: do not ignore mapping_gfp_mask in page cache allocation paths We proposed two patches to fix this issue (attached below, the 0002 should be must-have), feel free to let me know your comments and views. Thanks, Jiewei | ||||
Steps To Reproduce | Steps to reproduce it: Download the attached ext4-repro-with-multiple-cgroup.sh and ext4-repro-with-write, and run "bash ext4-repro-with-multiple-cgroup.sh". It will do the following things, 1. Prepare a test binary, named 'ext4-repro-with-write', which will open a file(located on an ext4 fs) and keep writing to it; see the attached ext4-repro-with-write.go; 2. Create 10 cgroups, each with memory limit size 100 MB; 3. For each cgroup, create 30 services to run the ext4-repro-with-write binary; see the; 4. We can reproduce it in 10 mins, the ext4 will hang; other tasks doing I/O on the same ext4 fs will hang also. 5. Run "systemctl stop ext4-repro-cgroup*-*" to stop the script. | ||||
Tags | No tags attached. | ||||
abrt_hash | |||||
URL | |||||
has duplicate | 0017981 | closed | ext4/jbd2 hang forever due to deadlock when OOM occurs |
0001-buffer-crossport-patches-to-ensure-grow_dev_page-wil.patch (6,471 bytes)
From 4b64f5a2fdeee9589572de9ddd44b8f8134777e4 Mon Sep 17 00:00:00 2001 From: Jiewei Ke <jiewei@smartx.com> Date: Mon, 21 Dec 2020 20:57:27 -0500 Subject: [PATCH 1/2] buffer: crossport patches to ensure grow_dev_page() will always use GFP_NOFAIL Following patches are crossported to 3.10.0, bc48f001de12 buffer: eliminate the need to call free_more_memory() in getblk_slow() 94dc24c0c59a buffer: grow_dev_page() should use GFP_NOFAIL for all cases 640ab98fb362 buffer: have alloc_page_buffers() use __GFP_NOFAIL --- drivers/md/md-bitmap.c | 2 +- fs/buffer.c | 62 ++++++++------------------------------------- fs/ntfs/aops.c | 2 +- fs/ntfs/mft.c | 2 +- include/linux/buffer_head.h | 2 +- 5 files changed, 15 insertions(+), 55 deletions(-) diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c index f0b5deb..aee6b00 100644 --- a/drivers/md/md-bitmap.c +++ b/drivers/md/md-bitmap.c @@ -351,7 +351,7 @@ static int read_page(struct file *file, unsigned long index, pr_debug("read bitmap file (%dB @ %llu)\n", (int)PAGE_SIZE, (unsigned long long)index << PAGE_SHIFT); - bh = alloc_page_buffers(page, 1<<inode->i_blkbits, 0); + bh = alloc_page_buffers(page, 1<<inode->i_blkbits, false); if (!bh) { ret = -ENOMEM; goto out; diff --git a/fs/buffer.c b/fs/buffer.c index ccb6790..0cda816 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -254,27 +254,6 @@ out: } /* - * Kick the writeback threads then try to free up some ZONE_NORMAL memory. - */ -static void free_more_memory(void) -{ - struct zone *zone; - int nid; - - wakeup_flusher_threads(1024, WB_REASON_FREE_MORE_MEM); - yield(); - - for_each_online_node(nid) { - (void)first_zones_zonelist(node_zonelist(nid, GFP_NOFS), - gfp_zone(GFP_NOFS), NULL, - &zone); - if (zone) - try_to_free_pages(node_zonelist(nid, GFP_NOFS), 0, - GFP_NOFS, NULL); - } -} - -/* * I/O completion handler for block_read_full_page() - pages * which come unlocked at the end of I/O. */ @@ -829,16 +808,19 @@ int remove_inode_buffers(struct inode *inode) * which may not fail from ordinary buffer allocations. */ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size, - int retry) + bool retry) { struct buffer_head *bh, *head; + gfp_t gfp = GFP_NOFS; long offset; -try_again: + if (retry) + gfp |= __GFP_NOFAIL; + head = NULL; offset = PAGE_SIZE; while ((offset -= size) >= 0) { - bh = alloc_buffer_head(GFP_NOFS); + bh = alloc_buffer_head(gfp); if (!bh) goto no_grow; @@ -864,23 +846,7 @@ no_grow: } while (head); } - /* - * Return failure for non-async IO requests. Async IO requests - * are not allowed to fail, so we have to wait until buffer heads - * become available. But we don't want tasks sleeping with - * partially complete buffers, so all were released above. - */ - if (!retry) - return NULL; - - /* We're _really_ low on memory. Now we just - * wait for old buffer heads to become free due to - * finishing IO. Since this is an async request and - * the reserve list is empty, we're sure there are - * async buffer heads in use. - */ - free_more_memory(); - goto try_again; + return NULL; } EXPORT_SYMBOL_GPL(alloc_page_buffers); @@ -969,8 +935,6 @@ grow_dev_page(struct block_device *bdev, sector_t block, gfp_mask |= __GFP_NOFAIL; page = find_or_create_page(inode->i_mapping, index, gfp_mask); - if (!page) - return ret; BUG_ON(!PageLocked(page)); @@ -988,9 +952,7 @@ grow_dev_page(struct block_device *bdev, sector_t block, /* * Allocate some buffers for this page */ - bh = alloc_page_buffers(page, size, 0); - if (!bh) - goto failed; + bh = alloc_page_buffers(page, size, true); /* * Link the page to the buffers and initialise them. Take the @@ -1070,8 +1032,6 @@ __getblk_slow(struct block_device *bdev, sector_t block, int size) ret = grow_buffers(bdev, block, size); if (ret < 0) return NULL; - if (ret == 0) - free_more_memory(); } } @@ -1537,7 +1497,7 @@ void create_empty_buffers(struct page *page, { struct buffer_head *bh, *head, *tail; - head = alloc_page_buffers(page, blocksize, 1); + head = alloc_page_buffers(page, blocksize, true); bh = head; do { bh->b_state |= b_state; @@ -2586,7 +2546,7 @@ int nobh_write_begin(struct address_space *mapping, * Be careful: the buffer linked list is a NULL terminated one, rather * than the circular one we're used to. */ - head = alloc_page_buffers(page, blocksize, 0); + head = alloc_page_buffers(page, blocksize, false); if (!head) { ret = -ENOMEM; goto out_release; @@ -2866,7 +2826,7 @@ int block_truncate_page(struct address_space *mapping, length = blocksize - length; iblock = (sector_t)index << (PAGE_CACHE_SHIFT - inode->i_blkbits); - + page = grab_cache_page(mapping, index); err = -ENOMEM; if (!page) diff --git a/fs/ntfs/aops.c b/fs/ntfs/aops.c index fa9c05f..9cb0166 100644 --- a/fs/ntfs/aops.c +++ b/fs/ntfs/aops.c @@ -1599,7 +1599,7 @@ void mark_ntfs_record_dirty(struct page *page, const unsigned int ofs) { spin_lock(&mapping->private_lock); if (unlikely(!page_has_buffers(page))) { spin_unlock(&mapping->private_lock); - bh = head = alloc_page_buffers(page, bh_size, 1); + bh = head = alloc_page_buffers(page, bh_size, true); spin_lock(&mapping->private_lock); if (likely(!page_has_buffers(page))) { struct buffer_head *tail; diff --git a/fs/ntfs/mft.c b/fs/ntfs/mft.c index 3014a36..9db0e31 100644 --- a/fs/ntfs/mft.c +++ b/fs/ntfs/mft.c @@ -506,7 +506,7 @@ int ntfs_sync_mft_mirror(ntfs_volume *vol, const unsigned long mft_no, if (unlikely(!page_has_buffers(page))) { struct buffer_head *tail; - bh = head = alloc_page_buffers(page, blocksize, 1); + bh = head = alloc_page_buffers(page, blocksize, true); do { set_buffer_uptodate(bh); tail = bh; diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index ab09b3c..d470df0 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -157,7 +157,7 @@ void set_bh_page(struct buffer_head *bh, struct page *page, unsigned long offset); int try_to_free_buffers(struct page *); struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size, - int retry); + bool retry); void create_empty_buffers(struct page *, unsigned long, unsigned long b_state); void end_buffer_read_sync(struct buffer_head *bh, int uptodate); -- 2.11.0 0002-fs-mm-avoid-unexpected-NOFAIL-removal-in-__add_to_pa.patch (2,440 bytes)
From 7dc5fb5fd28afbc09736025ca42f0e15aefe719f Mon Sep 17 00:00:00 2001 From: Jiewei Ke <jiewei@smartx.com> Date: Mon, 28 Dec 2020 01:56:41 -0500 Subject: [PATCH 2/2] fs, mm: avoid unexpected NOFAIL removal in __add_to_page_cache_locked NOFAIL flag may be unexpectedly removed after mapping_gfp_constraint() is called in __add_to_page_cache_locked(), which may cause critical issues since upper layer cannot handle it. For example, a deadlock in ext4 may happen like this: when the memory usage is above the the cgroup limit, ext4 fs may loop forever in __getblk_slow() waiting for free memory during transaction commit and block all other transactions; while in order to release some memory usage, OOM-killer may want to kill the task which is waiting for transaction and becomes uninterruptable. This patch fixes it by adding back the NOFAIL flag after mapping_gfp_constraint() if it has been specified by upper layer. For the ext4 case, there are no failure cases in grow_dev_page() that occur because of a failed memory allocation. Also add return value check for find_or_create_page() and alloc_page_buffers() for safety, even though NOFAIL flag has been specified. --- fs/buffer.c | 4 ++++ mm/filemap.c | 5 +++++ 2 files changed, 9 insertions(+) diff --git a/fs/buffer.c b/fs/buffer.c index 0cda816..af3bbcb 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -935,6 +935,8 @@ grow_dev_page(struct block_device *bdev, sector_t block, gfp_mask |= __GFP_NOFAIL; page = find_or_create_page(inode->i_mapping, index, gfp_mask); + if (!page) + return ret; BUG_ON(!PageLocked(page)); @@ -953,6 +955,8 @@ grow_dev_page(struct block_device *bdev, sector_t block, * Allocate some buffers for this page */ bh = alloc_page_buffers(page, size, true); + if (!bh) + goto failed; /* * Link the page to the buffers and initialise them. Take the diff --git a/mm/filemap.c b/mm/filemap.c index 4ee01a4..191c0fc 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -587,11 +587,16 @@ static int __add_to_page_cache_locked(struct page *page, void **shadowp) { int error; + bool has_nofail; VM_BUG_ON(!PageLocked(page)); VM_BUG_ON(PageSwapBacked(page)); + has_nofail = gfp_mask & __GFP_NOFAIL; gfp_mask = mapping_gfp_constraint(mapping, gfp_mask); + if (has_nofail) { + gfp_mask |= __GFP_NOFAIL; + } error = mem_cgroup_cache_charge(page, current->mm, gfp_mask & GFP_RECLAIM_MASK); -- 2.11.0 ext4-repro-with-multiple-cgroup.sh (1,334 bytes)
#!/usr/bin/bash echo "clean up" systemctl stop ext4-repro-cgroup*-* mkdir /usr/share/ext4-repro/ cp ext4-repro-with-write /tmp/ext4-repro-with-write chmod +x /tmp/ext4-repro-with-write for i in `seq 1 10`;do echo "prepare cgroup" cat << EOF > /etc/systemd/system/system-fio${i}.slice [Unit] Description=zbs Slice Before=slices.target Wants=system.slice After=system.slice EOF mkdir /etc/systemd/system/system-fio${i}.slice.d cat << EOF > /etc/systemd/system/system-fio${i}.slice.d/50-MemoryLimit.conf [Slice] MemoryLimit=104857600 EOF systemctl daemon-reload systemctl restart system-fio${i}.slice systemctl status system-fio${i}.slice echo "prepare fio data path" TEST_PATH="/usr/share/ext4-repro" mkdir $TEST_PATH for j in `seq 1 30`;do mkdir /etc/systemd/system/ext4-repro-cgroup${i}-${j}.service.d/ cat << EOF > /etc/systemd/system/ext4-repro-cgroup${i}-${j}.service.d/cgroup.conf [Service] Slice=system-fio${i}.slice EOF cat << EOF > /usr/lib/systemd/system/ext4-repro-cgroup${i}-${j}.service [Unit] After=network.target [Service] ExecStart=/tmp/ext4-repro-with-write Restart=always RestartSec=3s [Install] WantedBy=multi-user.target EOF done pushd /usr/lib/systemd/system systemctl restart ext4-repro-cgroup${i}*.service popd done ext4-repro-with-write.go (966 bytes)
package main import ( "flag" "fmt" "math/rand" "os" "strconv" "time" ) const letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" const path = "/usr/share/ext4-repo/" func RandStringBytes(n int) string { b := make([]byte, n) for i := range b { b[i] = letterBytes[rand.Intn(len(letterBytes))] } return string(b) } func createFileName(path string) string { name, _ := os.Hostname() fn := path + name + strconv.Itoa(int(time.Now().Unix())) + RandStringBytes(8) return fn } func w(path string) { fn := createFileName(path) fmt.Println(fn) f, _ := os.OpenFile(fn, os.O_APPEND|os.O_CREATE|os.O_WRONLY, 0644) fmt.Println(f) defer os.Remove(fn) defer f.Close() for { s := RandStringBytes(10) _, _ = f.WriteString(s) f.Sync() } } func main() { path := flag.String("path", "/usr/share/ext4-repo/", "a string") flag.Parse() fmt.Println("path:", *path) go w(*path) for { time.Sleep(time.Duration(1) * time.Second) } } |
|
Please replace the 'fxck-with-write' in the bug description into 'ext4-repro'. | |
Date Modified | Username | Field | Change |
---|---|---|---|
2020-12-29 04:19 | Jiewei Ke | New Issue | |
2020-12-29 04:19 | Jiewei Ke | File Added: 0001-buffer-crossport-patches-to-ensure-grow_dev_page-wil.patch | |
2020-12-29 04:19 | Jiewei Ke | File Added: 0002-fs-mm-avoid-unexpected-NOFAIL-removal-in-__add_to_pa.patch | |
2020-12-29 04:19 | Jiewei Ke | File Added: ext4-repro-with-multiple-cgroup.sh | |
2020-12-29 04:19 | Jiewei Ke | File Added: ext4-repro-with-write | |
2020-12-29 04:19 | Jiewei Ke | File Added: ext4-repro-with-write.go | |
2020-12-29 04:28 | Jiewei Ke | Note Added: 0038141 | |
2020-12-29 09:43 | toracat | Relationship added | has duplicate 0017981 |