View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0017584||CentOS-7||kernel||public||2020-07-09 22:36||2020-07-09 22:36|
|OS||CentOS Linux 7 (Core)||OS Version||7.8.2003|
|Summary||0017584: squashfs stuck after OOM|
|Description||After an OOM event due to cgroups limits, processes running in a Singularity container may become unrecoverably stuck in the squashfs portion of the kernel. The issue is not consistent, but we have enough nodes and users that several every day may become broken in this way.|
An example stack from procfs for a process that is stuck:
# cat /proc/21627/stack
[<ffffffffc07ce725>] squashfs_cache_get+0x105/0x3c0 [squashfs]
[<ffffffffc07ceff1>] squashfs_get_datablock+0x21/0x30 [squashfs]
[<ffffffffc07d0272>] squashfs_readpage+0x8a2/0xc30 [squashfs]
This process will never recover from this state, even if left for several days. It is also unkillable.
Additionally, reads to /proc/21627/cmdline will hang in uninterruptible sleep. I believe this is because squashfs is holding a lock to the memory map for that process.
|Steps To Reproduce||I can somewhat consistently reproduce by using a script that reads, with high multi-process concurrency, a file out of a Singularity image. The script also has a memory leak to trigger OOM. I run this in a memory constrained cgroup and, after the OOM, one or more of the reading processes might be stuck as above. Though, it is not uncommon for multiple OOM events to have to occur first.|
|Additional Information||Singularity uses Squashfs/Namespaces in the backend. I do not believe this is a Singularity issue as it just creates the environment for you, which normally runs fine.|
|Tags||No tags attached.|