View Issue Details

IDProjectCategoryView StatusLast Update
0017721CentOS-7dockerpublic2020-09-11 09:36
Reportersergey.kryazhev 
PriorityhighSeverityblockReproducibilitysometimes
Status newResolutionopen 
Product Version7.4.1708 
Target VersionFixed in Version 
Summary0017721: docker daemon stuck (any docker cli command) sometimes
DescriptionWe have a OKD OpenShift 1.5 cluster with:
Kernel Version: 3.10.0-693.el7.x86_64
Operating System: CentOS Linux 7 (Core)
Docker 1.13.1
Build: 108.git4ef4b30.el7.centos

From time to time we see that docker daemon on one of Oshift nodes is stuck:
- docker cli is stuck (docker ps)
- a lot of docker delete operations are stuck. Example:
 root 308 0.0 0.0 143644 9572 ? Sl ---- 0:00 /usr/libexec/docker/docker-runc-current --systemd-cgroup=true delete 04000383565c90756d676447560b85293086d315a9b14e345e889303e492b117

We can observe high mem usage/grow by docker daemon, which precede stucking.
As a result the whole OpenShift node is not operating.

Attaching go stacktrace on of reproducing.
Steps To ReproduceNothing special. However we think it might be related to mem usage on this node. The number of free memory on this node is less comparing to others. But it is not exceeded completely.
Additional InformationPlease suggest is it possible that is was fixed in the latest docker centos packages ? Actually, I did not find any issues in the changelog with similar symptoms.
Any comments , recomendation are very welcome. Since we are observing this issue in production and the impact is pretty high.
The only WA we know is to reboot the whole node.

Found something similar according to stacktrace https://github.com/moby/moby/issues/36419 But not sure.
TagsNo tags attached.
abrt_hash
URL

Activities

sergey.kryazhev

sergey.kryazhev

2020-09-09 07:04

reporter  

goroutine-stacks-2020-09-03T143546+0900.zip (20,842 bytes)
sergey.kryazhev

sergey.kryazhev

2020-09-10 14:29

reporter   ~0037682

New observation. Looks like VM, which running docker daemon, have low I/O on root partition (/var/lib/docker on separate partion with sufficient IO):
root partion:
dd bs=1M count=1024 if=/dev/urandom of=/tmp/test oflag=dsync

/: 7.7 MB/s

/ var: 63 MB/s
sergey.kryazhev

sergey.kryazhev

2020-09-10 15:11

reporter   ~0037683

We also tried to find a WA related to docker restart , systemctl restart docker is stuck so we killed docker process.
After that, we found docker not working at all. To any container run attempt, it answered:
Error response from daemon: unknown service types.API.
Only VM restart helped to resolve this issue
ManuelWolfshant

ManuelWolfshant

2020-09-10 15:35

manager   ~0037684

Both your kernel and your docker versions are old. Current kernel is 3.10.0-1127.19.1.el7 while docker is at 2:1.13.1-162.git64e9980.el7.centos . I suggest to update your whole OS and see what happens.
TrevorH

TrevorH

2020-09-10 15:38

manager   ~0037685

7.4 has been out of support since the release of 7.5 in April 2018 so you are nearly 2.5 years out of date. Only the current version of CentOS gets any sort of support at all.
sergey.kryazhev

sergey.kryazhev

2020-09-10 15:58

reporter   ~0037686

Thanks guys. I realize we use dinosaures. However, our customer requires proof that latest 3rd party components (kernel/docker) will resolve this issue. Is it possible somehow to point to any issues in CentOS/docker/kernel which might fix this , even in theory.
Thanks
ManuelWolfshant

ManuelWolfshant

2020-09-10 16:04

manager   ~0037687

Last edited: 2020-09-10 16:05

View 3 revisions

We can provide proof that we offer no support for anything but the latest minor release (updated to most current packages ) . Does this help ? :)
Seriously speaking now, you could start digging in the change logs. And while doing that, also take a note of the thousands of fixes - including security related ones - that got implemented.

And BTW, your client is extremely wrong if he considers the kernel and docker as "3rd party". They are the core, not 3rd party

ManuelWolfshant

ManuelWolfshant

2020-09-10 16:11

manager   ~0037688

It was just pointed to me that your Openshift version is also ancient and unsupported as well so you'd better look into updating that, too. Current stable version is 4.5
sergey.kryazhev

sergey.kryazhev

2020-09-11 09:36

reporter   ~0037693

Shame, shame on me ))). Ok, will start to look into changelog. At least, is it possible to say that very low IO on root partition IN(which we observed) might cause this docker issue?

Issue History

Date Modified Username Field Change
2020-09-09 07:04 sergey.kryazhev New Issue
2020-09-09 07:04 sergey.kryazhev File Added: goroutine-stacks-2020-09-03T143546+0900.zip
2020-09-10 14:29 sergey.kryazhev Note Added: 0037682
2020-09-10 15:11 sergey.kryazhev Note Added: 0037683
2020-09-10 15:35 ManuelWolfshant Note Added: 0037684
2020-09-10 15:38 TrevorH Note Added: 0037685
2020-09-10 15:58 sergey.kryazhev Note Added: 0037686
2020-09-10 16:04 ManuelWolfshant Note Added: 0037687
2020-09-10 16:05 ManuelWolfshant Note Edited: 0037687 View Revisions
2020-09-10 16:05 ManuelWolfshant Note Edited: 0037687 View Revisions
2020-09-10 16:11 ManuelWolfshant Note Added: 0037688
2020-09-11 09:36 sergey.kryazhev Note Added: 0037693