View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0016306||Buildsys||Ci.centos.org Ecosystem Testing||public||2019-07-31 09:26||2019-07-31 13:52|
|Summary||0016306: n22.kempty.ci.centos.org instances fail qemu-img|
|Description||We are currently getting weird test failures  on some of our tests. This is the root cause, in `oc rsh centosci-tasks-6lx8h`:|
$ ls -l fedora-30-caf27ed0e34e2019cbf8db0b02d97a006f9df21fbe32b3c90a95c5833acf1496.qcow2
-rw----r--. 1 nobody nobody 1954721792 Jul 19 02:33 fedora-30-caf27ed0e34e2019cbf8db0b02d97a006f9df21fbe32b3c90a95c5833acf1496.qcow2
$ sha256sum fedora-30-caf27ed0e34e2019cbf8db0b02d97a006f9df21fbe32b3c90a95c5833acf1496.qcow2
$ qemu-img info fedora-30-caf27ed0e34e2019cbf8db0b02d97a006f9df21fbe32b3c90a95c5833acf1496.qcow2
qemu-img: Could not open '/build/images/fedora-30-caf27ed0e34e2019cbf8db0b02d97a006f9df21fbe32b3c90a95c5833acf1496.qcow2': Could not read qcow2 header: Input/output error
In other words, reading the file is fine (the SHA sum is correct when comparing it to other pods), but qemu-img is upset.
However, this lives on a shared volume, all other pods are fine with it. I validated this with
for pod in $(oc get -o name -l infra=cockpit-tasks pods); do echo "===== $pod ===="; oc describe $pod | grep Node; oc rsh $pod qemu-img info /build/images/fedora-30-caf27ed0e34e2019cbf8db0b02d97a006f9df21fbe32b3c90a95c5833acf1496.qcow2; done
The broken pod is the only one that runs on the n22.kempty node, the others seem fine. As `strace` does not work in docker containers, I can't think of a further way to examine this. Does that node have something funky in its journal, or could you try to run `strace` on the qemu-img on the node?
Perhaps it just needs a reboot?
Note, I will kill this pod now until I catch one that runs somewhere else.
|Tags||No tags attached.|
|FTR, I restarted a new pod, it landed on n22 again and has the same error. So it's not some weird state inside the pod.|
It seems to get worse -- creating new pods on n22 (https://console.apps.ci.centos.org:8443/console/project/cockpit/browse/pods/release-job-cockpit-podman-7-13tlk?tab=events) now says "Error syncing pod
(8 times in the last minute)".