View Issue Details

IDProjectCategoryView StatusLast Update
0016306BuildsysCi.centos.org Ecosystem Testingpublic2019-07-31 13:52
ReporterMartin.Pitt 
PrioritynormalSeveritymajorReproducibilityalways
Status newResolutionopen 
Summary0016306: n22.kempty.ci.centos.org instances fail qemu-img
DescriptionWe are currently getting weird test failures [1] on some of our tests. This is the root cause, in `oc rsh centosci-tasks-6lx8h`:

```
$ ls -l fedora-30-caf27ed0e34e2019cbf8db0b02d97a006f9df21fbe32b3c90a95c5833acf1496.qcow2
-rw----r--. 1 nobody nobody 1954721792 Jul 19 02:33 fedora-30-caf27ed0e34e2019cbf8db0b02d97a006f9df21fbe32b3c90a95c5833acf1496.qcow2

$ sha256sum fedora-30-caf27ed0e34e2019cbf8db0b02d97a006f9df21fbe32b3c90a95c5833acf1496.qcow2
caf27ed0e34e2019cbf8db0b02d97a006f9df21fbe32b3c90a95c5833acf1496 fedora-30-caf27ed0e34e2019cbf8db0b02d97a006f9df21fbe32b3c90a95c5833acf1496.qcow2

$ qemu-img info fedora-30-caf27ed0e34e2019cbf8db0b02d97a006f9df21fbe32b3c90a95c5833acf1496.qcow2
qemu-img: Could not open '/build/images/fedora-30-caf27ed0e34e2019cbf8db0b02d97a006f9df21fbe32b3c90a95c5833acf1496.qcow2': Could not read qcow2 header: Input/output error
```

In other words, reading the file is fine (the SHA sum is correct when comparing it to other pods), but qemu-img is upset.

However, this lives on a shared volume, all other pods are fine with it. I validated this with
```
for pod in $(oc get -o name -l infra=cockpit-tasks pods); do echo "===== $pod ===="; oc describe $pod | grep Node; oc rsh $pod qemu-img info /build/images/fedora-30-caf27ed0e34e2019cbf8db0b02d97a006f9df21fbe32b3c90a95c5833acf1496.qcow2; done
```

The broken pod is the only one that runs on the n22.kempty node, the others seem fine. As `strace` does not work in docker containers, I can't think of a further way to examine this. Does that node have something funky in its journal, or could you try to run `strace` on the qemu-img on the node?

Perhaps it just needs a reboot?

Note, I will kill this pod now until I catch one that runs somewhere else.

[1] https://logs-https-cockpit.apps.ci.centos.org/logs/pull-12152-20190731-085959-208f9b6d-cockpit-project-cockpit-fedora-30-selenium-firefox/log.html
TagsNo tags attached.

Activities

Martin.Pitt

Martin.Pitt

2019-07-31 09:28

reporter   ~0034890

FTR, I restarted a new pod, it landed on n22 again and has the same error. So it's not some weird state inside the pod.
Martin.Pitt

Martin.Pitt

2019-07-31 13:52

reporter   ~0034891

It seems to get worse -- creating new pods on n22 (https://console.apps.ci.centos.org:8443/console/project/cockpit/browse/pods/release-job-cockpit-podman-7-13tlk?tab=events) now says "Error syncing pod
(8 times in the last minute)".

Issue History

Date Modified Username Field Change
2019-07-31 09:26 Martin.Pitt New Issue
2019-07-31 09:28 Martin.Pitt Note Added: 0034890
2019-07-31 13:52 Martin.Pitt Note Added: 0034891