View Issue Details

IDProjectCategoryView StatusLast Update
0014920BuildsysCi.centos.org Ecosystem Testingpublic2018-09-24 18:45
Reporterjmelis 
PrioritynormalSeverityminorReproducibilityrandom
Status acknowledgedResolutionopen 
Summary0014920: Duffy is handing out the same node twice
DescriptionIt looks like duffy is handing out the same node in two different API requests.

These two jobs were executed at the same time:

- https://ci.centos.org/job/devtools-test-e2e-openshift.io-logintest-us-east-2a-released/1401/consoleFull
  Success Build #1401 (Jun 7, 2018 1:58:25 PM)

- https://ci.centos.org/job/devtools-che-functional-tests-prcheck-prod-preview.openshift.io-2a/83/
  Build #83 (Jun 7, 2018 1:58:25 PM)

And they both obtained the same duffy node: 172.19.2.16

The code that requests the node is the same for both:

- https://github.com/openshiftio/openshiftio-cico-jobs/blob/30b06ffda1403ba0d59697ecea05f0d0bc0167a8/devtools-ci-index.yaml#L1735-L1755
- https://github.com/openshiftio/openshiftio-cico-jobs/blob/30b06ffda1403ba0d59697ecea05f0d0bc0167a8/devtools-ci-index.yaml#L282-L302

Some additional questions:

- Is there something wrong with that snippet? Maybe there's a problem in the way we are requesting the duffy node.
- Has this issue been reported before?
- Is there a workaround we can do to avoid this from happening?
TagsNo tags attached.

Activities

jmelis

jmelis

2018-06-21 13:52

reporter   ~0032123

Today we had an issue that **may** be related, so throwing it in here in case it helps.

The problem was with this build: https://ci.centos.org/view/Devtools/job/devtools-fabric8-wit/3011/consoleFull

There are three things wrong with that build:

First we got:

> Existing lock /var/run/yum.pid: another copy is running as pid 12649.

This indicates some other process is using yum, and furthermore, a few lines down it says::

> Package rsync-3.1.2-4.el7.x86_64 already installed and latest version

How can it be installed?

Again, a little bit further down we get the lock problem again:

> Existing lock /var/run/yum.pid: another copy is running as pid 12749.

Now the pid is different. This suggests that there is another script in the machine running yum commands.

What's even worse is that those packages are reported as already installed:

> Package 2:docker-1.13.1-63.git94f4240.el7.centos.x86_64 already installed and latest version
> Package 1:make-3.82-23.el7.x86_64 already installed and latest version
> Package git-1.8.3.1-13.el7.x86_64 already installed and latest version
> Package curl-7.29.0-46.el7.x86_64 already installed and latest version

After the lock is released, the script fails because:

> /usr/bin/docker-current: Error response from daemon: Conflict. The container name "/fabric8-wit-local-build" is already in use by container 9538a456620e424219cbe267e5c3be28780cde4d3f202701ced77201c6ee365d. You have to remove (or rename) that container to be able to reuse that name..

The fact that there is already a container with that name indicates that the node has been reused.

The conclusion is that this node had already been used. Not sure if the underlying cause is the same as the original problem, but symptoms are quite similar.
jmelis

jmelis

2018-06-21 13:54

reporter   ~0032125

Also note that this are not reproducible issues, previous and next builds don't have this problem, so this is not happening all the time.
kwk

kwk

2018-08-14 12:43

reporter   ~0032480

Is there somebody working on this?

Please also see this issue https://gitlab.cee.redhat.com/dtsd/housekeeping/issues/2216
riuvshyn

riuvshyn

2018-08-22 08:24

reporter   ~0032567

another similar report: https://gitlab.cee.redhat.com/dtsd/housekeeping/issues/2244
jmelis

jmelis

2018-08-27 11:09

reporter   ~0032598

One new case:

- https://ci.centos.org/job/devtools-fabric8-wit-coverage/3621/consoleFull - Build #3621 (Aug 14, 2018 11:29:19 AM)
- https://ci.centos.org/job/devtools-fabric8-wit/3630/consoleFull - Build #3630 (Aug 14, 2018 11:29:19 AM)
ppitonak

ppitonak

2018-09-11 10:26

reporter   ~0032695

Another case:

https://ci.centos.org/job/devtools-test-e2e-openshift.io-smoketest-us-east-2-released/664/console
https://ci.centos.org/job/devtools-test-e2e-cleanup/148/console

These two jobs were started almost at the same time and it's clear from the logs that both of them received node with IP 172.19.2.13
bstinson

bstinson

2018-09-13 15:14

administrator   ~0032715

I have a patch in testing to mitigate this.

There are a few problems with it, but when I fix those we'll deploy a new version of duffy.
jmelis

jmelis

2018-09-18 10:40

reporter   ~0032751

Thanks for the update @bstinson. We have recently filled up the devtools slave node (slave04) because of this problem, as explained here: https://bugs.centos.org/view.php?id=15288. It would be very convenient and important to us if this could be fixed as soon as possible.

Thanks a lot!
bowlofeggs

bowlofeggs

2018-09-18 21:39

reporter   ~0032756

This issue has been happening quite frequently to me lately, since I've been using more Duffy nodes per pull request. It's been difficult to get enough of the jobs to pass to merge pull requests.
bstinson

bstinson

2018-09-24 16:08

administrator   ~0032771

So here's what I've done so far:

- Changed dufffy to use a single thread for incoming requests (this was actually done a while ago), AND
- Removed a chassis that was malfunctioning during the PXE boot process, this was a secondary cause of the same behavior.

Permanent fixes in testing:
- Handle multithreading properly
- Adding an install-id to each install that stamps the machine on each reinstall
- Migrating manual checks -> automatic notifications if a dirty machine makes it into the ready pool
riuvshyn

riuvshyn

2018-09-24 18:45

reporter   ~0032778

That is awesome bstinson! thanks for update

Issue History

Date Modified Username Field Change
2018-06-07 15:29 jmelis New Issue
2018-06-21 13:52 jmelis Note Added: 0032123
2018-06-21 13:54 jmelis Note Added: 0032125
2018-08-14 12:43 kwk Note Added: 0032480
2018-08-22 08:24 riuvshyn Note Added: 0032567
2018-08-27 11:09 jmelis Note Added: 0032598
2018-09-11 10:26 ppitonak Note Added: 0032695
2018-09-13 15:14 bstinson Status new => acknowledged
2018-09-13 15:14 bstinson Note Added: 0032715
2018-09-18 10:40 jmelis Note Added: 0032751
2018-09-18 21:39 bowlofeggs Note Added: 0032756
2018-09-24 16:08 bstinson Note Added: 0032771
2018-09-24 18:45 riuvshyn Note Added: 0032778