View Issue Details

IDProjectCategoryView StatusLast Update Ecosystem Testingpublic2019-04-26 03:37
Status resolvedResolutionfixed 
Summary0014920: Duffy is handing out the same node twice
DescriptionIt looks like duffy is handing out the same node in two different API requests.

These two jobs were executed at the same time:

  Success Build #1401 (Jun 7, 2018 1:58:25 PM)

  Build #83 (Jun 7, 2018 1:58:25 PM)

And they both obtained the same duffy node:

The code that requests the node is the same for both:


Some additional questions:

- Is there something wrong with that snippet? Maybe there's a problem in the way we are requesting the duffy node.
- Has this issue been reported before?
- Is there a workaround we can do to avoid this from happening?
TagsNo tags attached.




2018-06-21 13:52

reporter   ~0032123

Today we had an issue that **may** be related, so throwing it in here in case it helps.

The problem was with this build:

There are three things wrong with that build:

First we got:

> Existing lock /var/run/ another copy is running as pid 12649.

This indicates some other process is using yum, and furthermore, a few lines down it says::

> Package rsync-3.1.2-4.el7.x86_64 already installed and latest version

How can it be installed?

Again, a little bit further down we get the lock problem again:

> Existing lock /var/run/ another copy is running as pid 12749.

Now the pid is different. This suggests that there is another script in the machine running yum commands.

What's even worse is that those packages are reported as already installed:

> Package 2:docker-1.13.1-63.git94f4240.el7.centos.x86_64 already installed and latest version
> Package 1:make-3.82-23.el7.x86_64 already installed and latest version
> Package git- already installed and latest version
> Package curl-7.29.0-46.el7.x86_64 already installed and latest version

After the lock is released, the script fails because:

> /usr/bin/docker-current: Error response from daemon: Conflict. The container name "/fabric8-wit-local-build" is already in use by container 9538a456620e424219cbe267e5c3be28780cde4d3f202701ced77201c6ee365d. You have to remove (or rename) that container to be able to reuse that name..

The fact that there is already a container with that name indicates that the node has been reused.

The conclusion is that this node had already been used. Not sure if the underlying cause is the same as the original problem, but symptoms are quite similar.


2018-06-21 13:54

reporter   ~0032125

Also note that this are not reproducible issues, previous and next builds don't have this problem, so this is not happening all the time.


2018-08-14 12:43

reporter   ~0032480

Is there somebody working on this?

Please also see this issue


2018-08-22 08:24

reporter   ~0032567

another similar report:


2018-08-27 11:09

reporter   ~0032598

One new case:

- - Build #3621 (Aug 14, 2018 11:29:19 AM)
- - Build #3630 (Aug 14, 2018 11:29:19 AM)


2018-09-11 10:26

reporter   ~0032695

Another case:

These two jobs were started almost at the same time and it's clear from the logs that both of them received node with IP


2018-09-13 15:14

administrator   ~0032715

I have a patch in testing to mitigate this.

There are a few problems with it, but when I fix those we'll deploy a new version of duffy.


2018-09-18 10:40

reporter   ~0032751

Thanks for the update @bstinson. We have recently filled up the devtools slave node (slave04) because of this problem, as explained here: It would be very convenient and important to us if this could be fixed as soon as possible.

Thanks a lot!


2018-09-18 21:39

reporter   ~0032756

This issue has been happening quite frequently to me lately, since I've been using more Duffy nodes per pull request. It's been difficult to get enough of the jobs to pass to merge pull requests.


2018-09-24 16:08

administrator   ~0032771

So here's what I've done so far:

- Changed dufffy to use a single thread for incoming requests (this was actually done a while ago), AND
- Removed a chassis that was malfunctioning during the PXE boot process, this was a secondary cause of the same behavior.

Permanent fixes in testing:
- Handle multithreading properly
- Adding an install-id to each install that stamps the machine on each reinstall
- Migrating manual checks -> automatic notifications if a dirty machine makes it into the ready pool


2018-09-24 18:45

reporter   ~0032778

That is awesome bstinson! thanks for update


2018-11-05 10:57

reporter   ~0033080

Is there any progress on this issue?


2019-04-10 22:24

reporter   ~0034178

This happened again to me today:


2019-04-11 21:22

reporter   ~0034182

This has happened to me several times this week. Was the fix perhaps reverted or something along those lines?

Issue History

Date Modified Username Field Change
2018-06-07 15:29 jmelis New Issue
2018-06-21 13:52 jmelis Note Added: 0032123
2018-06-21 13:54 jmelis Note Added: 0032125
2018-08-14 12:43 kwk Note Added: 0032480
2018-08-22 08:24 riuvshyn Note Added: 0032567
2018-08-27 11:09 jmelis Note Added: 0032598
2018-09-11 10:26 ppitonak Note Added: 0032695
2018-09-13 15:14 bstinson Status new => acknowledged
2018-09-13 15:14 bstinson Note Added: 0032715
2018-09-18 10:40 jmelis Note Added: 0032751
2018-09-18 21:39 bowlofeggs Note Added: 0032756
2018-09-24 16:08 bstinson Note Added: 0032771
2018-09-24 18:45 riuvshyn Note Added: 0032778
2018-11-05 10:57 ppitonak Note Added: 0033080
2019-04-10 22:24 bowlofeggs Note Added: 0034178
2019-04-11 21:22 bowlofeggs Note Added: 0034182
2019-04-26 03:37 bstinson Status acknowledged => resolved
2019-04-26 03:37 bstinson Resolution open => fixed