View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0014920||Buildsys||Ci.centos.org Ecosystem Testing||public||2018-06-07 15:29||2018-11-05 10:57|
|Summary||0014920: Duffy is handing out the same node twice|
|Description||It looks like duffy is handing out the same node in two different API requests.|
These two jobs were executed at the same time:
Success Build #1401 (Jun 7, 2018 1:58:25 PM)
Build #83 (Jun 7, 2018 1:58:25 PM)
And they both obtained the same duffy node: 172.19.2.16
The code that requests the node is the same for both:
Some additional questions:
- Is there something wrong with that snippet? Maybe there's a problem in the way we are requesting the duffy node.
- Has this issue been reported before?
- Is there a workaround we can do to avoid this from happening?
|Tags||No tags attached.|
Today we had an issue that **may** be related, so throwing it in here in case it helps.
The problem was with this build: https://ci.centos.org/view/Devtools/job/devtools-fabric8-wit/3011/consoleFull
There are three things wrong with that build:
First we got:
> Existing lock /var/run/yum.pid: another copy is running as pid 12649.
This indicates some other process is using yum, and furthermore, a few lines down it says::
> Package rsync-3.1.2-4.el7.x86_64 already installed and latest version
How can it be installed?
Again, a little bit further down we get the lock problem again:
> Existing lock /var/run/yum.pid: another copy is running as pid 12749.
Now the pid is different. This suggests that there is another script in the machine running yum commands.
What's even worse is that those packages are reported as already installed:
> Package 2:docker-1.13.1-63.git94f4240.el7.centos.x86_64 already installed and latest version
> Package 1:make-3.82-23.el7.x86_64 already installed and latest version
> Package git-126.96.36.199-13.el7.x86_64 already installed and latest version
> Package curl-7.29.0-46.el7.x86_64 already installed and latest version
After the lock is released, the script fails because:
> /usr/bin/docker-current: Error response from daemon: Conflict. The container name "/fabric8-wit-local-build" is already in use by container 9538a456620e424219cbe267e5c3be28780cde4d3f202701ced77201c6ee365d. You have to remove (or rename) that container to be able to reuse that name..
The fact that there is already a container with that name indicates that the node has been reused.
The conclusion is that this node had already been used. Not sure if the underlying cause is the same as the original problem, but symptoms are quite similar.
|Also note that this are not reproducible issues, previous and next builds don't have this problem, so this is not happening all the time.|
Is there somebody working on this?
Please also see this issue https://gitlab.cee.redhat.com/dtsd/housekeeping/issues/2216
|another similar report: https://gitlab.cee.redhat.com/dtsd/housekeeping/issues/2244|
One new case:
- https://ci.centos.org/job/devtools-fabric8-wit-coverage/3621/consoleFull - Build #3621 (Aug 14, 2018 11:29:19 AM)
- https://ci.centos.org/job/devtools-fabric8-wit/3630/consoleFull - Build #3630 (Aug 14, 2018 11:29:19 AM)
These two jobs were started almost at the same time and it's clear from the logs that both of them received node with IP 172.19.2.13
I have a patch in testing to mitigate this.
There are a few problems with it, but when I fix those we'll deploy a new version of duffy.
Thanks for the update @bstinson. We have recently filled up the devtools slave node (slave04) because of this problem, as explained here: https://bugs.centos.org/view.php?id=15288. It would be very convenient and important to us if this could be fixed as soon as possible.
Thanks a lot!
|This issue has been happening quite frequently to me lately, since I've been using more Duffy nodes per pull request. It's been difficult to get enough of the jobs to pass to merge pull requests.|
So here's what I've done so far:
- Changed dufffy to use a single thread for incoming requests (this was actually done a while ago), AND
- Removed a chassis that was malfunctioning during the PXE boot process, this was a secondary cause of the same behavior.
Permanent fixes in testing:
- Handle multithreading properly
- Adding an install-id to each install that stamps the machine on each reinstall
- Migrating manual checks -> automatic notifications if a dirty machine makes it into the ready pool
|That is awesome bstinson! thanks for update|
|Is there any progress on this issue?|
|2018-06-07 15:29||jmelis||New Issue|
|2018-06-21 13:52||jmelis||Note Added: 0032123|
|2018-06-21 13:54||jmelis||Note Added: 0032125|
|2018-08-14 12:43||kwk||Note Added: 0032480|
|2018-08-22 08:24||riuvshyn||Note Added: 0032567|
|2018-08-27 11:09||jmelis||Note Added: 0032598|
|2018-09-11 10:26||ppitonak||Note Added: 0032695|
|2018-09-13 15:14||bstinson||Status||new => acknowledged|
|2018-09-13 15:14||bstinson||Note Added: 0032715|
|2018-09-18 10:40||jmelis||Note Added: 0032751|
|2018-09-18 21:39||bowlofeggs||Note Added: 0032756|
|2018-09-24 16:08||bstinson||Note Added: 0032771|
|2018-09-24 18:45||riuvshyn||Note Added: 0032778|
|2018-11-05 10:57||ppitonak||Note Added: 0033080|