View Issue Details

IDProjectCategoryView StatusLast Update
0015393CentOS-7kernelpublic2020-06-15 13:25
Status closedResolutionno change required 
Product Version7.5.1804 
Target VersionFixed in Version 
Summary0015393: Unshare of network namespaces slows down
DescriptionRunning unshare with CLONE_NEWNET does not cause issues when run only occasionally. However, running unshare with CLONE_NEWNET frequently causes subsequent calls to take very long, up to several seconds or even a minute.

As far as I can tell this was fixed in more recent kernels, but I don't know which commit or version introduced the fix. This can be especially bad on systems with many cores, as there seems to be one lock held for freeing up the allocated namespaces/interfaces which blocks creation of subsequent namespaces.
Steps To ReproduceAn easy way to observe this problem is to call unshare -n in a tight loop. While calling unshare -n alone might take 2-50 ms to complete in normal operation, when called repeatedly in a single thread, this quickly jumps up to about 500ms.

A simple script such as
time sudo bash -c 'for i in $(seq 100); do unshare -n echo -n ; done'
takes about 45 seconds to complete on my machine.

To verify the slowdown is related to the kernel unshare syscall, one can run strace on unshare while the above is running:
$ sudo strace -Te trace=unshare unshare -n echo -n
unshare(CLONE_NEWNET) = 0 <1.214871>
+++ exited with 0 +++
Additional InformationThis is an issue that was fixed on more recent kernels, as best I can tell. Another machine running Debian on identical hardware but with kernel 4.9.0 takes about 4 seconds rather than 45 to run the above shell script.

On machines with many cores each running unshare, the time jumps up from a few hundred milliseconds on average to several seconds on average, when network namespaces are under heavy load, this jumps up to tens of seconds and up.
TagsNo tags attached.




2018-10-19 22:43

manager   ~0032947


You can test more recent kernels on CentOS by using ELRepo's kernel-lt (4.4) or kernel-ml (4.18).

Mind giving them a try?


2018-10-19 22:56

manager   ~0032948

And if you want to get it fixed in the CentOS/RHEL kernels then you'll also need to report this on CentOS only rebuilds what is used to build RHEL so fixes have to come from upstream.


2018-10-19 23:09

reporter   ~0032949

Sounds good. I'll try to test with those kernels and get back to you.

Is it worth keeping this report open and also filing a report on Given that I only have CentOS to test, I wasn't sure if there were things I wouldn't be able to test that I would be asked about if I filed there.


2018-10-19 23:16

manager   ~0032950


If we can identify the patch(es)/commits that fix the issue and if they can be backported to the current kernel, CentOS will offer the centosplus kernel that includes the patch(es). Then you can provide the info upstream (RH) for them to consider.


2018-10-20 00:13

reporter   ~0032951

Using the ElRepo kernel-lt (specifically 4.4.161-1.el7.elrepo.x86_64 ) the issue seems to be gone.

$ time sudo bash -c 'for i in $(seq 100); do unshare -n echo -n ; done'
real 0m5.456s

Not sure what the best way to track down the patch(es)/commit(s), but I guess that's a start.


2018-10-20 14:09

developer   ~0032952

@tylerni7, I've been trying to follow the patches, but there are too many changes related to netfilter.
If you could test with older versions of 4.4, it would be a good way to limit the amount of changes.
You can get older elrepo 4.4.x kernels from here, so you don't have to rebuild.



2018-10-22 10:56

developer   ~0032963

Also, I've been trying to replicate this bug locally and I couldn't, 3.10.0-862.14.4 is even faster than 4.4.161-1.el7.elrepo.x86_64.
Please tell us more about your system to help us replicate the problem.



2018-10-22 18:40

reporter   ~0032964

Hm gross. I'll try to see if I can find a good way to reproduce the issue on a standard environment. I can verify that what I was doing does not reproduce on fresh machine. I don't know if it only manifests itself after the machine gets into a particular state first or something else. The machines I was testing on are in use, which may be triggering the issue.

I know there have been a number of previous(?) kernel issues related to this when I was researching the issue which might be relevant.

One is a race condition with clone(CLONE_NEWNET) which once triggered prevents clone(CLONE_NEWNET) from succeeding until the machine reboots (I know that's the Ubuntu issue tracker, but seems like just a kernel issue).

The other seems to be likely the same issue, which has some stuff about reproducing the issue and more, so I'll try to look further into that and report back.


2018-10-26 20:19

reporter   ~0033008

Just a quick update: still having some issues reproducing this nicely. When we hit issues locally the machine is running Kubernetes which has set up a bunch of things that might somehow affect the performance.

On newer kernels (under slightly different circumstances, and also still need to figure out what "newer" means, but at least 4.15) the issue can still happen where unshare calls take several seconds to complete, but by setting RCU_EXPEDITED=1 the calls go back to taking milliseconds.

I will try to come up with a better way to reproduce the issue and post an update when that happens.


2019-09-22 15:24

developer   ~0035192

@tylerni7 I'm doing some cleanup here.
Are you still seeing this, even without a specific way to replicate it?


2019-09-23 17:35

reporter   ~0035206

We were still seeing this. Eventually we moved to using a kernel >= 4.17 from ELRepo , which has in it, which seems to help quite a bit.

It seems difficult to reproduce easily because something needs to start causing contention before everything starts slowing down, so a fresh system may not hit it.

Anyway, since we've worked around this, feel free to do whatever you think is appropriate for the ticket.


Issue History

Date Modified Username Field Change
2018-10-19 22:34 tylerni7 New Issue
2018-10-19 22:43 toracat Note Added: 0032947
2018-10-19 22:56 TrevorH Note Added: 0032948
2018-10-19 23:09 tylerni7 Note Added: 0032949
2018-10-19 23:16 toracat Note Added: 0032950
2018-10-20 00:13 tylerni7 Note Added: 0032951
2018-10-20 13:58 toracat Status new => acknowledged
2018-10-20 14:09 pgreco Note Added: 0032952
2018-10-22 10:56 pgreco Note Added: 0032963
2018-10-22 18:40 tylerni7 Note Added: 0032964
2018-10-26 20:19 tylerni7 Note Added: 0033008
2019-09-22 15:24 pgreco Note Added: 0035192
2019-09-23 17:35 tylerni7 Note Added: 0035206
2020-06-15 13:25 pgreco Status acknowledged => closed
2020-06-15 13:25 pgreco Resolution open => no change required