View Issue Details

IDProjectCategoryView StatusLast Update
0016441CentOS-7systemdpublic2019-11-22 04:14
Reporterbboozzoo 
PrioritynormalSeveritycrashReproducibilityrandom
Status newResolutionopen 
Product Version7.7-1908 
Target VersionFixed in Version 
Summary0016441: ABRT when enumerating cgroups
DescriptionCaught during integration tests of snapd on CentOS 7.

Last couple of lines from the journal:
Sep 23 09:04:32 sep230834-687428 systemd[1]: Stopping Snappy daemon...
Sep 23 09:04:32 sep230834-687428 snapd[7261]: main.go:149: Exiting on terminated signal.
Sep 23 09:04:32 sep230834-687428 systemd[1]: Stopped Snappy daemon.
Sep 23 09:04:33 sep230834-687428 systemd[1]: Reloading.
Sep 23 09:04:33 sep230834-687428 systemd[1]: Caught <ABRT>, dumped core as pid 7363.
Sep 23 09:04:33 sep230834-687428 systemd[1]: Freezing execution.

Backtrace:

Core was generated by `/usr/lib/systemd/systemd --system --deserialize 21'.
Program terminated with signal 6, Aborted.
#0 0x00007f13a1dd8607 in kill () at ../sysdeps/unix/syscall-template.S:81
81 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS)
Missing separate debuginfos, use: debuginfo-install libattr-2.4.46-13.el7.x86_64 libcap-ng-0.7.5-4.el7.x86_64 pcre-8.32-17.el7.x86_64
(gdb) bt
#0 0x00007f13a1dd8607 in kill () at ../sysdeps/unix/syscall-template.S:81
#1 0x000055f5851fd95d in crash (sig=6) at src/core/main.c:172
#2 <signal handler called>
#3 0x00007f13a1dd8337 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#4 0x00007f13a1dd9a28 in __GI_abort () at abort.c:90
#5 0x00007f13a1e1ae87 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7f13a1f2d3b8 "*** Error in `%s': %s: 0x%s ***\n")
    at ../sysdeps/unix/sysv/linux/libc_fatal.c:196
#6 0x00007f13a1e23679 in malloc_printerr (ar_ptr=0x7f13a2169760 <main_arena>, ptr=<optimized out>, str=0x7f13a1f2d530 "free(): invalid next size (fast)", action=3)
    at malloc.c:4967
#7 _int_free (av=0x7f13a2169760 <main_arena>, p=<optimized out>, have_lock=0) at malloc.c:3843
#8 0x000055f58525a5ff in freep (p=0x7ffe259cb140) at src/shared/util.h:688
#9 cg_enumerate_subgroups (controller=controller@entry=0x55f5852ec55a "name=systemd", path=path@entry=0x55f5871b4ad0 "/sys/fs/cgroup/systemd/sys/fs/cgroup/systemd",
    _d=_d@entry=0x7ffe259cb168) at src/shared/cgroup-util.c:89
#10 0x000055f58525a6c8 in cg_is_empty_recursive (controller=controller@entry=0x55f5852ec55a "name=systemd",
    path=0x55f5871b4ad0 "/sys/fs/cgroup/systemd/sys/fs/cgroup/systemd", ignore_self=ignore_self@entry=true) at src/shared/cgroup-util.c:923
#11 0x000055f585215b88 in cgroup_good (s=<optimized out>) at src/core/service.c:1329
#12 0x000055f585215bd5 in service_may_gc (u=0x55f58711d6c0) at src/core/service.c:2457
#13 0x000055f5852a9465 in unit_add_to_gc_queue (u=0x55f58711d6c0) at src/core/unit.c:355
#14 0x000055f5852ade67 in unit_ref_unset (ref=0x55f587187618) at src/core/unit.c:3155
#15 0x000055f5852af892 in unit_free (u=0x55f58711d6c0) at src/core/unit.c:525
#16 0x000055f5852008e8 in manager_clear_jobs_and_units (m=m@entry=0x55f58710f4b0) at src/core/manager.c:970
#17 0x000055f5852049b0 in manager_reload (m=m@entry=0x55f58710f4b0) at src/core/manager.c:2804
#18 0x000055f5851fac4f in main (argc=4, argv=0x7ffe259cbc78) at src/core/main.c:1799
(gdb) bt full
#0 0x00007f13a1dd8607 in kill () at ../sysdeps/unix/syscall-template.S:81
No locals.
#1 0x000055f5851fd95d in crash (sig=6) at src/core/main.c:172
        rl = {rlim_cur = 18446744073709551615, rlim_max = 18446744073709551615}
        sa = {__sigaction_handler = {sa_handler = 0x0, sa_sigaction = 0x0}, sa_mask = {__val = {0 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0x0}
        pid = <optimized out>
        __func__ = "crash"
        __PRETTY_FUNCTION__ = "crash"
#2 <signal handler called>
No locals.
#3 0x00007f13a1dd8337 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
        resultvar = 0
        pid = 1
        selftid = 1
#4 0x00007f13a1dd9a28 in __GI_abort () at abort.c:90
        save_stage = 2
        act = {__sigaction_handler = {sa_handler = 0x7f13a1f2ac31, sa_sigaction = 0x7f13a1f2ac31}, sa_mask = {__val = {3, 140729529444043, 5, 139722298138087, 1,
              139722298146801, 3, 140729529444020, 12, 139722298146805, 2, 139722298146805, 2, 140729529444832, 140729529444832, 140729529446592}}, sa_flags = 14,
          sa_restorer = 0x7ffe259cabe0}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#5 0x00007f13a1e1ae87 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7f13a1f2d3b8 "*** Error in `%s': %s: 0x%s ***\n")
    at ../sysdeps/unix/sysv/linux/libc_fatal.c:196
        ap = {{gp_offset = 40, fp_offset = 48, overflow_arg_area = 0x7ffe259cb0d0, reg_save_area = 0x7ffe259cafe0}}
        ap_copy = {{gp_offset = 16, fp_offset = 48, overflow_arg_area = 0x7ffe259cb0d0, reg_save_area = 0x7ffe259cafe0}}
        fd = 2
        on_2 = <optimized out>
        list = <optimized out>
        nlist = <optimized out>
        cp = <optimized out>
        written = <optimized out>
#6 0x00007f13a1e23679 in malloc_printerr (ar_ptr=0x7f13a2169760 <main_arena>, ptr=<optimized out>, str=0x7f13a1f2d530 "free(): invalid next size (fast)", action=3)
    at malloc.c:4967
        buf = "000055f5871b4ad0"
        cp = <optimized out>
        ar_ptr = 0x7f13a2169760 <main_arena>
        ptr = <optimized out>
        str = 0x7f13a1f2d530 "free(): invalid next size (fast)"
        action = 3
#7 _int_free (av=0x7f13a2169760 <main_arena>, p=<optimized out>, have_lock=0) at malloc.c:3843
        size = <optimized out>
        fb = <optimized out>
        nextchunk = <optimized out>
        nextsize = <optimized out>
        nextinuse = <optimized out>
        prevsize = <optimized out>
---Type <return> to continue, or q <return> to quit---
        bck = <optimized out>
        fwd = <optimized out>
        errstr = 0x7f13a1f2d530 "free(): invalid next size (fast)"
        locked = <optimized out>
#8 0x000055f58525a5ff in freep (p=0x7ffe259cb140) at src/shared/util.h:688
No locals.
#9 cg_enumerate_subgroups (controller=controller@entry=0x55f5852ec55a "name=systemd", path=path@entry=0x55f5871b4ad0 "/sys/fs/cgroup/systemd/sys/fs/cgroup/systemd",
    _d=_d@entry=0x7ffe259cb168) at src/shared/cgroup-util.c:89
        fs = 0x55f5871b4ad0 "/sys/fs/cgroup/systemd/sys/fs/cgroup/systemd"
        r = <optimized out>
        d = <optimized out>
        __PRETTY_FUNCTION__ = "cg_enumerate_subgroups"
#10 0x000055f58525a6c8 in cg_is_empty_recursive (controller=controller@entry=0x55f5852ec55a "name=systemd",
    path=0x55f5871b4ad0 "/sys/fs/cgroup/systemd/sys/fs/cgroup/systemd", ignore_self=ignore_self@entry=true) at src/shared/cgroup-util.c:923
        d = 0x0
        fn = 0x55f58711ddf0 "`q \207\365U"
        r = <optimized out>
        __PRETTY_FUNCTION__ = "cg_is_empty_recursive"
#11 0x000055f585215b88 in cgroup_good (s=<optimized out>) at src/core/service.c:1329
        r = <optimized out>
#12 0x000055f585215bd5 in service_may_gc (u=0x55f58711d6c0) at src/core/service.c:2457
        s = 0x55f58711d6c0
#13 0x000055f5852a9465 in unit_add_to_gc_queue (u=0x55f58711d6c0) at src/core/unit.c:355
No locals.
#14 0x000055f5852ade67 in unit_ref_unset (ref=0x55f587187618) at src/core/unit.c:3155
        __PRETTY_FUNCTION__ = "unit_ref_unset"
#15 0x000055f5852af892 in unit_free (u=0x55f58711d6c0) at src/core/unit.c:525
        d = <optimized out>
        i = {idx = 4294967295, next_key = 0x0}
        t = <optimized out>
        __PRETTY_FUNCTION__ = "unit_free"
#16 0x000055f5852008e8 in manager_clear_jobs_and_units (m=m@entry=0x55f58710f4b0) at src/core/manager.c:970
        u = <optimized out>
        __PRETTY_FUNCTION__ = "manager_clear_jobs_and_units"
#17 0x000055f5852049b0 in manager_reload (m=m@entry=0x55f58710f4b0) at src/core/manager.c:2804
        r = 0
        q = <optimized out>
        f = 0x55f5871b6df0
        fds = 0x55f587115bc0
        __PRETTY_FUNCTION__ = "manager_reload"
#18 0x000055f5851fac4f in main (argc=4, argv=0x7ffe259cbc78) at src/core/main.c:1799
        m = 0x55f58710f4b0
        r = <optimized out>
        retval = 1
        before_startup = 886751997
---Type <return> to continue, or q <return> to quit---
        after_startup = <optimized out>
        timespan = '\000' <repeats 63 times>
        fds = 0x0
        reexecute = false
        shutdown_verb = 0x0
        initrd_timestamp = <optimized out>
        userspace_timestamp = {realtime = 1569228578778635, monotonic = 886742994}
        kernel_timestamp = {realtime = 1569227692035642, monotonic = 0}
        security_start_timestamp = {realtime = 0, monotonic = 0}
        security_finish_timestamp = {realtime = 0, monotonic = 0}
        systemd = "systemd"
        skip_setup = <optimized out>
        j = <optimized out>
        loaded_policy = false
        arm_reboot_watchdog = false
        queue_default_job = <optimized out>
        empty_etc = false
        switch_root_dir = 0x0
        switch_root_init = 0x0
        saved_rlimit_nofile = {rlim_cur = 1024, rlim_max = 4096}
        error_message = 0x0
        __func__ = "main"
        __PRETTY_FUNCTION__ = "main"
(gdb)
Steps To ReproduceCould not reproduce reliably outside of snapd integration suite.
Additional InformationPackage versions:

glibc-2.17-292.el7.x86_64
glibc-common-2.17-292.el7.x86_64
glibc-debuginfo-2.17-292.el7.x86_64
glibc-debuginfo-common-2.17-292.el7.x86_64
glibc-devel-2.17-292.el7.x86_64
glibc-headers-2.17-292.el7.x86_64
glibc-static-2.17-292.el7.x86_64
systemd-219-67.el7_7.1.x86_64
systemd-debuginfo-219-67.el7_7.1.x86_64
systemd-devel-219-67.el7_7.1.x86_64
systemd-libs-219-67.el7_7.1.x86_64
systemd-sysv-219-67.el7_7.1.x86_64
TagsNo tags attached.
abrt_hash
URL

Activities

mihkela

mihkela

2019-09-23 14:51

reporter   ~0035202

Same core dump and backtrace after Kickstart installation of 7.7.1908 on bare metal and VMs. We use Puppet to install and configure many different services in %post.

Seems to be some race condition because rearranging the packages/services can make the problem disappear.

Did not reproduce after forcing older version of systemd in Kickstart file (systemd-219-67.el7_7.1 -> systemd-219-67.el7)
pgreco

pgreco

2019-09-26 17:03

developer   ~0035241

we've just uploaded a fasttrack version of systemd which I suspect will fix this issue, can you "yum update systemd --enablerepo=fasttrack" and check?
bboozzoo

bboozzoo

2019-09-27 10:23

reporter   ~0035252

@pgreco thanks for pushing out the builds. I can confirm that with updated systemd I was not able to reproduce the problem across a number of runs of the snapd integration tests suite.
mchapman

mchapman

2019-11-20 23:59

reporter   ~0035719

@pgreco

Do you have a link to the fix, or at least a description of the bug? I'm hitting this exact backtrack on systemd-219-67.el7_7.2. The only fasttrack RPM right now is for systemd-219-67.el7_7.1.0.1 and it's fixing a completely different problem, and that fix is already in the package I'm using.

Thanks!
pgreco

pgreco

2019-11-21 10:12

developer   ~0035721

@mchapman, what we had in fasttrack for el7_0.1.0.1 was this https://git.centos.org/rpms/systemd/c/8bc8b495f87464799ae978ee6058fa542a69b2a5?branch=c7-sig-altarch-fasttrack and that was included in el7_7.2 (hence no fasttrack for that version).
Do you have any idea when this started to fail? that should help me narrow down the offending patch.
mchapman

mchapman

2019-11-21 11:08

reporter   ~0035724

@pgreco, I saw it soon after an upgrade from 219-67.el7_7.1 to 219-67.el7_7.2 on some machines -- specifically on the first daemon-reload after the daemon-reexec during the systemd package's update. It's possible that the problem could be from an earlier version than this though... the initial version there was just what these machines were initially provisioned with.

I've been trying to reproduce the exact conditions that cause the crash. No luck yet.

The patch you linked there doesn't seem like it would have anything to do with this bug, which is why I was wondering whether there was some other fasttrack package that I'd missed. :-)
pgreco

pgreco

2019-11-21 12:02

developer   ~0035726

if you saw it updating from .1 (no fasttrack) to .2, then .1 was still active, explaining the crash, and why you haven't been able to replicate it since.
The patch fixes a problem of processes being killed, which is why I thought it could have helped in the first place.
mchapman

mchapman

2019-11-21 19:14

reporter   ~0035729

It was very definitely after the upgrade: the new systemd process had logged its standard "systemd 219 running in system mode" stuff, etc. Actually, looking at the logs more closely it seemed to happen on the second daemon-reload after the upgrade, not the first, which is even weirder.
pgreco

pgreco

2019-11-21 19:58

developer   ~0035730

Right, but afaik, until you reboot, or run systemctl daemon-reexec, you're still running the old systemd, even after upgrade
mchapman

mchapman

2019-11-21 20:49

reporter   ~0035731

The systemd package's %post scriptlet runs daemon-reexec.
mchapman

mchapman

2019-11-22 02:52

reporter   ~0035733

I have a way to reproduce this now.

The general idea is:

1. We create and start a service called foobar-a.service. Its config references a foobar-b.socket. The kind of reference (Wants, Requires, After, Before, etc.) probably does not matter.
2. The socket's associated service, foobar-b.service, does *not* exist. The socket has not been started, so this is not a problem.
3. We invoke systemctl daemon-reexec.
4. We next create foobar-b.service by linking it to an existing, active service. In the steps below, this happens to be foobar-a.service, but I could have just chosen any other active service.
5. We invoke systemctl daemon-reload twice.

###################

[root@test ~]# rpm -q systemd
systemd-219-67.el7_7.2.x86_64
[root@test ~]# cat <<EOF >/etc/systemd/system/foobar-a.service
> [Unit]
> After=foobar-b.socket
>
> [Service]
> ExecStart=/bin/sleep 600
> Restart=always
> EOF
[root@test ~]# cat <<EOF >/etc/systemd/system/foobar-b.socket
> [Socket]
> ListenDatagram=/tmp/socket
> EOF
[root@test ~]# systemctl daemon-reload
[root@test ~]# systemctl start foobar-a.service
[root@test ~]# systemctl daemon-reexec
[root@test ~]# ln -sf /etc/systemd/system/foobar-{a,b}.service; systemctl daemon-reload; systemctl daemon-reload
[root@test ~]# rm -f /etc/systemd/system/foobar-b.service; systemctl daemon-reload
[root@test ~]# systemctl daemon-reexec
[root@test ~]# ln -sf /etc/systemd/system/foobar-{a,b}.service; systemctl daemon-reload; systemctl daemon-reload
[root@test ~]# rm -f /etc/systemd/system/foobar-b.service; systemctl daemon-reload
[root@test ~]# systemctl daemon-reexec
[root@test ~]# ln -sf /etc/systemd/system/foobar-{a,b}.service; systemctl daemon-reload; systemctl daemon-reload
[root@test ~]# rm -f /etc/systemd/system/foobar-b.service; systemctl daemon-reload
[root@test ~]# systemctl daemon-reexec
[root@test ~]# ln -sf /etc/systemd/system/foobar-{a,b}.service; systemctl daemon-reload; systemctl daemon-reload

Broadcast message from systemd-journald@test (Fri 2019-11-22 02:47:58 UTC):

systemd[1]: Caught <ABRT>, dumped core as pid 1479.


Broadcast message from systemd-journald@test (Fri 2019-11-22 02:47:58 UTC):

systemd[1]: Freezing execution.


Message from syslogd@test at Nov 22 02:47:58 ...
 systemd:Caught <ABRT>, dumped core as pid 1479.

Message from syslogd@test at Nov 22 02:47:58 ...
 systemd:Freezing execution.
Failed to execute operation: Connection reset by peer

###################

As you can see, sometimes the last few steps sometimes need to be repeated. I suspect it's dependent on some hash ordering within systemd: this hash ordering is perturbed each time daemon-reexec is used, so if it doesn't crash in one iteration it might in the next.

My hunch here is that the bug is related to the way foobar-b.service needs to be turned from a stub service into an additional name on an _existing_ service. It seems like this is not happening correctly. systemd is deciding at one point to garbage collect foobar-b.service (since foobar-b.socket is stopped, this is OK), but that then is corrupting foobar-a.service's state.

Some further notes: I am not totally sure if a socket unit is necessary, but I couldn't reliably crash systemd without it. The socket doesn't need to be active: it's only use here seems to be to keep foobar-b.service around as a stub unit. This bug seems to exhibit itself in other weird ways if a crash does not occur. For instance, in many cases I ended up in a situation where systemd thought foobar-a.service was inactive, even though it still had a process -- the Main PID even! -- in that service's cgroup.
mchapman

mchapman

2019-11-22 03:09

reporter   ~0035734

With systemd-219-67.el7.x86_64 I am unable to crash systemd using these steps, but it does show the bug I described in my last paragraph:

###################

...
[root@test ~]# systemctl start foobar-a.service
[root@test ~]# systemctl daemon-reexec
[root@test ~]# systemctl status foobar-a.service
* foobar-a.service
   Loaded: loaded (/etc/systemd/system/foobar-a.service; static; vendor preset: disabled)
   Active: active (running) since Fri 2019-11-22 03:05:32 UTC; 18s ago
 Main PID: 2107 (sleep)
   CGroup: /system.slice/foobar-a.service
           `-2107 /bin/sleep 600

Nov 22 03:05:32 test systemd[1]: Started foobar-a.service.
[root@test ~]# ln -sf /etc/systemd/system/foobar-{a,b}.service; systemctl daemon-reload; systemctl daemon-reload
[root@test ~]# systemctl status foobar-a.service
* foobar-a.service
   Loaded: loaded (/etc/systemd/system/foobar-a.service; enabled; vendor preset: disabled)
   Active: inactive (dead)
 Main PID: 2107
   CGroup: /system.slice/foobar-a.service
           `-2107 /bin/sleep 600

Nov 22 03:05:32 test systemd[1]: Started foobar-a.service.

###################

That is, systemd thinks foobar-a.service is now stopped, even though the process it started in it is still running.
mchapman

mchapman

2019-11-22 04:14

reporter   ~0035735

This is also being tracked in https://bugzilla.redhat.com/show_bug.cgi?id=1760149 .

Issue History

Date Modified Username Field Change
2019-09-23 09:28 bboozzoo New Issue
2019-09-23 14:51 mihkela Note Added: 0035202
2019-09-26 17:03 pgreco Note Added: 0035241
2019-09-27 10:23 bboozzoo Note Added: 0035252
2019-11-20 23:59 mchapman Note Added: 0035719
2019-11-21 10:12 pgreco Note Added: 0035721
2019-11-21 11:08 mchapman Note Added: 0035724
2019-11-21 12:02 pgreco Note Added: 0035726
2019-11-21 19:14 mchapman Note Added: 0035729
2019-11-21 19:58 pgreco Note Added: 0035730
2019-11-21 20:49 mchapman Note Added: 0035731
2019-11-22 02:52 mchapman Note Added: 0035733
2019-11-22 03:09 mchapman Note Added: 0035734
2019-11-22 04:14 mchapman Note Added: 0035735