View Issue Details

IDProjectCategoryView StatusLast Update
0015079CentOS-7kernelpublic2018-07-24 02:40
Reporterkira 
PriorityhighSeveritymajorReproducibilityrandom
Status resolvedResolutionfixed 
Product Version7.4.1708 
Target VersionFixed in Version7.5.1804 
Summary0015079: kernel dead lock in kmem_cache_destroy_memcg_children
DescriptionWe install a kubernetes cluster into a bundle of centos 7.4 nodes. Sometimes, some nodes hang after we restarting docker service in installation.

dmesg logs:
```
Jul 18 18:12:13 c321v72 kernel: INFO: task systemd:1 blocked for more than 120 seconds.
Jul 18 18:12:13 c321v72 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 18 18:12:13 c321v72 kernel: systemd D
Jul 18 18:12:13 c321v72 kernel: ffff88017cd10000 0 1 0 0x00000000
Jul 18 18:12:13 c321v72 kernel: ffff88017cd1bd70 0000000000000086 ffff88017cd10000 ffff88017cd1bfd8
Jul 18 18:12:13 c321v72 kernel: ffff88017cd1bfd8 ffff88017cd1bfd8 ffff88017cd10000 ffffffff81a93c20
Jul 18 18:12:13 c321v72 kernel: ffffffff81a93c24 ffff88017cd10000 00000000ffffffff ffffffff81a93c28
Jul 18 18:12:13 c321v72 kernel: Call Trace:
Jul 18 18:12:13 c321v72 kernel: [<ffffffff816c38c9>] schedule_preempt_disabled+0x29/0x70
Jul 18 18:12:13 c321v72 kernel: [<ffffffff816c1347>] __mutex_lock_slowpath+0xc7/0x1e0
Jul 18 18:12:13 c321v72 kernel: [<ffffffff816c05af>] mutex_lock+0x1f/0x30
Jul 18 18:12:13 c321v72 kernel: [<ffffffff812019e3>] mem_cgroup_write+0x313/0x400
Jul 18 18:12:13 c321v72 kernel: [<ffffffff812016d0>] ? memcg_update_cache_sizes+0xc0/0xc0
Jul 18 18:12:13 c321v72 kernel: [<ffffffff8110ee5e>] cgroup_file_write+0x1be/0x320
Jul 18 18:12:13 c321v72 kernel: [<ffffffff812c1890>] ? security_file_permission+0x20/0xa0
Jul 18 18:12:13 c321v72 kernel: [<ffffffff8120cf6d>] vfs_write+0xbd/0x1e0
Jul 18 18:12:13 c321v72 kernel: [<ffffffff8120dddf>] SyS_write+0x7f/0xe0
Jul 18 18:12:13 c321v72 kernel: [<ffffffff816ceb49>] system_call_fastpath+0x16/0x1b
Jul 18 18:12:13 c321v72 kernel: INFO: task kworker/u8:1:52 blocked for more than 120 seconds.
Jul 18 18:12:13 c321v72 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 18 18:12:13 c321v72 kernel: kworker/u8:1 D ffff880234883f40 0 52 2 0x00000000
Jul 18 18:12:13 c321v72 kernel: Workqueue: netns cleanup_net
Jul 18 18:12:13 c321v72 kernel: ffff8802348b3be0 0000000000000046 ffff880234883f40 ffff8802348b3fd8
Jul 18 18:12:13 c321v72 kernel: ffff8802348b3fd8 ffff8802348b3fd8 ffff880234883f40 ffffffff81a93c20
Jul 18 18:12:13 c321v72 kernel: ffffffff81a93c24 ffff880234883f40 00000000ffffffff ffffffff81a93c28
Jul 18 18:12:13 c321v72 kernel: Call Trace:
Jul 18 18:12:13 c321v72 kernel: [<ffffffff816c38c9>] schedule_preempt_disabled+0x29/0x70
Jul 18 18:12:13 c321v72 kernel: [<ffffffff816c1347>] __mutex_lock_slowpath+0xc7/0x1e0
Jul 18 18:12:13 c321v72 kernel: [<ffffffff816c05af>] mutex_lock+0x1f/0x30
Jul 18 18:12:13 c321v72 kernel: [<ffffffff8120202c>] kmem_cache_destroy_memcg_children+0x4c/0xf0
Jul 18 18:12:13 c321v72 kernel: [<ffffffff811af992>] kmem_cache_destroy+0x52/0x140
Jul 18 18:12:13 c321v72 kernel: [<ffffffff812020ca>] kmem_cache_destroy_memcg_children+0xea/0xf0
Jul 18 18:12:13 c321v72 kernel: [<ffffffff811af992>] kmem_cache_destroy+0x52/0x140
Jul 18 18:12:13 c321v72 kernel: [<ffffffffc02965cb>] nf_conntrack_cleanup_net_list+0x17b/0x1d0 [nf_conntrack]
Jul 18 18:12:13 c321v72 kernel: [<ffffffffc029744d>] nf_conntrack_pernet_exit+0x6d/0x80 [nf_conntrack]
Jul 18 18:12:13 c321v72 kernel: [<ffffffff815969c3>] ops_exit_list.isra.5+0x53/0x60
Jul 18 18:12:13 c321v72 kernel: [<ffffffff81597aa0>] cleanup_net+0x1d0/0x2c0
Jul 18 18:12:13 c321v72 kernel: [<ffffffff810aa43d>] process_one_work+0x1ad/0x470
Jul 18 18:12:13 c321v72 kernel: [<ffffffff810ab4be>] worker_thread+0x11e/0x3e0
Jul 18 18:12:13 c321v72 kernel: [<ffffffff810ab3a0>] ? manage_workers.isra.24+0x2b0/0x2b0
Jul 18 18:12:13 c321v72 kernel: [<ffffffff810b278f>] kthread+0xcf/0xe0
Jul 18 18:12:13 c321v72 kernel: [<ffffffff810b26c0>] ? flush_kthread_worker+0xc0/0xc0
Jul 18 18:12:13 c321v72 kernel: [<ffffffff816cea98>] ret_from_fork+0x58/0x90
Jul 18 18:12:13 c321v72 kernel: [<ffffffff810b26c0>] ? flush_kthread_worker+0xc0/0xc0
```

Then I download `http://vault.centos.org/7.4.1708/os/x86_64/Packages/kernel-3.10.0-693.el7.x86_64.rpm` and add `printk` to kernel source.

`mm/slab_common.c`:
```c
struct kmem_cache *
kmem_cache_create_memcg(struct mem_cgroup *memcg, const char *name, size_t size,
            size_t align, unsigned long flags, void (*ctor)(void *),
            struct kmem_cache *parent_cache)
{
    printk(KERN_DEBUG
            "kdbg: creating cache %s with cgroup %p and parent cache %p",
            name, memcg, parent_cache);
    struct kmem_cache *s = NULL;
    int err = 0;

    get_online_cpus();
    mutex_lock(&slab_mutex);

    if (!kmem_cache_sanity_check(memcg, name, size) == 0)
        goto out_locked;

    /*
     * Some allocators will constraint the set of valid flags to a subset
     * of all flags. We expect them to define CACHE_CREATE_MASK in this
     * case, and we'll just provide them with a sanitized version of the
     * passed flags.
     */
    flags &= CACHE_CREATE_MASK;

    s = __kmem_cache_alias(memcg, name, size, align, flags, ctor);
    if (s) {
        printk(KERN_DEBUG "kdbg: cache %s(%p) shares %s with refcount %d and %s",
                name, s, s->name, s->refcount, is_root_cache(s)?"root":"nonroot");
        goto out_locked;
    }

    s = kmem_cache_zalloc(kmem_cache, GFP_KERNEL);
    if (s) {
        s->object_size = s->size = size;
        s->align = calculate_alignment(flags, align, size);
        s->ctor = ctor;

        if (memcg_register_cache(memcg, s, parent_cache)) {
            kmem_cache_free(kmem_cache, s);
            err = -ENOMEM;
            goto out_locked;
        }

        s->name = kstrdup(name, GFP_KERNEL);
        if (!s->name) {
            kmem_cache_free(kmem_cache, s);
            err = -ENOMEM;
            goto out_locked;
        }

        err = __kmem_cache_create(s, flags);
        if (!err) {
            s->refcount = 1;
            list_add(&s->list, &slab_caches);
            memcg_cache_list_add(memcg, s);
            printk(KERN_DEBUG "kdbg: allocated %s(%p) with refcount %d and %s",
                    name, s, s->refcount, is_root_cache(s)?"root":"nonroot");
        } else {
            kfree(s->name);
            kmem_cache_free(kmem_cache, s);
        }
    } else
        err = -ENOMEM;

out_locked:
    mutex_unlock(&slab_mutex);
    put_online_cpus();

    if (err) {

        if (flags & SLAB_PANIC)
            panic("kmem_cache_create: Failed to create slab '%s'. Error %d\n",
                name, err);
        else {
            printk(KERN_WARNING "kmem_cache_create(%s) failed with error %d",
                name, err);
            dump_stack();
        }

        return NULL;
    }

    return s;
}

struct kmem_cache *
kmem_cache_create(const char *name, size_t size, size_t align,
          unsigned long flags, void (*ctor)(void *))
{
    return kmem_cache_create_memcg(NULL, name, size, align, flags, ctor, NULL);
}
EXPORT_SYMBOL(kmem_cache_create);

void kmem_cache_destroy(struct kmem_cache *s)
{
    if (unlikely(!s))
        return;

    printk(KERN_DEBUG "kdbg: recycle cache %p which is %s", s,
            is_root_cache(s)?"root":"nonroot");

    /* Destroy all the children caches if we aren't a memcg cache */
    kmem_cache_destroy_memcg_children(s);

    get_online_cpus();
    mutex_lock(&slab_mutex);
    s->refcount--;
    if (!s->refcount) {
        list_del(&s->list);

        if (!__kmem_cache_shutdown(s)) {
            mutex_unlock(&slab_mutex);
            if (s->flags & SLAB_DESTROY_BY_RCU)
                rcu_barrier();

            memcg_release_cache(s);
            kfree(s->name);
            kmem_cache_free(kmem_cache, s);
            printk(KERN_DEBUG "kdbg: recycled cache %p", s);
        } else {
            list_add(&s->list, &slab_caches);
            mutex_unlock(&slab_mutex);
            printk(KERN_ERR "kmem_cache_destroy %s: Slab cache still has objects\n",
                s->name);
            dump_stack();
        }
    } else {
        mutex_unlock(&slab_mutex);
    }
    put_online_cpus();
}
```

`mm/memcontrol.c`
```c
void memcg_release_cache(struct kmem_cache *s)
{
    struct kmem_cache *root;
    struct mem_cgroup *memcg;
    int id;

    /*
     * This happens, for instance, when a root cache goes away before we
     * add any memcg.
     */
    if (!s->memcg_params)
        return;

    if (s->memcg_params->is_root_cache)
        goto out;

    memcg = s->memcg_params->memcg;
    id = memcg_cache_id(memcg);

    root = s->memcg_params->root_cache;
    root->memcg_params->memcg_caches[id] = NULL;
    printk(KERN_DEBUG "kdbg: remove cache %p from parent %p with index %d",
            s, root, id);

    mutex_lock(&memcg->slab_caches_mutex);
    list_del(&s->memcg_params->list);
    mutex_unlock(&memcg->slab_caches_mutex);

    mem_cgroup_put(memcg);
out:
    kfree(s->memcg_params);
}

static void kmem_cache_destroy_work_func(struct work_struct *w)
{
    struct kmem_cache *cachep;
    struct memcg_cache_params *p;

    p = container_of(w, struct memcg_cache_params, destroy);

    cachep = memcg_params_to_cache(p);

    printk(KERN_DEBUG "kdbg: destroy worker for %p is called", cachep);

    /*
     * If we get down to 0 after shrink, we could delete right away.
     * However, memcg_release_pages() already puts us back in the workqueue
     * in that case. If we proceed deleting, we'll get a dangling
     * reference, and removing the object from the workqueue in that case
     * is unnecessary complication. We are not a fast path.
     *
     * Note that this case is fundamentally different from racing with
     * shrink_slab(): if memcg_cgroup_destroy_cache() is called in
     * kmem_cache_shrink, not only we would be reinserting a dead cache
     * into the queue, but doing so from inside the worker racing to
     * destroy it.
     *
     * So if we aren't down to zero, we'll just schedule a worker and try
     * again
     */
    if (atomic_read(&cachep->memcg_params->nr_pages) != 0) {
        kmem_cache_shrink(cachep);
        if (atomic_read(&cachep->memcg_params->nr_pages) == 0)
            return;
    } else {
        printk(KERN_DEBUG "kdbg: destroy %p", cachep);
        kmem_cache_destroy(cachep);
    }
}

void kmem_cache_destroy_memcg_children(struct kmem_cache *s)
{
    struct kmem_cache *c;
    int i;
    char task_comm[TASK_COMM_LEN + 1];
    task_comm[TASK_COMM_LEN] = 0;

    if (!s->memcg_params)
        return;
    if (!s->memcg_params->is_root_cache)
        return;

    if (memcg_limit_mutex.owner == current) {
        /* Get task command name. */
        get_task_comm(task_comm, memcg_limit_mutex.owner);
        printk(KERN_EMERG "kdbg: %s dead lock with cache %p", task_comm, s);
    }

    /*
     * If the cache is being destroyed, we trust that there is no one else
     * requesting objects from it. Even if there are, the sanity checks in
     * kmem_cache_destroy should caught this ill-case.
     *
     * Still, we don't want anyone else freeing memcg_caches under our
     * noses, which can happen if a new memcg comes to life. As usual,
     * we'll take the memcg_limit_mutex to protect ourselves against this.
     */
    mutex_lock(&memcg_limit_mutex);
    for (i = 0; i < memcg_limited_groups_array_size; i++) {
        c = s->memcg_params->memcg_caches[i];
        if (!c)
            continue;

        /*
         * We will now manually delete the caches, so to avoid races
         * we need to cancel all pending destruction workers and
         * proceed with destruction ourselves.
         *
         * kmem_cache_destroy() will call kmem_cache_shrink internally,
         * and that could spawn the workers again: it is likely that
         * the cache still have active pages until this very moment.
         * This would lead us back to mem_cgroup_destroy_cache.
         *
         * But that will not execute at all if the "dead" flag is not
         * set, so flip it down to guarantee we are in control.
         */
        c->memcg_params->dead = false;
        cancel_work_sync(&c->memcg_params->destroy);

        printk(KERN_DEBUG "kdbg: parent %p cache %p index %d is %s",
                s, c, i, is_root_cache(s)?"root":"nonroot");
        kmem_cache_destroy(c);
    }
    mutex_unlock(&memcg_limit_mutex);
}
```

Then we reproduced this issue and got logs:
```
Jul 18 18:08:59 c321v72 kernel: kdbg: destroy worker for ffff880094a78c00 is called
Jul 18 18:08:59 c321v72 kernel: kdbg: destroy ffff880094a78c00
Jul 18 18:08:59 c321v72 kernel: kdbg: recycle cache ffff880094a78c00 which is nonroot
...
Jul 18 18:08:59 c321v72 kernel: kdbg: remove cache ffff880094a78c00 from parent ffff880204373100 with index 39
Jul 18 18:08:59 c321v72 kernel: kdbg: recycled cache ffff880094a78c00
...
Jul 18 18:08:59 c321v72 kernel: kdbg: parent ffff880204373100 cache ffff880094a78c00 index 39 is root
Jul 18 18:08:59 c321v72 kernel: kdbg: recycle cache ffff880094a78c00 which is root
(hanging here)
...
```

We deduce that the problem is in `kmem_cache_destroy_memcg_children()` and after `cancel_work_sync()`.

If a `destory` work is running, `cancel_work_sync()` will wait that work to finish. But if the work finished, `s->memcg_params->memcg_caches[i]` should be NULL. Then `c` becomes dangling pointer.

If that memory area is allocated by others and becomes root, `kmem_cache_destroy_memcg_children()` will dead lock on `memcg_limit_mutex`.
Steps To ReproduceInstall kubernetes cluster and restart docker.
Tags3.10.0-693.el7.x86_64
abrt_hash
URL

Activities

kira

kira

2018-07-19 04:18

reporter   ~0032304

Change logs:
```
        cancel_work_sync(&c->memcg_params->destroy);

        printk(KERN_DEBUG "kdbg: parent %p cache %p index %d is %s existing %p",
                s, c, i, is_root_cache(s)?"root":"nonroot",
                s->memcg_params->memcg_caches[i]);
        kmem_cache_destroy(c);
```

Logs:
```
Jul 19 12:13:14 c321v70 kernel: kdbg: destroy worker for ffff88021aa03a00 is called
Jul 19 12:13:14 c321v70 kernel: kdbg: destroy ffff88021aa03a00
Jul 19 12:13:14 c321v70 kernel: kdbg: recycle cache ffff88021aa03a00 which is nonroot
...
Jul 19 12:13:14 c321v70 kernel: kdbg: remove cache ffff88021aa03a00 from parent ffff8801f8b04b00 with index 775
Jul 19 12:13:14 c321v70 kernel: kdbg: recycled cache ffff88021aa03a00
Jul 19 12:13:14 c321v70 kernel: kdbg: parent ffff8801f8b04b00 cache ffff88021aa03a00 index 775 is root existing (null)
Jul 19 12:13:14 c321v70 kernel: kdbg: recycle cache ffff88021aa03a00 which is root
Jul 19 12:13:14 c321v70 kernel: kdbg: kworker/u8:2 dead lock with cache ffff88021aa03a00
...
```
toracat

toracat

2018-07-19 17:24

manager   ~0032307

CentOS 7.4 is no longer supported. Do you see the same issue with the current kernel ?
kira

kira

2018-07-20 02:04

reporter   ~0032323

I'm going to test it in 7.5. I think 7.5 also has the same issue because their codes are same.
kira

kira

2018-07-23 09:35

reporter   ~0032335

We have not reproduced this issue in CentOS 7.5. This bug may be solved. But I can't find any info about it.
toracat

toracat

2018-07-23 11:34

manager   ~0032337

It's good to know that the bug has been fixed in 7.5.

Are you sure you do not see any code change? The changelog of the 7.5 kernel (in section 3.10.0-852.el7) lists the following patch:

[mm] memcg, slab: do not destroy children caches if parent has aliases (Aristeu Rozanski) [1502818]

This is commit b8529907ba35d625fa4b85d3e4dc8021be97c1f3 upstream. This patch may be the one involved ?
toracat

toracat

2018-07-23 11:37

manager   ~0032338

I'm closing this as 'resolved'. But feel free to add any comment.

Issue History

Date Modified Username Field Change
2018-07-19 03:58 kira New Issue
2018-07-19 03:58 kira Tag Attached: 3.10.0-693.el7.x86_64
2018-07-19 04:18 kira Note Added: 0032304
2018-07-19 17:24 toracat Note Added: 0032307
2018-07-19 17:24 toracat Status new => feedback
2018-07-20 02:04 kira Note Added: 0032323
2018-07-20 02:04 kira Status feedback => assigned
2018-07-23 09:35 kira Note Added: 0032335
2018-07-23 11:34 toracat Note Added: 0032337
2018-07-23 11:37 toracat Status assigned => resolved
2018-07-23 11:37 toracat Resolution open => fixed
2018-07-23 11:37 toracat Note Added: 0032338
2018-07-23 11:38 toracat Fixed in Version => 7.5.1804