Sunday, June 21, 2015

signal delivery vs threads in the kernel

The following blogpost is the first step to explain why unkillable processes exist.

Processes can install handlers (custom functions) for signals. When a thread is executing in userspace, it can be interrupted at any time (with few exceptions). Upon such interruption, the kernel can force it to run installed signal handler before it continues executing the original code.

If such a thread is in the kernel and a signal is received, it can only be acted upon in places which explicitly check for pending signals. When a pending signal is seen, the function cleans up whatever it did and returns an error to the caller, all the way up to kernel-userspace boundary.

Let's see why.

Just interrupting code executing in the kernel and making it switch to userspace of course sounds absolutley terrible. Situation is better but still quite bad if we don't need to execute a handler. Still, it is beneficial to see actual technical obstacles. Note that this blogpost is about monolithic kernels, like Linux or FreeBSD.

Consider the following syscall written in pseudo-c (error and range checking omitted for brevity):


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
int meh_modify(int id, data_t *udata, size_t size)
{       
        data_t *d;
        meh_t *m;
        
        d = kmalloc(size);
        copy_from_user(d, udata, size);
        
        m = meh_find(id);
        meh_lock(m);
        for (size_t i = 0; i < size; i++)
                m->data[i] = d[i];
        meh_unlock(m);
        meh_put(m);
        kfree(d);
        return 0;
}

meh_modify is callable from userspace just like any other syscall. Callers pass an id of an object and data to be saved.

Sufficient amount of memory is allocated so that there is a place to store data copied from userspace.

When meh_find finds the object, it bumps reference counter by one and returns a pointer. We have to decrement the counter later by a call to meh_put.

Finally, meh_lock/meh_unlock pair provides us with a way to ensure no other threads are modifying the object we found at the same time.

With the code in place let's see how things can go wrong as we try to implement signal delivery for kernel threads.

Interruption at any point

Let's say the thread is interrupted after the line 10, just after it did meh_lock(m), as it is about to execute the loop.

So it got the lock. Any thread trying to do the same will spin waiting for our thread to release it. If our thread tries to take the lock again it will also spin.

Now consider a code in userspace which does meh_modify(0, some_data, BIGNUM) and installed a signal handler doing meh_modify(0, some_other_data, BIGNUM).

If we interrupt the thread in the kernel after the line 10 and make it execute such a handler, the very same thread re-enters the kernel, finds the same meh object and tries to lock it. But the lock is already taken, so it spins waiting for it to be released.... except it is in fact waiting for itself. As you can see, no progress can be made in this scenario.

One may suggest we recognise it is the very same thread which took the lock previously and just let it through. That's bogus as well. Consider interruption on line 12 (m->data[i] = d[i]), and let's say i == 42. The invocation from the signal handler succeeds, the thread goes back to where it left off previously - line 12. At this time i == 42, so it continues populating the data from this point only. So after it finishes, the end result is a mix of whatever was put there by current invocation and the one we let through. Or to state differently, there is no way to ensure consistency since operations are no longer atomic, defeating the purpose of locks.

You may note the problem here is fundamentally the same as with non-async-signal safe functions called from signal handlers.

Interruption in areas not protected by locks

What if we execute the handler either prior to meh_lock or after meh_unlock? Let's say we interrupt before the lock is taken, but after meh_find finds our object and bumps its reference counter.

Reference counting is a way of tracking the number of given object users.  For the purpose of this blogpost the following is true enough: when a function like meh_put is called, it decrements it by 1; If the count reaches 0, there are no other users and the object can be freed.

What would happen if someone erroneously kept incrementing the counter? Assuming it is of unsigned type, it would eventually wrap and be equal to 0. If then someone increments it once more and calls meh_put, the object would be freed, even though it still has valid users.

Say the kernel delivers the same signal multiple times.

Now let's take our userspace process calling meh_modify from its signal handler.  Well timed signalling can cause the thread to enter our syscall, grab the reference and get interrupted over and over, eventually overflowing the counter. Note this example is rather hypothetical, as other resulting problems would prevent the kernel from reaching this stage (see below).

But the reference is not the only thing modified over and over. We also allocate a buffer with kmalloc, so this could dos the kernel due to memory exhaustion. Except...

When a function is called, return address is saved on the stack. Or in other words a chain of function calls increases stack usage. Kernel stacks are way smaller than userspace ones (recently 16KB on amd64 on Linux), so if our repeated kernel entry was to use the same stack, it would be very quickly exhausted, which must not happen.

What if our signal handler does not play like this and instead, say, kills itself? The kernel would have to unwind each such interrupted function and let it finish. While strictly speaking should be doable, it is a bad idea. By having all functions properly finish before the thread leaves the kernel, its internal state is well defined. Interruptions "respecting" lock boundaries don't compromise them directly, but make it hard/impossible to limit what kind of resources (references, memory) are allocated and complicate the code.