your code is bad and you should feel bad (I certainly do): November 2015

Saturday, November 21, 2015

kernel game #1

Syscalls accepting file descriptors fit the following scheme:

int
meh_syscall(int fd)
{
        struct file *fp;
        int error;

        fp = getfile(fd);
        if (fp == NULL)
                return (EBADF);
        error = do_meh(fp);
        putfile(fp);
        return (error);
}

That is, passed fd is used to obtain a pointer to struct file, which is then used to perform the actual operation.

getfile will increase reference counter on struct file, while putfile will decrease it. If the new value is 0, there is nobody else using the file and it can be freed.

This is important if there are multiple threads in the process. If the counter was not maintained, and one thread closed the file while another one just got the pointer, there would be a bug.

However, if there is only one thread, what's the point of maintaining the counter? There is nobody to close the file from under us.

This optimisation is in fact in use on Linux. But their equivalent of getfile passes an information whether a reference was obtained, which is then used by putfile to check if it has to free it.

Why do they bother with that?

What would be wrong with the following approach: check if there is only 1 thread. If so, nobody can close the file from under us, therefore there is no need take the reference. Also this syscall does not create new threads. Then, after do_meh, we check the thread count again to see if we have to putfile. In other words:

int
meh_syscall(int fd)
{
        struct file *fp;
        int error;

        if (curproc->threads > 1)
                fp = getfile(fd);
        else
                /*
                 * just get the pointer,
                 * do not modify the reference
                 * counter
                 */
                fp = getfile_noref(fd);
        if (fp == NULL)
                return (EBADF);
        error = do_meh(fp);
        if (curproc->threads > 1)
                putfile(fp);
        return (error);
}

So, what's the bug?

Tuesday, November 3, 2015

a primitive to read data from userspace

As was outlined previously, a special primitive is needed to access userspace data safely. There are several highly specialized variants in both Linux and FreeBSD kernels, but they all work based on the same principle. Example below is taken from FreeBSD since Linux equivalent is way more convoluted.

Let's reiterate, consider:

int val;

val = *some_userspace_pointer;
printf("%d\n", val);

If some_userspace_pointer e.g. contains garbage, a page fault is going to occur. The page fault handler will conclude the fault cannot be satisified. But there is no way to tell this code about this issue - it only reads the value and assumes it succeeded.

What's needed is a function which will be able to actually detect the condition and return an error to the caller. With such a primitive in place the code becomes:

int val, error;
error = copyin(some_userspace_pointer, &val, sizeof(val));
if (error != 0)
return error;
printf("%d\n", val);

A super slow variant would lock the address space, ensure relevant mappings are fine and only then do the read. That's a lot of of work completely unnecessary in the common case.

Instead, the standard approach is to have a way to tell the page fault handler where to jump if the page fault cannot be serviced. The place is supposed to clean up after failed copy and go back to the original caller.

In pseudo-code it would look like this:

int
copyin(void *from, void *to, size_t len)
{

        set_fault_handler(copyin_fault);
        if (len == 0)
                goto done_copyin;
        if (!fits_userspace(from, len))
                goto copyin_fault;
        memcpy(to, from, len);
done_copyin:
        set_fault_handler(0);
        return 0;
copyin_fault:
        set_fault_handler(0);
        return EFAULT;
}

Let's take a look at an actual implementation with straightforward assembly (copyin(9) from the FreeBSD tree):

/*
* copyin(from_user, to_kernel, len) - MP SAFE
*        %rdi,      %rsi,      %rdx
*/
ENTRY(copyin)
        PUSH_FRAME_POINTER
        movq    PCPU(CURPCB),%rax
        movq    $copyin_fault,PCB_ONFAULT(%rax)

The handler is first set...

        testq   %rdx,%rdx                       /* anything to do? */
        jz      done_copyin

        /*
         * make sure address is valid
         */
        movq    %rdi,%rax
        addq    %rdx,%rax
        jc      copyin_fault
        movq    $VM_MAXUSER_ADDRESS,%rcx
        cmpq    %rcx,%rax
        ja      copyin_fault

... the range is then validated ...

        xchgq   %rdi,%rsi
        movq    %rdx,%rcx
        movb    %cl,%al
        shrq    $3,%rcx                         /* copy longword-wise */
        cld
        rep
        movsq
        movb    %al,%cl
        andb    $7,%cl                          /* copy remaining bytes */
        rep
        movsb

... and finally the copy actually done. In an event of a page fault which cannot be satisified, the kernel will go to copyin_fault label which will unset the handler and return an error effectively cleaning up after the function. The target buffer may now contain partially copied data, but that's an acceptable state - if the syscall failed, buffer content is not specified. Finally, if a page fault could be serviced without an issue (e.g. a page was swapped in) or there were no page faults, copying finishes and the code falls below to unset the handler and return 0.

done_copyin:
        xorl    %eax,%eax
        movq    PCPU(CURPCB),%rdx
        movq    %rax,PCB_ONFAULT(%rdx)
        POP_FRAME_POINTER
        ret

        ALIGN_TEXT
copyin_fault:
        movq    PCPU(CURPCB),%rdx
        movq    $0,PCB_ONFAULT(%rdx)
        movq    $EFAULT,%rax
        POP_FRAME_POINTER
        ret
END(copyin)

Monday, November 2, 2015

the kernel vs userspace arguments

Plenty of syscalls (e.g. open(2)) write to or read from userspace memory using dedicated primitives and maintain a local copy. Why not just deal with it like with regular kernel memory? As outlined in one of previous posts, mere access should work.

Passed address may belong to kernelspace, so it has to be validated. But let's say we already did that.

Consider a toy syscall:

int
sys_meh(const char *name, int value)
{
        if (!is_root()) {
                if (strcmp(name, "special") == 0)
                        return -EPERM;
        }

        spin_lock(&meh_lock);
        meh_modify(name, value);
        spin_unlock(&meh_unlock);
        return 0;
}

Here we accept a name and a value, but only root is allowed to modify the object identified as special.

First access is at line 5. What if the passed address is garbage? The read will trigger a page fault and with no way to communicate the problem to strcmp, the kernel is forced to oops/panic.

So let's say the address is not garbage.

The name is read twice: by sys_meh itself and later by meh_modify. Or in other words, the code relies on the value not changing. Is the expectation met? No. For instance there can be a second thread which will try to modify the string after strcmp is done, but before meh_modify is called. This would in effect circumvent the protection we had in place.

Here the situation is even worse. By the time the code reaches meh_modify, the kernel could have decided to evict the page backing the string. On access a page fault will occur and the kernel will try to bring it in. But it took a spinlock, which means it is illegal to service a page fault due to deadlock potential.

In situations like this the standard way is to store relevant data in a temporary buffer.

This causes serious trouble when various security-oriented syscall wrappers were implemented. For instance, code trying to restrict file access by monitoring filenames had the exact same bug visible with sys_meh above (but it could be also circumvented in myriad of other ways, including symlinks). Interested parties are invited to read Exploiting Concurrency Vulnerabilities in System Call Wrappers.

the kernel vs NULL pointer dereference

FreeBSD, Linux and plenty of other kernels deny userspace requests to mmap pages at address 0 as a rudimentary hardening measure. This guarantees the kernel catches its own "NULL pointer deferences" and in turn lets it panic/oops. This changes a guaranteed privilege escalation vector into a local denial of service.

Let's see what exactly is going on here.

We will focus on amd64, but conceptually this is also true for i386 and likely several other architectures which have the address space shared between the kernel and userspace. If said space is disjoint, description below does not apply.

The address space looks roughly like this:

+---------------+ 0xffffffffffffefff
| the kernel    |
|(in some areas)|
+---------------+ 0xffff800000000000
|address space |
|    hole       |
+---------------+ 0x800000000000
| userspace     |
|               |
+---------------+ 0x0

The hole covers addresses which cannot be accessed on this architecture. Outlined userspace and kernel placement is the de facto standard. Note that both are mapped in the same address space. Spaces can be split in principle, but are not due to performance reasons.

These addresses are virtual. Actual physical memory pages may or may not be backing them up. The size of a page varies, it can be either 4KB, 2MB or 1GB.

Let's say an address 0xc0ffee belongs to an area mmapped with read and write permissions, and backed by a physical page at this very moment. When a thread enters the kernel (to e.g. execute a system call), the in-kernel code will be able to read and write said memory without any special measures.

Userspace can request arbitrary addresses with calls to mmap(2). Normally the kernel will provide whatever address it wants, but this can be changed by passing MAP_FIXED flag. As such, userspace can request to map a page at address 0.

Without pedantry void *p = NULL; will mean that consists of zeroes.

To sum this up:

p = mmap(NULL, 4096, PROT_WRITE|PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0);
p->val = 8;

Provided the kernel grants the request, this will effectively dereference a NULL pointer.

But most importantly, should the kernel try to access such an address itself, it will now succeed. Why would it do that? Of course due to a bug. NULL is often the default value of pointers in various structures, so e.g. code which forgets to NULL check a field which can legitimately be NULL would be susceptible. There are plenty of real-world bugs which manifest themselves like this.

How to use this to escalate privileges? Depends on the bug, let's take the most blatant issue: a pointer to a function is NULL, but the code calls it. Userspace could mmap the page (with execute permissions) at 0 and fill it with whatever code it wants. When the bug is encountered, the kernel unknowingly starts executing the code planted by userspace.

Here is an example: CVE-2009-2692.txt Linux NULL pointer dereference due to incorrect proto_ops initializations.

While mappings at 0 are denied, there are other less frequent bugs which can result in the kernel unknowingly accessing userspace memory. A general solution with dedicated CPU support consists of SMAP (Supervisor Mode Access Prevention) and SMEP (Supervisor Mode Execution Prevention), but note these technologies are relatively new (read: your machines likely don' have them). Finally, a software-based implementation was provided with grsec.