Monday, December 12, 2016

bad kernel tutorials for beginners

There is an abundance of materials for beginners, which supposedly explain how to add your own syscall, hijack an existing one or maybe add a device driver to the linux kernel. Unfortunately presented code is extremely wrong and shows fundamental lack of understanding of the problem domain and basic programming practices. In this post I explain how incorrect approach resulted in missing out on critical factors, in effect giving wrong code, why the code may appear to work even though it is wrong and how to not make these mistakes. To restate, the code is not just buggy or maybe in the need of some polishing. It is fundamentally wrong. In particular complete disregard (or maybe lack of awareness) of concurrency makes the code harmful to beginners.

One common toy syscall boils down to walking the list of processes and printing things like their names, pids of their parents etc. A variation would dereference a pointer stored in a process.

First we will outline several important concepts. Then we will examine a simpler example which popped up, where the target process is found based on pid and then has few fields accessed. Finally, we will look at the more popular and more complicated case.

Big problem with these examples is that they "work" for the most part.

Let's start with the supposedly obvious: just because something appears to work, does not mean it is correct. Maybe the result you are getting is not the correct one, even if you think it is. Maybe the result is correct, but it depends on circumstances which hold by accident and will stop holding after a reboot or on a different machine. Maybe they will stop holding in a newer version. Finally, perhaps you are just not testing it well enough.

In a singlethreaded program you can for the most part access objects as you please. If you open a file, it stays open until you close. Any variable you read has nobody else to suddenly modify it. And there is nobody to free the buffer you are reading. While this is oversimplified, it is true enough for the purpose of this blogpost.

In a multithreaded program, like the linux kernel, things are much different. Threads share the address space. They all see the same memory. There will be things modifiable by everyone under appropriate circumstances and things which no thread other than the currently executing should change. All code must constantly follow the rules or a risk of bad things happening will be introduced.

One common method of synchronising access is known as locking. There will be special code, say lock(lock_t *) and unlock(lock_t *) such that when multiple threads execute lock(foo), only one gets to proceed and the rest waits for it to call unlock(foo). Then some other thread gets to proceed and so on.

If the code always modifies a buffer by doing lock(meh_buffer_lock); strcpy(meh_buffer, src); unlock(meh_buffer_lock); and reads while also holding the lock, everyone gets a stable state. But if new code is implemented which writes to the buffer without taking the lock, it can be writing along with another thread, corrupting the state.

In short, not following the rules (here: only accessing meh_buffer while holding meh_buffer_lock), results in code which can generate random failures.

Another thing to look at is how to implement things in the first place. Someone may be tempted to look at a struct definition, find fields they are interested in and just use them. This has a high chance of disregarding the rules (see above), but even if said rules are followed the result can be just wrong. It may be there are additional conditions which decide how to interpret the field, or even to ignore the value as present in the object and return something else. The thing to do is to find a code already accessing given field. Chances are high the field must not be accessed directly and instead there is a dedicated helper function which needs to be executed.

Problem descriptions may seem exaggerated, but they are not. Said examples are really extremely buggy on nearly each line.

With this in mind, let's take a look at the first sample adapted from a question from a site I'm not going to link to:
asmlinkage long procinfo(pid_t pid, struct info *pi)
    struct task_struct *task = pid_task(find_vpid(pid), PIDTYPE_PID);
    pi->pid = task->pid; 
    strcpy(pi->comm, task->comm);
    return 0; 

So what's wrong with it?

asmlinkage long procinfo(pid_t pid, struct info *pi)
    struct task_struct *task = pid_task(find_vpid(pid), PIDTYPE_PID);

We take a pid from userspace and use kernel primitives to translate that to a pointer to task_struct, so that we can inspect the task. But threads can be created and destroyed (and freed) at any time. What prevents the found thread from getting freed as we access it later in the function? What makes the process of finding the thread safe in the first place? Inspecting already existing code doing the lookup (e.g. the kill(2) syscall) will reveal that rules were not followed here and indeed the found thread can be freed at any time.

A very unfortunate property is that this code will not crash under light load as it will be extremely unlikely to get a pointer and have the thread get freed before the function finishes accessing it. Worse, even a freed thread does not guarantee a crash as the memory itself can remain mapped and have sane content. It is almost impossible this will crash when run on a laptop or in a vm doing nothing.

    pi->pid = task->pid;

First note is that it may be there is no thread found and task is NULL and the code fails to check for it.

A more important note is that pi is an address provided by userspace. Is not it odd to just access it without any validation and without any special measures?  Such an access is a security and reliability threat. Consider what happens when the user passes an address of something within the kernel or which is not mapped. Also it will straight up not work in  several cases (e.g. SMAP or architectures which have the kernel address space split from userspace). Instead, a dedicated primitive like put_user or copy_to_user has to be used.

Testing performed on x86 not passing bogus addresses is not going to run into any trouble here either, unless the cpu happens to support SMAP.

    strcpy(pi->comm, task->comm);

Grepping around will reveal that accessing the comm field requires the task to be locked. Also there is a dedicated primitive copying the name to the passed buffer. But note just as pi is provided by userspace, pi->comm also is by extension so a dedicated kernel-friendly buffer should be created first and that copy_to_user'ed.

Now the more common example:

asmlinkage long procdump(void)
    struct task_struct *p;
    for_each_process(p) {
        printk("pid %d name %s mm mapcount %d parent pid %d\n", p->pid,
            p->comm, p->mm ? p->mm->map_count : 0, p->parent->pid); 

I'll comment on each printk component separately.

asmlinkage long procdump(void)
    struct task_struct *p;
    for_each_process(p) {

So we are going to iterate over all processes. But is it safe to just use the macro? Processes can exit any time (and get freed). Indeed, looking at already existing code using the macro we will find that the rules of executing it are not followed here.

The rest ignores the use-after-free problem of the found pointer.

    "pid %d", p->pid

There is a minor issue that the printk invocation is wrong.  Looking around quickly reveals that the format string must be prefixed with a log level, e.g. KERN_INFO.

The pid read is fine - the pid is constant for the life duration of the object.

    "name %s", p->comm;

Access to the comm field was covered earlier.

    "mm mapcount %d\n", p->mm ? p->mm->map_count : 0

Whoa. p->mm can be NULL. If it is not, read map_count. Otherwise use 0.

Back to basics: checking how mm is accessed reveals a helper - get_task_mm. From the helper we see that not only the task has to be locked, it may be mm will be set for a kernel thread and for such a case it has to be ignored. The object is returned with a reference counter increased to prevent it from being freed. None of these considerations were taken into account here. Further looking around will reveal that an exiting thread will NULLify the mm field while holding the lock.

That is, since no lock is taken here, we can first see that mm is set and dereference the now-NULL pointer with mm->map_count. And even if the pointer was not NULL, chances are the target object just got freed.

    "parent pid %d", p->parent->pid

All processes have a parent. If the parent exits, the process is reparented to init. But if the parent exits, chances are its task_struct is freed. No steps were taken to ensure the access is safe. Interestingly, this is not even the field which has the information required here. Inspecting code used by the kernel to get the parent pid will reveal that real_parent is accessed instead. parent can be modified by a special syscall ptrace(2), used by e.g. strace and gdb. That is, this code can obtain pid of the strace (or gdb, or whatever) process as opposed to the parent it is expecting.

And this is yet another example where things appear to work because of insufficient testing. Not only you are extremely unlikely to trigger use-after-free with the parent, since most of the time you don't ptrace, the parent pointer swap will likely not be seen by this code.

In all cases following the standard programming practice of checking how existing code deals with a certain scenario (e.g. accessing certain field) would prevent most of the caveats from being a factor.

Someone interested in writing a version of the syscall which does work correctly can start with stracing ps and going from there. Note though that correctly implemented has only educational value and no practical use.

Let's sum up and emphasize crucial points.

1. A reasonable programmer always checks how existing code accomplishes things.

Consider the parent vs real_parent problem once more.

Let's say they want to get the parent pid in the kernel would, but have absolutely no idea where the an example code is. They can start with a userspace program obtaining said information -- ps. Strace on ps will reveal how the information is obtained and from the kernel and there it is easy to dive into kernel-side implementation of the feature. And there they see real_parent, not parent. The problem automatically avoided by checking existing code. What about safe access to the field? Again, the code in question already has to do it, so conditions should be seen in the vicinity.

Consider the issue of putting data to userspace, the pi->pid assignment from earlier.  I said it should have used put_user or copy_to_user.

A programmer needing the feature would find any syscall putting stuff to userspace. Trivial example: stat(2) takes a pointer as the second argument and fills the target buffer. Checking out the implementation reveals the buffer is written out using the copy_to_user primitive.

None of this was applied when writing said tutorials.

2. Just because you found something on the internet, does not mean it is true. In fact the more low-level given subject is, the more likely it is that a random website is flat out wrong. And yes, this is a random website.

3. If you are playing with the kernel out of your own curiosity, that's great. But if you were not able to spot why the code as presented had to be wrong, I would argue you are not ready yet. Take a step back, stick to userspace for the time being. The code in there is easier to develop and debug and you don't have to deal with threading until you are comfortable with it.

Thursday, December 1, 2016

will the operating system clean up after process exit

From time to time there are people asking "if I allocate memory and exit without freeing it, will the kernel clean up?". I recently encountered a variant with file descriptors and getting rid of the file after exit without explicit close.

While the answer for most cases on systems like Linux or *BSDs is a firm yes, nobody elaborates on the issue.

In short this has to be done for reliability reasons - without it malicious users and buggy programs would crash the system constantly due to resource exhaustion.

A hard requirement for this to be doable is that the kernel needs a way to iterate over the resources allocated for given process. In this post we will see why the ability is forced by user-visible functionality regardless of reliability concerns.

We will take file descriptors as a simple example.

Files when open are identifiable by so-called "file descriptor" numbers. That is, the kernel returns an integer which then can be used to do things with the file. Said integeres are private to the process.

An example program can:

fd = open("crap", O_RDONLY);
read(fd, buf, sizeof(buf));

Somewhere in memory there is an object which represents the file and there must be a way to go from the fd number to the address of the object for this work in the first place.

There is a reliability concern here as well: what if the process passes an incorrect value? Say it has file descriptors 0, 1, 2 and passes 7563. The kernel must not crash nor return junk.

There is also an important function: fork(2). It will create a new process which has certain things copied over from the caller. This includes file descriptors. If our process has 0, 1, 2 open, the child will also have 0, 1, 2 open and they will "point" to the same files.

But if the new process has to have the same file descriptors, the kernel already needs a way to iterate over descriptors installed for the old one.

And there you go - all open file descriptors are explicitly known, so there is no problem iterating over them on process exit.

Memory handling has similar reasoning.

As a final note, most real-world programs already exit without closing some file descriptors or freeing up all of the memory allocated.