Monday, December 12, 2016

bad kernel tutorials for beginners

There is an abundance of materials for beginners, which supposedly explain how to add your own syscall, hijack an existing one or maybe add a device driver to the linux kernel. Unfortunately presented code is extremely wrong and shows fundamental lack of understanding of the problem domain and basic programming practices. In this post I explain how incorrect approach resulted in missing out on critical factors, in effect giving wrong code, why the code may appear to work even though it is wrong and how to not make these mistakes. To restate, the code is not just buggy or maybe in the need of some polishing. It is fundamentally wrong. In particular complete disregard (or maybe lack of awareness) of concurrency makes the code harmful to beginners.

One common toy syscall boils down to walking the list of processes and printing things like their names, pids of their parents etc. A variation would dereference a pointer stored in a process.

First we will outline several important concepts. Then we will examine a simpler example which popped up, where the target process is found based on pid and then has few fields accessed. Finally, we will look at the more popular and more complicated case.

Big problem with these examples is that they "work" for the most part.

Let's start with the supposedly obvious: just because something appears to work, does not mean it is correct. Maybe the result you are getting is not the correct one, even if you think it is. Maybe the result is correct, but it depends on circumstances which hold by accident and will stop holding after a reboot or on a different machine. Maybe they will stop holding in a newer version. Finally, perhaps you are just not testing it well enough.

In a singlethreaded program you can for the most part access objects as you please. If you open a file, it stays open until you close. Any variable you read has nobody else to suddenly modify it. And there is nobody to free the buffer you are reading. While this is oversimplified, it is true enough for the purpose of this blogpost.

In a multithreaded program, like the linux kernel, things are much different. Threads share the address space. They all see the same memory. There will be things modifiable by everyone under appropriate circumstances and things which no thread other than the currently executing should change. All code must constantly follow the rules or a risk of bad things happening will be introduced.

One common method of synchronising access is known as locking. There will be special code, say lock(lock_t *) and unlock(lock_t *) such that when multiple threads execute lock(foo), only one gets to proceed and the rest waits for it to call unlock(foo). Then some other thread gets to proceed and so on.

If the code always modifies a buffer by doing lock(meh_buffer_lock); strcpy(meh_buffer, src); unlock(meh_buffer_lock); and reads while also holding the lock, everyone gets a stable state. But if new code is implemented which writes to the buffer without taking the lock, it can be writing along with another thread, corrupting the state.

In short, not following the rules (here: only accessing meh_buffer while holding meh_buffer_lock), results in code which can generate random failures.

Another thing to look at is how to implement things in the first place. Someone may be tempted to look at a struct definition, find fields they are interested in and just use them. This has a high chance of disregarding the rules (see above), but even if said rules are followed the result can be just wrong. It may be there are additional conditions which decide how to interpret the field, or even to ignore the value as present in the object and return something else. The thing to do is to find a code already accessing given field. Chances are high the field must not be accessed directly and instead there is a dedicated helper function which needs to be executed.

Problem descriptions may seem exaggerated, but they are not. Said examples are really extremely buggy on nearly each line.

With this in mind, let's take a look at the first sample adapted from a question from a site I'm not going to link to:
 
asmlinkage long procinfo(pid_t pid, struct info *pi)
{
    struct task_struct *task = pid_task(find_vpid(pid), PIDTYPE_PID);
    pi->pid = task->pid; 
    strcpy(pi->comm, task->comm);
    return 0; 
}

So what's wrong with it?

asmlinkage long procinfo(pid_t pid, struct info *pi)
{
    struct task_struct *task = pid_task(find_vpid(pid), PIDTYPE_PID);

We take a pid from userspace and use kernel primitives to translate that to a pointer to task_struct, so that we can inspect the task. But threads can be created and destroyed (and freed) at any time. What prevents the found thread from getting freed as we access it later in the function? What makes the process of finding the thread safe in the first place? Inspecting already existing code doing the lookup (e.g. the kill(2) syscall) will reveal that rules were not followed here and indeed the found thread can be freed at any time.

A very unfortunate property is that this code will not crash under light load as it will be extremely unlikely to get a pointer and have the thread get freed before the function finishes accessing it. Worse, even a freed thread does not guarantee a crash as the memory itself can remain mapped and have sane content. It is almost impossible this will crash when run on a laptop or in a vm doing nothing.

    pi->pid = task->pid;

First note is that it may be there is no thread found and task is NULL and the code fails to check for it.

A more important note is that pi is an address provided by userspace. Is not it odd to just access it without any validation and without any special measures?  Such an access is a security and reliability threat. Consider what happens when the user passes an address of something within the kernel or which is not mapped. Also it will straight up not work in  several cases (e.g. SMAP or architectures which have the kernel address space split from userspace). Instead, a dedicated primitive like put_user or copy_to_user has to be used.

Testing performed on x86 not passing bogus addresses is not going to run into any trouble here either, unless the cpu happens to support SMAP.

    strcpy(pi->comm, task->comm);

Grepping around will reveal that accessing the comm field requires the task to be locked. Also there is a dedicated primitive copying the name to the passed buffer. But note just as pi is provided by userspace, pi->comm also is by extension so a dedicated kernel-friendly buffer should be created first and that copy_to_user'ed.

Now the more common example:

asmlinkage long procdump(void)
{
    struct task_struct *p;
    for_each_process(p) {
        printk("pid %d name %s mm mapcount %d parent pid %d\n", p->pid,
            p->comm, p->mm ? p->mm->map_count : 0, p->parent->pid); 
    }
}

I'll comment on each printk component separately.

asmlinkage long procdump(void)
{
    struct task_struct *p;
    for_each_process(p) {

So we are going to iterate over all processes. But is it safe to just use the macro? Processes can exit any time (and get freed). Indeed, looking at already existing code using the macro we will find that the rules of executing it are not followed here.

The rest ignores the use-after-free problem of the found pointer.

    "pid %d", p->pid


There is a minor issue that the printk invocation is wrong.  Looking around quickly reveals that the format string must be prefixed with a log level, e.g. KERN_INFO.

The pid read is fine - the pid is constant for the life duration of the object.

    "name %s", p->comm;


Access to the comm field was covered earlier.

    "mm mapcount %d\n", p->mm ? p->mm->map_count : 0

Whoa. p->mm can be NULL. If it is not, read map_count. Otherwise use 0.

Back to basics: checking how mm is accessed reveals a helper - get_task_mm. From the helper we see that not only the task has to be locked, it may be mm will be set for a kernel thread and for such a case it has to be ignored. The object is returned with a reference counter increased to prevent it from being freed. None of these considerations were taken into account here. Further looking around will reveal that an exiting thread will NULLify the mm field while holding the lock.

That is, since no lock is taken here, we can first see that mm is set and dereference the now-NULL pointer with mm->map_count. And even if the pointer was not NULL, chances are the target object just got freed.

    "parent pid %d", p->parent->pid

All processes have a parent. If the parent exits, the process is reparented to init. But if the parent exits, chances are its task_struct is freed. No steps were taken to ensure the access is safe. Interestingly, this is not even the field which has the information required here. Inspecting code used by the kernel to get the parent pid will reveal that real_parent is accessed instead. parent can be modified by a special syscall ptrace(2), used by e.g. strace and gdb. That is, this code can obtain pid of the strace (or gdb, or whatever) process as opposed to the parent it is expecting.

And this is yet another example where things appear to work because of insufficient testing. Not only you are extremely unlikely to trigger use-after-free with the parent, since most of the time you don't ptrace, the parent pointer swap will likely not be seen by this code.

In all cases following the standard programming practice of checking how existing code deals with a certain scenario (e.g. accessing certain field) would prevent most of the caveats from being a factor.

Someone interested in writing a version of the syscall which does work correctly can start with stracing ps and going from there. Note though that correctly implemented has only educational value and no practical use.

Let's sum up and emphasize crucial points.

1. A reasonable programmer always checks how existing code accomplishes things.

Consider the parent vs real_parent problem once more.

Let's say they want to get the parent pid in the kernel would, but have absolutely no idea where the an example code is. They can start with a userspace program obtaining said information -- ps. Strace on ps will reveal how the information is obtained and from the kernel and there it is easy to dive into kernel-side implementation of the feature. And there they see real_parent, not parent. The problem automatically avoided by checking existing code. What about safe access to the field? Again, the code in question already has to do it, so conditions should be seen in the vicinity.

Consider the issue of putting data to userspace, the pi->pid assignment from earlier.  I said it should have used put_user or copy_to_user.

A programmer needing the feature would find any syscall putting stuff to userspace. Trivial example: stat(2) takes a pointer as the second argument and fills the target buffer. Checking out the implementation reveals the buffer is written out using the copy_to_user primitive.

None of this was applied when writing said tutorials.

2. Just because you found something on the internet, does not mean it is true. In fact the more low-level given subject is, the more likely it is that a random website is flat out wrong. And yes, this is a random website.

3. If you are playing with the kernel out of your own curiosity, that's great. But if you were not able to spot why the code as presented had to be wrong, I would argue you are not ready yet. Take a step back, stick to userspace for the time being. The code in there is easier to develop and debug and you don't have to deal with threading until you are comfortable with it.

Thursday, December 1, 2016

will the operating system clean up after process exit

From time to time there are people asking "if I allocate memory and exit without freeing it, will the kernel clean up?". I recently encountered a variant with file descriptors and getting rid of the file after exit without explicit close.

While the answer for most cases on systems like Linux or *BSDs is a firm yes, nobody elaborates on the issue.

In short this has to be done for reliability reasons - without it malicious users and buggy programs would crash the system constantly due to resource exhaustion.

A hard requirement for this to be doable is that the kernel needs a way to iterate over the resources allocated for given process. In this post we will see why the ability is forced by user-visible functionality regardless of reliability concerns.

We will take file descriptors as a simple example.

Files when open are identifiable by so-called "file descriptor" numbers. That is, the kernel returns an integer which then can be used to do things with the file. Said integeres are private to the process.

An example program can:

fd = open("crap", O_RDONLY);
read(fd, buf, sizeof(buf));
close(fd);

Somewhere in memory there is an object which represents the file and there must be a way to go from the fd number to the address of the object for this work in the first place.

There is a reliability concern here as well: what if the process passes an incorrect value? Say it has file descriptors 0, 1, 2 and passes 7563. The kernel must not crash nor return junk.

There is also an important function: fork(2). It will create a new process which has certain things copied over from the caller. This includes file descriptors. If our process has 0, 1, 2 open, the child will also have 0, 1, 2 open and they will "point" to the same files.

But if the new process has to have the same file descriptors, the kernel already needs a way to iterate over descriptors installed for the old one.

And there you go - all open file descriptors are explicitly known, so there is no problem iterating over them on process exit.

Memory handling has similar reasoning.

As a final note, most real-world programs already exit without closing some file descriptors or freeing up all of the memory allocated.

Sunday, July 24, 2016

malloc+memset vs calloc

There is various advice out there suggesting the use of calloc or malloc followed by the memset. Claims are made about supposed performance difference between the two, followed by the problematic explanation as to what's going on, in particular involving the so-called "zero page".

The first part of the post contains advice on a reasonable usage of these functions, while the second part explains what's going on. Hopefully clearing up why calloc can appear to be faster than malloc + memset in some cases.

While the post is Linux and x86-64-centric, in principle it covers other systems and architectures. The main difference is that they may be lacking the "zero page".

A separate post will cover tooling which can be used to investigate claims below.

When to use what?

First of all, do you really need to allocate anything? If your buffer is up to a few kilobytes in size and you always free it on return from the function, then use the stack.

malloc returns a suitable buffer with an indeterminate content. calloc returns a suitable buffer filled with zeroes.

If you need to allocate something, use malloc.

If you need to zero out part of the buffer, memset that part.

If you need to have the buffer zeroed out entirely or in the most part, use calloc(1, size).

In particular:
1. DO NOT p = malloc(size); memset(p, 0, size);, use calloc instead. It is more concise and in certain cases will be faster.
2. DO NOT default to zeroing everything. Tools like valgrind or scan-build can detect use of uninitialized variables, but if you just zero out the entire struct you initialize everything to 0, which hides the problem.

What's going on?

There is code out there which boils down to allocating several megabytes of memory with either calloc or malloc + memset and comparing the time needed to complete. Benchmarks of the sort not only don't test the actual expense in a real-world program when the allocated memory is later used, but don't necessarily find the actual cost of calloc itself in such a program. calloc will sometimes be faster than malloc + memset. An example benchmark is included at the end.

Let's see what's up.

Processes have so-called virtual address space, divided into pages, typically 4096 bytes of size. These areas can be backed by physical pages of the same size. A process can request memory with system calls like mmap or (s)brk.

malloc is a part of glibc and is running as a part of a process. It will request memory from the kernel as it sees fit. Once it obtains a page, it can e.g. divide it into chunks of size 16 bytes and use that to satisfy calls like malloc(16). That's a vastly simplified version, but sufficient as an illustration.

Once a new page is mapped, it is guaranteed to be zeroed. The reason is that the kernel has to get rid of any residual data from allocations made by different processes and zero is the nicest value to put in there.

For special cases, the kernel keeps a page which always contains zeroes and cannot be written to. We will call it ZERO_PAGE.

Asking the kernel for more memory does not mean it will allocate any pages. Instead it will take a note what was requested.

Later, if the processes accesses the area, a page fault will occur. Then the kernel will see that affected area was promised and will map a page in there. If the reason for the fault is a write request, it will get a standard zeroed page. If this is a read, it may decide to use ZERO_PAGE instead. In either case, the state visible by the process is a zeroed area and the entire business is transparent.

Let's walk through a malloc example. K is the kernel, P is the process, M and C are respectively malloc and calloc  running within P
  1. P does malloc(BIGNUM);
  2. M decides it has enough memory to satisfy the request and returns an appropriate buffer
Here the kernel was not involved. This is what typically happens in a long-running program.

How about a more convoluted example, typically happening at the start of the program:
  1. P does malloc(BIGNUM);
  2. M decides to grab more memory from the kernel and calls mmap
  3. K takes note and returns address ADDR. No physical memory is allocated at this stage.
  4. M picks an area from ADDR and returns BUFADDR
  5. P writes to BUFADDR which triggers a page fault
  6. K sees that the area fits what was returned from mmap and satisfies the request by mapping a zeroed page
What if calloc is used instead? Let's take a simplified example:
  1. P does calloc(1, BIGNUM);
  2. C decides it has enough memory to satisfy the request. zeroes a buffer and returns it
So the difference would be zeroing. The more complicated example with requesting more memory would differ in the same way.

But here is an observation: memory we got from the kernel is guaranteed to be zeroed, so if C knows for a fact that the particular buffer it wants to return was not used previously, it knows there is no need for zeroing. But this means the buffer is not touched, which may translate into memory page(s) not being touched, which in turn reduces the number of page faults.


And this is where the performance difference in toy benchmarks comes from. These toy benchmark don't touch the memory.

To elaborate, in a toy benchmark:
  1. P does calloc(BIGNUM);
  2. C decides to grab more memory from the kernel and calls mmap
  3. K takes note and returns address ADDR. No physical memory is allocated at this stage.
  4. C picks an area from ADDR and picks BUFADDR. Seeing this is a new allocation, it skips zeroing
  5. ... memory is not touched, so we end here. The most expensive part is not executed.
In comparison, a real-world program will typically use the calloced memory and the buffer returned from calloc in a long-running program will most of the time have to be cleared as it is being reused.

To reiterate, calloc will give you savings in the following circumstances:
1. sometimes page(s) are not touched and both page fault and explicit zeroing by C are avoided. If memory happens to be accessed later, page fault will be taken, but we still get away without explicit zeroing.
2. sometimes, even if the page was allocated, we know the buffer was never used and can avoid zeroing

What about ZERO_PAGE? It is NOT used as a result of mmap. It can be used later if there is a page fault from an attempt to read. In particular, ZERO_PAGE is NOT USED in the trivial benchmark just callocing a lot of memory.

The following should give a rough idea what happens:
# malloc + memset
$ time ./mtest m
./mtest m  0.18s user 1.94s system 95% cpu 2.227 total

# calloc
$ time ./mtest c
./mtest c  0.02s user 0.19s system 83% cpu 0.260 total
# calloc + read a byte from each page

$ time ./mtest C
./mtest C  0.11s user 0.41s system 95% cpu 0.540 total
# calloc + write a byte to each page

$ time ./mtest V
./mtest V  0.12s user 1.95s system 92% cpu 2.224 total

The reader will forgive the lack of statistical validity of these masurements. They are fine enough for this blogpost.

In particular, we see that malloc + memset resulting in touching all the pages is roughly equivalent to calloc + touching all the pages. Just doing calloc is definitely the fastest and that's because it does not map almost anything. Doing a read from each page gives a page fault and gives maps ZERO_PAGE for us, which is definitely faster than having to come up with a new page.

Also, let's see what happens if we repeatedly alloc and free with either malloc + memset or calloc. The allocator will reuse the buffer. But then calloc has to zero it out. And indeed:

$ time ./use-malloc
./use-malloc  6.21s user 0.00s system 99% cpu 6.222 total

$ time ./use-calloc
./use-calloc  6.17s user 0.00s system 99% cpu 6.176 total


As expected, the performance is about the same.

To sum up:
- stick to the advice given at the beginning
- claims about calloc being faster without further statements about conditions where it is true are misleading at best
- if calloc is faster, it is typically due to the fact it sees it can get away without zeroing the buffer, which can give additional speed up if pages used for the allocation end up not being accessed
- benchmarks of memory allocation without accessing the memory later don't show the actual cost

mtest.c:
#include <err.h>
#include <stdlib.h>
#include <string.h>

#define ALLOC   (256 * 1024 * 1024)
#define NBUFS   10

static void
handle_malloc(char **bufs, int nbufs, size_t size)
{
        char *p;
        int i;

        for (i = 0; i < nbufs; i++) {
                p = malloc(size);
                if (p == NULL)
                        err(1, "malloc");
                memset(p, 0, size);
                bufs[i] = p;
        }
}

static void
handle_calloc(char **bufs, int nbufs, size_t size)
{
        char *p;
        int i;

        for (i = 0; i < nbufs; i++) {
                p = calloc(1, size);
                if (p == NULL)
                        err(1, "calloc");
                bufs[i] = p;
        }
}

static void
touch_read(char **bufs, int nbufs, size_t size)
{
        int i;
        size_t j;
        unsigned char crap = 0; /* a meh way to force a read */

        for (i = 0; i < nbufs; i++) {
                for (j = 0; j < size; j += 4096)
                        crap += bufs[i][j];
        }
}

static void
touch_write(char **bufs, int nbufs, size_t size)
{
        int i;
        size_t j;

        for (i = 0; i < nbufs; i++) {
                for (j = 0; j < size; j += 4096)
                        bufs[i][j] = 1;
        }
}

int
main(int argc, char **argv)
{
        char *bufs[NBUFS];

        if (argc != 2 || argv[1][1] != 0)
                return 1;

        switch (argv[1][0]) {
        case 'm':
                handle_malloc(bufs, NBUFS, ALLOC);
                break;
        case 'M':
                handle_malloc(bufs, NBUFS, ALLOC);
                break;
        case 'c':
                handle_calloc(bufs, NBUFS, ALLOC);
                break;
        case 'C':
                handle_calloc(bufs, NBUFS, ALLOC);
                touch_read(bufs, NBUFS, ALLOC);
                break;
        case 'V':
                handle_calloc(bufs, NBUFS, ALLOC);
                touch_write(bufs, NBUFS, ALLOC);
                break;
        default:
                errx(1, "wrong opt");
        }      
       
        return 0;
}


malloc-calloc.c:
#include <stdlib.h>
#include <string.h>

#define    ALLOC    (1024 * 1024)

int
main(void)
{
    int i;
    void *p;

    for (i = 0; i < 100000; i++) {
#ifdef USE_MALLOC
        p = malloc(ALLOC);

        memset(p, 0, ALLOC);
#else
        p = calloc(1, ALLOC);
#endif
        free(p);
    }

}


Saturday, July 9, 2016

linux sysadmin interview questions

Most of these questions are known, for which I somewhat apologize. The purpose of this post is to have a collection of questions I consider to be of reasonable to high quality.

Questions are provided in a shortened form in order to reduce repetition. The interviewer is supposed to press the answer with "why", "does it have to be true/false", "what does this depend on" and the like as applicable. The candidate must effectively be told to explain their understanding of the issue at hand. These are not "yes or no" questions.

I'll start off with my own question (I realize that most likely someone else came up with an equivalent):

1. A classic one-liner is of the form: ps auxw | grep foo. Can grep foo appear in the result?

An excellent opportunity to discuss pipes + fork + exec + races.

And now stuff you likely heard at least once:

2. Is cd a shell built-in?

3. A particular server has 20 zombie processes. What do you do?

4. A junior sysadmin says they ran ps and it appears to have hanged and cannot be killed with ctrl+c. You have a ssh session on the affected machine. Diagnose.

ps can really hang due to a multitude of reasons. The (real world!) scenario I like is hang on /proc/<pid>/cmdline read. Note that candidate's ps is also expected to hang.

5. You see a process using up 100% CPU in top. Diagnose.

It can be e.g. an in-house program which stats the same non-existing file for some reason.

6. load average is 500. Diagnose.

Suggested classic is a hung nfs server.

7. Running any binary appears to hang. You got one last operational ssh session. Diagnose.

Only shell built-ins can be used.

8. You are given a binary. How do you find out what it does?

9. Why is suid not honoured when executing scripts?

10. ping crap.org

The classic. If you don't want network talk, I suggest uname.

11. You run strace -p <valid pid> and it does not give you any output. What now?

Monday, June 27, 2016

when the kernel can kill a process

Common cases of the kernel killing something include the OOM killer, things like SIGSEGV/SIGBUS due to incorrect memory access or a prosaic signal sent by someone.

Let's take a look at less popular ones.

1. OOPS

If the kernel detects a problem with its state, it will print information about the problem. Depending on the particular inconsistency it may also decide to OOPS, which depending on the state of kernel.panic_on_oops will either crash or kill the thread which happened to be executing the code at the time.

Either way, an OOPS is typically an indication of a bug in the kernel itself.

2. failed execve

execve(2) is used to execute a new binary in the current process. Setting everything up is a complicated process with several failure points. Some of which are present after the original address space is destroyed. If any failure happens afterwards, there is no address space to return to and the kernel has no choice but to kill the process.

This is not a big issue - if the process was doing execve it was designated to either succeed and get the new image or exit indicating an error anyway.

3. failed copy on write of a non-transparent huge page

When a process forks, its memory pages are marked as copy on write. Then when either the child or the parent writes something, the target page is unshared. What follows is that another page is allocated.

hugepages are a special case and a much more limited resource. Interested parties can force their use through hugetlbfs.

If there are no free hugepages to use on copy on write, the kernel kills the child.


There are many more situations when such a kill can happen. People interested in the subject are welcome to grep the source for send_sig and investigate from there.

Sunday, May 8, 2016

symlinks vs hardlinks

Given /foo/bar, one could say "the file bar is in the directory foo".  This is fine for everyday purposes when hardlinks are understood, but in general is incorrect.

Files exist on filesystems (not in directories) and are represented by inodes. An inode contans various information, like the owner, mode, access times and the link count. It does not contain any names.

A name for given inode number can then be placed in a directory and link count of the target inode be incremented to note this action. Removing a name understandably decrements the link count and possibly sets the inode  up for deletion. The name as described here is known as a hardlink.

Symlinks are slightly convoluted and typical explanations are somewhat misleading. They boil down to symlinks "pointing to a name". Say we are given an inode 42 with a name crap placed in /. Say a symlink meh is created with the following content: /crap. For everyday purposes one would say that "meh points to /crap" and expect to reach the inode 42.

The problem with explanations involving "pointing to a name" is they suggest a relationship between the symlink and the "target" name, while there is none.

The key here lies in the fact that processes can have different view of the "file namespace" (for lack of a better term). In particular thanks to the chroot(2) system call they can have their own idea what / is.

Consider the following tree:
/
├── crap // ino 100
└── foo
    └── bar
        ├── crap // ino 200
        └── craplink -> /crap

A regular process entering /foo/bar and doing cat /craplink will reach the file with ino 100. But another one which chrooted too /foo/bar will reach the inode with ino 200.

A symlink is just some text which can be used during path lookup. Conceptually the symlink is read and the part of the path which was not traversed yet is appended.

Consider a file /a/b/c/dir/file and a symlink /meh/dirlink to /a/b/c/dir. Lookup of /meh/dirlink/file will:
  1. grab / as set for the process
  2. find meh and conclude it's a directory
  3. find dirlink and conlude it's a symlink; the path to look up now becomes /a/b/c/dir/file
This leads us to a caveat with ".."'s in symlinks.

Consider loop up of /meh/dirlink/../crap.
  1.  grab / as set for the process
  2.  find meh and conclude it's a directory
  3.  find dirlink and conlude it's a symlink; the path to look up now becomes /a/b/c/dir/../crap
So this leads us to /a/b/c/crap as opposed to /meh/crap.

But there is a caveat in a caveat! Shells like bash or zsh try to be clever and internally detect this situation when you cd. So when you cd /meh/dirlink/../crap, you actually end up in /meh/crap. But if you try to do something else, the hack is gone.

$ echo 'surprise, mo^H^H^H^H' > /a/b/c/crap 
$ mkdir /meh/crap
$ cd /meh/dirlink/../crap
$ pwd
/meh/crap

$ cat /meh/dirlink/../crap
surprise, mo^H^H^H^H

Wednesday, April 6, 2016

linux process number shenanigans

How do you check how many processes are present on your Linux system? Will ps aux do the trick? Is this really the thing you would want to check? Let's see.

A process is a program being executed. It has several properties, like security credentials (uid, gid, labels etc.), address space and of course a PID.
Each process has at least one thread. Threads are what is executing the code, and so threads assigned to one process share a lot of properties (the program being executed, the address space etc.).

On FreeBSD there is a struct proc with all relevant properties. Then there is struct thread and the process has the list of threads. Pretty straightforward.

Historically threads were implemented as processes just sharing address space and the like. Basics of this model survived in Linux to this very day. It does not have a separate 'process object', everything is stuffed into the thread.

This means that Linux threads belonging to one process have separate PIDs, just as if they were completely separate processes. They do share a 'thread group' which is how their relationship is maintained. There is also a designated thread acting as the 'thread group leader'.

As such, things are blurred a little bit and any mentions of 'threads' or 'processes' have to be examined closely. Let's examine few things which happen in practice.

The kernel provides a special sysctl kernel.threads-max. Unsurprisingly it's an upper limit for the number of threads we can have in the system. We also get kernel.pid_max, which is not the upper limit of processes, although it may act like one in some cases. There is no explicit limit of processes.

Since each process has to have at lest one thread, creation of a new process adds a thread. So we are not going to have more processes than threads-max. Further, each thread has to have a PID, so we are not going to have more than pid_max threads either. But if we are already limited, what's the use for pid_max? It is used to provide the range of valid PIDs, so that it will take more time to reuse them (e.g. with threads limited to 32k and pid_max bumped from 32k to 64k the kernel has another set of pids before it wraps).

How about RLIMIT_NPROC (ulimit -u)? It limits threads, which has a side effect of limiting processes.

With this in mind, let's go back to the initial question.

How many processes are present? Well, turns out you typically don't want to ask this question as it has little relevance. Instead you want to check how many threads are present. The information is provided in /proc/loadavg file as the divider in the second to last field. That is, if the file contains "4.37 3.98 3.55 9/762 23384", the number of threads is 762.

But let's say we really want to know how many processes are present.

The primary source of information is the proc filesystem mounted on /proc.
Listing the content typically reveals plenty of directories with the name being a number. Each entry is either a thread group leader (so it can serve as a representation of a process) or a kernel thread. How to spot kernel threads? This one is fortunately quite easy - everything with a parent pid of 2 and the pid 2 itself.

Sunday, February 14, 2016

Fun fact: transient failure to read process name on Linux

Process names are obtained by tools like ps by reading /proc/<pid>/cmdline. The content of the file is obtained by accessing target process's address space. But the information is temporarily unavailable during execve.

In particular, a new structure describing the address space is allocated. It is being assigned to the process late in execve stage, but before it is fully populated. The code generating cmdline detects the condition and returns value of 0, meaning no data was generated.

Consider execve-loop, doing execs:
#include <unistd.h>          

int
main(int argc, char **argv)
{

    execv(argv[0], argv);
}


And execve-read, doing reads:
#include <sys/types.h>
#include <sys/stat.h>
#include <err.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>

int
main(int argc, char **argv)
{
    char buf[100];
    char *path;
    int fd;

    if (argc != 2)
        return (1);

    path = argv[1];

    for (;;) {
        fd = open(path, O_RDONLY);
        if (fd == -1)
            err(1, "open");
        if (read(fd, buf, sizeof(buf)) == 0)
            printf("failure!\n");
        else
            printf("success: [%s]\n", buf);
        close(fd);
    }
}


Let's run them:

shell1$ ./execve-loop
shell2$ ./execve-read /proc/$(pgrep execve-loop)/cmdline
success: [./execve-loop]
failure!
failure!
failure!
success: [./execve-loop]
success: [./execve-loop]

[snip]


Could the kernel be modified to e.g. provide the old name or in worst case wait until the new name becomes available? Yes, but this does not seem to be worth it.

Tuesday, February 9, 2016

kernel game #3

Assume we have an extremely buggy driver. Multiple threads can call into meh_ioctl shown below at the same time with the same device and there is no locking provided. The routine is supposed to either store a pointer to a referenced struct file object in m->fp or just clear the entry (and of course get rid of the reference).

What can go wrong here? Consider both a singlethraded and multithreaded execution.

int meh_ioctl(dev_t dev, ioctl_t ioct, int data)
{
        meh_t m *m = to_meh(dev);
        struct file *fp;

        switch (ioct) {
        case MEH_ATTACH:
                /* data is the fd we are going to borrow the file from */

                /* check if we already have a reference to a file */
                if (m->fp != NULL)
                        frele(m->fp);
                /* fget return the file with a reference or NULL on error */
                fp = fget(data);
                if (fp == NULL)
                        return EBADF:
                m->fp = fp;
                break;
        case MEH_DETACH:
                if (m->fp == NULL)
                        return EINVAL;
                frele(m->fp);
                m->fp = NULL;
                break;
        }

        return 0;
}


Monday, February 8, 2016

kernel game #2

Consider a kernel where processes are represented with struct proc objects. The kernel implements unix-like interfaces and works with multiple CPUs.

The following syscall is provided:

int sys_fork(void)
{
        struct proc *p;

        int error;

        error = kern_fork(&p);
        if (error == 0)
                curthread->retval = p->pid;
        return error;
}


That is, if error is 0 we know forking succeeded. in which case the functions stores the pid found in the object. Otherwise non-zero error value is returned and the retval field is not inspected.

Why would this code be incorrect?

Tuesday, January 19, 2016

Fun fact: transient ETXTBSY when trying to open a script on Linux

When you run given binary, the kernel marks it as being executed which is then used to disallow opening it for writing. Similarly, when a file is opened for writing, it is marked as such and execve fails.

The error is ETXTBSY or Text file busy.

Situation with scripts is a little bit more involved. Doing "sh script.sh" will typically result in execve of /bin/sh, i.e. the kernel does not really know nor care what script.sh is.

Let's consider a file with the following content being passed to execve:
#!/bin/sh

echo meh
exit 0


A special handler recognizes #! and proceeds to change the executed binary to /bin/sh.

However, once execution gets going, you can open the file for writing no problem.

This poses two questions:
- will the execution fail if the script is opened for writing?
- will opening the file for writing ever fail because the script is being executed?

Let's experiment. The following program will help:

#include <sys/types.h>
#include <sys/stat.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>

int
main(int argc, char **argv)
{
        int fd;

        if (argc != 2)
                return 1;

        for (;;) {
                fd = open(argv[1], O_WRONLY);
                if (fd != -1) {
                        close(fd);
                        continue;
                }
                perror("open");
        }
       
        return 1;
}


As you can see the program just repeatedly tries to open the file for writing.
We will run the script ("script.sh") in one terminal, while running the program in another. That is:

shell1$ ./write script.sh

shell2$ while true; do ./script.sh; done

And this gives....

shell1$ ./write script.sh
open: Text file busy
open: Text file busy
open: Text file busy
open: Text file busy
open: Text file busy
open: Text file busy
[snip]


shell2$ while true; do ./script.sh; done
zsh: text file busy: ./script.sh
zsh: text file busy: ./script.sh
meh
meh
zsh: text file busy: ./script.sh
meh

[snip]

So we see 2 failure modes:
- sometimes we fail to execve because the file is opened for writing
- sometimes we fail to open for writing because the file is being executed

The second condition is transient - the file is unmarked as the kernel proceeds to look up /bin/sh instead of the script.

A side observation is that if you have a file which is executable by others, they may interfere with your attempts to write it by repeatedly calling execve.h

Wednesday, January 13, 2016

can strace affect traced processes?

Please read "when strace fails to obtain syscall" first if you don't know how strace works.

Can strace/truss/whatever usage change the result of running a program?

Using ptrace introduces quite a lot of overhead (i.e. it slows things down). But besides that, can strace usage result in changing what the process ends up doing?

Yes it can and the key lies in waking up threads in an interruptible sleep. Note that you can count yourself unlucky if you encounter any of things below. strace is perfectly fine to use most of the time.

Depending on the code which put the thread to sleep, the thread will:
- do something without going back to the syscall boundary
- go back to the boundary and restart the syscall
- exit the kernel with EINTR

All of these are invasive in principle, but most of the time harmless.

One funny example of EINTR interferring with troubleshooting was few years back, when nginx would just hang around doing nothing when it was supposed to shutdown. Attaching to it revealed it was blocked in epoll_wait, which now returned with EINTR and that prompted nginx to exit.

Let's see a demo for the second case - automagic restart.

We are going to create a named pipe. The thread opening it for reading will block until it gets opened for writing (writers are blocked in the same manner).

shell1$ mkfifo fifo
shell1$ cat > fifo

shell2$ cat fifo

Stuff typed into the first cat should be read by the second cat.

Now consider:

shell1$ mkfifo fifo
shell1$ cat > fifo

shell2$ rm fifo 
shell2$ mkfifo fifo
shell2$ cat fifo

"cat > fifo" results in the shell trying to open the file, before it can proceed to execve cat. It got blocked:
[<ffffffff81226480>] pipe_wait+0x70/0xc0
[<ffffffff81226501>] wait_for_partner+0x31/0x70
[<ffffffff81226f60>] fifo_open+0x1b0/0x310
[<ffffffff8121bbff>] do_dentry_open+0x1ff/0x2f0
[<ffffffff8121d096>] vfs_open+0x56/0x60
[<ffffffff8122c0c4>] path_openat+0x1e4/0x12a0
[<ffffffff8122e34a>] do_filp_open+0x8a/0x100
[<ffffffff8121d45a>] do_sys_open+0x13a/0x230
[<ffffffff8121d56e>] SyS_open+0x1e/0x20
[<ffffffff817795ae>] entry_SYSCALL_64_fastpath+0x12/0x71
[<ffffffffffffffff>] 0xffffffffffffffff


But the name got unlinked and the other cat is waiting on a different pipe.

What if we strace? The sleep is interruptible and open is going to be restarted. And indeed:

open("fifo", O_WRONLY|O_CREAT|O_NOCTTY|O_TRUNC, 0666) = 3

Since it went all the way back to the boundary, the name "fifo" got looked up again and the thread proceeded to open the new pipe. And indeed, messages get relayed:

shell1$ cat > fifo
test

shell2$ cat fifo 
test

So, stracing blindly can sometimes give interesting results.

There are other methods which are not invasive in this manner, including systemtap, dtrace or ktrace (BSD-specific).