Monday, June 27, 2016

when the kernel can kill a process

Common cases of the kernel killing something include the OOM killer, things like SIGSEGV/SIGBUS due to incorrect memory access or a prosaic signal sent by someone.

Let's take a look at less popular ones.

1. OOPS

If the kernel detects a problem with its state, it will print information about the problem. Depending on the particular inconsistency it may also decide to OOPS, which depending on the state of kernel.panic_on_oops will either crash or kill the thread which happened to be executing the code at the time.

Either way, an OOPS is typically an indication of a bug in the kernel itself.

2. failed execve

execve(2) is used to execute a new binary in the current process. Setting everything up is a complicated process with several failure points. Some of which are present after the original address space is destroyed. If any failure happens afterwards, there is no address space to return to and the kernel has no choice but to kill the process.

This is not a big issue - if the process was doing execve it was designated to either succeed and get the new image or exit indicating an error anyway.

3. failed copy on write of a non-transparent huge page

When a process forks, its memory pages are marked as copy on write. Then when either the child or the parent writes something, the target page is unshared. What follows is that another page is allocated.

hugepages are a special case and a much more limited resource. Interested parties can force their use through hugetlbfs.

If there are no free hugepages to use on copy on write, the kernel kills the child.


There are many more situations when such a kill can happen. People interested in the subject are welcome to grep the source for send_sig and investigate from there.

Sunday, May 8, 2016

symlinks vs hardlinks

Given /foo/bar, one could say "the file bar is in the directory foo".  This is fine for everyday purposes when hardlinks are understood, but in general is incorrect.

Files exist on filesystems (not in directories) and are represented by inodes. An inode contans various information, like the owner, mode, access times and the link count. It does not contain any names.

A name for given inode number can then be placed in a directory and link count of the target inode be incremented to note this action. Removing a name understandably decrements the link count and possibly sets the inode  up for deletion. The name as described here is known as a hardlink.

Symlinks are slightly convoluted and typical explanations are somewhat misleading. They boil down to symlinks "pointing to a name". Say we are given an inode 42 with a name crap placed in /. Say a symlink meh is created with the following content: /crap. For everyday purposes one would say that "meh points to /crap" and expect to reach the inode 42.

The problem with explanations involving "pointing to a name" is they suggest a relationship between the symlink and the "target" name, while there is none.

The key here lies in the fact that processes can have different view of the "file namespace" (for lack of a better term). In particular thanks to the chroot(2) system call they can have their own idea what / is.

Consider the following tree:
/
├── crap // ino 100
└── foo
    └── bar
        ├── crap // ino 200
        └── craplink -> /crap

A regular process entering /foo/bar and doing cat /craplink will reach the file with ino 100. But another one which chrooted too /foo/bar will reach the inode with ino 200.

A symlink is just some text which can be used during path lookup. Conceptually the symlink is read and the part of the path which was not traversed yet is appended.

Consider a file /a/b/c/dir/file and a symlink /meh/dirlink to /a/b/c/dir. Lookup of /meh/dirlink/file will:
  1. grab / as set for the process
  2. find meh and conclude it's a directory
  3. find dirlink and conlude it's a symlink; the path to look up now becomes /a/b/c/dir/file
This leads us to a caveat with ".."'s in symlinks.

Consider loop up of /meh/dirlink/../crap.
  1.  grab / as set for the process
  2.  find meh and conclude it's a directory
  3.  find dirlink and conlude it's a symlink; the path to look up now becomes /a/b/c/dir/../crap
So this leads us to /a/b/c/crap as opposed to /meh/crap.

But there is a caveat in a caveat! Shells like bash or zsh try to be clever and internally detect this situation when you cd. So when you cd /meh/dirlink/../crap, you actually end up in /meh/crap. But if you try to do something else, the hack is gone.

$ echo 'surprise, mo^H^H^H^H' > /a/b/c/crap 
$ mkdir /meh/crap
$ cd /meh/dirlink/../crap
$ pwd
/meh/crap

$ cat /meh/dirlink/../crap
surprise, mo^H^H^H^H

Wednesday, April 6, 2016

linux process number shenanigans

How do you check how many processes are present on your Linux system? Will ps aux do the trick? Is this really the thing you would want to check? Let's see.

A process is a program being executed. It has several properties, like security credentials (uid, gid, labels etc.), address space and of course a PID.
Each process has at least one thread. Threads are what is executing the code, and so threads assigned to one process share a lot of properties (the program being executed, the address space etc.).

On FreeBSD there is a struct proc with all relevant properties. Then there is struct thread and the process has the list of threads. Pretty straightforward.

Historically threads were implemented as processes just sharing address space and the like. Basics of this model survived in Linux to this very day. It does not have a separate 'process object', everything is stuffed into the thread.

This means that Linux threads belonging to one process have separate PIDs, just as if they were completely separate processes. They do share a 'thread group' which is how their relationship is maintained. There is also a designated thread acting as the 'thread group leader'.

As such, things are blurred a little bit and any mentions of 'threads' or 'processes' have to be examined closely. Let's examine few things which happen in practice.

The kernel provides a special sysctl kernel.threads-max. Unsurprisingly it's an upper limit for the number of threads we can have in the system. We also get kernel.pid_max, which is not the upper limit of processes, although it may act like one in some cases. There is no explicit limit of processes.

Since each process has to have at lest one thread, creation of a new process adds a thread. So we are not going to have more processes than threads-max. Further, each thread has to have a PID, so we are not going to have more than pid_max threads either. But if we are already limited, what's the use for pid_max? It is used to provide the range of valid PIDs, so that it will take more time to reuse them (e.g. with threads limited to 32k and pid_max bumped from 32k to 64k the kernel has another set of pids before it wraps).

How about RLIMIT_NPROC (ulimit -u)? It limits threads, which has a side effect of limiting processes.

With this in mind, let's go back to the initial question.

How many processes are present? Well, turns out you typically don't want to ask this question as it has little relevance. Instead you want to check how many threads are present. The information is provided in /proc/loadavg file as the divider in the second to last field. That is, if the file contains "4.37 3.98 3.55 9/762 23384", the number of threads is 762.

But let's say we really want to know how many processes are present.

The primary source of information is the proc filesystem mounted on /proc.
Listing the content typically reveals plenty of directories with the name being a number. Each entry is either a thread group leader (so it can serve as a representation of a process) or a kernel thread. How to spot kernel threads? This one is fortunately quite easy - everything with a parent pid of 2 and the pid 2 itself.

Sunday, February 14, 2016

Fun fact: transient failure to read process name on Linux

Process names are obtained by tools like ps by reading /proc/<pid>/cmdline. The content of the file is obtained by accessing target process's address space. But the information is temporarily unavailable during execve.

In particular, a new structure describing the address space is allocated. It is being assigned to the process late in execve stage, but before it is fully populated. The code generating cmdline detects the condition and returns value of 0, meaning no data was generated.

Consider execve-loop, doing execs:
#include <unistd.h>          

int
main(int argc, char **argv)
{

    execv(argv[0], argv);
}


And execve-read, doing reads:
#include <sys/types.h>
#include <sys/stat.h>
#include <err.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>

int
main(int argc, char **argv)
{
    char buf[100];
    char *path;
    int fd;

    if (argc != 2)
        return (1);

    path = argv[1];

    for (;;) {
        fd = open(path, O_RDONLY);
        if (fd == -1)
            err(1, "open");
        if (read(fd, buf, sizeof(buf)) == 0)
            printf("failure!\n");
        else
            printf("success: [%s]\n", buf);
        close(fd);
    }
}


Let's run them:

shell1$ ./execve-loop
shell2$ ./execve-read /proc/$(pgrep execve-loop)/cmdline
success: [./execve-loop]
failure!
failure!
failure!
success: [./execve-loop]
success: [./execve-loop]

[snip]


Could the kernel be modified to e.g. provide the old name or in worst case wait until the new name becomes available? Yes, but this does not seem to be worth it.

Tuesday, February 9, 2016

kernel game #3

Assume we have an extremely buggy driver. Multiple threads can call into meh_ioctl shown below at the same time with the same device and there is no locking provided. The routine is supposed to either store a pointer to a referenced struct file object in m->fp or just clear the entry (and of course get rid of the reference).

What can go wrong here? Consider both a singlethraded and multithreaded execution.

int meh_ioctl(dev_t dev, ioctl_t ioct, int data)
{
        meh_t m *m = to_meh(dev);
        struct file *fp;

        switch (ioct) {
        case MEH_ATTACH:
                /* data is the fd we are going to borrow the file from */

                /* check if we already have a reference to a file */
                if (m->fp != NULL)
                        frele(m->fp);
                /* fget return the file with a reference or NULL on error */
                fp = fget(data);
                if (fp == NULL)
                        return EBADF:
                m->fp = fp;
                break;
        case MEH_DETACH:
                if (m->fp == NULL)
                        return EINVAL;
                frele(m->fp);
                m->fp = NULL;
                break;
        }

        return 0;
}


Monday, February 8, 2016

kernel game #2

Consider a kernel where processes are represented with struct proc objects. The kernel implements unix-like interfaces and works with multiple CPUs.

The following syscall is provided:

int sys_fork(void)
{
        struct proc *p;

        int error;

        error = kern_fork(&p);
        if (error == 0)
                curthread->retval = p->pid;
        return error;
}


That is, if error is 0 we know forking succeeded. in which case the functions stores the pid found in the object. Otherwise non-zero error value is returned and the retval field is not inspected.

Why would this code be incorrect?

Tuesday, January 19, 2016

Fun fact: transient ETXTBSY when trying to open a script on Linux

When you run given binary, the kernel marks it as being executed which is then used to disallow opening it for writing. Similarly, when a file is opened for writing, it is marked as such and execve fails.

The error is ETXTBSY or Text file busy.

Situation with scripts is a little bit more involved. Doing "sh script.sh" will typically result in execve of /bin/sh, i.e. the kernel does not really know nor care what script.sh is.

Let's consider a file with the following content being passed to execve:
#!/bin/sh

echo meh
exit 0


A special handler recognizes #! and proceeds to change the executed binary to /bin/sh.

However, once execution gets going, you can open the file for writing no problem.

This poses two questions:
- will the execution fail if the script is opened for writing?
- will opening the file for writing ever fail because the script is being executed?

Let's experiment. The following program will help:

#include <sys/types.h>
#include <sys/stat.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>

int
main(int argc, char **argv)
{
        int fd;

        if (argc != 2)
                return 1;

        for (;;) {
                fd = open(argv[1], O_WRONLY);
                if (fd != -1) {
                        close(fd);
                        continue;
                }
                perror("open");
        }
       
        return 1;
}


As you can see the program just repeatedly tries to open the file for writing.
We will run the script ("script.sh") in one terminal, while running the program in another. That is:

shell1$ ./write script.sh

shell2$ while true; do ./script.sh; done

And this gives....

shell1$ ./write script.sh
open: Text file busy
open: Text file busy
open: Text file busy
open: Text file busy
open: Text file busy
open: Text file busy
[snip]


shell2$ while true; do ./script.sh; done
zsh: text file busy: ./script.sh
zsh: text file busy: ./script.sh
meh
meh
zsh: text file busy: ./script.sh
meh

[snip]

So we see 2 failure modes:
- sometimes we fail to execve because the file is opened for writing
- sometimes we fail to open for writing because the file is being executed

The second condition is transient - the file is unmarked as the kernel proceeds to look up /bin/sh instead of the script.

A side observation is that if you have a file which is executable by others, they may interfere with your attempts to write it by repeatedly calling execve.h