Thursday, April 30, 2015

crap c code samples

There are various properties of a programming language. The amount of effort needed to shot yourself in the foot is definitely among them. There are situations where the language does not invite the programmer to do it, but they do so anyway.

And then there is writing stuff in a non-standard way for no reason.

When a normal person reads code, they make huge jumps. Reading any non-trivial codebase line-by-line is a non-starter. If they eye-grep something deviating from the standard way of doing it, they try to understand what's the difference in behaviour, and when they can't spot one they have to look again only to remember the code is crap so unjustified deviations are to be expected.

This post is not about random casts to silence the compiler, missing -Wall and the like in compilation flags, missing headers and other junk of the sort.

Let's take a look at weird ass code samples. The list is by no means complete.

do {
} while (1);

When you want to have an infinite loop you either while (1) or for (;;). Having a do {} while loop in this scenario only serves to confuse the reader for a brief moment.

while (foo());

Is the author trying to hide something? It is easy to miss the semicolon on while line and think someone without a real editor just did not indent bar properly. If you notice the semicolon, you start wondering if it ended up there by mistake.

There are legitimate reasons for having loops with empty bodies. If you need to use one, do the following instead:

while (foo())


Next goodie:

if (foo || (!foo && bar))

What? This is equivalent to mere (foo || bar). Inserting !foo only serves to make the reader read the condition several times while wondering if they had enough coffee this morning.

char *p = '\0';

I have not encountered this animal in the wild myself, but got reports of its existence. Reasoning behind the abuse '\0' and 0 (and resulting NULL) equivalency eludes me. Just use NULL. Thank you.

if (FOO == bar)

Famous yoda-style comparisons. Not only they look terrible, there is no legitimate justification that I could see.

There are people who claim it prevents mistakes of the form if (bar = FOO) where comparison was intended. Good news is that your compiler more than likely will tell you that such an expression is fishy and if you really mean it, use additional parenthesis. Which is a good thing since you may need to compare variables, in which case the "trick" would be useless. AVOID.

if (foo) {
} else {


Or even worse:

if (!foo) {
} else {


What's the point of else clause? Do this instead:

if (foo) {


Wednesday, April 29, 2015

why binaries from one OS don't work on another

Idea of taking calc.exe and just running it on Linux/whatever is obviously absurd. However, taking a binary from a unix-like system (say, Linux) and running it on a different system (say, FreeBSD) may pose a legitimate question why that would not work without special support.

The list below is by no means complete and I'm too lazy to educate myself more on the subject.

So, let's take a look what's needed to get a binary running.

Of course there is functionality specific to given system (e.g. epoll vs kqueue), extensions to common functionality or minor differences in semantics of common functionality, but let's ignore that.

binary loading

The kernel has to parse the binary. Both systems use ELF format, so headers are readable, but it does not mean either system can make sense of everything found inside.

ELF supports passing additional information to the process (apart from argument vector and the environment), but the information passed is os-specific.

A dynamically linked binary contains a hardcoded path to the linker which is supposed to be used and typically requires some librariers (like libc), but e.g. sizes of various structures can be different or macros can expand to different symbols.

As such even if you loaded glibc along with FreeBSD binary it would not work, and the linker does not like the binary anyway.

So one would have to provide a complete enough environment with all necessary binary files.

system calls

Programs open files, talk over the network etc. by asking the kernel to perform a specific action. This is achieved by the use of syscalls.

The kernel has a table of system calls. Userspace processes tell the kernel which syscall they want and what arguments should be used.

Even common syscalls can (and often do) have different numbers (FreeBSD, Linux). So our binary would end up calling wrong syscalls, which obviously cannot work.

Do you at least do the same thing to call a syscall with given arguments? Well...

On i386 systems FreeBSD expects syscall arguments to be on the stack, while Linux expects them in registers. A way of invoking a syscall is the same though.

On amd64 both systems do the same thing, but one could make them differ for the sake of it.


Supporting trivial binaries from other systems, which only use functionality provided by your system is not very hard. You can provide a dedicated system call table and make sure your signal delivery works.

Running non-trivial binaries requires a significant effort.

Such work was performed in FreeBSD (and other BSDs) and allows it to run Linux binaries. Although there are a lot of missing syscalls (e.g. inotify) and there are some terrible hacks, the layer is quite usable.

Thursday, April 23, 2015

kernel security vs selinux

I'm sorry for a marketing title, but I'm trying to make a point.

There seems to be a confusion as to what selinux (and other LSM modules) can do for you in terms of preventing kernel exploitation, especially in environments where untrusted code is expected to be executed. The answer is: not enough.

This is not an attack on selinux, but on an idea that it can secure your kernels. Especially on an idea that a container created specifically to run untrusted code is less of a threat thanks to selinux.

Note that selinux can be used to make it harder to execute arbitrary code in userspace, providing some degree of protection (an attacker needs a vulnerability in userspace first). However, in a lot of environments arbitrary userspace execution is a given and that's the setup we are going to focus on moving forward.

Normally you want to reduce the attack surface by making as much code as possible unreachable for attackers. Remaining code can have vulnerabilities as well and for that you want various techniques to make the exploitation impossible or at least way harder.

selinux, apparmor and the like are implemented on top of Linux Security Modules framework. LSM has a lot of points in the kernel where it can run your module's hooks, allowing them to deny an operation. But all of them are deep within syscalls.

A typical syscall looks like this:
a lot of code
some more code
likely a LSM hook somewhere
and even more code
Or to state differently, there is possibly vulnerable kernel code which does not need to be available and whose execution cannot be blocked by LSM simply because it is not executed early enough.

Most LSM hooks are placed deep within the code for a good reason, I don't know why there are no simple hooks provided to just deny syscall execution.

Look at seccomp(2) if you want the ability to restrict access better and at grsecurity if you want to decrease likelihood of successful exploitation of code which had to be reachable.

On FreeBSD MAC framework is an equivalent of LSM. You can restrict syscall access and the like with capsicum(4).

These technologies are not equivalent and a longer blogpost is in order. Another one will elaborate on a relationship between a container of any sort (e.g. FreeBSD's jail) and the host kernel.

The sole purpose of this post is to make it clear: just plopping selinux in an environment where your users can run their own code does not protect you from kernel exploitation in a sufficient manner.

As a side note there is a fun fact that one of the most basic exploitation prevention measures (disabled mapping at address 0) can be circumvented "thanks to" LSM.

Tuesday, April 21, 2015

unix: insecure by default

Unix-like systems accumulated a lot of stuff which tries to screw you over by default.

I may sound like Captain Obvious here, but I feel like ranting a little.

All of this can be worked around with some effort, the point is it should be the other way around. Even if you knew all the ways you can be screwed over (which you don't) and could always adjust stuff accordingly (and rest assured you would slip up at some point), you can't trust your co-workers will do the right thing.

There is no benefit to having any of this as a default that I could see.

hardlinks to files owned by others

Linux kernel allows this by default, although apparently distributions disable it on their own (see sysctl fs.protected_hardlinks). On FreeBSD can be altered with security.bsd.hardlink_check_uid /gid.


Ah, the bottomless source of vulnerabilities even in 2015.

If you just try to create a file in /tmp, you already screwed up. You have to be careful to not follow symlinks planted by jokers.

But wait, you opened a file and you checked with fstat(2) it's yours and it's not a symlink, so it should be fine. Except if your /tmp is not a separate partition it could be a hardlink planted by a joker.

Would be fixed for the most part if /tmp/$USER was provided instead.

Ideally just don't use /tmp.

mount options (suid, exec, dev)

Feeling like mounting something over nfs? You better always remember to put the magic three nosuid, noexec, nodev or chances are you will be screwed over.

A side note is that FreeBSD ditched support for accessing devices through files created on regular filesystems. If you want to put devices somewhere and use them, you need devfs(5).

file descriptors survive execution of a new binary

That is unless you explicitly set O_CLOEXEC on them. I'm sure no file descriptor ever leaked to an unprivileged process even though it was not supposed to.

shell scripting

People who have some knowledge wrote quite a lot about the subject (e.g. see BashPitfalls), so let me just point out my favorites.

Unknown variables expand to an empty string -- an excellent source providing us with a constant stream of weird problems, including everyone's favourite rm -rf /*. Try 'set -u' to workaround, but beware of some caveats.

Pattern is returned if no matches were found, e.g. foobar* results in foobar* if the only existing entries are foo, bar and baz. This is unknowingly abused by a lot of people who run tools like find (find . -name *pattern*) or ssh (ssh host cat pattern*). Countless scripts are waiting to break as soon as someone creates a file which happens to match the pattern. Workaround with 'shopt -s nullglob'.

process titles in Linux

FreeBSD provides a dedicated sysctl which updates an in-kernel buffer.

On Linux tools like ps read /proc/<pid>/cmdline in order to provide process titles/names + their args. This content is read from pages mapped into target process memory. People started abusing this situation by overwriting stored arguments in order to provide informative titles for their processes.

Updates have good performance since processes just write to their own memory.

As usual this leads to some user-visible caveats, which fortunately are fixable.

Title consistency

A cosmetic issue here is that there are no consistency guarantees - what happens if the kernel reads the content as its being written? The window is extremely small, and reads + updates are rare enough for this to likely never be a problem in practice.

As a side note, the kernel recognises the hack. People can even move the environment (which is normally stored after the argument vector) to make more space for the title and the kernel supports that.

One example approach would tell the kernel where to look for this data and would provide a marker to know whether there is an update in progress so that the kernel can re-read few times if needed.

Hanging processes

Accessing memory area storing cmdline's content requires locking target process address space for reading. Unfortunately it's possible that something will lock it for writing and block for an unspecified amount of time, preventing any read accesses. Then if you run a tool which reads the file (e.g. ps(1)), it blocks in an uninterruptible manner (i.e. cannot be killed) waiting for the lock.

So if you happen to have periodically executed scripts which run ps you can accumulate a lot of unkillable processes. This is especially confusing when you try to debug such a problem since mere ps run by hand will also block.

This can for example happen if nfs share dies while a process tries to mmap(2) a file backed by it and actual requests need to be issued.

I would say best course of action here would provide bound and killable sleep for cmdline reads. If no data could be read, process name taken from task_struct could be used.

Monday, April 13, 2015

file descriptors: multithreaded process vs the kernel

Userspace processes have several ways in which they can manipulate their file descriptor table. Doing so concurrently opens the kernel to some races which need to be handled.

Let's start with a note that files/pipes/sockets/whatever you may have opened are represented by a dedicated structure, which on FreeBSD, Linux and Solaris happens to be named file.

A file can be in use by multiple fds (and even without fds (e.g. with mmap + close, or just internally by the kernel)), so it has a reference counter. If the counter drops to 0, the object is freed.

File descriptors (fd from now on) have some properties (e.g. a close-on-exec flag), but first and foremost are there to obtain the pointer to relevant struct file (fp from now on). This is achieved by having a table indexed with fds.

With this relationship established let's examine relevant ways userspace processes can modify their fd table:
  1. obtain the lowest possible fd [1] (e.g. open(2), socket(2))
  2. close an arbitrary fd (close(2))
  3. obtain an arbitrary fd (dup2(2))
File descriptors are one of the most common arguments to syscalls and as such translation fd -> fp needs to be fast.

An example syscall resulting in a new fd with a unique fp needs to do the following steps:
  • obtain a new fp (duh)
  • obtain a new fd
  • do whatever it needs to do so that it can fill fp with relevant data
In what order this should be done? Now that depends. Let's assume that this syscall-specific action is very hard to revert, so we need a guarantee of succesfull fd + fp actions by the time we get to it.

The following fictional C functions will be helpful later:
fp_alloc - obtain a new fp
fd_alloc(fp) - obtain a new fd with given fp set
fp_init(fp, data) - fill fp with stuff specific to given syscall
fd_close(fd)- closes relevant fd, this releases a refcount on fp, possibly freeing it
fd_get(fd) - obtains a reference to fp and returns it
fp_drop(fp) - drops a reference, possibly freeing fp

Now let's consider a syscall doing (error handling trimmed for brevity):
fp = fp_alloc();
fd = fd_alloc(fp);
data = stuff();
fp_init(fp, data);
But what if someone fd_gets this fd before fp_init? We cannot return garbage. So we have to introduce some sort of larval state - fp is there, but is not ready for use and fd_get is careful to check for this condition and returns EBADF claiming nothing was there after all.

How about fd_close? If one was to call it before fp_init is completed, it could result in a use-after-free condition. Clearly this cannot be allowed. The solution here is to use an initial refcount of 2 and unconditionally drop the extra reference in the syscall. With this in place in worst case the code will fp_init something which is about to be destroyed, which is fine.

Now alter function names a little bit and introduce 'fp->f_ops == &badfileops' as a criterion for larval state and you got how it's done in FreeBSD.

Don't like it? How about some additional functions:
fd_reserve - obtain a new slot in fd table
fd_install(fd, fp) - fill the slot with fp

And a syscall:
fp = fd_alloc();
fd = fd_reserve();
data = stuff();
fp_init(fp, data);
fd_install(fd, fp);
Concurrent fd_get? No problem. Slot reservation can be marked in a bitmap, the table still has NULL set as fp so no special handling is needed.

Concurrent fd_close? It's still NULL, so we can return EBADF as fd in question is not (yet) in use. Such a call from userspace was inherently racy, so there is no correctness issue. Again, no special cases needed.

Once more ignore function names and you roughly got what's done in Linux.

So does this just work? Of course not. Let's take a look at concurrent dup2 execution. dup2(x, y) is documented to close y if it happened to be in use.

What if the syscall in question got fd 8 from fd_reserve while some other thread does dup2(0, 8)? In FreeBSD case there is no problem - fp is there, you can just close it. Here the kernel has to special case and Linux resorts to returning EBUSY.

Bonus question for the reader: what about concurrent execution of fork and a syscall which installs a fd?

Which solution is better? Well, there are more considerations (including locking) which I may tackle in upcoming posts.

files removed on last close?

To quote unlink(2):
When the file's link count becomes 0 and no process has the file open, the space occupied by the file shall be freed and the file shall no longer be accessible. If one or more processes have the file open when the last link is removed, the link shall be removed before unlink() returns, but the removal of the file contents shall be postponed until all references to the file are closed.
If having the file open would be interpreted as having a file descriptor, this would match the most common understanding of the issue.

As usual, this is not entirely correct. As far as userspace goes in holding off file removal for an extended period of time, the other thing to look at are memory mappings.

So in an effort to not make this post a three-liner + stolen quite, let's have a look at relevant mappings in a simple program which just sleeps:
00400000-00401000 r-xp 00000000 fd:03 2100640                           /tmp/a.out
00600000-00601000 r--p 00000000 fd:03 2100640                            /tmp/a.out
00601000-00602000 rw-p 00001000 fd:03 2100640                            /tmp/a.out
3685e00000-3685e21000 r-xp 00000000 fd:02 133994                         /usr/lib64/
3686020000-3686021000 r--p 00020000 fd:02 133994                         /usr/lib64/
3686021000-3686022000 rw-p 00021000 fd:02 133994                         /usr/lib64/
Both of these files are in use, but surely it does not have a file descriptor to either one:
lrwx------. 1 meh meh 64 Apr 13 20:58 0 -> /dev/pts/11
lrwx------. 1 meh meh 64 Apr 13 20:58 1 -> /dev/pts/11
lrwx------. 1 meh meh 64 Apr 13 20:58 2 -> /dev/pts/11 does not look like a good candidate for unlink test, so we will focus on /tmp/a.out.
$ stat -L -c '%i' /proc/$(pgrep a.out)/exe
$ rm /tmp/a.out
$ stat -L -c '%i' /proc/$(pgrep a.out)/exe
 Which I hope is sufficient for you to believe that the file was not removed just yet because of it being mapped.

Saturday, April 11, 2015

Weird stuff: thread credentials in Linux

For quite some time now credentials are more complicated than just a bunch of ids and as such are stored in a dedicated structure.

Credentials very rarely change (compared to how often they are read), so an approach which optimizes for reads is definitely in order.

First let's have a look at what happens in FreeBSD.

struct ucred contains a reference counter. In short it's a method of counting other structures which use this one [1]. New creds start with refcount of 1 and are freed when the counter reaches 0. The structure is copy-on-write - once initialised, it's never changed. This means that calls which change credentials (e.g. setuid(2)) always allocate new ones [2].

Processes are represented with struct proc which have one or more struct threads linked in. Both structures contain their own pointer to creds and keep their own reference.

So when credentials are changed, new cred struct is allocated and proc's credential pointer is modified to point to it. Threads check they got current cred pointer as they cross kernel<->userspace boundary. If needed they reference new credentials and drop the reference on old ones.

In effect cred access for the executing thread (the common case) is very cheap -- the kernel can just read them without any kind of locking.

This leaves a window where a thread can enter the kernel, while anoother thread just changed credentials and as a result operate using stale creds until it leaves the kernel, which is fine (I may elaborate in another post). Actions concerning process<->process interaction are authorized against proc credentials.

So, what happens in Linux?

Linux has a dedicated structure (struct cred) which also uses a copy-on-write scheme.

Big differences start with processes. These are represented with task_struct. Threads composing a process are just more task_struct's linked together into a thread group.

When credentials are changed in Linux , the kernel only deals with the calling thread. Other threads are not updated and are "unaware" of the event. That's the first weird part.

So how come a multithreaded process calling setuid ends up with consistent creds across its threads?

And the second: glibc makes all threads call setuid on their own (I did not check other libc implementations available for Linux, I presume they do the same.)

Now, it may be there are valid reasons to support per-thread credentials (serving files over the network?). But I would still expect dedicated syscalls (thread_setuid?) which just deal with a given thread and setuid etc. to deal with the entire process.

[1] strictly speaking not every structure storing a pointer to a refcounted structure must have its own reference. Dependencies between various structs can implicitly keep things stable.

[2] while typically true in practice, strictly speaking one could hack it up to e.g. lookup appropriate credentials and grab a reference on them

Friday, April 10, 2015

what is really shown by /proc/pid/environ

Have you ever inspected /proc/<pid>/environ and concluded it contains something which could not possibly be an environment? Or maybe it looked like environment, but could not possibly match what was used by the process?

Let's start with making it clear what the process environment is.

It's just a table with key=value strings passed around during execve(2).

Then the kernel puts it on the stack of the new process and stores the address for possible later use.

There is absolutely no magic involved. When you execute a process you can pass any environment you want.

When someone reads from /proc/<pid>/environ, the kernel grabs environment address it stored during execve and reads from target process' address space.

But is the environment really there? Well, possibly.

Userspace is free to move it wherever it want, and sometimes it has to if the process adds more variables.

As such, if the content looks sane as an environment, you can be reasonably sure this is the environment the process started with. But based on this you cannot know what modifications (if any) were made.

If you really need to know the environment state, your situation is not that bad. POSIX defines 'environ' symbol which is supposed to always point to current environment, so interested parties can easily inspect it by e.g. attaching to the process with gdb.

zombie processes

From time to time I encounter people spotting accumulating zombie processes, they proceed to kill -9 them, which of course does not change anything.

Each process, apart from init (pid 1) has a parent process. Children (if any) of an exiting process are reparented to init. Once a process exits it becomes a zombie, waiting for its parent to wait(2) for it [1]. Init reaps reparented processes automatically.

This means that persisting zombie processes are an indication of an actual problem.

What you can do from sysadmin perspective: first and foremost inspect the parent. If the process is happily executing code in userspace, there is very likely a bug in the application. Nothing you can do about it, contact your developers. If you just kill the process, zombies will be gone, along with debug info your developers would want to see. Obtaining a coredump still may not be sufficient, it's best to let them investigate live system if possible.

I was told about a "solution" which actually was reaping unwanted zombies even if the parent did not do it. As suspected it works as follows: it attaches to parent process with ptrace(2) and injects a wait4(2) call. This risks further damaging the state of problematic process -- for instance it can have a table with relevant children information and now one of these children is gone. What happens if the child pid gets reused afterwards?

But let's say the process in question is not running circles in userspace. It can be blocked in the kernel in a relatively self-explanatory place.

If the backtrace does not point out an obvious culprit, C language experience will be required. That paired with no fear of the kernel can bring you a long way, so it's worth giving it a shot if only for fun.

Let's evaluate some kernel stacktraces (/proc/<pid>/stack).
[<ffffffffa01b2dc9>] rpc_wait_bit_killable+0x39/0xa0 [sunrpc]
[<ffffffffa01b70b2>] __rpc_execute+0x202/0x750 [sunrpc]
[<ffffffffa01b7bc9>] rpc_execute+0x89/0x240 [sunrpc]
[<ffffffffa01a8d00>] rpc_run_task+0x70/0x90 [sunrpc]
[<ffffffffa01a8d70>] rpc_call_sync+0x50/0xc0 [sunrpc]
[<ffffffffa0485616>] nfs3_rpc_wrapper.constprop.11+0x86/0xd0 [nfsv3]
[<ffffffffa04859d4>] nfs3_proc_access+0xc4/0x1a0 [nfsv3]
[<ffffffffa0429229>] nfs_do_access+0x3c9/0x850 [nfs]
[<ffffffffa04298a1>] nfs_permission+0x1c1/0x2b0 [nfs]
[<ffffffff8126c642>] __inode_permission+0x72/0xd0
[<ffffffff8126c6b8>] inode_permission+0x18/0x50
[<ffffffff8126f306>] link_path_walk+0x266/0x860
[<ffffffff8126f9bc>] path_init+0xbc/0x840
[<ffffffff81272325>] path_openat+0x75/0x620
[<ffffffff81273ec9>] do_filp_open+0x49/0xc0
[<ffffffff8125fa5d>] do_sys_open+0x13d/0x230
[<ffffffff8125fb6e>] SyS_open+0x1e/0x20
[<ffffffff817ec22e>] system_call_fastpath+0x12/0x76
[<ffffffffffffffff>] 0xffffffffffffffff
Here we see that the process in question tried to open a file and path lookup lead it to a nfs share, at which point it blocked. While this does not tell you which share is causing trouble [2], you can narrow it down no problem.

[<ffffffff81263405>] __sb_start_write+0x195/0x1f0
[<ffffffff81286a24>] mnt_want_write+0x24/0x50
[<ffffffff81271adb>] do_last+0xbeb/0x13c0
[<ffffffff81272346>] path_openat+0x96/0x620
[<ffffffff81273ec9>] do_filp_open+0x49/0xc0
[<ffffffff8125fa5d>] do_sys_open+0x13d/0x230
[<ffffffff8125fb6e>] SyS_open+0x1e/0x20
[<ffffffff817ec22e>] system_call_fastpath+0x12/0x76
[<ffffffffffffffff>] 0xffffffffffffffff 
Again a file lookup, but this time we got no indication where [2].
Just like with nfs you can just spawn a new shell and probe stuff from that level. But what is it? For those who don't want to do a little digging: sfserrmr be fbzrguvat ryfr hfvat guvf shapvbanyvgl

Bottom line is, if you see accumulating zombie processes, don't try to kill them. Inspect your system instead and possibly report a bug to your developers. Chances are you will solve the actual problem on your own.

[1] There are ways in which process can make its children reaped automatically without waiting, also these days situation on Linux and FreeBSD is less trivial -- there are ways to reap foreign children, e.g. with process descriptors or prctl(2) (see PR_SET_CHILD_SUBREAPER)

[2] Unfortunately Linux does not provide any nice way to obtain such information at the moment. I have some ideas how to change it which I may describe in future posts. Interested parties can obtain this information in a way which may seem scary at first, but is quite trivial and largely safe & accurate - you can run the debugger on live kernel (named 'crash'), dump mount list (mount), dump the stack of problematic thread (bt -f) and eye-grep. Of course a proper way would require disassembling some code to be sure what is what on the stack.

Wednesday, April 8, 2015

What is really included in load average on Linux?

Everyone "knows" that load average = amount of runnable processes + processes blocked on I/O. While this may be true enough for a lot of use cases, it is incorrect.

The purpose of this article is to note briefly what is really counted, not to enumerate all possibilities.

First a short note that the kernel counts threads, not processes.

With this out of the way let's take a look a relevant comment (source):
 * Once every LOAD_FREQ:
 *   nr_active = 0;
 *   for_each_possible_cpu(cpu)
 *      nr_active += cpu_of(cpu)->nr_running + cpu_of(cpu)->nr_uninterruptible;

An alert reader may note that there are lies, damned lies, statistics and comments in the code. I have to agree, thus this requires validation.

A quick eye-grep reveals:

long calc_load_fold_active(struct rq *this_rq)
        long nr_active, delta = 0;

        nr_active = this_rq->nr_running;
        nr_active += (long) this_rq->nr_uninterruptible;

        if (nr_active != this_rq->calc_load_active) {
                delta = nr_active - this_rq->calc_load_active;
                this_rq->calc_load_active = nr_active;

        return delta;
While not strictly sufficient, it's fine enough for this article.

So we know "threads blocked on I/O" is not the criterion here, but threads which contribute to nr_uninterruptible counter.

nr_uninterruptible represents threads in TASK_UNINTERRUPTIBLE state (which are not frozen, but what it means is beyond the scope of this article).

When can this happen?
  • while waiting for event completion (also used when dealing with I/O)
  • while trying to acquire a sleepable locking primitive such as a semaphore
Significance of this information is that when a server with abnormally high load (say > 1k on a 64-way machine) is encountered, people tend to think I/O is at fault here (e.g. dead nfs server), which very easily may be false. For instance one thread could take a semaphore for writing and block itself for some reason, and a lot of other threads started tripping over it while trying to take it for reading.

Tuesday, April 7, 2015

nofile, ulimit -n, RLIMIT_NOFILE -- the most misunderstood resource limit

Have you ever seen "VFS: file-max limit XXX reached" and proceeded to hunt for file descriptor consumers? Does this prompt you to lsof -u joe | wc -l to find out how many file descriptors joe uses? If so, this post is for you.

Aforementioned message is not about file descriptors and lsof - u joe shows way more than just file descriptors anyway.
So what is limited by RLIMIT_NOFILE?

The biggest number that can be assigned to a file descriptor in given process.

I repeat: the biggest number that can be assigned to a file descriptor. Of course from the moment the limit is applied. This has a side effect of limiting the number of file descriptors a process can open from this point on.

1. process sets RLIMIT_NOFILE to 20 on its own. How many file descriptors can it have opened?

Impossible to tell. No new file descriptor will be bigger than 20, but there may be a huge number of already opened file descriptors with higher number.

2. There is only one process owned by joe. It has the following file descriptors opened: 0, 1, 2. It sets its own RLIMIT_NOFILE to 20 and creates a new process. How many file descriptors can be opened in each of them?

nofile limit is per process, thus the fact that one of these processes created the other one is irrelevant. Either can open 18 more file descriptors.

You may have encountered the following:
VFS: file-max limit $BIGNUM reached

What's the relationship between file descriptors and 'file' (struct file) limit?

So what is struct file? It is an object which contains some state related to an open entity like an on-disk file, pipe etc.

struct file may be used internally by the kernel and not be associated with any file descritor.
Each file descriptor has to be associated with exactly one struct file.
Each struct file has an unlimited number of associated file descriptors.
On clone() file descriptors are copied, i.e. they reference the same 'struct file' their counterparts in parent process do.

Opening a file typically boils down to the following:
lookup the file
allocate new file descriptor
allocate struct file
tie up an inode with struct file
set file descriptor to 'point' to struct file

3. Process has the following file descriptors open: 0, 1, 2. Now it calls clone(). How many new 'struct file' are allocated in order to satisfy this request?

None. 0, 1, 2 in the new process use struct file from 0, 1, 2 from the parent.

4. Process has the folowing file descriptors open: 0, 1, 2. Now it exits. How many 'struct file' will be freed as a result?

Impossible to tell. First, it is possible that all file descriptors were associated with the same 'struct file'. Not only that, these file descriptors could be inherited from parent process which is still alive and didn't modify its descriptors. As such, it is possible that struct file(s) in question are still in use.

5. No process has /etc/passwd open. Now one process opens it 3 times. How many 'struct file' were allocated as a result?

Three, one for each open request.

With that established let's take a look at related errors (man errno):

ENFILE          Too many open files in system (POSIX.1)
EMFILE          Too many open files (POSIX.1)

First one signals the kernel ran out of 'struct file', the other one that given process cannot have more file descriptors.

When the kernel prints "VFS: file-max limit XXX reached" it says it won't allocate any new struct file.

6. Let's assume the kernel reached the limit of 'struct file'. Now joe's process tried to obtain a new file descriptor. Can this operation succeed? Which error is returned on failure?

If the new file descriptor would have new 'struct file', the error would be ENFILE.
But it may be that this file descriptors is going to reuse already existing 'struct file', in which case it does not matter that the kernel hit the limit. In can fail with EMFILE, or it can succeed, depending on rlimits.

lsof | wc -l vs number of open descriptors

Apart from file descriptors, lsof shows other stuff (e.g. in-memory file mappings, current working directory). As such, output from mere 'lsof' invocations cannot be used to check file descriptors.

7. An administrator does `lsof -p 8888 | wc -l` and receives 10000. How many file descriptors are in use this process?

As noted earlier, impossible to tell due to other fields printed by lsof.

Current amount of open file descriptors by given process can be obtained by counting symlinks in /proc/<pid>/fd.

8. `ls /proc/<pid>/fd | wc -l`  returns 9000. How many 'struct file' are in use by this process?

Anything between 1 and 9000 (including both).

9. We get result as previous one. What can you say about 'nofile' rlimit set on this process?

Nothing. Not only we don't know the biggest open fd, even if we did the fd could be open before the limit was applied.

How traditional resource limits are handled

 For the purpose of this document we will define the following:
- resource limits as mentioned earlier will be referred to as rlimits
- we will have an unprivileged user joe

You most likely have set rlimits at some point, either by editing
/etc/security/limits.conf (or some other file), playing with ulimit etc.

But how and when such limits are meaningful?

You may have heard about RLIMIT_NPROC, which is supposed to limit the amount of processes given user is allowed to have. Yet, it is possible you will configure this limit and the user in question will have twice as many processes. Not only that, he may still be able to spawn more. What is going on here?

Statements which follow are true enough and sufficient to understand points I'm trying to make.

Process actions are subject to rlimits.

Resource limit is a property of a process, not a user.

Processes are owned by <uid>:<gid>, which are just some numerical values.
If there is a resource limit covering more than one process, it does so by looking at the uid.

Creation of a new process is accomplished with clone() systemcall. Provided
with appropriate arguments it will create a copy of the executing process.
There are some differences (pid, parent pid) but pretty much all the rest is
identical. This includes rlimits.

Now let's say we have our custom program running as root and we want to extend it so that it runs stuff as an unprivileged user.

In order to have a process running with given uid:gid, we need to use setuid() and setgid() systemcalls whose names are self explanatory (strictly speaking one can also run a suid/sgid binary, but that's irrelevant to the subject in question).

Let me reiterate: creating a process owned by given user looks as follows:


So... which one of these applies security limits as defined in /etc/security/limits.conf?

That's right, NONE.

This new process owned by joe has the same rlimits its parent does.

Rlimits from limits.conf etc. can be applied by additional code (typically a PAM module).
The point is, this has to be done separately and prior to {u,g}id change.
sshd, su and so on do this stuff.

With that in mind let's consider some problems.

Let's assume joe has 80 running processes and all of them have applied nproc limit to 100.

1. An administrator modified limits.conf:
joe hard nproc  50

1.1 Is it possible for joe to spawn a new process?

Yes, currently running processes have 100 as their limit and only 80 are running.
Thus rlimits are not in the way here.

1.2 Is it possible for joe to log in over ssh?

No. That would-be joe's process would have the limit set to 50, but there are
already 80 processes running, thus it would error out.

1.3. A custom daemon is running, it spawns 200 processes owned by joe, but does not apply his resource limits. Consider processes from 1.1. Can any of them clone()?

No. There are 280 joe's processes running, but ones from 1.1 have the limit set to 100.