Saturday, November 21, 2015

kernel game #1

Syscalls accepting file descriptors fit the following scheme:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
int
meh_syscall(int fd)
{
        struct file *fp;
        int error;

        fp = getfile(fd);
        if (fp == NULL)
                return (EBADF);
        error = do_meh(fp);
        putfile(fp);
        return (error);
}


That is, passed fd is used to obtain a pointer to  struct file, which is then used to perform the actual operation.

getfile will increase reference counter on struct file, while putfile will decrease it. If the new value is 0, there is nobody else using the file and it can be freed.

This is important if there are multiple threads in the process. If the counter was not maintained, and one thread closed the file while another one just got the pointer, there would be a bug.

However, if there is only one thread, what's the point of maintaining the counter? There is nobody to close the file from under us.

This optimisation is in fact in use on Linux. But their equivalent of getfile passes an information whether a reference was obtained, which is then used by putfile to check if it has to free it.

Why do they bother with that?

What would be wrong with the following approach: check if there is only 1 thread. If so, nobody can close the file from under us, therefore there is no need take the reference. Also this syscall does not create new threads. Then, after do_meh, we check the thread count again to see if we have to putfile. In other words:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
int
meh_syscall(int fd)
{
        struct file *fp;
        int error;

        if (curproc->threads > 1)
                fp = getfile(fd);
        else
                /*
                 * just get the pointer,
                 * do not modify the reference
                 * counter
                 */
                fp = getfile_noref(fd);
        if (fp == NULL)
                return (EBADF);
        error = do_meh(fp);
        if (curproc->threads > 1)
                putfile(fp);
        return (error);
}

So, what's the bug?

Tuesday, November 3, 2015

a primitive to read data from userspace

As was outlined previously, a special primitive is needed to access userspace data safely. There are several highly specialized variants in both Linux and FreeBSD kernels, but they all work based on the same principle. Example below is taken from FreeBSD since Linux equivalent is way more convoluted.

Let's reiterate, consider:
int val;

val = *some_userspace_pointer;
printf("%d\n", val);

If some_userspace_pointer e.g. contains garbage, a page fault is going to occur. The page fault handler will conclude the fault cannot be satisified. But there is no way to tell this code about this issue - it only reads the value and assumes it succeeded.

What's needed is a function which will be able to actually detect the condition and return an error to the caller. With such a primitive in place the code becomes:
int val, error;
error = copyin(some_userspace_pointer, &val, sizeof(val));
if (error != 0)
        return error;
printf("%d\n", val);
A super slow variant would lock the address space, ensure relevant mappings are fine and only then do the read. That's a lot of of work completely unnecessary in the common case.

Instead, the standard approach is to have a way to tell the page fault handler where to jump if the page fault cannot be serviced. The place is supposed to clean up after failed copy and go back to the original caller.

In pseudo-code it would look like this:
int
copyin(void *from, void *to, size_t len)
{
       
        set_fault_handler(copyin_fault);
        if (len == 0)
                goto done_copyin;
        if (!fits_userspace(from, len))
                goto copyin_fault;
        memcpy(to, from, len);
done_copyin:
        set_fault_handler(0);
        return 0;
copyin_fault:
        set_fault_handler(0);
        return EFAULT;
}    

Let's take a look at an actual implementation with straightforward assembly (copyin(9) from the FreeBSD tree):
/*
 * copyin(from_user, to_kernel, len) - MP SAFE
 *        %rdi,      %rsi,      %rdx
 */
ENTRY(copyin)
        PUSH_FRAME_POINTER
        movq    PCPU(CURPCB),%rax
        movq    $copyin_fault,PCB_ONFAULT(%rax)

The handler is first set...
        testq   %rdx,%rdx                       /* anything to do? */
        jz      done_copyin

        /*
         * make sure address is valid
         */
        movq    %rdi,%rax
        addq    %rdx,%rax
        jc      copyin_fault
        movq    $VM_MAXUSER_ADDRESS,%rcx
        cmpq    %rcx,%rax
        ja      copyin_fault

... the range is then validated ...

        xchgq   %rdi,%rsi
        movq    %rdx,%rcx
        movb    %cl,%al
        shrq    $3,%rcx                         /* copy longword-wise */
        cld
        rep
        movsq
        movb    %al,%cl
        andb    $7,%cl                          /* copy remaining bytes */
        rep
        movsb
 ... and finally the copy actually done.  In an event of a page fault which cannot be satisified, the kernel will go to copyin_fault label which will unset the handler and return an error effectively cleaning up after the function. The target buffer may now contain partially copied data, but that's an acceptable state - if the syscall failed, buffer content is not specified. Finally, if a page fault could be serviced without an issue (e.g. a page was swapped in) or there were no page faults, copying finishes and the code falls below to unset the handler and return 0.

done_copyin:
        xorl    %eax,%eax
        movq    PCPU(CURPCB),%rdx
        movq    %rax,PCB_ONFAULT(%rdx)
        POP_FRAME_POINTER
        ret

        ALIGN_TEXT
copyin_fault:
        movq    PCPU(CURPCB),%rdx
        movq    $0,PCB_ONFAULT(%rdx)
        movq    $EFAULT,%rax
        POP_FRAME_POINTER
        ret
END(copyin)

Monday, November 2, 2015

the kernel vs userspace arguments

Plenty of syscalls (e.g. open(2)) write to or read from userspace memory using dedicated primitives and maintain a local copy. Why not just deal with it like with regular kernel memory? As outlined in one of previous posts, mere access should work.

Passed address may belong to kernelspace, so it has to be validated. But let's say we already did that.

Consider a toy syscall:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
int
sys_meh(const char *name, int value)
{
        if (!is_root()) {
                if (strcmp(name, "special") == 0)
                        return -EPERM;
        }

        spin_lock(&meh_lock);
        meh_modify(name, value);
        spin_unlock(&meh_unlock);
        return 0;
}

Here we accept a name and a value, but only root is allowed to modify the object identified as special.

First access is at line 5. What if the passed address is garbage? The read will trigger a page fault and with no way to communicate the problem to strcmp, the kernel is forced to oops/panic.

So let's say the address is not garbage.

The name is read twice: by sys_meh itself and later by meh_modify. Or in other words, the code relies on the value not changing. Is the expectation met? No. For instance there can be a second thread which will try to modify the string after strcmp is done, but before meh_modify is called. This would in effect circumvent the protection we had in place.

Here the situation is even worse. By the time the code reaches meh_modify, the kernel could have decided to evict the page backing the string. On access a page fault will occur and the kernel will try to bring it in. But it took a spinlock, which means it is illegal to service a page fault due to deadlock potential.

In situations like this the standard way is to store relevant data in a temporary buffer.

This causes serious trouble when various security-oriented syscall wrappers were implemented. For instance, code trying to restrict file access by monitoring filenames had the exact same bug visible with sys_meh above (but it could be also circumvented in myriad of other ways, including symlinks). Interested parties are invited to read Exploiting Concurrency Vulnerabilities in System Call Wrappers.

the kernel vs NULL pointer dereference

FreeBSD, Linux and plenty of other kernels deny userspace requests to mmap pages at address 0 as a rudimentary hardening measure. This guarantees the kernel catches its own "NULL pointer deferences" and in turn lets it panic/oops. This changes a guaranteed privilege escalation vector into a local denial of service.

Let's see what exactly is going on here.

We will focus on amd64, but conceptually this is also true for i386 and likely several other architectures which have the address space shared between the kernel and userspace. If said space is disjoint, description below does not apply.

The address space looks roughly like this:

+---------------+ 0xffffffffffffefff    
| the kernel    |               
|(in some areas)|               
+---------------+ 0xffff800000000000               
|address space  |               
|    hole       |               
+---------------+ 0x800000000000
| userspace     |               
|               |               
+---------------+
0x0  

The hole covers addresses which cannot be accessed on this architecture. Outlined userspace and kernel placement is the de facto standard. Note that both are mapped in the same address space. Spaces can be split in principle, but are not due to performance reasons.

These addresses are virtual. Actual physical memory pages may or may not be backing them up. The size of a page varies, it can be either 4KB, 2MB or 1GB.

Let's say an address 0xc0ffee belongs to an area mmapped with read and write permissions, and backed by a physical page at this very moment. When a thread enters the kernel (to e.g. execute a system call), the in-kernel code will be able to read and write said memory without any special measures.

Userspace can request arbitrary addresses with calls to mmap(2). Normally the kernel will provide whatever address it wants, but this can be changed by passing MAP_FIXED flag. As such, userspace can request to map a page at address 0.

Without pedantry void *p = NULL; will mean that consists of zeroes.

To sum this up:

p = mmap(NULL, 4096, PROT_WRITE|PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0);
p->val = 8;

Provided the kernel grants the request, this will effectively dereference a NULL pointer.

But most importantly, should the kernel try to access such an address itself, it will now succeed. Why would it do that? Of course due to a bug. NULL is often the default  value of pointers in various structures, so e.g. code which forgets to NULL check a field which can legitimately be NULL would be susceptible. There are plenty of real-world bugs which manifest themselves like this.

How to use this to escalate privileges? Depends on the bug, let's take the most blatant issue: a pointer to a function is NULL, but the code calls it. Userspace could mmap the page (with execute permissions) at 0 and fill it with whatever code it wants. When the bug is encountered, the kernel unknowingly starts executing the code planted by userspace.

Here is an example: CVE-2009-2692.txt Linux NULL pointer dereference due to incorrect proto_ops initializations.

While mappings at 0 are denied, there are other less frequent bugs which can result in the kernel unknowingly accessing userspace memory. A general solution with dedicated CPU support consists of SMAP (Supervisor Mode Access Prevention) and SMEP (Supervisor Mode Execution Prevention), but note these technologies are relatively new (read: your machines likely don' have them). Finally, a software-based implementation was provided with grsec.

Friday, October 30, 2015

when strace fails to obtain syscall information

strace(1) (or truss(1) on BSDs) is a system call tracer. You may have seen threads waiting for various operations in the kernel which were successfully reported (e.g. open). Yet sometimes you attach to the target process and don't get any output. The boring answer is that no threads in the process are executing any syscalls and as a result there is nothing to report. But what if we can tell for sure at least one thread is executing a syscall or at least called one and is now blocked?

Let's see how strace works in the first place. The kernel provides a special interface: ptrace(2). It can be used to observe various actions of the target process and interact with it. In particular, it can be told to stop the target process on syscall entry and exit. Once it is stopped, the tracer can read the state and determine what syscall is being called and what arguments were provided. The key here is that the target process has to reach this code.

So how does strace manage to properly report a thread waiting for open? [1] The thread in such a state is in an interruptible sleep. It is woken up, goes all the way back to kernel<->userspace boundary where it executes ptrace bits and proceeds to re-executes the syscall.

For what threads will strace fail to obtain syscall information? Definitely ones blocked in an uninterruptible sleep as they cannot be woken up like that, and in effect can't go back to let the tracer do its thing. The other possibility is a thread actively executing code in the kernel - it does not sleep and there is no mechanism to tell it to go back to the boundary.

What to do for such threads? In most (not all!) cases it is possible to read kernel backtrace (/proc/<tid>/stack) and try to work out stuff from there.

As a final remark, not all threads entering the kernel are executing syscalls. A typical example is a page fault or floating point exception, none of which are reported by strace.

[1] Of course there is no guarantee that all open operations will be interruptible, but a popular example of waiting for the writer when opening a fifo is.

Sunday, June 21, 2015

signal delivery vs threads in the kernel

The following blogpost is the first step to explain why unkillable processes exist.

Processes can install handlers (custom functions) for signals. When a thread is executing in userspace, it can be interrupted at any time (with few exceptions). Upon such interruption, the kernel can force it to run installed signal handler before it continues executing the original code.

If such a thread is in the kernel and a signal is received, it can only be acted upon in places which explicitly check for pending signals. When a pending signal is seen, the function cleans up whatever it did and returns an error to the caller, all the way up to kernel-userspace boundary.

Let's see why.

Just interrupting code executing in the kernel and making it switch to userspace of course sounds absolutley terrible. Situation is better but still quite bad if we don't need to execute a handler. Still, it is beneficial to see actual technical obstacles. Note that this blogpost is about monolithic kernels, like Linux or FreeBSD.

Consider the following syscall written in pseudo-c (error and range checking omitted for brevity):


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
int meh_modify(int id, data_t *udata, size_t size)
{       
        data_t *d;
        meh_t *m;
        
        d = kmalloc(size);
        copy_from_user(d, udata, size);
        
        m = meh_find(id);
        meh_lock(m);
        for (size_t i = 0; i < size; i++)
                m->data[i] = d[i];
        meh_unlock(m);
        meh_put(m);
        kfree(d);
        return 0;
}

meh_modify is callable from userspace just like any other syscall. Callers pass an id of an object and data to be saved.

Sufficient amount of memory is allocated so that there is a place to store data copied from userspace.

When meh_find finds the object, it bumps reference counter by one and returns a pointer. We have to decrement the counter later by a call to meh_put.

Finally, meh_lock/meh_unlock pair provides us with a way to ensure no other threads are modifying the object we found at the same time.

With the code in place let's see how things can go wrong as we try to implement signal delivery for kernel threads.

Interruption at any point

Let's say the thread is interrupted after the line 10, just after it did meh_lock(m), as it is about to execute the loop.

So it got the lock. Any thread trying to do the same will spin waiting for our thread to release it. If our thread tries to take the lock again it will also spin.

Now consider a code in userspace which does meh_modify(0, some_data, BIGNUM) and installed a signal handler doing meh_modify(0, some_other_data, BIGNUM).

If we interrupt the thread in the kernel after the line 10 and make it execute such a handler, the very same thread re-enters the kernel, finds the same meh object and tries to lock it. But the lock is already taken, so it spins waiting for it to be released.... except it is in fact waiting for itself. As you can see, no progress can be made in this scenario.

One may suggest we recognise it is the very same thread which took the lock previously and just let it through. That's bogus as well. Consider interruption on line 12 (m->data[i] = d[i]), and let's say i == 42. The invocation from the signal handler succeeds, the thread goes back to where it left off previously - line 12. At this time i == 42, so it continues populating the data from this point only. So after it finishes, the end result is a mix of whatever was put there by current invocation and the one we let through. Or to state differently, there is no way to ensure consistency since operations are no longer atomic, defeating the purpose of locks.

You may note the problem here is fundamentally the same as with non-async-signal safe functions called from signal handlers.

Interruption in areas not protected by locks

What if we execute the handler either prior to meh_lock or after meh_unlock? Let's say we interrupt before the lock is taken, but after meh_find finds our object and bumps its reference counter.

Reference counting is a way of tracking the number of given object users.  For the purpose of this blogpost the following is true enough: when a function like meh_put is called, it decrements it by 1; If the count reaches 0, there are no other users and the object can be freed.

What would happen if someone erroneously kept incrementing the counter? Assuming it is of unsigned type, it would eventually wrap and be equal to 0. If then someone increments it once more and calls meh_put, the object would be freed, even though it still has valid users.

Say the kernel delivers the same signal multiple times.

Now let's take our userspace process calling meh_modify from its signal handler.  Well timed signalling can cause the thread to enter our syscall, grab the reference and get interrupted over and over, eventually overflowing the counter. Note this example is rather hypothetical, as other resulting problems would prevent the kernel from reaching this stage (see below).

But the reference is not the only thing modified over and over. We also allocate a buffer with kmalloc, so this could dos the kernel due to memory exhaustion. Except...

When a function is called, return address is saved on the stack. Or in other words a chain of function calls increases stack usage. Kernel stacks are way smaller than userspace ones (recently 16KB on amd64 on Linux), so if our repeated kernel entry was to use the same stack, it would be very quickly exhausted, which must not happen.

What if our signal handler does not play like this and instead, say, kills itself? The kernel would have to unwind each such interrupted function and let it finish. While strictly speaking should be doable, it is a bad idea. By having all functions properly finish before the thread leaves the kernel, its internal state is well defined. Interruptions "respecting" lock boundaries don't compromise them directly, but make it hard/impossible to limit what kind of resources (references, memory) are allocated and complicate the code.

Tuesday, May 12, 2015

Performance of the same code under different operating systems

Nobody needs convincing that running program foo under different operating systems can influence its performance. Different kernels, libc, compiler version obviously have to have a huge impact.

But what about the following: an operating system-specific code sets everything up (allocates memory, reads data etc.) and then a common single-threaded assembly doing only memory and register accesses performs computation. What impact can an operating system have on performance of the common binary code?

Some of the things that may affect this are listed below.

I assume no games are played with altering CPU clock.

Impact of issues mentioned below can vary greatly and I'm too lazy to come up with any specific numbers. Point is, there are non-obvious factors which can impact the performance.

Interrupts

Random devices can generate interrupts which pause the execution of the code. Also there is the scheduling-clock interrupt firing several times per second (depends on the system, typically 1000). An operating system may allowing binding given process to a cpu which does not receive additional interrupts. It could also use a tickless approach to get rid of the clock interrupt as well. All this has some impact on performance. See Paul E. McKenney - Bare-Metal Multicore Performance in a General-Purpose Operating System (youtube) for more details.

TLB coverage

Both physical and virtual memory consist of pages. Page sizes vary between architectures and given architecture can support more than one. For instance amd64 supports 4KB, 2MB and 1GB pages. All addresses in our process' address space are virtual. An attempt to access such an address means the associated physical page needs to be looked up.  This information is cached in TLB, which obviously has limited amount of entries. So if the code accesses a sufficiently wide range of virtual addresses, it may force a lot of lookups. If an operating system supports providing bigger pages, TLB coverage can be greatly increased reducing the need for lookups and in effect improving performance. See Superpages in FreeBSD (youtube) and Practical, transparent operating system support for superpages (in Linux world known as hugepages).

NUMA

Modern amd64 machines are using NUMA, which means access to various parts of memory varies between CPUs, in particular each CPU has it's local memory which is the most optimal to access. Thus, if an operating system does not know how to allocate physical memory "close" to the CPU and then bind the process, memory access cost can grow.

Obtained virtual addresses vs CPU cache

Turns out virtual addresses returned by malloc(3) can also affect the performance. Let me quote 4K Aliasing from Intel optimisation guide:

When an earlier load issued after a later store (in program order), a potential WAR (write-after-read) hazard exists. To detect such hazards, the memory order buffer (MOB) compares the low-order 12 bits of the load and store in every potential WAR hazard. If they match, the load is reissued, penalizing performance. However, as only 12 bits are compared, a WAR hazard may be detected falsely on loads and stores whose addresses are separated by a multiple of 4096 (2^12). This metric estimates the performance penalty of handling such falsely aliasing loads and stores. 

Thursday, April 30, 2015

crap c code samples

There are various properties of a programming language. The amount of effort needed to shot yourself in the foot is definitely among them. There are situations where the language does not invite the programmer to do it, but they do so anyway.

And then there is writing stuff in a non-standard way for no reason.

When a normal person reads code, they make huge jumps. Reading any non-trivial codebase line-by-line is a non-starter. If they eye-grep something deviating from the standard way of doing it, they try to understand what's the difference in behaviour, and when they can't spot one they have to look again only to remember the code is crap so unjustified deviations are to be expected.

This post is not about random casts to silence the compiler, missing -Wall and the like in compilation flags, missing headers and other junk of the sort.

Let's take a look at weird ass code samples. The list is by no means complete.

1
2
3
do {
        .....
} while (1);

When you want to have an infinite loop you either while (1) or for (;;). Having a do {} while loop in this scenario only serves to confuse the reader for a brief moment.

1
2
while (foo());
bar();

Is the author trying to hide something? It is easy to miss the semicolon on while line and think someone without a real editor just did not indent bar properly. If you notice the semicolon, you start wondering if it ended up there by mistake.

There are legitimate reasons for having loops with empty bodies. If you need to use one, do the following instead:

1
2
3
4
while (foo())
     continue;

bar();

Next goodie:

1
2
if (foo || (!foo && bar))
        .......

What? This is equivalent to mere (foo || bar). Inserting !foo only serves to make the reader read the condition several times while wondering if they had enough coffee this morning.

1
char *p = '\0';

I have not encountered this animal in the wild myself, but got reports of its existence. Reasoning behind the abuse '\0' and 0 (and resulting NULL) equivalency eludes me. Just use NULL. Thank you.

1
2
if (FOO == bar)
        .......

Famous yoda-style comparisons. Not only they look terrible, there is no legitimate justification that I could see.

There are people who claim it prevents mistakes of the form if (bar = FOO) where comparison was intended. Good news is that your compiler more than likely will tell you that such an expression is fishy and if you really mean it, use additional parenthesis. Which is a good thing since you may need to compare variables, in which case the "trick" would be useless. AVOID.


1
2
3
4
5
6
7
8
if (foo) {
        bar();
        return;
} else {
        baz();
}

quux();

Or even worse:

1
2
3
4
5
6
7
8
if (!foo) {
        baz();
} else {
        bar();
        return;
}

quux();

What's the point of else clause? Do this instead:


1
2
3
4
5
6
7
if (foo) {
        bar();
        return;
}

baz();
quux();



Wednesday, April 29, 2015

why binaries from one OS don't work on another

Idea of taking calc.exe and just running it on Linux/whatever is obviously absurd. However, taking a binary from a unix-like system (say, Linux) and running it on a different system (say, FreeBSD) may pose a legitimate question why that would not work without special support.

The list below is by no means complete and I'm too lazy to educate myself more on the subject.

So, let's take a look what's needed to get a binary running.

Of course there is functionality specific to given system (e.g. epoll vs kqueue), extensions to common functionality or minor differences in semantics of common functionality, but let's ignore that.

binary loading

The kernel has to parse the binary. Both systems use ELF format, so headers are readable, but it does not mean either system can make sense of everything found inside.

ELF supports passing additional information to the process (apart from argument vector and the environment), but the information passed is os-specific.

A dynamically linked binary contains a hardcoded path to the linker which is supposed to be used and typically requires some librariers (like libc), but e.g. sizes of various structures can be different or macros can expand to different symbols.

As such even if you loaded glibc along with FreeBSD binary it would not work, and the linker does not like the binary anyway.

So one would have to provide a complete enough environment with all necessary binary files.

system calls

Programs open files, talk over the network etc. by asking the kernel to perform a specific action. This is achieved by the use of syscalls.

The kernel has a table of system calls. Userspace processes tell the kernel which syscall they want and what arguments should be used.

Even common syscalls can (and often do) have different numbers (FreeBSD, Linux). So our binary would end up calling wrong syscalls, which obviously cannot work.

Do you at least do the same thing to call a syscall with given arguments? Well...

On i386 systems FreeBSD expects syscall arguments to be on the stack, while Linux expects them in registers. A way of invoking a syscall is the same though.

On amd64 both systems do the same thing, but one could make them differ for the sake of it.

conclusion

Supporting trivial binaries from other systems, which only use functionality provided by your system is not very hard. You can provide a dedicated system call table and make sure your signal delivery works.

Running non-trivial binaries requires a significant effort.

Such work was performed in FreeBSD (and other BSDs) and allows it to run Linux binaries. Although there are a lot of missing syscalls (e.g. inotify) and there are some terrible hacks, the layer is quite usable.

Thursday, April 23, 2015

kernel security vs selinux

I'm sorry for a marketing title, but I'm trying to make a point.

There seems to be a confusion as to what selinux (and other LSM modules) can do for you in terms of preventing kernel exploitation, especially in environments where untrusted code is expected to be executed. The answer is: not enough.

This is not an attack on selinux, but on an idea that it can secure your kernels. Especially on an idea that a container created specifically to run untrusted code is less of a threat thanks to selinux.

Note that selinux can be used to make it harder to execute arbitrary code in userspace, providing some degree of protection (an attacker needs a vulnerability in userspace first). However, in a lot of environments arbitrary userspace execution is a given and that's the setup we are going to focus on moving forward.

Normally you want to reduce the attack surface by making as much code as possible unreachable for attackers. Remaining code can have vulnerabilities as well and for that you want various techniques to make the exploitation impossible or at least way harder.

selinux, apparmor and the like are implemented on top of Linux Security Modules framework. LSM has a lot of points in the kernel where it can run your module's hooks, allowing them to deny an operation. But all of them are deep within syscalls.

A typical syscall looks like this:
a lot of code
some more code
likely a LSM hook somewhere
and even more code
Or to state differently, there is possibly vulnerable kernel code which does not need to be available and whose execution cannot be blocked by LSM simply because it is not executed early enough.

Most LSM hooks are placed deep within the code for a good reason, I don't know why there are no simple hooks provided to just deny syscall execution.

Look at seccomp(2) if you want the ability to restrict access better and at grsecurity if you want to decrease likelihood of successful exploitation of code which had to be reachable.

On FreeBSD MAC framework is an equivalent of LSM. You can restrict syscall access and the like with capsicum(4).

These technologies are not equivalent and a longer blogpost is in order. Another one will elaborate on a relationship between a container of any sort (e.g. FreeBSD's jail) and the host kernel.

The sole purpose of this post is to make it clear: just plopping selinux in an environment where your users can run their own code does not protect you from kernel exploitation in a sufficient manner.

As a side note there is a fun fact that one of the most basic exploitation prevention measures (disabled mapping at address 0) can be circumvented "thanks to" LSM.

Tuesday, April 21, 2015

unix: insecure by default

Unix-like systems accumulated a lot of stuff which tries to screw you over by default.

I may sound like Captain Obvious here, but I feel like ranting a little.

All of this can be worked around with some effort, the point is it should be the other way around. Even if you knew all the ways you can be screwed over (which you don't) and could always adjust stuff accordingly (and rest assured you would slip up at some point), you can't trust your co-workers will do the right thing.

There is no benefit to having any of this as a default that I could see.

hardlinks to files owned by others

Linux kernel allows this by default, although apparently distributions disable it on their own (see sysctl fs.protected_hardlinks). On FreeBSD can be altered with security.bsd.hardlink_check_uid /gid.

/tmp

Ah, the bottomless source of vulnerabilities even in 2015.

If you just try to create a file in /tmp, you already screwed up. You have to be careful to not follow symlinks planted by jokers.

But wait, you opened a file and you checked with fstat(2) it's yours and it's not a symlink, so it should be fine. Except if your /tmp is not a separate partition it could be a hardlink planted by a joker.

Would be fixed for the most part if /tmp/$USER was provided instead.

Ideally just don't use /tmp.

mount options (suid, exec, dev)

Feeling like mounting something over nfs? You better always remember to put the magic three nosuid, noexec, nodev or chances are you will be screwed over.

A side note is that FreeBSD ditched support for accessing devices through files created on regular filesystems. If you want to put devices somewhere and use them, you need devfs(5).

file descriptors survive execution of a new binary

That is unless you explicitly set O_CLOEXEC on them. I'm sure no file descriptor ever leaked to an unprivileged process even though it was not supposed to.

shell scripting

People who have some knowledge wrote quite a lot about the subject (e.g. see BashPitfalls), so let me just point out my favorites.

Unknown variables expand to an empty string -- an excellent source providing us with a constant stream of weird problems, including everyone's favourite rm -rf /*. Try 'set -u' to workaround, but beware of some caveats.

Pattern is returned if no matches were found, e.g. foobar* results in foobar* if the only existing entries are foo, bar and baz. This is unknowingly abused by a lot of people who run tools like find (find . -name *pattern*) or ssh (ssh host cat pattern*). Countless scripts are waiting to break as soon as someone creates a file which happens to match the pattern. Workaround with 'shopt -s nullglob'.

process titles in Linux

FreeBSD provides a dedicated sysctl which updates an in-kernel buffer.

On Linux tools like ps read /proc/<pid>/cmdline in order to provide process titles/names + their args. This content is read from pages mapped into target process memory. People started abusing this situation by overwriting stored arguments in order to provide informative titles for their processes.

Updates have good performance since processes just write to their own memory.

As usual this leads to some user-visible caveats, which fortunately are fixable.

Title consistency

A cosmetic issue here is that there are no consistency guarantees - what happens if the kernel reads the content as its being written? The window is extremely small, and reads + updates are rare enough for this to likely never be a problem in practice.

As a side note, the kernel recognises the hack. People can even move the environment (which is normally stored after the argument vector) to make more space for the title and the kernel supports that.

One example approach would tell the kernel where to look for this data and would provide a marker to know whether there is an update in progress so that the kernel can re-read few times if needed.

Hanging processes

Accessing memory area storing cmdline's content requires locking target process address space for reading. Unfortunately it's possible that something will lock it for writing and block for an unspecified amount of time, preventing any read accesses. Then if you run a tool which reads the file (e.g. ps(1)), it blocks in an uninterruptible manner (i.e. cannot be killed) waiting for the lock.

So if you happen to have periodically executed scripts which run ps you can accumulate a lot of unkillable processes. This is especially confusing when you try to debug such a problem since mere ps run by hand will also block.

This can for example happen if nfs share dies while a process tries to mmap(2) a file backed by it and actual requests need to be issued.

I would say best course of action here would provide bound and killable sleep for cmdline reads. If no data could be read, process name taken from task_struct could be used.

Monday, April 13, 2015

file descriptors: multithreaded process vs the kernel

Userspace processes have several ways in which they can manipulate their file descriptor table. Doing so concurrently opens the kernel to some races which need to be handled.

Let's start with a note that files/pipes/sockets/whatever you may have opened are represented by a dedicated structure, which on FreeBSD, Linux and Solaris happens to be named file.

A file can be in use by multiple fds (and even without fds (e.g. with mmap + close, or just internally by the kernel)), so it has a reference counter. If the counter drops to 0, the object is freed.

File descriptors (fd from now on) have some properties (e.g. a close-on-exec flag), but first and foremost are there to obtain the pointer to relevant struct file (fp from now on). This is achieved by having a table indexed with fds.

With this relationship established let's examine relevant ways userspace processes can modify their fd table:
  1. obtain the lowest possible fd [1] (e.g. open(2), socket(2))
  2. close an arbitrary fd (close(2))
  3. obtain an arbitrary fd (dup2(2))
File descriptors are one of the most common arguments to syscalls and as such translation fd -> fp needs to be fast.

An example syscall resulting in a new fd with a unique fp needs to do the following steps:
  • obtain a new fp (duh)
  • obtain a new fd
  • do whatever it needs to do so that it can fill fp with relevant data
In what order this should be done? Now that depends. Let's assume that this syscall-specific action is very hard to revert, so we need a guarantee of succesfull fd + fp actions by the time we get to it.

The following fictional C functions will be helpful later:
fp_alloc - obtain a new fp
fd_alloc(fp) - obtain a new fd with given fp set
fp_init(fp, data) - fill fp with stuff specific to given syscall
fd_close(fd)- closes relevant fd, this releases a refcount on fp, possibly freeing it
fd_get(fd) - obtains a reference to fp and returns it
fp_drop(fp) - drops a reference, possibly freeing fp

Now let's consider a syscall doing (error handling trimmed for brevity):
fp = fp_alloc();
fd = fd_alloc(fp);
data = stuff();
fp_init(fp, data);
But what if someone fd_gets this fd before fp_init? We cannot return garbage. So we have to introduce some sort of larval state - fp is there, but is not ready for use and fd_get is careful to check for this condition and returns EBADF claiming nothing was there after all.

How about fd_close? If one was to call it before fp_init is completed, it could result in a use-after-free condition. Clearly this cannot be allowed. The solution here is to use an initial refcount of 2 and unconditionally drop the extra reference in the syscall. With this in place in worst case the code will fp_init something which is about to be destroyed, which is fine.

Now alter function names a little bit and introduce 'fp->f_ops == &badfileops' as a criterion for larval state and you got how it's done in FreeBSD.

Don't like it? How about some additional functions:
fd_reserve - obtain a new slot in fd table
fd_install(fd, fp) - fill the slot with fp

And a syscall:
fp = fd_alloc();
fd = fd_reserve();
data = stuff();
fp_init(fp, data);
fd_install(fd, fp);
Concurrent fd_get? No problem. Slot reservation can be marked in a bitmap, the table still has NULL set as fp so no special handling is needed.

Concurrent fd_close? It's still NULL, so we can return EBADF as fd in question is not (yet) in use. Such a call from userspace was inherently racy, so there is no correctness issue. Again, no special cases needed.

Once more ignore function names and you roughly got what's done in Linux.

So does this just work? Of course not. Let's take a look at concurrent dup2 execution. dup2(x, y) is documented to close y if it happened to be in use.

What if the syscall in question got fd 8 from fd_reserve while some other thread does dup2(0, 8)? In FreeBSD case there is no problem - fp is there, you can just close it. Here the kernel has to special case and Linux resorts to returning EBUSY.

Bonus question for the reader: what about concurrent execution of fork and a syscall which installs a fd?

Which solution is better? Well, there are more considerations (including locking) which I may tackle in upcoming posts.

files removed on last close?

To quote unlink(2):
When the file's link count becomes 0 and no process has the file open, the space occupied by the file shall be freed and the file shall no longer be accessible. If one or more processes have the file open when the last link is removed, the link shall be removed before unlink() returns, but the removal of the file contents shall be postponed until all references to the file are closed.
If having the file open would be interpreted as having a file descriptor, this would match the most common understanding of the issue.

As usual, this is not entirely correct. As far as userspace goes in holding off file removal for an extended period of time, the other thing to look at are memory mappings.

So in an effort to not make this post a three-liner + stolen quite, let's have a look at relevant mappings in a simple program which just sleeps:
00400000-00401000 r-xp 00000000 fd:03 2100640                           /tmp/a.out
00600000-00601000 r--p 00000000 fd:03 2100640                            /tmp/a.out
00601000-00602000 rw-p 00001000 fd:03 2100640                            /tmp/a.out
3685e00000-3685e21000 r-xp 00000000 fd:02 133994                         /usr/lib64/ld-2.17.so
3686020000-3686021000 r--p 00020000 fd:02 133994                         /usr/lib64/ld-2.17.so
3686021000-3686022000 rw-p 00021000 fd:02 133994                         /usr/lib64/ld-2.17.so
Both of these files are in use, but surely it does not have a file descriptor to either one:
lrwx------. 1 meh meh 64 Apr 13 20:58 0 -> /dev/pts/11
lrwx------. 1 meh meh 64 Apr 13 20:58 1 -> /dev/pts/11
lrwx------. 1 meh meh 64 Apr 13 20:58 2 -> /dev/pts/11
 ld-2.17.so does not look like a good candidate for unlink test, so we will focus on /tmp/a.out.
$ stat -L -c '%i' /proc/$(pgrep a.out)/exe
3337221
$ rm /tmp/a.out
$ stat -L -c '%i' /proc/$(pgrep a.out)/exe
3337221
 Which I hope is sufficient for you to believe that the file was not removed just yet because of it being mapped.

Saturday, April 11, 2015

Weird stuff: thread credentials in Linux

For quite some time now credentials are more complicated than just a bunch of ids and as such are stored in a dedicated structure.

Credentials very rarely change (compared to how often they are read), so an approach which optimizes for reads is definitely in order.

First let's have a look at what happens in FreeBSD.

struct ucred contains a reference counter. In short it's a method of counting other structures which use this one [1]. New creds start with refcount of 1 and are freed when the counter reaches 0. The structure is copy-on-write - once initialised, it's never changed. This means that calls which change credentials (e.g. setuid(2)) always allocate new ones [2].

Processes are represented with struct proc which have one or more struct threads linked in. Both structures contain their own pointer to creds and keep their own reference.

So when credentials are changed, new cred struct is allocated and proc's credential pointer is modified to point to it. Threads check they got current cred pointer as they cross kernel<->userspace boundary. If needed they reference new credentials and drop the reference on old ones.

In effect cred access for the executing thread (the common case) is very cheap -- the kernel can just read them without any kind of locking.

This leaves a window where a thread can enter the kernel, while anoother thread just changed credentials and as a result operate using stale creds until it leaves the kernel, which is fine (I may elaborate in another post). Actions concerning process<->process interaction are authorized against proc credentials.

So, what happens in Linux?

Linux has a dedicated structure (struct cred) which also uses a copy-on-write scheme.

Big differences start with processes. These are represented with task_struct. Threads composing a process are just more task_struct's linked together into a thread group.

When credentials are changed in Linux , the kernel only deals with the calling thread. Other threads are not updated and are "unaware" of the event. That's the first weird part.

So how come a multithreaded process calling setuid ends up with consistent creds across its threads?

And the second: glibc makes all threads call setuid on their own (I did not check other libc implementations available for Linux, I presume they do the same.)

Now, it may be there are valid reasons to support per-thread credentials (serving files over the network?). But I would still expect dedicated syscalls (thread_setuid?) which just deal with a given thread and setuid etc. to deal with the entire process.

[1] strictly speaking not every structure storing a pointer to a refcounted structure must have its own reference. Dependencies between various structs can implicitly keep things stable.

[2] while typically true in practice, strictly speaking one could hack it up to e.g. lookup appropriate credentials and grab a reference on them

Friday, April 10, 2015

what is really shown by /proc/pid/environ

Have you ever inspected /proc/<pid>/environ and concluded it contains something which could not possibly be an environment? Or maybe it looked like environment, but could not possibly match what was used by the process?

Let's start with making it clear what the process environment is.

It's just a table with key=value strings passed around during execve(2).

Then the kernel puts it on the stack of the new process and stores the address for possible later use.

There is absolutely no magic involved. When you execute a process you can pass any environment you want.

When someone reads from /proc/<pid>/environ, the kernel grabs environment address it stored during execve and reads from target process' address space.

But is the environment really there? Well, possibly.

Userspace is free to move it wherever it want, and sometimes it has to if the process adds more variables.

As such, if the content looks sane as an environment, you can be reasonably sure this is the environment the process started with. But based on this you cannot know what modifications (if any) were made.

If you really need to know the environment state, your situation is not that bad. POSIX defines 'environ' symbol which is supposed to always point to current environment, so interested parties can easily inspect it by e.g. attaching to the process with gdb.


zombie processes

From time to time I encounter people spotting accumulating zombie processes, they proceed to kill -9 them, which of course does not change anything.

Each process, apart from init (pid 1) has a parent process. Children (if any) of an exiting process are reparented to init. Once a process exits it becomes a zombie, waiting for its parent to wait(2) for it [1]. Init reaps reparented processes automatically.

This means that persisting zombie processes are an indication of an actual problem.

What you can do from sysadmin perspective: first and foremost inspect the parent. If the process is happily executing code in userspace, there is very likely a bug in the application. Nothing you can do about it, contact your developers. If you just kill the process, zombies will be gone, along with debug info your developers would want to see. Obtaining a coredump still may not be sufficient, it's best to let them investigate live system if possible.

I was told about a "solution" which actually was reaping unwanted zombies even if the parent did not do it. As suspected it works as follows: it attaches to parent process with ptrace(2) and injects a wait4(2) call. This risks further damaging the state of problematic process -- for instance it can have a table with relevant children information and now one of these children is gone. What happens if the child pid gets reused afterwards?

But let's say the process in question is not running circles in userspace. It can be blocked in the kernel in a relatively self-explanatory place.

If the backtrace does not point out an obvious culprit, C language experience will be required. That paired with no fear of the kernel can bring you a long way, so it's worth giving it a shot if only for fun.

Let's evaluate some kernel stacktraces (/proc/<pid>/stack).
[<ffffffffa01b2dc9>] rpc_wait_bit_killable+0x39/0xa0 [sunrpc]
[<ffffffffa01b70b2>] __rpc_execute+0x202/0x750 [sunrpc]
[<ffffffffa01b7bc9>] rpc_execute+0x89/0x240 [sunrpc]
[<ffffffffa01a8d00>] rpc_run_task+0x70/0x90 [sunrpc]
[<ffffffffa01a8d70>] rpc_call_sync+0x50/0xc0 [sunrpc]
[<ffffffffa0485616>] nfs3_rpc_wrapper.constprop.11+0x86/0xd0 [nfsv3]
[<ffffffffa04859d4>] nfs3_proc_access+0xc4/0x1a0 [nfsv3]
[<ffffffffa0429229>] nfs_do_access+0x3c9/0x850 [nfs]
[<ffffffffa04298a1>] nfs_permission+0x1c1/0x2b0 [nfs]
[<ffffffff8126c642>] __inode_permission+0x72/0xd0
[<ffffffff8126c6b8>] inode_permission+0x18/0x50
[<ffffffff8126f306>] link_path_walk+0x266/0x860
[<ffffffff8126f9bc>] path_init+0xbc/0x840
[<ffffffff81272325>] path_openat+0x75/0x620
[<ffffffff81273ec9>] do_filp_open+0x49/0xc0
[<ffffffff8125fa5d>] do_sys_open+0x13d/0x230
[<ffffffff8125fb6e>] SyS_open+0x1e/0x20
[<ffffffff817ec22e>] system_call_fastpath+0x12/0x76
[<ffffffffffffffff>] 0xffffffffffffffff
Here we see that the process in question tried to open a file and path lookup lead it to a nfs share, at which point it blocked. While this does not tell you which share is causing trouble [2], you can narrow it down no problem.

[<ffffffff81263405>] __sb_start_write+0x195/0x1f0
[<ffffffff81286a24>] mnt_want_write+0x24/0x50
[<ffffffff81271adb>] do_last+0xbeb/0x13c0
[<ffffffff81272346>] path_openat+0x96/0x620
[<ffffffff81273ec9>] do_filp_open+0x49/0xc0
[<ffffffff8125fa5d>] do_sys_open+0x13d/0x230
[<ffffffff8125fb6e>] SyS_open+0x1e/0x20
[<ffffffff817ec22e>] system_call_fastpath+0x12/0x76
[<ffffffffffffffff>] 0xffffffffffffffff 
Again a file lookup, but this time we got no indication where [2].
Just like with nfs you can just spawn a new shell and probe stuff from that level. But what is it? For those who don't want to do a little digging: sfserrmr be fbzrguvat ryfr hfvat guvf shapvbanyvgl

Bottom line is, if you see accumulating zombie processes, don't try to kill them. Inspect your system instead and possibly report a bug to your developers. Chances are you will solve the actual problem on your own.

[1] There are ways in which process can make its children reaped automatically without waiting, also these days situation on Linux and FreeBSD is less trivial -- there are ways to reap foreign children, e.g. with process descriptors or prctl(2) (see PR_SET_CHILD_SUBREAPER)

[2] Unfortunately Linux does not provide any nice way to obtain such information at the moment. I have some ideas how to change it which I may describe in future posts. Interested parties can obtain this information in a way which may seem scary at first, but is quite trivial and largely safe & accurate - you can run the debugger on live kernel (named 'crash'), dump mount list (mount), dump the stack of problematic thread (bt -f) and eye-grep. Of course a proper way would require disassembling some code to be sure what is what on the stack.

Wednesday, April 8, 2015

What is really included in load average on Linux?

Everyone "knows" that load average = amount of runnable processes + processes blocked on I/O. While this may be true enough for a lot of use cases, it is incorrect.

The purpose of this article is to note briefly what is really counted, not to enumerate all possibilities.

First a short note that the kernel counts threads, not processes.

With this out of the way let's take a look a relevant comment (source):
 * Once every LOAD_FREQ:
 *
 *   nr_active = 0;
 *   for_each_possible_cpu(cpu)
 *      nr_active += cpu_of(cpu)->nr_running + cpu_of(cpu)->nr_uninterruptible;

An alert reader may note that there are lies, damned lies, statistics and comments in the code. I have to agree, thus this requires validation.

A quick eye-grep reveals:

long calc_load_fold_active(struct rq *this_rq)
{
        long nr_active, delta = 0;

        nr_active = this_rq->nr_running;
        nr_active += (long) this_rq->nr_uninterruptible;

        if (nr_active != this_rq->calc_load_active) {
                delta = nr_active - this_rq->calc_load_active;
                this_rq->calc_load_active = nr_active;
        }

        return delta;
}
While not strictly sufficient, it's fine enough for this article.

So we know "threads blocked on I/O" is not the criterion here, but threads which contribute to nr_uninterruptible counter.

nr_uninterruptible represents threads in TASK_UNINTERRUPTIBLE state (which are not frozen, but what it means is beyond the scope of this article).

When can this happen?
  • while waiting for event completion (also used when dealing with I/O)
  • while trying to acquire a sleepable locking primitive such as a semaphore
Significance of this information is that when a server with abnormally high load (say > 1k on a 64-way machine) is encountered, people tend to think I/O is at fault here (e.g. dead nfs server), which very easily may be false. For instance one thread could take a semaphore for writing and block itself for some reason, and a lot of other threads started tripping over it while trying to take it for reading.

Tuesday, April 7, 2015

nofile, ulimit -n, RLIMIT_NOFILE -- the most misunderstood resource limit

Have you ever seen "VFS: file-max limit XXX reached" and proceeded to hunt for file descriptor consumers? Does this prompt you to lsof -u joe | wc -l to find out how many file descriptors joe uses? If so, this post is for you.

Aforementioned message is not about file descriptors and lsof - u joe shows way more than just file descriptors anyway.
So what is limited by RLIMIT_NOFILE?

The biggest number that can be assigned to a file descriptor in given process.

I repeat: the biggest number that can be assigned to a file descriptor. Of course from the moment the limit is applied. This has a side effect of limiting the number of file descriptors a process can open from this point on.

1. process sets RLIMIT_NOFILE to 20 on its own. How many file descriptors can it have opened?

Impossible to tell. No new file descriptor will be bigger than 20, but there may be a huge number of already opened file descriptors with higher number.

2. There is only one process owned by joe. It has the following file descriptors opened: 0, 1, 2. It sets its own RLIMIT_NOFILE to 20 and creates a new process. How many file descriptors can be opened in each of them?

nofile limit is per process, thus the fact that one of these processes created the other one is irrelevant. Either can open 18 more file descriptors.

You may have encountered the following:
VFS: file-max limit $BIGNUM reached

What's the relationship between file descriptors and 'file' (struct file) limit?

So what is struct file? It is an object which contains some state related to an open entity like an on-disk file, pipe etc.

struct file may be used internally by the kernel and not be associated with any file descritor.
Each file descriptor has to be associated with exactly one struct file.
Each struct file has an unlimited number of associated file descriptors.
On clone() file descriptors are copied, i.e. they reference the same 'struct file' their counterparts in parent process do.

Opening a file typically boils down to the following:
lookup the file
allocate new file descriptor
allocate struct file
tie up an inode with struct file
set file descriptor to 'point' to struct file

3. Process has the following file descriptors open: 0, 1, 2. Now it calls clone(). How many new 'struct file' are allocated in order to satisfy this request?

None. 0, 1, 2 in the new process use struct file from 0, 1, 2 from the parent.

4. Process has the folowing file descriptors open: 0, 1, 2. Now it exits. How many 'struct file' will be freed as a result?

Impossible to tell. First, it is possible that all file descriptors were associated with the same 'struct file'. Not only that, these file descriptors could be inherited from parent process which is still alive and didn't modify its descriptors. As such, it is possible that struct file(s) in question are still in use.

5. No process has /etc/passwd open. Now one process opens it 3 times. How many 'struct file' were allocated as a result?

Three, one for each open request.

With that established let's take a look at related errors (man errno):

ENFILE          Too many open files in system (POSIX.1)
EMFILE          Too many open files (POSIX.1)

First one signals the kernel ran out of 'struct file', the other one that given process cannot have more file descriptors.

When the kernel prints "VFS: file-max limit XXX reached" it says it won't allocate any new struct file.

6. Let's assume the kernel reached the limit of 'struct file'. Now joe's process tried to obtain a new file descriptor. Can this operation succeed? Which error is returned on failure?

If the new file descriptor would have new 'struct file', the error would be ENFILE.
But it may be that this file descriptors is going to reuse already existing 'struct file', in which case it does not matter that the kernel hit the limit. In can fail with EMFILE, or it can succeed, depending on rlimits.

lsof | wc -l vs number of open descriptors

Apart from file descriptors, lsof shows other stuff (e.g. in-memory file mappings, current working directory). As such, output from mere 'lsof' invocations cannot be used to check file descriptors.

7. An administrator does `lsof -p 8888 | wc -l` and receives 10000. How many file descriptors are in use this process?

As noted earlier, impossible to tell due to other fields printed by lsof.

Current amount of open file descriptors by given process can be obtained by counting symlinks in /proc/<pid>/fd.

8. `ls /proc/<pid>/fd | wc -l`  returns 9000. How many 'struct file' are in use by this process?

Anything between 1 and 9000 (including both).

9. We get result as previous one. What can you say about 'nofile' rlimit set on this process?

Nothing. Not only we don't know the biggest open fd, even if we did the fd could be open before the limit was applied.

How traditional resource limits are handled

 For the purpose of this document we will define the following:
- resource limits as mentioned earlier will be referred to as rlimits
- we will have an unprivileged user joe

You most likely have set rlimits at some point, either by editing
/etc/security/limits.conf (or some other file), playing with ulimit etc.

But how and when such limits are meaningful?

You may have heard about RLIMIT_NPROC, which is supposed to limit the amount of processes given user is allowed to have. Yet, it is possible you will configure this limit and the user in question will have twice as many processes. Not only that, he may still be able to spawn more. What is going on here?

Statements which follow are true enough and sufficient to understand points I'm trying to make.

Process actions are subject to rlimits.

Resource limit is a property of a process, not a user.

Processes are owned by <uid>:<gid>, which are just some numerical values.
If there is a resource limit covering more than one process, it does so by looking at the uid.

Creation of a new process is accomplished with clone() systemcall. Provided
with appropriate arguments it will create a copy of the executing process.
There are some differences (pid, parent pid) but pretty much all the rest is
identical. This includes rlimits.

Now let's say we have our custom program running as root and we want to extend it so that it runs stuff as an unprivileged user.

In order to have a process running with given uid:gid, we need to use setuid() and setgid() systemcalls whose names are self explanatory (strictly speaking one can also run a suid/sgid binary, but that's irrelevant to the subject in question).

Let me reiterate: creating a process owned by given user looks as follows:

clone();
setgid();
setuid();

So... which one of these applies security limits as defined in /etc/security/limits.conf?

That's right, NONE.

This new process owned by joe has the same rlimits its parent does.

Rlimits from limits.conf etc. can be applied by additional code (typically a PAM module).
The point is, this has to be done separately and prior to {u,g}id change.
sshd, su and so on do this stuff.

With that in mind let's consider some problems.

Let's assume joe has 80 running processes and all of them have applied nproc limit to 100.

1. An administrator modified limits.conf:
joe hard nproc  50

1.1 Is it possible for joe to spawn a new process?

Yes, currently running processes have 100 as their limit and only 80 are running.
Thus rlimits are not in the way here.

1.2 Is it possible for joe to log in over ssh?

No. That would-be joe's process would have the limit set to 50, but there are
already 80 processes running, thus it would error out.

1.3. A custom daemon is running, it spawns 200 processes owned by joe, but does not apply his resource limits. Consider processes from 1.1. Can any of them clone()?

No. There are 280 joe's processes running, but ones from 1.1 have the limit set to 100.