Let's start with a note that files/pipes/sockets/whatever you may have opened are represented by a dedicated structure, which on FreeBSD, Linux and Solaris happens to be named file.
A file can be in use by multiple fds (and even without fds (e.g. with mmap + close, or just internally by the kernel)), so it has a reference counter. If the counter drops to 0, the object is freed.
File descriptors (fd from now on) have some properties (e.g. a close-on-exec flag), but first and foremost are there to obtain the pointer to relevant struct file (fp from now on). This is achieved by having a table indexed with fds.
With this relationship established let's examine relevant ways userspace processes can modify their fd table:
- obtain the lowest possible fd [1] (e.g. open(2), socket(2))
- close an arbitrary fd (close(2))
- obtain an arbitrary fd (dup2(2))
An example syscall resulting in a new fd with a unique fp needs to do the following steps:
- obtain a new fp (duh)
- obtain a new fd
- do whatever it needs to do so that it can fill fp with relevant data
The following fictional C functions will be helpful later:
fp_alloc - obtain a new fp
fd_alloc(fp) - obtain a new fd with given fp set
fp_init(fp, data) - fill fp with stuff specific to given syscall
fd_close(fd)- closes relevant fd, this releases a refcount on fp, possibly freeing it
fd_get(fd) - obtains a reference to fp and returns it
fp_drop(fp) - drops a reference, possibly freeing fp
Now let's consider a syscall doing (error handling trimmed for brevity):
fp = fp_alloc();But what if someone fd_gets this fd before fp_init? We cannot return garbage. So we have to introduce some sort of larval state - fp is there, but is not ready for use and fd_get is careful to check for this condition and returns EBADF claiming nothing was there after all.
fd = fd_alloc(fp);
data = stuff();
fp_init(fp, data);
How about fd_close? If one was to call it before fp_init is completed, it could result in a use-after-free condition. Clearly this cannot be allowed. The solution here is to use an initial refcount of 2 and unconditionally drop the extra reference in the syscall. With this in place in worst case the code will fp_init something which is about to be destroyed, which is fine.
Now alter function names a little bit and introduce 'fp->f_ops == &badfileops' as a criterion for larval state and you got how it's done in FreeBSD.
Don't like it? How about some additional functions:
fd_reserve - obtain a new slot in fd table
fd_install(fd, fp) - fill the slot with fp
And a syscall:
fp = fd_alloc();Concurrent fd_get? No problem. Slot reservation can be marked in a bitmap, the table still has NULL set as fp so no special handling is needed.
fd = fd_reserve();
data = stuff();
fp_init(fp, data);
fd_install(fd, fp);
Concurrent fd_close? It's still NULL, so we can return EBADF as fd in question is not (yet) in use. Such a call from userspace was inherently racy, so there is no correctness issue. Again, no special cases needed.
Once more ignore function names and you roughly got what's done in Linux.
So does this just work? Of course not. Let's take a look at concurrent dup2 execution. dup2(x, y) is documented to close y if it happened to be in use.
What if the syscall in question got fd 8 from fd_reserve while some other thread does dup2(0, 8)? In FreeBSD case there is no problem - fp is there, you can just close it. Here the kernel has to special case and Linux resorts to returning EBUSY.
Bonus question for the reader: what about concurrent execution of fork and a syscall which installs a fd?
Which solution is better? Well, there are more considerations (including locking) which I may tackle in upcoming posts.
No comments:
Post a Comment