Sunday, May 8, 2016

symlinks vs hardlinks

Given /foo/bar, one could say "the file bar is in the directory foo".  This is fine for everyday purposes when hardlinks are understood, but in general is incorrect.

Files exist on filesystems (not in directories) and are represented by inodes. An inode contans various information, like the owner, mode, access times and the link count. It does not contain any names.

A name for given inode number can then be placed in a directory and link count of the target inode be incremented to note this action. Removing a name understandably decrements the link count and possibly sets the inode  up for deletion. The name as described here is known as a hardlink.

Symlinks are slightly convoluted and typical explanations are somewhat misleading. They boil down to symlinks "pointing to a name". Say we are given an inode 42 with a name crap placed in /. Say a symlink meh is created with the following content: /crap. For everyday purposes one would say that "meh points to /crap" and expect to reach the inode 42.

The problem with explanations involving "pointing to a name" is they suggest a relationship between the symlink and the "target" name, while there is none.

The key here lies in the fact that processes can have different view of the "file namespace" (for lack of a better term). In particular thanks to the chroot(2) system call they can have their own idea what / is.

Consider the following tree:
/
├── crap // ino 100
└── foo
    └── bar
        ├── crap // ino 200
        └── craplink -> /crap

A regular process entering /foo/bar and doing cat /craplink will reach the file with ino 100. But another one which chrooted too /foo/bar will reach the inode with ino 200.

A symlink is just some text which can be used during path lookup. Conceptually the symlink is read and the part of the path which was not traversed yet is appended.

Consider a file /a/b/c/dir/file and a symlink /meh/dirlink to /a/b/c/dir. Lookup of /meh/dirlink/file will:
  1. grab / as set for the process
  2. find meh and conclude it's a directory
  3. find dirlink and conlude it's a symlink; the path to look up now becomes /a/b/c/dir/file
This leads us to a caveat with ".."'s in symlinks.

Consider loop up of /meh/dirlink/../crap.
  1.  grab / as set for the process
  2.  find meh and conclude it's a directory
  3.  find dirlink and conlude it's a symlink; the path to look up now becomes /a/b/c/dir/../crap
So this leads us to /a/b/c/crap as opposed to /meh/crap.

But there is a caveat in a caveat! Shells like bash or zsh try to be clever and internally detect this situation when you cd. So when you cd /meh/dirlink/../crap, you actually end up in /meh/crap. But if you try to do something else, the hack is gone.

$ echo 'surprise, mo^H^H^H^H' > /a/b/c/crap 
$ mkdir /meh/crap
$ cd /meh/dirlink/../crap
$ pwd
/meh/crap

$ cat /meh/dirlink/../crap
surprise, mo^H^H^H^H