Tuesday, May 12, 2015

Performance of the same code under different operating systems

Nobody needs convincing that running program foo under different operating systems can influence its performance. Different kernels, libc, compiler version obviously have to have a huge impact.

But what about the following: an operating system-specific code sets everything up (allocates memory, reads data etc.) and then a common single-threaded assembly doing only memory and register accesses performs computation. What impact can an operating system have on performance of the common binary code?

Some of the things that may affect this are listed below.

I assume no games are played with altering CPU clock.

Impact of issues mentioned below can vary greatly and I'm too lazy to come up with any specific numbers. Point is, there are non-obvious factors which can impact the performance.

Interrupts

Random devices can generate interrupts which pause the execution of the code. Also there is the scheduling-clock interrupt firing several times per second (depends on the system, typically 1000). An operating system may allowing binding given process to a cpu which does not receive additional interrupts. It could also use a tickless approach to get rid of the clock interrupt as well. All this has some impact on performance. See Paul E. McKenney - Bare-Metal Multicore Performance in a General-Purpose Operating System (youtube) for more details.

TLB coverage

Both physical and virtual memory consist of pages. Page sizes vary between architectures and given architecture can support more than one. For instance amd64 supports 4KB, 2MB and 1GB pages. All addresses in our process' address space are virtual. An attempt to access such an address means the associated physical page needs to be looked up.  This information is cached in TLB, which obviously has limited amount of entries. So if the code accesses a sufficiently wide range of virtual addresses, it may force a lot of lookups. If an operating system supports providing bigger pages, TLB coverage can be greatly increased reducing the need for lookups and in effect improving performance. See Superpages in FreeBSD (youtube) and Practical, transparent operating system support for superpages (in Linux world known as hugepages).

NUMA

Modern amd64 machines are using NUMA, which means access to various parts of memory varies between CPUs, in particular each CPU has it's local memory which is the most optimal to access. Thus, if an operating system does not know how to allocate physical memory "close" to the CPU and then bind the process, memory access cost can grow.

Obtained virtual addresses vs CPU cache

Turns out virtual addresses returned by malloc(3) can also affect the performance. Let me quote 4K Aliasing from Intel optimisation guide:

When an earlier load issued after a later store (in program order), a potential WAR (write-after-read) hazard exists. To detect such hazards, the memory order buffer (MOB) compares the low-order 12 bits of the load and store in every potential WAR hazard. If they match, the load is reissued, penalizing performance. However, as only 12 bits are compared, a WAR hazard may be detected falsely on loads and stores whose addresses are separated by a multiple of 4096 (2^12). This metric estimates the performance penalty of handling such falsely aliasing loads and stores.