Sunday, July 24, 2016

malloc+memset vs calloc

There is various advice out there suggesting the use of calloc or malloc followed by the memset. Claims are made about supposed performance difference between the two, followed by the problematic explanation as to what's going on, in particular involving the so-called "zero page".

The first part of the post contains advice on a reasonable usage of these functions, while the second part explains what's going on. Hopefully clearing up why calloc can appear to be faster than malloc + memset in some cases.

While the post is Linux and x86-64-centric, in principle it covers other systems and architectures. The main difference is that they may be lacking the "zero page".

A separate post will cover tooling which can be used to investigate claims below.

When to use what?

First of all, do you really need to allocate anything? If your buffer is up to a few kilobytes in size and you always free it on return from the function, then use the stack.

malloc returns a suitable buffer with an indeterminate content. calloc returns a suitable buffer filled with zeroes.

If you need to allocate something, use malloc.

If you need to zero out part of the buffer, memset that part.

If you need to have the buffer zeroed out entirely or in the most part, use calloc(1, size).

In particular:
1. DO NOT p = malloc(size); memset(p, 0, size);, use calloc instead. It is more concise and in certain cases will be faster.
2. DO NOT default to zeroing everything. Tools like valgrind or scan-build can detect use of uninitialized variables, but if you just zero out the entire struct you initialize everything to 0, which hides the problem.

What's going on?

There is code out there which boils down to allocating several megabytes of memory with either calloc or malloc + memset and comparing the time needed to complete. Benchmarks of the sort not only don't test the actual expense in a real-world program when the allocated memory is later used, but don't necessarily find the actual cost of calloc itself in such a program. calloc will sometimes be faster than malloc + memset. An example benchmark is included at the end.

Let's see what's up.

Processes have so-called virtual address space, divided into pages, typically 4096 bytes of size. These areas can be backed by physical pages of the same size. A process can request memory with system calls like mmap or (s)brk.

malloc is a part of glibc and is running as a part of a process. It will request memory from the kernel as it sees fit. Once it obtains a page, it can e.g. divide it into chunks of size 16 bytes and use that to satisfy calls like malloc(16). That's a vastly simplified version, but sufficient as an illustration.

Once a new page is mapped, it is guaranteed to be zeroed. The reason is that the kernel has to get rid of any residual data from allocations made by different processes and zero is the nicest value to put in there.

For special cases, the kernel keeps a page which always contains zeroes and cannot be written to. We will call it ZERO_PAGE.

Asking the kernel for more memory does not mean it will allocate any pages. Instead it will take a note what was requested.

Later, if the processes accesses the area, a page fault will occur. Then the kernel will see that affected area was promised and will map a page in there. If the reason for the fault is a write request, it will get a standard zeroed page. If this is a read, it may decide to use ZERO_PAGE instead. In either case, the state visible by the process is a zeroed area and the entire business is transparent.

Let's walk through a malloc example. K is the kernel, P is the process, M and C are respectively malloc and calloc  running within P
  1. P does malloc(BIGNUM);
  2. M decides it has enough memory to satisfy the request and returns an appropriate buffer
Here the kernel was not involved. This is what typically happens in a long-running program.

How about a more convoluted example, typically happening at the start of the program:
  1. P does malloc(BIGNUM);
  2. M decides to grab more memory from the kernel and calls mmap
  3. K takes note and returns address ADDR. No physical memory is allocated at this stage.
  4. M picks an area from ADDR and returns BUFADDR
  5. P writes to BUFADDR which triggers a page fault
  6. K sees that the area fits what was returned from mmap and satisfies the request by mapping a zeroed page
What if calloc is used instead? Let's take a simplified example:
  1. P does calloc(1, BIGNUM);
  2. C decides it has enough memory to satisfy the request. zeroes a buffer and returns it
So the difference would be zeroing. The more complicated example with requesting more memory would differ in the same way.

But here is an observation: memory we got from the kernel is guaranteed to be zeroed, so if C knows for a fact that the particular buffer it wants to return was not used previously, it knows there is no need for zeroing. But this means the buffer is not touched, which may translate into memory page(s) not being touched, which in turn reduces the number of page faults.


And this is where the performance difference in toy benchmarks comes from. These toy benchmark don't touch the memory.

To elaborate, in a toy benchmark:
  1. P does calloc(BIGNUM);
  2. C decides to grab more memory from the kernel and calls mmap
  3. K takes note and returns address ADDR. No physical memory is allocated at this stage.
  4. C picks an area from ADDR and picks BUFADDR. Seeing this is a new allocation, it skips zeroing
  5. ... memory is not touched, so we end here. The most expensive part is not executed.
In comparison, a real-world program will typically use the calloced memory and the buffer returned from calloc in a long-running program will most of the time have to be cleared as it is being reused.

To reiterate, calloc will give you savings in the following circumstances:
1. sometimes page(s) are not touched and both page fault and explicit zeroing by C are avoided. If memory happens to be accessed later, page fault will be taken, but we still get away without explicit zeroing.
2. sometimes, even if the page was allocated, we know the buffer was never used and can avoid zeroing

What about ZERO_PAGE? It is NOT used as a result of mmap. It can be used later if there is a page fault from an attempt to read. In particular, ZERO_PAGE is NOT USED in the trivial benchmark just callocing a lot of memory.

The following should give a rough idea what happens:
# malloc + memset
$ time ./mtest m
./mtest m  0.18s user 1.94s system 95% cpu 2.227 total

# calloc
$ time ./mtest c
./mtest c  0.02s user 0.19s system 83% cpu 0.260 total
# calloc + read a byte from each page

$ time ./mtest C
./mtest C  0.11s user 0.41s system 95% cpu 0.540 total
# calloc + write a byte to each page

$ time ./mtest V
./mtest V  0.12s user 1.95s system 92% cpu 2.224 total

The reader will forgive the lack of statistical validity of these masurements. They are fine enough for this blogpost.

In particular, we see that malloc + memset resulting in touching all the pages is roughly equivalent to calloc + touching all the pages. Just doing calloc is definitely the fastest and that's because it does not map almost anything. Doing a read from each page gives a page fault and gives maps ZERO_PAGE for us, which is definitely faster than having to come up with a new page.

Also, let's see what happens if we repeatedly alloc and free with either malloc + memset or calloc. The allocator will reuse the buffer. But then calloc has to zero it out. And indeed:

$ time ./use-malloc
./use-malloc  6.21s user 0.00s system 99% cpu 6.222 total

$ time ./use-calloc
./use-calloc  6.17s user 0.00s system 99% cpu 6.176 total


As expected, the performance is about the same.

To sum up:
- stick to the advice given at the beginning
- claims about calloc being faster without further statements about conditions where it is true are misleading at best
- if calloc is faster, it is typically due to the fact it sees it can get away without zeroing the buffer, which can give additional speed up if pages used for the allocation end up not being accessed
- benchmarks of memory allocation without accessing the memory later don't show the actual cost

mtest.c:
#include <err.h>
#include <stdlib.h>
#include <string.h>

#define ALLOC   (256 * 1024 * 1024)
#define NBUFS   10

static void
handle_malloc(char **bufs, int nbufs, size_t size)
{
        char *p;
        int i;

        for (i = 0; i < nbufs; i++) {
                p = malloc(size);
                if (p == NULL)
                        err(1, "malloc");
                memset(p, 0, size);
                bufs[i] = p;
        }
}

static void
handle_calloc(char **bufs, int nbufs, size_t size)
{
        char *p;
        int i;

        for (i = 0; i < nbufs; i++) {
                p = calloc(1, size);
                if (p == NULL)
                        err(1, "calloc");
                bufs[i] = p;
        }
}

static void
touch_read(char **bufs, int nbufs, size_t size)
{
        int i;
        size_t j;
        unsigned char crap = 0; /* a meh way to force a read */

        for (i = 0; i < nbufs; i++) {
                for (j = 0; j < size; j += 4096)
                        crap += bufs[i][j];
        }
}

static void
touch_write(char **bufs, int nbufs, size_t size)
{
        int i;
        size_t j;

        for (i = 0; i < nbufs; i++) {
                for (j = 0; j < size; j += 4096)
                        bufs[i][j] = 1;
        }
}

int
main(int argc, char **argv)
{
        char *bufs[NBUFS];

        if (argc != 2 || argv[1][1] != 0)
                return 1;

        switch (argv[1][0]) {
        case 'm':
                handle_malloc(bufs, NBUFS, ALLOC);
                break;
        case 'M':
                handle_malloc(bufs, NBUFS, ALLOC);
                break;
        case 'c':
                handle_calloc(bufs, NBUFS, ALLOC);
                break;
        case 'C':
                handle_calloc(bufs, NBUFS, ALLOC);
                touch_read(bufs, NBUFS, ALLOC);
                break;
        case 'V':
                handle_calloc(bufs, NBUFS, ALLOC);
                touch_write(bufs, NBUFS, ALLOC);
                break;
        default:
                errx(1, "wrong opt");
        }      
       
        return 0;
}


malloc-calloc.c:
#include <stdlib.h>
#include <string.h>

#define    ALLOC    (1024 * 1024)

int
main(void)
{
    int i;
    void *p;

    for (i = 0; i < 100000; i++) {
#ifdef USE_MALLOC
        p = malloc(ALLOC);

        memset(p, 0, ALLOC);
#else
        p = calloc(1, ALLOC);
#endif
        free(p);
    }

}


Saturday, July 9, 2016

linux sysadmin interview questions

Most of these questions are known, for which I somewhat apologize. The purpose of this post is to have a collection of questions I consider to be of reasonable to high quality.

Questions are provided in a shortened form in order to reduce repetition. The interviewer is supposed to press the answer with "why", "does it have to be true/false", "what does this depend on" and the like as applicable. The candidate must effectively be told to explain their understanding of the issue at hand. These are not "yes or no" questions.

I'll start off with my own question (I realize that most likely someone else came up with an equivalent):

1. A classic one-liner is of the form: ps auxw | grep foo. Can grep foo appear in the result?

An excellent opportunity to discuss pipes + fork + exec + races.

And now stuff you likely heard at least once:

2. Is cd a shell built-in?

3. A particular server has 20 zombie processes. What do you do?

4. A junior sysadmin says they ran ps and it appears to have hanged and cannot be killed with ctrl+c. You have a ssh session on the affected machine. Diagnose.

ps can really hang due to a multitude of reasons. The (real world!) scenario I like is hang on /proc/<pid>/cmdline read. Note that candidate's ps is also expected to hang.

5. You see a process using up 100% CPU in top. Diagnose.

It can be e.g. an in-house program which stats the same non-existing file for some reason.

6. load average is 500. Diagnose.

Suggested classic is a hung nfs server.

7. Running any binary appears to hang. You got one last operational ssh session. Diagnose.

Only shell built-ins can be used.

8. You are given a binary. How do you find out what it does?

9. Why is suid not honoured when executing scripts?

10. ping crap.org

The classic. If you don't want network talk, I suggest uname.

11. You run strace -p <valid pid> and it does not give you any output. What now?