Friday, April 10, 2015

zombie processes

From time to time I encounter people spotting accumulating zombie processes, they proceed to kill -9 them, which of course does not change anything.

Each process, apart from init (pid 1) has a parent process. Children (if any) of an exiting process are reparented to init. Once a process exits it becomes a zombie, waiting for its parent to wait(2) for it [1]. Init reaps reparented processes automatically.

This means that persisting zombie processes are an indication of an actual problem.

What you can do from sysadmin perspective: first and foremost inspect the parent. If the process is happily executing code in userspace, there is very likely a bug in the application. Nothing you can do about it, contact your developers. If you just kill the process, zombies will be gone, along with debug info your developers would want to see. Obtaining a coredump still may not be sufficient, it's best to let them investigate live system if possible.

I was told about a "solution" which actually was reaping unwanted zombies even if the parent did not do it. As suspected it works as follows: it attaches to parent process with ptrace(2) and injects a wait4(2) call. This risks further damaging the state of problematic process -- for instance it can have a table with relevant children information and now one of these children is gone. What happens if the child pid gets reused afterwards?

But let's say the process in question is not running circles in userspace. It can be blocked in the kernel in a relatively self-explanatory place.

If the backtrace does not point out an obvious culprit, C language experience will be required. That paired with no fear of the kernel can bring you a long way, so it's worth giving it a shot if only for fun.

Let's evaluate some kernel stacktraces (/proc/<pid>/stack).
[<ffffffffa01b2dc9>] rpc_wait_bit_killable+0x39/0xa0 [sunrpc]
[<ffffffffa01b70b2>] __rpc_execute+0x202/0x750 [sunrpc]
[<ffffffffa01b7bc9>] rpc_execute+0x89/0x240 [sunrpc]
[<ffffffffa01a8d00>] rpc_run_task+0x70/0x90 [sunrpc]
[<ffffffffa01a8d70>] rpc_call_sync+0x50/0xc0 [sunrpc]
[<ffffffffa0485616>] nfs3_rpc_wrapper.constprop.11+0x86/0xd0 [nfsv3]
[<ffffffffa04859d4>] nfs3_proc_access+0xc4/0x1a0 [nfsv3]
[<ffffffffa0429229>] nfs_do_access+0x3c9/0x850 [nfs]
[<ffffffffa04298a1>] nfs_permission+0x1c1/0x2b0 [nfs]
[<ffffffff8126c642>] __inode_permission+0x72/0xd0
[<ffffffff8126c6b8>] inode_permission+0x18/0x50
[<ffffffff8126f306>] link_path_walk+0x266/0x860
[<ffffffff8126f9bc>] path_init+0xbc/0x840
[<ffffffff81272325>] path_openat+0x75/0x620
[<ffffffff81273ec9>] do_filp_open+0x49/0xc0
[<ffffffff8125fa5d>] do_sys_open+0x13d/0x230
[<ffffffff8125fb6e>] SyS_open+0x1e/0x20
[<ffffffff817ec22e>] system_call_fastpath+0x12/0x76
[<ffffffffffffffff>] 0xffffffffffffffff
Here we see that the process in question tried to open a file and path lookup lead it to a nfs share, at which point it blocked. While this does not tell you which share is causing trouble [2], you can narrow it down no problem.

[<ffffffff81263405>] __sb_start_write+0x195/0x1f0
[<ffffffff81286a24>] mnt_want_write+0x24/0x50
[<ffffffff81271adb>] do_last+0xbeb/0x13c0
[<ffffffff81272346>] path_openat+0x96/0x620
[<ffffffff81273ec9>] do_filp_open+0x49/0xc0
[<ffffffff8125fa5d>] do_sys_open+0x13d/0x230
[<ffffffff8125fb6e>] SyS_open+0x1e/0x20
[<ffffffff817ec22e>] system_call_fastpath+0x12/0x76
[<ffffffffffffffff>] 0xffffffffffffffff 
Again a file lookup, but this time we got no indication where [2].
Just like with nfs you can just spawn a new shell and probe stuff from that level. But what is it? For those who don't want to do a little digging: sfserrmr be fbzrguvat ryfr hfvat guvf shapvbanyvgl

Bottom line is, if you see accumulating zombie processes, don't try to kill them. Inspect your system instead and possibly report a bug to your developers. Chances are you will solve the actual problem on your own.

[1] There are ways in which process can make its children reaped automatically without waiting, also these days situation on Linux and FreeBSD is less trivial -- there are ways to reap foreign children, e.g. with process descriptors or prctl(2) (see PR_SET_CHILD_SUBREAPER)

[2] Unfortunately Linux does not provide any nice way to obtain such information at the moment. I have some ideas how to change it which I may describe in future posts. Interested parties can obtain this information in a way which may seem scary at first, but is quite trivial and largely safe & accurate - you can run the debugger on live kernel (named 'crash'), dump mount list (mount), dump the stack of problematic thread (bt -f) and eye-grep. Of course a proper way would require disassembling some code to be sure what is what on the stack.

No comments:

Post a Comment