Adventures in Hooking Process and Thread Spawn Events
20 May 2021Introduction
A while ago, I was working on a project that maintained some state about every process running on a system. This required me to collect metadata whenever a process or thread was spawned. I thought it would be pretty straightforward, but it turned out to be trickier than I expected. Through the process of trial and error (plus a little help from internet friends!), I learned a bunch so I took notes in the hopes that maybe one day someone else will find this useful too – especially because some of what I ended up doing wasn’t really well documented in one place.
The post is a bit long so if you’d like to skip ahead, the tl;dr is if the data you’re looking for is available in a tracepoint, then just save yourself some trouble and just use that.
Attempt #1: kprobe on execve
Kprobes are neat; they’re basically breakpoints that can be set on any instruction in the kernel. In practice, they’re very similar to userland breakpoints in that the target address is replaced with a breakpoint instruction (e.g. int3
on x86) and control is handed off to the kprobe handler. Two great resources on kprobes are the kernel docs and this blog post by Julia Evans.
Anyway, placing a breakpoint on the execve
syscall isn’t enough; as you may know, this is just for processes, not threads. For threads we’ll have to start looking at clone(2)
. For a quick primer on the difference between clone
and execve
, check out this StackOverflow question.
Verdict: Fail by design, intercepting execve calls misses out on threads.
Attempt #2: kretprobe on clone
Kretprobes are like Kprobes in that they use the same underlying mechanism, with one major difference. As the name might imply, kretprobes let you break whenever a function returns. It works by installing a kprobe at the function’s entrypoint that sneakily replaces the function’s return address on the stack with a pointer to the kretprobe handler.
We need a kretprobe because unlike with execve
, a clone
spawned thread’s information is only available when the function returns. For example, to get the PID of a process spawned via execve
, you can just look at the PID even before the syscall executes; with clone
, you have to wait for it to return. We can confirm this by reading _do_fork
(the function clone
relies on) and noticing that the process information only becomes available part way through.
This works, with one major caveat… it doesn’t work if the target process is in its own PID namespace. This means if you were monitoring thread activity from outside the namespace (e.g. a kubernetes daemonset spying on pods), the value is only valid for that specific namespace. This means you’ll get an invalid process at best and the wrong process at worst. Without some additional leg work (such as this) this value is unfit for our use.
Verdict: Fail clone(2)
’s return values are only valid in the pid namespace from which it was called
Attempt #3: kretprobe on sched_fork
Unlike execve
and clone
, sched_fork
isn’t a syscall. It’s the internal kernel function that both execve
and clone
use to populate the task_struct
struct that’s used to keep track of every process and thread on the system.
This one was pretty interesting because we’re looking at an internal kernel function, which typically do not have documentation beyond any comments in the source. It’s also a good example of kprobes on functions that aren’t syscalls.
int sched_fork(unsigned long clone_flags, struct task_struct *p)
The function takes a pointer to a task_struct
and initializes some of its values. Once that’s done, you should be able to read p->pid
and p->tgid
. Except for some reason when I tried this, I got a zero every time. I don’t know what I did wrong, since I was able to adapt my code to other internal kernel functions and inspect their parameters just fine.
I’m kind of glad this didn’t work. As I mentioned earlier, this is an internal function. This means unlike syscalls, there’s little to no documentation and no stable interface. There’s no reason a kprobe/kretprobe wouldn’t work on a function like this (they are just breakpoints, after all!) but there’s also no guarantee the code I’m breaking on will look or behave the same way between kernel releases.
At this point I considered giving up, but someone in a chat room suggested I look into tracepoints so that’s what I did next.
Verdict: Fail it didn’t work, and I don’t really know why. :(
Attempt #4: tracepoint on sched:sched_process_fork
Tracepoints are like kprobes in that they’re pretty much just breakpoints, except they’ve already been set for you. Strictly speaking, they’re not actually breakpoints, but rather specific locations that have been decided by linux kernel developers as useful areas for introspection. To do this, there’s a TRACE_EVENT
macro developers can place wherever they think users might want to place a breakpoint. This ~10 year old LWN article is a pretty good primer on tracepoints.
There are entire categories of tracepoints, for example:
sched
: task scheduler events such processes context switching, hanging, exiting, etctcp
: tcp/networking events such as retransmits and resetsext4/btrfs/etc
: filesystem events like inode allocations/freesrandom
: RNG related events such as requests for random bytes and the status of the entropy pool
For this, I’ll be focusing on the sched
event: sched_process_fork
.
There are a few things that make tracepoints really useful, and often better than kprobes:
- tracepoints are self documenting, unlike internal kernel functions and their kprobes
- tracepoints are stable; the code around them may change but the data returned by them will not
- tracepoints are designed to give you all the context you need whenever they’re executed, so you don’t have to cobble it together yourself
With kprobes, you’re generally limited to the arguments and return values of the functions you’re looking at; this is usually fine but can be tricky sometimes where the data you need is split between a function entrypoint (i.e. a function’s parameters) and its return value. There are a few things you end up needing to do here: find another function that has all the context you need in one function call, or keep state of function calls and returns using BPF maps.
I started by looking through the sched
category to see what events were available:
$ grep TRACE_EVENT include/trace/events/sched.h
TRACE_EVENT(sched_kthread_stop,
TRACE_EVENT(sched_kthread_stop_ret,
TRACE_EVENT(sched_switch,
TRACE_EVENT(sched_migrate_task,
TRACE_EVENT(sched_process_wait,
TRACE_EVENT(sched_process_fork,
TRACE_EVENT(sched_process_exec,
TRACE_EVENT(sched_pi_setprio,
TRACE_EVENT(sched_process_hang,
TRACE_EVENT(sched_swap_numa,
TRACE_EVENT(sched_wake_idle_without_ipi,
...
sched_process_fork
looks like what I need. Note though, I grepped through a header file in the kernel source for this. I did this because I already had the source code open. You could easily navigate the same information using just your shell. We can see what the categories are, what tracepoints are available for each category, the data each tracepoint provides, and even a copy/pastable format string for easy debugging:
$ ls /sys/kernel/debug/tracing/events/
alarmtimer fib irq_matrix net rpm timer
block fib6 irq_vectors nmi sched tlb
...
exceptions iommu msr regmap thermal
ext4 irq napi regulator thermal_power_allocator
$ ls /sys/kernel/debug/tracing/events/sched/
enable sched_process_exec sched_stat_iowait sched_wait_task
filter sched_process_exit sched_stat_runtime sched_wake_idle_without_ipi
sched_kthread_stop sched_process_fork sched_stat_sleep sched_wakeup
sched_kthread_stop_ret sched_process_free sched_stat_wait sched_wakeup_new
sched_migrate_task sched_process_hang sched_stick_numa sched_waking
sched_move_numa sched_process_wait sched_swap_numa
sched_pi_setprio sched_stat_blocked sched_switch
$ ls /sys/kernel/debug/tracing/events/sched/sched_process_fork
enable filter format hist id trigger
$ cat /sys/kernel/debug/tracing/events/sched/sched_process_fork/format
name: sched_process_fork
ID: 299
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:char parent_comm[16]; offset:8; size:16; signed:1;
field:pid_t parent_pid; offset:24; size:4; signed:1;
field:char child_comm[16]; offset:28; size:16; signed:1;
field:pid_t child_pid; offset:44; size:4; signed:1;
print fmt: "comm=%s pid=%d child_comm=%s child_pid=%d", REC->parent_comm, REC->parent_pid, RE
C->child_comm, REC->child_pid
To begin processing that data, I’ll need to use the data structure in the format “file” (it’s not actually a file on disk!) The neat thing is, even though this format is generally the most stable source of this information, you can dynamically generate the structs required by consulting the format files at runtime – no kernel header parsing needed. That shouldn’t be necessary, but it’s always an option!
With the format:
data, I defined a struct and created a little program to test it:
struct sched_process_fork {
unsigned short common_type;
unsigned char common_flags;
unsigned char common_preempt_count;
int common_pid;
char parent_comm[16];
u32 parent_pid;
char child_comm[16];
u32 child_pid;
};
int sched_process_fork(struct sched_process_fork *ctx) {
bpf_trace_printk("pid = %ld\n", ctx->child_pid);
return 0;
}
I ran this program and monitored /sys/kernel/debug/tracing/trace_pipe
to confirm ctx->child_pid
contained what I was looking for. Independent of this program, I double checked the validity of this output using bpftrace:
$ sudo bpftrace -e \
'tracepoint:sched:sched_process_fork { printf("%s %ld\n", args->child_comm, args->child_pid) }'
Using bpftrace
first can probably save you some time up front since it can help you quickly verify that at least the event you’re looking for exists and contains the data you need – all before you’ve written any code.
You might notice that I originally wanted to capture processes and threads, but I’m looking at just pid
here. This is because in the kernel, pid
refers to an individual task which may be either a process or a thread. And turns out, this works great! The thread ID/process ID is no longer specific to an individual PID namespace, but rather is a valid value for the root namespace. It’s actually safe to use!
Verdict: Pass It works! The tracepoint returns a thread ID that’s valid in the root namespace!
Conclusion
My takeaway for folks looking to do this type of introspection is to skip the kprobes and go straight to tracepoints. Kprobes might be trendy and they definitely have their uses, but you’ll be more productive trying out tracepoints first. They’re easier to use, require less boilerplate, and have a more stable interface. Realistically, you probably won’t even need to read any kernel source code to get a working proof of concept going.