Adventures in Hooking Process and Thread Spawn Events

Introduction

A while ago, I was working on a project that maintained some state about every process running on a system. This required me to collect metadata whenever a process or thread was spawned. I thought it would be pretty straightforward, but it turned out to be trickier than I expected. Through the process of trial and error (plus a little help from internet friends!), I learned a bunch so I took notes in the hopes that maybe one day someone else will find this useful too – especially because some of what I ended up doing wasn’t really well documented in one place.

The post is a bit long so if you’d like to skip ahead, the tl;dr is if the data you’re looking for is available in a tracepoint, then just save yourself some trouble and just use that.

Attempt #1: kprobe on execve

Kprobes are neat; they’re basically breakpoints that can be set on any instruction in the kernel. In practice, they’re very similar to userland breakpoints in that the target address is replaced with a breakpoint instruction (e.g. int3 on x86) and control is handed off to the kprobe handler. Two great resources on kprobes are the kernel docs and this blog post by Julia Evans.

Anyway, placing a breakpoint on the execve syscall isn’t enough; as you may know, this is just for processes, not threads. For threads we’ll have to start looking at clone(2). For a quick primer on the difference between clone and execve, check out this StackOverflow question.

Verdict: Fail by design, intercepting execve calls misses out on threads.

Attempt #2: kretprobe on clone

Kretprobes are like Kprobes in that they use the same underlying mechanism, with one major difference. As the name might imply, kretprobes let you break whenever a function returns. It works by installing a kprobe at the function’s entrypoint that sneakily replaces the function’s return address on the stack with a pointer to the kretprobe handler.

We need a kretprobe because unlike with execve, a clone spawned thread’s information is only available when the function returns. For example, to get the PID of a process spawned via execve, you can just look at the PID even before the syscall executes; with clone, you have to wait for it to return. We can confirm this by reading _do_fork (the function clone relies on) and noticing that the process information only becomes available part way through.

This works, with one major caveat… it doesn’t work if the target process is in its own PID namespace. This means if you were monitoring thread activity from outside the namespace (e.g. a kubernetes daemonset spying on pods), the value is only valid for that specific namespace. This means you’ll get an invalid process at best and the wrong process at worst. Without some additional leg work (such as this) this value is unfit for our use.

Verdict: Fail clone(2)’s return values are only valid in the pid namespace from which it was called

Attempt #3: kretprobe on sched_fork

Unlike execve and clone, sched_fork isn’t a syscall. It’s the internal kernel function that both execve and clone use to populate the task_struct struct that’s used to keep track of every process and thread on the system.

This one was pretty interesting because we’re looking at an internal kernel function, which typically do not have documentation beyond any comments in the source. It’s also a good example of kprobes on functions that aren’t syscalls.

int sched_fork(unsigned long clone_flags, struct task_struct *p)

The function takes a pointer to a task_struct and initializes some of its values. Once that’s done, you should be able to read p->pid and p->tgid. Except for some reason when I tried this, I got a zero every time. I don’t know what I did wrong, since I was able to adapt my code to other internal kernel functions and inspect their parameters just fine.

I’m kind of glad this didn’t work. As I mentioned earlier, this is an internal function. This means unlike syscalls, there’s little to no documentation and no stable interface. There’s no reason a kprobe/kretprobe wouldn’t work on a function like this (they are just breakpoints, after all!) but there’s also no guarantee the code I’m breaking on will look or behave the same way between kernel releases.

At this point I considered giving up, but someone in a chat room suggested I look into tracepoints so that’s what I did next.

Verdict: Fail it didn’t work, and I don’t really know why. :(

Attempt #4: tracepoint on sched:sched_process_fork

Tracepoints are like kprobes in that they’re pretty much just breakpoints, except they’ve already been set for you. Strictly speaking, they’re not actually breakpoints, but rather specific locations that have been decided by linux kernel developers as useful areas for introspection. To do this, there’s a TRACE_EVENT macro developers can place wherever they think users might want to place a breakpoint. This ~10 year old LWN article is a pretty good primer on tracepoints.

There are entire categories of tracepoints, for example:

For this, I’ll be focusing on the sched event: sched_process_fork.

There are a few things that make tracepoints really useful, and often better than kprobes:

  1. tracepoints are self documenting, unlike internal kernel functions and their kprobes
  2. tracepoints are stable; the code around them may change but the data returned by them will not
  3. tracepoints are designed to give you all the context you need whenever they’re executed, so you don’t have to cobble it together yourself

With kprobes, you’re generally limited to the arguments and return values of the functions you’re looking at; this is usually fine but can be tricky sometimes where the data you need is split between a function entrypoint (i.e. a function’s parameters) and its return value. There are a few things you end up needing to do here: find another function that has all the context you need in one function call, or keep state of function calls and returns using BPF maps.

I started by looking through the sched category to see what events were available:

$ grep TRACE_EVENT include/trace/events/sched.h
TRACE_EVENT(sched_kthread_stop,
TRACE_EVENT(sched_kthread_stop_ret,
TRACE_EVENT(sched_switch,
TRACE_EVENT(sched_migrate_task,
TRACE_EVENT(sched_process_wait,
TRACE_EVENT(sched_process_fork,
TRACE_EVENT(sched_process_exec,
TRACE_EVENT(sched_pi_setprio,
TRACE_EVENT(sched_process_hang,
TRACE_EVENT(sched_swap_numa,
TRACE_EVENT(sched_wake_idle_without_ipi,
...

sched_process_fork looks like what I need. Note though, I grepped through a header file in the kernel source for this. I did this because I already had the source code open. You could easily navigate the same information using just your shell. We can see what the categories are, what tracepoints are available for each category, the data each tracepoint provides, and even a copy/pastable format string for easy debugging:

$ ls /sys/kernel/debug/tracing/events/
alarmtimer  fib           irq_matrix   net             rpm                      timer
block       fib6          irq_vectors  nmi             sched                    tlb
...
exceptions  iommu         msr          regmap          thermal
ext4        irq           napi         regulator       thermal_power_allocator

$ ls /sys/kernel/debug/tracing/events/sched/
enable                  sched_process_exec  sched_stat_iowait   sched_wait_task
filter                  sched_process_exit  sched_stat_runtime  sched_wake_idle_without_ipi
sched_kthread_stop      sched_process_fork  sched_stat_sleep    sched_wakeup
sched_kthread_stop_ret  sched_process_free  sched_stat_wait     sched_wakeup_new
sched_migrate_task      sched_process_hang  sched_stick_numa    sched_waking
sched_move_numa         sched_process_wait  sched_swap_numa
sched_pi_setprio        sched_stat_blocked  sched_switch

$ ls /sys/kernel/debug/tracing/events/sched/sched_process_fork
enable  filter  format  hist  id  trigger

$ cat /sys/kernel/debug/tracing/events/sched/sched_process_fork/format
name: sched_process_fork
ID: 299
format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;

        field:char parent_comm[16];     offset:8;       size:16;        signed:1;
        field:pid_t parent_pid; offset:24;      size:4; signed:1;
        field:char child_comm[16];      offset:28;      size:16;        signed:1;
        field:pid_t child_pid;  offset:44;      size:4; signed:1;

print fmt: "comm=%s pid=%d child_comm=%s child_pid=%d", REC->parent_comm, REC->parent_pid, RE
C->child_comm, REC->child_pid

To begin processing that data, I’ll need to use the data structure in the format “file” (it’s not actually a file on disk!) The neat thing is, even though this format is generally the most stable source of this information, you can dynamically generate the structs required by consulting the format files at runtime – no kernel header parsing needed. That shouldn’t be necessary, but it’s always an option!

With the format: data, I defined a struct and created a little program to test it:

struct sched_process_fork {
    unsigned short common_type;
    unsigned char common_flags;
    unsigned char common_preempt_count;
    int common_pid;
    char parent_comm[16];
    u32 parent_pid;
    char child_comm[16];
    u32 child_pid;
};

int sched_process_fork(struct sched_process_fork *ctx) {
    bpf_trace_printk("pid = %ld\n", ctx->child_pid);
    return 0;
}

I ran this program and monitored /sys/kernel/debug/tracing/trace_pipe to confirm ctx->child_pid contained what I was looking for. Independent of this program, I double checked the validity of this output using bpftrace:

$ sudo bpftrace -e \
     'tracepoint:sched:sched_process_fork { printf("%s %ld\n", args->child_comm, args->child_pid) }'

Using bpftrace first can probably save you some time up front since it can help you quickly verify that at least the event you’re looking for exists and contains the data you need – all before you’ve written any code.

You might notice that I originally wanted to capture processes and threads, but I’m looking at just pid here. This is because in the kernel, pid refers to an individual task which may be either a process or a thread. And turns out, this works great! The thread ID/process ID is no longer specific to an individual PID namespace, but rather is a valid value for the root namespace. It’s actually safe to use!

Verdict: Pass It works! The tracepoint returns a thread ID that’s valid in the root namespace!

Conclusion

My takeaway for folks looking to do this type of introspection is to skip the kprobes and go straight to tracepoints. Kprobes might be trendy and they definitely have their uses, but you’ll be more productive trying out tracepoints first. They’re easier to use, require less boilerplate, and have a more stable interface. Realistically, you probably won’t even need to read any kernel source code to get a working proof of concept going.