Building Container Images Securely on Kubernetes

Tuesday, March 20, 2018

A lot of people seem to want to be able to build container images in Kubernetes without mounting in the docker socket or doing anything to compromise the security of their cluster.

This all was brought to my attention when my awesome coworker at Gabe Monroy and I were chatting with Michelle Noorali over pizza at Kubecon in Austin last December.

Here is pretty much how it went down:

Gabe: I’d would love to switch our clusters to a lightweight runtime like 
containerd, but we need those docker build apis right now. I wish someone 
would come up with an unprivileged container image builder..

Me: Oh that’s easy

Gabe: Bullshit, if it was easy someone would have done it already. I’ve wanted 
this for years. Please pass the ranch dressing.

Me: I’m telling you you’re wrong. I’ll prove it to you. It’s easy.

Judgy Four Seasons Staff: Excuse me, can I help you?

Me: Nah we’re good. Actually if you could grab me a slice of that Papa John's 
jalapano & pineapple that would be great.

.. next morning ..

100 lines of bash shaming in Gabe's inbox proving it could be done.

Prior Art

A few years ago when I worked at Docker, Stephen Day and Michael Crosby did a POC demo of a standalone image builder.

It still actually exists today in a fork of docker/distribution on Stephen’s github. It consisted of a dist command line tool for interacting with the registry and runc. Combined together with the awesome powers of bash like so (nsinit was runc before runc was A Thing):

#!/bin/bash

function FROM () {
    mkdir rootfs
    dist pull "$1" rootfs
}

function USERNS() {
    export nsinituserns="$1"
}

function CWD() {
    export nsinitcwd="$1"
}

function MEM() {
    export nsinitmem="$1"
}

function EXEC() {
    nsinit exec \
        --tty \
        --rootfs "$(pwd)/rootfs" \
        --create \
        --cwd="$nsinitcwd" \
        --memory-limit="$nsinitmem" \
        --memory-swap -1 \
        --userns-root-uid="$nsinituserns" \
        -- [email protected]
}

function RUN() {
    t="\"[email protected]\""
    EXEC sh -c "$t"
}

So in their demo, you would source the above bash script and then execute your Dockerfile like it was also a bash script. Pretty cool right.

So that is what I sent to Gabe’s inbox to prove it was possible but also: “Look, I will make you something nice.”

Designing Something Nice

So I went out on my mission to make them something nice, which lead me through a sea of existing tools. I collected all my findings in a design doc if you are curious as to what I think about the other existing tools.

I didn’t want to reinvent the world I just wanted to make it unprivileged and a single binary with a simple user interface that could easily be switched out with docker.

Not all of my ideas are good. I first started on a FUSE snapshotter. Turns out FUSE kinda sucks…

I started playing with buildkit. It’s an awesome project. Tõnis Tiigi did a really stellar job on it and I thought to myself, “I definitely want to use this as the backend.”

Buildkit is more cache-efficient than Docker because it can execute multiple build stages concurrently with its internal DAG.

Then I stumbled upon Akihiro Suda’s patches for an unprivileged Buildkit. This was perfect for my use case.

I owe all these fine folks so much for the great work I got to build on top of. :)

And thus came img.

So that was all fine and dandy and it works great as unprivileged… on my host. Now I’m a huge fan of desktop tools and this actually filled a large void in my tooling that now I can build as unprivileged on my host without Docker.

But I still have to make this work in Kubernetes so I can make Gabe happy and fulfill my dreams of eating more pineapple and jalapeno pizzas at Kubecons.

Why is this problem so hard?

Let me go over in detail some of the patches needed to even make this work as unprivileged on my host.

For one, we need subuid and subgid maps. See @AkihiroSuda’s patch. We also need to setgroups. See @AkihiroSuda’s patch for that as well. Those allow us to use apt in unprivileged user namespaces.

Then if we want to use the containerd snapshotter backends and actually mount the filesystems as we diff them, then we need unprivileged mounting. Which can only be done from inside a user and mount namespace. So we need to do this at the start of our binary before we even do anything else.

Granted mounting is not a requirement of building docker images. You can always go the route of orca-build and umoci and not mount at all. umoci is also an unprivileged image builder and was made long before I even made mine by the talented Aleksa Sarai who is also responsible for a lot of the rootless containers work upstream in runc.

Getting this to work in containers…

img works on my host which is all fine and dandy but I gotta help my k8s pals do their builds…

Enter the next problem. For the record, all these problems apply to any builder that is using runc to launch containers as an unprivileged user.

The next issue involved not being able to mount proc inside a Docker container.

My first thought was “well it must be something Docker is doing”. So I isolated the problem, put it in a container and ten minutes after I dove into the rabbit hole I realized it was the fact that Docker sets paths inside /proc to be masked and readonly by default, preventing me from mounting.

Duh I thought to myself. Remember that thing we never thought we’d need… well we need it.

You can find all the fun details on opencontainers/runc#1658.

Well this blows, I could obviously just run the container as --privileged but thats really stupid and defeats the whole point of this exercise. I did not want to add any extra capabilities or any host devices which is exactly what privileged does… gross.

So I opened an issue on Docker and made a patch.

Okay so problem solved. Wait… no… now I gotta pull that option through to kubernetes…

So I opened a proposal there: kubernetes/community#1934.

And I made a patch just for playing with it on my fork: jessfraz/kubernetes#rawproc.

Okay now I want to try it in a cluster… enter acs-engine. I made a branch there as well for easily combining together all my patches for testing: jessfraz/acs-engine#rawaccess.

Here is a yaml file you can use to deploy and try it:

apiVersion: v1
kind: Pod
metadata:
  labels:
    run: img
  name: img
  annotations:
    container.apparmor.security.beta.kubernetes.io/img: unconfined
spec:
  securityContext:
    runAsUser: 1000
  initContainers:
    # This container clones the desired git repo to the EmptyDir volume.
    - name: git-clone
      image: r.j3ss.co/jq
      args:
        - git
        - clone
        - --single-branch
        - --
        - https://github.com/jessfraz/dockerfiles
        - /repo # Put it in the volume
      securityContext:
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true
      volumeMounts:
        - name: git-repo
          mountPath: /repo
  containers:
  - image: r.j3ss.co/img
    imagePullPolicy: Always
    name: img
    resources: {}
    workingDir: /repo
    command:
    - img
    - build
    - -t
    - irssi
    - irssi/
    securityContext:
      rawProc: true
    volumeMounts:
    - name: cache-volume
      mountPath: /tmp
    - name: git-repo
      mountPath: /repo
  volumes:
  - name: cache-volume
    emptyDir: {}
  - name: git-repo
    emptyDir: {}
  restartPolicy: Never

So is this secure?

Well I am running that pod as user 1000. Granted it does have access to a raw proc without masks… the nested containers do not. The nested containers have /proc set as read-only and masked paths. The nested containers also use a default seccomp profile denying privileged operations that should not be allowed.

Your main concern here is my code and the code in buildkit and runc. Personally I think that’s fine because I obviously trust myself, but you are more than welcome to audit it and open bugs and/or patches.

If you randomly generate different users for all your pod builds to run under then you are relying on the user isolation of linux itself.

If you are running a cluster inside your organization, it’s unlikely someone is going waste a kernel 0day popping your cluster from within your org.

This is much better than the current situation where people are mounting the docker socket into containers and everything is running as root.

You can even use a Pod Security Policy and set MustRunAs to make sure all your pods are being run as users within a certain range of uids.

You are effectively as safe as any other non-root user running on a shared machine.

If you are running random builds from users off the internet I would suggest using VMs. You can use my patches to acs-engine to run all your pods in Intel’s Clear Containers and you would then have hardware isolation for your little builders :) You just need to use this config.

And that ends the most epic yak shave ever, minus the patches all being merged upstream. Thanks for playing. Feel free to try it out on Azure with my branch to acs-engine. That was a lot of patching and I’m tired. Peace.