Building Container Images Securely on Kubernetes
A lot of people seem to want to be able to build container images in Kubernetes without mounting in the docker socket or doing anything to compromise the security of their cluster.
This all was brought to my attention when my awesome coworker at Gabe Monroy and I were chatting with Michelle Noorali over pizza at Kubecon in Austin last December.
Here is pretty much how it went down:
Gabe: I’d would love to switch our clusters to a lightweight runtime like
containerd, but we need those docker build apis right now. I wish someone
would come up with an unprivileged container image builder..
Me: Oh that’s easy
Gabe: Bullshit, if it was easy someone would have done it already. I’ve wanted
this for years. Please pass the ranch dressing.
Me: I’m telling you you’re wrong. I’ll prove it to you. It’s easy.
Judgy Four Seasons Staff: Excuse me, can I help you?
Me: Nah we’re good. Actually if you could grab me a slice of that Papa John's
jalapano & pineapple that would be great.
.. next morning ..
100 lines of bash shaming in Gabe's inbox proving it could be done.
Prior Art
A few years ago when I worked at Docker, Stephen Day and Michael Crosby did a POC demo of a standalone image builder.
It still actually exists today in
a fork of docker/distribution on Stephen’s github.
It consisted of a dist
command line tool for interacting with the registry
and runc. Combined together with the awesome powers of bash like so (nsinit
was runc
before runc
was A Thing):
#!/bin/bash
function FROM () {
mkdir rootfs
dist pull "$1" rootfs
}
function USERNS() {
export nsinituserns="$1"
}
function CWD() {
export nsinitcwd="$1"
}
function MEM() {
export nsinitmem="$1"
}
function EXEC() {
nsinit exec \
--tty \
--rootfs "$(pwd)/rootfs" \
--create \
--cwd="$nsinitcwd" \
--memory-limit="$nsinitmem" \
--memory-swap -1 \
--userns-root-uid="$nsinituserns" \
-- $@
}
function RUN() {
t="\"$@\""
EXEC sh -c "$t"
}
So in their demo, you would source the above bash script and then execute your Dockerfile like it was also a bash script. Pretty cool right.
So that is what I sent to Gabe’s inbox to prove it was possible but also: “Look, I will make you something nice.”
Designing Something Nice
So I went out on my mission to make them something nice, which lead me through a sea of existing tools. I collected all my findings in a design doc if you are curious as to what I think about the other existing tools.
I didn’t want to reinvent the world I just wanted to make it unprivileged and a single binary with a simple user interface that could easily be switched out with docker.
Not all of my ideas are good. I first started on a FUSE snapshotter. Turns out FUSE kinda sucks…
so fuse calls
— jessie frazelle (@jessfraz) February 8, 2018getxattr
2x the amount it callslookup
even if the damn inodes have no xattrs…. and it has to go back and forth from kernel to userspace to do it… I need a drink.
I started playing with buildkit. It’s an awesome project. Tõnis Tiigi did a really stellar job on it and I thought to myself, “I definitely want to use this as the backend.”
Buildkit is more cache-efficient than Docker because it can execute multiple build stages concurrently with its internal DAG.
Then I stumbled upon Akihiro Suda’s patches for an unprivileged Buildkit. This was perfect for my use case.
I owe all these fine folks so much for the great work I got to build on top of. :)
And thus came img.
So that was all fine and dandy and it works great as unprivileged… on my host. Now I’m a huge fan of desktop tools and this actually filled a large void in my tooling that now I can build as unprivileged on my host without Docker.
But I still have to make this work in Kubernetes so I can make Gabe happy and fulfill my dreams of eating more pineapple and jalapeno pizzas at Kubecons.
Why is this problem so hard?
Let me go over in detail some of the patches needed to even make this work as unprivileged on my host.
For one, we need subuid
and subgid
maps. See @AkihiroSuda’s patch.
We also need to setgroups
. See @AkihiroSuda’s patch for that as well.
Those allow us to use apt
in unprivileged user namespaces.
Then if we want to use the containerd snapshotter backends and actually mount the filesystems as we diff them, then we need unprivileged mounting. Which can only be done from inside a user and mount namespace. So we need to do this at the start of our binary before we even do anything else.
Granted mounting is not a requirement of building docker images. You can always go the route of orca-build and umoci and not mount at all. umoci is also an unprivileged image builder and was made long before I even made mine by the talented Aleksa Sarai who is also responsible for a lot of the rootless containers work upstream in runc.
Getting this to work in containers…
img
works on my host which is all fine and dandy but I gotta help my k8s pals do
their builds…
Enter the next problem. For the record, all these problems apply to any builder that is using runc to launch containers as an unprivileged user.
The next issue involved not being able to mount proc inside a Docker container.
My first thought was “well it must be something Docker is doing”. So I isolated
the problem, put it in a container and ten minutes after I dove into the rabbit
hole I realized it was the fact that Docker sets paths inside /proc
to be
masked and readonly by default, preventing me from mounting.
Duh I thought to myself. Remember that thing we never thought we’d need… well we need it.
"We'll never need this"
— julia ferraioli (@juliaferraioli) March 4, 2018
"Fuck, we need that"
You can find all the fun details on opencontainers/runc#1658.
Well this blows, I could obviously just run the container as --privileged
but
thats really stupid and defeats the whole point of this exercise. I did not
want to add any extra capabilities or any host devices which is exactly what
privileged
does… gross.
So I opened an issue on Docker and made a patch.
Okay so problem solved. Wait… no… now I gotta pull that option through to kubernetes…
So I opened a proposal there: kubernetes/community#1934.
And I made a patch just for playing with it on my fork: jessfraz/kubernetes#rawproc.
Okay now I want to try it in a cluster…
enter acs-engine
. I made a branch there as well for easily combining together
all my patches for testing: jessfraz/acs-engine#rawaccess.
Here is a yaml file you can use to deploy and try it:
apiVersion: v1
kind: Pod
metadata:
labels:
run: img
name: img
annotations:
container.apparmor.security.beta.kubernetes.io/img: unconfined
spec:
securityContext:
runAsUser: 1000
initContainers:
# This container clones the desired git repo to the EmptyDir volume.
- name: git-clone
image: r.j3ss.co/jq
args:
- git
- clone
- --single-branch
- --
- https://github.com/jessfraz/dockerfiles
- /repo # Put it in the volume
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
volumeMounts:
- name: git-repo
mountPath: /repo
containers:
- image: r.j3ss.co/img
imagePullPolicy: Always
name: img
resources: {}
workingDir: /repo
command:
- img
- build
- -t
- irssi
- irssi/
securityContext:
rawProc: true
volumeMounts:
- name: cache-volume
mountPath: /tmp
- name: git-repo
mountPath: /repo
volumes:
- name: cache-volume
emptyDir: {}
- name: git-repo
emptyDir: {}
restartPolicy: Never
So is this secure?
Well I am running that pod as user 1000. Granted it does have access to a raw
proc without masks… the nested containers do not. The nested containers
have /proc
set as
read-only and masked paths. The nested containers also use a default seccomp
profile denying privileged operations that should not be allowed.
Your main concern here is my code and the code in buildkit and runc. Personally I think that’s fine because I obviously trust myself, but you are more than welcome to audit it and open bugs and/or patches.
If you randomly generate different users for all your pod builds to run under then you are relying on the user isolation of linux itself.
If you are running a cluster inside your organization, it’s unlikely someone is going waste a kernel 0day popping your cluster from within your org.
This is much better than the current situation where people are mounting the docker socket into containers and everything is running as root.
You can even use a Pod Security Policy
and set MustRunAs
to make sure all your pods are being run as users within
a certain range of uids.
You are effectively as safe as any other non-root user running on a shared machine.
If you are running random builds from users off the internet I would suggest using VMs. You can use my patches to acs-engine to run all your pods in Intel’s Clear Containers and you would then have hardware isolation for your little builders :) You just need to use this config.
And that ends the most epic yak shave ever, minus the patches all being merged upstream. Thanks for playing. Feel free to try it out on Azure with my branch to acs-engine. That was a lot of patching and I’m tired. Peace.