How to use the new Docker Seccomp profiles
In case you missed it, we recently merged a default seccomp profile for Docker containers. I urge you to try out the default seccomp profile, mostly so we can rest easy knowing the defaults are sane and your containers work as before. You can download the master version of Docker Engine from master.dockerproject.org or experimental.docker.com.
We even have a doc describing the syscalls we purposely block and security vulnerabilities the profile blocked.
But that’s not what this blog post is about. This post is about how you can create your own custom seccomp profiles for your containers. And how to debug when your profile is missing a syscall.
So this is not the most sane thing in the world, I even tried in the process
to create a bash script that takes the output from strace, collects the
syscalls, and generates a profile. But like all tools of this sort (eg.
aa-genprof
) it missed some, well to be exact it missed 6. Which is no
small feat to debug, so this post is in the format: learn by example. I am
going to take you step by step through what I did.
- Wake up go to starbucks… just kidding… not that specific.
I wanted to make a custom profile for my chrome
container.
I decided to get the syscalls it used by changing the entrypoint for my
chrome/Dockerfile
to ENTRYPOINT [ "strace", "-ff", "google-chrome" ]
. So the only things that
changed was wrapping the command in strace
and of course installing strace
in the container. The -ff
option makes sure strace
follows forks. Which is
essential for chrome because they fork a bunch of processes (fun fact: each tab
is a process with it’s own PID namespace).
Cool beans, moving on.
So I used chrome the entire day like this to create the most verbose
strace
output so I wouldn’t miss any syscalls.
At the end of the day I saved this output into a file by running
docker logs chrome > $HOME/chrome-strace.log 2>&1
.
Then I used the world’s most janky bash script to generate a profile:
#!/bin/bash
set -e
set -o pipefail
main(){
local file=$1
local name=$(basename "$0")
if [[ -z "$file" ]]; then
cat >&2 <<-EOF
${name} [strace-output-filename]
You must pass a filename that has the strace output.
EOF
fi
# get just the syscalls
local IFS=$'\n'
raw=( $(perl -lne 'print $1 if /([a-zA-Z_]+\()/' "$file" | sort -u) )
unset IFS
syscalls=( )
tmpfile=$(mktemp /tmp/seccomp-strace.XXXXXX)
curl -sSL -o "$tmpfile" https://raw.githubusercontent.com/torvalds/linux/master/arch/x86/entry/syscalls/syscall_64.tbl
for syscall in "${raw[@]}"; do
# clean the trailing (
syscall=${syscall%(}
if grep -R -q -w $syscall "$tmpfile"; then
syscalls+=( $syscall )
fi
done
# start the seccomp profile
cat <<-EOF > "$tmpfile"
{
"defaultAction": "SCMP_ACT_ERRNO",
"syscalls": [
EOF
for syscall in "${syscalls[@]}"; do
cat <<-EOF
{
"name": "${syscall}",
"action": "SCMP_ACT_ALLOW",
"args": null
},
EOF
done >> "$tmpfile"
# remove trailing comma
sed -i '$s/,$//' "$tmpfile"
cat <<-EOF >> "$tmpfile"
]
}
EOF
cat "$tmpfile"
rm "$tmpfile"
}
main $@
You use this script like so:
$ ./shitty-seccomp-profile-generator.sh chrome-strace.log
Now you have a whitelist generated from your strace output. But it’s super bad
and when you try to run your container with it you get a vague error and
Operation not permitted
.
Just for this example the error was:
[1:1:0104/214046:ERROR:nacl_fork_delegate_linux.cc(314)] Bad NaCl helper startup ack (0 bytes)
.
So now we have to use our brains. WHAT!? NOOOOO!
So I opened the generated profile and took a look at what it was allowing.
Now I know a little bit about how chrome uses namespaces/seccomp to create a
sandbox, so my first thought was let’s make sure we allow unshare
, clone
,
seccomp
and setns
. Sure enough, unshare
and setns
were missing… thanks strace
you really sucked that one up, even I know chrome calls those.
After further thought I realized it was also missing setgid
and
exit
/exit_group
.
This all took a super long time of guessing and checking but I ended up with this profile.
Obviously noone else is going to do this, debug for hours the syscalls that are missing. This is why the default profile is so important, we wanted to create sane defaults that would protect people but also not cause all this pain.
So please, please, please try it out and open an issue if you find your
container that used to run perfectly is now giving Operation not permitted
.
If you are curious about syscalls or are trying to track down what you are missing, this is a great syscall table: filippo.io/linux-syscall-table.
Also, things are going to get better. We are working on sane security profiles for containers that don’t make you want to pull your hair out. You can read up on the proposal at docker/docker#17142.