Thoughts on Conway's Law and the software stack

Monday, March 25, 2019

I’ve been talking to a lot of people in different layers of the stack during my funemployment. I wanted to share one of the problems I’ve been thinking about and maybe you can think of some clever solutions to solve it.

Conway’s Law states “organizations which design systems … are constrained to produce designs which are copies of the communication structures of these organizations.”

If you were to apply Conway’s Law to all the layers of the software stack and open source software you’d see a problem: There is not sufficient communication between the various layers of software.

Let’s dive in a bit to make the problem super clear.

I’ve met a bunch of hardware engineers and I’ve made a point about asking each of them how they feel about using a single chip for multiple users. This is, of course, the use case of the cloud. All of the hardware engineers either laugh or are horrified and the resounding reaction is “you’d be crazy to think hardware was ever intended to be used for isolating multiple users safely.” Spectre and Meltdown proved this was true as well. Speculative execution was a feature intended to make processors faster but was never thought about in terms of the vector of hacking something running multi-tenant compute, like a cloud provider. Seems like the software and hardware layers should better communicate…

That’s just one example, let’s reverse the interaction. I’ve talked to a bunch of firmware and kernel engineers and they’d all love if the firmware from chip vendors did less complexity. For instance, it seems like a unanimous vote among firmware and kernel engineers that CPU vendors should not include runtime services or SMM with their firmware. Open source firmware and kernel developers would rather handle those problems at their layer of the stack. All the complexity in the firmware leads to overlooked bugs and odd behavior that can’t be controlled or debugged from the kernel developers layer and/or user space. Not to mention, a lot of CPU vendors firmware is proprietary so it’s really hard to know if a bug is truly a firmware bug.

Another example would be the hack of SoftLayer. Hackers modified the firmware on the BMC from a bare metal host the cloud provider was offering. This shows another mistake in having blinders on and not being conscious of the other layers of the stack and the entire system.

Let’s move up the stack a bit to something I personally have experienced. I worked a lot on container runtimes. I also have worked on kubernetes. I was horrified to find people are running multi-tenant kubernetes clusters with multiple customers processes, aka for isolating untrusted processes. The architecture of kubernetes is just not designed for this.

A common miscommunication is the “window dressing.” For example, there is a feature in kubernetes that prevents exec-ing into containers. This is implemented by merely preventing the API call in kubernetes. If a person has access to a cluster there are about 4 dozen different ways I can think of to exec into a container and bypass this “feature” and kubernetes entirely. Using said “security feature” in kubernetes alone is not sufficient for security in any respect. This is a common pattern.

All these problems are not small by any means. They are miscommunications at various layers of the stack. They are people thinking an interface or feature is secure when it is merely a window dressing that can be bypassed with just a bit more knowledge about the stack. I really like the advice Lea Kissner gave: “take the long view, not just the broad view.” We should do this more often when building systems.

The thought I’ve been noodling on is: how do we solve this? Is this something a code hosting provider like GitHub should fix? But, that excludes all the projects that are not on that platform. How do we promote better communication between layers of the stack? How can we automate some of this away? Or is the answer simply, own all the layers of the stack yourself?