DEV Community: Andrew Howden

What is a container?

Andrew Howden — Sun, 31 Mar 2019 11:58:48 +0000

Containers have recently become a common way of packaging, deploying and running software across a wide set of machines in all sorts of environments. With the initial release of Docker in March, 2013^[1] containers have become ubiquitous in modern software deployment with 71% of Fortune 100 companies running it in some capacity^[2]. Containers can be used for:

Running user facing, production software
Running a software development environment
Compiling software with its dependencies in a sandbox
Analysing the behaviour of software within a sandbox

Like their namesake in the shipping industry containers are designed to easily "lift and shift" software to different environments and have that software execute in the same way across those environments.

Containers have thus earned their place in the modern software development toolkit. However to understand how container technology fits into our modern software architecture its worth understanding how we arrived at containers, as well as how they work.

History

The "birth" of containers was denoted by Bryan Cantrill as March 18th, 1982^[3] with the addition of the chroot syscall in BSD. From the FreeBSD website^[4]:

According to the SCCS logs, the chroot call was added by Bill Joy on March 18, 1982 approximately 1.5 years before 4.2BSD was released. That was well before we had ftp servers of any sort (ftp did not show up in the source tree until January 1983). My best guess as to its purpose was to allow Bill to chroot into the /4.2BSD build directory and build a system using only the files, include files, etc contained in that tree. That was the only use of chroot that I remember from the early days.

— Dr. Marshall Kirk Mckusick

chroot is used to put a process into a "changed root"; a new root filesystem that has limited or no access to the parent root filesystem. An extremely minimal chroot can be created on Linux as follows^[5]:

# Get a shell
$ cd $(mktemp -d)
$ mkdir bin
$ $(which sh) bin/bash

# Find shared libraries required for shell
$ ldd bin/sh
    linux-vdso.so.1 (0x00007ffe69784000)
    /lib/x86_64-linux-gnu/libsnoopy.so (0x00007f6cc4c33000)
    libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f6cc4a42000)
    libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f6cc4a21000)
    libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f6cc4a1c000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f6cc4c66000)

# Duplicate libraries into root
$ mkdir -p lib64 lib/x86_64-linux-gnu
$ cp /lib/x86_64-linux-gnu/libsnoopy.so \
    /lib/x86_64-linux-gnu/libc.so.6 \
    /lib/x86_64-linux-gnu/libpthread.so.0 \
    /lib/x86_64-linux-gnu/libdl.so.2 \
    lib/x86_64-linux-gnu/

$ cp /lib64/ld-linux-x86-64.so.2 lib64/

# Change into that root
$ sudo chroot .

# Test the chroot
# ls
/bin/bash: 1: ls: not found
#

There were problems with this early implementation of chroot, such as being able to exit that chroot by running cd..^[3], but these were resolved in short order. Seeking to provide better security FreeBSD extended the chroot into the jail^[3][4] which allowed running software that desired to run as root and running it within a confined environment that was root within that environment but not root elsewhere on the system.

This work was further built upon in the Solaris operating system to provide fuller isolation from the host^[3][6]:

User separation (similar to jail)
Filesystem separation (similar to chroot)
A separate process space

Providing something similar to the modern concept of containers; processes running on the same kernel. Later, similar work took place in the Linux kernel to isolate kernel structures on a per-process basis under "namespaces"^[7].

However, in parallel Amazon Web Services (AWS) launched their Elastic Compute Cloud (EC2) product which took a different approach to separating out workloads: virtualising the entire hardware^[3]. This has some different tradeoffs; it limits exploitation of the host kernel or isolation implementation however running the additional operating system and hypervisor meant a far less efficient use of resources.

Virtualisation continued to dominate workload isolation until the company "dotcloud" (now Docker), then operating as a "platform as a service" (PAAS) offering, open sourced the software they used to run their PAAS. With that software and a large amount of luck containers proliferated rapidly until Docker became the power house it is now.

Shortly after Docker released their container runtime they started expanding their product offerings into build, orchestration and server management tooling^[8]. Unhappy with this CoreOS created their own container runtime, rkt, which had the stated goal of interoperating with existing services such as systemd, following the unix philosophy of "Write programs that do one thing and do it well^[9]."

To reconcile these disaparate definitions of a container the Open Container Initiative was established^[10], after which Docker donated its schema and its runtime as what amounted to a defacto container standard.

There are now a number of container implementations, as well as a number of standards to define their behaviour.

Definition

It might be surprising to learn that a "container" is not a real thing — rather, it is a specification. At the time of writing this specification has implementations on^[11]:

Linux
Windows
Solaris
Virtual Machines

In turn, containers are expected to be^[12]:

Consumable with a set of standard, interoperable tools
Consistent regardless of what type of software is being run
Agnostic to the underlying infrastructure the container is being run on
Designed in a way that makes automation easy
Of excellent quality

There are specifications that dictate how containers should reach these principles by defining how they should be executed (the runtime specification^[11]), what a container should contain (the image specification^[13]) and how to distribute container "images" (the distribution specification^[14]).

These specifications mean that a wide variety of tools can be used to interact with containers. The canonical tool that is in most common use is the Docker tool, which in addition to manipulating containers provides container build tooling and some limited orchestration of containers. However, there are a number of container runtimes:

As well as other tools that help with building or distributing images.

Lastly, there are extensions to the existing standards, such as the container networking interface, which define additional behaviour where the standards are not yet clear enough.

Implementation

While the standards give us some idea as to what a container is and how they should work, it’s perhaps useful to understand how a container implementation works. Not all container runtimes are implemented in this way; notably, kata containers implement hardware virtualisation as alluded to earlier with EC2.

The problems being solved by containers are:

Isolation of a process(es)
Distribution of that process(es)
Connecting that process(es) to other machines

With that said let’s dive in to the Docker implementation^[15]. This uses a series of technologies exposed by the underlying kernel:

Kernel feature isolation: namespaces

The man namespaces command defines namespaces as follows:

A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource. Changes to the global resource are visible to other processes that are members of the namespace, but are invisible to other processes. One use of namespaces is to implement containers.

Paraphrased, a namespace is a slice of the system that, from within that slice, a process cannot see the rest of the system.

A process must make a system call to the Linux kernel to changes its namespace. There are several system calls:

clone: Create a new process. When used in conjunction with CLONE_NEW* it creates a namespace of the kind specified. For example, if used with CLONE_NEWPID the process will enter a new pid namespace and become pid 1
setns: Allows the calling process to join an existing namespace, specified under /proc/[pid]/ns
unshare: Moves the calling process into a new namespace

There is a user command also called unshare which allows us to experiment with namespaces. We can put ourselves into a separate process and network namespace with the following command:

# Scratch space
$ cd $(mktemp -d)

# Fork is required to spawn new processes, and proc is mounted to give accurate process information
$ sudo unshare \
    --fork \
    --pid \
    --mount-proc \
    --net

# Here we see that we only have access to the loopback interface
root@sw-20160616-01:/tmp/tmp.XBESuNMJJS# ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

# Here we see that we can only see the first process (bash) and our `ps aux` invocation
root@sw-20160616-01:/tmp/tmp.XBESuNMJJS# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.3  0.0   8304  5092 pts/7    S    05:48   0:00 -bash
root         5  0.0  0.0  10888  3248 pts/7    R+   05:49   0:00 ps aux

Docker uses the following namespaces to limit the ability for a process running in the container to see resources outside that container:

The pid namespace: Process isolation (PID: Process ID).
The net namespace: Managing network interfaces (NET: Networking).
The ipc namespace: Managing access to IPC resources (IPC: InterProcess Communication).
The mnt namespace: Managing filesystem mount points (MNT: Mount)
The uts namespace: Isolating kernel and version identifiers. (UTS: Unix Timesharing System).

These provide reasonable separation between processes such that workloads should not be able to interfere with each other. However there is a notable caveat: we can disable some of this isolation^[16].

This is an extremely useful property. One example of this would be for system daemons that need access to the host network to bind ports on the host^[17], such as running a DNS service or service proxy in a container.

Tip

Process #1 or the init process in Linux systems has some additional responsibilities. When processes terminate in Linux they are not automatically cleaned up, but rather simply enter a terminated state. It is the responsibility of the init process to "reap" those processes, deleting them so that their process ID can be reused^[18]. Accordingly the first process run in a Linux namespace should be an init process, and not a user facing process like mysql. This is known as the zombie reaping problem.

Tip

Another place namespaces are used is the Chromium browser^[19]. Chromium uses at least the setuid and user namespaces.

Resource isolation: control groups

The kernel documentation for cgroups defines the cgroup as follows:

Control Groups provide a mechanism for aggregating/partitioning sets of tasks, and all their future children, into hierarchical groups with specialized behaviour.

That doesn’t really tell us much though. Luckily it expands:

On their own, the only use for cgroups is for simple job tracking. The intention is that other subsystems hook into the generic cgroup support to provide new attributes for cgroups, such as accounting/limiting the resources which processes in a cgroup can access. For example, cpusets (see Documentation/cgroup-v1/cpusets.txt) allow you to associate a set of CPUs and a set of memory nodes with the tasks in each cgroup.

So, cgroups are a groups of "jobs" that other systems can assign meaning to. The systems that currently use this cgroup systems:

As well as various others.

cgroups are manipulated by reading and writing to the /proc filesystem. For example:

# Create a cgroup called "me"
$  mkdir /sys/fs/cgroup/memory/me

# Allocate the cgroup a max of 100Mb memory
$ echo '100000000' | sudo tee /sys/fs/cgroup/memory/me/memory.limit_in_bytes

# Move this proess into the cgroup
$ echo $$  | sudo tee /sys/fs/cgroup/memory/me/cgroup.procs
5924

That’s it! This process should now be limited to 100Mb total usage

Docker uses the same functionality in its --memory and --cpus arguments, and it is employed by the orchestration systems Kubernetes and Apache Mesos to determine where to schedule workloads.

Tip

Although cgroups are most commonly associated with containers they’re already used for other workloads. The best example is perhaps systemd, which automatically puts all services into a cgroup if the CPU scheduler is enabled in the kernel^[20]. systemd services are … kind of containers!

Userland isolation: seccomp

While both namespaces and cgroups go a significant way to isolating processes into their own containers Docker goes further than that to restrict what access the process can have to the Linux kernel itself. This is enforced in supported operating systems via "SECure COMPuting with filters", also known as seccomp-bpf or simply seccomp.

The Linux kernel user space API guide defines seccomp as:

Seccomp filtering provides a means for a process to specify a filter for incoming system calls. The filter is expressed as a Berkeley Packet Filter (BPF) program, as with socket filters, except that the data operated on is related to the system call being made: system call number and the system call arguments.

BPF in turn is a small, in-kernel virtual machine language used in a number of kernel tracing, networking and other tasks^[21]. Whether the system supports seccomp can be determined by running the following command^[22]:

$ grep CONFIG_SECCOMP= /boot/config-$(uname -r)

# Our system supports seccomp
CONFIG_SECCOMP=y

Practically this limits a processes ability to ask the kernel to do certain things. Any system call can be restricted, and docker allows the use of arbitrary seccomp "profiles" via its --security-opt argument^[22]:

docker run --rm \
  -it \
  --security-opt seccomp=/path/to/seccomp/profile.json \
  hello-world

However, most usefully Docker provides a default security profile that limits some of the more dangerous system calls that processes run from a container should never need to make, including:

clone: The ability to clone new namespaces
bpf: The ability to load and run bpf programs
add_key: The ability to access the kernel keyring
kexec_load: The ability to load a new linux kernel

As well as many others. The full list of syscalls blocked by default is available on the Docker website.

In addition to seccomp there are other ways to ensure containers are behaving as expected, including:

Linux Capabilities^[23]
SELinux
AppArmour
AuditD
Falco^[24]

Each of which take slightly different approaches of ensuring the process is only executed within expected behaviour. It’s worth spending time to investigate the tradeoffs of each of these security decisions or simply delegating the choice to a competent third party provider.

Additionally it’s worth noting that even though Docker defaults to enabling the seccomp policy, orchestration systems such as kubernetes may disable it^[25].

Distribution: the union file system

To generate a container Docker requires a set of "build instructions". A trivial image could be:

# Scrath space
$ cd $(mktemp -d)

# Create a docker file
$ cat <<EOF > Dockerfile
FROM debian:buster

# Create a test directory
RUN mkdir /test

# Create a bunch of spam files
RUN echo $(date) > /test/a
RUN echo $(date) > /test/b
RUN echo $(date) > /test/c

EOF

# Build the image
$ docker build .
Sending build context to Docker daemon  4.096kB
Step 1/5 : FROM debian:buster
 ---> ebdc13caae1e
Step 2/5 : RUN mkdir /test
 ---> Running in a9c0fa1a56c7
Removing intermediate container a9c0fa1a56c7
 ---> 6837541a46a5
Step 3/5 : RUN echo Sat 30 Mar 18:05:24 CET 2019 > /test/a
 ---> Running in 8b61ca022296
Removing intermediate container 8b61ca022296
 ---> 3ea076dcea98
Step 4/5 : RUN echo Sat 30 Mar 18:05:24 CET 2019 > /test/b
 ---> Running in 940d5bcaa715
Removing intermediate container 940d5bcaa715
 ---> 07b2f7a4dff8
Step 5/5 : RUN echo Sat 30 Mar 18:05:24 CET 2019 > /test/c
 ---> Running in 251f5d00b55f
Removing intermediate container 251f5d00b55f
 ---> 0122a70ad0a3
Successfully built 0122a70ad0a3

This creates a docker image with the id of 0122a70ad0a3 containing the contents of date at a, b and c. We can verify this by starting the container and examining its contents:

$ docker run \
  --rm=true \
  -it \
  0122a70ad0a3 \
  /bin/bash

$ cd /test
$ ls
a  b  c
$ cat *

Sat 30 Mar 18:05:24 CET 2019
Sat 30 Mar 18:05:24 CET 2019
Sat 30 Mar 18:05:24 CET 2019

However, in the docker build command earlier Docker created several images. If we run the image after only a and b have been executed we will not see c:

$ docker run \
  --rm=true \
  -it \
  07b2f7a4dff8 \
  /bin/bash
$ ls test
a  b

Docker is not creating a whole new filesystem for each of these images. Instead, each of the images are layered on top of each other. If we query Docker we can see each of the layers that go into a given image:

$ docker history 0122a70ad0a3
IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT
0122a70ad0a3        5 minutes ago       /bin/sh -c echo Sat 30 Mar 18:05:24 CET 2019…   29B
07b2f7a4dff8        5 minutes ago       /bin/sh -c echo Sat 30 Mar 18:05:24 CET 2019…   29B
3ea076dcea98        5 minutes ago       /bin/sh -c echo Sat 30 Mar 18:05:24 CET 2019…   29B
6837541a46a5        5 minutes ago       /bin/sh -c mkdir /test                          0B
ebdc13caae1e        12 months ago       /bin/sh -c #(nop)  CMD ["bash"]                 0B
<missing>           12 months ago       /bin/sh -c #(nop) ADD file:2219cecc89ed69975…   106MB

This allows docker to reuse vast chunks of what it downloads. For example, given the image we built earlier we can see that it uses:

A layer called ADD file:… — this is the Debian Buster root filesystem at 106MB
A layer for a that renders the date to disk at 29B
A layer for b that renders the date to disk at 29B

And so on. Docker will reuse the Add file:… Debian Buster root for all image that start with FROM: debian:buster.

This allows Docker to be extremely space efficient if possible, reusing the same operating system image for multiple different executions.

Tip

Even though Docker is extremely space efficient the docker library on disk can grow extremely large and transferring large docker images over the network can become expensive. Therefore, try to reuse image layers where possible and prefer smaller operating systems or the scratch (nothing) image where possible.

These layers are implemented via a Union Filesystem, or UnionFS. There are various "backends" or filesystems that can implement this approach:

overlay2
devicemapper
aufs

Generally speaking the package manager on our machine will include the appropriate underlying filesystem driver; docker supports many:

$ docker info | grep Storage
Storage Driver: overlay2

We can replicate this implementation with our overlay mount fairly easily^[26]:

# scratch
cd $(mktemp -d)

# Create some layers
$ mkdir \
  lower \
  upper \
  workdir \
  overlay

# Create some files that represent the layers
$ touch lower/i-am-the-lower
$ touch higher/i-am-the-higher

# Create the layered filesystem at overlay with lower, upper and workdir
$ mount -t overlay \
    -o lowerdir=lower,upperdir=upper,workdir=workdir \
    ./overlay \
    overlay

# List the directory
$ ls overlay/
i-am-the-lower  i-am-the-upper

Docker goes so far as to nest those layers until the multi-layered filesystem has been successfully implemented.

Files that are written are written back to the upper directory, in the case of overlay2. However Docker will generally dispose of these temporary files when the container is removed.

Tip

Generally speaking all software needs access to shared libraries found in static paths in Linux operating systems. Accordingly it is the convention to simply ship a stripped down version of an operating systems root file system such that users can install and applications can find the libraries they expect. However, it is possible to use an empty filesystem and a statically compiled binary with the scratch image type.

Connectivity: networking

As mentioned earlier, containers make use of Linux namespaces. Of particular interest when understanding container networking is the network namespace. This namespace gives the process separate:

(virtual) ethernet devices
routing tables
iptables rules

For example,

# Create a new network namespace
$ sudo unshare --fork --net

# List the ethernet devices with associated ip addresses
$ ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

# List all iptables rules
root@sw-20160616-01:/home/andrewhowden# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

# List all network routes
$ ip route show

By default, the container has no network connectivity — not even the loopback adapter is up. We cannot even ping ourselves!

$ ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1): 56 data bytes
ping: sending packet: Network is unreachable

We can start setting up the expected network environment by bringing up the loopback adapter:

$ ip link set lo up
root@sw-20160616-01:/home/andrewhowden# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever

# Test the loopback adapter
$ ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1): 56 data bytes
64 bytes from 127.0.0.1: icmp_seq=0 ttl=64 time=0.092 ms
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.068 ms

However, we cannot access the outside world. In most environments our host machine will be connected via ethernet to a given network and either have an IP assigned to it via the cloud provider or, in the case of a development or office machine, request an IP via DHCP. However our container is in a network namespace of its own and has no knowledge of the ethernet connected to the host. To connect the container to the host we need to employ a veth device.

veth, or "Virtual Ethernet Device" is defined by man vetTo create a `veth device we can run as:

The veth devices are virtual Ethernet devices. They can act as tunnels between network namespaces to create a bridge to a physical network device in another namespace, but can also be used as standalone network devices.

This is exactly what we need! Because unshare creates an anonymous network namespace we need to determine what the pid of the process started in that namespace is^[27][28]:

$ echo $$
18171

We can then create the veth device:

$ sudo ip link add veth0 type veth peer name veth0 netns 18171

We can see both on the host and the guest these virtual ethernet devices appear. However, neither has an IP attached nor any routes defined:

# Container

$ ip addr
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: veth0@if7: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 16:34:52:54:a2:a1 brd ff:ff:ff:ff:ff:ff link-netnsid 0
$ ip route show

# No output

To address that we simply add an IP and define the default route:

# On the host
$ ip addr add 192.168.24.1 dev veth0

# Within the container
$ ip address add 192.168.24.10 dev veth0

From there, bring the devices up:

# Both host and container
$ ip link set veth0 up

Add a route such that 192.168.24.0/24 goes out via veth0:

# Both host and guest
ip route add 192.168.24.0/24 dev veth0

And voilà! We have connectivity to the host namespace and back:

# Within container
$ ping 192.168.24.1
PING 192.168.24.1 (192.168.24.1): 56 data bytes
64 bytes from 192.168.24.1: icmp_seq=0 ttl=64 time=0.149 ms
64 bytes from 192.168.24.1: icmp_seq=1 ttl=64 time=0.096 ms
64 bytes from 192.168.24.1: icmp_seq=2 ttl=64 time=0.104 ms
64 bytes from 192.168.24.1: icmp_seq=3 ttl=64 time=0.100 ms

However, that does not give us access to the wider internet. While the veth adapter functions as a virtual cable between our container and our host, there is currently no path from our container to the internet:

# Within container
$ ping google.com
ping: unknown host

To create such a path we need to modify our host such that it functions as a "router" between its own, separated network namespaces and its internet facing adapter.

Luckily, Linux is set up well for this purpose. First, we need to modify the normal behaviour of Linux from dropping packets not destined for IP addresses with which their associated but rather allow forwarding a packet from one adapter to the other:

# Within container
$ echo 1 > /proc/sys/net/ipv4/ip_forward

That means when we request public facing IPs from within our container via our veth adapter to our host veth adapter the host adapter won’t simply drop those packets.

From there we employ iptables rules on the host to forward traffic from the host veth adapter to the internet facing adapter — in this case wlp2s0:

# On the host
# Forward packets from the container to the host adapter
iptables -A FORWARD -i veth0 -o wlp2s0 -j ACCEPT

# Forward packets that have been established via egress from the host adapater back to the contianer
iptables -A FORWARD -i wlp2s0 -o veth0 -m state --state ESTABLISHED,RELATED -j ACCEPT

# Relabel the IPs for the container so return traffic will be routed correctly
iptables -t nat -A POSTROUTING -o wlp2s0 -j MASQUERADE

We then tell our container to send traffic it doesn’t know anything else about down the veth adapter:

# Within the container
$ ip route add default via 192.168.24.1 dev veth0

And the internet works!

$ # ping google.com
PING google.com (172.217.22.14): 56 data bytes
64 bytes from 172.217.22.14: icmp_seq=0 ttl=55 time=16.456 ms
64 bytes from 172.217.22.14: icmp_seq=1 ttl=55 time=15.102 ms
64 bytes from 172.217.22.14: icmp_seq=2 ttl=55 time=34.369 ms
64 bytes from 172.217.22.14: icmp_seq=3 ttl=55 time=15.319 ms

As mentioned, each container implementation can implement networking differently. There are implementations that use the aforementioned veth pair, vxlan, BPF or other cloud specific implementations. However, when designing containers we need some way to reason about what behaviour we should expect.

To help address this the "Container Network Interface" tooling has been designed. This allows defining consistent network behaviour across network implementations, as well as models such as Kubernetes shared lo adapter between several containers.

The networking side of containers is an area undergoing rapid innovation but relying on:

A lo interface
A public facing eth0 (or similar) interface

being present seems a fairly stable guarantee.

Landscape review

Given our understanding of the implementation of containers we can now take a look at some of the classic docker discussions.

Systems Updates

One of the oft overlooked parts of containers is the necessity to keep both them, and the host system up to date.

In modern systems it is quite common to simply enable automatic updates on host systems and, so long as we stick to the system package manager and ensure updates stay successful, the system will keep itself both up to date and stable.

However, containers take a very different approach. They’re effectively giant static binaries deployed into a production system. In this capacity they can do no self maintenance.

Accordingly even if there are no updates to the software the container runs, containers should be periodically rebuilt and redeployed to the production system — less they accumulate vunlerabilities over time.

Init within contianer

Given our understanding of containers its reasonable to consider the "1 process per container" advice and determine that it is an oversimplification of how containers work, and it makes sense in some cases to do service management within a container with a system like runit.

This allows multiple processes to be executed within a single container including things like:

syslog
logrotate
cron

And so fourth.

In the case where Docker is the only system that is being used it is indeed reasonable to think about doing service management within docker — particularly when hitting the constraints of shared filesystem or network state. However systems such as Kubernetes, Swarm or Mesos have replaced much of the necessity of these init systems; tasks such as log aggregation, restarting services or colocating services are taken care of by these tools.

Accordingly its best to keep containers simple such that they are maximally composable and easy to debug, delegating the more complex behaviour out.

In Conclusion

Containers are an excellent way to ship software to production systems. They solve a swathe of interesting problems and cost very little as a result. However, their rapid growth has meant some confusion in industry as to exactly how they work, whether they’re stable and so fourth. Containers are a combination of both old and new Linux kernel technology such as namespaces, cgroups, seccomp and other Linux networking tooling but are as stable as any other kernel technology (so, very) and well suited for production systems.

<3 for making it this far.

References

“Docker.” https://en.wikipedia.org/wiki/Docker_(software) .
“Cloud Native Technologies in the Fortune 100.” https://redmonk.com/fryan/2017/09/10/cloud-native-technologies-in-the-fortune-100/ , Sep-2017.
B. Cantrill, “The Container Revolution: Reflections After the First Decade.” https://www.youtube.com/watch?v=xXWaECk9XqM , Sep-2018.
“Papers (Jail).” https://docs.freebsd.org/44doc/papers/jail/jail.html .
“An absolutely minimal chroot.” https://sagar.se/an-absolutely-minimal-chroot.html , Jan-2011.
J. Beck et al., “Virtualization and Namespace Isolation in the Solaris Operating System (PSARC/2002/174).” https://us-east.manta.joyent.com/jmc/public/opensolaris/ARChive/PSARC/2002/174/zones-design.spec.opensolaris.pdf , Sep-2006.
M. Kerrisk, “Namespaces in operation, part 1: namespaces overview.” https://lwn.net/Articles/531114/ , Jan-2013.
A. Polvi, “CoreOS is building a container runtime, rkt.” https://coreos.com/blog/rocket.html , Jan-2014.
“Basics of the Unix Philosophy.” http://www.catb.org/ esr/writings/taoup/html/ch01s06.html .
P. Estes and M. Brown, “OCI Image Support Comes to Open Source Docker Registry.” https://www.opencontainers.org/blog/2018/10/11/oci-image-support-comes-to-open-source-docker-registry , Oct-2018.
“Open Container Initiative Runtime Specification.” https://github.com/opencontainers/runtime-spec/blob/74b670efb921f9008dcdfc96145133e5b66cca5c/spec.md , Mar-2018.
“The 5 principles of Standard Containers.” https://github.com/opencontainers/runtime-spec/blob/74b670efb921f9008dcdfc96145133e5b66cca5c/principles.md , Dec-2016.
“Open Container Initiative Image Specification.” https://github.com/opencontainers/image-spec/blob/db4d6de99a2adf83a672147d5f05a2e039e68ab6/spec.md , Jun-2017.
“Open Container Initiative Distribution Specification.” https://github.com/opencontainers/distribution-spec/blob/d93cfa52800990932d24f86fd233070ad9adc5e0/spec.md , Mar-2019.
“Docker Overview.” https://docs.docker.com/engine/docker-overview/ .
J. Frazelle, “Containers aka crazy user space fun.” https://www.youtube.com/watch?v=7mzbIOtcIaQ , Jan-2018.
“Use Host Networking.” https://docs.docker.com/network/host/ .
Krallin, “Tini: A tini but valid init for containers.” https://github.com/krallin/tini , Nov-2018.
https://chromium.googlesource.com/chromium/src.git/+/HEAD/docs/linux_sandboxing.md .
[[0pointer.resources]]L. Poettering, “systemd for Administrators, Part XVIII.” http://0pointer.de/blog/projects/resources.html , Oct-2012.
A. Howden, “Coming to grips with eBPF.” https://www.littleman.co/articles/coming-to-grips-with-ebpf/ , Mar-2019.
“Seccomp security profiles for docker.” https://docs.docker.com/engine/security/seccomp/ .
“Linux kernel capabilities.” https://docs.docker.com/engine/security/security/#linux-kernel-capabilities .
M. Stemm, “SELinux, Seccomp, Sysdig Falco, and you: A technical discussion.” https://sysdig.com/blog/selinux-seccomp-falco-technical-discussion/ , Dec-2016.
“Pod Security Policies.” https://kubernetes.io/docs/concepts/policy/pod-security-policy/#seccomp .
Programster, “Example OverlayFS Usage.” https://askubuntu.com/a/704358 , Nov-2015.
“How do I connect a veth device inside an ’anonymous’ network namespace to one outside?” https://unix.stackexchange.com/a/396210 , Oct-2017.
D. P. García, “Network namespaces.” https://blogs.igalia.com/dpino/2016/04/10/network-namespaces/ , Apr-2016.

Laying out a git repository

Andrew Howden — Tue, 26 Mar 2019 06:34:10 +0000

Version control is one of the more fundamental pieces of software development. It allows developers to navigate through a projects history to understand who implemented each change, as well as why they did so. It is an invaluable tool for use while understanding any given issue.

littleman.co uses git as its version control tool of choice. git is the defacto standard of the software industry, having replaced Mercurial, Subversion and CVS. The majority of our development tools and our workflow builds on top of git primitives such as:

patches
branches
tags

And so forth. That said git, for all its opinions, is remarkably silent about how to lay out a project.

This is a good thing for the tool but not necessarily for the developer. When first reading a project to understand and debug it a developer needs to build a model of that project as quickly as possible. They can then use that model to make predictions about how the software should behave; as well, spotting things that violate such predictions. If we are able to keep projects consistent we are able to reduce the number of odd things developers need to investigate to find the desired issue.

Accordingly it’s a good idea to structure all projects in the in the same way and that developers can easily understand and search through them.

Existing Standards

Defining a standard for how a project should be laid out is hardly a new endeavour. There is:

If one of these standards is in wide use in your organisation its best to continue with that, rather than adopt yet another standard. However each of those standards have the limitation they’re only used in the context of the language or build tooling they’re defined in. In an environment such as littleman.co that includes many different languages, applications and other types of development these standards either do not define enough behaviour to be useful or define things that do not propagate well between languages.

Determining the boundaries of a repository

There are usually many different components of a project that need to come together to have that project user facing and doing useful work. Things like the:

Application
Infrastructure
CI/CD
Artifacts
Documentation

These things must all be coordinated in some way that allows developers to make changes to a project in a predictable way and with predictable timing and have those changes be pushed to users.

Traditionally each of these components would be kept separate, handled by different teams. However with the advent of continuous delivery developers can push code to production in a "self service" manner, and have a robot take care of tasks such as:

Ensuring the application works as expected before it hits users
Replacing the existing application in production with the new application
Rolling back the application to its previous version in the case of failure
Creating testing environments

And so forth.

Deployments are the boundary that seems most useful to determine what should be in a single repository. For example, if the application is the only thing that should change in a single deployment it can be the only thing in that repository. However, if the application is changing and requires an underlying infrastructure change, that infrastructure should also be in the repository. If the application requires a new set of tests and those tests should be in the CI/CD configuration that also belongs in the repository.

However, this also provides good boundaries as to what does not belong in the repository. The application should never require Kubernetes to be in a specific configuration, and Kubernetes configuration and life cycle should thus be managed in a separate repository. If the application requires new TLS certificates but those certificates are handled in a process outside the normal application development process they should also not be stored in the repository.

By using the deployment as our boundary to determine what goes in and out of the project we see a number of benefits:

Democratised project tooling

Even though things such as Docker or CI/CD may require specialised knowledge that the application developers do not have any reason to learn, by seeing those changes in the same place and subject to the same standards as other parts of the application those developers get a better understanding of their own project lifecycle. They can use that knowledge to decrease the time required to understand and resolve issues that are associated with any changes in that process, such as configuration changes in CI/CD breaking asset compilation in the application.

Additionally those developers can contribute application specific insights to the CI/CD process, such as the best place to store configuration or environment specific application configuration that must be applied.

Single view of changes

When understanding how and when a bug was introduced into a service the fewer places we must look and correlate the change the faster we can find and resolve the issue.

By having all changes associated with the project down to the next "deployment layer" we can quickly see whether it was an application code change, configuration change or environment change that was introduced at the same time as an issue started hitting users.

Coordinated Changes

There are times in which an application change and a configuration or environment change must happen at the same time. Examples include:

The addition of a new data store (Redis)
Newly exposed application configuration
A new application feature that requires a system library

By having both the application and the infrastructure in a single repository we can review both the application changes and the infrastructure changes in a single pull request and ensure they’re released and tested in a coordinated way.

Additionally any deployment artifacts generated can be directly traced back to a change in the git repository allowing operations team members to know exactly what code is running in production at any given time.

The Standard

The littleman.co standard is derived from the requirements as above. The directory layout is as follows:

$ tree .

├── bin
├── build
│   ├── ci
│   └── container
│       ├── Dockerfile
│       └── etc
├── deploy
│   ├── ansible
│   │   └── playbook.yml
│   ├── docker-compose
│   │   ├── docker-compose.yml
│   │   └── mnt
│   │       └── app
│   └── helm
├── docs
├── LICENSE.txt
├── README.adoc
├── src
└── web

14 directories, 5 files

A new project was published on GitHub with this post that describes the existing standards, formatted as a boilr template.

/

├── LICENSE.txt
├── README.adoc
├── .drone.yml
├── .arclint

There are various files that are either required by convention or by project tooling to be in the root of the project.

These include:

LICENSE.txt: The project license
README.adoc: Some basic description about the project
.drone.yml: The task runner / CI configuration for the project
.arclint: Configuration for the Arcanist lint runner

Build

└── build

Build configuration is expected to produce some sort of artifact, either consumed later in the build or deployed to some sort of environment.

These include:

CI

└── build
    └── ci

Sometimes there are limitations with the build system that require additional procedural scripts to do some $THING.

These are somewhat of an anti-pattern though; where possible, build tools that address the problems in a more abstract sense or reusable plugins in the style of drone plugins.

Containers

└── build
    └── container

Containers are the canonical deployment artifact used by littleman.co. They’re build from the Dockerfile definition.

Generally there is only one production container per project, though other containers may be used to assist with bespoke application build tasks.

Deploy

└── deploy

The deployment folder contains any "infrastructure as code" configuration. There are various kinds that are in common use, including:

Helm

└── deploy
    └── helm
        ├── Chart.yml
        ├── templates
        └── ...

Helm is a project for managing the definitions and lifecycle of Kubernetes objects. It is an opinionated way of packaging and vendoring software and there are a number of pre-packaged bits of software.

Each bit of software is packaged into a "chart". This chart includes:

Some metadata describing the software
The deployment definitions
The deployment definition configuration

Usually a project only has a single chart. However, where there are multiple charts required to launch this project each chart is nested in its own subdirectory:

└── deploy
    └── helm
        └── service-a
            ├── Chart.yml
            ├── templates
            └── ...
        └── service-b
            ├── Chart.yml
            └── ...

Generally speaking however, it is an anti-pattern to need multiple services for a single project. The project should be deployed as a single, atomic change. These services are better organised in the subchart pattern.

Ansible

└── deploy
    └── ansible

Ansible is a tool for defining machine specifications and having them enforced. The layout within this folder should be the layout defined by Ansible upstream, with the exception that each project is expected to only define one role.

Docker Compose

└── deploy
    └── docker-compose

docker-compose is a tool that is useful for spinning up a "production like" environment in a limited way in the local development environment.

Its scope is limited to local development by design.

Docs

└── docs

Project specific documentation

Src

└── src

All files associated with the application.

If the application is interpreted this should be called "app".

Web

└── web

The generated web application

In Conclusion

Our tools shape our conceptual model of a project. When developing keeping things consistent reduces the amount we need to investigate given each different project before we can start diagnosing issues or adding features to that project and adopting a single project layout keeps things as consistent as possible. The things included in a git repository in littleman.co projects are all the things that are needed to deploy a project to users or subsequently change that project’s behaviour, given consistent underlying infrastructure. The layout is fairly straight forward but is subject to iteration, and has thus been pushed to GitHub. Hopefully understanding how we structure projects will give you some guidance on how to structure your own projects, or invite questions as to whether your projects are currently structured to maximise clarity and consistency in your team.

Coming to grips with eBPF

Andrew Howden — Sat, 23 Mar 2019 02:41:39 +0000

I have a fairly long history using Linux for a number of purposes. After being assigned Linux as a development machine while working with the team at Fontis a combination of curiosity and the need to urgently repair this development machine as a result of curiosity driven stick pokery meant that I learned a large amount of Linux trivia knowledge fairly quickly. I built further on this while helping set up Sitewards infrastructure tooling; a much more heterogeneous set of computers and providers but with a standard approach emerging built of Docker and Kubernetes.

The sum total of this experience means I’ve been heavily motivated to invest more in the technologies associated with Linux. One of the more interesting technologies I’ve become peripherally aware of during this tuition is the "Extended Berkeley Packet Filter", or "eBPF".

I was introduced to this technology by Brendan Greggs excellent videos on performance analysis with eBPF. This was somewhat experimentally useful, but required recent kernels and various other oddities that weren’t consistent across our infrastructure. However, in parallel there was some interesting discussion about another eBPF project — Cilium. This project provides the underlying networking for Kubernetes, but does so in a way that appears to provide additional security and visibility that other network plugins do not; naively similar to Istio.

Very recently I’ve had the opportunity to help another team with some scaling issues with a large, bespoke Kubernetes cluster. This cluster had a large number of services, and these services were being updated slowly due to performance issues with their calico & kube-proxy iptables implementations. That particular issue addressed another way, but lead to the investigation into Calico and subsequent eBPF network tooling.

What is BPF?

The original "Berkeley Packet Filter" was derived from a paper written by Steve McCanne and Van Jacobson in 1992 for the Berkeley Software Distribution. It’s purpose was to allow an efficient capture of packets from within the Kernel to the Userland by compiling a program that filtered out packets that should not be copied across. This was subsequently employed in utilities such as tcpdump.

In 2011 Eric Dumazet considerably improved the performance of this BPF filter by adding a Just In Time (JIT) compiler that would compiled the BPF bytecode into an optimized instruction sequence. Later, in 2014, Alexi Starovoitov capitalised on this performant virtual machine to expose kernel tracing information more efficiently than it otherwise would be and extending BPF beyond its initial packet filtering purpose. Jonathan Corbet noted and published this to the LWN network, hinting that eventually BPF programs may not only be used internally in the kernel but compiled in userland and loaded into the Kernel. Later that same year the Alexi started work on the bpf() syscall, and the current notion of eBPF was kicked off.

eBPF is now an extension of the BPF tooling, converted into a more general purpose virtual machine and used for roles well beyond its initial packet filtering purpose. It is a quirk of history that it is still referred to as the Berkley packet filter, but the name has now stuck.

Because eBPF is an extension of the original specification it is generally simply referred to as BPF. The older language is transpiled in the kernel to the newer eBPF before its compiled so the only BPF that’s in the kernel is the newer eBPF.

How does BPF work?

BPF is a sequence of 64 bit instructions. These instructions are generally generated by an intermediary such as tcpdump (libpcap):

# See https://blog.cloudflare.com/bpf-the-forgotten-bytecode/
$ sudo tcpdump -i wlp2s0 'ip and tcp' -d
(000) ldh      [12]                           # Load a half-word (2 bytes) from the packet at offset 12.
(001) jeq      #0x800           jt 2    jf 5  # Check if the value is 0x0800, otherwise fail.
                                              # This checks for the IP packet on top of an Ethernet frame.
(002) ldb      [23]                           # Load byte from a packet at offset 23.
                                              # That's the "protocol" field 9 bytes within an IP frame.
(003) jeq      #0x6             jt 4    jf 5  # Check if the value is 0x6, which is the TCP protocol number,
                                              # otherwise fail.
(004) ret      #262144                        # Return fail
(005) ret      #0                             # Return success

But can also be written in a limited subset of C and compiled.

BPF programs have a certain set of guarantees enforced by a kernel verifier that make BPF programs safe to run in kernel land without risk of locking up or otherwise breaking the kernel. The verifier ensures that:

The program does not loop
There are no unreachable instructions
Every register and stack state are valid
Registers with uninitialized content are not read
The program only accesses structures appropriate for its BPF program type
(Optionally) pointer arithmetic is prevented

The BCC tools repository contains a set of tools wrapping BPF programs that can do useful things. We can use one of those programs (dns_matching.py) to demonstrate how BPF is able to instrument the network:

# Clone the repository

$ git clone https://github.com/iovisor/bcc.git
Cloning into 'bcc'...
Receiving objects: 100% (17648/17648), 8.42 MiB | 1.21 MiB/s, done.
Resolving deltas: 100% (11460/11460), done.

# Pick the DNS matching
$ cd bcc/examples/networking/dns_matching

# Run it!
$ sudo ./dns_matching.py  --domains fishfingers.io

$ sudo ./dns_matching.py  --domains fishfingers.io
>>>> Adding map entry:  fishfingers.io

Try to lookup some domain names using nslookup from another terminal.
For example:  nslookup foo.bar

BPF program will filter-in DNS packets which match with map entries.
Packets received by user space program will be printed here

Hit Ctrl+C to end...

In another window we can run:

$ dig fishfingers.io

Which will show in our first window:

Hit Ctrl+C to end...

[<DNS Question: 'fishfingers.io.' qtype=A qclass=IN>]

The domain is nonsense, but the question is still posed. Looking at the source file we can see the eBPF program written in C that:

Checks the type of Ethernet frame
Checks to see if its UDP
Checks to see if its Port 53
Check if the DNS name supplied is within the payload

That’s it! Our eBPF program has successfully run in the kernel and packets copied out to the userland python program where they’re subsequently saved.

While this example was associated with the network kernel system (BPF_PROG_TYPE_SOCKET_FILTER), there are a whole series of kernel entry points that can execute these eBPF programs. At the time of there are a total of 22 program types; unfortunately, they are currently poorly documented.

eBPF in the wild

To understand where eBPF sits in the infrastructure ecosystem it’s worth looking at where other companies have chosen to use it over other, more conventional ways of solving the problem.

Firewall

The de facto implementation for a Linux firewall uses iptables as its underlying enforcement mechanism. iptables allows configuring a set of netfilter tables that manipulates packets in a number of ways. For example, the following rule drops all connections from the IP address 10.10.10.10:

iptables -A INPUT -s 10.10.10.10/32 -j DROP

iptables can be used for a number of packet manipulation tasks such as Network Address Translation (NAT) or packet forwarding. However iptables runs into a couple of significant problems:

iptables rules are matched sequentially
iptables updates must be made by recreating and updating all rules in a single transaction

These two properties mean that under large, diverse traffic conditions (such as those experienced by any sufficiently large service — Facebook) or in a system that has a large number of changes to iptables rules there will be an unacceptable performance overhead to running iptables which can either degrade or take offline an entire service.

There are already improvements to this subsystem in the Linux kernel by way of nftables. This system is designed to improve iptables and is architecturally similar to BPF in that it implements a virtual machine in the kernel. nftables is a little older and better supported in existing Linux distributions, and in the testing distributions has even begun to entirely replace iptables. However with the advent and optimizations of BPF nftables is perhaps a technology less worth investing in.

That leaves us with BPF. BPF has a couple of unique advantages over iptables:

Its implemented as an instruction set in a virtual machine, and can be heavily optimized
It is matched against the "closest" rule, rather than by iterating over the entire rule set.
It can introspect specific packet data when making decisions as to whether to drop
It can be compiled and run in the Linux "Express Data Path" (or XDP); the earliest possible point to interact with network traffic

These advantages can yield some staggering performance benefits. In CloudFlare’s (artificial) tests BPF with XDP was approximately 5x better at dropping packets than the next best solution (tc). Facebook saw a much more predictable CPU usage with the use of BPF filtering.

In addition to the performance benefits some applications use BPF in combination with userland proxies (such as Envoy) to allow or deny the application protocols HTTP, gRPC, DNS or Kafka. This sort of application specific filtering is only otherwise seen in service meshes, such as Istio or Linkerd which incur more of a performance penalty than the BPF based solution.

So, packet filtering based on BPF is both more flexible and more efficient (with XDP) than the existing iptables
solution. Whiletcand nftables may provide similar performance now or in future, BPFs combination of a large set of use cases and efficiency means it’s perhaps a better place to invest.

Kernel tracing & instrumentation

After running Linux in production for some period of time invariably we can run into issues. In the past I’ve had issues debugging:

iptables performance problems
Workload CPU performance
Software not loading configuration
Software becoming stalled
Systems being "slow" for no apparent reason

In those cases we need to dig further into what’s happening between kernel land and userland and to poke at why the system is doing.

There are an abundance of tools for this task. Brendan Gregg has an excellent image showing the many tools and what they’re useful for. From the list above, I’m familiar with:

strace / ltrace
top
sysdig
iotop
df
perf

These tools each have their own unique tradeoffs and doing a depth analysis of them is beyond the scope of this article. However, the most useful tool is perhaps strace. strace provides visibility into what system calls (calls to the Linux kernel) the process is using. The following example shows what file system calls the process cat /tmp/foo will make:

$ strace -e file cat /tmp/foo
execve("/bin/cat", ["cat", "/tmp/foo"], 0x7fffc2c8c308 /* 56 vars */) = 0
access("/etc/ld.so.preload", R_OK)      = 0
openat(AT_FDCWD, "/etc/ld.so.preload", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libsnoopy.so", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libpthread.so.0", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libdl.so.2", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/tmp/foo", O_RDONLY)  = 3
hi

This allows us to debug a range of issues including configuration not working, what a process is sending over a network, what a process is receiving and what processes a given process spawns. However, it comes at a cost — strace will significantly slow down that process. Suddenly introducing large latency into the system will annoy users, and can block and stack up requests eventually breaking the service. Accordingly it needs to be used with caution.

However, a much more efficient way to trace these system calls is with BPF. This is made easy with the bcc tools git repository; specifically, the trace.py tool. The tool has a slightly different interface than strace; perhaps because BPF is compiled and executed based on events in the Kernel rather than interrupting a process at the kernel interface. However, it can be replicated as follows:

 $ sudo ./trace.py 'do_sys_open "%s", arg2' | grep 'cat'

And then in another window:

$ cat /tmp/foo

Will yield

13785   13785   cat             do_sys_open      /etc/ld.so.preload
13785   13785   cat             do_sys_open      /lib/x86_64-linux-gnu/libsnoopy.so
13785   13785   cat             do_sys_open      /etc/ld.so.cache
13785   13785   cat             do_sys_open      /lib/x86_64-linux-gnu/libc.so.6
13785   13785   cat             do_sys_open      /lib/x86_64-linux-gnu/libpthread.so.0
13785   13785   cat             do_sys_open      /lib/x86_64-linux-gnu/libdl.so.2
13785   13785   cat             do_sys_open      /usr/lib/locale/locale-archive
13785   13785   cat             do_sys_open      /tmp/foo

This fairly accurately replicates the functionality of strace; each of the files listed earlier are shown in the trace.py output the same as they were in the strace output.

BPF is not limited to strace like tools. It can be used to introspect a whole series of both user and kernel level problems and has been packaged into user friendly tools in the BCC repository. Additionally, BPF now powers Sysdig, the tool used for spelunking into a machine to determine its behaviour by analysing system calls. There is even some work to export the result of BPF programs in the Prometheus format for aggregation in time series data.

Because of its high performance, flexibility and good support in more recent Linux kernels BPF forms the foundation of a new set of systems introspection tools that are able to provide more flexible, performant systems introspection. Additionally BPF seems simpler than the kernel hacking that would otherwise be required to provide this sort of systems introspection and may democratize the design of such tools, leading to more innovation in this area.

Network visibility

Given the history of BPF in packet filtering a reasonable next logical step is collecting statistics from the network for later analysis.

There are already a number of network statistics exposed via the /proc subdirectory that can be read with little overhead. The Prometheus "node exporter" reads:

/proc/sys/net/netfilter/
/proc/net/ip_vs
/proc/net/ip_vs_stats
/sys/class/net/
/proc/net/netstat
/proc/net/sockstat
/proc/net/tcp
/proc/net/tcp6

However as much as this exposes, there are still things about connections that either can’t be read directly from /proc or via the set of CLI tools that also read from this (ss, netstat etc). One such case was discussed by Julia Evans and Brendan Gregg on Twitter: The stats of TCP connection lengths on a given port.

This is useful for debugging what a system is connected to, and how long it spends in that connection. We can in turn use this to determine who our machine is talking to, and whether it’s getting stuck on any given connection.

Brendan Gregg has a post that describes how this is implemented in detail, but to summarise it listens to tcp_set_state() and queries the properties of the connection from struct tcp_info. There are various limitations to this approach, but it seems to work pretty well.

The result has been committed to the bcc repository and looks like:

# Trace remote port 443
$ sudo ./tcplife.py -D 443

Then, in another window:

$ curl https://www.andrewhowden.com/ > /dev/null

The first window then shows:

PID   COMM       LADDR           LPORT RADDR           RPORT TX_KB RX_KB MS
7362  curl       10.1.1.247      43074 34.76.108.124   443       0    16 3369.32

Indicating that a process with ID 7362 connected to 34.76.108.124 over port 443 and took 3369.32ms to complete its transfer (Australian internet is a bit slow in some areas).

These kind of ad-hoc debugging statistics are essentially impossible to gather any other way. Additionally it should be possible (if desired) to express these statistics in such a way the Prometheus exporter will load them and export them for collection, making the network essentially arbitrarily introspectable.

Using BPF

Given the above BPF seems like a compelling technology that it’s worth investing in learning more about. However there are some difficulties in getting BPF to work properly:

BPF is only in "recent" kernels

BPF is an area that’s undergoing rapid development in the Linux kernel. Accordingly features may not be complete, or may not be present at all. Tools may not work as expected and their failure conditions not well documented. Accordingly if the kernels used in production are fairly modern than BPF may provide considerable utility. If not, it’s perhaps worth waiting until development in this area slows down and an LTS kernel with good BPF compatibility is released.

It’s hard to debug

BPF is fairly opaque at the moment. While there are bits of documentation here and there and one can go and read the kernel source its not as easy to debug as (for example) iptables or other system tools. It may be difficult to debug network issues that are created by improperly constructed bpf programs. The advice here is the same as other new or bespoke technologies: ensure that multiple team members understand and can debug it, and if they cant or those people are not available, pick another technology.

It’s an implementation detail

Its my suspicion that the vast majority of our interaction with BPF will not be interaction of our design. BPF is useful in the design of analysis tools, but the burden is perhaps too large to place on the shoulders of systems administrators. Accordingly, to start reaping the benefits of BPF its worth instead investing in tools that use this technology. These include:

Cilium
BCC Tools
bpftrace
Sysdig

More tools will arrive in future, though those are the only ones I would currently invest in.

Conclusion

BPF is an old technology that has had new life breathed into it with the extended instruction set, implementation of a JIT and ability to execute BPF at various points in the Linux kernel. It provides a way to export information about or modify Linux kernel behaviour at runtime without needing to reboot or reload the Kernel, including just for transient systems introspection. BPF has probably most immediate ramifications on network performance as networks need to handle a truly bizarre level of both traffic and complexity, and BPF provides some concrete solutions to these problems. Accordingly its a good start to understand BPF in the context of networks, particularly instead of investing in nftables or iptables. BPF additionally provides some compelling insights into both system and network visibility that are otherwise difficult or impossible to achieve, though this area is somewhat more nascent than the network implementations.

TL, DR — BPF is pretty damned cool.

References

Architecting a software system for malleability

Andrew Howden — Wed, 20 Mar 2019 05:16:06 +0000

The past few years of software development has given me this one beautiful insight:

I can’t predict the future

To illustrate this point on a personal level, it wasn’t even my plan to be a software developer. My undergraduate studies were in sports physiology, and the intention was to follow that up with sports medicine. However, through the various twists of fate inherent in life that was not to be, and I wound up helping building and shipping eCommerce stores.

The vagaries of life do not extend only to me, however. They’re an inherent part of life. The psychologist Dan Gilbert says in his talk “The psychology of your future self”:

We asked people how much they expected to change over the next 10 years, and also how much they had changed over the last 10 years, and what we found, well, … people underestimate how much their personalities will change in the next decade.

So, I didn’t know I would be here 6 years ago, and based on Gilberts assessment we don’t know who we’ll be in another 10 years. It follows that its extremely difficult to see where our technology should develop over the next 10 years. We
can assess that by reviewing the last 10 years;

The Apple App Store, Chrome, Android and Bitcoin were released in 2008
Maps with GPS reached Android, in 2009
Both the iPad and car2go (short term, instant car rentals) were released in 2010
Google+ was launched and Adobe begun to sunset Flash in 2011
Windows 8, 4k TV, Windows phone, Curiosity on Mars, Google Glass, 802.11ac and Space X flying to the ISS were all in 2012
Oculus Rift, the Smart Watch and Touch ID landed in 2013
Self driving cars began to emerge in 2014
Apple Pay, Project Loon and was released in 2015
IOT began to appear in earnest in 2016
Self driving trucks, reinforcement learning (AlphaGo) and the smart speaker made great strides in 2017
Cheap neural networks (Tensor Flow) executing on phones, bluetooth headphones that automatically translate and the GDPR were part of 2018

Which brings us to our current year of 2019. Each of those technologies had an impact on the market, shifting the balance of power in various industries dramatically and providing new opportunities for those who are lucky enough to
find the talent, capital and drive to take advantage of them.

The lesson to draw from these changes is that the world changes at a far higher rate than one would naively imagine and when designing our software systems we should factor in this high rate of change so that we do not drive ourselves or
our project partners into financial ruin attempting to innovate around their next business offering

Platforms that have successfully adapted

It stands to reason that if we were to assess how to design our own software to be maximally adaptable we should look at what others have done in the past that have successfully adapted to industry changes.

Apple, Cisco and Intel are all hardware (in addition to software) companies, so for our purpose we’ll dismiss them as a targets. Google, Microsoft, Facebook and Adobe are all primarily software companies however, so can serve as good
lessons how how to build systems that are well structured over time. Google and Facebook are famously “internet” heavy companies, but both Adobe and Microsoft have pivoted in recent years to be much more internet driven. Microsoft have
famously stated Windows 10 will be their last version of Windows and Adobe is making significant moves into internet driven business with “experience cloud”.

So, these companies are moving towards software that is:

Delivered primarily via the internet
Developed and delivered to users in increments, and adapted based on user feedback
Sold via an “ongoing revenue” model, be that subscriptions or advertising

The thing that these companies have in common is that their products are all designed around software that embraces continual change, in any arbitrary direction.

Software design requirements

To understand how to design software it’s first worth unpacking why we’re building software in the first place. Generally speaking I build software to make a computer solve a problem in a reliable way to derive some sort of useful
work out of it. Programs can be as simple as:

# Get a list of unique commands run on this machine
$ cat /var/log/auth.log | cut -d':' -f11 | sort | uniq

To Magento 1’s behemoth 1.7 million lines of code:

$ sloccount clean-magento-ee
Total Physical Source Lines of Code (SLOC) = 1,730,997
Development Effort Estimate 502.62

Regardless, software programs exist for some human purpose; to take some human input and return some human output (at some point or other).

Designing for reasonability

Softwares utility is a function of its predictability; of our understanding of how we can use it to accomplish work. Perhaps the best example of this is the Unix utility called cat:

$ cat foo
bar

This program takes the contents of the file “foo” and prints them to screen, showing “bar”. The particularly remarkable part about cat is not this behaviour, but rather:

That it was initially designed in 1971
That it hasn’t changed

It is the very essence of a predictable program. There is a whole swathe of unix programs that follow this trend; enunciated by Peter H. Salus with:

Write programs that do one thing and do it well.

The wisdom of this minimalistic approach is difficult to overstate. Programs that are easily predictable and follow a “standard” approach have some distinct advantages:

The time to understand and fit them in to our architecture is minimal
Their potential use cases are large
Their interoperability with other systems is large

Additionally, keeping the feature set limited makes it much simpler to maintain this software, especially while retaining knowledge of the use cases it is being used for — both those initially designed and those accrued over time.

This dramatically reduces both the cost of maintaining one piece of software, and the likelihood that this particular piece of software will change over time.

Designing for interrogability

Generally speaking we do not design software just for ourselves, but additionally to solve problems on behalf of others (usually for some monetary compensation). This creates a disconnect between:

How we understand the problem, and design the software to be used
How the software is actually used

John Allspaw refers to this as “above/below the line”, in which each user, developer and other stakeholder has a different conceptual model for how the software “works”. That model is only grounded in “reality” by interrogating the
software to ensure that it’s actually functioning as initially designed. To make design decisions as to how the software should be further reduced, restructured or replaced we need to know how the software is being use.

We can start this process by interrogating cat. cat is written in c and runs on unix . Unix (particularly Linux in this example) exposes a whole set of tools to allow inspecting both cat and other applications, such as strace , ltrace,
perf with additional tools like sysdig . However, while these tools give us an extremely good idea what the application is doing in specific invocations they are cost prohibitive to run the entire time. Instead, we need to move to less
granular tools. Unfortunately, this comes with a tradeoff — we need to guess ahead of time what we need to instrument.

There are a three broad way of doing so:

Logs
Metrics
Traces

Without going too far into the detail, an application should be designed such that it exposes the detail required to understand how its working. This is useful both for understanding when the application is not working correctly as well
as understanding how its used under normal conditions.

When choosing how to instrument an application the property that is perhaps the most useful is being able to ask questions of the software — to interrogate it. Logs are perhaps the simplest way to do this, allowing us to check
internal program state at a later point when an issue is reported. But time series data is a very close second, and allows querying for application behaviour over time. This allows making judgements about how people are using the app,
rather than just snap-shotting application internal state over time. The Prometheus documentation explains how to instrument an application to maximise its interrogability.

By understanding how its used we can modify our program to make those use cases easier or more efficient. We can additionally drop some of the functionality that is not being used over time to maintain program simplicity and reduce
the cost of maintenance and risk.

As software is used more frequently it will be better understood by its users. That is also where software engineers should invest the most time ensuring the software is designed in such a way it is easy for users to understand and
reason about as designing for simplicity will further increase uptake, forming a virtuous cycle until an “optimal simplicity” level is reached.

Design with a focus on solving the users problem

The process of shipping software is a complex one, involving:

Business process modelling
UX Design
General architecture design
Software component design
Software infrastructure

Each of these disciplines is a complex one that involves a staggering amount of research, discipline and effort over time. Accordingly it’s more likely than not that each component will have specialists, each of whom seek to do the best
job they possibly can.

It’s important while designing and implementing this system that the goal is to solve a users problem. One can get lost in the minutiae of one's own discipline, creating a relative work of art — at the expense of the system as a whole,
and the user with their problem.

To solve the users problem each stakeholder needs to subjugate their own ideal solution in favour of a solution that favours the customers happiness. To retain this focus while developing the design needs to put the customer at the
forefront of all decisions; each decision justified in relation to how that decision helps the customer solves their problem.

By doing so, while each component of the system may be even more complex or less elegant for those who have built it, the vast majority of users will experience a simpler, easier to understand system.

Designing unsurprising software

Software that is “surprising” is software that is unpredictable. Unpredictable software is harder for users to make use of, in turn driving usage of the application in unpredictable ways. This unpredictable usage means either either:

A high amount of refactoring to make the unusual mechanism the standard use case
A high amount of refactoring to shift users to the standard model

Regardless, quite a bit needs to be changed. Accordingly the goal while developing software should be to be the “least surprising” or “least astonishing”. This principle is captured as the “principle of least astonishment”:

“People are part of the system. The design should match the user’s experience, expectations, and mental models.”

Unfortunately what users find surprising is context specific. While designing an alarm clock users might expect that once they turn off an alarm the alarm goes away until the next occurrence, they might expect that hospital monitors
switch alarms back on themselves after a period of time. Accordingly, designing software that does “what the user expects” requires an in depth understanding of that user, and the context in which they’re using the software.

That is surprisingly hard to come by; the study of software development is such a complex one it precludes a depth first knowledge of other fields. However, one can take two strategies to help design software in an unsurprising way:

Design software after an already established pattern. Design hospital software like other hospital software, and alarm clocks like other alarm clocks.
Work closely with users, soliciting and integrating their feedback

Even the most intractable problems can be made simpler and easier for users to understand with a deliberate design of software to match their conceptual models.

Designing software on balance

Given the above requirements perhaps the hardest thing to do is to strike a balance across them, and design the software for simplicity relative to each designer or consumer of that project.

cat, for example, may be simple to me as a developer but it is likely not simple for my grandmother.

Each stakeholder has a different model of the software:

Users model it in terms of the problems they’re trying to solve
The UX team model and optimize for users usage of the application
The business logic team attempt to model the user in the software
The business owners model it in terms of a return on investment

This makes it hard for the software architect to be able to make the software simple relative to all users. However, there are ways in which it’s possible to determine how to evolve the software to suit the stakeholders over time.

As the software evolves and the stakeholders learn more about each other it will become clear that there are commonalities in how those users see the software. For example, in the case of an eCommerce store the user, UX, business
logic and business owners all have approximately the same notion of what an “order” or “shipment” needs, though with varying degrees of detail.

By writing the software to deliberately communicate its own nature with all stakeholders, writing supporting documentation to clearly explain that software where the software is incapable of explaining itself and minimising the
amount of “views” that the software has the software itself can remain simple, and all stakeholders have a similarmental model of the software.

Once these patterns are established continue reusing them, reinforcing a consistent way of reasoning about that software.

Understanding what we’re designing

To understand what we’re designing, we first need to think in terms of the problem we’re solving.

Boxing

In a past life I spent considerable time training to be a boxer (more specifically, a Thai boxer). Though it was only a habit, it was an activity that I fundamentally enjoyed. It additionally necessitated the purchase of some equipment. To participate, I would need.

1x. 16 ounce Boxing gloves
2x. Mouth guards
4x Singlets, Shorts & Wraps
1x. Groin guard
1x. Shin Guards (Heavy)
1x. Shin guards (Light)

The software journey we’ll consider then is the one that hopes to connect me with the equipment I need to continue my boxing profession.

Modelling the buying and usage journey

In the above equipment there is little value for it to be particularly well styled, emphatic or otherwise different — there is little fashion in the world of “boxing equipment”; they’re essentially commodity goods. Above all I would
prize:

Functional
Comfortable
Long lasting

As a buyer of this equipment, I’m likely to undertake the following steps:

Discover the need for this equipment as I join (or rejoin) a boxing gymnasium
Discuss with my peers what a set of reliable equipment would be. If it’s available on site, I would likely simply purchase it there.
Further research what equipment might be available, and look for reviews that help me determine what brand of equipment I would like
Make the purchase of this equipment, and use it for a period while training
Purchase either the same or new equipment once that had been worn beyond its utility.

Each of those components have some reflection in software; from joining the boxing club to evaluating the equipment after a period of use for reuse.

Designing the software itself

Given our understanding of the principles required to design resilient software, let’s try and help our boxer find the equipment they need.

Launch and Iterate

As we’ve established, we’re poor predictors of the future. So to understand our problem we need to start solving it.

The simplest way the users buying journey can be modelled is simply a cash transaction for equipment at the boxing
gymnasium. This is a solution completely without software, but as a process is a reasonably elegant solution:

It’s simple, and reuses existing primitives (cash, equipment)
It’s extremely low cost and easy to implement

This allows us to start filling out our business process. Things like “where do we purchase our goods from” or “where do we store our goods” or “what do users want to know about our goods” all start to come up and need solving.

Resolving solved problems

Given our scenario our boxing gym has been holding equipment but is struggling to understand what equipment sells well, what sells badly and how much stock is remaining. In terms of our previously defined principles the process is not
interrogable.

In this the use cases are fairly common, and there are already solutions that have largely solved these problems.

Dropping in a solution that solves “enough” of the problem is usually a good next step. Things like VendHQ, Square, Xero can solve the vast majority of these needs, and where they’re not yet solved a human process can make up the difference.

These solutions are perhaps not the most technically elegant. However, they’re already shaped by user demand and are thus the most conceptually simple to our user — they solve the users problem better than we’d be able to ourselves.

Be careful about solutions that solve more than the problems that need to be solved now. It is harder to remove process than it is to add it, and unless there is a demand for a feature it is likely redundant. That increases complexity for
no discernable gain.

Building additional services

Our boxing gymnasium is now successfully selling equipment to its members, however the gym has only limited staff and does not have the time to explain the tradeoff between the various pieces of equipment prior to the start of the class.

To address this, they need software that will allow them to list their services on some sort of consumable format — the defacto implementation being on the internet.

Depending on the software chosen previously it’s possible that our boxing gym can simply “switch on” an integration with Shopify or Magento that allows them to reuse their existing data. If so, this is the best solution in this case.
The gymnasium can continue to use their existing services with limited additional learning required to list their services online.

However, if such an integration is not available it is worth beginning to reevaluate the entire business stack such that a single solution can solve all problems. While this means a higher initial invest, it will be a significantly
lower invest in terms of learning, diagnostics and any further development over essentially any timescale.

Designing a unique service

Our boxing gym has now grown and sells equipment both in its gymnasium and online. However it would like to develop a new feature that doesn’t exist on the market — the ability to sell equipment directly from other gymnasiums.

This requirement is so unique that no existing software can be used to model this particular requirement. Either existing software will have to be repurposed, or new software designed.

Whether to repurpose existing software or redesign new software essentially depends on the total feature set required for the new software. If the business is well understood and the requirements limited designing new software offers
some compelling benefits:

The software can be designed to take advantage of business efficiencies
The software is well known by the implementing team
The software in absolute terms is not as complex

However, comes at the significant risk of losing track of the implementing team. If that team disappears, a new team will need to relearn the entirety of the business. Accordingly, if the software is being contracted out using a
“standard” solution with minimal customisation buys insurance against relations with that contractor going sideways.

For the purpose of this we’ll assume that the development team is in house and has a vested interested in the success of the project.

Perhaps the best thing to do is to rebuild the business logic entirely. This means losing many features that are inherent in commercial or open source software, but it also dramatically reduces the absolute complexity of the system. This allows much faster development targeted directly for the needs of the business.

The result is software that is simpler, more targeted and in better control of the business — presuming the development team is capable of such software design.

Downsides of malleable software

Malleable software is exceedingly hard to design. There are some significant downsides to it:

Expensive

As described in the example of the boxing gym owner, it was not economical to design software from scratch until the business requirement was such that no software existed that could be easily ported to the businesses need.

Designing software from scratch is an extremely expensive exercise. Developers are a scarce resource and developers that are driven by the results of the business even rarer.

It’s often a better balance to reuse existing primitives for services rather than take the leap for fully customized, malleable software. The more customized software is, the more expensive it is to maintain.

Difficult

The process of understanding, designing and implementing software is an exceedingly difficult task. It requires an in depth knowledge of the problem, patience to put forward designs and rework them and the ability to implement the designs in software.

Long Term

Software that is malleable does pay off, but only over a long period of time. The upfront investment is significant, and is better offset by incrementalism and the shifting to a self hosted solution only as there are no other options
available.

However, once the initial design of the solution has been completed and presuming upkeep is not cost prohibitive, a solution that is more malleable will open more business opportunity.

In Conclusion

Designing software is a complex process, needing to balance the needs of all stakeholders while keeping true to the vision that it intends to solve over a long period of time and with many different hands.

However, hopefully this article has provided some general background as to how software can be designed in such a way that it is more malleable, reducing the costs over the long term.