<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Andrew Howden</title>
    <description>The latest articles on DEV Community by Andrew Howden (@andrewhowdencom).</description>
    <link>https://dev.to/andrewhowdencom</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F68484%2F28dc1f63-5041-4548-b984-2fb1324668e6.jpeg</url>
      <title>DEV Community: Andrew Howden</title>
      <link>https://dev.to/andrewhowdencom</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/andrewhowdencom"/>
    <language>en</language>
    <item>
      <title>What is a container?</title>
      <dc:creator>Andrew Howden</dc:creator>
      <pubDate>Sun, 31 Mar 2019 11:58:48 +0000</pubDate>
      <link>https://dev.to/andrewhowdencom/what-is-a-container-464k</link>
      <guid>https://dev.to/andrewhowdencom/what-is-a-container-464k</guid>
      <description>

&lt;p&gt;Containers have recently become a common way of packaging, deploying and running software across a wide set of machines in all sorts of environments. With the initial release of Docker in March, 2013&lt;sup&gt;[1]&lt;/sup&gt; containers have become ubiquitous in modern software deployment with 71% of Fortune 100 companies running it in some capacity&lt;sup&gt;[2]&lt;/sup&gt;. Containers can be used for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Running user facing, production software&lt;/li&gt;
&lt;li&gt;Running a software development environment&lt;/li&gt;
&lt;li&gt;Compiling software with its dependencies in a sandbox&lt;/li&gt;
&lt;li&gt;Analysing the behaviour of software within a sandbox&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Like their namesake in the shipping industry containers are designed to easily "lift and shift" software to different environments and have that software execute in the same way across those environments.&lt;/p&gt;

&lt;p&gt;Containers have thus earned their place in the modern software development toolkit. However to understand how container technology fits into our modern software architecture its worth understanding how we arrived at containers, as well as how they work.&lt;/p&gt;

&lt;h1&gt;
  
  
  History
&lt;/h1&gt;

&lt;p&gt;The "birth" of containers was denoted by Bryan Cantrill as March 18th, 1982&lt;sup&gt;[3]&lt;/sup&gt; with the addition of the &lt;code&gt;chroot&lt;/code&gt; syscall in BSD. From the FreeBSD website&lt;sup&gt;[4]&lt;/sup&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;According to the SCCS logs, the chroot call was added by Bill Joy on March 18, 1982 approximately 1.5 years before 4.2BSD was released. That was well before we had ftp servers of any sort (ftp did not show up in the source tree until January 1983). My best guess as to its purpose was to allow Bill to chroot into the /4.2BSD build directory and build a system using only the files, include files, etc contained in that tree. That was the only use of chroot that I remember from the early days.&lt;/p&gt;

&lt;p&gt;—  Dr. Marshall Kirk Mckusick &lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;code&gt;chroot&lt;/code&gt; is used to put a process into a "changed root"; a new root filesystem that has limited or no access to the parent root filesystem. An extremely minimal &lt;code&gt;chroot&lt;/code&gt; can be created on Linux as follows&lt;sup&gt;[5]&lt;/sup&gt;:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Get a shell&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="k"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;mktemp&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;&lt;span class="k"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;mkdir &lt;/span&gt;bin
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="k"&gt;$(&lt;/span&gt;which sh&lt;span class="k"&gt;)&lt;/span&gt; bin/bash

&lt;span class="c"&gt;# Find shared libraries required for shell&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;ldd bin/sh
    linux-vdso.so.1 &lt;span class="o"&gt;(&lt;/span&gt;0x00007ffe69784000&lt;span class="o"&gt;)&lt;/span&gt;
    /lib/x86_64-linux-gnu/libsnoopy.so &lt;span class="o"&gt;(&lt;/span&gt;0x00007f6cc4c33000&lt;span class="o"&gt;)&lt;/span&gt;
    libc.so.6 &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; /lib/x86_64-linux-gnu/libc.so.6 &lt;span class="o"&gt;(&lt;/span&gt;0x00007f6cc4a42000&lt;span class="o"&gt;)&lt;/span&gt;
    libpthread.so.0 &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; /lib/x86_64-linux-gnu/libpthread.so.0 &lt;span class="o"&gt;(&lt;/span&gt;0x00007f6cc4a21000&lt;span class="o"&gt;)&lt;/span&gt;
    libdl.so.2 &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; /lib/x86_64-linux-gnu/libdl.so.2 &lt;span class="o"&gt;(&lt;/span&gt;0x00007f6cc4a1c000&lt;span class="o"&gt;)&lt;/span&gt;
    /lib64/ld-linux-x86-64.so.2 &lt;span class="o"&gt;(&lt;/span&gt;0x00007f6cc4c66000&lt;span class="o"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# Duplicate libraries into root&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; lib64 lib/x86_64-linux-gnu
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; /lib/x86_64-linux-gnu/libsnoopy.so &lt;span class="se"&gt;\&lt;/span&gt;
    /lib/x86_64-linux-gnu/libc.so.6 &lt;span class="se"&gt;\&lt;/span&gt;
    /lib/x86_64-linux-gnu/libpthread.so.0 &lt;span class="se"&gt;\&lt;/span&gt;
    /lib/x86_64-linux-gnu/libdl.so.2 &lt;span class="se"&gt;\&lt;/span&gt;
    lib/x86_64-linux-gnu/

&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; /lib64/ld-linux-x86-64.so.2 lib64/

&lt;span class="c"&gt;# Change into that root&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;sudo chroot&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="c"&gt;# Test the chroot&lt;/span&gt;
&lt;span class="c"&gt;# ls&lt;/span&gt;
/bin/bash: 1: &lt;span class="nb"&gt;ls&lt;/span&gt;: not found
&lt;span class="c"&gt;#&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;There were problems with this early implementation of &lt;code&gt;chroot&lt;/code&gt;, such as being able to exit that &lt;code&gt;chroot&lt;/code&gt; by running &lt;code&gt;cd..&lt;/code&gt;&lt;sup&gt;[3]&lt;/sup&gt;, but these were resolved in short order. Seeking to provide better security FreeBSD extended the &lt;code&gt;chroot&lt;/code&gt; into the &lt;code&gt;jail&lt;/code&gt;&lt;sup&gt;[3][4]&lt;/sup&gt; which allowed running software that desired to run as &lt;code&gt;root&lt;/code&gt; and running it within a confined environment that was &lt;code&gt;root&lt;/code&gt; within that environment but not &lt;code&gt;root&lt;/code&gt; elsewhere on the system.&lt;/p&gt;

&lt;p&gt;This work was further built upon in the Solaris operating system to provide fuller isolation from the host&lt;sup&gt;[3][6]&lt;/sup&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User separation (similar to &lt;code&gt;jail&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Filesystem separation (similar to &lt;code&gt;chroot&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;A separate process space&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Providing something similar to the modern concept of containers; processes running on the same kernel. Later, similar work took place in the Linux kernel to isolate kernel structures on a per-process basis under "namespaces"&lt;sup&gt;[7]&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;However, in parallel Amazon Web Services (AWS) launched their Elastic Compute Cloud (EC2) product which took a different approach to separating out workloads: virtualising &lt;em&gt;the entire hardware&lt;/em&gt;&lt;sup&gt;[3]&lt;/sup&gt;. This has some different tradeoffs; it limits exploitation of the host kernel or isolation implementation however running the additional operating system and hypervisor meant a far less efficient use of resources.&lt;/p&gt;

&lt;p&gt;Virtualisation continued to dominate workload isolation until the company "dotcloud" (now Docker), then operating as a "platform as a service" (PAAS) offering, open sourced the software they used to run their PAAS. With that software and a large amount of luck containers proliferated rapidly until Docker became the power house it is now.&lt;/p&gt;

&lt;p&gt;Shortly after Docker released their container runtime they started expanding their product offerings into build, orchestration and server management tooling&lt;sup&gt;[8]&lt;/sup&gt;. Unhappy with this CoreOS created their own container runtime, &lt;code&gt;rkt&lt;/code&gt;, which had the stated goal of interoperating with existing services such as &lt;code&gt;systemd&lt;/code&gt;, following &lt;a href="https://en.wikipedia.org/wiki/Unix_philosophy"&gt;the unix philosophy&lt;/a&gt; of "Write programs that do one thing and do it well&lt;sup&gt;[9]&lt;/sup&gt;."&lt;/p&gt;

&lt;p&gt;To reconcile these disaparate definitions of a container the Open Container Initiative was established&lt;sup&gt;[10]&lt;/sup&gt;, after which Docker donated its schema and its runtime as what amounted to a defacto container standard.&lt;/p&gt;

&lt;p&gt;There are now a number of container implementations, as well as a number of standards to define their behaviour.&lt;/p&gt;

&lt;h1&gt;
  
  
  Definition
&lt;/h1&gt;

&lt;p&gt;It might be surprising to learn that a "container" is not a real thing — rather, it is a specification. At the time of writing this specification has implementations on&lt;sup&gt;[11]:&lt;/sup&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Linux&lt;/li&gt;
&lt;li&gt;Windows&lt;/li&gt;
&lt;li&gt;Solaris&lt;/li&gt;
&lt;li&gt;Virtual Machines&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In turn, containers are expected to be&lt;sup&gt;[12]&lt;/sup&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Consumable with a set of standard, interoperable tools&lt;/li&gt;
&lt;li&gt; Consistent regardless of what type of software is being run&lt;/li&gt;
&lt;li&gt; Agnostic to the underlying infrastructure the container is being run on&lt;/li&gt;
&lt;li&gt; Designed in a way that makes automation easy&lt;/li&gt;
&lt;li&gt; Of excellent quality&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;There are specifications that dictate how containers should reach these principles by defining how they should be executed (the runtime specification&lt;sup&gt;[11]&lt;/sup&gt;), what a container should contain (the image specification&lt;sup&gt;[13]&lt;/sup&gt;) and how to distribute container "images" (the distribution specification&lt;sup&gt;[14]&lt;/sup&gt;).&lt;/p&gt;

&lt;p&gt;These specifications mean that a wide variety of tools can be used to interact with containers. The canonical tool that is in most common use is the Docker tool, which in addition to manipulating containers provides container build tooling and some limited orchestration of containers. However, there are a number of container runtimes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.docker.com/"&gt;Docker&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/rkt/rkt"&gt;Rkt&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cri-o.io/"&gt;cri-o&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://discuss.linuxcontainers.org/t/lxc-3-0-0-has-been-released/1449"&gt;LXC&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/clearcontainers/runtime"&gt;"Clear Containers"&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As well as other tools that help with building or distributing images.&lt;/p&gt;

&lt;p&gt;Lastly, there are extensions to the existing standards, such as the &lt;a href="https://github.com/containernetworking/cni"&gt;container networking interface&lt;/a&gt;, which define additional behaviour where the standards are not yet clear enough.&lt;/p&gt;

&lt;h1&gt;
  
  
  Implementation
&lt;/h1&gt;

&lt;p&gt;While the standards give us some idea as to what a container is and how they should work, it’s perhaps useful to understand how a container implementation works. Not all container runtimes are implemented in this way; notably, kata containers implement hardware virtualisation as alluded to earlier with EC2.&lt;/p&gt;

&lt;p&gt;The problems being solved by containers are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Isolation of a process(es)&lt;/li&gt;
&lt;li&gt; Distribution of that process(es)&lt;/li&gt;
&lt;li&gt; Connecting that process(es) to other machines&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;With that said let’s dive in to the Docker implementation&lt;sup&gt;[15]&lt;/sup&gt;. This uses a series of technologies exposed by the underlying kernel:&lt;/p&gt;

&lt;h2&gt;
  
  
  Kernel feature isolation: namespaces
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;man namespaces&lt;/code&gt; command defines namespaces as follows:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource. Changes to the global resource are visible to other processes that are members of the namespace, but are invisible to other processes. One use of namespaces is to implement containers.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Paraphrased, a namespace is a slice of the system that, from within that slice, a process cannot see the rest of the system.&lt;/p&gt;

&lt;p&gt;A process must make a system call to the Linux kernel to changes its namespace. There are several system calls:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;clone&lt;/code&gt;: Create a new process. When used in conjunction with &lt;code&gt;CLONE_NEW*&lt;/code&gt; it creates a namespace of the kind specified. For example, if used with &lt;code&gt;CLONE_NEWPID&lt;/code&gt; the process will enter a new &lt;code&gt;pid&lt;/code&gt; namespace and become &lt;code&gt;pid 1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;setns&lt;/code&gt;: Allows the calling process to join an existing namespace, specified under &lt;code&gt;/proc/[pid]/ns&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;unshare&lt;/code&gt;: Moves the calling process into a new namespace&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There is a user command also called &lt;code&gt;unshare&lt;/code&gt; which allows us to experiment with namespaces. We can put ourselves into a separate process and network namespace with the following command:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Scratch space&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="k"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;mktemp&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;&lt;span class="k"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# Fork is required to spawn new processes, and proc is mounted to give accurate process information&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;unshare &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--fork&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--pid&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--mount-proc&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--net&lt;/span&gt;

&lt;span class="c"&gt;# Here we see that we only have access to the loopback interface&lt;/span&gt;
root@sw-20160616-01:/tmp/tmp.XBESuNMJJS# ip addr
1: lo: &amp;lt;LOOPBACK&amp;gt; mtu 65536 qdisc noop state DOWN group default qlen 1000
    &lt;span class="nb"&gt;link&lt;/span&gt;/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

&lt;span class="c"&gt;# Here we see that we can only see the first process (bash) and our `ps aux` invocation&lt;/span&gt;
root@sw-20160616-01:/tmp/tmp.XBESuNMJJS# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.3  0.0   8304  5092 pts/7    S    05:48   0:00 &lt;span class="nt"&gt;-bash&lt;/span&gt;
root         5  0.0  0.0  10888  3248 pts/7    R+   05:49   0:00 ps aux
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Docker uses the following namespaces to limit the ability for a process running in the container to see resources outside that container:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;pid&lt;/code&gt; namespace: Process isolation (PID: Process ID).&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;net&lt;/code&gt; namespace: Managing network interfaces (NET: Networking).&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;ipc&lt;/code&gt; namespace: Managing access to IPC resources (IPC: InterProcess Communication).&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;mnt&lt;/code&gt; namespace: Managing filesystem mount points (MNT: Mount)&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;uts&lt;/code&gt; namespace: Isolating kernel and version identifiers. (UTS: Unix Timesharing System).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These provide reasonable separation between processes such that workloads should not be able to interfere with each other. However there is a notable caveat: &lt;strong&gt;we can disable some of this isolation&lt;/strong&gt;&lt;sup&gt;[16].&lt;/sup&gt;&lt;/p&gt;

&lt;p&gt;This is an extremely useful property. One example of this would be for system daemons that need access to the host network to bind ports on the host&lt;sup&gt;[17]&lt;/sup&gt;, such as running a DNS service or service proxy in a container.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Tip&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Process #1 or the &lt;code&gt;init&lt;/code&gt; process in Linux systems has some additional responsibilities. When processes terminate in Linux they are not automatically cleaned up, but rather simply enter a terminated state. It is the responsibility of the init process to "reap" those processes, deleting them so that their process ID can be reused&lt;sup&gt;[18]&lt;/sup&gt;. Accordingly the first process run in a Linux namespace should be an &lt;code&gt;init&lt;/code&gt; process, and not a user facing process like &lt;code&gt;mysql&lt;/code&gt;. This is known as the &lt;em&gt;zombie reaping problem&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tip&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Another place namespaces are used is the Chromium browser&lt;sup&gt;[19]&lt;/sup&gt;. Chromium uses at least the &lt;code&gt;setuid&lt;/code&gt; and &lt;code&gt;user&lt;/code&gt; namespaces.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Resource isolation: control groups
&lt;/h2&gt;

&lt;p&gt;The kernel documentation for &lt;code&gt;cgroups&lt;/code&gt; defines the cgroup as follows:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Control Groups provide a mechanism for aggregating/partitioning sets of tasks, and all their future children, into hierarchical groups with specialized behaviour.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That doesn’t really tell us much though. Luckily it expands:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;On their own, the only use for cgroups is for simple job tracking. The intention is that other subsystems hook into the generic cgroup support to provide new attributes for cgroups, such as accounting/limiting the resources which processes in a cgroup can access. For example, cpusets (see Documentation/cgroup-v1/cpusets.txt) allow you to associate a set of CPUs and a set of memory nodes with the tasks in each cgroup.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So, &lt;code&gt;cgroups&lt;/code&gt; are a groups of "jobs" that other systems can assign meaning to. The systems that currently use this &lt;code&gt;cgroup&lt;/code&gt; systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt"&gt;CPU&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt"&gt;Memory&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.kernel.org/doc/Documentation/cgroup-v1/pids.txt"&gt;PIDs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.kernel.org/doc/Documentation/cgroup-v1/net_prio.txt"&gt;Network Priority&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As well as various others.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;cgroups&lt;/code&gt; are manipulated by reading and writing to the &lt;code&gt;/proc&lt;/code&gt; filesystem. For example:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create a cgroup called "me"&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt; &lt;span class="nb"&gt;mkdir&lt;/span&gt; /sys/fs/cgroup/memory/me

&lt;span class="c"&gt;# Allocate the cgroup a max of 100Mb memory&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s1"&gt;'100000000'&lt;/span&gt; | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; /sys/fs/cgroup/memory/me/memory.limit_in_bytes

&lt;span class="c"&gt;# Move this proess into the cgroup&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$$&lt;/span&gt;  | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; /sys/fs/cgroup/memory/me/cgroup.procs
5924
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;That’s it! This process should now be limited to 100Mb total usage&lt;/p&gt;

&lt;p&gt;Docker uses the same functionality in its &lt;code&gt;--memory&lt;/code&gt; and &lt;code&gt;--cpus&lt;/code&gt; arguments, and it is employed by the orchestration systems Kubernetes and Apache Mesos to determine where to schedule workloads.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Tip&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Although &lt;code&gt;cgroups&lt;/code&gt; are most commonly associated with containers they’re already used for other workloads. The best example is perhaps &lt;code&gt;systemd&lt;/code&gt;, which automatically puts all services into a &lt;code&gt;cgroup&lt;/code&gt; if the CPU scheduler is enabled in the kernel&lt;sup&gt;[20]&lt;/sup&gt;. &lt;code&gt;systemd&lt;/code&gt; services are …​ kind of containers!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Userland isolation: seccomp
&lt;/h2&gt;

&lt;p&gt;While both namespaces and &lt;code&gt;cgroups&lt;/code&gt; go a significant way to isolating processes into their own containers Docker goes further than that to restrict what access the process can have to the Linux kernel itself. This is enforced in supported operating systems via "SECure COMPuting with filters", also known as &lt;code&gt;seccomp-bpf&lt;/code&gt; or simply &lt;code&gt;seccomp&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The Linux kernel user space API guide defines &lt;code&gt;seccomp&lt;/code&gt; as:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Seccomp filtering provides a means for a process to specify a filter for incoming system calls. The filter is expressed as a Berkeley Packet Filter (BPF) program, as with socket filters, except that the data operated on is related to the system call being made: system call number and the system call arguments.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;BPF in turn is a small, in-kernel virtual machine language used in a number of kernel tracing, networking and other tasks&lt;sup&gt;[21]&lt;/sup&gt;. Whether the system supports seccomp can be determined by running the following command&lt;sup&gt;[22]&lt;/sup&gt;:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;grep &lt;/span&gt;&lt;span class="nv"&gt;CONFIG_SECCOMP&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; /boot/config-&lt;span class="k"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;uname&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt;&lt;span class="k"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# Our system supports seccomp&lt;/span&gt;
&lt;span class="nv"&gt;CONFIG_SECCOMP&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;y
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Practically this limits a processes ability to ask the kernel to do certain things. Any system call can be restricted, and docker allows the use of arbitrary seccomp "profiles" via its &lt;code&gt;--security-opt&lt;/code&gt; argument&lt;sup&gt;[22]&lt;/sup&gt;:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--security-opt&lt;/span&gt; &lt;span class="nv"&gt;seccomp&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/path/to/seccomp/profile.json &lt;span class="se"&gt;\&lt;/span&gt;
  hello-world
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;However, most usefully Docker provides a default security profile that limits some of the more dangerous system calls that processes run from a container should never need to make, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;clone&lt;/code&gt;: The ability to clone new namespaces&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;bpf&lt;/code&gt;: The ability to load and run &lt;code&gt;bpf&lt;/code&gt; programs&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;add_key&lt;/code&gt;: The ability to access the kernel keyring&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;kexec_load&lt;/code&gt;: The ability to load a new linux kernel&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As well as many others. The full list of syscalls blocked by default is &lt;a href="https://docs.docker.com/engine/security/seccomp/"&gt;available on the Docker website&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In addition to &lt;code&gt;seccomp&lt;/code&gt; there are other ways to ensure containers are behaving as expected, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Linux Capabilities&lt;sup&gt;[23]&lt;/sup&gt;
&lt;/li&gt;
&lt;li&gt;SELinux&lt;/li&gt;
&lt;li&gt;AppArmour&lt;/li&gt;
&lt;li&gt;AuditD&lt;/li&gt;
&lt;li&gt;Falco&lt;sup&gt;[24]&lt;/sup&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of which take slightly different approaches of ensuring the process is only executed within expected behaviour. It’s worth spending time to investigate the tradeoffs of each of these security decisions or simply delegating the choice to a competent third party provider.&lt;/p&gt;

&lt;p&gt;Additionally it’s worth noting that even though Docker defaults to enabling the &lt;code&gt;seccomp&lt;/code&gt; policy, orchestration systems such as &lt;code&gt;kubernetes&lt;/code&gt; may disable it&lt;sup&gt;[25]&lt;/sup&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Distribution: the union file system
&lt;/h2&gt;

&lt;p&gt;To generate a container Docker requires a set of "build instructions". A trivial image could be:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Scrath space&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="k"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;mktemp&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;&lt;span class="k"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# Create a docker file&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt;&lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt; &amp;gt; Dockerfile
FROM debian:buster

# Create a test directory
RUN mkdir /test

# Create a bunch of spam files
RUN echo &lt;/span&gt;&lt;span class="k"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="k"&gt;)&lt;/span&gt;&lt;span class="sh"&gt; &amp;gt; /test/a
RUN echo &lt;/span&gt;&lt;span class="k"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="k"&gt;)&lt;/span&gt;&lt;span class="sh"&gt; &amp;gt; /test/b
RUN echo &lt;/span&gt;&lt;span class="k"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt;&lt;span class="k"&gt;)&lt;/span&gt;&lt;span class="sh"&gt; &amp;gt; /test/c
&lt;/span&gt;&lt;span class="no"&gt;
EOF

&lt;/span&gt;&lt;span class="c"&gt;# Build the image&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;docker build &lt;span class="nb"&gt;.&lt;/span&gt;
Sending build context to Docker daemon  4.096kB
Step 1/5 : FROM debian:buster
 &lt;span class="nt"&gt;---&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; ebdc13caae1e
Step 2/5 : RUN &lt;span class="nb"&gt;mkdir&lt;/span&gt; /test
 &lt;span class="nt"&gt;---&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; Running &lt;span class="k"&gt;in &lt;/span&gt;a9c0fa1a56c7
Removing intermediate container a9c0fa1a56c7
 &lt;span class="nt"&gt;---&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; 6837541a46a5
Step 3/5 : RUN &lt;span class="nb"&gt;echo &lt;/span&gt;Sat 30 Mar 18:05:24 CET 2019 &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /test/a
 &lt;span class="nt"&gt;---&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; Running &lt;span class="k"&gt;in &lt;/span&gt;8b61ca022296
Removing intermediate container 8b61ca022296
 &lt;span class="nt"&gt;---&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; 3ea076dcea98
Step 4/5 : RUN &lt;span class="nb"&gt;echo &lt;/span&gt;Sat 30 Mar 18:05:24 CET 2019 &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /test/b
 &lt;span class="nt"&gt;---&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; Running &lt;span class="k"&gt;in &lt;/span&gt;940d5bcaa715
Removing intermediate container 940d5bcaa715
 &lt;span class="nt"&gt;---&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; 07b2f7a4dff8
Step 5/5 : RUN &lt;span class="nb"&gt;echo &lt;/span&gt;Sat 30 Mar 18:05:24 CET 2019 &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /test/c
 &lt;span class="nt"&gt;---&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; Running &lt;span class="k"&gt;in &lt;/span&gt;251f5d00b55f
Removing intermediate container 251f5d00b55f
 &lt;span class="nt"&gt;---&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; 0122a70ad0a3
Successfully built 0122a70ad0a3
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This creates a docker image with the id of &lt;code&gt;0122a70ad0a3&lt;/code&gt; containing the contents of &lt;code&gt;date&lt;/code&gt; at &lt;code&gt;a&lt;/code&gt;, &lt;code&gt;b&lt;/code&gt; and &lt;code&gt;c&lt;/code&gt;. We can verify this by starting the container and examining its contents:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;docker run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  0122a70ad0a3 &lt;span class="se"&gt;\&lt;/span&gt;
  /bin/bash

&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; /test
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;ls
&lt;/span&gt;a  b  c
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt;

Sat 30 Mar 18:05:24 CET 2019
Sat 30 Mar 18:05:24 CET 2019
Sat 30 Mar 18:05:24 CET 2019
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;However, in the &lt;code&gt;docker build&lt;/code&gt; command earlier Docker created several images. If we run the image after only &lt;code&gt;a&lt;/code&gt; and &lt;code&gt;b&lt;/code&gt; have been executed we will not see &lt;code&gt;c&lt;/code&gt;:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;docker run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--rm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  07b2f7a4dff8 &lt;span class="se"&gt;\&lt;/span&gt;
  /bin/bash
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;ls test
&lt;/span&gt;a  b
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Docker is not creating a whole new filesystem for each of these images. Instead, each of the images are layered on top of each other. If we query Docker we can see each of the layers that go into a given image:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;docker &lt;span class="nb"&gt;history &lt;/span&gt;0122a70ad0a3
IMAGE               CREATED             CREATED BY                                      SIZE                COMMENT
0122a70ad0a3        5 minutes ago       /bin/sh &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="nb"&gt;echo &lt;/span&gt;Sat 30 Mar 18:05:24 CET 2019…   29B
07b2f7a4dff8        5 minutes ago       /bin/sh &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="nb"&gt;echo &lt;/span&gt;Sat 30 Mar 18:05:24 CET 2019…   29B
3ea076dcea98        5 minutes ago       /bin/sh &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="nb"&gt;echo &lt;/span&gt;Sat 30 Mar 18:05:24 CET 2019…   29B
6837541a46a5        5 minutes ago       /bin/sh &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="nb"&gt;mkdir&lt;/span&gt; /test                          0B
ebdc13caae1e        12 months ago       /bin/sh &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="c"&gt;#(nop)  CMD ["bash"]                 0B&lt;/span&gt;
&amp;lt;missing&amp;gt;           12 months ago       /bin/sh &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="c"&gt;#(nop) ADD file:2219cecc89ed69975…   106MB&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This allows docker to reuse vast chunks of what it downloads. For example, given the image we built earlier we can see that it uses:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; A layer called &lt;code&gt;ADD file:…​&lt;/code&gt; — this is the Debian Buster root filesystem at 106MB&lt;/li&gt;
&lt;li&gt; A layer for &lt;code&gt;a&lt;/code&gt; that renders the date to disk at 29B&lt;/li&gt;
&lt;li&gt; A layer for &lt;code&gt;b&lt;/code&gt; that renders the date to disk at 29B&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;And so on. Docker will reuse the &lt;code&gt;Add file:…​&lt;/code&gt; Debian Buster root for all image that start with &lt;code&gt;FROM: debian:buster&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This allows Docker to be extremely space efficient if possible, reusing the same operating system image for multiple different executions.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Tip&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Even though Docker is extremely space efficient the docker library on disk can grow extremely large and transferring large docker images over the network can become expensive. Therefore, try to reuse image layers where possible and prefer smaller operating systems or the &lt;code&gt;scratch&lt;/code&gt; (nothing) image where possible.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;These layers are implemented via a Union Filesystem, or UnionFS. There are various "backends" or filesystems that can implement this approach:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;overlay2&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;devicemapper&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;aufs&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Generally speaking the package manager on our machine will include the appropriate underlying filesystem driver; docker supports many:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;docker info | &lt;span class="nb"&gt;grep &lt;/span&gt;Storage
Storage Driver: overlay2
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;We can replicate this implementation with our overlay mount fairly easily&lt;sup&gt;[26]&lt;/sup&gt;:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# scratch&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="k"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;mktemp&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;&lt;span class="k"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;# Create some layers&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  lower &lt;span class="se"&gt;\&lt;/span&gt;
  upper &lt;span class="se"&gt;\&lt;/span&gt;
  workdir &lt;span class="se"&gt;\&lt;/span&gt;
  overlay

&lt;span class="c"&gt;# Create some files that represent the layers&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;touch &lt;/span&gt;lower/i-am-the-lower
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;touch &lt;/span&gt;higher/i-am-the-higher

&lt;span class="c"&gt;# Create the layered filesystem at overlay with lower, upper and workdir&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;mount &lt;span class="nt"&gt;-t&lt;/span&gt; overlay &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;lowerdir&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;lower,upperdir&lt;span class="o"&gt;=&lt;/span&gt;upper,workdir&lt;span class="o"&gt;=&lt;/span&gt;workdir &lt;span class="se"&gt;\&lt;/span&gt;
    ./overlay &lt;span class="se"&gt;\&lt;/span&gt;
    overlay

&lt;span class="c"&gt;# List the directory&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;ls &lt;/span&gt;overlay/
i-am-the-lower  i-am-the-upper
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Docker goes so far as to nest those layers until the multi-layered filesystem has been successfully implemented.&lt;/p&gt;

&lt;p&gt;Files that are written are written back to the &lt;code&gt;upper&lt;/code&gt; directory, in the case of &lt;code&gt;overlay2&lt;/code&gt;. However Docker will generally dispose of these temporary files when the container is removed.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Tip&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Generally speaking all software needs access to shared libraries found in static paths in Linux operating systems. Accordingly it is the convention to simply ship a stripped down version of an operating systems root file system such that users can install and applications can find the libraries they expect. However, it is possible to use an empty filesystem and a statically compiled binary with the &lt;code&gt;scratch&lt;/code&gt; image type.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Connectivity: networking
&lt;/h2&gt;

&lt;p&gt;As mentioned earlier, containers make use of Linux namespaces. Of particular interest when understanding container networking is the network namespace. This namespace gives the process separate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;(virtual) ethernet devices&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;routing tables&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;iptables&lt;/code&gt; rules&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example,&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create a new network namespace&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;unshare &lt;span class="nt"&gt;--fork&lt;/span&gt; &lt;span class="nt"&gt;--net&lt;/span&gt;

&lt;span class="c"&gt;# List the ethernet devices with associated ip addresses&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;ip addr
1: lo: &amp;lt;LOOPBACK&amp;gt; mtu 65536 qdisc noop state DOWN group default qlen 1000
    &lt;span class="nb"&gt;link&lt;/span&gt;/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

&lt;span class="c"&gt;# List all iptables rules&lt;/span&gt;
root@sw-20160616-01:/home/andrewhowden# iptables &lt;span class="nt"&gt;-L&lt;/span&gt;
Chain INPUT &lt;span class="o"&gt;(&lt;/span&gt;policy ACCEPT&lt;span class="o"&gt;)&lt;/span&gt;
target     prot opt &lt;span class="nb"&gt;source               &lt;/span&gt;destination

Chain FORWARD &lt;span class="o"&gt;(&lt;/span&gt;policy ACCEPT&lt;span class="o"&gt;)&lt;/span&gt;
target     prot opt &lt;span class="nb"&gt;source               &lt;/span&gt;destination

Chain OUTPUT &lt;span class="o"&gt;(&lt;/span&gt;policy ACCEPT&lt;span class="o"&gt;)&lt;/span&gt;
target     prot opt &lt;span class="nb"&gt;source               &lt;/span&gt;destination

&lt;span class="c"&gt;# List all network routes&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;ip route show
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;By default, the container has no network connectivity — not even the &lt;code&gt;loopback&lt;/code&gt; adapter is up. We cannot even ping ourselves!&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;ping 127.0.0.1
PING 127.0.0.1 &lt;span class="o"&gt;(&lt;/span&gt;127.0.0.1&lt;span class="o"&gt;)&lt;/span&gt;: 56 data bytes
ping: sending packet: Network is unreachable
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;We can start setting up the expected network environment by bringing up the &lt;code&gt;loopback&lt;/code&gt; adapter:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;ip &lt;span class="nb"&gt;link set &lt;/span&gt;lo up
root@sw-20160616-01:/home/andrewhowden# ip addr
1: lo: &amp;lt;LOOPBACK,UP,LOWER_UP&amp;gt; mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    &lt;span class="nb"&gt;link&lt;/span&gt;/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever

&lt;span class="c"&gt;# Test the loopback adapter&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;ping 127.0.0.1
PING 127.0.0.1 &lt;span class="o"&gt;(&lt;/span&gt;127.0.0.1&lt;span class="o"&gt;)&lt;/span&gt;: 56 data bytes
64 bytes from 127.0.0.1: &lt;span class="nv"&gt;icmp_seq&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 &lt;span class="nv"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;64 &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.092 ms
64 bytes from 127.0.0.1: &lt;span class="nv"&gt;icmp_seq&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="nv"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;64 &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.068 ms
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;However, we cannot access the outside world. In most environments our host machine will be connected via ethernet to a given network and either have an IP assigned to it via the cloud provider or, in the case of a development or office machine, request an IP via DHCP. However our container is in a network namespace of its own and has no knowledge of the ethernet connected to the host. To connect the container to the host we need to employ a &lt;code&gt;veth&lt;/code&gt; device.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;veth&lt;/code&gt;, or "Virtual Ethernet Device" is defined by &lt;code&gt;man vetTo create a `veth&lt;/code&gt; device we can run as:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The veth devices are virtual Ethernet devices. They can act as tunnels between network namespaces to create a bridge to a physical network device in another namespace, but can also be used as standalone network devices.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is exactly what we need! Because &lt;code&gt;unshare&lt;/code&gt; creates an anonymous network namespace we need to determine what the &lt;code&gt;pid&lt;/code&gt; of the process started in that namespace is&lt;sup&gt;[27][28]&lt;/sup&gt;:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$$&lt;/span&gt;
18171
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;We can then create the &lt;code&gt;veth&lt;/code&gt; device:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;ip &lt;span class="nb"&gt;link &lt;/span&gt;add veth0 &lt;span class="nb"&gt;type &lt;/span&gt;veth peer name veth0 netns 18171
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;We can see both on the host and the guest these virtual ethernet devices appear. However, neither has an IP attached nor any routes defined:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Container&lt;/span&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;ip addr
1: lo: &amp;lt;LOOPBACK&amp;gt; mtu 65536 qdisc noop state DOWN group default qlen 1000
    &lt;span class="nb"&gt;link&lt;/span&gt;/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: veth0@if7: &amp;lt;BROADCAST,MULTICAST&amp;gt; mtu 1500 qdisc noop state DOWN group default qlen 1000
    &lt;span class="nb"&gt;link&lt;/span&gt;/ether 16:34:52:54:a2:a1 brd ff:ff:ff:ff:ff:ff link-netnsid 0
&lt;span class="nv"&gt;$ &lt;/span&gt;ip route show

&lt;span class="c"&gt;# No output&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;To address that we simply add an IP and define the default route:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# On the host&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;ip addr add 192.168.24.1 dev veth0

&lt;span class="c"&gt;# Within the container&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;ip address add 192.168.24.10 dev veth0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;From there, bring the devices up:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Both host and container&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;ip &lt;span class="nb"&gt;link set &lt;/span&gt;veth0 up
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Add a route such that &lt;code&gt;192.168.24.0/24&lt;/code&gt; goes out via &lt;code&gt;veth0&lt;/code&gt;:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Both host and guest&lt;/span&gt;
ip route add 192.168.24.0/24 dev veth0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;And voilà! We have connectivity to the host namespace and back:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Within container&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;ping 192.168.24.1
PING 192.168.24.1 &lt;span class="o"&gt;(&lt;/span&gt;192.168.24.1&lt;span class="o"&gt;)&lt;/span&gt;: 56 data bytes
64 bytes from 192.168.24.1: &lt;span class="nv"&gt;icmp_seq&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 &lt;span class="nv"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;64 &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.149 ms
64 bytes from 192.168.24.1: &lt;span class="nv"&gt;icmp_seq&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="nv"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;64 &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.096 ms
64 bytes from 192.168.24.1: &lt;span class="nv"&gt;icmp_seq&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2 &lt;span class="nv"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;64 &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.104 ms
64 bytes from 192.168.24.1: &lt;span class="nv"&gt;icmp_seq&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3 &lt;span class="nv"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;64 &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.100 ms
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;However, that does not give us access to the wider internet. While the &lt;code&gt;veth&lt;/code&gt; adapter functions as a virtual cable between our container and our host, there is currently no path from our container to the internet:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Within container&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;ping google.com
ping: unknown host
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;To create such a path we need to modify our host such that it functions as a "router" between its own, separated network namespaces and its internet facing adapter.&lt;/p&gt;

&lt;p&gt;Luckily, Linux is set up well for this purpose. First, we need to modify the normal behaviour of Linux from dropping packets not destined for IP addresses with which their associated but rather allow forwarding a packet from one adapter to the other:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Within container&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;echo &lt;/span&gt;1 &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /proc/sys/net/ipv4/ip_forward
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;That means when we request public facing IPs from within our container via our &lt;code&gt;veth&lt;/code&gt; adapter to our host &lt;code&gt;veth&lt;/code&gt; adapter the host adapter won’t simply drop those packets.&lt;/p&gt;

&lt;p&gt;From there we employ &lt;code&gt;iptables&lt;/code&gt; rules on the host to forward traffic from the host &lt;code&gt;veth&lt;/code&gt; adapter to the internet facing adapter — in this case &lt;code&gt;wlp2s0&lt;/code&gt;:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# On the host&lt;/span&gt;
&lt;span class="c"&gt;# Forward packets from the container to the host adapter&lt;/span&gt;
iptables &lt;span class="nt"&gt;-A&lt;/span&gt; FORWARD &lt;span class="nt"&gt;-i&lt;/span&gt; veth0 &lt;span class="nt"&gt;-o&lt;/span&gt; wlp2s0 &lt;span class="nt"&gt;-j&lt;/span&gt; ACCEPT

&lt;span class="c"&gt;# Forward packets that have been established via egress from the host adapater back to the contianer&lt;/span&gt;
iptables &lt;span class="nt"&gt;-A&lt;/span&gt; FORWARD &lt;span class="nt"&gt;-i&lt;/span&gt; wlp2s0 &lt;span class="nt"&gt;-o&lt;/span&gt; veth0 &lt;span class="nt"&gt;-m&lt;/span&gt; state &lt;span class="nt"&gt;--state&lt;/span&gt; ESTABLISHED,RELATED &lt;span class="nt"&gt;-j&lt;/span&gt; ACCEPT

&lt;span class="c"&gt;# Relabel the IPs for the container so return traffic will be routed correctly&lt;/span&gt;
iptables &lt;span class="nt"&gt;-t&lt;/span&gt; nat &lt;span class="nt"&gt;-A&lt;/span&gt; POSTROUTING &lt;span class="nt"&gt;-o&lt;/span&gt; wlp2s0 &lt;span class="nt"&gt;-j&lt;/span&gt; MASQUERADE
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;We then tell our container to send traffic it doesn’t know anything else about down the &lt;code&gt;veth&lt;/code&gt; adapter:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Within the container&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;ip route add default via 192.168.24.1 dev veth0
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;And the internet works!&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="c"&gt;# ping google.com&lt;/span&gt;
PING google.com &lt;span class="o"&gt;(&lt;/span&gt;172.217.22.14&lt;span class="o"&gt;)&lt;/span&gt;: 56 data bytes
64 bytes from 172.217.22.14: &lt;span class="nv"&gt;icmp_seq&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 &lt;span class="nv"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;55 &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;16.456 ms
64 bytes from 172.217.22.14: &lt;span class="nv"&gt;icmp_seq&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="nv"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;55 &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;15.102 ms
64 bytes from 172.217.22.14: &lt;span class="nv"&gt;icmp_seq&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2 &lt;span class="nv"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;55 &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;34.369 ms
64 bytes from 172.217.22.14: &lt;span class="nv"&gt;icmp_seq&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3 &lt;span class="nv"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;55 &lt;span class="nb"&gt;time&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;15.319 ms
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;As mentioned, each container implementation can implement networking differently. There are implementations that use the aforementioned &lt;code&gt;veth&lt;/code&gt; pair, &lt;code&gt;vxlan&lt;/code&gt;, &lt;code&gt;BPF&lt;/code&gt; or other cloud specific implementations. However, when designing containers we need some way to reason about what behaviour we should expect.&lt;/p&gt;

&lt;p&gt;To help address this the &lt;a href="https://github.com/containernetworking/cni"&gt;"Container Network Interface"&lt;/a&gt; tooling has been designed. This allows defining consistent network behaviour across network implementations, as well as models such as Kubernetes shared &lt;code&gt;lo&lt;/code&gt; adapter between several containers.&lt;/p&gt;

&lt;p&gt;The networking side of containers is an area undergoing rapid innovation but relying on:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; A &lt;code&gt;lo&lt;/code&gt; interface&lt;/li&gt;
&lt;li&gt; A public facing &lt;code&gt;eth0&lt;/code&gt; (or similar) interface&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;being present seems a fairly stable guarantee.&lt;/p&gt;

&lt;h1&gt;
  
  
  Landscape review
&lt;/h1&gt;

&lt;p&gt;Given our understanding of the implementation of containers we can now take a look at some of the classic docker discussions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Systems Updates
&lt;/h2&gt;

&lt;p&gt;One of the oft overlooked parts of containers is the necessity to keep both them, and the host system up to date.&lt;/p&gt;

&lt;p&gt;In modern systems it is quite common to simply enable automatic updates on host systems and, so long as we stick to the system package manager and ensure updates stay successful, the system will keep itself both up to date and stable.&lt;/p&gt;

&lt;p&gt;However, containers take a very different approach. They’re effectively giant static binaries deployed into a production system. In this capacity they can do no self maintenance.&lt;/p&gt;

&lt;p&gt;Accordingly even if there are no updates to the software the container runs, containers should be periodically rebuilt and redeployed to the production system — less they accumulate vunlerabilities over time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Init within contianer
&lt;/h2&gt;

&lt;p&gt;Given our understanding of containers its reasonable to consider the "1 process per container" advice and determine that it is an oversimplification of how containers work, and it makes sense in some cases to do service management within a container with a system like &lt;code&gt;runit&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This allows multiple processes to be executed within a single container including things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;syslog&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;logrotate&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;cron&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And so fourth.&lt;/p&gt;

&lt;p&gt;In the case where Docker is the only system that is being used it is indeed reasonable to think about doing service management within docker — particularly when hitting the constraints of shared filesystem or network state. However systems such as Kubernetes, Swarm or Mesos have replaced much of the necessity of these init systems; tasks such as log aggregation, restarting services or colocating services are taken care of by these tools.&lt;/p&gt;

&lt;p&gt;Accordingly its best to keep containers simple such that they are maximally composable and easy to debug, delegating the more complex behaviour out.&lt;/p&gt;

&lt;h1&gt;
  
  
  In Conclusion
&lt;/h1&gt;

&lt;p&gt;Containers are an excellent way to ship software to production systems. They solve a swathe of interesting problems and cost very little as a result. However, their rapid growth has meant some confusion in industry as to exactly how they work, whether they’re stable and so fourth. Containers are a combination of both old and new Linux kernel technology such as namespaces, cgroups, seccomp and other Linux networking tooling but are as stable as any other kernel technology (so, very) and well suited for production systems.&lt;/p&gt;

&lt;p&gt;&amp;lt;3 for making it this far.&lt;/p&gt;

&lt;h1&gt;
  
  
  References
&lt;/h1&gt;

&lt;ol&gt;
&lt;li&gt; “Docker.” &lt;a href="https://en.wikipedia.org/wiki/Docker_(software"&gt;https://en.wikipedia.org/wiki/Docker_(software&lt;/a&gt;) .&lt;/li&gt;
&lt;li&gt; “Cloud Native Technologies in the Fortune 100.” &lt;a href="https://redmonk.com/fryan/2017/09/10/cloud-native-technologies-in-the-fortune-100/"&gt;https://redmonk.com/fryan/2017/09/10/cloud-native-technologies-in-the-fortune-100/&lt;/a&gt; , Sep-2017.&lt;/li&gt;
&lt;li&gt; B. Cantrill, “The Container Revolution: Reflections After the First Decade.” &lt;a href="https://www.youtube.com/watch?v=xXWaECk9XqM"&gt;https://www.youtube.com/watch?v=xXWaECk9XqM&lt;/a&gt; , Sep-2018.&lt;/li&gt;
&lt;li&gt; “Papers (Jail).” &lt;a href="https://docs.freebsd.org/44doc/papers/jail/jail.html"&gt;https://docs.freebsd.org/44doc/papers/jail/jail.html&lt;/a&gt; .&lt;/li&gt;
&lt;li&gt; “An absolutely minimal chroot.” &lt;a href="https://sagar.se/an-absolutely-minimal-chroot.html"&gt;https://sagar.se/an-absolutely-minimal-chroot.html&lt;/a&gt; , Jan-2011.&lt;/li&gt;
&lt;li&gt; J. Beck &lt;em&gt;et al.&lt;/em&gt;, “Virtualization and Namespace Isolation in the Solaris Operating System (PSARC/2002/174).” &lt;a href="https://us-east.manta.joyent.com/jmc/public/opensolaris/ARChive/PSARC/2002/174/zones-design.spec.opensolaris.pdf"&gt;https://us-east.manta.joyent.com/jmc/public/opensolaris/ARChive/PSARC/2002/174/zones-design.spec.opensolaris.pdf&lt;/a&gt; , Sep-2006.&lt;/li&gt;
&lt;li&gt; M. Kerrisk, “Namespaces in operation, part 1: namespaces overview.” &lt;a href="https://lwn.net/Articles/531114/"&gt;https://lwn.net/Articles/531114/&lt;/a&gt; , Jan-2013.&lt;/li&gt;
&lt;li&gt; A. Polvi, “CoreOS is building a container runtime, rkt.” &lt;a href="https://coreos.com/blog/rocket.html"&gt;https://coreos.com/blog/rocket.html&lt;/a&gt; , Jan-2014.&lt;/li&gt;
&lt;li&gt; “Basics of the Unix Philosophy.” &lt;a href="http://www.catb.org/%C2%A0esr/writings/taoup/html/ch01s06.html"&gt;http://www.catb.org/ esr/writings/taoup/html/ch01s06.html&lt;/a&gt; .&lt;/li&gt;
&lt;li&gt;P. Estes and M. Brown, “OCI Image Support Comes to Open Source Docker Registry.” &lt;a href="https://www.opencontainers.org/blog/2018/10/11/oci-image-support-comes-to-open-source-docker-registry"&gt;https://www.opencontainers.org/blog/2018/10/11/oci-image-support-comes-to-open-source-docker-registry&lt;/a&gt; , Oct-2018.&lt;/li&gt;
&lt;li&gt;“Open Container Initiative Runtime Specification.” &lt;a href="https://github.com/opencontainers/runtime-spec/blob/74b670efb921f9008dcdfc96145133e5b66cca5c/spec.md"&gt;https://github.com/opencontainers/runtime-spec/blob/74b670efb921f9008dcdfc96145133e5b66cca5c/spec.md&lt;/a&gt; , Mar-2018.&lt;/li&gt;
&lt;li&gt;“The 5 principles of Standard Containers.” &lt;a href="https://github.com/opencontainers/runtime-spec/blob/74b670efb921f9008dcdfc96145133e5b66cca5c/principles.md"&gt;https://github.com/opencontainers/runtime-spec/blob/74b670efb921f9008dcdfc96145133e5b66cca5c/principles.md&lt;/a&gt; , Dec-2016.&lt;/li&gt;
&lt;li&gt;“Open Container Initiative Image Specification.” &lt;a href="https://github.com/opencontainers/image-spec/blob/db4d6de99a2adf83a672147d5f05a2e039e68ab6/spec.md"&gt;https://github.com/opencontainers/image-spec/blob/db4d6de99a2adf83a672147d5f05a2e039e68ab6/spec.md&lt;/a&gt; , Jun-2017.&lt;/li&gt;
&lt;li&gt;“Open Container Initiative Distribution Specification.” &lt;a href="https://github.com/opencontainers/distribution-spec/blob/d93cfa52800990932d24f86fd233070ad9adc5e0/spec.md"&gt;https://github.com/opencontainers/distribution-spec/blob/d93cfa52800990932d24f86fd233070ad9adc5e0/spec.md&lt;/a&gt; , Mar-2019.&lt;/li&gt;
&lt;li&gt;“Docker Overview.” &lt;a href="https://docs.docker.com/engine/docker-overview/"&gt;https://docs.docker.com/engine/docker-overview/&lt;/a&gt; .&lt;/li&gt;
&lt;li&gt;J. Frazelle, “Containers aka crazy user space fun.” &lt;a href="https://www.youtube.com/watch?v=7mzbIOtcIaQ"&gt;https://www.youtube.com/watch?v=7mzbIOtcIaQ&lt;/a&gt; , Jan-2018.&lt;/li&gt;
&lt;li&gt;“Use Host Networking.” &lt;a href="https://docs.docker.com/network/host/"&gt;https://docs.docker.com/network/host/&lt;/a&gt; .&lt;/li&gt;
&lt;li&gt;Krallin, “Tini: A tini but valid init for containers.” &lt;a href="https://github.com/krallin/tini"&gt;https://github.com/krallin/tini&lt;/a&gt; , Nov-2018.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://chromium.googlesource.com/chromium/src.git/+/HEAD/docs/linux_sandboxing.md"&gt;https://chromium.googlesource.com/chromium/src.git/+/HEAD/docs/linux_sandboxing.md&lt;/a&gt; .&lt;/li&gt;
&lt;li&gt;[[0pointer.resources]]L. Poettering, “systemd for Administrators, Part XVIII.” &lt;a href="http://0pointer.de/blog/projects/resources.html"&gt;http://0pointer.de/blog/projects/resources.html&lt;/a&gt; , Oct-2012.&lt;/li&gt;
&lt;li&gt;A. Howden, “Coming to grips with eBPF.” &lt;a href="https://www.littleman.co/articles/coming-to-grips-with-ebpf/"&gt;https://www.littleman.co/articles/coming-to-grips-with-ebpf/&lt;/a&gt; , Mar-2019.&lt;/li&gt;
&lt;li&gt;“Seccomp security profiles for docker.” &lt;a href="https://docs.docker.com/engine/security/seccomp/"&gt;https://docs.docker.com/engine/security/seccomp/&lt;/a&gt; .&lt;/li&gt;
&lt;li&gt;“Linux kernel capabilities.” &lt;a href="https://docs.docker.com/engine/security/security/#linux-kernel-capabilities"&gt;https://docs.docker.com/engine/security/security/#linux-kernel-capabilities&lt;/a&gt; .&lt;/li&gt;
&lt;li&gt;M. Stemm, “SELinux, Seccomp, Sysdig Falco, and you: A technical discussion.” &lt;a href="https://sysdig.com/blog/selinux-seccomp-falco-technical-discussion/"&gt;https://sysdig.com/blog/selinux-seccomp-falco-technical-discussion/&lt;/a&gt; , Dec-2016.&lt;/li&gt;
&lt;li&gt;“Pod Security Policies.” &lt;a href="https://kubernetes.io/docs/concepts/policy/pod-security-policy/#seccomp"&gt;https://kubernetes.io/docs/concepts/policy/pod-security-policy/#seccomp&lt;/a&gt; .&lt;/li&gt;
&lt;li&gt;Programster, “Example OverlayFS Usage.” &lt;a href="https://askubuntu.com/a/704358"&gt;https://askubuntu.com/a/704358&lt;/a&gt; , Nov-2015.&lt;/li&gt;
&lt;li&gt;“How do I connect a veth device inside an ’anonymous’ network namespace to one outside?” &lt;a href="https://unix.stackexchange.com/a/396210"&gt;https://unix.stackexchange.com/a/396210&lt;/a&gt; , Oct-2017.&lt;/li&gt;
&lt;li&gt;D. P. García, “Network namespaces.” &lt;a href="https://blogs.igalia.com/dpino/2016/04/10/network-namespaces/"&gt;https://blogs.igalia.com/dpino/2016/04/10/network-namespaces/&lt;/a&gt; , Apr-2016.&lt;/li&gt;
&lt;/ol&gt;


</description>
      <category>container</category>
      <category>docker</category>
      <category>namespace</category>
      <category>deepdive</category>
    </item>
    <item>
      <title>Laying out a git repository</title>
      <dc:creator>Andrew Howden</dc:creator>
      <pubDate>Tue, 26 Mar 2019 06:34:10 +0000</pubDate>
      <link>https://dev.to/andrewhowdencom/laying-out-a-git-repository-ii1</link>
      <guid>https://dev.to/andrewhowdencom/laying-out-a-git-repository-ii1</guid>
      <description>

&lt;p&gt;Version control is one of the more fundamental pieces of software development. It allows developers to navigate through a projects history to understand who implemented each change, as well as why they did so. It is an invaluable tool for use while understanding any given issue.&lt;/p&gt;

&lt;p&gt;littleman.co uses git as its version control tool of choice. &lt;code&gt;git&lt;/code&gt; is the defacto standard of the software industry, having replaced Mercurial, Subversion and CVS. The majority of our development tools and our workflow builds on top of &lt;code&gt;git&lt;/code&gt; primitives such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;patches&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;branches&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;tags&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And so forth. That said &lt;code&gt;git&lt;/code&gt;, for all its opinions, is remarkably silent about how to lay out a project.&lt;/p&gt;

&lt;p&gt;This is a good thing for the tool but not necessarily for the developer. When first reading a project to understand and debug it a developer needs to build a model of that project as quickly as possible. They can then use that model to make predictions about how the software should behave; as well, spotting things that violate such predictions. If we are able to keep projects consistent we are able to reduce the number of odd things developers need to investigate to find the desired issue.&lt;/p&gt;

&lt;p&gt;Accordingly it’s a good idea to structure all projects in the in the same way and that developers can easily understand and search through them.&lt;/p&gt;

&lt;h1&gt;
  
  
  Existing Standards
&lt;/h1&gt;

&lt;p&gt;Defining a standard for how a project should be laid out is hardly a new endeavour. There is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The &lt;a href="http://refspecs.linuxfoundation.org/FHS_3.0/fhs-3.0.html"&gt;Linux Filesystem Hierarchy Standard&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;a href="https://github.com/golang-standards/project-layout"&gt;standard go project layout&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;a href="https://maven.apache.org/guides/introduction/introduction-to-the-standard-directory-layout.html"&gt;Maven standard directory layout&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;a href="https://docs.python-guide.org/writing/structure/"&gt;Python standard project layout&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If one of these standards is in wide use in your organisation its best to continue with that, rather than &lt;a href="https://xkcd.com/927/"&gt;adopt yet another standard&lt;/a&gt;. However each of those standards have the limitation they’re only used in the context of the language or build tooling they’re defined in. In an environment such as littleman.co that includes many different languages, applications and other types of development these standards either do not define enough behaviour to be useful or define things that do not propagate well between languages.&lt;/p&gt;

&lt;h1&gt;
  
  
  Determining the boundaries of a repository
&lt;/h1&gt;

&lt;p&gt;There are usually many different components of a project that need to come together to have that project user facing and doing useful work. Things like the:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Application&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Infrastructure&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;CI/CD&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Artifacts&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Documentation&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These things must all be coordinated in some way that allows developers to make changes to a project in a predictable way and with predictable timing and have those changes be pushed to users.&lt;/p&gt;

&lt;p&gt;Traditionally each of these components would be kept separate, handled by different teams. However with the advent of continuous delivery developers can push code to production in a "self service" manner, and have a robot take care of tasks such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Ensuring the application works as expected before it hits users&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Replacing the existing application in production with the new application&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Rolling back the application to its previous version in the case of failure&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Creating testing environments&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And so forth.&lt;/p&gt;

&lt;p&gt;Deployments are the boundary that seems most useful to determine what should be in a single repository. For example, if the application is the only thing that should change in a single deployment it can be the only thing in that repository. However, if the application is changing and requires an underlying infrastructure change, that infrastructure should also be in the repository. If the application requires a new set of tests and those tests should be in the CI/CD configuration that also belongs in the repository.&lt;/p&gt;

&lt;p&gt;However, this also provides good boundaries as to what does not belong in the repository. The application should never require Kubernetes to be in a specific configuration, and Kubernetes configuration and life cycle should thus be managed in a separate repository. If the application requires new TLS certificates but those certificates are handled in a process outside the normal application development process they should also not be stored in the repository.&lt;/p&gt;

&lt;p&gt;By using the deployment as our boundary to determine what goes in and out of the project we see a number of benefits:&lt;/p&gt;

&lt;h2&gt;
  
  
  Democratised project tooling
&lt;/h2&gt;

&lt;p&gt;Even though things such as Docker or CI/CD may require specialised knowledge that the application developers do not have any reason to learn, by seeing those changes in the same place and subject to the same standards as other parts of the application those developers get a better understanding of their own project lifecycle. They can use that knowledge to decrease the time required to understand and resolve issues that are associated with any changes in that process, such as configuration changes in CI/CD breaking asset compilation in the application.&lt;/p&gt;

&lt;p&gt;Additionally those developers can contribute application specific insights to the CI/CD process, such as the best place to store configuration or environment specific application configuration that must be applied.&lt;/p&gt;

&lt;h2&gt;
  
  
  Single view of changes
&lt;/h2&gt;

&lt;p&gt;When understanding how and when a bug was introduced into a service the fewer places we must look and correlate the change the faster we can find and resolve the issue.&lt;/p&gt;

&lt;p&gt;By having all changes associated with the project down to the next "deployment layer" we can quickly see whether it was an application code change, configuration change or environment change that was introduced at the same time as an issue started hitting users.&lt;/p&gt;

&lt;h2&gt;
  
  
  Coordinated Changes
&lt;/h2&gt;

&lt;p&gt;There are times in which an application change and a configuration or environment change must happen at the same time. Examples include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The addition of a new data store (Redis)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Newly exposed application configuration&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A new application feature that requires a system library&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By having both the application and the infrastructure in a single repository we can review both the application changes and the infrastructure changes in a single pull request and ensure they’re released and tested in a coordinated way.&lt;/p&gt;

&lt;p&gt;Additionally any deployment artifacts generated can be directly traced back to a change in the &lt;code&gt;git&lt;/code&gt; repository allowing operations team members to know exactly what code is running in production at any given time.&lt;/p&gt;

&lt;h1&gt;
  
  
  The Standard
&lt;/h1&gt;

&lt;p&gt;The littleman.co standard is derived from the requirements as above. The directory layout is as follows:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;tree &lt;span class="nb"&gt;.&lt;/span&gt;

├── bin
├── build
│   ├── ci
│   └── container
│       ├── Dockerfile
│       └── etc
├── deploy
│   ├── ansible
│   │   └── playbook.yml
│   ├── docker-compose
│   │   ├── docker-compose.yml
│   │   └── mnt
│   │       └── app
│   └── helm
├── docs
├── LICENSE.txt
├── README.adoc
├── src
└── web

14 directories, 5 files
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;A &lt;a href="https://github.com/littlemanco/boilr-gitrepo"&gt;new project was published on GitHub&lt;/a&gt; with this post that describes the existing standards, formatted as &lt;a href="https://github.com/tmrts/boilr"&gt;a &lt;code&gt;boilr&lt;/code&gt; template&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  /
&lt;/h2&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;├── LICENSE.txt
├── README.adoc
├── .drone.yml
├── .arclint
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;There are various files that are either required by convention or by project tooling to be in the root of the project.&lt;/p&gt;

&lt;p&gt;These include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;LICENSE.txt&lt;/strong&gt;: The project license&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;README.adoc&lt;/strong&gt;: Some basic description about the project&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;.drone.yml&lt;/strong&gt;: The task runner / CI configuration for the project&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;.arclint&lt;/strong&gt;: Configuration for the Arcanist lint runner&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Build
&lt;/h1&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;└── build
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Build configuration is expected to produce some sort of artifact, either consumed later in the build or deployed to some sort of environment.&lt;/p&gt;

&lt;p&gt;These include:&lt;/p&gt;

&lt;h2&gt;
  
  
  CI
&lt;/h2&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;└── build
    └── ci
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Sometimes there are limitations with the build system that require additional procedural scripts to do some &lt;code&gt;$THING&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;These are somewhat of an anti-pattern though; where possible, build tools that address the problems in a more abstract sense or reusable plugins in &lt;a href="http://plugins.drone.io/"&gt;the style of &lt;code&gt;drone&lt;/code&gt; plugins&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Containers
&lt;/h2&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;└── build
    └── container
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Containers are the canonical deployment artifact used by littleman.co. They’re build from the &lt;code&gt;Dockerfile&lt;/code&gt; definition.&lt;/p&gt;

&lt;p&gt;Generally there is only one production container per project, though other containers may be used to assist with bespoke application build tasks.&lt;/p&gt;

&lt;h1&gt;
  
  
  Deploy
&lt;/h1&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;└── deploy
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The deployment folder contains any "infrastructure as code" configuration. There are various kinds that are in common use, including:&lt;/p&gt;

&lt;h2&gt;
  
  
  Helm
&lt;/h2&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;└── deploy
    └── helm
        ├── Chart.yml
        ├── templates
        └── ...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Helm is a project for managing the definitions and lifecycle of Kubernetes objects. It is an opinionated way of packaging and vendoring software and there are &lt;a href="https://github.com/helm/charts/tree/master/stable"&gt;a number of pre-packaged bits of software&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Each bit of software is packaged into a "chart". This chart includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Some metadata describing the software&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The deployment definitions&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The deployment definition configuration&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Usually a project only has a single chart. However, where there are multiple charts required to launch this project each chart is nested in its own subdirectory:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;└── deploy
    └── helm
        └── service-a
            ├── Chart.yml
            ├── templates
            └── ...
        └── service-b
            ├── Chart.yml
            └── ...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Generally speaking however, it is an anti-pattern to need multiple services for a single project. The project should be deployed as a single, &lt;a href="https://en.wikipedia.org/wiki/Atomic_commit"&gt;atomic change&lt;/a&gt;. These services are better organised &lt;a href="https://helm.sh/docs/chart_template_guide/"&gt;in the subchart pattern&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ansible
&lt;/h2&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;└── deploy
    └── ansible
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Ansible is a tool for defining machine specifications and having them enforced. The layout within this folder should be &lt;a href="https://docs.ansible.com/ansible/latest/user_guide/playbooks_best_practices.html#directory-layout"&gt;the layout defined by Ansible upstream&lt;/a&gt;, with the exception that each project is expected to only define one role.&lt;/p&gt;

&lt;h2&gt;
  
  
  Docker Compose
&lt;/h2&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;└── deploy
    └── docker-compose
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;code&gt;docker-compose&lt;/code&gt; is a tool that is useful for spinning up a "production like" environment in a limited way in the local development environment.&lt;/p&gt;

&lt;p&gt;Its scope is limited to local development by design.&lt;/p&gt;

&lt;h2&gt;
  
  
  Docs
&lt;/h2&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;└── docs
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Project specific documentation&lt;/p&gt;

&lt;h2&gt;
  
  
  Src
&lt;/h2&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;└── src
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;All files associated with the application.&lt;/p&gt;

&lt;p&gt;If the application is interpreted this should be called "app".&lt;/p&gt;

&lt;h2&gt;
  
  
  Web
&lt;/h2&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;└── web
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The generated web application&lt;/p&gt;

&lt;h1&gt;
  
  
  In Conclusion
&lt;/h1&gt;

&lt;p&gt;Our tools shape our conceptual model of a project. When developing keeping things consistent reduces the amount we need to investigate given each different project before we can start diagnosing issues or adding features to that project and adopting a single project layout keeps things as consistent as possible. The things included in a &lt;code&gt;git&lt;/code&gt; repository in littleman.co projects are all the things that are needed to deploy a project to users or subsequently change that project’s behaviour, given consistent underlying infrastructure. The layout is fairly straight forward but is subject to iteration, and has thus been &lt;a href="https://github.com/littlemanco/boilr-gitrepo"&gt;pushed to GitHub&lt;/a&gt;. Hopefully understanding how we structure projects will give you some guidance on how to structure your own projects, or invite questions as to whether your projects are currently structured to maximise clarity and consistency in your team.&lt;/p&gt;


</description>
      <category>git</category>
      <category>repository</category>
      <category>layout</category>
      <category>technicalreview</category>
    </item>
    <item>
      <title>Coming to grips with eBPF </title>
      <dc:creator>Andrew Howden</dc:creator>
      <pubDate>Sat, 23 Mar 2019 02:41:39 +0000</pubDate>
      <link>https://dev.to/andrewhowdencom/coming-to-grips-with-ebpf--1760</link>
      <guid>https://dev.to/andrewhowdencom/coming-to-grips-with-ebpf--1760</guid>
      <description>

&lt;p&gt;I have a fairly long history using Linux for a number of purposes. After being assigned Linux as a development machine while working with the team at &lt;a href="https://www.fontis.com.au/"&gt;Fontis&lt;/a&gt; a combination of curiosity and the need to urgently repair this development machine as a result of curiosity driven stick pokery meant that I learned a large amount of Linux trivia knowledge fairly quickly. I built further on this while helping set up &lt;a href="https://www.sitewards.com/"&gt;Sitewards&lt;/a&gt; infrastructure tooling; a much more heterogeneous set of computers and providers but with a standard approach emerging built of Docker and Kubernetes.&lt;/p&gt;

&lt;p&gt;The sum total of this experience means I’ve been heavily motivated to invest more in the technologies associated with Linux. One of the more interesting technologies I’ve become peripherally aware of during this tuition is the "Extended Berkeley Packet Filter", or "eBPF".&lt;/p&gt;

&lt;p&gt;I was introduced to this technology by Brendan Greggs excellent videos on &lt;a href="https://www.youtube.com/watch?v=bj3qdEDbCD4"&gt;performance analysis with eBPF&lt;/a&gt;. This was somewhat experimentally useful, but required recent kernels and various other oddities that weren’t consistent across our infrastructure. However, in parallel there was some interesting discussion &lt;a href="https://twitter.com/jessfraz/status/897819764915142656"&gt;about another eBPF project — Cilium&lt;/a&gt;. This project provides the underlying networking for Kubernetes, but does so in a way that appears to provide additional security and visibility that other network plugins do not; naively &lt;a href="https://istio.io/"&gt;similar to Istio&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Very recently I’ve had the opportunity to help another team with some scaling issues with a large, bespoke Kubernetes cluster. This cluster had a large number of services, and these services were being updated slowly due to performance issues with their calico &amp;amp; kube-proxy &lt;code&gt;iptables&lt;/code&gt; implementations. That particular issue addressed another way, but lead to the investigation into Calico and subsequent eBPF network tooling.&lt;/p&gt;

&lt;h1&gt;
  
  
  What is BPF?
&lt;/h1&gt;

&lt;p&gt;The original "Berkeley Packet Filter" was derived from a paper written by Steve McCanne and Van Jacobson in 1992 for the Berkeley Software Distribution. It’s purpose was to allow an efficient capture of packets from within the Kernel to the Userland by compiling a program that filtered out packets that should not be copied across. This was subsequently employed in utilities such as &lt;code&gt;tcpdump&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In 2011 Eric Dumazet considerably improved the performance of this BPF filter by adding a Just In Time (JIT) compiler that would compiled the BPF bytecode into an optimized instruction sequence. Later, in 2014, Alexi Starovoitov capitalised on this performant virtual machine to expose kernel tracing information more efficiently than it otherwise would be and extending BPF beyond its initial packet filtering purpose. Jonathan Corbet noted and published this to the LWN network, hinting that eventually BPF programs may not only be used internally in the kernel but compiled in userland and loaded into the Kernel. Later that same year the Alexi started work on the &lt;code&gt;bpf()&lt;/code&gt; syscall, and the current notion of eBPF was kicked off.&lt;/p&gt;

&lt;p&gt;eBPF is now an extension of the BPF tooling, converted into a more general purpose virtual machine and used for roles well beyond its initial packet filtering purpose. It is a quirk of history that it is still referred to as the Berkley packet filter, but the name has now stuck.&lt;/p&gt;

&lt;p&gt;Because eBPF is an extension of the original specification it is generally simply referred to as BPF. The older language is transpiled in the kernel to the newer eBPF before its compiled so the only BPF that’s in the kernel is the newer eBPF.&lt;/p&gt;

&lt;h1&gt;
  
  
  How does BPF work?
&lt;/h1&gt;

&lt;p&gt;BPF is a sequence of 64 bit instructions. These instructions are generally generated by an intermediary such as &lt;code&gt;tcpdump&lt;/code&gt; (libpcap):&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# See https://blog.cloudflare.com/bpf-the-forgotten-bytecode/&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;tcpdump &lt;span class="nt"&gt;-i&lt;/span&gt; wlp2s0 &lt;span class="s1"&gt;'ip and tcp'&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;span class="o"&gt;(&lt;/span&gt;000&lt;span class="o"&gt;)&lt;/span&gt; ldh      &lt;span class="o"&gt;[&lt;/span&gt;12]                           &lt;span class="c"&gt;# Load a half-word (2 bytes) from the packet at offset 12.&lt;/span&gt;
&lt;span class="o"&gt;(&lt;/span&gt;001&lt;span class="o"&gt;)&lt;/span&gt; jeq      &lt;span class="c"&gt;#0x800           jt 2    jf 5  # Check if the value is 0x0800, otherwise fail.&lt;/span&gt;
                                              &lt;span class="c"&gt;# This checks for the IP packet on top of an Ethernet frame.&lt;/span&gt;
&lt;span class="o"&gt;(&lt;/span&gt;002&lt;span class="o"&gt;)&lt;/span&gt; ldb      &lt;span class="o"&gt;[&lt;/span&gt;23]                           &lt;span class="c"&gt;# Load byte from a packet at offset 23.&lt;/span&gt;
                                              &lt;span class="c"&gt;# That's the "protocol" field 9 bytes within an IP frame.&lt;/span&gt;
&lt;span class="o"&gt;(&lt;/span&gt;003&lt;span class="o"&gt;)&lt;/span&gt; jeq      &lt;span class="c"&gt;#0x6             jt 4    jf 5  # Check if the value is 0x6, which is the TCP protocol number,&lt;/span&gt;
                                              &lt;span class="c"&gt;# otherwise fail.&lt;/span&gt;
&lt;span class="o"&gt;(&lt;/span&gt;004&lt;span class="o"&gt;)&lt;/span&gt; ret      &lt;span class="c"&gt;#262144                        # Return fail&lt;/span&gt;
&lt;span class="o"&gt;(&lt;/span&gt;005&lt;span class="o"&gt;)&lt;/span&gt; ret      &lt;span class="c"&gt;#0                             # Return success&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;But can also be written in a &lt;a href="https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md#bpf-c"&gt;limited subset of C&lt;/a&gt; and compiled.&lt;/p&gt;

&lt;p&gt;BPF programs have a certain set of guarantees enforced by a kernel verifier that make BPF programs safe to run in kernel land without risk of locking up or otherwise breaking the kernel. The verifier ensures that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The program does not loop&lt;/li&gt;
&lt;li&gt;There are no unreachable instructions&lt;/li&gt;
&lt;li&gt;Every register and stack state are valid&lt;/li&gt;
&lt;li&gt;Registers with uninitialized content are not read&lt;/li&gt;
&lt;li&gt;The program only accesses structures appropriate for its BPF program type&lt;/li&gt;
&lt;li&gt;(Optionally) pointer arithmetic is prevented&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The BCC tools repository contains a set of tools wrapping BPF programs that can do useful things. We can use one of those programs (&lt;code&gt;dns_matching.py&lt;/code&gt;) to demonstrate how &lt;code&gt;BPF&lt;/code&gt; is able to instrument the network:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone the repository&lt;/span&gt;

&lt;span class="nv"&gt;$ &lt;/span&gt;git clone https://github.com/iovisor/bcc.git
Cloning into &lt;span class="s1"&gt;'bcc'&lt;/span&gt;...
Receiving objects: 100% &lt;span class="o"&gt;(&lt;/span&gt;17648/17648&lt;span class="o"&gt;)&lt;/span&gt;, 8.42 MiB | 1.21 MiB/s, &lt;span class="k"&gt;done&lt;/span&gt;&lt;span class="nb"&gt;.&lt;/span&gt;
Resolving deltas: 100% &lt;span class="o"&gt;(&lt;/span&gt;11460/11460&lt;span class="o"&gt;)&lt;/span&gt;, &lt;span class="k"&gt;done&lt;/span&gt;&lt;span class="nb"&gt;.&lt;/span&gt;

&lt;span class="c"&gt;# Pick the DNS matching&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;bcc/examples/networking/dns_matching

&lt;span class="c"&gt;# Run it!&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;sudo&lt;/span&gt; ./dns_matching.py  &lt;span class="nt"&gt;--domains&lt;/span&gt; fishfingers.io

&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;sudo&lt;/span&gt; ./dns_matching.py  &lt;span class="nt"&gt;--domains&lt;/span&gt; fishfingers.io
&lt;span class="o"&gt;&amp;gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; Adding map entry:  fishfingers.io

Try to lookup some domain names using nslookup from another terminal.
For example:  nslookup foo.bar

BPF program will filter-in DNS packets which match with map entries.
Packets received by user space program will be printed here

Hit Ctrl+C to end...
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;In another window we can run:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;dig fishfingers.io
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Which will show in our first window:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;Hit Ctrl+C to end...

&lt;span class="o"&gt;[&lt;/span&gt;&amp;lt;DNS Question: &lt;span class="s1"&gt;'fishfingers.io.'&lt;/span&gt; &lt;span class="nv"&gt;qtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;A &lt;span class="nv"&gt;qclass&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;IN&amp;gt;]
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The domain is nonsense, but the question is still posed. Looking at the &lt;a href="https://github.com/iovisor/bcc/blob/master/examples/networking/dns_matching/dns_matching.c"&gt;source file&lt;/a&gt; we can see the eBPF program written in C that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Checks the type of Ethernet frame&lt;/li&gt;
&lt;li&gt; Checks to see if its UDP&lt;/li&gt;
&lt;li&gt; Checks to see if its Port 53&lt;/li&gt;
&lt;li&gt; Check if the DNS name supplied is within the payload&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That’s it! Our eBPF program has successfully run in the kernel and packets copied out to the userland python program where they’re subsequently saved.&lt;/p&gt;

&lt;p&gt;While this example was associated with the network kernel system (BPF_PROG_TYPE_SOCKET_FILTER), there are a whole series of kernel entry points that can execute these eBPF programs. At the time of there are a total of 22 program types; unfortunately, they are currently poorly documented.&lt;/p&gt;

&lt;h1&gt;
  
  
  eBPF in the wild
&lt;/h1&gt;

&lt;p&gt;To understand where eBPF sits in the infrastructure ecosystem it’s worth looking at where other companies have chosen to use it over other, more conventional ways of solving the problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Firewall
&lt;/h2&gt;

&lt;p&gt;The de facto implementation for a Linux firewall uses &lt;code&gt;iptables&lt;/code&gt; as its underlying enforcement mechanism. &lt;code&gt;iptables&lt;/code&gt; allows configuring a set of netfilter tables that manipulates packets in a number of ways. For example, the following rule drops all connections from the IP address 10.10.10.10:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;iptables &lt;span class="nt"&gt;-A&lt;/span&gt; INPUT &lt;span class="nt"&gt;-s&lt;/span&gt; 10.10.10.10/32 &lt;span class="nt"&gt;-j&lt;/span&gt; DROP
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;&lt;code&gt;iptables&lt;/code&gt; can be used for a number of packet manipulation tasks such as Network Address Translation (NAT) or packet forwarding. However &lt;code&gt;iptables&lt;/code&gt; runs into a couple of significant problems:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;code&gt;iptables&lt;/code&gt; &lt;a href="https://cilium.io/blog/2018/11/20/fb-bpf-firewall/"&gt;rules are matched sequentially&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt; &lt;code&gt;iptables&lt;/code&gt; updates must be made by &lt;a href="https://www.youtube.com/watch?v=4-pawkiazEg"&gt;recreating and updating all rules in a single transaction&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These two properties mean that under large, diverse traffic conditions (such as those experienced by any sufficiently large service — Facebook) or in a system that has a large number of changes to &lt;code&gt;iptables&lt;/code&gt; rules there will be an unacceptable performance overhead to running &lt;code&gt;iptables&lt;/code&gt; which can either degrade or take offline an entire service.&lt;/p&gt;

&lt;p&gt;There are already improvements to this subsystem in the Linux kernel by way of &lt;code&gt;nftables&lt;/code&gt;. This system is designed to improve &lt;code&gt;iptables&lt;/code&gt; and is architecturally similar to BPF in that it implements a virtual machine in the kernel. &lt;code&gt;nftables&lt;/code&gt; is a little older and better supported in existing Linux distributions, and in the testing distributions has even begun to entirely replace &lt;code&gt;iptables&lt;/code&gt;. However with the advent and optimizations of BPF &lt;code&gt;nftables&lt;/code&gt; is perhaps a technology less worth investing in.&lt;/p&gt;

&lt;p&gt;That leaves us with BPF. BPF has a couple of unique advantages over &lt;code&gt;iptables&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Its implemented as an instruction set in a virtual machine, and can be heavily optimized&lt;/li&gt;
&lt;li&gt;It &lt;a href="https://cilium.io/blog/2018/11/20/fb-bpf-firewall/"&gt;is matched against the "closest" rule&lt;/a&gt;, rather than by iterating over the entire rule set.&lt;/li&gt;
&lt;li&gt;It can introspect specific packet data when making decisions as to whether to drop&lt;/li&gt;
&lt;li&gt;It can be compiled and run in the Linux "Express Data Path" (or XDP); the earliest possible point to interact with network traffic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These advantages can yield some staggering performance benefits. In CloudFlare’s (artificial) tests BPF with XDP was approximately &lt;a href="https://blog.cloudflare.com/how-to-drop-10-million-packets/"&gt;5x better at dropping packets&lt;/a&gt; than the next best solution (tc). Facebook saw &lt;a href="https://cilium.io/blog/2018/11/20/fb-bpf-firewall/"&gt;a much more predictable CPU usage&lt;/a&gt; with the use of BPF filtering.&lt;/p&gt;

&lt;p&gt;In addition to the performance benefits some applications use BPF in combination with userland proxies (such as Envoy) to &lt;a href="http://docs.cilium.io/en/stable/policy/language/#layer-7-examples"&gt;allow or deny the application protocols HTTP, gRPC, DNS or Kafka&lt;/a&gt;. This sort of application specific filtering is only otherwise seen in service meshes, such as Istio or Linkerd which incur more of a performance penalty than the BPF based solution.&lt;/p&gt;

&lt;p&gt;So, packet filtering based on BPF is both more flexible and more efficient (with XDP) than the existing &lt;code&gt;iptables&lt;/code&gt;&lt;br&gt;
solution. While&lt;code&gt;tc&lt;/code&gt;and &lt;code&gt;nftables&lt;/code&gt; may provide similar performance now or in future, BPFs combination of a large set of use cases and efficiency means it’s perhaps a better place to invest.&lt;/p&gt;

&lt;h2&gt;
  
  
  Kernel tracing &amp;amp; instrumentation
&lt;/h2&gt;

&lt;p&gt;After running Linux in production for some period of time invariably we can run into issues. In the past I’ve had issues debugging:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;iptables&lt;/code&gt; performance problems&lt;/li&gt;
&lt;li&gt;Workload CPU performance&lt;/li&gt;
&lt;li&gt;Software not loading configuration&lt;/li&gt;
&lt;li&gt;Software becoming stalled&lt;/li&gt;
&lt;li&gt;Systems being "slow" for no apparent reason&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In those cases we need to dig further into what’s happening between kernel land and userland and to poke at why the system is doing.&lt;/p&gt;

&lt;p&gt;There are an abundance of tools for this task. Brendan Gregg has an &lt;a href="http://www.brendangregg.com/Perf/linux_perf_tools_full.svg"&gt;excellent image showing the many tools and what they’re useful for&lt;/a&gt;. From the list above, I’m familiar with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;strace&lt;/code&gt; / &lt;code&gt;ltrace&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;&lt;code&gt;top&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;sysdig&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;iotop&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;df&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;perf&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These tools each have their own unique tradeoffs and doing a depth analysis of them is beyond the scope of this article. However, the most useful tool is perhaps &lt;code&gt;strace&lt;/code&gt;. &lt;code&gt;strace&lt;/code&gt; provides visibility into what system calls (calls to the Linux kernel) the process is using. The following example shows what file system calls the process &lt;code&gt;cat /tmp/foo&lt;/code&gt; will make:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;strace &lt;span class="nt"&gt;-e&lt;/span&gt; file &lt;span class="nb"&gt;cat&lt;/span&gt; /tmp/foo
execve&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"/bin/cat"&lt;/span&gt;, &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"cat"&lt;/span&gt;, &lt;span class="s2"&gt;"/tmp/foo"&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;, 0x7fffc2c8c308 /&lt;span class="k"&gt;*&lt;/span&gt; 56 vars &lt;span class="k"&gt;*&lt;/span&gt;/&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 0
access&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"/etc/ld.so.preload"&lt;/span&gt;, R_OK&lt;span class="o"&gt;)&lt;/span&gt;      &lt;span class="o"&gt;=&lt;/span&gt; 0
openat&lt;span class="o"&gt;(&lt;/span&gt;AT_FDCWD, &lt;span class="s2"&gt;"/etc/ld.so.preload"&lt;/span&gt;, O_RDONLY|O_CLOEXEC&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 3
openat&lt;span class="o"&gt;(&lt;/span&gt;AT_FDCWD, &lt;span class="s2"&gt;"/lib/x86_64-linux-gnu/libsnoopy.so"&lt;/span&gt;, O_RDONLY|O_CLOEXEC&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 3
openat&lt;span class="o"&gt;(&lt;/span&gt;AT_FDCWD, &lt;span class="s2"&gt;"/etc/ld.so.cache"&lt;/span&gt;, O_RDONLY|O_CLOEXEC&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 3
openat&lt;span class="o"&gt;(&lt;/span&gt;AT_FDCWD, &lt;span class="s2"&gt;"/lib/x86_64-linux-gnu/libc.so.6"&lt;/span&gt;, O_RDONLY|O_CLOEXEC&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 3
openat&lt;span class="o"&gt;(&lt;/span&gt;AT_FDCWD, &lt;span class="s2"&gt;"/lib/x86_64-linux-gnu/libpthread.so.0"&lt;/span&gt;, O_RDONLY|O_CLOEXEC&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 3
openat&lt;span class="o"&gt;(&lt;/span&gt;AT_FDCWD, &lt;span class="s2"&gt;"/lib/x86_64-linux-gnu/libdl.so.2"&lt;/span&gt;, O_RDONLY|O_CLOEXEC&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 3
openat&lt;span class="o"&gt;(&lt;/span&gt;AT_FDCWD, &lt;span class="s2"&gt;"/usr/lib/locale/locale-archive"&lt;/span&gt;, O_RDONLY|O_CLOEXEC&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 3
openat&lt;span class="o"&gt;(&lt;/span&gt;AT_FDCWD, &lt;span class="s2"&gt;"/tmp/foo"&lt;/span&gt;, O_RDONLY&lt;span class="o"&gt;)&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; 3
hi
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This allows us to debug a range of issues including configuration not working, what a process is sending over a network, what a process is receiving and what processes a given process spawns. However, it comes at a cost — &lt;code&gt;strace&lt;/code&gt; will &lt;a href="http://www.brendangregg.com/blog/2014-05-11/strace-wow-much-syscall.html"&gt;significantly slow down that process&lt;/a&gt;. Suddenly introducing large latency into the system will annoy users, and can block and stack up requests eventually breaking the service. Accordingly it needs to be used with caution.&lt;/p&gt;

&lt;p&gt;However, a much more efficient way to trace these system calls is with BPF. This is made easy with &lt;a href="https://github.com/iovisor/bcc"&gt;the &lt;code&gt;bcc&lt;/code&gt; tools git repository&lt;/a&gt;; specifically, the &lt;code&gt;trace.py&lt;/code&gt; tool. The tool has a slightly different interface than &lt;code&gt;strace&lt;/code&gt;; perhaps because BPF is compiled and executed based on events in the Kernel rather than interrupting a process at the kernel interface. However, it can be replicated as follows:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt; &lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;sudo&lt;/span&gt; ./trace.py &lt;span class="s1"&gt;'do_sys_open "%s", arg2'&lt;/span&gt; | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="s1"&gt;'cat'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;And then in another window:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;cat&lt;/span&gt; /tmp/foo
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Will yield&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;13785   13785   &lt;span class="nb"&gt;cat             &lt;/span&gt;do_sys_open      /etc/ld.so.preload
13785   13785   &lt;span class="nb"&gt;cat             &lt;/span&gt;do_sys_open      /lib/x86_64-linux-gnu/libsnoopy.so
13785   13785   &lt;span class="nb"&gt;cat             &lt;/span&gt;do_sys_open      /etc/ld.so.cache
13785   13785   &lt;span class="nb"&gt;cat             &lt;/span&gt;do_sys_open      /lib/x86_64-linux-gnu/libc.so.6
13785   13785   &lt;span class="nb"&gt;cat             &lt;/span&gt;do_sys_open      /lib/x86_64-linux-gnu/libpthread.so.0
13785   13785   &lt;span class="nb"&gt;cat             &lt;/span&gt;do_sys_open      /lib/x86_64-linux-gnu/libdl.so.2
13785   13785   &lt;span class="nb"&gt;cat             &lt;/span&gt;do_sys_open      /usr/lib/locale/locale-archive
13785   13785   &lt;span class="nb"&gt;cat             &lt;/span&gt;do_sys_open      /tmp/foo
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This fairly accurately replicates the functionality of &lt;code&gt;strace&lt;/code&gt;; each of the files listed earlier are shown in the &lt;code&gt;trace.py&lt;/code&gt; output the same as they were in the &lt;code&gt;strace&lt;/code&gt; output.&lt;/p&gt;

&lt;p&gt;BPF is not limited to &lt;code&gt;strace&lt;/code&gt; like tools. It can be used to introspect a whole series of both user and kernel level problems and has been packaged into user friendly tools &lt;a href="https://github.com/iovisor/bcc"&gt;in the BCC repository&lt;/a&gt;. Additionally, BPF &lt;a href="https://sysdig.com/blog/sysdig-and-falco-now-powered-by-ebpf/"&gt;now powers Sysdig&lt;/a&gt;, the tool used for spelunking into a machine to determine its behaviour by analysing system calls. There is even some work to export the result of BPF programs &lt;a href="https://github.com/cloudflare/ebpf_exporter"&gt;in the Prometheus format&lt;/a&gt; for aggregation in time series data.&lt;/p&gt;

&lt;p&gt;Because of its high performance, flexibility and good support in more recent Linux kernels BPF forms the foundation of a new set of systems introspection tools that are able to provide more flexible, performant systems introspection. Additionally BPF seems simpler than the kernel hacking that would otherwise be required to provide this sort of systems introspection and may democratize the design of such tools, leading to more innovation in this area.&lt;/p&gt;

&lt;h2&gt;
  
  
  Network visibility
&lt;/h2&gt;

&lt;p&gt;Given the history of BPF in packet filtering a reasonable next logical step is collecting statistics from the network for later analysis.&lt;/p&gt;

&lt;p&gt;There are already a number of network statistics exposed via the &lt;code&gt;/proc&lt;/code&gt; subdirectory that can be read with little overhead. The &lt;a href="https://github.com/prometheus/node_exporter"&gt;Prometheus "node exporter"&lt;/a&gt; reads:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;/proc/sys/net/netfilter/&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;/proc/net/ip_vs&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;/proc/net/ip_vs_stats&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;/sys/class/net/&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;/proc/net/netstat&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;/proc/net/sockstat&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;/proc/net/tcp&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;/proc/net/tcp6&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However as much as this exposes, there are still things about connections that either can’t be read directly from &lt;code&gt;/proc&lt;/code&gt; or via the set of CLI tools that also read from this (&lt;code&gt;ss&lt;/code&gt;, &lt;code&gt;netstat&lt;/code&gt; etc). One such case was discussed by &lt;a href="https://twitter.com/b0rk/status/765666624968003584"&gt;Julia Evans and Brendan Gregg on Twitter&lt;/a&gt;: The stats of TCP connection lengths on a given port.&lt;/p&gt;

&lt;p&gt;This is useful for debugging what a system is connected to, and how long it spends in that connection. We can in turn use this to determine who our machine is talking to, and whether it’s getting stuck on any given connection.&lt;/p&gt;

&lt;p&gt;Brendan Gregg has a post &lt;a href="http://www.brendangregg.com/blog/2016-11-30/linux-bcc-tcplife.html"&gt;that describes how this is implemented in detail&lt;/a&gt;, but to summarise it listens to &lt;code&gt;tcp_set_state()&lt;/code&gt; and queries the properties of the connection from &lt;code&gt;struct tcp_info&lt;/code&gt;. There are various limitations to this approach, but it seems to work pretty well.&lt;/p&gt;

&lt;p&gt;The result has been committed to the bcc repository and looks like:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Trace remote port 443&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;&lt;span class="nb"&gt;sudo&lt;/span&gt; ./tcplife.py &lt;span class="nt"&gt;-D&lt;/span&gt; 443
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Then, in another window:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;$ &lt;/span&gt;curl https://www.andrewhowden.com/ &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; /dev/null
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;The first window then shows:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight shell"&gt;&lt;code&gt;PID   COMM       LADDR           LPORT RADDR           RPORT TX_KB RX_KB MS
7362  curl       10.1.1.247      43074 34.76.108.124   443       0    16 3369.32
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Indicating that a process with ID 7362 connected to 34.76.108.124 over port 443 and took 3369.32ms to complete its transfer (Australian internet is a bit slow in some areas).&lt;/p&gt;

&lt;p&gt;These kind of ad-hoc debugging statistics are essentially impossible to gather any other way. Additionally it should be possible (if desired) to express these statistics in such a way the Prometheus exporter will load them and export them for collection, making the network essentially arbitrarily introspectable.&lt;/p&gt;

&lt;h1&gt;
  
  
  Using BPF
&lt;/h1&gt;

&lt;p&gt;Given the above BPF seems like a compelling technology that it’s worth investing in learning more about. However there are some difficulties in getting BPF to work properly:&lt;/p&gt;

&lt;h2&gt;
  
  
  BPF is only in "recent" kernels
&lt;/h2&gt;

&lt;p&gt;BPF is an area that’s undergoing rapid development in the Linux kernel. Accordingly features may not be complete, or may not be present at all. Tools may not work as expected and their failure conditions not well documented. Accordingly if the kernels used in production are fairly modern than BPF may provide considerable utility. If not, it’s perhaps worth waiting until development in this area slows down and an LTS kernel with good BPF compatibility is released.&lt;/p&gt;

&lt;h2&gt;
  
  
  It’s hard to debug
&lt;/h2&gt;

&lt;p&gt;BPF is fairly opaque at the moment. While there are bits of documentation here and there and one can go and read the kernel source its not as easy to debug as (for example) &lt;code&gt;iptables&lt;/code&gt; or other system tools. It may be difficult to debug network issues that are created by improperly constructed &lt;code&gt;bpf&lt;/code&gt; programs. The advice here is the same as other new or bespoke technologies: ensure that multiple team members understand and can debug it, and if they cant or those people are not available, pick another technology.&lt;/p&gt;

&lt;h2&gt;
  
  
  It’s an implementation detail
&lt;/h2&gt;

&lt;p&gt;Its my suspicion that the vast majority of our interaction with BPF will not be interaction of our design. BPF is useful in the design of analysis tools, but the burden is perhaps too large to place on the shoulders of systems administrators. Accordingly, to start reaping the benefits of BPF its worth instead investing in tools that use this technology. These include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cilium&lt;/li&gt;
&lt;li&gt;BCC Tools&lt;/li&gt;
&lt;li&gt;&lt;code&gt;bpftrace&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Sysdig&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;More tools will arrive in future, though those are the only ones I would currently invest in.&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;BPF is an old technology that has had new life breathed into it with the extended instruction set, implementation of a JIT and ability to execute BPF at various points in the Linux kernel. It provides a way to export information about or modify Linux kernel behaviour at runtime without needing to reboot or reload the Kernel, including just for transient systems introspection. BPF has probably most immediate ramifications on network performance as networks need to handle a truly bizarre level of both traffic and complexity, and BPF provides some concrete solutions to these problems. Accordingly its a good start to understand BPF in the context of networks, particularly instead of investing in &lt;code&gt;nftables&lt;/code&gt; or &lt;code&gt;iptables&lt;/code&gt;. BPF additionally provides some compelling insights into both system and network visibility that are otherwise difficult or impossible to achieve, though this area is somewhat more nascent than the network implementations.&lt;/p&gt;

&lt;p&gt;TL, DR — BPF is pretty damned cool.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.iovisor.org/"&gt;IOVisor project: a bunch of good eBPF and XDP reading and tools&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cilium.io/"&gt;API aware networking and security, powered by eBPF and XDP&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cilium.io/blog/2018/04/17/why-is-the-kernel-community-replacing-iptables/"&gt;Why is the kernel community replacing IPTables with eBPF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developers.redhat.com/blog/2018/12/06/achieving-high-performance-low-latency-networking-with-xdp-part-1/"&gt;Achieving high performance low latency networking with XDP&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cilium.io/blog/2018/11/20/fb-bpf-firewall/"&gt;Inside Facebook’s eBPF Firewall&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.cilium.io/en/v1.4/architecture/"&gt;Cilium Architecture&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blogs.igalia.com/dpino/2019/01/07/a-brief-introduction-to-xdp-and-ebpf/"&gt;A brief introduction to XDP and eBPF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ferrisellis.com/posts/ebpf_past_present_future/"&gt;eBPF: Past, present and future&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lwn.net/Articles/708087/"&gt;Debating the value of XDP&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developers.redhat.com/blog/2018/12/03/network-debugging-with-ebpf/"&gt;Network debugging with eBPF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lwn.net/Articles/437981/"&gt;A JIT for packet filters&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lwn.net/Articles/324989/"&gt;NFTables: A new packet filtering engine&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.cilium.io/en/v1.4/bpf/"&gt;eBPF and XDP: A reference guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blogs.igalia.com/dpino/2019/01/02/build-a-kernel/"&gt;How to build a kernel with XDP support&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cilium.io/blog/2018/04/24/cilium-security-for-age-of-microservices/"&gt;Cilium: rethinking Linux networking and security in the age of Microservices&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cilium.io/blog/2019/02/12/cilium-14/"&gt;Cilium 1.4 release notes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lwn.net/Articles/719850/"&gt;New approaches to network fast paths&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lwn.net/Articles/740157/"&gt;A thorough introduction to eBPF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sysdig.com/blog/sysdig-and-falco-now-powered-by-ebpf/"&gt;Sysdig: Now powered by eBPF&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.brendangregg.com/blog/2016-03-05/linux-bpf-superpowers.html"&gt;Linux eBPF Superpowers&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lwn.net/Articles/437981/"&gt;BPF: the universal in-kernel virtual machine&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.tcpdump.org/papers/bpf-usenix93.pdf"&gt;The BSD packet filter: A new architecture for User Level packet capture&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/iovisor/bpf-docs/blob/master/eBPF.md"&gt;Unofficial eBPF spec&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.kernel.org/doc/Documentation/networking/filter.txt"&gt;Linux Socket Filtering aka Berkeley Packet Filter (BPF)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.cloudflare.com/bpf-the-forgotten-bytecode/"&gt;BPF: The forgotten bytecode&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://man7.org/linux/man-pages/man8/tc-bpf.8.html"&gt;TC BPF man page&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.cloudflare.com/l4drop-xdp-ebpf-based-ddos-mitigations/"&gt;L4Drop: XDP DDoS Mitigation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.howtogeek.com/177621/the-beginners-guide-to-iptables-the-linux-firewall/"&gt;The beginners guide to iptables and the Linux firewall&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://en.wikipedia.org/wiki/Nftables"&gt;Wikipedia: NFTables&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.cloudflare.com/how-to-drop-10-million-packets/"&gt;How to drop 10 million packets&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.cloudflare.com/introducing-the-p0f-bpf-compiler/"&gt;Introducing the p0f compiler&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.cloudflare.com/l4drop-xdp-ebpf-based-ddos-mitigations/"&gt;L4Drop: XDP DDoS mitigations&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


</description>
      <category>ebpf</category>
      <category>linux</category>
      <category>networking</category>
      <category>deepdive</category>
    </item>
    <item>
      <title>Architecting a software system for malleability</title>
      <dc:creator>Andrew Howden</dc:creator>
      <pubDate>Wed, 20 Mar 2019 05:16:06 +0000</pubDate>
      <link>https://dev.to/andrewhowdencom/architecting-a-software-system-for-malleability-2p7</link>
      <guid>https://dev.to/andrewhowdencom/architecting-a-software-system-for-malleability-2p7</guid>
      <description>

&lt;p&gt;The past few years of software development has given me this one beautiful insight:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I can’t predict the future&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To illustrate this point on a personal level, it wasn’t even my plan to be a software developer. My undergraduate studies were in sports physiology, and the intention was to follow that up with sports medicine. However, through the various twists of fate inherent in life that was not to be, and I wound up helping building and shipping eCommerce stores.&lt;/p&gt;

&lt;p&gt;The vagaries of life do not extend only to me, however. They’re an inherent part of life. The psychologist Dan Gilbert says in his talk “The psychology of your future self”:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;We asked people how much they expected to change over the next 10 years, and also how much they had changed over the last 10 years, and what we found, well, … people underestimate how much their personalities will change in the next decade.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So, I didn’t know I would be here 6 years ago, and based on Gilberts assessment we don’t know who we’ll be in another 10 years. It follows that its extremely difficult to see where our technology should develop over the next 10 years. We&lt;br&gt;
can assess that by reviewing the last 10 years;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Apple App Store, Chrome, Android and Bitcoin were released in 2008&lt;/li&gt;
&lt;li&gt;Maps with GPS reached Android, in 2009&lt;/li&gt;
&lt;li&gt;Both the iPad and car2go (short term, instant car rentals) were released in 2010&lt;/li&gt;
&lt;li&gt;Google+ was launched and Adobe begun to sunset Flash in 2011&lt;/li&gt;
&lt;li&gt;Windows 8, 4k TV, Windows phone, Curiosity on Mars, Google Glass, 802.11ac and Space X flying to the ISS were all in 2012&lt;/li&gt;
&lt;li&gt;Oculus Rift, the Smart Watch and Touch ID landed in 2013&lt;/li&gt;
&lt;li&gt;Self driving cars began to emerge in 2014&lt;/li&gt;
&lt;li&gt;Apple Pay, Project Loon and was released in 2015&lt;/li&gt;
&lt;li&gt;IOT began to appear in earnest in 2016&lt;/li&gt;
&lt;li&gt;Self driving trucks, reinforcement learning (AlphaGo) and the smart speaker made great strides in 2017&lt;/li&gt;
&lt;li&gt;Cheap neural networks (Tensor Flow) executing on phones, bluetooth headphones that automatically translate and the GDPR were part of 2018&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Which brings us to our current year of 2019. Each of those technologies had an impact on the market, shifting the balance of power in various industries dramatically and providing new opportunities for those who are lucky enough to &lt;br&gt;
find the talent, capital and drive to take advantage of them.&lt;/p&gt;

&lt;p&gt;The lesson to draw from these changes is that the world changes at a far higher rate than one would naively imagine and when designing our software systems we should factor in this high rate of change so that we do not drive ourselves or &lt;br&gt;
our project partners into financial ruin attempting to innovate around their next business offering&lt;/p&gt;

&lt;h2&gt;
  
  
  Platforms that have successfully adapted
&lt;/h2&gt;

&lt;p&gt;It stands to reason that if we were to assess how to design our own software to be maximally adaptable we should look at what others have done in the past that have successfully adapted to industry changes.&lt;/p&gt;

&lt;p&gt;Apple, Cisco and Intel are all hardware (in addition to software) companies, so for our purpose we’ll dismiss them as a targets. Google, Microsoft, Facebook and Adobe are all primarily software companies however, so can serve as good &lt;br&gt;
lessons how how to build systems that are well structured over time. Google and Facebook are famously “internet” heavy companies, but both Adobe and Microsoft have pivoted in recent years to be much more internet driven. Microsoft have &lt;br&gt;
famously stated Windows 10 will be their last version of Windows and Adobe is making significant moves into internet driven business with “experience cloud”.&lt;/p&gt;

&lt;p&gt;So, these companies are moving towards software that is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Delivered primarily via the internet&lt;/li&gt;
&lt;li&gt;Developed and delivered to users in increments, and adapted based on user feedback&lt;/li&gt;
&lt;li&gt;Sold via an “ongoing revenue” model, be that subscriptions or advertising&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The thing that these companies have in common is that their products are all designed around software that embraces continual change, in any arbitrary direction.&lt;/p&gt;

&lt;h2&gt;
  
  
  Software design requirements
&lt;/h2&gt;

&lt;p&gt;To understand how to design software it’s first worth unpacking why we’re building software in the first place. Generally speaking I build software to make a computer solve a problem in a reliable way to derive some sort of useful &lt;br&gt;
work out of it. Programs can be as simple as:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Get a list of unique commands run on this machine
$ cat /var/log/auth.log | cut -d':' -f11 | sort | uniq
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;To Magento 1’s behemoth 1.7 million lines of code:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ sloccount clean-magento-ee
Total Physical Source Lines of Code (SLOC) = 1,730,997
Development Effort Estimate 502.62 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Regardless, software programs exist for some human purpose; to take some human input and return some human output (at some point or other).&lt;/p&gt;

&lt;h3&gt;
  
  
  Designing for reasonability
&lt;/h3&gt;

&lt;p&gt;Softwares utility is a function of its predictability; of our understanding of how we can use it to accomplish work. Perhaps the best example of this is the Unix utility called cat:&lt;/p&gt;



&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ cat foo
bar
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;This program takes the contents of the file “foo” and prints them to screen, showing “bar”. The particularly remarkable part about cat is not this behaviour, but rather:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;That it was initially designed in 1971&lt;/li&gt;
&lt;li&gt;That it hasn’t changed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is the very essence of a predictable program. There is a whole swathe of unix programs that follow this trend; enunciated by Peter H. Salus with:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Write programs that do one thing and do it well.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The wisdom of this minimalistic approach is difficult to overstate. Programs that are easily predictable and follow a “standard” approach have some distinct advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The time to understand and fit them in to our architecture is minimal&lt;/li&gt;
&lt;li&gt;Their potential use cases are large&lt;/li&gt;
&lt;li&gt;Their interoperability with other systems is large&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Additionally, keeping the feature set limited makes it much simpler to maintain this software, especially while retaining knowledge of the use cases it is being used for — both those initially designed and those accrued over time.&lt;/p&gt;

&lt;p&gt;This dramatically reduces both the cost of maintaining one piece of software, and the likelihood that this particular piece of software will change over time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Designing for interrogability
&lt;/h3&gt;

&lt;p&gt;Generally speaking we do not design software just for ourselves, but additionally to solve problems on behalf of others (usually for some monetary compensation). This creates a disconnect between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How we understand the problem, and design the software to be used&lt;/li&gt;
&lt;li&gt;How the software is actually used&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;John Allspaw refers to this as “above/below the line”, in which each user, developer and other stakeholder has a different conceptual model for how the software “works”. That model is only grounded in “reality” by interrogating the &lt;br&gt;
software to ensure that it’s actually functioning as initially designed. To make design decisions as to how the software should be further reduced, restructured or replaced we need to know how the software is being use.&lt;/p&gt;

&lt;p&gt;We can start this process by interrogating cat. cat is written in c and runs on unix . Unix (particularly Linux in this example) exposes a whole set of tools to allow inspecting both cat and other applications, such as strace , ltrace,&lt;br&gt;
perf with additional tools like sysdig . However, while these tools give us an extremely good idea what the application is doing in specific invocations they are cost prohibitive to run the entire time. Instead, we need to move to less&lt;br&gt;
granular tools. Unfortunately, this comes with a tradeoff — we need to guess ahead of time what we need to instrument.&lt;/p&gt;

&lt;p&gt;There are a three broad way of doing so:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Logs&lt;/li&gt;
&lt;li&gt;Metrics&lt;/li&gt;
&lt;li&gt;Traces&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without going too far into the detail, an application should be designed such that it exposes the detail required to understand how its working. This is useful both for understanding when the application is not working correctly as well&lt;br&gt;
as understanding how its used under normal conditions.&lt;/p&gt;

&lt;p&gt;When choosing how to instrument an application the property that is perhaps the most useful is being able to ask questions of the software — to interrogate it. Logs are perhaps the simplest way to do this, allowing us to check &lt;br&gt;
internal program state at a later point when an issue is reported. But time series data is a very close second, and allows querying for application behaviour over time. This allows making judgements about how people are using the app,&lt;br&gt;
rather than just snap-shotting application internal state over time. The Prometheus documentation explains how to instrument an application to maximise its interrogability.&lt;/p&gt;

&lt;p&gt;By understanding how its used we can modify our program to make those use cases easier or more efficient. We can additionally drop some of the functionality that is not being used over time to maintain program simplicity and reduce &lt;br&gt;
the cost of maintenance and risk.&lt;/p&gt;

&lt;p&gt;As software is used more frequently it will be better understood by its users. That is also where software engineers should invest the most time ensuring the software is designed in such a way it is easy for users to understand and &lt;br&gt;
reason about as designing for simplicity will further increase uptake, forming a virtuous cycle until an “optimal simplicity” level is reached.&lt;/p&gt;

&lt;h3&gt;
  
  
  Design with a focus on solving the users problem
&lt;/h3&gt;

&lt;p&gt;The process of shipping software is a complex one, involving:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Business process modelling&lt;/li&gt;
&lt;li&gt;UX Design&lt;/li&gt;
&lt;li&gt;General architecture design&lt;/li&gt;
&lt;li&gt;Software component design&lt;/li&gt;
&lt;li&gt;Software infrastructure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these disciplines is a complex one that involves a staggering amount of research, discipline and effort over time. Accordingly it’s more likely than not that each component will have specialists, each of whom seek to do the best&lt;br&gt;
job they possibly can.&lt;/p&gt;

&lt;p&gt;It’s important while designing and implementing this system that the goal is to solve a users problem. One can get lost in the minutiae of one's own discipline, creating a relative work of art — at the expense of the system as a whole,&lt;br&gt;
and the user with their problem.&lt;/p&gt;

&lt;p&gt;To solve the users problem each stakeholder needs to subjugate their own ideal solution in favour of a solution that favours the customers happiness. To retain this focus while developing the design needs to put the customer at the &lt;br&gt;
forefront of all decisions; each decision justified in relation to how that decision helps the customer solves their problem.&lt;/p&gt;

&lt;p&gt;By doing so, while each component of the system may be even more complex or less elegant for those who have built it, the vast majority of users will experience a simpler, easier to understand system.&lt;/p&gt;

&lt;h3&gt;
  
  
  Designing unsurprising software
&lt;/h3&gt;

&lt;p&gt;Software that is “surprising” is software that is unpredictable. Unpredictable software is harder for users to make use of, in turn driving usage of the application in unpredictable ways. This unpredictable usage means either either:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A high amount of refactoring to make the unusual mechanism the standard use case&lt;/li&gt;
&lt;li&gt;A high amount of refactoring to shift users to the standard model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Regardless, quite a bit needs to be changed. Accordingly the goal while developing software should be to be the “least surprising” or “least astonishing”. This principle is captured as the “principle of least astonishment”:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“People are part of the system. The design should match the user’s experience, expectations, and mental models.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Unfortunately what users find surprising is context specific. While designing an alarm clock users might expect that once they turn off an alarm the alarm goes away until the next occurrence, they might expect that hospital monitors &lt;br&gt;
switch alarms back on themselves after a period of time. Accordingly, designing software that does “what the user expects” requires an in depth understanding of that user, and the context in which they’re using the software.&lt;/p&gt;

&lt;p&gt;That is surprisingly hard to come by; the study of software development is such a complex one it precludes a depth first knowledge of other fields. However, one can take two strategies to help design software in an unsurprising way:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Design software after an already established pattern. Design hospital software like other hospital software, and alarm clocks like other alarm clocks.&lt;/li&gt;
&lt;li&gt;Work closely with users, soliciting and integrating their feedback&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even the most intractable problems can be made simpler and easier for users to understand with a deliberate design of software to match their conceptual models.&lt;/p&gt;

&lt;h3&gt;
  
  
  Designing software on balance
&lt;/h3&gt;

&lt;p&gt;Given the above requirements perhaps the hardest thing to do is to strike a balance across them, and design the software for simplicity relative to each designer or consumer of that project.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;cat&lt;/code&gt;, for example, may be simple to me as a developer but it is likely not simple for my grandmother.&lt;/p&gt;

&lt;p&gt;Each stakeholder has a different model of the software:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Users model it in terms of the problems they’re trying to solve&lt;/li&gt;
&lt;li&gt;The UX team model and optimize for users usage of the application&lt;/li&gt;
&lt;li&gt;The business logic team attempt to model the user in the software&lt;/li&gt;
&lt;li&gt;The business owners model it in terms of a return on investment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes it hard for the software architect to be able to make the software simple relative to all users. However, there are ways in which it’s possible to determine how to evolve the software to suit the stakeholders over time.&lt;/p&gt;

&lt;p&gt;As the software evolves and the stakeholders learn more about each other it will become clear that there are commonalities in how those users see the software. For example, in the case of an eCommerce store the user, UX, business&lt;br&gt;
logic and business owners all have approximately the same notion of what an “order” or “shipment” needs, though with varying degrees of detail.&lt;/p&gt;

&lt;p&gt;By writing the software to deliberately communicate its own nature with all stakeholders, writing supporting documentation to clearly explain that software where the software is incapable of explaining itself and minimising the&lt;br&gt;
amount of “views” that the software has the software itself can remain simple, and all stakeholders have a similarmental model of the software.&lt;/p&gt;

&lt;p&gt;Once these patterns are established continue reusing them, reinforcing a consistent way of reasoning about that software.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding what we’re designing
&lt;/h2&gt;

&lt;p&gt;To understand what we’re designing, we first need to think in terms of the problem we’re solving.&lt;/p&gt;

&lt;h3&gt;
  
  
  Boxing
&lt;/h3&gt;

&lt;p&gt;In a past life I spent considerable time training to be a boxer (more specifically, a Thai boxer). Though it was only a habit, it was an activity that I fundamentally enjoyed. It additionally necessitated the purchase of some equipment. To participate, I would need.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1x. 16 ounce Boxing gloves&lt;/li&gt;
&lt;li&gt;2x. Mouth guards&lt;/li&gt;
&lt;li&gt;4x Singlets, Shorts &amp;amp; Wraps&lt;/li&gt;
&lt;li&gt;1x. Groin guard&lt;/li&gt;
&lt;li&gt;1x. Shin Guards (Heavy)&lt;/li&gt;
&lt;li&gt;1x. Shin guards (Light)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The software journey we’ll consider then is the one that hopes to connect me with the equipment I need to continue my boxing profession.&lt;/p&gt;

&lt;h3&gt;
  
  
  Modelling the buying and usage journey
&lt;/h3&gt;

&lt;p&gt;In the above equipment there is little value for it to be particularly well styled, emphatic or otherwise different — there is little fashion in the world of “boxing equipment”; they’re essentially commodity goods. Above all I would&lt;br&gt;
prize:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Functional&lt;/li&gt;
&lt;li&gt;Comfortable&lt;/li&gt;
&lt;li&gt;Long lasting&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As a buyer of this equipment, I’m likely to undertake the following steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Discover the need for this equipment as I join (or rejoin) a boxing gymnasium&lt;/li&gt;
&lt;li&gt;Discuss with my peers what a set of reliable equipment would be. If it’s available on site, I would likely simply purchase it there.&lt;/li&gt;
&lt;li&gt;Further research what equipment might be available, and look for reviews that help me determine what brand of equipment I would like&lt;/li&gt;
&lt;li&gt;Make the purchase of this equipment, and use it for a period while training&lt;/li&gt;
&lt;li&gt;Purchase either the same or new equipment once that had been worn beyond its utility.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of those components have some reflection in software; from joining the boxing club to evaluating the equipment after a period of use for reuse.&lt;/p&gt;

&lt;h2&gt;
  
  
  Designing the software itself
&lt;/h2&gt;

&lt;p&gt;Given our understanding of the principles required to design resilient software, let’s try and help our boxer find the equipment they need.&lt;/p&gt;

&lt;h3&gt;
  
  
  Launch and Iterate
&lt;/h3&gt;

&lt;p&gt;As we’ve established, we’re poor predictors of the future. So to understand our problem we need to start solving it.&lt;/p&gt;

&lt;p&gt;The simplest way the users buying journey can be modelled is simply a cash transaction for equipment at the boxing&lt;br&gt;
gymnasium. This is a solution completely without software, but as a process is a reasonably elegant solution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It’s simple, and reuses existing primitives (cash, equipment)&lt;/li&gt;
&lt;li&gt;It’s extremely low cost and easy to implement&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This allows us to start filling out our business process. Things like “where do we purchase our goods from” or “where do we store our goods” or “what do users want to know about our goods” all start to come up and need solving.&lt;/p&gt;

&lt;h3&gt;
  
  
  Resolving solved problems
&lt;/h3&gt;

&lt;p&gt;Given our scenario our boxing gym has been holding equipment but is struggling to understand what equipment sells well, what sells badly and how much stock is remaining. In terms of our previously defined principles the process is not&lt;br&gt;
interrogable.&lt;/p&gt;

&lt;p&gt;In this the use cases are fairly common, and there are already solutions that have largely solved these problems.&lt;/p&gt;

&lt;p&gt;Dropping in a solution that solves “enough” of the problem is usually a good next step. Things like VendHQ, Square, Xero can solve the vast majority of these needs, and where they’re not yet solved a human process can make up the difference.&lt;/p&gt;

&lt;p&gt;These solutions are perhaps not the most technically elegant. However, they’re already shaped by user demand and are thus the most conceptually simple to our user — they solve the users problem better than we’d be able to ourselves.&lt;/p&gt;

&lt;p&gt;Be careful about solutions that solve more than the problems that need to be solved now. It is harder to remove process than it is to add it, and unless there is a demand for a feature it is likely redundant. That increases complexity for&lt;br&gt;
no discernable gain.&lt;/p&gt;

&lt;h3&gt;
  
  
  Building additional services
&lt;/h3&gt;

&lt;p&gt;Our boxing gymnasium is now successfully selling equipment to its members, however the gym has only limited staff and does not have the time to explain the tradeoff between the various pieces of equipment prior to the start of the class.&lt;/p&gt;

&lt;p&gt;To address this, they need software that will allow them to list their services on some sort of consumable format — the defacto implementation being on the internet.&lt;/p&gt;

&lt;p&gt;Depending on the software chosen previously it’s possible that our boxing gym can simply “switch on” an integration with Shopify or Magento that allows them to reuse their existing data. If so, this is the best solution in this case. &lt;br&gt;
The gymnasium can continue to use their existing services with limited additional learning required to list their services online.&lt;/p&gt;

&lt;p&gt;However, if such an integration is not available it is worth beginning to reevaluate the entire business stack such that a single solution can solve all problems. While this means a higher initial invest, it will be a significantly &lt;br&gt;
lower invest in terms of learning, diagnostics and any further development over essentially any timescale.&lt;/p&gt;

&lt;h3&gt;
  
  
  Designing a unique service
&lt;/h3&gt;

&lt;p&gt;Our boxing gym has now grown and sells equipment both in its gymnasium and online. However it would like to develop a new feature that doesn’t exist on the market — the ability to sell equipment directly from other gymnasiums.&lt;/p&gt;

&lt;p&gt;This requirement is so unique that no existing software can be used to model this particular requirement. Either existing software will have to be repurposed, or new software designed.&lt;/p&gt;

&lt;p&gt;Whether to repurpose existing software or redesign new software essentially depends on the total feature set required for the new software. If the business is well understood and the requirements limited designing new software offers &lt;br&gt;
some compelling benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The software can be designed to take advantage of business efficiencies&lt;/li&gt;
&lt;li&gt;The software is well known by the implementing team&lt;/li&gt;
&lt;li&gt;The software in absolute terms is not as complex&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, comes at the significant risk of losing track of the implementing team. If that team disappears, a new team will need to relearn the entirety of the business. Accordingly, if the software is being contracted out using a &lt;br&gt;
“standard” solution with minimal customisation buys insurance against relations with that contractor going sideways.&lt;/p&gt;

&lt;p&gt;For the purpose of this we’ll assume that the development team is in house and has a vested interested in the success of the project.&lt;/p&gt;

&lt;p&gt;Perhaps the best thing to do is to rebuild the business logic entirely. This means losing many features that are inherent in commercial or open source software, but it also dramatically reduces the absolute complexity of the system. This allows much faster development targeted directly for the needs of the business.&lt;/p&gt;

&lt;p&gt;The result is software that is simpler, more targeted and in better control of the business — presuming the development team is capable of such software design.&lt;/p&gt;

&lt;h2&gt;
  
  
  Downsides of malleable software
&lt;/h2&gt;

&lt;p&gt;Malleable software is exceedingly hard to design. There are some significant downsides to it:&lt;/p&gt;

&lt;h3&gt;
  
  
  Expensive
&lt;/h3&gt;

&lt;p&gt;As described in the example of the boxing gym owner, it was not economical to design software from scratch until the business requirement was such that no software existed that could be easily ported to the businesses need.&lt;/p&gt;

&lt;p&gt;Designing software from scratch is an extremely expensive exercise. Developers are a scarce resource and developers that are driven by the results of the business even rarer.&lt;/p&gt;

&lt;p&gt;It’s often a better balance to reuse existing primitives for services rather than take the leap for fully customized, malleable software. The more customized software is, the more expensive it is to maintain.&lt;/p&gt;

&lt;h3&gt;
  
  
  Difficult
&lt;/h3&gt;

&lt;p&gt;The process of understanding, designing and implementing software is an exceedingly difficult task. It requires an in depth knowledge of the problem, patience to put forward designs and rework them and the ability to implement the designs in software.&lt;/p&gt;

&lt;h3&gt;
  
  
  Long Term
&lt;/h3&gt;

&lt;p&gt;Software that is malleable does pay off, but only over a long period of time. The upfront investment is significant, and is better offset by incrementalism and the shifting to a self hosted solution only as there are no other options&lt;br&gt;
available.&lt;/p&gt;

&lt;p&gt;However, once the initial design of the solution has been completed and presuming upkeep is not cost prohibitive, a solution that is more malleable will open more business opportunity.&lt;/p&gt;

&lt;h2&gt;
  
  
  In Conclusion
&lt;/h2&gt;

&lt;p&gt;Designing software is a complex process, needing to balance the needs of all stakeholders while keeping true to the vision that it intends to solve over a long period of time and with many different hands.&lt;/p&gt;

&lt;p&gt;However, hopefully this article has provided some general background as to how software can be designed in such a way that it is more malleable, reducing the costs over the long term.&lt;/p&gt;


</description>
      <category>architecture</category>
      <category>deepdive</category>
    </item>
  </channel>
</rss>
