<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: penumbra23</title>
    <description>The latest articles on DEV Community by penumbra23 (@penumbra23).</description>
    <link>https://dev.to/penumbra23</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F258965%2F686d9263-fa90-4c93-959f-a6295866726c.png</url>
      <title>DEV Community: penumbra23</title>
      <link>https://dev.to/penumbra23</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/penumbra23"/>
    <language>en</language>
    <item>
      <title>Container Runtime in Rust - Part II</title>
      <dc:creator>penumbra23</dc:creator>
      <pubDate>Wed, 06 Oct 2021 16:13:20 +0000</pubDate>
      <link>https://dev.to/penumbra23/container-runtime-in-rust-part-ii-34em</link>
      <guid>https://dev.to/penumbra23/container-runtime-in-rust-part-ii-34em</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/penumbra23/container-jailing-how-it-s-done-2ikc"&gt;Part I&lt;/a&gt; of the series described the filesystem layout and how the runtime jails the container process inside the root filesystem of the container.&lt;br&gt;
Part II dives a bit deeper into the implementation and shows how the runtime creates the child process and how they communicate until the point where the user-defined process kicks in. It will also describe how a pseudo-terminal is set up and show the importance of Unix sockets.&lt;br&gt;
By the end of this part, we should have a basic runtime that's interoperable with Docker.&lt;/p&gt;
&lt;h2&gt;
  
  
  Clone
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/penumbra23/container-runtime-in-rust-part-0-1ecm"&gt;Part 0&lt;/a&gt; briefly explained the clone syscall. It's like fork/vfork, but with more options to control the child process. Actually, some implementations of fork propagate the call to &lt;code&gt;clone()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Besides controlling which parts of the execution context get shared from the parent process, the clone call offers the possibility for us to create a separate memory block for the child's stack. The nix implementation of clone has the following signature:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;clone&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
 &lt;span class="n"&gt;cb&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;CloneCb&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nv"&gt;'_&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
 &lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;u8&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; 
 &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;CloneFlags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
 &lt;span class="n"&gt;signal&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;c_int&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Pid&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;signal&lt;/code&gt; argument, if specified, is sent back to the parent when the child process terminates.&lt;/p&gt;

&lt;p&gt;Here is the code snippet depicting the create command and the parent-child relation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;create&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Create&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;container_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;create&lt;/span&gt;&lt;span class="py"&gt;.id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;root&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;create&lt;/span&gt;&lt;span class="py"&gt;.root&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;bundle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;create&lt;/span&gt;&lt;span class="py"&gt;.bundle&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="c1"&gt;// Load config.json specification file&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;spec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="nn"&gt;Spec&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;try_from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;bundle&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"config.json"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.as_path&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nf"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nd"&gt;error!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"{}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;

    &lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;STACK_SIZE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;usize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// 4 MB&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;ref&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;u8&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;STACK_SIZE&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;STACK_SIZE&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;

    &lt;span class="c1"&gt;// Take namespaces from config.json&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;spec_namespaces&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="py"&gt;.linux.namespaces&lt;/span&gt;&lt;span class="nf"&gt;.into_iter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;.map&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;ns&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="nf"&gt;to_flags&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ns&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="nf"&gt;.reduce&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;clone_flags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="n"&gt;spec_namespaces&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nb"&gt;None&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nn"&gt;CloneFlags&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;empty&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;

    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;child&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;clone&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Box&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;child_fun&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;stack&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;clone_flags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;None&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Parent process&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One problem that arises from viewing the above is the communication between parent and child. In the case of spawning a thread, one could say that creating an in-memory channel would solve the issue (Rust has excellent support for multithreading and thread-safe &lt;a href="https://doc.rust-lang.org/std/sync/mpsc/index.html" rel="noopener noreferrer"&gt;multi-producer, single-consumer queue&lt;/a&gt;). In our example, this is not the case because it's creating (or better say cloning) a new process with its separate memory space.&lt;/p&gt;

&lt;p&gt;Inter-process communication (IPC) is a set of techniques that allows processes to communicate with each other. The two most widely used are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;shared memory&lt;/li&gt;
&lt;li&gt;sockets&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In our example, we'll use Unix sockets (AF_UNIX) to establish a "client-server" channel between the parent and the child processes. The container process is going to bind to that Unix socket and listen for incoming connections from the parent process. Both processes will use the socket connection to inform each other when different parts of the execution pass or fail. The socket connection also comes in handy when the start command is being invoked, to inform the container process to start the user-defined program. The following diagram describes the "protocol" better:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2arzqf13fm0np7fetg85.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2arzqf13fm0np7fetg85.png" alt="container-runtime-diagram"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Unix Sockets
&lt;/h2&gt;

&lt;p&gt;For those unfamiliar with &lt;a href="https://man7.org/linux/man-pages/man7/unix.7.html" rel="noopener noreferrer"&gt;Unix (domain) Sockets&lt;/a&gt;, this Linux feature will hopefully be mind-blowing (at least for me it was). Unix sockets are an inter-process communication mechanism that establishes a two-way data exchange channel between processes running on the same machine. One can think of them as TCP/IP sockets that don't use the network stack to send and receive data, but a file on the filesystem.&lt;/p&gt;

&lt;p&gt;In the case of the container runtime, Unix sockets offer a bi-directional data exchange for the runtime parent and child process. That exchange channel is essential for the container runtime! What if something breaks down in the child process? How can the parent process continue? Or how does the child know when the start command gets invoked?&lt;/p&gt;

&lt;p&gt;For these purposes, the container runtime implements IPC channels. Those are bidirectional channels using Unix domain sockets. One process acts as a "server" and the other processes (known as "clients") connect to the server process.&lt;/p&gt;

&lt;p&gt;To shorten up the story, here's a rough idea of how that Rust code might look:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;IpcChannel&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sock_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;_client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Option&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;i32&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;impl&lt;/span&gt; &lt;span class="n"&gt;IpcChannel&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;IpcChannel&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;socket_raw_fd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nn"&gt;AddressFamily&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Unix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nn"&gt;SockType&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;SeqPacket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nn"&gt;SockFlag&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;SOCK_CLOEXEC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nb"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;.map_err&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"unable to create IPC socket"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;err_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nn"&gt;ErrorType&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Runtime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;sockaddr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;SockAddr&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new_unix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="nf"&gt;.map_err&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"unable to create unix socket"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;err_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nn"&gt;ErrorType&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Runtime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="nf"&gt;bind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;socket_raw_fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;sockaddr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.map_err&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"unable to bind IPC socket"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;err_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nn"&gt;ErrorType&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Runtime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="nf"&gt;listen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;socket_raw_fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.map_err&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"unable to listen IPC socket"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;err_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nn"&gt;ErrorType&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Runtime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;IpcChannel&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;socket_raw_fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;sock_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="nf"&gt;.clone&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;_client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;IpcChannel&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;socket_raw_fd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="nn"&gt;AddressFamily&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Unix&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nn"&gt;SockType&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;SeqPacket&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nn"&gt;SockFlag&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;SOCK_CLOEXEC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nb"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;.map_err&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"unable to create IPC socket"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;err_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nn"&gt;ErrorType&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Runtime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;sockaddr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;SockAddr&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new_unix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="nf"&gt;.map_err&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"unable to create unix socket"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;err_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nn"&gt;ErrorType&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Runtime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;socket_raw_fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;sockaddr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.map_err&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"unable to connect to unix socket"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;err_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nn"&gt;ErrorType&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Runtime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;IpcChannel&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;socket_raw_fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;sock_path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="nf"&gt;.clone&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;_client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;accept&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;child_socket_fd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;nix&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;socket&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;accept&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.fd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.map_err&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"unable to accept incoming socket"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;err_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nn"&gt;ErrorType&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Runtime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;._client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;child_socket_fd&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;fd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;._client&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nb"&gt;None&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;};&lt;/span&gt;

        &lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="nf"&gt;.as_bytes&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="nf"&gt;.map_err&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nd"&gt;format!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"unable to write to unix socket {}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;err_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nn"&gt;ErrorType&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Runtime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;recv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;fd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;._client&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nb"&gt;None&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;};&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;buf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;num&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

        &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;str&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_utf8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="n"&gt;num&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="nf"&gt;.trim&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;()),&lt;/span&gt;
            &lt;span class="nf"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"error while converting byte to string {}"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="n"&gt;err_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nn"&gt;ErrorType&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Runtime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;}),&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.fd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.map_err&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"error closing socket"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;err_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nn"&gt;ErrorType&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Runtime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;remove_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.sock_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.map_err&lt;/span&gt;&lt;span class="p"&gt;(|&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="n"&gt;Error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"error removing socket"&lt;/span&gt;&lt;span class="nf"&gt;.to_string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="n"&gt;err_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nn"&gt;ErrorType&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Runtime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

        &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The server calls the &lt;strong&gt;new&lt;/strong&gt; method and binds to the &lt;em&gt;.sock&lt;/em&gt; file. Then it calls &lt;strong&gt;accept&lt;/strong&gt; and waits for incoming connections. On the other hand, the client just calls &lt;strong&gt;connect&lt;/strong&gt; with the same &lt;em&gt;.sock&lt;/em&gt; file and after that, the server and client can exchange messages. In the end, both processes call &lt;strong&gt;close&lt;/strong&gt; and the communication is finished.&lt;/p&gt;

&lt;p&gt;Note that I've used &lt;code&gt;SOCK_SEQPACKET&lt;/code&gt; sockets, because the messages come in-order, it's connection-based and the message gets flushed all at once (opposite to &lt;code&gt;SOCK_STREAM&lt;/code&gt;).&lt;/p&gt;

&lt;h2&gt;
  
  
  Terminal
&lt;/h2&gt;

&lt;p&gt;To have a nice interaction with the container once it starts, the runtime should be able to provide a terminal interface if the user requested a terminal.&lt;/p&gt;

&lt;p&gt;When running a Docker command like this one:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;docker run alpine ping 8.8.8.8 &lt;/code&gt;&lt;/p&gt;

&lt;p&gt;you will see the output of the ping command sending ICMP requests to Google's DNS. The output of the ping command is piped through Docker, but when we want to stop the command (by using Ctrl+C) nothing happens. That's because when pressing the SIGINT key combination, the signal is sent to Docker which isn't passing the command to the actual container process.&lt;/p&gt;

&lt;p&gt;On the other side, when running:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;docker run -it alpine ping 8.8.8.8&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;and pressing Ctrl+C, the command terminates immediately, as if it runs on the host machine. Why is that?&lt;br&gt;
That's because in the first example the container process doesn't have a terminal instantiated, therefore the user nor Docker can't forward the signal to the container via tty.&lt;/p&gt;

&lt;p&gt;Luckily for us, the &lt;code&gt;-t&lt;/code&gt; option sets the &lt;code&gt;terminal: true&lt;/code&gt; flag inside the config.json file. After that, it's the container runtime's responsibility to create a so-called &lt;a href="https://linux.die.net/man/7/pty" rel="noopener noreferrer"&gt;"pseudo-terminal"&lt;/a&gt; (pty).&lt;/p&gt;

&lt;p&gt;To simplify things, a PTY is a (master-slave) pair of communication devices that act like a real terminal. Any command sent to the master gets forwarded to the slave end, from text input to process signals. PTYs are a very important and used feature of the Linux kernel (&lt;em&gt;ssh&lt;/em&gt; uses it!).&lt;/p&gt;

&lt;p&gt;Now it's simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;if &lt;code&gt;terminal: true&lt;/code&gt; the container runtime creates a PTY&lt;/li&gt;
&lt;li&gt;the slave descriptor goes to the child process&lt;/li&gt;
&lt;li&gt;the master descriptor goes to the calling process (in this case Docker)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;But how does the child process send the master descriptor to Docker?&lt;/p&gt;

&lt;p&gt;Sigh… This was a real &lt;a href="https://www.allacronyms.com/PITA/Pain_In_The_Ass" rel="noopener noreferrer"&gt;PITA&lt;/a&gt; to find out and the solution was outside the scope of the OCI runtime spec. &lt;code&gt;runc&lt;/code&gt; developed a solution for which the steps are described &lt;a href="https://github.com/opencontainers/runc/blob/master/docs/terminals.md#detached-new-terminal" rel="noopener noreferrer"&gt;here&lt;/a&gt;. TLDR; our friend Unix sockets came to help. Docker creates a Unix domain socket and passes it to the container runtime as &lt;code&gt;console-socket&lt;/code&gt; argument. After the container runtime creates the PTY, it sends the master end to that same Unix socket with &lt;a href="https://man7.org/linux/man-pages/man3/cmsg.3.html" rel="noopener noreferrer"&gt;SCM_RIGHTS&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Finally, we have a ready-to-test OCI container runtime! &lt;/p&gt;

&lt;p&gt;This part explained the clone syscall and how it detaches the execution context from the parent process. It also has a flexible API so that we can specify the new stack for the process.&lt;br&gt;
Unix domain sockets play a big role here because they sync the whole parent-child communication and handle potential scenes when errors show up, on both sides.&lt;/p&gt;

&lt;p&gt;Part II rounds up the Container Runtime in Rust series. The whole source code for the experimental container runtime can be found on this &lt;a href="https://github.com/penumbra23/pura" rel="noopener noreferrer"&gt;Github repo&lt;/a&gt;. Feel free to ask questions or point interesting things out in the implementation.&lt;/p&gt;

</description>
      <category>linux</category>
      <category>docker</category>
      <category>containers</category>
      <category>rust</category>
    </item>
    <item>
      <title>Container Runtime in Rust — Part I</title>
      <dc:creator>penumbra23</dc:creator>
      <pubDate>Sat, 18 Sep 2021 17:46:38 +0000</pubDate>
      <link>https://dev.to/penumbra23/container-jailing-how-it-s-done-2ikc</link>
      <guid>https://dev.to/penumbra23/container-jailing-how-it-s-done-2ikc</guid>
      <description>&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://dev.to/penumbra23/container-runtime-in-rust-part-0-1ecm"&gt;Part 0&lt;/a&gt; of the series, we’ve seen how processes can get a restricted view of the resources they see. This part will explain how a container runtime prepares and creates an isolated environment for the container process.&lt;/p&gt;

&lt;p&gt;A prerequisite for this part is a basic knowledge of how the Linux filesystem works, what are inodes, symbolic links, and mount points.&lt;/p&gt;

&lt;p&gt;The full source code for this post can be found &lt;a href="https://github.com/penumbra23/pura"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;First, let’s start with the OCI spec.&lt;/p&gt;

&lt;h2&gt;
  
  
  Operations
&lt;/h2&gt;

&lt;p&gt;At the time of writing, the OCI spec defines a minimum of five standard operations: create, start, state, delete and kill. Having that in mind, using the &lt;a href="https://crates.io/crates/clap"&gt;clap&lt;/a&gt; library we can generate a nice CLI interface in no time. It should look something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;matches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;App&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Container Runtime"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;.subcommand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nn"&gt;SubCommand&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;with_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"create"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;.arg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Arg&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;with_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"bundle"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.required&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="nf"&gt;.arg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Arg&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;with_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.required&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;.subcommand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;SubCommand&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;with_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"start"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.arg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Arg&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;with_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.required&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
    &lt;span class="nf"&gt;.subcommand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nn"&gt;SubCommand&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;with_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"kill"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;.arg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Arg&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;with_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.required&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="nf"&gt;.arg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Arg&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;with_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"signal"&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;.subcommand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;SubCommand&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;with_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"delete"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.arg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Arg&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;with_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.required&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
    &lt;span class="nf"&gt;.subcommand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;SubCommand&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;with_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"state"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.arg&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Arg&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;with_name&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.required&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
    &lt;span class="nf"&gt;.get_matches&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We’ll give the most attention to the &lt;strong&gt;create&lt;/strong&gt; and &lt;strong&gt;start&lt;/strong&gt; command because those are the two most important commands when running the docker run command.&lt;/p&gt;

&lt;p&gt;The bundle directory contains the config.json file that holds all the metadata for creating the container:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ociVersion - version of the OCI spec&lt;/li&gt;
&lt;li&gt;process - the user-defined process that the container executes (shell, database, web app, gRPC service, etc.) with the necessary args and environment variables&lt;/li&gt;
&lt;li&gt;root - path to the subdirectory for the container root&lt;/li&gt;
&lt;li&gt;hostname of the container&lt;/li&gt;
&lt;li&gt;mounts - list of mount points inside the container&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In addition, the OCI spec contains a platform-specific section enabling custom settings based on the platform where the container is being run. Because we are looking only at Linux containers, the linux section will be of use to us.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;create&lt;/strong&gt; command is supplied with the container ID and the bundle path. Its purpose is to initialize the container process, mount all necessary subdirectories, “jail” the container inside the &lt;strong&gt;root.path&lt;/strong&gt; folder, update all system variables inside the container (env, hostname, user, group), execute a couple of hooks (we’ll look into that later), assign the unique ID to the container itself, and wait until the &lt;strong&gt;start&lt;/strong&gt; command gets fired. After the &lt;strong&gt;create&lt;/strong&gt; command finishes, the container is in status Created and the user process MUST wait for the &lt;strong&gt;start&lt;/strong&gt; command to start the actual container process.&lt;br&gt;
Everything seems straightforward regarding the implementation, but the “jail” -ing part can be a bit confusing. How is it done?&lt;/p&gt;
&lt;h2&gt;
  
  
  chroot
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://man7.org/linux/man-pages/man2/chroot.2.html"&gt;Chroot&lt;/a&gt; is a syscall that changes the root directory of the calling process. It takes the new root path as an argument, which can be an absolute or relative path. The chroot command from the terminal does the same thing, except it takes an additional argument, namely the process that’s going to be executed inside the changed root.&lt;br&gt;
Before we take a look at an example, first we need to prepare the new rootfs. Unfortunately, the binaries used in the jail must reside inside the chroot-ed directory (obviously), so we need to have a premade rootfs. Luckily, we can use our host OS binaries and mount bind the already existing files and end up with a structure like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--uRCS8t8z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nvlz4ll3ewtfnz9atpq6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--uRCS8t8z--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nvlz4ll3ewtfnz9atpq6.png" alt="mount-binds" width="800" height="459"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Don’t worry if your listing differs, just be sure to have bash and ls inside the bin directory.&lt;/p&gt;

&lt;p&gt;Let’s take a look at the chroot command (run with sudo):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--fv0MhdSm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lb7cbotxt4m2vz32x3vi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--fv0MhdSm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lb7cbotxt4m2vz32x3vi.png" alt="chroot" width="800" height="385"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we can see, listing directories outside our root (ls ..) lists the jailed root and it seems that we can’t see anything outside. Also, listing the bin and lib directories gives the same result as with the above example.&lt;br&gt;
One can say “that’s how containers are jailed” and jump right onto building a container from scratch. But, things aren’t that easy… Chroot doesn’t change the filesystem nor the mount points that the process sees. It just changes the view for the process’ root, but everything remains the same. And, breaking this jail is fairly easy as described &lt;a href="https://deepsec.net/docs/Slides/2015/Chw00t_How_To_Break%20Out_from_Various_Chroot_Solutions_-_Bucsay_Balazs.pdf"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  pivot_root
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://man7.org/linux/man-pages/man2/pivot_root.2.html"&gt;pivot_root&lt;/a&gt; on the other hand does exactly what we need. Given a new root and subdirectory of the current root, it moves the current root to the subdirectory and mounts the new root as the root mount point. In this way, it changes the physically mounted folder of the root directory. Later on, we can unmount the “old” root and just leave the newly created root mount point.&lt;br&gt;
We’re going to take a look at an example.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NOTE&lt;/strong&gt;: &lt;strong&gt;pivot_root changes the root mount point and can potentially make a mess out of your filesystem, so be sure to follow the steps below.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;First, we need a real rootfs filesystem. We can’t use the above example because we mount bind the host binaries. We need an independent directory that can live on its own. For that, we are going to use Docker to export a fresh rootfs from an Alpine container. Then we’re going to use unshare (remember our friend from Part 0) to create a new mount namespace. Then we’re going to pivot the roots inside our container. It should look like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--apYq0Lwp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1riboulwljibgehp7ai7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--apYq0Lwp--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1riboulwljibgehp7ai7.png" alt="export-alpine-docker" width="800" height="247"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;docker export&lt;/code&gt; simply copies files from the container to the host system into a tar archive. After exporting the rootfs from the Alpine image, we bind mount the directory to itself, why? Because by the specification of the &lt;strong&gt;pivot_root&lt;/strong&gt; syscall, &lt;em&gt;new_root&lt;/em&gt; must be a path to a mount point that’s different than “/”.&lt;/p&gt;

&lt;p&gt;After preparing the container root, we need to create a new mount namespace to have it differently than our host environment, so that pivot_root doesn’t change anything on the host mount namespace. We create a temporary folder to hold the old root, pivot the root, unmount (or unlink with umount -l) the oldroot and, to finalize the swap, remove the oldroot folder.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Voila!&lt;/em&gt; Now we have a bash process running inside the jailed container folder.&lt;/p&gt;

&lt;p&gt;In Rust code, mounting the rootfs folder with the &lt;a href="https://docs.rs/nix/0.22.1/nix/"&gt;nix crate&lt;/a&gt; would look something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;mount_rootfs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rootfs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nb"&gt;Box&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;dyn&lt;/span&gt; &lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;mount&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nn"&gt;None&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s"&gt;"/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nn"&gt;None&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nn"&gt;MsFlags&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;MS_PRIVATE&lt;/span&gt; &lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="nn"&gt;MsFlags&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;MS_REC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nn"&gt;None&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="nn"&gt;mount&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;rootfs&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;rootfs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nn"&gt;None&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nn"&gt;MsFlags&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;MS_BIND&lt;/span&gt; &lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="nn"&gt;MsFlags&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;MS_REC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nn"&gt;None&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first mount changes the mount propagation of the root mount point to private (shared mount propagation isn’t allowed with &lt;strong&gt;pivot_root&lt;/strong&gt;, for obvious reasons).&lt;/p&gt;

&lt;p&gt;The code for the whole pivoting procedure should look something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;pub&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;pivot_rootfs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rootfs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nb"&gt;Box&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="k"&gt;dyn&lt;/span&gt; &lt;span class="n"&gt;Error&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;chdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rootfs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;create_dir_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rootfs&lt;/span&gt;&lt;span class="nf"&gt;.join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"oldroot"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="nf"&gt;pivot_root&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rootfs&lt;/span&gt;&lt;span class="nf"&gt;.as_os_str&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;rootfs&lt;/span&gt;&lt;span class="nf"&gt;.join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"oldroot"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.as_os_str&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="nf"&gt;umount2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"./oldroot"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;MntFlags&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;MNT_DETACH&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;remove_dir_all&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"./oldroot"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="nf"&gt;chdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"/"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that both &lt;code&gt;mount_rootfs&lt;/code&gt; and &lt;code&gt;pivot_rootfs&lt;/code&gt; are called in the newly created mount namespace.&lt;/p&gt;

&lt;h2&gt;
  
  
  Special links &amp;amp; mounts
&lt;/h2&gt;

&lt;p&gt;The OCI runtime spec defines a set of &lt;a href="https://github.com/opencontainers/runtime-spec/blob/master/runtime-linux.md"&gt;special symlinks&lt;/a&gt;. These symbolic links are used to pass the stdin, stdout, and stderr streams from the container engine (Docker, containerd) to the runtime and vice versa. It simply binds the container’s standard streams to the outside file descriptors of the container process. The container runtime needs to establish these symlinks before pivot_root.&lt;/p&gt;

&lt;p&gt;The OCI runtime spec defines a set of &lt;a href="https://github.com/opencontainers/runtime-spec/blob/master/config-linux.md#default-filesystems"&gt;filesystems&lt;/a&gt; that need to be mounted inside the container. While extracting some &lt;strong&gt;config.json&lt;/strong&gt; files from images like Alpine, Ubuntu, Debian, the &lt;em&gt;/dev/pts&lt;/em&gt; and &lt;em&gt;/dev/shm&lt;/em&gt; mount points are present inside the mounts section of the runtime config spec.&lt;/p&gt;

&lt;p&gt;Two important filesystems that need more attention are proc and sysfs.&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;proc&lt;/em&gt; filesystem mounts to the &lt;em&gt;/proc&lt;/em&gt; directory and acts as an interface for the internal structures of the kernel. For each process it holds a &lt;em&gt;/proc/[PID]&lt;/em&gt; subdirectory that keeps the file descriptors, cpu and memory usage, mount information, page tables and many other things. For example, one can’t inspect the current mount points (with the mount command) until this fs isn’t created.&lt;/p&gt;

&lt;p&gt;The exact command for mounting the proc fs is:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;mount -t proc proc /proc&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;sysfs&lt;/em&gt; filesystem is a pseudo-fs-like proc that provides an interface to the internal kernel objects. Opposite to the proc filesystem, it holds system-wide information like metadata of block and char devices, bus info, drivers, control groups, kernel info, and other global variables. Mounting the &lt;em&gt;sysfs&lt;/em&gt; is the same as with &lt;em&gt;proc&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;mount -t sysfs sysfs /sys&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Both &lt;em&gt;proc&lt;/em&gt; and &lt;em&gt;sysfs&lt;/em&gt; need to be mounted after &lt;strong&gt;pivot_root&lt;/strong&gt;, when the new root mount point is created.&lt;/p&gt;

&lt;h2&gt;
  
  
  Devices
&lt;/h2&gt;

&lt;p&gt;In Linux, everything is considered as a file. Hard drives, peripherals, even processes are fully described through file descriptors. Devices aren’t an exception either. Diskettes, CDROMs, serial ports, any device you attach should appear inside the &lt;em&gt;/dev&lt;/em&gt; subdirectory under the root directory. Devices have types and the majority of devices are either &lt;strong&gt;block&lt;/strong&gt; (store data of some kind) or &lt;strong&gt;character&lt;/strong&gt; (stream or transfer data to/from) devices. Terminals, pseudo-random number generators and even the &lt;em&gt;/dev/null&lt;/em&gt; file is considered a device too.&lt;/p&gt;

&lt;p&gt;The OCI specification defines required devices for each container and the &lt;strong&gt;config.json&lt;/strong&gt; contains a devices list under the linux section. The container runtime is responsible for creating these devices inside the container root directory. The syscall for creating devices is &lt;a href="https://man7.org/linux/man-pages/man2/mknod.2.html"&gt;mknod&lt;/a&gt;. This syscall (also command inside the terminal) accepts 4 required parameters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;pathname - full path to the file location&lt;/li&gt;
&lt;li&gt;type - block, character or other device types&lt;/li&gt;
&lt;li&gt;major &amp;amp; minor - unique identifiers for the device&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, char device with major minor numbers 1, 8 is the random device representing the pseudo-random number generator. Whenever your app requests a random number, this device gets a request.&lt;br&gt;
We can easily generate the special devices with nix’s &lt;strong&gt;mknod&lt;/strong&gt; function or, in case of binding to the host’s device (which is covered by the OCI spec) use the mount bind option.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;We’ve seen how &lt;strong&gt;chroot&lt;/strong&gt; changes the view of the root directory for the current process and how &lt;strong&gt;pivot_root&lt;/strong&gt; swaps the root mount points, creating a logical isolation of the filesystem. We’ve also seen how to create standard container devices and that different containers can request special devices inside the &lt;em&gt;mounts&lt;/em&gt; section of the &lt;strong&gt;config.json&lt;/strong&gt; file.&lt;/p&gt;

&lt;p&gt;Knowing how &lt;strong&gt;unshare&lt;/strong&gt; and &lt;strong&gt;pivot_root&lt;/strong&gt; work gives us the ability to manually create Linux containers in our terminal. In the next parts, we’ll dive a bit deeper into the implementation. Specifically on cloning the child process and preparations for starting the container command.&lt;/p&gt;

</description>
      <category>docker</category>
      <category>rust</category>
      <category>containers</category>
      <category>linux</category>
    </item>
    <item>
      <title>Container Runtime in Rust - Part 0</title>
      <dc:creator>penumbra23</dc:creator>
      <pubDate>Thu, 09 Sep 2021 17:42:03 +0000</pubDate>
      <link>https://dev.to/penumbra23/container-runtime-in-rust-part-0-1ecm</link>
      <guid>https://dev.to/penumbra23/container-runtime-in-rust-part-0-1ecm</guid>
      <description>&lt;h2&gt;
  
  
  Intro
&lt;/h2&gt;

&lt;p&gt;From a DevOps perspective, I can only say what a time to be alive! With the rise of containers all the pain we had with setting up VMs, backup/restore procedures, dynamic provisioning because of scaling because of increased usage volume because of unoptimized architectural setup because of, etc. Everything is gone (well, almost everything from the above list) with a single docker-compose file or a convenient one-liner like helm install my-app.&lt;br&gt;
But, what are containers essentially, and how does Docker work under the hood?&lt;/p&gt;

&lt;p&gt;This will be a series describing my journey writing a small, experimental container runtime in what other language than the one perfect for the job — Rust!&lt;/p&gt;

&lt;p&gt;Part 0 describes the system features of the Linux OS that Docker and other tools utilize to start a process that is isolated and protected or, in other words, a &lt;strong&gt;container&lt;/strong&gt;. The whole series is based on Linux containers, so any occurrence of the word &lt;strong&gt;container&lt;/strong&gt; is interchangeable with &lt;strong&gt;“Linux container”&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The full source code for this post can be found &lt;a href="https://github.com/penumbra23/pura" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Container
&lt;/h2&gt;

&lt;p&gt;When being asked to describe, with as few words as possible, what a container is, most developers feel disappointed. Described with one tech word, it’s a forked or cloned process. It has a dedicated PID, it’s owned by a user and a group, you can list it with the ps command and send signals to it (yeah, signal #9 too). It’s nothing more than that, nothing super fancy bleeding edge around it, just an old-fashioned process.&lt;/p&gt;

&lt;p&gt;But how is it isolated from the rest of the system?&lt;/p&gt;

&lt;p&gt;The answer is &lt;a href="https://en.wikipedia.org/wiki/Linux_namespaces" rel="noopener noreferrer"&gt;namespaces&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Namespaces provide the logical isolation of resources for processes running in different sets of namespaces. There are different types of namespaces, for example, MOUNT namespace for all mount points that the current process can see, NETWORK namespaces for the network interfaces and traffic rules, PID for the process tree, and so on. Two processes running in different PID namespaces don’t see the same process tree. A separate NETWORK namespace gets its own network stack, routing table, firewalls and a loopback interface. Two processes with different network namespaces that bind to their respective loopback devices are bound to a separate logical interface so that traffic doesn’t interfere between them.&lt;/p&gt;

&lt;p&gt;The MOUNT namespace contains the list of mount points a process can see. When first cloning from a mount namespace (the CLONE_NEWNS flag) all mount points are copied from the parent to the child namespace. Any additional mount point created in the child isn’t propagated to the parent mount namespace. Also, when the child process unmounts any mount point, it’s only being affected inside his mount namespace.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flpg90fhw6gffi04wp645.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flpg90fhw6gffi04wp645.png" alt="linux-namespaces"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Root (“/”) isn’t an exception either. The root mount point doesn’t have to be (mostly isn’t) the same directory for different mount namespaces. When running a container, Docker (specifically containerd) creates a directory for each container’s root mount point. Before the actual container runs the user-defined process, it’s the container runtime’s responsibility to mount that directory as the container’s root. In that way, each container has its own root mount point that’s separated from the rest of the filesystem and all other containers running on the host.&lt;/p&gt;

&lt;p&gt;Each process has a &lt;em&gt;/proc/PID/ns&lt;/em&gt; subdirectory on the host containing symlinks for each namespace they belong to (pid, net, user, cgroup, …). If two processes belong to the same namespace, their symlinks will be the same.&lt;/p&gt;

&lt;p&gt;Let’s see what namespaces a simple sleep command has (on my machine of course):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc0p4kkfoe3bfj1y6pdg1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc0p4kkfoe3bfj1y6pdg1.png" alt="console-1"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Running another process in the same shell and inspecting it’s namespaces gives the same symlinks as the above:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5wo3zkfiw2gyl4jol0gg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5wo3zkfiw2gyl4jol0gg.png" alt="console-2"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The two processes are run without any extra commands or setup, they have the same namespace symlinks, therefore they belong to the same namespaces.&lt;/p&gt;

&lt;p&gt;Linux provides the &lt;a href="https://man7.org/linux/man-pages/man2/unshare.2.html" rel="noopener noreferrer"&gt;&lt;strong&gt;UNSHARE&lt;/strong&gt;&lt;/a&gt; syscall from the &lt;em&gt;sched.h&lt;/em&gt; library to change the process’ execution context and allow it to create and enter new namespaces. After &lt;strong&gt;UNSHARE&lt;/strong&gt; gets called with the specific bit mask (flags combining which namespaces to “unshare” from the execution context), the running process gets detached from the root namespaces into it’s own set of namespaces. Unfortunately, it’s not enough to call UNSHARE and expect to have an isolated container (for example, a parent process that “unshares” the PID namespace, still runs in the root PID namespace, but any child process created afterwards will enter the newly created PID namespace). Usually after the container runtime calls UNSHARE it gets followed by a fork/vfork call to create the actual container process.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://man7.org/linux/man-pages/man2/clone.2.html" rel="noopener noreferrer"&gt;&lt;strong&gt;CLONE&lt;/strong&gt;&lt;/a&gt; syscall is a far better option and the implementation in &lt;a href="https://docs.rs/nix/0.22.1/nix/sched/fn.clone.html" rel="noopener noreferrer"&gt;Rust’s nix package&lt;/a&gt; feels more robust and fine-grained. It gives the ability to specify the namespace flags as in &lt;strong&gt;UNSHARE&lt;/strong&gt;, forks a child process and creates the stack for the child.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SETNS&lt;/strong&gt; syscall (&lt;strong&gt;NSENTER&lt;/strong&gt; command) gives the option to change the given namespace to an already existing namespace by it’s file descriptor. For example, to fork a shell process that enters the mount namespace of process with PID 15, enter in a new shell (with root permissions):&lt;/p&gt;

&lt;p&gt;&lt;code&gt;nsenter --mount=/proc/15/ns/mnt /bin/sh&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Enough theory! To confirm everything said above, let’s jump onto a Docker-based example. We’ll run an alpine container, list the process on the host system, inspect it’s namespaces and docker exec into it with &lt;strong&gt;SETNS&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Let’s run a long running sleep container:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhnz1rqk8a5ra69nx6lfm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhnz1rqk8a5ra69nx6lfm.png" alt="console-3"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After inspecting the container’s PID, let’s see the namespace symlinks under &lt;em&gt;/proc/PID/ns&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9dwa1aquolgp0t52skgs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9dwa1aquolgp0t52skgs.png" alt="console-4"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Interesting! Some namespaces are the same as with the above sleep command that we’ve run in our terminal (specifically the &lt;strong&gt;user&lt;/strong&gt; and &lt;strong&gt;cgroup&lt;/strong&gt; namespaces), but most of them are different, which proves that Docker separates containers in different namespaces.&lt;/p&gt;

&lt;p&gt;Now, let’s try to exec a shell inside the container to inspect the filesystem:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn639jz367q81lrkokl0w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn639jz367q81lrkokl0w.png" alt="console-5"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Starting to make sense? Well, almost…&lt;/p&gt;

&lt;p&gt;What’s going on here is that we’ve run &lt;strong&gt;NSENTER&lt;/strong&gt; to start a shell process inside the mount namespace of the container process. The container’s mount namespace is different from the root one because the root directory lists a bit different tree structure than on my WSL instance. Additionally, on my Debian instance, printing the Linux distro info (&lt;em&gt;/etc/os-release&lt;/em&gt;) inside the container shows an Alpine Linux distro. So we conclude:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;docker exec -it &amp;lt;CONTAINER_ID&amp;gt; &amp;lt;CMD&amp;gt;&lt;/code&gt;&lt;br&gt;
equals to:&lt;br&gt;
&lt;code&gt;nsenter -a -t &amp;lt;CONTAINER_PID&amp;gt; &amp;lt;CMD&amp;gt;&lt;/code&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;nsenter -a&lt;/strong&gt; enters all namespace of process with PID specified in the -t arg&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Docker
&lt;/h2&gt;

&lt;p&gt;Most readers are already familiar with the Docker client-server model, so it’s unnecessary to explain how the CLI invokes commands on the dockerd service running in the background. Let’s open the front hood and see what’s going underneath.&lt;/p&gt;

&lt;p&gt;In the last example, we’ve seen that the docker run command forks a process in a separate namespace set. What’s actually happening is that Docker (again, more specifically containerd, but let’s keep it simple for now) calls the underlying container runtime to create the specified namespaces, prepare the container environment and execute some special commands needed before the actual user-defined command starts.&lt;/p&gt;

&lt;p&gt;But if that’s the responsibility of the runtime, what do we use Docker for?&lt;br&gt;
Docker prepares everything before container creation, with the 2 most important parts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;config.json&lt;/li&gt;
&lt;li&gt;container root directory&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These two parts (together with other things that we will discover in this series) are called the container bundle.&lt;/p&gt;

&lt;p&gt;The config.json file has the complete layout for the whole container lifecycle, from container start to container deletion. It holds the path for the container root directory, a list of namespaces that need to be unshared, resource limits for the container process, hooks that need to be executed at a specific point in time, and many other settings.&lt;/p&gt;

&lt;p&gt;The container root directory is the directory mentioned in the mount namespace section. That’s the subdirectory somewhere on the host system that will be the root directory for the container. The user-defined process MUST NOT know that there’s a whole different world outside the container root directory, which basically “cages” (most literature refers to it as “jails”) the user process inside the container root directory.&lt;/p&gt;

&lt;p&gt;Besides these two most important things, Docker does other preparations too (for example, pulls the image layers from the remote repository, sets up the networking interfaces if the container has networking enabled, and so on).&lt;/p&gt;

&lt;h2&gt;
  
  
  OCI Specification
&lt;/h2&gt;

&lt;p&gt;Standardization is a crucial part of any software integration. From writing REST APIs or gRPC services to designing low-level network protocols, everything starts with a well-defined document describing what needs to be (un)expected behavior and the minimal contract that an implementation needs to fulfill.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://opencontainers.org/" rel="noopener noreferrer"&gt;Open Container Initiative (OCI)&lt;/a&gt; created and still maintains the OCI runtime specification that can be found &lt;a href="https://github.com/opencontainers/runtime-spec/blob/master/spec.md" rel="noopener noreferrer"&gt;here&lt;/a&gt;. It’s a spec that still evolves and adds new features that a container runtime can or may execute while launching a container process.&lt;/p&gt;

&lt;p&gt;I won’t go into the details of the spec, as it’s described very well in the document, but at least here is what it is in a nutshell.&lt;/p&gt;

&lt;p&gt;An OCI-compliant &lt;strong&gt;container runtime&lt;/strong&gt; is a CLI binary that implements the following commands:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;create &amp;lt;id&amp;gt; &amp;lt;bundle_path&amp;gt;&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;start &amp;lt;id&amp;gt;&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;state &amp;lt;id&amp;gt;&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;kill &amp;lt;id&amp;gt; &amp;lt;signal&amp;gt;&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;delete &amp;lt;id&amp;gt;&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As with any other emerging spec, the OCI runtime spec describes the minimum features for creating a container. Popular container runtime implementations like &lt;a href="https://github.com/opencontainers/runc" rel="noopener noreferrer"&gt;runc&lt;/a&gt; or &lt;a href="https://github.com/containers/crun" rel="noopener noreferrer"&gt;crun&lt;/a&gt; have additional arguments that help setting up the PID file of the process, root folder for the container state, or the socket file for the container terminal.&lt;/p&gt;

&lt;p&gt;But don’t worry about these parts. In the upcoming articles in this series, we’ll dive deeper into it with some Rust code.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;a href="https://man7.org/linux/man-pages/man7/namespaces.7.html" rel="noopener noreferrer"&gt;Linux namespaces man page&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/opencontainers/runtime-spec" rel="noopener noreferrer"&gt;OCI runtime spec repository&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/opencontainers/runc/blob/master/man/runc.8.md" rel="noopener noreferrer"&gt;runc repository&lt;/a&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>docker</category>
      <category>rust</category>
      <category>containers</category>
      <category>linux</category>
    </item>
  </channel>
</rss>
