penumbra23

Posted on Oct 6, 2021 • Originally published at penumbra23.Medium

Container Runtime in Rust - Part II

#linux #docker #containers #rust

Part I of the series described the filesystem layout and how the runtime jails the container process inside the root filesystem of the container.
Part II dives a bit deeper into the implementation and shows how the runtime creates the child process and how they communicate until the point where the user-defined process kicks in. It will also describe how a pseudo-terminal is set up and show the importance of Unix sockets.
By the end of this part, we should have a basic runtime that's interoperable with Docker.

Clone

Part 0 briefly explained the clone syscall. It's like fork/vfork, but with more options to control the child process. Actually, some implementations of fork propagate the call to clone().

Besides controlling which parts of the execution context get shared from the parent process, the clone call offers the possibility for us to create a separate memory block for the child's stack. The nix implementation of clone has the following signature:

pub fn clone(
 cb: CloneCb<'_>, 
 stack: &mut [u8], 
 flags: CloneFlags, 
 signal: Option<c_int>
) -> Result<Pid>

The signal argument, if specified, is sent back to the parent when the child process terminates.

Here is the code snippet depicting the create command and the parent-child relation:

pub fn create(create: Create) {
    let container_id = create.id;
    let root = create.root;
    let bundle = create.bundle;

    // Load config.json specification file
    let spec = match Spec::try_from(Path::new(&bundle).join("config.json").as_path()) {
        Ok(spec) => spec,
        Err(err) => {
            error!("{}", err);
            exit(1);
        }
    };

    const STACK_SIZE: usize = 4 * 1024 * 1024; // 4 MB
    let ref mut stack: [u8; STACK_SIZE] = [0; STACK_SIZE];

    // Take namespaces from config.json
    let spec_namespaces = spec.linux.namespaces.into_iter()
        .map(|ns| to_flags(ns))
        .reduce(|a, b| a | b);

    let clone_flags = match spec_namespaces {
        Some(flags) => flags,
        None => CloneFlags::empty(),
    };

    let child = clone(Box::new(child_fun), stack, clone_flags, None);

    // Parent process
}

One problem that arises from viewing the above is the communication between parent and child. In the case of spawning a thread, one could say that creating an in-memory channel would solve the issue (Rust has excellent support for multithreading and thread-safe multi-producer, single-consumer queue). In our example, this is not the case because it's creating (or better say cloning) a new process with its separate memory space.

Inter-process communication (IPC) is a set of techniques that allows processes to communicate with each other. The two most widely used are:

shared memory
sockets

In our example, we'll use Unix sockets (AF_UNIX) to establish a "client-server" channel between the parent and the child processes. The container process is going to bind to that Unix socket and listen for incoming connections from the parent process. Both processes will use the socket connection to inform each other when different parts of the execution pass or fail. The socket connection also comes in handy when the start command is being invoked, to inform the container process to start the user-defined program. The following diagram describes the "protocol" better:

Unix Sockets

For those unfamiliar with Unix (domain) Sockets, this Linux feature will hopefully be mind-blowing (at least for me it was). Unix sockets are an inter-process communication mechanism that establishes a two-way data exchange channel between processes running on the same machine. One can think of them as TCP/IP sockets that don't use the network stack to send and receive data, but a file on the filesystem.

In the case of the container runtime, Unix sockets offer a bi-directional data exchange for the runtime parent and child process. That exchange channel is essential for the container runtime! What if something breaks down in the child process? How can the parent process continue? Or how does the child know when the start command gets invoked?

For these purposes, the container runtime implements IPC channels. Those are bidirectional channels using Unix domain sockets. One process acts as a "server" and the other processes (known as "clients") connect to the server process.

To shorten up the story, here's a rough idea of how that Rust code might look:

pub struct IpcChannel {
    fd: i32,
    sock_path: String,
    _client: Option<i32>,
}

impl IpcChannel {
    pub fn new(path: &String) -> Result<IpcChannel> {
        let socket_raw_fd = socket(
            AddressFamily::Unix,
            SockType::SeqPacket,
            SockFlag::SOCK_CLOEXEC,
            None,
        )
        .map_err(|_| Error {
            msg: "unable to create IPC socket".to_string(),
            err_type: ErrorType::Runtime,
        })?;

        let sockaddr = SockAddr::new_unix(Path::new(path)).map_err(|_| Error {
            msg: "unable to create unix socket".to_string(),
            err_type: ErrorType::Runtime,
        })?;

        bind(socket_raw_fd, &sockaddr).map_err(|_| Error {
            msg: "unable to bind IPC socket".to_string(),
            err_type: ErrorType::Runtime,
        })?;

        listen(socket_raw_fd, 10).map_err(|_| Error {
            msg: "unable to listen IPC socket".to_string(),
            err_type: ErrorType::Runtime,
        })?;
        Ok(IpcChannel {
            fd: socket_raw_fd,
            sock_path: path.clone(),
            _client: None,
        })
    }

    pub fn connect(path: &String) -> Result<IpcChannel> {
        let socket_raw_fd = socket(
            AddressFamily::Unix,
            SockType::SeqPacket,
            SockFlag::SOCK_CLOEXEC,
            None,
        )
        .map_err(|_| Error {
            msg: "unable to create IPC socket".to_string(),
            err_type: ErrorType::Runtime,
        })?;

        let sockaddr = SockAddr::new_unix(Path::new(path)).map_err(|_| Error {
            msg: "unable to create unix socket".to_string(),
            err_type: ErrorType::Runtime,
        })?;

        connect(socket_raw_fd, &sockaddr).map_err(|_| Error {
            msg: "unable to connect to unix socket".to_string(),
            err_type: ErrorType::Runtime,
        })?;

        Ok(IpcChannel {
            fd: socket_raw_fd,
            sock_path: path.clone(),
            _client: None,
        })
    }

    pub fn accept(&mut self) -> Result<()> {
        let child_socket_fd = nix::sys::socket::accept(self.fd).map_err(|_| Error {
            msg: "unable to accept incoming socket".to_string(),
            err_type: ErrorType::Runtime,
        })?;

        self._client = Some(child_socket_fd);
        Ok(())
    }

    pub fn send(&self, msg: &str) -> Result<()> {
        let fd = match self._client {
            Some(fd) => fd,
            None => self.fd,
        };

        write(fd, msg.as_bytes()).map_err(|err| Error {
            msg: format!("unable to write to unix socket {}", err),
            err_type: ErrorType::Runtime,
        })?;

        Ok(())
    }

    pub fn recv(&self) -> Result<String> {
        let fd = match self._client {
            Some(fd) => fd,
            None => self.fd,
        };
        let mut buf = [0; 1024];
        let num = read(fd, &mut buf).unwrap();

        match std::str::from_utf8(&buf[0..num]) {
            Ok(str) => Ok(str.trim().to_string()),
            Err(_) => Err(Error {
                msg: "error while converting byte to string {}".to_string(),
                err_type: ErrorType::Runtime,
            }),
        }
    }

    pub fn close(&self) -> Result<()> {
        close(self.fd).map_err(|_| Error {
            msg: "error closing socket".to_string(),
            err_type: ErrorType::Runtime,
        })?;

        std::fs::remove_file(&self.sock_path).map_err(|_| Error {
            msg: "error removing socket".to_string(),
            err_type: ErrorType::Runtime,
        })?;

        Ok(())
    }
}

The server calls the new method and binds to the .sock file. Then it calls accept and waits for incoming connections. On the other hand, the client just calls connect with the same .sock file and after that, the server and client can exchange messages. In the end, both processes call close and the communication is finished.

Note that I've used SOCK_SEQPACKET sockets, because the messages come in-order, it's connection-based and the message gets flushed all at once (opposite to SOCK_STREAM).

Terminal

To have a nice interaction with the container once it starts, the runtime should be able to provide a terminal interface if the user requested a terminal.

When running a Docker command like this one:

docker run alpine ping 8.8.8.8

you will see the output of the ping command sending ICMP requests to Google's DNS. The output of the ping command is piped through Docker, but when we want to stop the command (by using Ctrl+C) nothing happens. That's because when pressing the SIGINT key combination, the signal is sent to Docker which isn't passing the command to the actual container process.

On the other side, when running:

docker run -it alpine ping 8.8.8.8

and pressing Ctrl+C, the command terminates immediately, as if it runs on the host machine. Why is that?
That's because in the first example the container process doesn't have a terminal instantiated, therefore the user nor Docker can't forward the signal to the container via tty.

Luckily for us, the -t option sets the terminal: true flag inside the config.json file. After that, it's the container runtime's responsibility to create a so-called "pseudo-terminal" (pty).

To simplify things, a PTY is a (master-slave) pair of communication devices that act like a real terminal. Any command sent to the master gets forwarded to the slave end, from text input to process signals. PTYs are a very important and used feature of the Linux kernel (ssh uses it!).

Now it's simple:

if terminal: true the container runtime creates a PTY
the slave descriptor goes to the child process
the master descriptor goes to the calling process (in this case Docker)

But how does the child process send the master descriptor to Docker?

Sigh… This was a real PITA to find out and the solution was outside the scope of the OCI runtime spec. runc developed a solution for which the steps are described here. TLDR; our friend Unix sockets came to help. Docker creates a Unix domain socket and passes it to the container runtime as console-socket argument. After the container runtime creates the PTY, it sends the master end to that same Unix socket with SCM_RIGHTS.

Conclusion

Finally, we have a ready-to-test OCI container runtime!

This part explained the clone syscall and how it detaches the execution context from the parent process. It also has a flexible API so that we can specify the new stack for the process.
Unix domain sockets play a big role here because they sync the whole parent-child communication and handle potential scenes when errors show up, on both sides.

Part II rounds up the Container Runtime in Rust series. The whole source code for the experimental container runtime can be found on this Github repo. Feel free to ask questions or point interesting things out in the implementation.

DEV Community

Container Runtime in Rust - Part II

Clone

Unix Sockets

Terminal

Conclusion

Top comments (0)

Read next

Top Interview questions for DevOps Part-2

Mastering Bash: Essential Commands for Everyday Use

Symfony7 Docker Template

Dockerize nestjs app with postgres