DEV Community πŸ‘©β€πŸ’»πŸ‘¨β€πŸ’»

Akira Moroo
Akira Moroo

Posted on • Updated on • Originally published at retrage.github.io

Linux Kernel Library Nabla Containers Internals

This post describes the design and implementation of Linux Kernel Library Nabla Containers (LKL Nabla), Linux based unikernels as processes. The previous post introduces LKL Nabla and provides how to build and run. Since most of the unikernel work is done by frankenlibc LKL/musl, mainly focus on frankenlibc Solo5 port in this post.

You can find LKL Nabla code at:

Modifications to runnc

Before diving into frankenlibc code, let’s take a look at the modifications to runnc.

When runnc is executed, it initializes devices that will be used by the container. Then, the runtime builds arguments and launches a container as a process.

What kind of devices will be provided? On current runnc implementation, it can provide only one network device and block device correspondingly. This situation is the same in LKL Nabla.

A container manager like Docker pulls container an image and extracts to a rootfs as a directory. runnc creates a disk image from the rootfs directory. The disk image format is ISO in Rumprun, but the default file system is ext4 in LKL. Thus, it is switched to ext4 in LKL Nabla.
For the implementation, see CreateExt4().

Rumprun accepts JSON config from arguments on runtime. The original runnc builds config on container initialization. On the other hand, LKL also allows JSON config on runtime. However, the config format is quite different from Rumprun’s one. LKL Nabla’s runnc creates a config for LKL.
llruntimes/nabla/runnc-cont/lkl.go is the config builder for LKL.

After the initialization, runnc launches a unikernel process using Solo5 tender like:

    var args []string
    args = []string{r.NablaRunBin,
        "--mem=" + strconv.FormatInt(r.Memory, 10),
        "--net:tap=" + r.Tap,
        "--block:rootfs=" + disk,
        r.UniKernelBin}
    args = append(args, "__RUMP_FDINFO_NET_tap=4")
    args = append(args, r.Env...)
    args = append(args, "--config")
    args = append(args, unikernelArgs)
    args = append(args, "--")
  args = append(args, r.NablaRunArgs...)
  // snip
    err = syscall.Exec(r.NablaRunBin, args, newenv)

frankenlibc

Now, it’s time to dive into Solo5 port frankenlibc. It was a bunch of tools to run Rumprun unikernel on userspace. It was forked and added LKL/musl support. LKL Nabla uses this fork to run LKL on Solo5.

Below shows the architecture of frankenlibc.

frankenlibc Layers
Application
musl libc
LKL
librumpuser
franken
platform
Host

An application is the top of the 7 layers. The host is the bottom. The host-dependent layer is a platform. The code is located in platform directory. To port a new host, you will have to add the code to the platform.

The interfaces that platform code should provide are the same as Linux system calls. Here is the list.

int __libc_start_main(int (*main)(int,char **,char **), int argc, char **argv);
void _exit(int status);
int clock_getres(clockid_t clk_id, struct timespec *tp);
int clock_gettime(clockid_t clk_id, struct timespec *tp);
int clock_nanosleep(clockid_t clk_id, int flags, const struct timespec *request, struct timespec *remain);
int fcntl(int fd, int cmd, ...);
int fstat(int fd, struct stat *st);
int fsync(int fd);
int getpagesize(void);
int getrandom(void *buf, size_t size, unsigned int flags);
int kill(pid_t pid, int sig);
off_t lseek(int fd, off_t offset, int whence);
void *mmap(void *addr, size_t length, int prot, int nflags, int fd, off_t offset);
int munmap(void *addr, size_t length);
int mprotect(void *addr, size_t length, int prot);
int poll(struct pollfd *fds, nfds_t n, int timeout);
ssize_t pread(int fd, void *buf, size_t count, off_t offset);
ssize_t preadv(int fd, const struct iovec *iov, int iovcnt, off_t off);
ssize_t pwrite(int fd, const void *buf, size_t count, off_t offset);
ssize_t pwritev(int fd, const struct iovec *iov, int iovcnt, off_t off);
ssize_t read(int fd, void *buf, size_t count);
ssize_t readv(int fd, const struct iovec *iov, int iovcnt);
ssize_t write(int fd, const void *buf, size_t count);
ssize_t writev(int fd, const struct iovec *iov, int iovcnt);

It looks much larger than that of Solo5 as it provides only 13 hypercalls to the guest OS, but some of them are optional. We need to implement the platform code using the hypercalls for porting LKL/musl to Solo5.

Entry Point

solo5_start_main() is an entry point in Solo5 guest. A Solo5 tender starts the OS from this function. The argument is a pointer to struct solo5_start_info. It contains cmdline, heap_start and heap_size.

struct solo5_start_info {
    const char *cmdline;
    uintptr_t heap_start;
    size_t heap_size;
};

cmdline is an argument string passed when the unikernel process is launched. As frankenlibc expects envp and argv will be passed from the host, cmdline is parsed into envp and argv in the initialization.
rexec, a launch tool for frankenlibc, can pass a JSON config through a file descriptor. The FD value is shared using the environment variable (e.g. __RUMP_FDINFO_CONFIGJSON). However, this method cannot be applied to Solo5 port because any environment variable cannot be shared with Solo5 guests. Therefore, the JSON config is passed from cmdline as a string in the Solo5 port.

The other arguments heap_start and heap_size are information about heap provided by the tender. They are used for memory manager initialization. In this Solo5 port, the memory manager is a simple buddy allocator from mini-os. It is used in the mmap()/munmap() platform code.

Devices

In *nix system, most of the devices are represented as files and the operations are read/write to the file descriptor. frankenlibc also use this manner in platform code.
rexec opens devices and passes the FD numbers through environment variables (e.g. __RUMP_FDINFO_NET_tap). This behavior is the same as the JSON config. The franken layer registers devices using the FD info in fdinit().

In Solo5, devices attached at runtime must be specified at build time. When building a guest, a JSON format config called Application Manifest manifest.json must be supplied. It declares user-specified devices. In contrast to Solo5, frankenlibc rexec can specify devices at run time. As described before, current runnc can deal with one block device and one network device. Therefore, the Solo5 port uses fixed manifest.json that specifies one block device rootfs and one network device tap. Below is the config.

{
  "type": "solo5.manifest",
  "version": 1,
  "devices":
  [
    { "name": "rootfs", "type": "BLOCK_BASIC" },
    { "name": "tap", "type": "NET_BASIC" }
  ]
}

A device in Solo5 is represented by solo5_handle_t, not by the file descriptor. In the frankenlibc Solo5 port, as the devices are fixed, it assigns a virtual FD number to the Solo5 device handle.

Solo5 provides interfaces for reading/writing devices and console. In read()/write() platform code, it identifies the FD number and call appropriate hypercalls.

poll()/clock_nanosleep() are used for waiting network packets. Each network device has file descriptor pollfd to store polling state in frankenlibc. For Solo5 port, solo5_yield() is used to implement poll()/clock_nanosleep() The behavior is almost the same as the Linux port.

In clock_nanosleep(), it calls solo5_yield() and if the network handle is set on ready_set, it updates the pollfd.revents and wake the associated thread. In poll(), it sleeps until timeout and sets FD's revents if pollfd.revents is updated.

Conclusion

This post summarized LKL Nabla internals. The most of implementations are straight forward thanks to frankenlibc platform-independent interfaces and simple Solo5 hypercalls. However, since LKL has different interfaces with Rumprun, patches to runnc for LKL port is quite large. It will be better to have a switching option to change between Rumprun and LKL on runnc.

Top comments (0)

🌚 Friends don't let friends browse without dark mode.

Sorry, it's true.