DEV Community: thesystemsprogrammer

The LANd of Computer Networking

thesystemsprogrammer — Wed, 26 May 2021 04:40:47 +0000

To understand how your laptop at home talks to Google’s servers in a galaxy far far away, you first need to understand how your laptop at home (or in a data center) talks to another computer located in the same physical location. To do that, all the computers must be connected in some way so that they can transfer signals to each other. Let’s assume that we’re only interested in wired connections at the moment.

The LANd

One way to connect computers to each other is give all computers a direct connection to each other. Let’s say we have 3 computers, computer A would have 2 ports so that it can talk to computers B and C.

Doesn’t seem so bad, right? Well, let’s now think about a network with 100 computers. Each computer would need 99 ports. In total, that’s almost 10,000 ports for this network! Clearly, this model would only work at a small scale.

Another way to handle this is to have a centralized hub. All data sent by a computer is received by the hub. The hub then forwards all data to all the other computers connected to the hub. The intended computer receives the data and all other computers drop the data. Now how many connections do we need? For 100 computers on this network, we’d only need 100 connections! Each computer just needs to be wired to the hub and it will be able to communicate to any other computer on the network. This is known as the “star” network topology, when all computers in a network are connected to one central hub.

One downside of this approach, however, is each computer would be receiving data that is not intended for it. The hub doesn’t know anything about where its incoming data should be routed too, it only knows to forward all data it receives to all other computers on the network. This not only may be a security issue, but it also wastes bandwidth. We could be transferring data from computer C to D while computer A is trying to talk to computer B! The problem gets even worse as more computers are connected to the hub. For every additional computer connected to the hub, every sent message on the network needs to be sent to one additional computer.

How can we solve this problem? Let’s break down our 100 computer hub network into two 50 computer hub networks and connect the 50 computer hub networks to each other using a bridge. Let’s call the first 50 computer hub network Network 1 and the second 50 computer hub network Network 2. The goal of this is to have a mechanism for allowing all computers on the larger 100 computer network to talk to each other, but messages not intended for Network 2 from Network 1, don’t need to be broadcasted to Network 2 computers and vice-versa. The hubs on Network 1 and Network 2 are now going to be connected to each other using this bridge.

The bridge is slightly smarter than a hub. It doesn’t just blindly forward packets to computers that aren’t the destination. Instead, it has an idea of which computers are in Network 1 and which computers are in Network 2. To do this, the bridge needs to know which computers are on Network 1 and which computers are on Network 2; it needs a forwarding table. The main way this is done is through some added support by the computers on the network and a learning mechanism. To make this possible, each computer on the network will be assigned an address. In practice, the addresses of these computers are known as MAC addresses. When a computer wants to send a message, it will include the address of the sender and the address of the receiver in the message.

Initially, say computer A on Network 1 wants to send data to computer B on Network 2. Computer A will send the data to the hub which will broadcast the message to all computers on Network 1. Additionally, the hub will forward the message to the bridge. The bridge will then note that computer A sent it information on port 1 which is associated with Network 1. The bridge needs to know whether it should drop or forward the message on to Network 2. Because the bridge doesn’t yet have any information of which network computer B is on, it forwards the information to the hub on Network 2 even though computer B may not be in Network 2. In turn, the hub on Network 2 broadcasts the message to all computers on the network and computer B is one of those computers that will receive the message.

Now, say computer C, a computer on Network 1 wants to send a message to computer A (which is also on Network 1). The hub in Network 1 receives the message and sends it to all computers on Network 1. Additionally, it sends the message to the bridge. The bridge recognizes that the message is intended for computer A. Previously, it had noted down that computer A had sent data from Network 1, so the bridge knows it can safely drop (instead of forward) the message to Network 2.

Why is this more efficient? Now, after a computer in Network 1 has sent its first message, any message that it is intended to receive after that from another computer in Network 1 does not need to be broadcasted to Network 2, saving bandwidth.

Switching it Up

Now, I hate to switch things up on you, but everything we just discussed, that’s old tech! But, it’s important to understand that so you can appreciate what is used in modern networking. Introducing the packet switch!

The packet switch has mostly made the physical usage of bridges and hubs obsolete although the concepts that they were built upon are still used today. One thing you may have wondered is, why don’t we use the intelligence of a bridge knowing to forward messages or drop them in a hub. A packet switch has that functionality! It acts as a hub but is able to make smarter routing decisions which further reduces bandwidth consumption. When computer Z on Network 3, which is connected by a packet switch, wants to send a message to computer Y on Network 3, the packet switch forwards the message directly to computer Y instead of sending it to Computer X as well. It can do this because it has routing information and learns it in a similar way that bridges learn routing information.

Packet switches take advantage of dedicated hardware that make the routing and forwarding calculation extremely fast. Because of this, they are generally more expensive than hubs and bridges.

Conclusion

Now that we know how computers in the same physical location can talk to each other, let’s explore how computers in different physical locations talk to each other. This is what the next post will be about.

Virtualizing Memory

thesystemsprogrammer — Mon, 24 May 2021 00:05:40 +0000

In the last article, we asked ourselves how the operating system gives each process the illusion that it has its own address space despite only having one hardware RAM. This is one of the most important and most complicated virtualization techniques that the operating system performs. Because of that we will discuss memory virtualization in three separate articles. The goals of the operating system with respect to memory are as follows:

Give processes a contiguous address space
Give processes memory isolation from each other
Do both of the above efficiently with respect to memory usage and processing speed

Let’s first talk about what memory is. Memory can be thought of as a series of slots. 32-bit memory has 32-bit slots and 64-bit memory has 64-bit slots. The size of a computer’s memory can be quite large. For example, on my Mac, the size of the memory is 8 gigabytes. Each memory slot (more commonly referred to as address), is numbered. Memory 8 gigabytes large with 64-bit slots is numbered 0 to 8,000,000,000 each at intervals of 8 bytes (64-bits).

Diving a bit deeper, what does a specific process’s memory address space look like? There are a couple important sections in a program’s memory and they are: the stack, the heap, the code, and the data. At the bottom is the text segment, this is where the code lives. Above that is the data segment; variables that are global or static live here. Above that is the heap segment. This is where data allocated at runtime using the malloc call in C, for example, is located. The heap grows upwards, memory is allocated at the bottom first and grows to the top. Conversely, the stack is above the heap and grows downward. The stack area is used for local variables that exist only during the duration of its scope and are automatically free’d when the scope is exited.

Contiguous Address Space

It’s important for a process to have the illusion that it has a contiguous address space. If it doesn’t, then pointer arithmetic can not happen and we wouldn’t be able to allocate memory larger than a word at a time. A programmer needs a consistent view of memory to them to be able to write code that performs deterministically. For example, the program needs to know that the calling function’s stack is above the callee function’s stack. Otherwise, when exiting the scope of a function, the programming language wouldn’t know where to go to find the next instruction.

One way to give programs a contiguous address space is by dividing up the computer’s total available memory (or RAM - random access memory) into a fixed number and giving each process one of these chunks.

The obvious downside of this approach is that we may only be able to have a fixed number of processes running at a particular time. For example, if I divided up my computer’s 8 gigabytes of RAM into 1 gigabyte per process, then I would only be able to have 8 processes running simultaneously. The 9th process wouldn’t have a chunk of memory readily available to it! Additionally, there will be a large amount of wasted memory since all these 8 processes would likely not use up the entire address space.

How can we minimize the amount of unused memory and make it so that any process can access a free memory address? One way to do this is to add a layer of indirection that gives each process the illusion that it has a contiguous address space while it may be fragmented in the physical address space. This layer of indirection would be a translation table. The operating system could have a table that translates address spaces as viewed by the process to address spaces as viewed by the actual hardware. Every time the process reads or writes to memory, it needs to ask the operating system to look at the translation table to determine what physical address space corresponds to the virtual address space the process is reading or writing from.

One huge downside of this approach is that every memory access has to go through the operating system! Since most programs out there need to access memory frequently, having the operating system maintain this data structure and check it on every memory access would take up a lot of CPU cycles. How can we make this lookup faster? One way to make it faster is by having this translation table live in hardware and have the hardware do the translation instead of the operating system doing it. This would be much faster!

A common implementation of this is to have a translation lookaside buffer (TLB) in hardware. The TLB will map virtual addresses to physical addresses and is updated by the operating system. When the hardware looks up a virtual address in the TLB, if the translation doesn’t exist, the hardware executes a handler in the operating system. The operating system will fill the TLB with the correct address and the instruction will re-execute. Now, the user process does not need to request an address translation from the operating system everytime it reads or writes to memory. Instead, the hardware does the memory address translation under the hood.

Let’s see how this might work in practice. Let’s say a user process wants to allocate some memory at address 0x00. The CPU attempts to access the address 0x00 by first looking for a translation in the TLB. Because this is the first time the program is attempting to access 0x00, it doesn’t find any entry for 0x00 in the TLB and it jumps to execute instructions in the operating system. The operating system then checks to see if the process can still allocate memory (it is not at its memory limit). If it does, then it finds a memory address that is unused by any other processes. To do this, the operating system needs a mapping of memory addresses to whether it is available or not. Once the operating system finds an available physical address, lets say 0x10, it fills the TLB with the virtual address to physical address mapping 0x00 -> 0x10. Then, it will resume the execution of the CPU instruction. The next time the CPU wants to read the data at virtual address 0x00, the CPU looks it up in the TLB and finds the physical address is 0x10 and is able to pull the value correctly.

To make all this possible, the operating system needs a data structure for the mapping of virtual memory addresses to its physical memory address. Wait a minute...this means that for every memory address, we need another memory address to tell us what physical address a virtual address maps to (if it does at all). Woah, this means that we lose half the memory addresses available to us.

Is there a way we can use memory a little bit more efficiently so we don’t lose half of it to operating system accounting overhead? Yes, there is actually! What if instead of a one-to-one mapping of physical address space to whether it is free or not, we map things in larger chunks. For example, if we have a chunk size of 256 addresses, then we would have one entry in memory representing whether addresses 0 - 255 are free, another entry representing if addresses 256 - 511 are free, etc… What is the space complexity of this scheme? Instead of having to reserve half our memory for a virtual address to physical address mapping, we spend 1/256 the amount of memory on it. This is a huge win and the chunk size parameter can be tuned! I call this a chunk but it is more commonly referred to as a page and the technique referred to as paging. Let’s see an example of how all this works.

Let’s say that our program wants to access memory address 0x0110 (address 272 in decimal). If our page size is 256, then the TLB will use the last two bits as an offset into the page, and the remaining bits to find the virtual to physical page mapping.

The TLB will first identify the offset which is 16 in decimal (0x10). Next, it will identify the remaining bits used to find the virtual to physical page mapping. The remaining bits are 0x01 in hex which corresponds to the virtual to physical page table mapping at index 1.

The TLB will take the physical page frame and append the offset to generate the physical memory address that the program will find the data the CPU is requesting.

Great, we now have an efficient way of giving processes the illusion that it has a contiguous address space and access exclusive access to physical hardware when it really doesn’t! The next thing we need to worry about is memory isolation between processes.

Memory Isolation

What’s stopping a malicious process (or a buggy process) from accessing an area of memory that it isn’t allocated? Easy! We can re-use the address translation mechanism for creating a contiguous address space. In the TLB, we can include data about the active process. If the active process running an instruction doesn’t match the TLB entry, then the CPU will trap and execute a SEGFAULT signal handler into the operating system.

Demand Paging

So...are we done? Is that what all modern operating systems do? Nope, one more question! Our computers are not limited to a fixed number of processes, so how do they do it while still providing memory isolation between processes? The answer to this is demand paging. Demand paging was an invention during the multi-programming era of the operating system initially invented by the Atlas operating system. The trick is: we can store the memory chunk of a process that isn’t running to disk. When this process requests the memory again, the operating system swaps it from disk and back into memory. We’re effectively swapping out these pages of memory and putting it into disk when it is not needed and bringing it back when it is needed. We are now able to run many processes on the operating system without worrying about the specific number of allotted chunks.

Conclusion

These are the general principles behind how physical memory is virtualized by the operating system. If there is something you are curious to learn more about, feel free to reach out to me on Twitter or via email.

Virtualizing the CPU

thesystemsprogrammer — Sat, 22 May 2021 17:40:51 +0000

Have you ever wondered how your 6-core MacBook is able to run more than 6 programs at once? Each core is executing one instruction at a time, it’s bizarre to me that it’s able to run more than 6 applications. That’s the magic of the operating system.

The most fundamental piece of hardware in a computer is its CPU. The CPU (central processing unit) executes instructions given to it and performs logic computations very quickly. In this article, we’ll talk about how the operating system makes it seem like all the programs on your computer have exclusive access to the CPU resource, when in reality, it’s shared across all of them.

To do this, the operating system uses an abstraction called a process. A process is an executing instance of a program. Each process is given some time to use the CPU and then that access is revoked and another process is given some time to use the CPU. This happens so quickly, humans don’t notice a pause when using these programs.

Process Initialization

So how does a program get run in the first place? A program is just a file with data in it. The data is stored on a computer’s disk. To run a program, the operating system must first take the data on disk and move it to memory since a CPU cannot directly access the disk. For a CPU to interact with the disk, it must initiate an I/O request which, at a high level, involves sending messages back and forth between itself and the disk’s controller. The I/O is complete once the data is moved from disk into memory. Additionally, the operating system will allocate some stack and heap memory for the process to use during run-time.

Once a process has been created it is put into the READY state. The operating system groups processes into different states which can be generically categorized as READY, RUNNING, or BLOCKED. READY implies that the operating system can schedule it to be run on a CPU. Once the operating system has scheduled a program to run it is put in the RUNNING state as its instructions are executed on the CPU. Now, say the program wants to interact with disk, instead of waiting for the disk I/O to complete, which can take awhile, the process is put in a BLOCKED state in which it is descheduled from the CPU. During this time, another READY process can run on the CPU. Once disk I/O is completed, the process is put back into the READY state where the operating system can choose to schedule it again.

Resource Sharing

Wait a minute...if the operating system schedules a process to run on the CPU and the CPU is running the instructions of that process, how is the operating system able to tell the CPU to start executing the instructions of a different process? What if that process runs in an infinite-loop, does that mean no other processes on the operating system will be able to run? After all, the operating system is also just software running on the CPU. The answer to this question is a hardware-based timer-interrupt. This timer-interrupt runs at a predetermined interval and when it is run, the CPU jumps to a specific location in memory and starts executing instructions at that memory address. This memory address is the operating systems timer interrupt handler. At a high-level, the interrupt handler will determine whether or not the currently running process has been running too long and needs to be descheduled in favor of another READY process. If that’s the case, then the operating system will make the switch.

But what does “make the switch” even mean? The operating system has to take the state of the currently running process, save it into the operating system’s memory, take the state of the to-be-scheduled process from the operating system’s memory, load it and then run it on the CPU. The state that we’re referring to here is the CPUs registers. When a CPU does logical operations, it operates on values stored in registers on the CPU. Things like the memory address of the currently executing instruction are stored in a register.

Scheduling

One thing we didn’t talk about in detail is how the operating system decides what program should be run next. Scheduling is a widely studied discipline but we will only touch a brief part of it. The goals of an operating system scheduler are generally to minimize the amount of idle time (the amount of time a process isn’t running) while also ensuring fairness (all processes get an equal chunk of time to run). Let’s look at a couple scheduling policies and understand their pros and cons.

First-in First-Out This implies that a process that is scheduled will run to completion before another process is scheduled on the operating system. The benefit is that there is no overhead of context switching (saving registers into memory and then loading them again later) so it may minimize the amount of time necessary for all programs to reach completion. However, it’s not necessarily fair. What happens if program A is constantly running and never stops? Program B which was added after program A will never run!

Round Robin Round robin works by always switching the scheduled process during a timer interrupt. It gives each process a time slice and once that time slice has expired, the next process in line is scheduled. This scheduling policy is the most fair but the constant context switches could drastically degrade performance.

Conclusion

Now that we understand how the physical resource of the CPU is virtualized by the operating system, it should make more sense how the operating system is able to give the illusion that all processes have exclusive access to the CPU. One burning question you may have is: how does the operating give each process the illusion that it each has its own address space?

If you have any questions, don’t hesitate to email me at contact@thesystemsprogrammer.com or DM me on Twitter @asystemshacker

History of Operating Systems

thesystemsprogrammer — Sun, 09 May 2021 15:36:27 +0000

The primary goal of an operating system is to allow a programmer to make use of the hardware in a fast and secure manner. However, it took many iterations to get to this definition. Brinch Hansens describes seven phases of the operating system which helped us get to where we are today.

I Open Shop
II Batch Processing
III Multiprogramming
IV Time Sharing
V Concurrent Programming
VI Personal Computing
VII Distributed Systems

Open Shop

The first operating system was almost like a computer at a modern day library. Nobody owns it, and if you wanted to use it, you may need to book some time out with a librarian. You may have an hour to use the computer, but you’ll spend at least 10 - 15 minutes getting set up: downloading applications you need, loading files, logging in to websites. It may not seem like much but the overhead definitely adds up. Operating systems in the 1950s were very similar, you would have to book some time out with a human operator and they would schedule a slot for you to run your job. It would take you a while to set up your program, but once you had, you were free to run it for the remainder of your slot.

Batch Processing

The next phase of operating systems came with the realization that a human wasn’t needed to schedule these jobs, it could be done with software! Users would load their program onto a tape and the computer would have to process the jobs on a first come first serve basis because it was efficient to go forward with tape but extremely slow to go backwards. Even if you have a 10 minute job scheduled to run, if it is behind another user’s 5 hour job, you will have to wait 5 hours for your job to begin running.

Multiprogramming

As computer hardware support began to improve, the idea of a hardware interrupt came about. This enabled multiprogramming - allowing an operating system to execute multiple programs at once. Additionally, memory hardware began to support random-access which enabled new scheduling policies outside of first-come first-serve. Atlas was the first operating system in this era and it introduced demand-paging (paging memory in and out of RAM to disk) and supervisor calls (the early version of system calls) both of which are still used today. Scientists began to see problems like starvation with certain process scheduling policies which gave birth to the next phase of operating system innovation.

Timesharing

Timesharing meant what it sounds like - users would be able to share time on a computer simultaneously. To each user, it would seem like they had the whole computer to itself. John McCarthy had pioneered this idea and helped build a small example of it. Multics, although not used much outside of MIT, had built a larger scale version of this timesharing idea and also implemented the first hierarchical file system. This would allow for private files and folders which catalyzed the idea of timesharing and multi-user systems. Then finally came Unix which is still the basis of many operating systems used today.

Concurrent Programming

Concurrency became a big problem as more and more people were using multiprogrammed operating systems. Multiprogramming also made it much more difficult for people to reason about programs without significant thought. To make operating system concepts easier to grasp, scientists began developing abstractions for how things worked and the idea of semaphores. Both things helped reduce the amount of crashes in an operating system and enabled more scientists and engineers to contribute to improving and building an operating system.

Personal Computing

The increase in availability for engineers to contribute to operating systems and the reduction in crashes vastly improved the usability of computers. This coupled with reductions in hardware costs made it much more feasible for consumers to own computers. GUIs began to gain more popularity and single-user operating systems gained more investment specifically from a company called Xerox. Xerox had built Xerox Star which was an operating system designed to mimic things found in an office: files, trash, calculators, etc… Xerox Star would soon inspire the birth of Macintosh.

Distributed Systems

Distributed systems gave users the ability to turn 5 computers into one interface. Instead of the inter-process communication mechanisms used in prior operating systems, they could use remote procedure calls which would communicate to other machines over a local area network. However, this was non-trivial to implement since remote procedure calls, specifically transferring data over network, were more often subject to hardware failures. This meant that remote procedure calls to a different machine could fail and there had to be mechanisms to account for that. RPCs made it possible for users to access data from their local machine that didn’t necessarily exist on their local machine.

There you have it, that’s an introduction to the history of operating systems. I think it is important to understand this before diving into modern day operating systems since it gives some context behind why certain design decisions were made. If you have any questions, don’t hesitate to reach me at contact@thesystemsprogrammer.com or DM me on Twitter @asystemshacker.

References
http://brinch-hansen.net/papers/2001b.pdf

How does event-driven programming even work?

thesystemsprogrammer — Sat, 08 May 2021 17:13:34 +0000

I’ve always wondered how event-driven programming worked – it is very different from the programming paradigms I was taught in school. I was confused by the asynchronous nature of callbacks and promises. It was also interesting to me how something like setTimeout or setInterval was implemented! It seemed non-trivial for this to be implemented in another language like C/C++ without constantly checking a timer in several areas of your code.

In Node.js, there is a runtime and a JIT compiler that executes the Javascript that a programmer has written. The runtime doesn’t execute operations in the traditional line-after-line blocking manner that synchronous C/C++ does. Instead, it has an event loop and operations are added and executed on the event loop throughout the lifetime of a program. If an event has I/O and needs to be blocked, instead of the CPU halting, context switching, and waiting for the I/O to complete, the Node.js runtime continues to process the next event on the loop. Here is an example:

const fs = require('fs');

function hello_world(x) {
    console.log(`Hello World ${x}!`);
    fs.writeFile(`${x}.txt`, "hi", err => {
        if (err) {
            console.error(err);
        } else {
            console.log(`Finished writing to file ${x}`);
        }
    });
}

hello_world(1);
hello_world(2);

A synchronous version of this written in C/C++ would have a guaranteed output order of:

Hello World 1!
Finished writing to file 1
Hello World 2!
Finished writing to file 2

But in Node.js, the output would likely be something closer to:

Hello World 1!
Hello World 2!
Finished writing to file 1
Finished writing to file 2

It almost looks like the Node.js runtime was smart enough to do other work on the CPU while an I/O operation was happening! Under the hood, Node.js is adding hello_world(1) to the task queue. While executing hello_world(1), it notices that some I/O needs to be done so it does some magic to be discussed later and executes the next item on the task queue which is hello_world(2). Eventually, the Node.js runtime will get an event added to its task queue notifying it that writing to 1.txt file has completed and it will finish up the method call hello_world(1).

The most interesting part here is the mechanism in which Node.js skips blocking on I/O and executes a different event instead of completing the first hello_world(1). And then, somehow the runtime gets a notification that the file has been written to and executes the callback in fs.writeFile. To do all this and more, Node.js uses an asynchronous I/O library called libuv.

Node.js uses libuv as a wrapper to do I/O that would otherwise block the CPU for several cycles. When fs.writeFile is called, a request is sent to libuv telling it to write some content to a file. Eventually, once the content is written, libuv will send a notification back to Node.js telling it the write operation has been completed and it should run the callback for fs.writeFile. Here is an example of how libuv works when handling file I/O:

#include <uv.h>
#include <iostream>

uv_loop_t* loop;

void close_callback(uv_fs_t *close_request) {
    std::cout << "Finished closing file" << std::endl;
    int result = close_request->result;

    // Free the memory
    uv_fs_req_cleanup(close_request);

    if (result < 0) {
        std::cout << "There was an error closing the file" << std::endl;
        return;
    }
    std::cout << "Successfully wrote to the file" << std::endl;
}

void write_callback(uv_fs_t *write_request) {
    std::cout << "Wrote to file" << std::endl;
    int result = write_request->result;
    int data = *(int*) write_request->data;

    // Free the memory
    uv_fs_req_cleanup(write_request);

    if (result < 0) {
        std::cout << "There was an error writing to the file" << std::endl;
        return;
    }

    // Make sure to allocate on the heap since the stack will disappear with
    // an event loop model
    uv_fs_t* close_req = (uv_fs_t*) malloc(sizeof(uv_fs_t));
    uv_fs_close(loop, close_req, data, close_callback);
}
void open_callback(uv_fs_t *open_request) {
    std::cout << "Opened file" << std::endl;
    int result = open_request->result;

    // Free the memory
    uv_fs_req_cleanup(open_request);

    if (result < 0) {
        std::cout << "There was an error opening the file" << std::endl;
        return;
    }

    // Make sure to allocate on the heap since the stack will disappear with
    // an event loop model
    uv_fs_t* write_request = (uv_fs_t*) malloc(sizeof(uv_fs_t));
    write_request->data = (void*) malloc(sizeof(int));
    *((int*) write_request->data) = result;

    char str[] = "Hello World!\n";
    uv_buf_t buf = {str, sizeof(str)};

    uv_buf_t bufs[] = {buf};
    uv_fs_write(loop, write_request, result, bufs, 1 , -1, write_callback);
}

int main() {
    loop = uv_default_loop();

    uv_fs_t* open_request = (uv_fs_t*) malloc(sizeof(uv_fs_t));
    uv_fs_open(loop, open_request, "hello_world.txt", O_WRONLY | O_CREAT, S_IRUSR | S_IWUSR, open_callback);

    uv_fs_t* open_request2 = (uv_fs_t*) malloc(sizeof(uv_fs_t));
    uv_fs_open(loop, open_request2, "hello_world2.txt", O_WRONLY | O_CREAT, S_IRUSR | S_IWUSR, open_callback);

    // Run event loop
    return uv_run(loop, UV_RUN_DEFAULT);
}

In this example, we have added two events to our event loop and uv_run begins running the events. In a traditional C/C++ synchronous style program, we’d expect these to execute sequentially and take a long time because each I/O operation takes a long time. However, using libuv as an async I/O library with an event loop, I/O blocking becomes less of an issue because we are able to execute other pending events while another event is blocked on I/O. To prove that, a possible output of running the above program is:

Opened file
Opened file
Wrote to file
Wrote to file
Finished closing file
Succesfully wrote to the file
Finished closing file
Succesfully wrote to the file

As you can see, the program doesn’t open, write, and then close each file sequentially. Instead, it opens each file, then writes to them and closes them in batches. This is because while the program is waiting for the file to do I/O, it executes the operations for another event. For example, while it is waiting to open file #1, it sends syscalls to open files #2 and #3.

But...how does it work under the hood?

An initial guess as to how this is implemented in libuv is to spawn a separate thread for every I/O operation and block on it. Once the I/O operation has completed, the thread exits and returns to the main libuv thread. The main libuv thread then notifies Node.js that the I/O operation has completed. However, this is likely very slow. Spawning a new thread for every I/O request is a lot of additional CPU overhead! Can we do better?

Another idea I have is to constantly run the poll syscall on all the file descriptors of interest, waiting for the event of interest to occur. In this design, we would only need one libuv thread and that thread would have a loop constantly polling all the file descriptors of interest to check if it is ready. This method would scale linearly O(n) with the number of file descriptors. Unfortunately, this method also isn’t fast enough. You can imagine a Node.js webserver running and having to loop through 5000 file descriptors on every iteration to check for a read or write event.

After a bit more digging and understanding how high-performance web servers like NGINX handle this problem (C10K problem), I came across epoll. The benefit of epoll vs. poll is that epoll only returns file descriptors which have some data update, so there’s no need to scan all of the watched file descriptors. This seems much better than poll and is indeed how libuv implements its async I/O on Linux.

On Linux, epoll works by having the kernel update the epoll per process data structure for every event on a monitored file descriptor. When a user space program requests all the file descriptors that have updates, the kernel already has this list of updated file descriptors and simply has to transfer it to user space. This contrasts from poll because in poll, the kernel needs to query all the file descriptors by iterating through them during the execution of poll.

What about setTimer and setInterval, how are those implemented?

Now that we have a rough understanding of how I/O is implemented in single-threaded Node.js, how do features like setTimer and setInterval work? These don’t use libuv but it is pretty easy to guess how it might work. Because we now know that Node.js is an event-driven language and constantly pulls events off a task queue, it is easy to fathom that the runtime checks every timer or interval to see if it has expired on every event loop iteration. If it has, then it runs the callback for the timer or interval. If not, it skips to the next phase in the event loop. It is important to note that not all timers and intervals will be processed in one loop, the runtime often has a maximum number of events that it will process in each phase.

Curious for more?

If you're interested in learning more, feel free to contact me at contact@thesystemsprogrammer.com or DM me on Twitter @asystemshacker. Check out my blog.

Other Resources

https://nikhilm.github.io/uvbook/basics.html