Meet Gandhi

Posted on May 9

Finally architected my local cloud!

#devjournal #systemdesign

If you are a new reader, here's the context. I along with 2 other friends am planning to create CodeArena a platform built for students and professors in mind aimed to increase students' learning while decreasing the workload on the professor to manage them.

Yeah, huge expectations but we are still in the planning phase. Can't wait to actually see it getting used by students and professors!

Architecture

So as a student with interest in OS kernels and compilers, I came up with a HUGE design. And believe me, I am in love with this design 😂

So let's get to the fun part, the system is a client server architecture which is trivial (yeah, see the server architecture and you'll find this one trivial 😂). The server internally consists of another architecture made of three layers or three types of nodes.

So some clarifications before we actually dive in, the first thing is that we're using the lab PCs of our university so the system we install there should be considerate of how much load the machine is currently in and how much it should load the machine. And plus the machines can shut down anytime so the system needs to be elastic enough to handle that.

So the first type of node is the Gateway node and that will be a personal computer and it will be the only node in the system to be singular. I mean in this system there can only ever be a single Gateway node, strictly.

The second node is an Orchestrator, this node too has 2 types in it a monitoring orchestrator (monitor for short) and an assigning orchestrator (assigner for short) each of which handles different types of work.

The third type of node is the Worker node which basically runs the code given by the client in a linux container (surprise! I am not using Docker 😂).

The last node is the Empty node which actually does nothing, it is just in the system in case the load increases or some machines go down.

So this was the overview but here is the description of each...

Gateway

This node is a personal computer (mostly my pc) on which some parameters are set. These parameters include (but not limited to) number of usable pcs in the lab with their IP addresses, the bare minimum number of all types of nodes for the system to function.

The client sends a request to this Gateway which handles request using a high level language like JS or Go and then communicates with the internal system using TCP.

Now the Gateway is a Single Point of Failure if you pay close attention. To tackle this, an Orchestrator node will make use of Gratuitous ARP to make this an active / passive High Availability setup instead of a Single Point of Failure!

Orchestrator

This node is just another lab computer. There are two types of orchestrators as I mentioned previously, lets look at both one by one.

Monitoring Orchestrator:

This orchestrator will collect heartbeats from all the other nodes in the network (yes even other monitoring orchestrators) and keep track of which nodes are alive and how much work they have.

And this node will also periodically send out heartbeats which contain all the information it has about the worker nodes and their workloads.

The monitoring orchestrators in the network also use the Gossip protocol to keep each one updated about what is going on in the system, because one node will never broadcast what it did that simply makes the system inefficient so instead the monitoring nodes periodically sync their knowledge among themselves making sure even new monitoring orchestrators have the same information.

When the Gateway gets a new client it asks one of these Monitors to lead it to the least loaded Assigner in the network who can then carry out the remaining flow.

Assigning Orchestrator:

This orchestrator will get new client requests from the Gateway (lead by a Monitor) and it will use its knowledge of load on Worker nodes given by the periodic heartbeats of Monitors and assign this task to a Worker node.

The Worker node can either accept the task or can reject it, in the latter case the Assigner simply tries with a different Worker node.

The Worker node will also stream chunks of 1500 byte output it gets from the container running the code which this orchestrator will stream to the gateway and using websockets it would be streamed to the client. And for inputs, the client will send data packets to the gateway (not exactly streaming since it would be sent once the client presses enter) which will then send this packet to the Assigner and to the container.

The Assigner is a crucial failure point in this system currently so to make it a less probable failure I introduced a Buddy system where in the number of Assigners in the system will always be even, every Assigner has a Buddy Assigner. These two "buddies" will gossip their work to each other and in case one fails the other will take over until a Monitor does not assign a new Buddy Assigner to this node.

This orchestrator will also check for Monitor's heartbeats and in case it does not get 5 consecutive heartbeats the Assigners elect a new Monitor and along with that its Buddy will also become a Monitor, so in place of one Monitor two new Monitors will be created (A true Hydra System 😂).

Worker

This node will also be a process on a lab computer, but the special thing here is multiple Workers will co-exist in one machine. The number of workers in a machine is equal to the number of cores that machine has, this will allow the system to take advantage of parallelism at the hardware level.

The Worker node has 3 threads, a communicator thread, a runner thread and a heartbeat thread. The details of each are below.

Communicator Thread:

The communicator thread facilitates communication between the Orchestrators and the Worker. It gets a request from the Orchestrator and if the machine's PSI is below the defined boundary it creates a new Runner thread and hands over the code to that thread.

It also kills a Runner thread in case of container failure, abrupt exit of the thread or if executing the code is complete.

Runner Thread:

The runner thread creates a new rootless container using Linux namespaces. This would allow our system to be able to run without the need of having sudo access on the machine.

This thread also streams output received from the program running in the container to the Orchestrator in blocks of 1500 bytes. And it also takes input from the Orchestrator (the input comes from the client to the Gateway then to the Orchestrator) and provides this input to the program running inside the container.

Heartbeat Thread:

The heartbeat thread sends periodic pings to the controller thread of the process ensuring that the thread is not dead yet. It does this using Unix Domain Sockets or UDS for short.

This thread then sends periodic heartbeats to a Monitor making sure the Worker process is actually alive.

Empty Node

This node is a lab computer which is not either of the above mentioned nodes. This node is used when a node in the system fails or when the load on the system increases.

Its only job is to ping a Monitor and give heartbeats at specific time intervals.

New things I am planning to learn

So some of the things I am excited for are:

Linux Namespaces, Cgroups and a lot more container level stuff. Basically everything that Docker abstracted away
Implementing distributed algorithms like RAFT for leader selection
Implementing Gratuitous ARP so that a failing gateway can be quickly replaced.

And I am planning to start writing the C code for each of these nodes just after finishing my blog. Yeah and just like always without AI writing a single line of code but AI helping me find Linux docs.

What else I did

So this is not the only thing I did in this week after my last blog. I also practiced a bit of System Design. So me and my peers were trying to design a system (HLD since this is the first time we're doing something like this, the next time we're planning to have LLD as well) for a Blogging Platform.

So this was the problem statement

Design a simple multi-user publishing/blogging platform, allowing writers to publish and manage the blogs under their personal publication and readers to read them.

And the requirements:

Core Requirements:

writers should be able to publish blog under their personal publication

readers should be able to read the blog

a user can be both - a reader as well as a writer

author of the blog should be able to delete the blog

blog may contain images, but will not contain any video

time to access the blog should be as low as possible

we have to render "number of blogs" written by every user on his/her profile

users should be able to search for a particular blog

the platform should be scaled for 5 million daily active readers

the platform should be scaled for 10,000 daily active writers

And my system was well a bit too huge. It was decentralized 😉

So what I came up with was a childish excalidraw diagram of the system.

So don't judge me, I made it in 4 hours while still in sleep 😅

So the 7 concentric circles here are the servers for each continent (yeah I included Antarctica, don't judge me now!) and the red circle around them is the independent database for each server. My system is decentralized and the database is distributed.

So this system has two levels of servers, level 1 servers are the continent level servers, very powerful servers and can operate at high speeds. The level 2 servers are servers in countries. A country could have more than one server if needed.

So when a writer wants to create their blog they ping (wrong word but you have no option 😁) their nearest level 2 server probably in their country. This level 2 server then updates the database in the level 1 server of the continent. Now the level 1 server will acknowledge and the writer's blog is published. And now eventually the level 1 server updates the level 2 servers of its continent only. And the level 1 server propogates this information in batches (there's also timeouts so if a writer did publish a blog and no new writer publishes for say 2 hours the data even if there's just one blog entry will travel across these servers) to the servers of the other continents so that data stays eventually fresh in the system.

Now suppose a reader requests a blog from their nearest level 2 server, now if the blog is in the database of the level 2 server (it also has a cache but it is well understood I suppose, independent cache to be more precise not shared cache). it will be served as quickly as possible but if the blog is not there, the level 2 server asks the level 1 server for the blog. And the response of the level 1 server is what reaches the client and also gets stored in the level 2 server.

The level 1 servers do sync their data (which is timestamped to avoid things like lost updates) occasionally, say every 24 hours, after a set checkpoint to avoid the system from becoming a bandwidth hungry monster. This syncing is different from the blog level syncing I told about a second ago with the assumed 2 hour delay. This data exchange makes sure any new data since the previous exchange is aggregated to maintain the eventual consistency.

This system prioritizes Availabilty over Consistency so even if a person from europe has published their blog a minute ago, their american friend might or might not see it for another 2 hours (that's what we assumed before). And that is okay since a blogging platform should be available even if the blogs are eventually updated. After all as a great person once told me,

Systems is always a trade off, you decide which one you want!

But in case of a partition inside the level 2 servers, the system chooses consistency. So if a level 2 server crashes and reboots it might be possible it missed some data, so it will get new data from the level 1 server after the last synced timestamp. Here the data transferred would be respective to the region the level 2 server is in. If this is a level 2 server in China, it won't get data of Russian writers until it requests later on. It might also get data of more read blogs but not everything at once.

So yeah this was my design of this system. It is complex, it is incomplete and it is full of flaws but it is my original system hence whatever flaw you do point out, I will simply add a new patch to my system without having to switch over to something else (except when the problem is a part of the architecture, in that case I'll enjoy learning something from you!).

What's Next?

Next I'm gonna write the code for all the types of nodes of my local cloud and then I'll also create a new system for the next system design question. This time with LLD too.

So not committing fully, but by the next blog I will be ready with the node codes and probably a better system design of the next problem.

Thanks a lot for reading, see you around Amigos!