Meet Gandhi

Posted on Jun 12

From Slack Clones to Data Mules: My Descent into System Design Madness

#devjournal #systemdesign

According to the CAP theorem, I sacrificed Consistency for Availablilty! I have some good content to share today and much better than last time. This is what caught me up for a long long time.

So last time I came up with a system design which was decentralized but this time I have 4 more designs for 4 different use cases. One of them combines the benefits of the centralized world while being decentralized in nature. I know it sounds weird, but it kinda resembles an architecture of that kind, and I did not take help of any external sources except my own brain (I wish I had a chip installed, I could just lie this way 😂)

And today I am also planning to share a very twisty problem which might seem like another tuesday for someone building at scale, but for someone like me who is still learning the art, I was totally lost in the ideas... which were totally wrong at first but then slowly started clicking together 🤒. That problem involves me introducing HLC's and BitTorrent just for helping WebRTC. Sounds weird? It is! 😂

And finally I have some micellaneous stuff, sort of life updates but technical. I want to share some repositories I found and were shared with me, plus something about my freelance work.

Also I have a confession, I am kinda losing interest after completing half the local cloud project. I just don't know why, but it feels like this project just held up too long but... I think I'm not gonna let myself lose interest in this project. But there's nothing I can do except just continue the implementation and hope for the best.

So let's get started!

The Local Cloud

The last time I wrote a blog the empty nodes elected a monitor and the monitor just sent and received heartbeats, nothing else. The monitor did not store the information of the heartbeats neither did it have a sync protocol for multiple monitors.

This new version has a defined limit on the number of nodes a monitor can hold. So all the monitors listen on port 8000 for broadcast messages. Now suppose there is an empty node trying to find the nearest monitor, that nodes sends the message and the first monitor to reply is the monitor it will talk to. But that was the happy path...

Now comes the failure path, suppose this happens and I have a 100 nodes of which just one is a monitor, now that monitor will get overloaded as more and more empty nodes join in right? So how do we solve this?

The easiest solution I could think of in a mess I created was for the monitor to simply not reply if it's capacity is full. So a monitor can only hold 5 nodes suppose and it has 5 nodes sending it heartbeats via unicast. All is working well right now. But then a 6th node joins in, now that node sends a broadcast signal but the only monitor does not reply because it is already at max capacity. Hence the 6th node becomes the new monitor by electing itself. Now any new node joining in will get a reply from this new monitor until a node from the old monitor fails and stops sending it heartbeats.

But apart from this feature, the monitors can also sync their data giving each monitor a good view of the entire system. They sync only 5 numbers, the number of gateways, assigners, monitors, workers and empty nodes. Now that each node has a bird view of the system, if any of these numbers fall below their required minimum (for example gateway becomes 0) then the monitors elect one among them to do the promotion process. The promotion process just means promoting nodes in some hierarchy to fill up the minimum requirement of each node type.

Currently the monitors do not have the promotion logic yet, but they can elect one among them for starting the process. That elected monitor just says "I will promote nodes" and gets to its usual work again 😂

So next I'm planning to add this promotion logic and then I'll start writing these other nodes, starting with the gateway then the assigner and last but not the least the worker node. This way I am planning to connect with new bugs and headaches which will help me in the future! (High hopes 🤞)

System Design

So this time I have done 4 system designs since I had a backlog 😂

Airline Check-in System

The first was this. The problem statement in short is this:

Make a ticket booking system that is highly available, consistent and should handle a load of 100 passengers per flight.

The problem statement also said a lot more things than I have written here, but this is mostly what it was.

Hence I came up with this design which, as I told was a combination of centralized and decentralized worlds. So this architecture is not complete, neither is it robust but its just me trying to explore something out of the box and also learn. I would be so glad if someone could actually point out a pretty huge bug hiding in plain sight (without the use of LLMs I guess? 😂).

So this looks simple but the explaination is a bit complicated. So on the right if you see there's this client server architecture where in a client is redirected to a server using a load balancer. There can be multiple such load balancers since they are kinda independent. This is a decentralized system this way. The load balancers only need to know one thing — which server handles which flight.

Yeah, this is the thing that makes my system centralized. So suppose there is a flight from place A to place B, then there will be a server allocated explicitly to that flight. And that server will be pysically present somewhere near the midpoint of place A and place B. So any load balancer getting a request for the flight from A to B will always redirect to the dedicated server. And due to this flight-server mapping this system becomes centralized around the allocated server. So latency could spike if you're at place C trying to book a flight for place A to B but not if you're staying near A and B or maybe closer to the mid-point.

SQL Backed Key Value Store

Actually this was a problem similar to Redis but I thought let's not look into how Redis does it, so first I designed my system then looked at Redis. And surprisingly (or by pure luck 😂) one of my design decision was actually implemented by Redis.

This was the problem statement:

Design a KV Store built on top of a SQL (relational) database. The store exposes APIs to GET, PUT, DEL keys. Along with these core functiona, there should be an API to set TTL to an existing key, post which the key is auto-deleted from the store. Scale this KV store 1 million concurrent API calls and a total storage of 5000 TB.

And this was my design:

So on the left is a normal distributed architecture as you'd expect. Nothing fancy here, clients connect to the server via a load balancer. But the new thing I introduced here are dynamic buckets.

So let's talk about databases first, the database isn't a single machine, it can be split up into how many ever different machines you want (according to how much you want to spend).

Next layer is called the Dynamic Bucket layer and it consists of not so resource hungry servers. These servers get a request from the client via a load balancer and do the GET, PUT or DEL requests on the databases. These servers have a map or dictionary if you're more familiar with python. this map has all the keys attended by the server and which key is stored on which database. It also stores which keys are TTL keys. Now when a server has a lot of keys in it, say 10,000 keys, then the server will do something amazing, it will delegate half of its task to an idle server. This gives a nice splitting effect based on the actual work instead of just ranges.

The dynamic bucket layer initally consists of just 1 active server and the rest will be idle servers. The idle servers do literally nothing. One important thing to know is that these servers will store only the hash of the key and not the key entirely since a key could ideally be very very long. So when an active server is full, it will slice its map into half (actually half, it will be maintaining a sorted array of key hashes side by side and when the time to split comes, the mid point of the array is taken and any hash less than that will be kept and the rest will be sent to an idle server). But now you may have a doubt, won't this be too slow? No, it won't be slow, firstly the constraint on the max keys will be small around 10,000 or maybe 5,000 if you want and then these servers themselves are pretty small and not so resource hungry. Hence once in a while a spike of 2ms is my tradeoff in this system.

Now that I have the splitting mechanism explained let me confront one flaw my friend helped me realize. What about scaling down? When the keys decrease, then the active nodes should ideally merge together helping optimize server costs right? And that was the part I never thought while thinking about this design 😅. And its still hard for me to consider this since I'd need an orchestrator do this in an efficient way. And I'm guessing an orchestrator might not be a good solution if I want to scale this geographically, basically I'm not sure 😂

Now comes the TTL part, so the problem statement also wants TTL and my design has a solution for that. I have some fraction of database machines dedicated just for TTL keys, they have an internal delete thread which deletes a key when their TTL hits. I could not have this on all databases since it would make them slower for a higher load due to concurrency issues. So when you put a key and then send a ttl request the dynamic bucket server moves your data from the normal database to on of the TTL database. And all operations for that key happen on the TTL server from that point of time onwards.

But there's still a flaw here, if you're really understanding this then you might ask how does the load balancer know which dynamic bucket to pick? And my answer is health checks, so load balancers occasionally check for nodes healths and the total traffic on the node; so here a node's health is the range of key hashes it holds. The load balancer computes the hash of the key and then sends it to the dynamic bucket that holds the range. And if the key is in range of no bucket, it sends the key to an idle server. That server then becomes active and processes the request.

There's a ton of holes (and I'm freaking out because of that 😂) but it's hard for me to cover all of them with such a low knowledge of system design, so we move on to the next one!

Design Slack's Realtime Communication

Yeah, it is a new problem statement added to the list. And it has no description... 🫣

So one of my friends found slack's architecture on some online blog and we understood the requirements.

The user can send real-time messages to another user and within a channel (group messaging)

The chat messages should be persisted and displayed in chronological order to keep the total order of messages

The user can send media files such as pictures or short audio files

The chat service should display the online presence status of the users

A chat message must be received by a user at most once to preserve message integrity

Yeah... too many requirements. And I came up to fill in some totally different requirements (along with some of these) 😇

So first things first, that Nginx logo on the left, it is a Media Server for WebRTC (couldn't find a better symbol 🫣). I got confused while explaining it before and I don't want the same to happen again 😂

So the system is a pretty standard one, the clients connect to a load balancer and that load balancer redirects to any server it wants. Simple. But that is the for the normal server cluster. There is a second cluster called the socket server cluster and as its name suggests it handles web socket connections. A client connects to the socket server and establishes a web socket server directly with the socket server. Now both of them can talk to each other mostly without needing the load balancer (pardon the opposite happening in the design 😂).

A socket server is used for showing real time events like a user is typing or a user is online/offline, or even for receiving a message, etc. A normal server is used for database updates like a user has sent a message (which then triggers the socket servers to send that message to the assigned user). An optimization here would be to use SSE instead of websockets since it would use lesser resources on the server side.

Another thing this design covers is reactions to messages. Those reactions are written to a write-behind cache (basically it bulk writes into the database), mostly redis which is then updated later in the database.

The last thing I covered is calls. The calls will happen using WebRTC. If it is a personal call then the traditional WebRTC works but in case of a group call I will have a separate Media Server which has just one job - receive frames from all the members of the group call and every tick send all the received frames to everyone. To understand why this is a better approach there's a ton of explainations on the internet, but in short it is the difference between a complete graph and a star graph (a graph seeming like a star, a central node connected to every other node, giving it n nodes and n - 1 edges).

Design a Load Balancer

The final system design question I did this week was to design a load balancer.

These were the core requirements:

ability to accept incoming TCP connection and forward it to one of the configured backend server

ability to add and remove backend servers at will

ability to monitor healthy backend servers

ability to have a configurable load balancing strategy

ability to measure and monitor load balancer metrics

should scale to millions of concurrent TCP connections

And this was my design, a mix of HA and concurrency issues thrown at you 😂

So the clients connect to the public IP of the system at which a load balancer sits. Now when I started designing a load balancer, I thought "Wait, isn't a load balancer a SPOF?" and I googled a bit and learned about the HA pair (High Availability pair), which is similar to the buddy system I thought of for my local cloud project. So as the name suggests, a HA pair has two machines both preferrably of equal specs. Now one is the active load balancer while the second is the passive one. When the DNS asks for the first time that who's IP is this the active load balancer responds while the passive one remains quiet. This is because both of them have a continuous exchange of heartbeats and if the passive load balancer keeps receiving the heartbeats from the active load balancer, it just does not do anything. But if the passive load balancer detects that the active load balancer has failed, it immediately starts to advertise its own ip address to the dns for the new coming requests. It also have a virtual IP in a lot of cases which helps with this, but the way how the switch happens is totally dependent on the implementation.

Now coming to the servers, the servers will have an internal thread which will periodically ping the load balancer conveying its parameter. Say you want to use least request algorithm then the number will be the number of requests the server is currently handling and if you want to use lowest response time then it could be the average response time of that server, and for round robin you could send 1 bit dummy data or empty packets just for the load balancer to register you as "still living".

Now we have some parts clear, let's dive into the load balancer itself. The load balancer has 2 threads in it. the first is the allocator thread which simply redirects the request to a server. Now since this is a thread, it will have a separate file of its own and hence if you want to change the algorithm from maybe LRU to round robin, just go ahead and compile the new binary right in that same directory. The only delay that comes is the OS's disk write time, which should not be more than 50ms even for the world's largest binary on such a powerful machine. Hence it is a one time delay but it allows dynamic configuration updates. Or maybe you could just write a new update to the passive HA machine and just switch roles, smooth transition 😂

The second thread in the load balancer is the listener thread which simply listens for server's heartbeats and updates the shared memory. Oh I guess I introduced concurrency... 😁. This shared memory has a map and an array. The map's keys are the server IDs or the server MAC IDs. the values are the indices of that id in the array. The array contains the number that the servers give to the load balancer. Now comes the twist, the array has element wise locking while the map has a full lock. Why? Element wise locks are too expensive, but in this case they allow the listener thread and allocator thread to modify the shared memory at the same time. the map is only read and rarely modified hence keeping an element wise lock there would be mostly a waste of resources.

Also, a small optimization we could have in this system is that the listener thread also stores a "next server" variable which it updates whenever a new server heartbeat is received. That is basically the server to next send the request to. If the algorithm is least requests, then "next server" is the server ID of the server which has the least requests currently, making it a constant time operation since it is happening at runtime. I could have used a better data structure but since I want the load balancer to be as fast as possible, it is better to use stale minimums for speed.

So my load balancer's trade off is "not the ideal next server" for "high speed even at peak load".

Important stuff...

I know this system has a ton of problems which I cannot see it yet. But if you like my blog and don't think of it as just another boring blog, you are free to destroy my design in the comments, I'd highly appreciate it!

The twisty idea

Okay so I have got a freelance work from a company which had me make an app using flutter. And the app was very simple, I cannot share the details but think of it as a simple Inventory Mangement App.

I just started building blindly this time since I knew Flutter, the requirements were clear, the app had to be totally offline and serverless.

But as I built I faced a question, a question so good I wish I did not think of it 😂

Problem

The app was working perfectly! But now think of this, if a user has a phone and a desktop and wants to open the app on both, then every time he/she wants to add a new item or delete an item he/she will have to do it on both the devices right? So to solve that issue I proposed a new idea... to allow offline sync.

I wish I hadn't proposed the idea 😓... but then I would have missed on a lot of new things I learnt in the 2 weeks I explored this 😄

Answer?

So if you're going to answer WebRTC, that is not the complete answer my friend. I also thought of the exact same thing, "let's use WebRTC for message transfer!" but that soon ran into a problem...

To understand the why, let's look at how WebRTC works. So basically WebRTC has 3 types of servers (techincally 4 if we consider a fallback server but mostly 3). The first is the signaling server. To even establish a connection, the devices need to exchange SDP Offer and SDP Answer packets. After that's done, both the devices ping a STUN server (basically an echo server which returns the public ip of the sender) and then via the signaling server they exchange their public ips. So now device A knows device B's public ip and B knows A's public ip.

Now comes the amazing part, normal devices are under NATs and a lot of other networking stuff making them totally hidden to external devices. Hence communication can be established only if a device from within a NAT requests to communicate to a device outside. And WebRTC does that exactly, device A or B whichever is faster at this point tries to communicate to the other device. So now say A arrived at this step first and it wants to communicate to B's public ip. A's NAT has no problem and registers that A talked to B via this public ip and this port even though B's NAT does not allow A's request in the NAT of A has a "hole" in it. Now the other device, device B does that same thing, it sends a request to A's public ip. Again B's NAT registers that B talked to A via this public ip and this port but this time B's request is able to reach A since A's NAT was expecting B to talk to A. This is called "punching a hole" in the NAT of both devices. And this is what allows A and B to have a direct P2P connection.

Now let's understand why we have a problem, firstly the app should ideally be working offline hence the signaling server is.... just not an option 😅

So let's look at some good options:

Use the internet!
Scan the LAN...
Use a more time consuming method...

So option 1 is the default one, yeah even though the app is an oflline first app we will use the internet just for this bit. So in this option, the app will be doing WebRTC but using BitTorrent. Yep, something totally unrelated is coming in. So the SDP Offer and SDP Answer will be on BitTorrent. Both the devices will be scanning the BitTorrent network for updates on a shared agreed key and the SDP Offer will also have a timestamp, that is when the devices will connect this ensures no polling and direct connection when possible. Now you should ideally ask, if we're using BitTorrent why use WebRTC? Because BitTorrent only allows some small amounts of data to be hosted on the network for free, and if our sync data is small enough we could optimize but it generally should not be the case hence WebRTC 😁

But suppose the devices are in a basement, we certainly can't get internet connectivity there right? So let's check if the devices are in the same LAN and if we're lucky we don't have to fallback to the last manual method. We could ideally use something like WifiDirect or Bluetooth for this but I'll figure that out during implementation.

The last option is the weirdest, I couldn't think of a better way 😂 so I came up to use the mule method 🫠 the oldest method in computer science to transport data across a distance.

Now picture this, device A is in a basement and device B is also in a basement. Neither of them can connect to any router or any other electronic device which could help us connect with the internet since even WebRTC needs some kind of path from A to B via routers and switches.

So what I came up with is (after a huuuuge chat with Gemini 😂) the mule method. When the internet did not exist and even today at some places, this method is the best for data transfer across disconnected devices.

This is the plan, device A is in a basement and there will be workers which will update it. So now if those workers have the same app in their phones (which needs to be opened atleast once since the last reboot) and if I setup a geofence trigger in those phones and then whenever that phone comes near device A it will automatically run the merge script and store it in itself. Let's call this device A1 which is the device with the worker. Now device A1 has the updated data of A similarly there will be a device B1 which will have the updated data of device B now when these workers leave the basement and get internet connectivity their devices will use WebRTC to establish the connection and complete the exchange. This can be done by sending the user a notification which will open the app for some seconds and the app on both devices A1 and B1 will complete their data merges and both will have the updated tables (those tables are deterministic because of HLCs).

And then when A1 goes back to the basement with A it will sync data with A using geofencing and similarly B1 will sync data with B using geofencing trigger. This will help us use the concept of data mule in an offline environment. I am planning to create a C script for this and then using flutter, I'd execute this c script and do the magic.

There are a ton of pitfalls waiting for me, they'll pop up one by one as I implement this 😂

The "Micellaneous Stuff"

So I am a development oriented guy, but for placement I give contests on leetcode, codechef and codeforces. And recently my leetcode and codechef ratings have started to go up slowly, a bit exciting and shocking for me 😂

Placements require me to reach 1800 in leetcode and 2 star on codechef 😓. That's pretty... hard 🙂. But somehow I am a 2 star on codechef and around 1742 on leetcode after a long long time of just wishing I also have good ratings it is finally happening 😂 But I'm still afraid of DP 🫣

Also, I found a github repository I think is worth sharing, it is open source and is actually a great thing for someone who is interested in low level stuff. They are all from the same person 😂

Pragmatic DB
InfraNotify
Oxygen (Yeah that's really its name 😂)

These repositories have a good goal set but some things are still waiting for a fix, give it a try I bet you'll want to contribute atleast once 😅

Thanks

That's it for today I guess... I mean I hope to stay consistent and more focused, but stay tuned! Thanks a lot for reading!

Credits: Tenor

If you have a bit more of time and like to support me, I also write on other two platforms and would appreciate if you could give just a view to those blogs too 😅

Medium: https://medium.com/@meetgandhi13062005/from-slack-clones-to-data-mules-my-descent-into-system-design-madness-84cb04252ffc

Hashnode: https://keep-alives.hashnode.dev/from-slack-clones-to-data-mules-my-descent-into-system-design-madness