Meet Gandhi

Posted on Jul 4

Monoliths, Mutexes, and Mistakes: 3 System Design Problems I Tried to Solve from Scratch

#systemdesign #distributedsystems

This weekend me and my two other friends sat down again for discussing system design.

If you're a new reader, me and my two other friends have made a weekly routine of meeting online every weekend and proposing ideas for one system design question. And possibly implementation which is a real rarity 😂

I have 3 problems to share with you today and all the three did make me think deep into this domain. So let's get started!

Design Synchronized Queue Consumers

This was the first one, and its description was quite confusing at first. In fact one of us misinterpreted the entire problem and built something the problem never asked... and yet it had a lot of new things to learn and a lot of places to criticize 😌

Core Requirements:

Only one consumer consumes an element from the queue at any moment

If two consumers trying to fetch the head of the queue, one has to wait while other finishes

Queue is a black box and cannot be altered

Throughput of the system does not matter

And this seems simple enough but it was difficult deep inside. Because here mutexes or locks would spike up the latency a ton.

The real solution which only one of us managed to implement was Remote Locking, this comes from a distributed context and it is an in built feature of Redis. This mechanism allows multiple workers to work simultaneously on a resource which means it allows horizontal scaling which in turn makes it perfect for flexible systems which can control the number of simultaneous workers working on the resource (similar to my CodeArena project 😁).

And the good part is that remote locking has a negligible effect of horizontal scaling since it was built for it.

But anyways, let's just focus on what I did, especially what I did wrong 😂

Instead of using a distributed architecture, I thought of going with a monolith this time, and it turns out I was totally wrong 🫠.

So my idea was simple, I have a single queue inside the server (in-memory queue) which is shared by two processes/threads whatever you find is better according to the number of incoming requests.

The red colored box is the process/thread which simply takes the first element from the queue and gives to the blackbox queue. In practice, this element should be a user connection say a HTTP request or something

The green colored box is another process/thread which could be multiple in number and they accept the connection request of the user and add it to the in-memory queue.

The blue box is the in-memory queue, I thought of implementing it as a bounded buffer but for that I need to know beforehand how many requests will come at peak time. So that was the first problem with this design, if the traffic spikes more than we expect, the queue will totally fail and old data gets over written which leaves a client hanging and later facing a timeout.

The second issue are the mutexes on the queue. If I put the queue behind mutexes then the speed decreases due to the number of system calls made to the OS. I then had this idea of a per-element mutex but that would add a huge overhead of memory but it would surely increase the speed since it will rare that the green and red processes are on par with each other, one would naturally fall behind hence every mutex call won't actually wake up the kernel (the newer futexes (it is "futexes") in linux do that, and it blew my mind for the first time 😂) hence the time that is taken reduces drastically.

So the conclusion? My system works at a smaller scale when you know the peak traffic numbers, otherwise it totally fails ✌️

TLDR;

This design is simply a middle man design which uses an internal queue (with mutex insecurities) to pass a client to the black box queue 😂

Design an Image Service

This was the second system design problem we solved. And it was quite a struggle 😁

Core Requirements:

Upload 5 million images every hour from various clients and devices

Serving images efficiently to the rendering devices

Provide analytics around how images are requested from the systems

Bandwith consumption should be near-optimal

This design was a struggle because I initially chose to use a monolith (again, I know! 😂) and hit a road block and then shifted to a distributed system. In fact I even forgot CDNs both the times and did not use an object storage. I stored the images in the same servers which were storing the metadata and the list just goes on. But hey it had a high throughput 😅 until the other two people analyzed it...

It seems to have a lot of components we could get rid of and a lot of "alien" components. This is because I wanted to DIY it 😇

Firstly those black boxes are just the bottom box duplicated across the globe totally independent of each other. Now that's clear I can move to the explaination 😂

So inside every such black box the first thing the request hits is obviously a load balancer. Now there is a sneaky problem hidden in here and I solved it without having to change anything in the system, but let's see the original first. The load balancer routes the request to the server with the lowest current task and you ask how it will know the lowest task server? That small server in the corner does exactly that, it listens to messages from the servers about the number of task they have and it unicasts the address of the lowest loaded server directly to the load balancer (more specifically another process running on the same machine with a shared memory to the actual load balancer). Now the load balancer always sends the request to the lowest loaded server without having to calculate anything.

Now let's come to the bugs, first the easy ones.

Do normal load balancers like NGinx allow that kind of granularity? I am not so sure but probably yes. I was thinking that the organization would have their custom load balancer sitting there 😌 which my friend ruled out because it increases development cost as well as maintenance cost 😵

The second problem is what if someone from some other region wants to access an image stored in a region server very very far away?

My solution - embed that directly in the image URL. So the image url is a lower case alphabetic string of length 8. It is made up like so "{RegionID}-{ImageID}". Here the RegionID is the id of the region and it can be kept as a map key pointing to the actual region's load balancer's address. And the ImageID is the ID of the image in the region's internal object database. So how much can this short string support? After calculating a split of 3 characters for RegionID and 5 characters for ImageID can support 1,000 regions each with 100,000 images in them. And this is an approximation, there can be vastly more regions and images per region.

If you also use upper case letters, this length 8 string can support about 140 thousand regions and 380 million images per region, HUGE!! Just try that for alphanumerics 😉

So now a load balancer just checks the first 3 characters of the image and knows if it forwards the request to one of its internal servers or some other regional server.

But this solution had a ton of flaws which my other two friends covered up, but their designs were monoliths up to what I remember hence they too had their own share of problems 😁

TLDR;

This is a "monolith-became-distributed" design which almost gave me a headache 😂. Every component in it is DIY and hence it has its own set of problems. Internally it is a region wise distributed system with each node having an inner distributed server design governed by a load balancer and health checked by a small server. The load balancer also uses the image url to know which region the image originally belongs to.

Design the HashTag Service

This was the next system design question that stared me in the face. And this time I chose the distributed path directly (I am a "Dynamic Program-mer" after all 😁).

Core Requirements

Extract and manage HashTags from all the uploaded photos

5 million photos uploaded every hour

Efficiently drive the HashTag page that shows:

1. the hashtag

2. the number of photos with that hashtags

3. top 50 photos for that hashtag

Honestly this one felt easy on the surface and was not so aggressive when I dug in, so I enjoyed this problem (or maybe the problem enjoyed me 👀, goes both ways you know 😂).

Till now, I have always tried to do everything and every component DIY but this time I used external services and already trusted databases because everytime introducing something new is just not worth it 😉

So firstly this is also a distributed system and each of the servers are deployed across the globe. Each region node has an internal Apache Cassandra instance running in it. Why Apache Cassandra? Because it prioritizes high availability, eventual consistency and can handle huge write spikes easily.

Internally Apache Cassandra is a P2P distributed network, each piece of data is stored in three nodes. So how it works is like this, you contact any node in the system, it uses a concept of sharding in which you split the data into buckets to make it totally deterministic. So the node you communicate with is called the coordinator node just for that read/write operation. The coordinator node communicates with the nodes which need to store the data and also coordinates with those who have the data to be read. It merges the outputs and then gives it back to you. Simple, elegant yet difficult 😅

The other component except Apache Cassandra is a server called Master which is responsible for the display of top 50 images. Now each master queries its own Apache Cassandra database using a CQL query and gets their own version of top 50 images, then they sync across the globe (it is as bad as it sounds 😅) and merge everything and cache it for sometime.

Note on CQL: Cassandra is a wide column database with its own language - Cassandra Query Language, think of it as SQL level 1, limited to basic queries without any kind of joins or complex aggregations

Now I know there are loopholes here, the first and foremost is stale data. The top 50 data is going to be stale since there are 1400 images coming in every second from everywhere in the system the monitors can at max cache the result for about 1 second or lower depending on what you want to prioritize Consistency or Availability. If you do prioritize Consistency then the latency spikes up and otherwise its stale data.

The second problem is how to efficiently merge the data. If you have read my previous blogs, you'd know I mentioned about a very twisty idea and while explaining how it would work I wrote "LSM Tree". So that is the key here, we could use LSM Trees for syncing between the nodes, every master node creates an LSM tree of their data sends it to the other masters and merges its LSM tree with all the other ones it receives. This guarantees deterministic output even when the system is under huge load.

The other two of my friends built a monolith and I saw a problem in one of the monoliths that surprisingly was not much of a problem in Cassandra. So they used a normal Postgres or SQL database and the problem with those databases is that they use locks and locks tend to slow down reading.

This use case is a write heavy use case, but it also needs frequent reading just not as frequent as writing (maybe once per second?). So now what does a database do when 1400 entries are written down every second using a write lock passed around and almost never actually released? It would starve the reader! And what happens when this happens at a scale in a monolith? The read latency spikes up fast hence the advantage a monolith gives of consistency and availability together against the distributed architecture is simply not so interesting now that we consider the network latency, read latency and physical constraints of the machine itself.

Now why does this not happen in Apache Cassandra? Because Apache Cassandra is internally using a distributed architecture you can read and write to it at the same time under huge load without it actually breaking down! I mean of course the reads do slow down the writes by consuming nodes but that happens once every 1400 writes which I think is a good tradeoff here.

But the better approach according to me should be a monolith using Apache Cassandra as its database, this would allow fast reads, fast writes and we only have the network latency rising as a problem. No syncs, no LSMs, no database lock latency, pure speed and some better tradeoffs 😌

TLDR;

This is a good system which turned difficult to build and handle a bit too fast 😅. It uses a Apache Cassandra database and a master node for each geographic node. Apache Cassandra allows almost parallel reads and writes while the master node allows global deterministic sync to top 50 images. This one uses CDNs and external Object Storages though 😂

What's next?

Next we are planning to tackle this problem "Design OnePic". And I think I am going decentralized on this one. Haven't had a decentralized architecture in a long time 😂

If you have made it till here, I would like to inform you that I have changed my blogging style from an Monolith style to a Distributed style, all my blogs have exactly one thing they discuss about, no messy blog discussing 10 topics all at once, one blog for exactly one topic so it is easier for you read what you like and easy for me to know what you like 😁

See you in the next blog, Byee!

DEV Community

Monoliths, Mutexes, and Mistakes: 3 System Design Problems I Tried to Solve from Scratch

Design Synchronized Queue Consumers

Core Requirements:

TLDR;

Design an Image Service

Core Requirements:

TLDR;

Design the HashTag Service

Core Requirements

TLDR;

What's next?

Top comments (0)