DEV Community: Manthan Gupta

Introducing CricLang 🏏: A programming language for cricket enthusiasts

Manthan Gupta — Sun, 17 Mar 2024 12:14:43 +0000

CricLang is a fun programming language created for cricket enthusiasts. If you look at the initial commit on the repository, it will show it as June 9, 2023, but the idea of building my programming language has been lingering at the back of my head since my college days. Finally, after procrastinating on building CricLang, I started working on it on Feb 17, 2024, and it is now ready for public beta release.

Why the name CricLang?

CricLang is an amalgamation of 'Cricket' and 'GoLang' as it is built on top of Go. It represents my love for cricket, software engineering, and the idea that code should be as simple and readable as the game. It is an evolving programming language and isn't perfect by any means.

Documentation

CricLang is a dynamically typed language written in Go

Data Types

CricLang currently supports 4 types of data types int, string, boolean, and array as of now. Boolean values are denoted by notout and out.

>>> 1                   # int
>>> -10                 # int
>>> "Hello World!"      # string
>>> notout              # boolean true
>>> out                 # boolean false
>>> [1, "Hello World!"] # array

Declaring Variables

Variables are declared using the reserved keyword player.


>>> player x = 5;                       # integer variable
>>> player y = "Hello World!";          # string variable
>>> player z = notout;                  # boolean variable
>>> player a = [-5, "Hello World!", z];  # array of variables

Conditionals

CricLang supports if-else statements. If condition is declared using the appeal reserved keyword and else is declared using the appealrejected reserved keyword.


>>> appeal(5 < 10) { "if condition" } appealrejected { "else condition" };
>>> appeal(notout) { "true" };
>>> appeal(!notout) { "if condition" } appealrejected { "else condition" };

Return Statements

Return statements are declared using the keyword signaldecision.


>>> signaldecision 10;
>>> appeal(5 < 10) { signaldecision "if condition"; } appealrejected { signaldecision "else condition"; };

Functions

Functions can be declared using the reserved keyword field. Functions are also binded to variables in CricLang.


>>> player x = -5;
>>> player y = 25;
>>> player z = field(x, y){ appeal(x * x == y){ signaldecision "yes"; } appealrejected {signaldecision "no";}; };
>>> z(x, y);            # output: yes

Operators

CricLang supports arithmetic operators (+, -, *, /), logical operator (!), and comparison operators (==, <, >, !=) out of the box.


>>> 1 + 1                       # output: 2
>>> "Hello" + " " + "World!"    # output: "Hello World!"
>>> 1 - 1                       # output: 0
>>> 5 * -5                      # output: 25
>>> 5 / 5                       # output : 1
>>> notout == out               # output: false
>>> !out                        # output: true

Built-in Functions

CricLang is incomplete without its built-in functions which are meme references. Only the OG cricket fans can spot the references to the error messages. Here is a list of all the built-in functions

Thala

Thala is inspired by the meme "Thala for a reason". It takes a string or an array as input and returns the length of the string or array but with a twist.

>>> thala()                             # 0 params
MISFIELD: girlfriend se raat mei baat kar lena, pehle 0 ki jagah 1 argument daal de

>>> thala(1)                            # wrong type of param
MISFIELD: girlfriend se raat mei baat kar lena, pehle sahi type ka argument toh daal de

>>> thala("Hello World")                # string as a parameter
Captain Cool: 11

>> thala("Is this sixteen?")            # string a parameter
Thala for a reason: 16

Gambhir

Gambhir is inspired by the famous meme where he says some 3rd option when given two options. It takes two parameters and returns something random.


>>> gambhir()                   # no arguments passed
MISFIELD: *Gautam Gambhir Stares Angrily* got=0 arguments, want=2 arguments

>> gambhir("virat", "sachin")   # 2 arguments passed
Interviewer: virat or sachin
Gautam Gambhir: baingan

Kohli

Kohli is inspired by one of his famous stump mic recordings. It takes [0, n] arguments and returns a fatal error with an error message.


>>> kohli()                     # no arguments passed
shaam tak khelenge, inki G phatt jaayegi lekin abhi tera code phatt gaya
exit status 1

>> kohli("wrong data type")     # with an error message
shaam tak khelenge, inki G phatt jaayegi lekin abhi tera code phatt gaya wrong data type
exit status 1

Rohit

Rohit is inspired by one of his after-match interviews in 2019. It takes one string argument and returns the output.


>>> rohit()             # no arguments passed
MISFIELD: mera gale ka vaat lag gaya chilla chilla ke ki 1 argument chahiye! tunne 0 de diye

>>> rohit(42)           # argument with wrong data type
MISFIELD: mera gale ka vaat lag gaya chilla chilla ke ki sahi type ka argument daal de

>>> rohit("Manthan")    # 1 string argument
Reporter: Manthan ke birthday ke baare mei kuch boliye.
Rohit: Abhi birthday mei kya bola jata hai? Happy Birthday? Yahi bola jata hai.

I will keep adding more fun built-in functions to CricLang and improve it overall. If you think you have some great ideas for adding a built-in function or improving CricLang, don't shy away from sending a PR.

Parting Words

CricLang is my attempt at building a programming language, and it's by no means a complete language. It has its share of flaws, and they will be improved upon over the period of time. If you love CricLang, then don't forget to share it on Twitter, LinkedIn, and Peerlist. Drop a star to CricLang on GitHub.

Note for Folks who are Hiring

If you are hiring for software engineers, then I am actively looking out. I am looking for companies that are working either in the field of distributed systems, databases, cloud computing, audio/ video infrastructure, OTT platforms, fintech, or AI. If your team is pushing boundaries and working on something truly unique, I'd love to hear from you too. Reach out to me on guptaamanthan01@gmail.com or Twitter or LinkedIn.

                        Made with love for 🏏 in 🇮🇳

Replication in Distributed Systems - Part 3

Manthan Gupta — Tue, 13 Feb 2024 15:37:08 +0000

Welcome to the third and final part of my series on replication in distributed systems. If you have missed the first two parts of the series, I would advise you to read them first, as this comes off as a storyline. Click here and here for the previous blogs. In this blog, we will discuss... well, let's dive right in without formalities!

Leaderless Replication

In a leaderless implementation, the client directly sends the writes to several replicas and waits for the majority of them to acknowledge, compared to a setup with one or more leaders where a leader exists to coordinate the writes.

What happens when a node goes down?

In a leader-based configuration with one leader and two replicas, we need to perform failover if we want to continue to process writes. But in a leaderless configuration, failover doesn't exist. So when the unavailable node comes back online, it has missing writes, meaning it has stale data.

This problem can easily be solved by sending the read request from the client to not just one replica but to several nodes in parallel. The client may receive different versions of the data that may be up-to-date or stale. Version numbers are used to determine which value is newer.

When the unavailable node comes back online, it has stale data. There are two mechanisms to catch up on the writes

Read Repair - When a client makes a read request from several nodes in parallel, it can detect the node/s having stale data. The client sees the node which has stale data and writes back the newer value. This approach works well for values frequently read.
Anti-Entropy - A background process runs constantly to look for differences in the data between the replicas. This approach doesn't copy writes in the same order they are written to the up-to-date replica

Quorums Consistency

Quorums aren't necessarily the majority -- the only thing that matters is that the sets of the nodes used by the read and write operations overlap in at least one node. If there are n replicas, every write must be confirmed by w nodes to be considered successful, and we must query at least r nodes for each read. As long as w + r > n, we expect to get up-to-date values.

Reads and writes that obey these r and w values are called quorum reads and writes. The quorum condition w + r > n allows the system to tolerate unavailable nodes

if w < n, we can still process writes if a node is unavailable
if r < n, we can still process reads if a node is unavailable

But even with quorum consistency, there are edge cases where stale values returned

If a sloppy quorum is used, the w writes can end up on different nodes than the r reads, so there is no longer a guaranteed overlap between the r and w nodes.
If two writes occur concurrently, it's not clear which one happened first. The only safe solution is to merge the concurrent writes. If a winner is picked based on the timestamp (LWW - last write wins), writes can be lost due to clock skew.
If a write happens concurrently with a read, the write may be reflected on only some of the replicas. It's undetermined whether the reads return the old or new value.
If the write succeeds on fewer than w nodes and fails on the remaining ones, it is not rolled back on the replicas where it succeeded. It means that if a write is reported as failed, subsequent reads may or may not return the value from that write.

Sloppy Quorums & Hinted Handoff

In large clusters with significantly large n nodes, the client can likely connect to some database nodes during the network interruption, just not the nodes that needed to assemble a quorum for a particular write or read. Is it better to return errors to all requests when we can't reach a quorum, or should we accept the writes anyway and write them to some nodes that are reachable, but aren't among the n nodes on which the value lives?

This is called sloppy quorums. Writes and reads still require w and r successful responses, but those may include nodes that aren't among the designated n "home" nodes for a value. Once the network interruption is fixed, any writes that one node temporarily accepted on behalf of another node are sent to the appropriate "home" nodes. It is called hinted handoff.

Sloppy quorums are particularly useful for increasing write availability. As long as any w nodes are available, the database can accept the writes. However, this means that even when w + r > n, you cannot be sure to read the latest value for a key because the latest value may have been temporarily written to some nodes outside of n.

Last Write Wins (LWW)

Leaderless replication is designed to tolerate conflicting concurrent writes. One approach to achieving eventual convergence is to declare that each replica needs only to store the most "recent" value and allow "older" values to be overwritten. We can attach a timestamp to each write, pick the biggest timestamp as the most "recent" value, and discard any writes with an earlier timestamp. This conflict resolution algorithm is called last write wins.

LWW achieves the goal of eventual convergence but at the cost of durability. If there are several concurrent writes to the same key, even if they were all reported to be successful by the quorum to the client, only one of the writes will be able to survive, and the others will be silently discarded. It may cause LWW to drop writes that aren't concurrent. Caching is one use case where lost writes are acceptable, and LWW is a good choice. But if losing data isn't acceptable, then LWW is a poor choice for conflict resolution.

Defining Concurrency & "Happens-Before" Relationship

An operation A happens before another operation B if B knows about A, B depends on A, or B builds upon A. The key to defining concurrency is to know whether one operation happens before another operation. In concurrency, the timestamp doesn't matter. We call two operations concurrent if they are unaware of each other, regardless of the physical time at which they occurred.

Capturing the Happens-Before Relationship

The server can determine whether two operations are concurrent by looking at the version numbers.

The server maintains the version number for every key and increments the version number every time that key is written. The server stores the new version along with the value written.
When a client reads a key, the server returns all the values that haven't been overwritten with the latest version number. A client must read the key before writing.
When a client writes a key, it must include the version number from the prior read and must merge all the values it received in the prior read.
When the server receives a write with a particular version number, it can overwrite all values with that version number or below, but it must keep all values with a higher version number.

Version Vectors

Using a single version number to capture dependencies between operations isn't sufficient when multiple replicas are accepting writes concurrently. Instead, we need to use a version number per replica per key. Each replica increments its own version number when processing a write and keeps track of the version numbers it has seen from each of the other replicas. This information indicates which values to overwrite and which values to keep as siblings (concurrent values are called siblings). The collection of version numbers from all replicas is called version vector.

Version vectors are sent from the database replicas to clients when values are read and need to be sent back to the database when a value is subsequently written. It helps the database to distinguish between overwrites and concurrent writes.

Parting Words

The third part and final part of the blog series introduced quorum consistency for leaderless replication, the LWW concurrency model, and version vectors. This rounds up my blog series on replication in distributed systems.

If you liked the blog, don't forget to share it and follow me on Twitter, LinkedIn, Peerlist, and other social platforms, and tag me. You may also write to me at guptaamanthan01@gmail.com to share your thoughts about the blog.

References

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann

Replication in Distributed Systems - Part 2

Manthan Gupta — Thu, 01 Feb 2024 13:36:17 +0000

In the first part of the series, we laid the foundation of replication in distributed systems. We will take this forward in the 2nd part, where we introduce the different consistency guarantees, multi-leader configuration and its comparison with single-leader configuration, conflict resolution, and more.

Reading your own writes

Many applications allow the user to submit some data and then view what they have submitted, for example, comment in a discussion, buy a stock, etc. The submitted data is sent to the leader and read by the follower, but if we have asynchronous replication, it is a problem. The new data may have not reached the replica. It seems like the data is lost.

In this situation, we need read-after-write consistency, also known as read-your-writes consistency. It guarantees that if the user reloads the page, they will always see the updates that they have submitted. It doesn't hold for other user's updates, and it will take some time to reflect them.

But how do we implement it? The initial thought process should be to read from the leader the data that the user may have modified. For example, user info can only be edited by the user itself, so a simple rule is that the user profile is read from the leader and any other user's profile from a follower. But if most things are editable by the user, then the benefit of read scaling will be negated. In this case, track the time of the last update and serve the reads from the leader after a minute of the last update.

Another complication arises if the user is using multiple devices to access the service. Let's say the user comments under an Instagram photo and checks the comment from another device. In this case, we need cross-device read-after-write consistency. Approaches that require remembering the timestamp of the last update need to be centralized. Another issue is that the request might not be routed to the same data center for multiple devices, so there might be read inconsistencies. If your approach requires reading from the leader, we may first need to route requests from all of a user's devices to the same data center.

Monotonic Reads

If the user makes multiple reads from different replicas as the user refreshes the web page, each request is served from a random server. The user sees time moving backward because the data hasn't persisted on all the replicas.

Monotonic reads is a guarantee that this kind of anomaly doesn't happen. It's a lesser consistency than strong consistency but a stronger guarantee than eventual consistency. One way of achieving this is by serving all the reads for a user from the same replica chosen by the hash of the userID.

Consistent Prefix Reads

Let's suppose you are in a WhatsApp group of 3 people, and all 3 of you are exchanging messages. Now you see the sequence of messages not making any sense. Your friend is replying to messages that haven't arrived, and it feels like he can see the future.

Consistent prefix reads guarantees that if a sequence of writes happens in a particular order, anyone reading those writes will see them appear in the same order. One solution to this problem is to make sure that any writes that are causally related to each other are stored in the same partition.

Multi-Leader Replication

Leader based replication has one major downside, it has only one leader. If we aren't able to connect to the leader for any reason, we can't write to the database. The obvious answer to this problem is to have more than one node that accepts writes. This is called multi-leader configuration, also known as master-master or active/active replication

Single-Leader v/s Multi-Leader Configuration in Multi-Data Center Deployment

Performance

In a single-leader configuration, all writes go to the data center with the leader, which adds significant latency to writes, compared to a multi-leader configuration where every write is processed in the local data center and is replicated asynchronously to the other data center. The inter-data center network delay is hidden from the user, which means the perceived performance may be better.

Tolerance of Data Center Outages

In a single-leader configuration, if the data center with the leader fails, failover can promote a follower to the leader in another data center. In a multi-leader configuration, it continues to operate independently of the others, and replication catches up when the failed data center comes back online.

Tolerance of Network Problems

A single-leader configuration is very sensitive to problems in this inter-data center link because writes are made asynchronously over this link. A multi-leader configuration with asynchronous configuration can tolerate network problems better as it doesn't prevent it from processing writes.

A big downside of multi-leader configuration is that the same data can be modified in 2 different places, causing write conflicts. Auto-incrementing keys, triggers, and integrity constraints can be problematic.

Handling Write-Conflicts

Synchronous v/s Asynchronous Conflict Detection

In a single-leader database, the 2nd writer is either blocked or queued, waiting for the first write to complete or aborted, forcing the user to retry the write. In a multi-leader setup, both the writes are accepted, and the conflict is only detected asynchronously at some later point in time. We could've made the conflict detection synchronous in a multi-leader setup i.e. wait for the first write to be replicated to all the replicas before telling the user the write was successful. But this would make us lose on the benefit of multi-leader replication to allow each replica to accept writes independently. Use single-leader replication if you want synchronous conflict detection.

Converging Towards a Consistent State

A single-leader database applies the writes in sequential order, meaning that if there are multiple updates to the same field, the last write determines the final value of the field. In multi-leader configuration, there is no defined ordering of writes, so it's not clear what the final value of the field will be.

Various ways of achieving convergent conflict resolution

Give each write a unique ID, pick the write with the highest ID as the winner, and throw all the other writes. If a timestamp is used, the technique is called last write wins (LWW). But this is prone to data loss.
Give each replica a unique ID, and let the writes that originated at a higher-numbered replica always take precedence over writes that originated at a lower-numbered replica. But this is also prone to data loss.
Record the conflict in an explicit data structure that preserves all information, and write application code that resolves the conflict.

Custom Conflict Resolution Logic

Most multi-leader replication tools let us write conflict resolution logic using application code that may execute on read or on write.

on write - As soon as the system detects a conflict in the log of replicated changes, it calls the conflict handler.
on read - When a conflict is detected, all the conflicting writes are stored. The next time data is read, multiple versions of the data are sent back to the application. The application may prompt the user to manually resolve the conflict.

Conflict resolution usually applies on a row or document level and not to a complete transaction.

Automatic Conflict Resolution

Conflict-free replicated datatypes (CRDTs) are a family of data structures for sets, maps, ordered lists, counters, etc that can be concurrently edited by multiple users and which automatically sensibly resolve conflicts. It uses 2-way merge function
Mergeable persistent data structure tracks history explicitly, similarly to the git version control, and uses a 3-way merge function
Operational transformation is the conflict resolution algorithm behind collaborative editing applications such as Google Docs. It is designed particularly for concurrent editing of an ordered list of items.

Parting Words

The second part of the blog series introduced the different consistency guarantees other than strong consistency and eventual consistency. It also gave an in-depth introduction to multi-leader configuration, where we compared it to single-leader configuration.

References

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann

Replication in Distributed Systems - Part 1

Manthan Gupta — Mon, 29 Jan 2024 16:27:47 +0000

Welcome, fellow nerds, to the 1st part of a blog series on replication. We will be discussing why we even need to distribute a database across multiple machines, what are leaders and followers, how to handle the failure of leaders and followers, etc. It will set it up nicely for our future blog in this series.

Why distribute a database across multiple machines?

Scalability - If your data volume, read load, or write load grows larger than what a single machine can handle, you can spread the load across multiple machines
High availability - If your application needs to continue to serve even if one or more machines goes down, you can use multiple machines to give you redundancy/ fault tolerance.
Latency - If you have users over the globe, you will want to have servers at various locations worldwide so that each user is served from a data center that is geographically closer to them.

There are 2 common ways to distribute data across multiple machines (or nodes)

Replication - Keeping a copy of the same data on several different nodes, potentially in different locations. Replication provides redundancy.
Partition - Splitting the database into smaller subsets called partitions so that different partitions can be assigned to different nodes (also known as sharding)

Leaders & Followers

Every node that has a copy of the database is called a replica. Every write needs to be processed by every replica; otherwise, the replicas would no longer have the same data. One of the replicas is designated as the leader. When a client wants to write to the database, they must send their request to the leader that first writes the new data to its local storage.

The replicas other than the leader are known as followers. Whenever the leader writes new data to its local storage, it also sends the data change to all of its followers as part of the replication log or change stream in the same order they were processed by the leader. All the followers are read-only, while only the leader accepts the writes. So, when the client wants to read from the database, the followers or the leader can serve the read request.

Async vs Sync Replication

In synchronous replication, the leader waits until the follower confirms that it received the write before reporting the success to the user and before making the write visible to other clients. In asynchronous replication, the leader sends the message and doesn't wait for the acknowledgment from the follower.

The advantage of synchronous replication is that the follower is guaranteed to have an up-to-date copy of the data that is consistent with the leader. If the leader suddenly fails, we can be sure the follower has the data available. The disadvantage is that if the follower doesn't respond, for some reason, it has crashed, network fault, etc. The write cannot be processed. The leader will have to block all the writes and wait for the acknowledgment from the follower whenever it is available again.

It is definitely impractical to have all the followers as synchronous. Any one node outage would cause the whole system to grind to a halt. If we have a synchronous replication setting enabled on a database, it usually means that one of the followers is synchronous, and the others are asynchronous. If the synchronous follower becomes unavailable or slow, one of the asynchronous followers is promoted to synchronous. It guarantees we have an up-to-date copy of the data on at least two nodes: the leader and one synchronous follower.

In asynchronous replication, if the leader fails and isn't recoverable, any writes that haven't yet been replicated to the followers are lost. It means the writes aren't guaranteed to be durable, even if it has been confirmed to the client. However, the advantage of a fully asynchronous configuration is that the leader can continue processing writes even if all the followers have fallen behind. Weak durability may sound like a bad trade-off, but asynchronous replication is widely used, especially if there are many followers or if they are geographically distributed.

How to add new Followers?

Let's say we want to add a new replica because we want to scale or replace a failed node. Simply copying data files from one node to another is not sufficient. Clients are constantly writing to the database, and the data is always in flux, so a standard file copy would see different parts of the database at different points in time. We can lock the database, making it unavailable for any writes, but that would go against our goal of high availability.

Here is how it's done

Take a consistent snapshot of the leader's database at some point in time. If possible, without taking a lock on the database
Copy the snapshot to the new follower node
The follower connects to the leader and requests all the data changes that have happened since the snapshot was taken
When the followers have processed the backlog of data changes since the snapshot, we say it has caught up. It can now continue to process data changes from the leader as they happen

Handling Node Outages

Follower Failure

Each follower stores a log of the data changes it has received from the leader on its local disk. If a follower crashes and restarts or the network between the leader and the follower is temporarily interrupted. The follower can recover from its log: it knows the last transaction that was processed before the fault occurred. Thus, the follower can connect to the leader and request all the data changes that occurred when the follower was disconnected.

Leader Failure

What if the leader himself goes down? TLDR; One of the followers is promoted to be the leader. Clients are reconfigured to send all the writes to the new leader, and the followers need to start consuming data from the new leader. This process is called failover.

There is no foolproof way of detecting what has gone wrong, so most systems use a timeout. Nodes frequently bounce messages back and forth between each other, and if a node doesn't respond for some time - say 30 seconds, it is assumed to be dead. (This doesn't apply if the leader is taken down for planned maintenance)
Choosing a new leader is done through an election where the leader is chosen by a majority of the remaining replicas, or a new replica could be appointed by the previous leader. The best candidate is usually the replica with the most up-to-date data changes from the previous leader.
Clients are now reconfigured to send their write requests to the new leader. If the old leader comes back, it might still believe it is the leader, not realizing that the other replicas have forced it to step down. The system has to ensure that the older leader becomes a follower and recognizes the new leader.

The time between when the leader is down and the next leader is getting elected. Writes are blocked or queued in the leaderless period. Consistent reads are served by the replicas if synchronous replication is enabled else there is the possibility that stale and inconsistent data is served because of eventual consistency.

What could go wrong with Failover?

If asynchronous replication is used, the new leader may not have received all the writes from the old leader before it failed. If the former leader rejoins the cluster after a new leader has been chosen, what should happen to the writes that the new leader hasn't received? The new leader may receive conflicting writes in the meantime. The most common solution is to discard the unreplicated changes of the old leader. But, discarding the writes is dangerous business as it violates the client's durability.

In certain fault situations, two nodes might believe that they are the leader. This situation is called split brain, and it is dangerous. If both the leaders accept writes, and there is no process for resolving conflicts, data is likely to be lost or corrupted. As a safety catch, some systems have mechanisms to shut down one node if two leaders are detected.

And finally, what should be the ideal timeout before the leader is declared dead? A longer timeout means a longer time to recover in case where the leader fails. However, if the timeout is too short, there could be unnecessary failovers.

Implementation of Replication Logs

Statement Based Replication

The leader logs every write request that it executes and sends that statement log to its followers. For a relational database, every INSERT, UPDATE, or DELETE statement is forwarded to followers, and each follower parses and executes that SQL statement as it has received from a client. But this approach could break down on statements that call a non-deterministic function such as Now() or RAND() that is likely to generate different values on each replica.

Write Ahead Log (WAL)

This log is an append-only sequence of bytes containing all writes to the database. We can use the same log to build a replica on another node. The main disadvantage is that the log describes the data on a very low level, which makes it tightly coupled to the storage engine. If the database changes its storage format from one version to another, it is typically not possible to run different versions of the database software on the leader and the followers.

Logical Log Replication (row-based)

A logical log is decoupled from the storage engine and uses different log formats for the storage engine and replication. For a relational database, it is usually a sequence of records describing writes to database tables at the granularity of a row. As it is decoupled from a storage engine, it can be made backward compatible, allowing the leader and the follower to run different versions of the database software or even different storage engines.

Parting Words

The first part of the blog series aimed at laying the foundation for future blogs in this series on replication in distributed systems. We still have a lot to discuss on this topic that we will do in the next one.

If you liked the blog, don't forget to share it on Twitter, LinkedIn, Peerlist, and other social platforms and tag me. You may also write to me at guptaamanthan01@gmail.com to share your thoughts about the blog.

References

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann

Data Structures That Power Your Databases

Manthan Gupta — Mon, 29 Jan 2024 15:40:22 +0000

Have you ever thought about why databases are so complex internally and why we can't use a text file to store the data? This blog post aims to go from the most basic database that writes data to text files to a more complex setting where we use LSM-tree and B-tree. We will understand why we need these data structures and more.

World's Simplest Key-Value Store

What does the world's simplest key-value store look like? 2 bash function that writes data to a text file and gets data from the text file.

db_set(){
    echo "$1, $2" >> database
}

db_get(){
    grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}

Let me explain both the functions.

The db_set() function appends to the end of the file without any logic. That means if you update the value of a key, it will not overwrite the value but create a new key value at the end of the file.
The db_get() function gets the last occurrence of the key.

What's the problem with this implementation?

Doesn't deal with concurrency control, reclaiming disk space, handling errors, and partially written records.
db_get() function has terrible performance as it takes O(n) to lookup for a record.

To efficiently find the value of a key, we need a different data structure: an index. The general idea behind indexes is to keep some additional metadata on the side, which acts as a signpost and helps you locate the data you want. But this comes with a tradeoff as it slows down writes but offers fast reads.

Hash Indexes

I am not planning to dive deep into the workings of a hash indexes in this particular blog. I plan on covering it in a future blog, but till then, you can refer to this blog to understand more about it.

Let's again go back to our key-value store. The strategy to improve the lookup speed is to keep an in-memory hash map where every key is mapped to a byte offset in the file. Whenever a new key is appended to the file, the hash map is also updated to reflect the offset.

How do we avoid running out of disk space as we are only appending to the file?

We break the files into smaller segments and perform compaction on each segment. Compaction is throwing away the duplicate keys in the file and keeping only the most recent update for each key. We can perform compaction over multiple segments together at the same time. Segments are never modified, so the merged segment is written to a new file. It is done in the background while we serve read and write requests using the old segment files.

But there are some things to take care of with the implementation and limitations of the hash table index:

Hash table must fit in memory because, on disk, it is hard to get the same performance
We can't do range queries efficiently
If the database is restarted, the in-memory hash maps are lost. We rebuild the index by reading the entire segment file from beginning to end. It will take a lot of time. We can have a snapshot of each segment's hash map on disk to solve this issue
Database may crash halfway through appending a record that can be resolved by including checksums
Writes are appended to the file in a strictly sequential order. A common implementation is to have only 1 write thread and multiple read threads
When merging segments, the tombstone tells the merging process to discard any previous values for the deleted key

SSTables

A simple change to our segment files can help us get rid of maintaining the hash index with all keys. We store the key-value pairs in a sorted-by-key order. We call this format SSTable

What advantages do we gain over the hash index?

Merging the segment files gets simple and efficient even if the files are larger than the available memory. The merge sort algorithm is used to create new merged segment files, also sorted by key. We now don't need to maintain an index of all keys in memory. We build a sparse index where we get the keys between which our key lies. It saves disk space and I/O bandwidth.

How is this SSTable constructed and maintained?

When writes come in we add it to the in-memory tree memtable. When it gets bigger than a threshold, we write all the data from the memtable to the SSTable file. The new SSTable file becomes the most recent segment of the database. While we write to SSTable, writes continue to new memtable instance.
When read requests come in we first search for the key in memtable then in the most recent on-disk segment, then in the next older segment and so on. And from time to time we run compaction in the background. We can also keep a seperate log on disk to which every write is immediately appended for durability purposes.

LSM Tree

Log-structure merge-tree or LSM tree is a clever algorithm based on the above principle of merging and compacting sorted files. LSM is append-only as we only append all the modfications to the SSTable.

The LSM tree algorithm can be slow when looking up keys that do not exist in the database. We first check the memtable and then all the segments. Storage engines use Bloom Filters to optimize the algorithm.

A bloom filter is a memory-efficient data structure for approximating the contents of a set. It can tell if a key exists in the database and thus save many unnecessary disk reads for keys that don't exist.

B Tree

B Tree keeps key-value pairs sorted by keys like SSTable, which allows efficient key lookups and range queries. It breaks down the database into fixed-size blocks or pages, traditionally 4Kb in size, and reads or writes one page at a time. This design corresponds closely to the underlying hardware, as disks are also arranged in fixed-size blocks.

An append-only file where we capture all B tree modifications before applying them to the pages of the tree itself. It is a write-ahead log (WAL or redo log) that helps make the database crash-resilient. We use latches (lightweight locks) to protect our B tree when multiple threads try to access the B tree at the same time.

Which one is better? LSM Tree vs B Tree

B Tree writes every piece of data at least twice: once to the write-ahead log and once to the tree page itself (and sometimes again when the pages split). The log-structure index rewrites data multiple times due to the repeated compaction and merging of SSTable. It results in one write to the database and multitudinous writes to the disk over the course of the database's lifetime - known as write amplification. It is a concern on SSDs as a limited number of overwrites can perform before wearing out.
LSM trees can be compressed better, resulting in smaller files on disk than B trees.
B tree storage engine leaves some disk space unused due to fragmentation when the page is split or the row cannot fit in an existing table. LSM trees aren't page oriented and periodically rewrite SSTables to remove fragmentation.
The compaction process can sometimes interfere with the performance of ongoing reads and writes. At high throughput, the disk's finite bandwidth is shared between the initial write and the compaction threads running in the background. The bigger the database gets, the more disk bandwidth is required for compaction.
In B trees, each key only exists in exactly one place in the index, whereas a log-structured storage engine may have multiple copies of the same key in different segments. It makes B trees attractive in databases that want to offer strong transactional semantics.
LSM trees are typically able to sustain a higher write throughput than B trees. But B trees are thought to offer faster read throughput.

Parting Words

There are definitely more data structures that power databases like skiplist, inverted index, R tree, suffix tree, etc. But this blog introduces to the more famous ones and to help understand the underlying working of them.

If you liked the blog don't forget to share it on Twitter, LinkedIn, Peerlist, and other social platforms and tag me. You may also write to me on guptaamanthan01@gmail.com to share your thoughts about the blog.

References

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems by Martin Kleppmann