DEV Community: Ali Malik

The Rise of Edge Computing

Ali Malik — Tue, 07 Apr 2026 05:15:49 +0000

Hey folks I’m back with another post. Today, we’ll be talking about Edge Computing, what it is, why its important, etc.

Before starting, once again, I’d like to thank all of you for tuning in. If you’re new here, please consider subscribing, it allows me to really feel like you want more of this.

Subscribe now

Also I’d like to address the absence, I started this blog as a way to teach myself and others, I had initially committed to posting weekly, but with work and school it’s proven to be a challenge to put out high quality articles at a weekly cadence. I will continue to try to release an article weekly, but long-form articles and technical deep-dives will probably happen every other week if not longer.

Without further delay, let’s dive in…

What is Edge Computing?

Before we can dive into edge computing as a whole we need to understand how we got here. Since the introduction of the computing paradigm, its evolution has been marked by a continuous shift from centralized systems to more distributed architectures. We’ve gone from mainframe computers to desktops to now laptops, in each iteration computing power was brought closer and closer to the end user. As the internet became more robust computing was decentralized even further, enabling networked communication and data sharing at a global scale.

A major milestone in the computing paradigm was the emergence and domination of cloud computing, which offers scalable and on-demand resources virtually all over the world. Cloud computing has completely revolutionized how individuals and businesses utilize computing power, storage, and offered services. Cloud computing has been revolutionary but, as the number of connected devices and the Internet of Things (IoT) increases, new challenges arise that aren’t addressable with traditional cloud computing.

This is where edge computing comes into play. Edge computing is a computing paradigm that brings computation power and data storage needs closed to the source of data. Instead of sending all data to a central location (think Cloud Computing and storing everything in the “cloud”) for processing, edge computing processes data at or near the point of generation. By doing so, we’re able to reduce latency, conserve network bandwidth, increase responsiveness, and improve user experience for real-time applications.

Taken from Akamai’s article regarding Edge Computing

This is especially powerful today, where we have billions of interconnected devices generating huge amounts of data, it really highlights the limitations of centralized computing. Various applications like autonomous vehicles, idustrial automation, and smart cities require near instantaneous data processing. Edge computing resolves this problem by bringing the computational power closer to these data sources.

Why is Edge Computing Important?

As mentioned above, edge computing provides benefits in areas such as latency, bandwidth, and responsiveness. In addition to those, edge computing can enhance the security posture of a system, increase reliability, and, in some cases, even increase the scalability of a system. We’ll take a look at some of these below:

Reduced Latency

One of the biggest attraction points of edge computing is the fact that you can reduce latency by magnitudes of order. By processing data closer to the generation point, applications are able to respond much faster.

The increased responsivness is criticial for use cases like autonomous vehicles, industrial automation, and healthcare.

Autonomous Vehicles: processing sensor data immediately is essential in guaranteeing safety.
Industrial Automation: real-time control systems rely on instant feedback so quick processing is necessary.
Healthcare: critical monitoring devices need to be able to process and analyze patient data rapidly.

Network Bandwidth Optimization

In the typical centralized computing model data needs to be transferred to a central storage server for processing which, as you can imagine for large amounts of data, can strain the network and increase costs. With edge computing it’s possible to filter the data to send only what is required and aggregate data to reduce redundancy which, ultimately, can lower operational expenses.

Increased Security

Processing data locally can improve the security of a system by localizing data. By localizing data, regulatory compliance is easier to achieve.

Scalability

One can imagine that as the number of connected devices grow, as they have been, centralized computing becomes less and less optimal. Edge computing, on the otherhand, can provide a distributed and load balanced processing system. By offlading processing tasks from the central servers one can achieve a distributed processing framework, coupled with having multiple edge nodes one can achieve a load balanced, distributed processing system.

Edge Computing Today

Edge computing is being adopted very rapidly across various industries like manufacturing, retail, healthcare, and energy and utilities.

Manufacturing: for predictive maintenance and real-time quality control.
Retail: enhancing customer experiencing by leveraging real-time analytics.
HealthCare : in medical devices and for telemedicine applications.
Energy and Utilities: monitoring and controlling distributed resources.

All of these use cases are aided by the various edge computing platforms that exist today. Service offerings such as AWS IoT Greengrass, Microsoft Azure’s IoT Edge, and Google Cloud’s IoT Edge have pushed the envelope for providing cloud services at the edge.

AWS IoT Greengrass: extends the AWS service portfolio to edge devices.
Microsoft Azure’s IoT Edge: deploys cloud workloads to run on edge devices.
Google’s Cloud IoT Edge: extends Google’s data processing and machine learning capabilities to edge devices.

In addition to edge computing platforms there is various edge hardware out there including, but not limited to, NVIDIA’s Jetson Devices (I’ve personally used these and they’re fricking awesome) and Intel’s OpenVINO Toolkit.

Together with a dedicated platform and the hardware to match its possible to get edge applications up and running faster than ever. Having this number of devices does come with some drawbacks:

Security: as the number of edge devices increases as does the overall attack surface. This means more robust security measures are required.
Deployment Complexity: deploying and managing a number of distributed devices is non-trivial.
Interoperability: heterogenous hardware and software platforms can lead to compatability issues.

All in all, currently the edge compute space looks real good. It’s got an established list of hardware and software platforms, support from major companies, and is only expected to get better and better as the need for edge processing increases. Plus it doesn’t hurt that the global edge market size was valued at $16.45 billion in 2023 and is expected to grow at a CAGR of 36.9% till 2030.

The Future?

In the last two years artificial intelligence and machine learning (AI/ML) have taken center stage, this is highlighted by the fact that every major company is either incorporating or releasing Generative AI (GenAI) products. With this being the case, Edge Compute for AI (Edge AI) will enable devices to perform tasks like image recognition, classification, natural language processing, and predictive analytics without having to rely on cloud-based models.

On-Device AI : smartphones and IoT devices will begin to run AI/ML models locally (this can be observed in the new iOS updates with Apple Intelligence).
Edge AI Processors: specialized processors geared towards AI workloads will start to show up (NVIDIA is already doing this).

Edge computing will prove to be very useful in areas like autonomous vehicles, smart cities, and augmented/virtual reality (AR/VR) by reducing latency and allowing for quick data processing. It will do the same for real-time systems like the ones present in industrial solutions.

While the upside to edge computing is limitless there are plenty of downsides, namely, security and privacy, interoperability, and complexity. As the market increases these challenges should begin to subside. This new computing paradigm is poised to create new business and employment opportunities.

That’s all I have for you today. Hopefully you enjoyed the read. If you liked the content and would like to continue to receive my content please consider subscribing, it goes a long way.

Lastly, please comment & share, I’d love to hear everyones thoughts.

Thanks,

Ali

Professional Imposter Syndrome is a reader-supported publication. To receive new posts and support my work, consider becoming a subscriber.

Leave a comment

Cryptography for Software Developers

Ali Malik — Tue, 07 Apr 2026 05:15:03 +0000

Hey folks, I am back, officially this time. Thank you for bearing with me during the absence. But I am back and better than ever.

Today we’ll be taking a look at core cryptography knowledge that I consider essential for software developers.

Before starting, once again, I’d like to thank all of you for tuning in. If you’re new here, please consider subscribing, it allows me to really feel like you want more of this.

Subscribe now

What is Cryptography?

Cryptography, plain and simple, is the study and practice of hiding information so that only the intended recipient can read the message. Cryptography revolves around four pillars:

Confidentiality: Ensuring only the intended recipient can read the data.
Integrity: Ensuring data has not been tampered with.
Authentication: Verifying the identity of parties involved.
Non-Repudiation: Preventing someone from denying they performed an action (e.g., signed a document).

Cryptography is everywhere in the digital world, from encrypting WhatsApp messages to ensuring your credit card numbers remain confidential and secure.

Terms

Plaintext (PT): the unencrypted text. Just regular text, like what you’re reading.
Ciphertext (CT): the encrypted counterpart to PT. For example: 1238B45E5AF8E88324699CC6BD9B6DB2 which is the AES-256 representation for “Hello, World”
AES: Advanced Encryption Standard
DES: Data Encryption Standard
Hash/Digest: A fixed-size fingerprint of data (e.g., SHA-256 hashes "Hello World" to d2a84f4b8b650937ec8f73cd8be2c74add5a911ba64df27458ed8229da804a26).

The Basics

Let’s start with the basics. In this section we’ll discuss symmetric vs asymmetric cryptography.

Symmetric Cryptography

Symmetric cryptography is cryptography that uses the same key for encryption and decryption, hence the classification symmetric.

For symmetric cryptography two, or more, communicating parties know the shared secret (the key). An example of a symmetric key algorithm is AES (Advanced Encryption Standard). Symmetric Cryptography offer speed of encryption, efficiency over large amounts of data, and confidentiality.

Symmetric cryptographic algorithms can be further split into two categories:

Block Ciphers
Stream Ciphers :

Block Ciphers

As the name suggests, block ciphers encrypt blocks of bits at a time. For example, AES has a block size of 128 bits. That means that the AES algorithm breaks an input into chunks of 128 bits when encrypting. Block ciphers are very prominent in internet applications such as TLS/SSL (HTTPS), VPN, and BitLocker are some examples.

Other popular block ciphers include Triple DES (3DES) and BlowFish both with a block size of 64 bits.

Note: Block size is entirely separate from key size. Block size is a fixed-size chunk of data that a cipher processes in every encryption / decryption operation.

Stream Ciphers

Stream ciphers operate on a stream of bits, thus, these ciphers encrypt bits/byte one at a time, individually. Stream ciphers are used heavily in real-time scenarios such as, real-time communication such as VoIP (Voice-over-IP) and mobile communications. Since these ciphers are lightweight they’re perfect for resource-constrained environments.

An example of a stream cipher, a very important example, is the One-Time Pad (OTP). The importance here lies in the fact that a OTP is mathematically unbreakable, when used correctly. This is because the ciphertext provides absolutely no information about the plaintext if the key is unknown.

Asymmetric Cryptography

Asymmetric cryptography is also known by its much more popular pseudonym …public-key cryptography, this is what I’ll use moving forward. Public-Key Cryptography, when compared to Symmetric Cryptography, offer better security and robustness (offers confidentiality, authenticity, and non-repudiation). However, these algorithms are much more resource intensive making it slower than symmetric cryptography. Public-Key Cryptography also greatly simplifies the key exchange process, removing the need to share a single secret key.

In Public-Key Cryptography, a pair of keys are used. One of the keys, the public key, is not a secret, it can be shared with anyone. The other key, the private key, is the user’s secret that does not get shared with anyone. There is a mathematical relationship between the two keys, however, given a key one cannot derive the other part of the pair (it’s mathematically infeasible). Very common example of a public-key algorithm is RSA (Rivest-Shamir-Adleman).

Alice’s (recipient) public key is used to encrypt the message and the private key is used to decrypt the message.

One of the most important aspects of Public Key Cryptography is the fact that it greatly simplifies key distribution. No longer does one need to share secrets via a secure channel. The presence of two keys (one public and one private) builds the basis for digital signatures.

As mentioned earlier, public-key cryptography is slower than it’s counterpart (symmetric cryptography) but another con is that it requires much larger key sizes to achieve comparable security. For example, when comparing RSA and AES, RSA needs a key size of 15360 bits to achieve the same level of security as 256 for AES.

Comparing Symmetric and Asymmetric Algorithm Key Sizes (Credit)

Hashes and Digests

Sometimes rather than completely encrypting data we just need to verify its integrity, a quick and simple way of doing so is using hashes. Hashes and digest are used pretty interchangeably, they both refer to the result of applying a cryptographic hash function to a piece of data (file, text, etc). The point of the hash/digest is to create a “fingerprint” of sorts for the input data that allows us to track the integrity of the message, to make sure it has not been tampered with.

So we mentioned a cryptographic hash function above, but what is that? Basically a cryptographic hash function, referred to simply as a hash function, is a _ one-way function that produces a fixed size output _. Hash functions have two really important properties to them: they’re one way and they’re collision resistant.

One Way : it should be computationally infeasible (basically impossible) to get the original data from the hash value.
Collision-Resistant : it should be computationally infeasible (basically impossible) to find two pieces of data that map to the same hash value.

I mentioned above that hashes are great at verifying data integrity, lets give a few examples of that to better understand the role of hashes/message digests:

Software Distribution : you may have noticed when downloading a new piece of software often times you’re provided with something called the “SHA” of that distribution.
Data Backups : verifying that backups haven’t been tampered with.
Storing Passwords: passwords aren’t stored in plaintext, for obvious reasons, instead they are hashed and the hashed value is stored. This ensures that even if the password database is hacked/compromised the passwords themselves are not.
Commit Identification: if you use Git (a version-control system) you’ll notice random alphanumeric characters representing the commits. Git uses a hash to identify the commit.

There are many hash functions available, some more popular than others. Examples include: MD5, SHA-1, SHA-256, BLAKE-2, etc.

Hashes and digest also make the foundation for crypto currencies. They’re crucial in ensuring integrity, security and trust. In an earlier post, we talked about Merkle Trees and their applications, remember Merkle Trees rely on the concept of hashes to be able to verify transactions. This extends beyond just proof of transactions, with hashes/digests, crypto currencies add functionality like Proof of Stake, data integrity, and digital signatures.

Note: Please don’t use MD5 and SHA-1 they are weak hashes.

Message Authentication Codes and Signatures

I mentioned above that if we wanted to simply check the integrity of a message we could use a hash function. But what if we wanted to make sure that the message was authentic, the sending party is who they say they are? This is where message authentication codes (MACs) come into play. Think of a MAC as the digital fingerprint of the message, it’s unique to that message (and the shared secret key). Formally, a message authentication code (MAC) is a cryptographic technique used to verify both the integrity and authenticity of the message. Remember, authenticity relates to the sending and receiving parties, so a MAC not only makes sure that the message has not been tampered with but also provides verification of the two parties involved.

So how do MACs work? Well before anything, step 0 if you will, the sender and receiver need to share a secret key beforehand. This key is crucial for verification. Once that’s taken care of we can generate the MAC.

If Alice wants to send a message to Bob she would include the MAC of the message with the original message so Bob can verify its authenticity.

Once the secret key is shared, when sending a message the sending party generates a MAC of the message and includes it in the transmission. Once the receiving party receives the message, they would also generate a MAC of the message and compare it to the accompanying MAC. If the MACs match everything proceeds as normal, otherwise the message gets rejected.

Digital signatures are basically the same thing. Instead of a shared secret key the senders private key is used for signing, as we know the senders private key can be verified by the senders public key. Basically, digital signatures are for asymmetric cryptography whereas MACs are for symmetric cryptography.

Tips, Reminders & Resources

Please DO NOT roll your own crypto algorithm beyond for fun. If you’re using crypto algorithms for production applications please use current best practices. Check out NIST Standards, as well as, Cryptographic Right Answers.
Please DO NOT use outdated or weak algorithms.
Utilize OWASP Top 10s and OWASP Cheat Sheets.

Thanks for reading Professional Imposter Syndrome! Subscribe for free to receive new posts and support my work.

That’s all I have today. Thank you for reading through the entire post. Hopefully I was able to teach you something new about the ever-evolving and extremely important field of cryptography.

If you liked the post please like, share, and subscribe. Please share your thoughts on the post.

My next post will start the Distributed Systems series. I’ll see you there.

Ali

Thanks for reading Professional Imposter Syndrome! This post is public so feel free to share it.

Leave a comment

Merkle Trees

Ali Malik — Tue, 07 Apr 2026 05:10:50 +0000

Hey folks, I’m back for another post. I’m doing decently well in keeping up with my promise of weekly posts. I do, however, apologize for the posts not coming out on set dates and time as I originally promised.

Before we delve into the weekly topic, I’d like to welcome the 36 (at the time of writing this) new subscribers. I’d just like to say thank you so much for supporting my work here. We went from 12 subscribers from my last post, all the way to 48 at the time of writing this. The growth is beyond what I initially imagined. Thank you to all my subscribers, if you’re new here I hope you’ll enjoy my posts.

Some of the feedback I’ve received from readers:

Also I'm not sure if I said this before but it's a great thing you're doing and I've read all of your articles thus far. I hope you keep doing it and I wish you success!

Thanks for sharing this post, Ali. I have been learning this more deeply during my studies, but I never saw a simple summary and overview like this one. It was a nice journey!

Subscribe now

This may seem like a random topic, but I’m currently taking a cryptography class and my project idea is related to homomorphic encryption. While doing some preliminary research into the topic I stumbled across Merkle trees and distributed filesystems. So, naturally, I decided to make it the topic of my post in hopes of learning more about them and imparting some knowledge.

So, without further adieu…enjoy.

Before we can dive into Merkle Trees we first have to talk about plain old trees.

What is a Tree?

A tree is a data structure in computer science that represents a hierarchical tree structure composed of nodes. Each node can be connected to many children, but can only have one parent node, this is not the case for the root node. The root node has no parents, it’s the top-most node in the tree. Nodes with no children are called leaf nodes. Below is a figure that explains this in a visual manner:

Using the visual above lets classify the different types of nodes:

Root Node : the root node in the above tree is Parent. The parent node has no…parent.
Children Nodes : James, Thomas, Nick, and all of the Grandchildren are children nodes.
Leaf: the leaf nodes of a tree are the children nodes with no children of their own. So in this case, all of the Grandchildren nodes are leaf nodes as they do not have any children of their own.

The constraints put on the nodes is actually quite useful. This means that there are no loops in relations, meaning a node cannot be it’s own ancestor. Secondly, this means that we can consider each child node the root node of its own subtree. That might sound confusing but let’s use some visuals:

Let’s show the entire tree with leaf nodes and subtrees marked:

As shown, James, the child node of Parent, is it’s own root node within Subtree 1. This is true for all non-leaf children nodes of a tree.

With that in mind, how can we visit / access each node? This is called “walking the tree” or tree traversal.

Tree Traversal

We won’t be looking at tree traversal in-depth for this post, however, just at a high-level I’ll briefly describe the two most popular types.

Unlike linear data structures, which are traversed in a linear fashion (looping over the data structure itself and computing at each element), trees can be traversed in multiple ways. Two of the most common traversal types are breadth-first and depth-first.

Depth-First : go as deep into the tree as possible before moving to the next sibling.
Breadth-First : explore all nodes at the current depth before moving on, also known as level-order

We now have enough background info to tackle this weeks post…

What is a Merkle Tree?

A Merkle tree is a binary tree (just a tree where a node can have up to two children nodes), obviously, in which every leaf node contains the cryptographic hash of a block of data and every non-leaf node contains the cryptographic hash of the hashes of it's children (sounds confusing I’ll show a visual in a second).

Now, what this specific tree allows us to do is securely verify that arbitrary content exists within the tree.

Let’s get a visualization of a Merkle Tree:

As shown, the leaf nodes of the Merkle Tree are hashes of the underlying data block. Now, at the very top we have the root node, known as the Merkle Root. So why is this tree important? It allows us to verify data efficiently and securely without having to check each piece of data individually.

Inclusivity

So how can we verify that a piece of data is included in the merkle tree? By providing a Merkle Proof, aka an inclusion proof.

In an inclusion proof all we need is the leaf node’s (the data block being verified) sibling and all other intermediate hashes needed to calculate the merkle root hash. I’ll explain visually below:

Lets say we want to check if data block _ L1 _ is in the Merkle Tree. The way we would do that is:

Calculate hash(L1)
Use the Merkle Tree’s Hash 0-1 and Hash 1

Once we have Hash 1 and Hash 0-1 we can calculate Merkle root and compare the two hashes to verify the exact content of L1.

This method of proofs is actually very advantageous when compared to linear data structures because it scales logarithmically. This means that we need to compute log(N) hashes to prove the inclusion of a data element, where Nis the number of leaf nodes. As data grows this logarithmic scaling proves to be much more manageable.

So in our example above, we have four leaf nodes, so we only need log(4) hashes to verify any data block. Let’s see how that scales:

8 leaf nodes → log(8) → 3 hashes needed
16 leaf nodes → log(16) → 4 hashes needed
128 leaf nodes → log(128) → 7 hashes needed
1,000,000 leaf nodes → log(1000000) → 20 hashes needed

As the leaf nodes grow, the number of hashes to verify any individual data block grows at a sustainable pace.

Uses

So where do we see Merkle Trees today? Basically everywhere. The most common and prevalent usage of Merkle Trees is within the Blockchain and Cryptocurrency space. It’s a fundamental part of Blockchain technology, they’re used to verify and organize transactions efficiently within data blocks.

Blockchain Technology :
Distributed File Systems :
Version Control Systems :
Peer-to-Peer Networks :
Trusted Platform Modules (TPMs):

So basically, Merkle Trees are everywhere from the file system on your computer to your version control system (Git, Mercurial) to distributed databases like Amazon DynamoDB and Google’s BigTable.

Conclusion

Merkle Trees play a crucial role in ensuring data integrity and efficient verification in distributed systems. Their hierarchical hashing mechanism allows for secure and scalable data verification, making them indispensable in modern technology.

For next weeks post we’ll be implementing Merkle Trees .

If you found this post helpful or have questions, please leave a comment below. Don't forget to subscribe to stay updated on future posts exploring fascinating topics in computer science and cryptography.

Thanks for tuning in,

Ali.

Leave a comment

Thanks for reading Professional Imposter Syndrome! Subscribe for free to receive new posts and support my work.

Merkle Trees....continued (#4)

Ali Malik — Tue, 07 Apr 2026 05:10:18 +0000

Hey folks, we’re back for another post. Today, as promised in the previous post, we’ll be implementing Merkle Trees. I will do my best to provide the implementation in multiple languages.

Note: I’m currently trying to get better at Golang so I’ll be implementing the tree in that.

Subscribe now

Shout-outs

We’ve gone from 48 subscribers last week to 57 this week. I’m beyond grateful for all the subscribers. Hopefully you all are enjoying the content. I’ve gotten a few requests from subscribers and I’d like to see what folks want to see more of:

Recap

Let’s quickly recap what we discussed last time. We talked about the tree data structure and how it represents a hierarchy consisting of nodes classified as root, children, and leaf. We then went on to define and discuss a special type of tree, a Merkle tree.

We defined a Merkle tree as a binary tree in which every leaf node contains the cryptographic hash of a data element while every non-leaf node contains the cryptographic hash of the sum of cryptographic hashes of its children. Below is our definition in a visual format:

Implementation

Let’s get started implementing the Merkle Tree. We’ll start by defining our high level data structures:

MerkleNode: this will represent a singular node. As we’ve outlined a node can have a left child and a right child. Secondly, a node either contains the hash of a data element or the hash of the concatenation of its childrens hashes.
MerkleTree: the tree itself is represented by the root node, the MerkleRoot

// Let's define a Node N
// A node can have a left child and a right child
// A node also contains the cryptographic hash
type MerkleNode struct {
 LeftChild *MerkleNode
 RightChild *MerkleNode
 Hash string
}

type Tree struct {
 MerkleRoot *MerkleNode
}

Above we've defined the MerkleNode which contains the members LeftChild, RightChild and Data. We've also defined the Tree itself. The Tree struct only has one member...the MerkleRoot.

With our data structures set up let’s now tackle actually creating a node. As stated above, a node can have children and that determines what is hashed and stored.

func NewMerkleNode(leftChild, rightChild *MerkleNode, data []byte) *MerkleNode {
 node := MerkleNode{}

 if leftChild != nil && rightChild != nil {
  oldHash := leftChild.Hash + rightChild.Hash
  newHash := sha256.Sum256([]byte(oldHash))
  node.Hash = hex.EncodeToString(newHash[:])
 } else {
  hash := sha256.Sum256(data)
  node.Hash = hex.EncodeToString(hash[:])
 }
 node.LeftChild = leftChild
 node.RightChild = rightChild
 return &node
}

Our function NewMerkleNode takes in three inputs; the left child, right child, and the data itself. We then create an empty MerkleNode and explicitly check if we passed in valid values for left and right child. If we did, we update the hash accordingly. We then return the memory address of the node.

To highlight how this would work, imagine if we wanted to create a brand new leaf node the way we would do that is:

leafNode := NewMerkleNode(nil, nil, []byte("data"))

Similarly, if we wanted to create a regular node with left and right children it would look something like this (assuming the left and right children have already been created):

node := NewMerkleNode(leftChild, rightChild, []byte(""))

Now that we’ve got a way to create nodes, we need a way to construct the tree itself. The gist of constructing a tree is you climb up the tree creating new levels until you reach the top, the root node.

As a thought exercise, let’s imaging we’re given a bunch of data blocks and we’re asked to construct the Merkle Tree, it would look something like this:

Create leaf node corresponding to each data block
Go up one level creating nodes corresponding to the lower level leaf nodes
Go up one level creating nodes corresponding to the lower level nodes
And so on….. until we’ve reached a singular node.

One thing we need to watch out for is an odd number of nodes at any level besides the root node. For example, lets visualize having 5 data blocks to encrypt:

As you can tell there are only 5 nodes, there aren’t an even number of nodes, so we append the last node again to make it look like this:

No we have an even number of nodes lets start moving up the chain.

Once again we find ourselves in a situation where we need to replicate the last node at this level (in this case Hash(“Data Block 5 + Data Block 5”)). As we replicate and move up this is what we end up with:

So as you can notice, we had to replicate two nodes (the ones in red, obviously). We have to take this into account when coming up with the logic to create the tree.

func NewMerkleTree(dataBlocks [][]byte) (*Tree, error) {
 if len(dataBlocks) == 0 {
  return nil, errors.New("must have more than 0 data blocks")
 }

 var nodes []MerkleNode

 // Handle odd number of data blocks by replicating the last one.
 if len(dataBlocks)%2 != 0 {
  dataBlocks = append(dataBlocks, dataBlocks[len(dataBlocks)-1])
 }

 // Create leaf Nodes
 for _, data := range dataBlocks {
  nodes = append(nodes, *NewMerkleNode(nil, nil, data))
 }

 // Time to construct the tree level by level
 for i := 0; i < len(dataBlocks)/2; i++ {
  var newLevel []MerkleNode
  for j := 0; j < len(nodes); j += 2 {
   // If the number of nodes at the level are odd, once again append the last one at the end
   if len(nodes)%2 != 0 {
    nodes = append(nodes, nodes[len(nodes)-1])
   }
   newLevel = append(newLevel, *NewMerkleNode(&nodes[j], &nodes[j+1], nil))
  }
  nodes = newLevel
 }
 // return the tree
 return &Tree{MerkleRoot: &nodes[0]}, nil
}

To deal with the possibility of there being an odd number of nodes at any given level, there’s a check to make sure that we have an even number of nodes, if we don’t just append the last node to the array and carry on.

What this function does is:

Create leaf nodes from the input data.
Create nodes level by level making sure to check if there are an even number of nodes every time.
Return the Tree struct with the MerkleRoot set to the only element in the nodes array.

Using this function would look something like this:

 testData := [][]byte{
  []byte("Transaction 1"),
  []byte("Transaction 2"),
  []byte("Transaction 3"),
  []byte("Transaction 4"),
  []byte("Transaction 5"),
 }
 tree, err := NewMerkleTree(testData)

Okay, we’ve implemented it but how can we check if it’s right? How can we verify that the hashes line up? With a simple breadth-first search (BFS) of the tree.

Let’s verify along the way:

Data Blocks

Data: Transaction 1 Hash:dff3b30655dc240deca00ed22fae68fdf8cf465bbe99bb2b2e24259cc1daac3a
Data: Transaction 2 Hash:4ae0e48b754a046b0f08e50e91708ddff4bac4daee30b786dbd67c30d8e00df8
Data: Transaction 3 Hash:2b8fd91deadf550d81682717104df059adc0addd006a0c7b99297e88769b30e5
Data: Transaction 4 Hash:b99ca09efe93055ad86acb5bfc964e16393d8e4672c3a4c5fa08ffabc85065b3
Data: Transaction 5 Hash:40d1474d042b66b26df83eae197368b93d84d8c960d39aec68573796078114a4
Data: Transaction 5 Hash:40d1474d042b66b26df83eae197368b93d84d8c960d39aec68573796078114a4

You can see that Transaction 5 is repeated.

Level 0 (Immediately above Data Blocks)

Printing levels...
Level 0...

Hash Left: dff3b30655dc240deca00ed22fae68fdf8cf465bbe99bb2b2e24259cc1daac3a Hash Right: 4ae0e48b754a046b0f08e50e91708ddff4bac4daee30b786dbd67c30d8e00df8 New Hash: a31d3187c179a847bc0fbe729e06a0770147ad58b45995ac945032df15ba38e3

Hash Left: 2b8fd91deadf550d81682717104df059adc0addd006a0c7b99297e88769b30e5 Hash Right: b99ca09efe93055ad86acb5bfc964e16393d8e4672c3a4c5fa08ffabc85065b3 New Hash: c6dadc3fe8fde36887f07ed12e0bb073b4165d0921749b98b2ae237f3aed3e07

Hash Left: 40d1474d042b66b26df83eae197368b93d84d8c960d39aec68573796078114a4 < Hash Right: 40d1474d042b66b26df83eae197368b93d84d8c960d39aec68573796078114a4 < New Hash: 745ebfe140c7e62a45766934adff116dcd736890477b37ccd44550284d38e7c2

You can see each of the data block hashes being used. And the < marks the replicated node.

Level 1

Level 1...
Hash Left: a31d3187c179a847bc0fbe729e06a0770147ad58b45995ac945032df15ba38e3 Hash Right: c6dadc3fe8fde36887f07ed12e0bb073b4165d0921749b98b2ae237f3aed3e07 New Hash:7e1182c5c00e379261503e757487cfa16cda7010f1ca3ae0115d5b78cfc07509

Hash Left: 745ebfe140c7e62a45766934adff116dcd736890477b37ccd44550284d38e7c2 < Hash Right: 745ebfe140c7e62a45766934adff116dcd736890477b37ccd44550284d38e7c2 < New Hash:c4a338ab87fe2e56055a48e160f007ebb6a8a90303659fd8b97dde9d99a9a164

Same as before you can see where the node was replicated and you can see where the hashes from Level 0 are used here.

Level 2 (Merkle Root)

Level 2...

Hash Left: 7e1182c5c00e379261503e757487cfa16cda7010f1ca3ae0115d5b78cfc07509 Hash Right: c4a338ab87fe2e56055a48e160f007ebb6a8a90303659fd8b97dde9d99a9a164 
New Hash: 2c2c4cdf817ca1233db4784bb8752eddca8428c5c88ad7fad7e7235532e33c3c

And there we go, we’ve reached the top, the Merkle Root.

Verification

Now that we have a merkle tree, how can we verify an element of the original data block?

Scenario : you’re hosting a file on Dropbox and need to retrieve it. Dropbox partitions the file and stores the file in chunks across multiple nodes. When retrieving your file Dropbox will use a Merkle Proof to verify that a chunk belongs to the requested file.

So given data blocks and a specific data block to verify how can we ensure that the specific data block is part of the Merkle Tree? This is where a Merkle Proof comes in to play.

A Merkle Proof (defined as a Merkle Consistency Proof in RFC6962) is basically the list of nodes in the Merkle Tree required to verify that the input is contained in the tree.

Let’s visualize this:

Let’s say we want to verify that Data Block 1 (blue) is included in the Merkle Root Node for the tree, what nodes do we need to verify that?

We can generate the hash for Data Block 1
We need the hash for Data Block 2 (green)
We need the hash for Hash (Hash 3 + Hash 4) (green)
We need the hash for Hash (Hash 5&5 + Hash 5&5) (green)

This means for Data Block 1 the Merkle Proof is:

[Hash(Block 2), Hash(Hash 3 + Hash 4), Hash(Hash 5&5 + Hash 5&5)]

Implementation

Let’s implement the verification process. This will be two parts:

Build the proof:
Verify the proof

Build the Proof

Lets say we’re given an array of data blocks and a specific data block we want to verify. How do we approach this? This is almost entirely identical to building the actual Merkle tree, the only difference is that we’ll append the hashes we need to an array and return them.

func (tree *Tree) MerkleProof(data []byte, dataBlocks [][]byte) ([]string, error) {
 // Find index of the data block we're looking for
 // Determines whether we have a left or right sibling
 index := -1
 for i, block := range dataBlocks {
  if bytes.Equal(data, block) {
   index = i
   break
  }
 }

 if index == -1 {
  return nil, errors.New("data not found in datablocks")
 }

 // Get the leaf hashes
 var proof []string
 var hashedNodes []*MerkleNode
 for _, data := range dataBlocks {
  hashedNodes = append(hashedNodes, NewMerkleNode(nil, nil, data))
 }

 // We will now construct the tree and traverse up to verify
 // Basically build the tree and add the necessary nodes to our proof
 for len(hashedNodes) > 1 {
  var nextLevel []*MerkleNode

  // If even we have a right sibling, if odd we have a left sibling
  if index%2 == 0 && (index+1 < len(hashedNodes)) {
   proof = append(proof, hashedNodes[index+1].Hash)
  } else if index%2 == 1 {
   proof = append(proof, hashedNodes[index-1].Hash)
  }

  for j := 0; j < len(hashedNodes); j += 2 {
   if j+1 < len(hashedNodes) {
    nextLevel = append(nextLevel, NewMerkleNode(hashedNodes[j], hashedNodes[j+1], nil))
   } else {
    nextLevel = append(nextLevel, NewMerkleNode(hashedNodes[j], hashedNodes[j], nil))
   }
  }
  hashedNodes = nextLevel
  index /= 2
 }
 return proof, nil
}

This code will give us an array of hashes, that when used will re-construct our MerkleRoot.

Verify the Proof

This part is the easiest, we will quite literally just hash each member of the proof array until we reach the MerkleRoot.

To do this we’ll write two helper functions to make our lives easier:

leafHashwhich will hash the contents as if it were a leaf.
nodeHashwhich will hash the contents as if it were two nodes.

func nodeHash(leftHash, rightHash string) string {
 newHash := sha256.Sum256([]byte(leftHash + rightHash))
 return hex.EncodeToString(newHash[:])
}

func leafHash(data []byte) string {
 hash := sha256.Sum256(data)
 return hex.EncodeToString(hash[:])
}

With the helper functions in place, lets implement the verifier. We will return true if the data block is part of the Merkle Tree and false otherwise.

func VerifyMerkleProof(data []byte, proof []string, rootHash string) bool {
 leaf := leafHash(data)

 for _, siblingHash := range proof {
  leaf = nodeHash(leaf, siblingHash)
 }
 return leaf == rootHash
}

That’s it we’re done. Now to actually run through it we’ll reuse the data from above, lets check if Transaction 1 is in the Merkle Tree:

output, err := tree.MerkleProof([]byte("Transaction 1"), testData)

When executed, output looks like this (a list of hashes that when hashed with the original data block should give us the MerkleRoot):

[4ae0e48b754a046b0f08e50e91708ddff4bac4daee30b786dbd67c30d8e00df8 c6dadc3fe8fde36887f07ed12e0bb073b4165d0921749b98b2ae237f3aed3e07 c4a338ab87fe2e56055a48e160f007ebb6a8a90303659fd8b97dde9d99a9a164]

Now to verify the proof:

proof := VerifyMerkleProof([]byte("Transaction 1"), output, tree.MerkleRoot.Hash)

This returns true. So, there you have it. A fully functional Merkle Tree implementation with proof and verification.

Conclusion

So today we implemented what we talked about last week, Merkle Trees. In our implementation we referred to RFC6962 for guidance, added a Merkle Proof method that returns the list of hashes needed to reconstruct the Merkle Tree, and added a verification method that verifies whether the proof is correct.

If you liked what you read please like, comment, and share. I’m always looking for feedback and topic suggestions.

Once again, thanks all for subscribing and tuning in this week. I will see y’all next week.

Thanks,

Ali

Professional Imposter Syndrome is a reader-supported publication. To receive new posts and support my work, consider becoming a subscriber.

Leave a comment

Beyond "Pick Two": Real-World Trade Offs

Ali Malik — Tue, 07 Apr 2026 05:00:00 +0000

This is Part 4 of a 4-part series on the CAP Theorem and distributed systems trade-offs. Read Part 1, Part 2, or Part 3 if you haven’t already.

We’ve dedicated three posts to treating CP and AP as binary choices. etcd is CP. DynamoDB is AP. Pick a side.

But that’s not how real systems work.

MongoDB can be CP or AP depending on your configuration. Cassandra lets you choose per query. Even PostgreSQL, which we covered as a CP system in Part 2, flips to AP behavior with async replication. The line between strong consistency and high availability isn’t a binary switch…..it’s a spectrum.

And then there are the systems that seem to ignore CAP entirely. Google Spanner claims to be both consistent and available. CockroachDB markets itself as “ACID at scale.” What’s going on?

The answer comes down to understanding what CAP actually constrains, what happens when networks don’t partition, and how some very smart engineers found ways to work around the edges.

FINAL PART OF THE SERIES. SUBSCRIBE TO SEE HOW IT ALL COMES TOGETHER.

PACELC: The Other Half of the Story

CAP Theorem is very prescriptive. It only tells you what happens during a network partition. But what about all of the other times? The times the network is working just fine? The other 99% of the the time?

That’s where PACELC (pronounced “pass-elk” but I refuse to pronounce it in any other way than “pace-l-c”). Introduced by Daniel Abadi in 2010, it extends the CAP Theorem by actually including what to choose when the network is fine.

PACELC stands for Partition, Availability, Consistency, Else, Latency, and Consistency.

The way it’s read is “ If there’s a network (P)artition, choose between (A)vailability and (C)onsistency. (E)lse, choose between (L)atency and (C)onsistency.

That “Else” is doing a lot of heavy lifting. Even when your network is perfectly healthy, distributed systems still face a fundamental trade-off: do you want fast responses or strong consistency?

The Latency vs Consistency Trade-off

Let’s think about it for a second. To guarantee strong consistency, we showed that a system needs to coordinate between multiple nodes. We established that coordination takes time (a cross-region round-trip is about 50-200 ms). That’s the price of every coordinated read / write.

You want the most up-to-date value? You need to check with multiple replicas or go to the leader. Those are extra round trips. You want an instant response instead? You can read from the nearest replica, but also come to terms with the fact that it could be slightly behind.

This trade-off exists even in perfect networks. It takes time to send data and coordination isn’t free.

PACELC in Practice

Let’s map the systems we’ve covered to both sides of PACELC:

DynamoDB:

P+A: During a network partition, choose availability via eventually consistent reads. Eventually consistent reads stay available during partitions, any replica can serve them.
E+L: During a clean working network, choose latency. Those default eventually consistent reads optimize for latency by hitting the nearest replica. Flip to strongly consistent reads and you’re trading latency for consistency, even when the network is healthy.

Cassandra:

P+A: During a network partition, all node remain available for reads and writes. Remember, Cassandra is leaderless, so there’s no leader to lose.
E+L or E+C: During a normal working network, you get to choose what you want: latency or consistency. Remember that Cassandra allows you to choose per query. ONE optimizes for latency (E+L). QUORUM optimizes for consistency (E+C). Same cluster, same data, different trade-offs.

MongoDB:

P+A/C: A little different than the others, MongoDB changes between availability and consistency depending on the operation. During partitions, the minority side remains available to serve reads from the secondaries (P+A). But, writes still require the leader, which lives on the majority side (P+C). So reads fall towards P+A and writes fall to P+C.
E+L/C: During normal operations, MongoDB defaults to reading from the primary. Which means you’re getting strongly consistent reads. You can set readPreference to nearest which would give you the fastest response. There’s also readConcern which defaults to local which says get the most recent data, regardless of whether it’s been replicated.

Traditional RDBMS (PostgreSQL with sync replication):

P+C: During network partitions, if the primary can’t reach it’s synchronous replica, writes block. They don’t immediately fail and they don’t get dropped….it just hangs. The system would rather stall than be inconsistent.
E+C: During normal operation, every write still waits for every replica to acknowledge before confirming the write. That’s added latency for every single write irrespective of partitions. Postgres pays the consistency tax every single time, not just during partitions.

Notice a pattern? PACELC shows why “eventually consistent” systems like DynamoDB feel fast and reliable in normal operation, they’re optimizing for the common case (no partition, low latency) while accepting trade-offs during the rare case (partition, stale reads). Meanwhile, CP systems pay the latency tax all the time, partition or not.

This is why your strongly consistent database feels slower even when nothing is broken. It’s not a bug. It’s the ELC trade-off in action.

The Consistency Spectrum

CP vs AP suggests two positions on the consistency scale: strong or eventual. Reality shows us that there is way more to this spectrum.

We covered some of these in Part 3: read-your-writes, monotonic reads, causal consistency. But here’s the full picture as a spectrum, from strongest to weakest:

Linearizability: The gold standard. Every operation appears to take effect instantly at a single point in time. Every read returns the most recent write. We established before that this is what CAP calls “consistency.” It is also the most expensive form of consistency. It requires coordination on every operation.
Sequential Consistency: A little more lax. All operations still appear in some total order but, may or may not line up according to wall clock. For example, if User A writes at 12:00:01 and User B writes at 12:00:02, the system might order User A first or second. Really doesn’t matter, what matters is that everyone sees the same order. Imagine it as a globally agreed-upon order.
Causal Consistency: Even more lax than above. Only cares about cause and effect. If two operations are independent of each other it doesn’t matter what order it’s shown in. However, if two operations are dependent on each other (B on A), everyone must see A before B.
Read Your Writes: Scoped to a single client. If Client A writes something, subsequent reads are guaranteed to reflect that write. Other clients may or may not see it yet. This is the bare minimum for user experience: “I just posted a comment, why can’t I see it.”
Monotonic Reads: As the name suggests, time only moves in one direction, forward. Once you’ve seen a value, you are guaranteed to not see an older version of that value. If this wasn’t guaranteed you could refresh your page and see different data.
Eventual Consistency: The weakest model on this list. Basically, all replicas will eventually converge to the same value. Eventually. Nobody knows how long that eventually is. It’s not bounded by anything.

This is a useful mental model to have, not a perfect taxonomy. Some of these consistency models apply at the global level and others at the session level. But the information is useful to illustrate that there is plenty of room between linearizability and eventual consistency.

Why the Middle Matters

Most applications don’t need pure linearizability everywhere and many can’t tolerate bare eventual consistency everywhere either. So what do we do? The answer is…a mix of both.

Let’s do a thought experiment. Let’s assume we have a social media feed. We don’t really care about having likes and comments ordered perfectly, right? But there are areas where we need guarantees stronger than eventual consistency:

When you publish a post, you should be able to see it immediately (read-your-writes).
Once you’ve seen a comment, it shouldn’t just disappear just because the backing store is stale (monotonic reads).
A reply should appear after a post it’s responding to has published (causal consistency).

Those guarantees are definitely weaker than linearizability, but map much better to the expected user experience real applications need. Best part? All three can exist with high availability.

This is the insight that drives modern database design. Instead of forcing you into “consistent” or “available,” systems give you a dial. Cassandra’s tunable consistency. DynamoDB’s per-request read mode. MongoDB’s read concern and write concern settings. They’re all implementations of this spectrum.

The question isn’t “CP or AP?” It’s “how much consistency do I actually need for this specific operation?”

Systems That Claim to Beat CAP

Some distributed systems seem to ignore CAP entirely. Strong consistency and high availability. Globally distributed and fast. What’s the catch?

There’s always a catch.

Google Spanner: Throw Hardware at the Problem

Google Spanner, a globally distributed SQL database, advertised as “always-on” is always brought up when discussing the CAP theorem. Spanner is able to provide globally distributed, strongly consistent ACID transactions. Sounds too good to be true, no?

Well it is. Spanner relies on a global timestamp (their secret weapon here) to order events. TrueTime, is a global timestamp with bounded uncertainty, powered by GPS receivers and atomic clocks. Yes, you read that right, actual GPS receivers atomic clocks are physically bolted to Google server racks. By knowing that an event happened between time T and T+ε (where ε is typically less than 10ms), Spanner can order events globally without the extensive coordination that other CP systems require.

That solves our latency issue during regular operations. But what about when there’s a network partition?

During a network partition, it chooses consistency over availability. Spanner is still CP. Full stop. The reason it appears to also be highly available is that Google’s network almost never partitions, because Google owns the fiber, the switches, the data centers, and the custom hardware. Spanner’s availability story isn’t an architectural breakthrough. It’s an infrastructure advantage that many companies simply can’t replicate.

Eric Brewer (the guy that coined the CAP Theorem) admits that Spanner is technically CP, it just happens to operate in an environment where the P is heavily minimized.

If you’re not Google, you don’t have TrueTime. And if your network actually partitions, Spanner makes the same choice as etcd: consistency wins, availability loses.

CockroachDB: Make the Pain Invisible

CockroachDB takes a completely different approach. Instead of trying eliminate or minimize partitions like Google, it aims to hide the partitions from the user.

Under the hood, CockroachDB uses Raft consensus, the same protocol we covered in Part 2 with etcd, for replication. But where etcd uses a single Raft group for the whole cluster, CockroachDB splits the data into ranges, each running its own independent Raft group. Each range replicates across multiple nodes and has its own node leader.

So how does it stay “available” if it’s running consensus?

The trick here is scope. During a network partition, CockroachDB does not go down in its entirety, only the specific ranges that lost their majority. Let’s say that again. During a network partition, only ranges that lose the majority go down. For example, if a partition cuts off one node in a five-node cluster, only the ranges where that node held the deciding vote are affected. Everything else keeps working as if nothing happened.

It’s technically CP behavior. But by distributing ranges across many nodes and rebalancing automatically, the practical impact of any single partition is small. Most users won’t notice.

CockroachDB didn’t hack the CAP Theorem, it just made the decision extremely granular. You’re still compromising availability during partitions, it’s just for a small-subset of the data.

The Pattern

As we’ve shown, neither of these systems actually violate the CAP Theorem. They just engineered around it from opposite directions. Google’s Spanner reduces the probability of partitions by owning the entirety of the infrastructure. CockroachDB reduces the impact of partitions by splitting the data into numerous independent ranges.

The constraints are still there. Networks are still bound to fail and when that inevitably happens, both systems default to choosing consistency over availability. The difference, however, is that both systems have found ways to make the consistency vs. availability choice less painful. Spanner by making partitions almost nonexistent. CockroachDB by making them almost invisible.

The lesson? CAP describes fundamental constraints. Creative engineering minimizes their impact. But the constraints are still there, waiting for the network to have a bad day.

How to Actually Choose

Alrighty, we’re on our fourth post. Four posts of theory, papers, and systems deep-dives. Let’s get a bit practical. How do you actually decide between consistency and availability for your system?

Start with Three Questions

Step 0, before even picking technologies, assess the system you’re building. Answer these questions honestly.

1. What happens if users see stale data?

Obviously stale data is not ideal. But this question is more geared towards the user-experience if stale data is presented. For a bank, a stale bank balance has real financial implications for the user. However, for a social media app, a like counter doesn’t really have that big of an affect. An example to think about, DNS has tolerated stale records for decades. Decades. That’s not to say to pick a certain path, but to drive home the point “not everything needs to be strongly consistent”.

2. What happens if the system is unavailable?

We would all love 100% system uptime. AWS spends billions of dollars to achieve four 9’s of availability (still not 100%). Where stale data affects user-experience, system availability is usually tied to business performance.

A payment processor being down is lost revenue. An analytics dashboard being down leads to delayed decisions. A configuration service being can lead to failures across all components that config controls (which might be worse than stale data). The answer isn’t always obvious, but you need to think through what happens when your system is down.

3. How much consistency do you actually need?

Let’s re-read that question, how much consistency do you actually _ need? _ Notice need not want.

Bank balances are strictly linearizable, there’s no way around it. Shopping carts can get away with something weaker (causal consistency). Social features just need read-your-writes level of consistency and they’re good. You need to be very honest here. Every step up the consistency ladder is a step down the latency or throughput ladder.

The Approach That Actually Works: Hybrid

What I’ve learned from working on, operating, and researching distributed systems is that almost nobody should go pure CP or pure AP. The right answer is almost always somewhere in the middle. Different data has different requirements and it should be treated that way.

Here’s what that looks like for an e-commerce system:

User sessions → AP. Throw this in DynamoDB with eventually consistent reads. Fast logins, and who cares if a session is slightly stale? If it breaks, the user logs in again.

Product catalog → AP. Fast browsing matters more than showing the absolute latest product description. Cache aggressively. Invalidate when you can.

Inventory → Tunable. This is the interesting one. You can be optimistic (AP) when browsing — show approximate stock levels. But when the user actually clicks “Buy,” switch to a strongly consistent check before charging their card. Cassandra with ONE for reads and QUORUM for the purchase path, or DynamoDB with eventually consistent reads for browsing and strongly consistent reads at checkout.

Orders and payments → CP. No debate. Money is involved. PostgreSQL with synchronous replication, or a distributed SQL database like CockroachDB. If the system can’t confirm consistency, it should refuse the write.

Reviews and social features → AP. Eventual consistency is totally fine here. A review showing up 2 seconds late doesn’t hurt anyone.

The pattern: use the weakest consistency model that still meets your requirements for each piece of data. Stronger consistency isn’t free, it costs latency, throughput, and operational complexity. Don’t pay for it where you don’t need it.

What CAP Actually Taught Us

Four posts. Thousands of words. Multiple papers. Here’s what it all boils down to:

CAP Is a Design Tool, Not a Law

CAP theorem doesn’t tell you what to build. It tells you what questions to ask. What should your system do when the network partitions? When nodes fail? When clocks skew? If you haven’t explicitly decided, your system will decide for you. If you’re lucky that looks like 3AM debugging sessions on the weekend.

Nothing is Free

CAP’s lasting contribution is killing the fantasy that distributed systems can provide everything for free. Every guarantee has a cost:

Strong consistency costs latency and availability.
High availability costs consistency guarantees.
Partition tolerance isn’t optional, networks fail.

Understanding these costs turns you from someone who hopes their system handles edge cases into someone who knows what their system will do when things go wrong.

The Spectrum Is Your Friend

The real world isn’t CP or AP. It’s a spectrum, and modern systems give you a dial. Cassandra’s tunable consistency, DynamoDB’s read modes, MongoDB’s read/write concerns, all of these exist because the engineers who built them understood that different data deserves different trade-offs.

The best distributed system for your use case isn’t the most consistent one or the most available one. It’s the one that makes the right trade-offs for your specific requirements. And now you know how to figure out what those trade-offs are.

Series Recap

We started this series with a simple question: what does “you can only pick two” actually mean?

Part 1 busted the myth. CAP isn’t about permanently picking two properties. It’s about what your system does during a network partition — and that choice is binary: consistency or availability.

Part 2 explored the cost of being right. etcd, Zookeeper, PostgreSQL, and MongoDB showed us what strong consistency demands: quorum writes, leader elections, blocked operations during partitions, and latency on every write even when the network is healthy.

Part 3 explored the cost of staying up. DynamoDB, Cassandra, and DNS showed us what high availability demands: stale reads, conflict resolution strategies, and the operational complexity of reconciling data after the network heals.

Part 4, this part, showed that the binary choice was never the full story. PACELC revealed that you’re making trade-offs even when the network is fine. The consistency spectrum gave you six stops between linearizability and eventual consistency. And the hybrid approach showed that real systems don’t pick one side — they pick different trade-offs for different data.

Next time someone draws a triangle on a whiteboard and asks you to “pick two,” you’ll know exactly why that’s the wrong question — and what to ask instead.

THANKS FOR FOLLOWING THE SERIES! SUBSCRIBE FOR MORE DEEP DIVES INTO DISTRIBUTED SYSTEMS.

— Ali

References

Consistency Tradeoffs in Modern Distributed Database System Design — Daniel Abadi (2012), the PACELC paper.
Spanner: Google’s Globally Distributed Database — Corbett et al. (2012)
Spanner, TrueTime and the CAP Theorem — Eric Brewer (2017), where Brewer himself explains why Spanner is technically CP.
CockroachDB Architecture Documentation
Designing Data-Intensive Applications — Martin Kleppmann
Jepsen Analysis — Real-world consistency testing of distributed systems.

Series Index

CAP Theorem: Beyond the “Pick Two” Myth
CP Systems Explained: The Hidden Cost of Strong Consistency
AP Systems Explained: Why Stale Data Beats Dead Servers
Beyond “Pick Two”: Real-World Trade-offs ← You are here

How OpenAI Serves 800 Million Users Without Sharding Postgres

Ali Malik — Tue, 07 Apr 2026 05:00:00 +0000

OpenAI recently published a blog post titled Scaling PostgreSQL to Power 800 Million ChatGPT Users, and it’s one of those posts that I think anyone working with databases or distributed systems should read. Not because the techniques are new or groundbreaking, but because of what OpenAI chose not to do. No sharding. No fancy distributed databases. No custom storage engine. Just a single PostgreSQL primary, roughly 50 read replicas, and a lot of operational discipline.

When I first read this, I was shocked by how much restraint OpenAI showed in every decision. Oftentimes as engineers we want the fancy solution that’ll look cool and be awesome to talk about and show off, but that doesn’t mean it’s the best solution. The overarching goal of this post is to walk through what OpenAI did, why it works, and what we can take away from it as engineers building systems at scale.

Definitions and Background

Before getting into the specifics of OpenAI’s setup, let’s define a few things that’ll come up throughout this post.

Sharding: its when you split your massive database across multiple independent machines (shards), each responsible for a subset of the original data. This allows you to scale writes horizontally, but it introduces significant complexity: cross-shard queries, distributed transactions, rebalancing, and application-layer routing logic.

Write-Ahead Log (WAL): its the mechanism PostgreSQL uses to ensure durability. Every change is first written to a sequential log before being applied to the actual data files. This same log is streamed to read replicas to keep them in sync with the primary. Think of it like a shared notebook: the primary writes every change into the notebook first, and replicas read from that notebook to stay up to date.

MVCC (Multi-Version Concurrency Control): its how PostgreSQL handles concurrent access to data. Rather than locking rows during updates, PostgreSQL creates a new version of the row and lets readers continue accessing the old version. This is great for read concurrency, but it has some well-documented downsides for write-heavy workloads (more on this later).

Now that we have some necessary background, let’s look at what OpenAI actually built.

The Architecture

OpenAI’s core PostgreSQL deployment is surprisingly straightforward:

One primary instance on Azure PostgreSQL Flexible Server, handling all writes
~50 read replicas distributed across multiple geographic regions
PgBouncer (a connection pooling proxy) in front of every replica
A caching layer absorbing the majority of read traffic
Write-heavy workloads migrated to Azure Cosmos DB (a sharded NoSQL system)

That’s it. A single writer node handling writes for 800 million monthly active users, with reads fanned out to replicas. Their database load has grown by more than 10x in the past year, and they’re handling millions of queries per second (QPS) across the cluster.

So how does a single primary not fall over? The answer comes down to the way it’s accessed.

OpenAI’s PostgreSQL architecture. A single primary handles all writes and streams WAL to ~50 read replicas distributed across regions. PgBouncer pools connections in front of each replica (dropping connection time from 50ms to 5ms), and a caching layer absorbs the majority of read traffic before it ever hits a replica. Write-heavy workloads are migrated to Cosmos DB to reduce pressure on the primary.

Why This Works

The key here is that ChatGPT’s workload is a lot more read-heavy. Think about what the typical ChatGPT interaction looks like from a database perspective: fetch user data, fetch conversation history, fetch preferences. Read after read after read. The actual writes (creating a new message, updating a session) are a small fraction of total operations.

This matters because PostgreSQL’s replication model scales reads almost linearly. Each read replica is a full copy of the database that can independently serve queries. Add another replica, absorb another chunk of read traffic. The primary doesn’t care, it just keeps streaming WAL to the replicas.

To put even more simply, if your read-to-write ratio is high enough, horizontal read scaling through replication can get you very far. OpenAI is proof of that.

If you enjoy breakdowns like this, consider subscribing. I write about distributed systems, cryptography, and much, much more!

Where PostgreSQL Struggles

How PostgreSQL’s MVCC creates dead tuples. Updating a single field doesn’t modify the row in place. Instead, PostgreSQL copies the entire row, marks the old version as dead, and leaves it for autovacuum to clean up. At OpenAI’s scale, each write cascades: one logical change produces a full row copy, a WAL record, and ~50x network amplification as that WAL streams to every replica.

OpenAI doesn’t hide from discussing PostgreSQL’s weaknesses, and the biggest one at their scale is the cost of writes under MVCC. The post directly references a blog that Bohan Zhang (the OpenAI engineer behind this effort) co-authored with Prof. Andy Pavlo at Carnegie Mellon University called The Part of PostgreSQL We Hate the Most.

So what exactly is the problem? When PostgreSQL updates a row, even a single field, it doesn’t modify the row in place. Instead, it copies the entire row to create a new version. The old version becomes a “dead tuple” that sticks around until the autovacuum process cleans it up.

For example: imagine you have a user row with 20 columns, and you update just the last_login timestamp. PostgreSQL doesn’t touch the existing row. It writes a brand new copy of all 20 columns with the updated timestamp, and marks the old row as dead. That dead row still takes up space in the table and in every index that references it, and queries have to skip over it until autovacuum eventually reclaims it.

This design leads to several compounding issues:

Write amplification : A small logical change produces a disproportionately large physical write.
Read amplification : Queries have to scan past dead tuples to find the current version of a row.
Table and index bloat : Dead tuples accumulate in both tables and indexes, consuming storage and degrading query performance over time.
Autovacuum complexity : The garbage collector (autovacuum) that reclaims dead tuples requires careful tuning, and long-running transactions can block it entirely.

The practical consequence at OpenAI’s scale is that writes are expensive and they ripple through the entire system. More writes mean more WAL, which means more data streaming to ~50 replicas, which means more network bandwidth consumed and potentially more replication lag. One logical field change on the primary can cascade into 50x the network traffic.

OpenAI’s solution wasn’t to fix PostgreSQL’s storage engine. It was to reduce the write surface area as aggressively as possible:

They migrated shardable write-heavy workloads to Cosmos DB
They fixed application bugs that were causing redundant writes
They introduced lazy writes to smooth out traffic spikes
They rate-limited backfills (even if the process takes over a week)
They banned new tables from being added to the PostgreSQL cluster entirely

That last point is worth sitting with. They’re effectively treating the PostgreSQL cluster as a closed system for writes. Existing workloads stay, but all new workloads go to sharded systems. To keep the primary healthy, they’ve drawn a line in the sand.

The Patterns Worth Studying

Beyond the high-level architecture, there are several specific patterns in the post that are worth understanding in detail.

Cache Stampede Protection

When their caching layer has misses, a naive implementation would send every missed request straight to PostgreSQL, potentially overwhelming it. OpenAI implements a cache locking (and leasing) mechanism: on a cache miss for a given key, only one request acquires a lock and fetches from the database. All other requests for the same key wait for the cache to be repopulated.

Cache stampede protection via lock-based request coalescing. Without protection (left), a cache miss for a given key sends N duplicate queries to PostgreSQL simultaneously. With protection (right), only one request acquires a lock and fetches from the database, while all other requests for the same key wait for the cache to be repopulated. The database gets hit once regardless of how many concurrent requests arrive.

Here’s an analogy. Imagine 1,000 people all try to look up the same book in a library at the same time, and the book isn’t on the shelf. Without protection, all 1,000 people walk to the back room and ask the librarian for it simultaneously. With stampede protection, one person goes to the back room, and the other 999 wait at the shelf until the book is returned and reshelved. The librarian (PostgreSQL) only gets asked once.

This pattern is sometimes called “stampede protection” or “request coalescing.” It’s one of those things that’s easy to skip during initial implementation and then regret during your first major cache failure. If you’re building any system with a caching layer in front of a database, this is worth implementing early.

Workload Isolation

Workload isolation via dedicated replica pools. Without isolation (left), all traffic shares the same replicas, and a heavy analytics query saturating CPU degrades ChatGPT’s user-facing latency. With isolation (right), traffic is classified and routed to separate pools: high-priority for user-facing queries, product-isolated for other services, and low-priority for analytics and backfills. A CPU spike in the low-priority pool stays contained and never touches the critical path.

They split traffic into low-priority and high-priority tiers and route them to separate replica pools. This prevents the “noisy neighbor” problem, where an expensive analytical query or a poorly optimized new feature can saturate CPU and degrade latency for critical-path requests. They apply the same isolation across different products and services as well, so that activity from one product doesn’t affect the performance of another.

Basically, this is the same principle as Quality of Service (QoS) in networking: classify traffic, isolate resource pools, and make sure lower-priority work can’t starve higher-priority work.

Multi-Layer Rate Limiting

OpenAI’s four layers of rate limiting between a request and the database. Each layer (application, proxy, PgBouncer, query) can independently drop or block traffic. During an incident, operators can identify a specific problematic query digest and block it at the ORM or query layer, turning a full-service Sev0 into a targeted load shed while all other traffic continues flowing normally.

Rate limiting at the application layer alone isn’t sufficient. OpenAI implements rate limiting at four layers: application, connection pooler, proxy, and query. They even enhanced their ORM (Object-Relational Mapping) layer to support blocking specific query digests for targeted load shedding.

The ability to identify and kill a specific problematic query pattern in real-time during an incident is a powerful operational tool. It turns what would be a full-service degradation into a targeted response.

The 12-Table Join That Caused Sev0s

One of the more memorable details: they identified an extremely costly query that joined 12 tables, and that spikes in this query were directly responsible for past Sev0 incidents. Their takeaway is to avoid complex multi-table joins in OLTP (Online Transaction Processing) workloads, and to break them into multiple simpler queries with the join logic handled in the application layer.

This is a cautionary tale about ORMs. ORM frameworks make it easy to define relationships between models and then traverse them in application code. Under the hood, many ORMs eagerly join across those relationships, producing SQL that no human DBA would write. If you’re using an ORM in a high-traffic system, it’s important to carefully review the actual SQL being generated, not just the ORM code.

Connection Pooling

Each Azure PostgreSQL instance has a maximum connection limit of 5,000. At OpenAI’s scale, it’s easy to run out of connections or accumulate too many idle ones. They’ve had incidents caused by connection storms that exhausted all available connections.

To address this, they deployed PgBouncer as a proxy layer to pool database connections, running it in statement or transaction pooling mode. This allows efficient reuse of connections and greatly reduces the number of active client connections. In their benchmarks, average connection time dropped from 50 milliseconds to 5 milliseconds. That’s a 10x improvement just from connection pooling.

Each read replica gets its own Kubernetes deployment running multiple PgBouncer pods, with a Kubernetes Service load-balancing traffic across them. They also co-locate the proxy, clients, and replicas in the same region to minimize network overhead.

The Decision Not to Shard

Perhaps the most interesting engineering decision in the entire post is the one they didn’t make: sharding PostgreSQL.

They explicitly state that sharding existing workloads would require changes to hundreds of application endpoints and could take months or even years. Instead, they chose to:

Optimize the single-primary architecture as far as possible
Migrate new write-heavy workloads to already-sharded systems (Cosmos DB)
Gradually migrate existing write-heavy workloads off PostgreSQL over time

This is a more rational decision. Sharding introduces enormous complexity: cross-shard queries, distributed transactions, rebalancing, application-layer routing, and schema migration coordination across shards. For a read-heavy workload where the single primary still has headroom, the operational cost of sharding doesn’t justify the benefit.

OpenAI isn’t saying sharding is bad. They’re saying sharding isn’t worth it for them, right now, given their access patterns and the engineering cost of migration. They’re not ruling it out for the future, but it’s not a near-term priority.

The Scaling Ceiling

Star vs. cascading replication topologies for WAL fan-out. In the current star topology (left), the primary streams WAL directly to every replica, creating an O(N) network bottleneck. In the planned cascading topology (right), intermediate replicas relay WAL to downstream nodes, reducing the primary’s fan-out to O(2) and enabling 100+ replicas. The trade-off is added latency hops and more complex failover if an intermediate node fails.

The most interesting forward-looking challenge they discuss is WAL fan-out. As mentioned earlier, the primary streams Write-Ahead Log data to every single replica. At ~50 replicas, this works with large instance types and high-bandwidth networking, but it doesn’t scale indefinitely. Each additional replica adds more network and CPU pressure on the primary.

Their planned solution is cascading replication : intermediate replicas receive WAL from the primary and relay it to downstream replicas, forming a tree structure instead of a star topology. This would allow them to scale to potentially over 100 replicas without the primary becoming a network bottleneck.

There’s a trade-off here though. A star topology (primary → all replicas) is simple and has low latency, but the fan-out creates a bottleneck at the center. A tree topology distributes the fan-out load but adds latency hops and complicates failover. If an intermediate replica goes down, its downstream replicas lose their WAL source. OpenAI notes they’re still testing this with the Azure PostgreSQL team and won’t deploy it until failover is reliable.

The Results

To put some concrete numbers on the outcomes:

Millions of QPS across the cluster (combined reads and writes)
~50 replicas with near-zero replication lag
Low double-digit millisecond p99 client-side latency
Five-nines availability (99.999%)
One Sev0 in 12 months , which occurred during the viral launch of ChatGPT ImageGen when write traffic surged by more than 10x as over 100 million new users signed up within a week

That last data point is revealing. Even with all of these optimizations, a 10x write spike during a viral launch still overwhelmed the system. Writes remain the weak point, and OpenAI knows it.

All in All

A few things stand out to me from this post.

Boring technology works. PostgreSQL is 30+ years old. PgBouncer has been around since 2007. Read replicas are as old as relational databases themselves. None of this is new, and that’s the point. OpenAI isn’t succeeding because they have exotic infrastructure. They’re succeeding because they execute the fundamentals with discipline.

Access patterns determine architecture. The entire architecture is predicated on the workload being read-heavy. If ChatGPT’s workload were write-heavy, this whole approach would fall apart. Understanding your actual read/write ratio, your query distribution, and your hot paths isn’t just an optimization exercise. It determines your entire system design.

Operational discipline matters more than architectural cleverness. The post reads like a list of things they stopped doing: no new tables on PostgreSQL, no complex joins, no unthrottled backfills, no write-heavy workloads without migration plans. Scaling is as much about what you refuse to let into the system as it is about what you build.

Sharding is a last resort, not a first instinct. The distributed systems community sometimes treats sharding as the default answer to scaling problems. OpenAI’s experience shows that for many workloads, the complexity cost of sharding exceeds the benefit, and that you can go remarkably far with a single primary, read replicas, and careful write management.

In closing, I think the most valuable lesson here isn’t any single optimization technique. It’s the reminder that understanding your workload deeply, and making deliberate trade-offs based on that understanding, will take you further than reaching for complex distributed systems before you need them.

Have you dealt with similar scaling decisions at work? Did you shard too early, or wish you had sooner? I’d love to hear about it in the comments.

If you enjoyed this post, subscribe to Professional Imposter Syndrome for more breakdowns of real-world systems engineering.

Source: Scaling PostgreSQL to Power 800 Million ChatGPT Users — OpenAI Engineering Blog

AP Systems Explained: Stale Data Beats Dead Servers

Ali Malik — Tue, 24 Feb 2026 15:18:13 +0000

This is Part 3 of a 4-part series on the CAP Theorem and distributed systems trade-offs. Read Part 1 here or Part 2 here if you haven’t already done so.

In Part 2, we looked at CP Systems, datastores that would rather outright refuse requests than give the wrong answer. We covered strong consistency (otherwise known as linearizability), quorum-based writes, leader elections, and how when put together it’s expensive but right.

AP Systems make the opposite decision: “I’d rather give you something than nothing.”

Sounds pretty crazy until you realize that much of our daily internet traffic works this way. From DNS, to the CDNs powering your Netflix habit, to your social media feed, shopping cart, the list keeps going. All of these systems prioritize availability over consistency. These systems do just fine, heck they power our daily habits.

So what do we do with the inconsistency? We can’t avoid it, right? We got to manage it the best we can.

Following along? Subscribe to get the rest of the series!

The “Always On” Philosophy

An AP system prioritizes availability over consistency, the reverse of a CP system, during network partitions. According to the CAP theorem, every non-failing node must respond to requests, even if it can’t guarantee data freshness.

Okay that’s fine and all but let’s see what that actually looks like.

Multiple Nodes Accept Writes

Unlike CP systems where all writes are funneled through a singular leader, AP systems can accept writes on multiple nodes. During a network partition, both sides of the split continue to accept writes independently. This is what makes the system available, there’s no single point of failure for writes.

Naturally, you’d ask “what if two clients write different values to the same key?” To that I’d say keep reading. We will discuss conflict resolution a little later in the article. But also, both writes would succeed but, you now have a conflict that needs to be resolved once the network failure heals.

Reads Return Local Values

There is absolutely zero coordination when it comes to reads. There’s no double-checking with other nodes, no waiting for consensus that this is the latest value. Nodes in an AP system return whatever they have, even if that data is stale.

This is absolutely okay for social media feeds (nobody cares if a “like” shows up late). But absolutely horrendous for a banking system.

Partitioned Nodes Stay Alive

This is one of the main differences between AP and CP systems. Recall, in a CP system, the minority side of a partition stops completely. No reads or writes are served from that side. In an AP system, every node continues to operate. Both sides of the partition are alive and accepting reads/writes, operating as if nothing happened.

The fundamental choice during network partitions. CP systems sacrifice availability for consistency (minority nodes stop serving). AP systems sacrifice consistency for availability (all nodes keep serving, creating potential conflicts).

Cleaning Up the Mess Later

CP Systems avoid conflicts by simply running every write through a leader and requiring quorum on writes. AP systems say YOLO and allow conflicts to happen. These conflicts get dealt with later on. Let’s discuss some of these conflict resolution strategies.

Last-Write-Wins (LWW)

This is the simplest approach. Each write has an associated timestamp. When two writes conflict, the one with the later timestamp wins. The other is discarded.

This is awesome for data systems where the most recent value matters:

User preferences (theme choice, language, notification settings, etc).
Status / state data (online/offline, connection state, etc).
Cache / Metadata (expiration times, modification timestamps, etc).
Social Media updates (profile bio contents, pictures, profile/cover photos).

What’s the problem with this? Timestamps in distributed systems are not really reliable. Node A’s clock might or might not be slighly ahead of Node B’s clock. The two writes that occurred at the “same time” get differing timestamps and one of them gets tossed. Data loss could happen unknowingly.

Last-Write-Wins conflict resolution appears simple but can silently lose data. Client 1’s write succeeds initially but disappears when nodes reconcile, with no error or notification. Clock skew makes this unpredictable.

For the examples above, however, this is perfectly fine. Nobody cares about losing an intermediate GPS coordinate, or a like on a post, these don’t break anything. For as simple as LWW is its risk profile is tiny.

Apache Cassandra relies on LWW for conflict resolution by default. Amazon’s DynamoDB avoids this problem at the single-region level almost entirely by means of leader-based quorum writes. However, Global Tables (cross-region replication) rely on LWW to resolve conflicts between regions.

Vector Clocks

A vector clock, as the name suggests, is a clock….sike. It is NOT a clock reading. At all. A vector clock is a data structure that helps determine the causal ordering of events, simply put, it helps you determine whether a write happened before another, or whether they happened concurrently.

Each node maintains a vector of counters (one per node). The counters represent the nodes knowledge of how many events have occurred at every node in the system. On a local event, a node will increment its own counter. When sending a message, the node attaches its vector. When receiving an event, it merges the received vectors.

Let’s imagine a couple of situations. Imagine write A occurred before write B, you know B is newer thus we’d keep B and discard A. Imagine write A and B occurred simultaneously (neither of them caused one another). This conflict can’t be resolved by itself. This is where conflict resolution techniques come into play.

In those situations, the system needs to decide whether it will pick one and risk losing data or keep both and hand off the resolution to the application. An example of the latter is Amazon’s DynamoDB, in the original paper, DynamoDB was designed to return multiple versions of the data and let the application level logic merge them. The most famous example is Amazon’s shopping cart example, if two conflicting cart states exist, merge them and keep all of the items. It’s to be noted that modern DynamoDB dropped this entirely and opted for leader-based writes to prevent conflicts at the storage layer.

In practice, vector clocks add a level of complexity that not every team wants to deal with. Not every team wants to write conflict resolution strategies.

Conflict-free Replicated Data Types (CRDTs)

CRDTs are data structures that are designed to make concurrent updates mergeable without any conflicts. No need for coordination nor conflict resolution logic. The math behind CRDTs guarantees eventual convergence.

I’ll do my best to explain some CRDTs in the following sections.

Grow-Only Counter (G-Counter)

The simplest of all CRDTs, a G-Counter is, as its name suggests, a counter that only grows. It only supports increments.

Each node maintains its own counter. The total is the sum of all node counters. If two nodes both increment, there’s no problem, we just add them up. This is how _ likes _ work in social media.

Real World Example: Social media likes, video view counts, Reddit’s upvoting systems all use G-Counters. If two users like a post at the same time during a network partition, both increments are saved when the partition heals.

G-Counter magic in two steps: (1) Each node only increments its own counter during partition, (2) Merge by taking the maximum of each position. No conflicts possible—the math guarantees all nodes converge to the same total.

Observed-Remove Set (OR-Set)

A set that supports both addition and removal concurrently. The way this works is, every add operation gets tagged with a unique identifier (UID). A remove operation only removes UIDs _ it has observed . What this means is if a node has not _seen an add it can’t remove it. In the simplest of terms, if an add and a remove occur on different nodes, for the remove to be successful the node performing it must have synced with every other node, to see the addition.

Real-World Example: We’ll rely on the classic shopping cart example again. Imagine two different users/sessions update the same cart. User A _ adds _ a “Laptop” and User B _ removes _ a “Laptop”. What do you think would happen? A laptop is added to the cart. Why? User B is removing the “version” of a laptop it saw in its cart. Since the _ add _ didn’t get sync’d yet, User B didn’t see the add, so it can’t remove it.

Last-Write-Wins Register (LWW-Register)

This is basically LWW as a CRDT. This CRDT only stores a single value and resolves concurrent writes by keeping the latest write, by timestamp.

Real-World Example: The easiest one is GPS tracking apps displaying a “last known location” field. A more concrete example is caches. If two nodes update a cache simultaneously, the entry with the freshest timestamp is preserved. Naturally, for a cache only one value is necessary.

Okay thats enough CRDT explanations. Now, CRDTs sound awesome (they are), but there are constraints. Not every data structure has a CRDT equivalent. They can also grow unboundedly if not properly maintained, taking up a lot of space. Plus, it doesn’t help that they don’t work for just any arbitrary logic.

Subscribe now

“Eventually”….maybe?

Everyone has heard the term and it usually gets hand-waved to “all nodes eventually see the same data”. While that is true, it’s a bit incomplete.

What it actually means is, “if no new writes occur, all replicas will eventually converge to the same value.” Did you notice it? There’s no limit to what “eventually” could be. It could be milliseconds (ms) or minutes.

In the real world, most AP systems converge fairly quick. Amazon’s DynamoDB replication typically happens within a second across all storage locations within a region. DNS propagation can take hours. The variance is huge.

What matters is whether your application can handle that time, the inconsistency window. Some can, and some absolutely can’t.

Stronger? Eventual Consistency

Not all forms of eventual consistency are the same. Some are stronger than others. These stronger models give you more guarantees while still being weaker than linearizability.

Read-Your-Writes

If Client A performs a write, future reads from the same client are guaranteed to see that write. Other clients may or may not see the updated write just yet, but Client A is guaranteed to see the updated write. This is what users typically expect from most applications (i.e “I just posted a comment, I should be able to see it”).

Monotonic Reads

This consistency model guarantees that once a client has read a value, it will never see an older version of that value on subsequent reads. Time does _ not _ go backwards. Without this guarantee, a client can refresh the page and see new data disappear.

Causal Consistency

In this consistency model, if Operation A caused Operation B (A happens before B and B depends on A), then everyone must see Operation A before Operation B. You’ll always see them in the right order. The cause comes before the effect, everywhere (hence it’s name….causal). Imagine seeing a seething reply to a comment without seeing the comment first. That would be a bit confusing as the end user.

These consistency models are _ weaker _ than linearizability, which means they don’t take on the CAP theorem trade-offs we discussed in the CP Systems Overview. Naturally, this means that you can have these guarantees and still be highly available during network partitions.

This very clearly highlights the problem with the “pick-two” framing of the CAP theorem. Most applications don’t need true linearizability, they’re absolutely okay with high availability coupled with these stronger eventually consistent models.

Real-World AP Systems

Amazon’s DynamoDB: The Always On DB

The history of DynamoDB goes back to 2004 when Amazon’s e-commerce business suffered a few too many outages. The e-commerce business was pushing their relational databases to their limits even though they had fairly simple usage patterns. This culminated in the release of the famous Dynamo paper, which laid out the design and implementation of Dynamo, a highly available, leaderless, eventually consistent key-value store. This would later become the foundation for DynamoDB.

How DynamoDB Stays Available

Here’s what a lot of people confuse: DynamoDB the service _ is not _ the Dynamo paper.

The original Dynamo paper laid out a completely leaderless system with sloppy quorum, vector clocks, and application-side conflict resolution. DynamoDB the service threw a lot of that out. DynamoDB now does Multi-Paxos based leader replica election and quorum-based writes (writes require 2/3 storage nodes to acknowledge before committing). Does that sound familiar? It should cause that’s a CP pattern.

So why is DynamoDB here, in the AP section?

Because _ reads are AP by default. _ With the default setting of eventually consistent reads, your request can be routed to any of the three replicas. There’s no coordination and no leader dependency. If the replica happens to be behind, you’ll get stale data and the system will remain up.

The Trade-Offs in Reality

By default, DynamoDB offers eventually consistent reads. Which means that you could get stale data. But DynamoDB also offers strongly consistent reads which get routed to the partition’s leader-node. Now we have the downsides of a CP system: dependence on the leader node being reachable.

DynamoDB doesn’t make a binary choice between AP and CP. It gives you a per-request option between AP and CP behavior on the read path, while keeping writes CP for durability. This is the kind of nuance that “pick two” completely misses.

For a vast majority of use cases (i.e shopping carts, user sessions, game state, IoT sensor data) the default eventually consistent reads are more than good enough. DynamoDB’s SLA promises 99.99% availability for standard tables and 99.999% for global tables. That’s less than 5 minutes of downtime per year.

DynamoDB achieves availability through AWS infrastructure speed, not by avoiding coordination. Writes still require 2/3 quorum, but cross-AZ replication completes in milliseconds. The AP trade-off becomes clear with reads: eventually consistent reads are fast but risky, strongly consistent reads sacrifice speed for guaranteed accuracy.

Apache Cassandra: The Choose Your Own Consistency DB

Born from the need of allowing users to instantly search through inbox message history, Cassandra was Facebooks answer to a scalable, reliable, and highly available storage system. It took learnings and design decisions from its predecessors Amazon’s Dynamo (partitioning and replication) and Google’s BigTable (data and storage engine model). Open-sourced in 2008, Cassandra was built to handle ginormous write-throughput across multiple data centers.

How Cassandra Stays Available

Cassandra has no single point of failure. There is no leader. Each and every node in the cluster can serve reads and accept writes. Data is distributed similar to DynamoDB, with consistent hashing, and replicated to N nodes (the recommend amount is 3 nodes in a datacenter).

This is a proper leaderless design, unlike DynamoDB. Each and every node is an equal. Any node can serve any request.

Cassandra also lets you control the consistency per query via its tunable consistency :
ONE : read / write succeeds after _ one replica _ responds. This is the fastest and least consistent option.
QUORUM: a majority of replicas (n/2 + 1 number) must respond. Slower than ONE but more consistent.
ALL: as the name suggests, _ all replicas _ must respond. This is **the slowest option and the most strongly consistent.
LOCAL_QUORUM: a _ majority of the local replicas _ (locale depends on the datacenter the coordinator dispatching the job is in) must respond.

If you set the consistency level to QUORUM for both reads and writes, you’ve got yourself _ strong consistency . You’ve basically turned Cassandra into a CP system _for that query. Set both read and write consistency to ONE and you’ve got a purely AP system. You pick the consistency, per query. Every single time.

The Trade-Offs in Reality

Cassandra relies on LWW as its default conflict resolution strategy (based on timestamps). This also opens the door for clock skew, bad time-stamping, silently allowing the wrong write to win.

Tombstones (deletion markers) can start to pile up and impact read performance if time-to-lives (TTLs), deletes, and compaction behavior aren’t managed carefully.

Tunable consistency, as awesome as it is, can also be detrimental to your system. Before using tunable consistency, you must understand each consistency level you choose. If misconfigured, you’ll pay the price in prod.

As scary as that sounded, Cassandra’s per query tunable consistency is truly where it stands out. Number of views? Use ONE , it’s fast and slightly stale data is perfectly fine. Inventory reservations? Use QUORUM , slower but it’s correct.

For time-series data, messaging systems, and really anything requiring heavy write throughput across data centers, Cassandra’s flexibility is hard to beat.

Cassandra’s superpower: you choose your trade-off per query. Need fast social media feeds? Use ONE (5ms, might be stale). Need accurate account balances? Use QUORUM (15ms, strong consistency). Need audit-level certainty? Use ALL (50ms, maximum safety). Same cluster, different guarantees.When Should You Choose AP?

We’ve seen how DynamoDB, Cassandra, and DNS each make different architectural decisions while staying highly available and remaining on the AP side of the spectrum. Here’s how to tell whether your system should be one of them:

Go AP when:

Availability is non-negotiable *.* User-facing applications where downtime directly affects revenue, users, or both.
Stale reads are harmless. The data doesn’t change meaning even if its a few seconds or minutes behind.
You need to go global. AP systems handle multi-region deployments very well. This allows you to avoid cross-region consensus overhead.
Write throughput matters. In leaderless / multi-leader designs, multiple nodes can accept writes concurrently. Applications around event logging, IoT telemetry, time-series data, etc basically anything where a lot of writes are expected.

Be cautious of AP systems when:

Correctness is non-negotiable. Financial transactions, payments/ledgers, inventory management, etc. Anything where conflicting writes cause real-world damage.
Conflict resolution isn’t simple. If your data can’t be merged cleanly (i.e with CRDTs or well-defined rules), you’ll end up writing and maintaining custom resolution logic, something no team wants to do.
Users expect instant consistency. “I just saved my document and now it’s gone” is a horrible user experience, even if it re-appears two seconds later.

Same question I posed in Part 2, just from the opposite side of the lens: “What’s worse, being briefly wrong or briefly unavailable?”

If stale data is harmless or easily correctable, go AP. If stale data has the ability to cause real damage, go CP.

Up Next: Beyond the “Pick Two” Framing

We’ve dedicated two long posts to treating CP and AP as binary choices. They’re not.

In Part 4, the grand finale, we’ll look at how modern systems blur the line between the two:

Tunable Consistency: you saw this with Cassandra and DynamoDB.
PACELC Theorem: If during a partition you’re trading between availability and consistency, what are you trading when there is no partition?
How systems like Spanner and CockroachDB aim for C and A-like behavior and at what cost.
Practical guidance for choosing trade-offs in your own systems.

The CAP theorem provides you with a mental model. Part 4 of this series will give you the escape hatch.

DNS: The Most Important Stale Data on Earth

We mentioned DNS way back in Part 1, but it deserves a deeper look here. After all, it is the purest AP system in existence.

DNS is a globally distributed, hierarchically organized database. When you enter a URL in your browser, you kickstart a chain of reactions across multiple layers of DNS, from your recursive resolver (ISP provided, or company DNS, or public resolver), to Root, TLD, and Authoritative servers (each with the possibility of holding cached records).

How DNS Works

DNS handles billions of queries per second across millions of servers worldwide. Imagine requiring linearizability across each and every query (i.e making every query check with the authoritative server before responding). It would be insanely slow….like extremely slow. Not only would it be painstakingly slow, it would introduce a single point of failure for the entire internet. _ THE ENTIRE INTERNET _. Nobody, and I mean absolutely nobody, would want that.

Instead, DNS relies on caching controlled by TTLs (Time-to-Live). Each DNS record has a TTL value that lets resolvers know how long to cache it before checking with the authoritative server again. Naturally, when you update a DNS record, the old record doesn’t just magically disappear. It sits cached on DNS servers around the world until the TTLs expire. This is exactly why DNS changes “take time to propagate” and most likely why your DevOps friend told you to wait after you updated an A record.

The Trade-Offs in Reality

With caching and depending on the TTL values, different users around the world can resolve the same domain to different IP addresses, sometimes for minutes and sometimes for hours. This is a completely tolerable consequence. As opposed to the entirety of DNS going down. A small window of staleness is better than making name resolution slow and failure-prone.

That’s the most important thing that DNS teaches us: most applications can tolerate a few seconds of stale data. It’s, arguably, the most critical system on the internet and runs on eventual consistency with TTLs measured in hours (hours). If the backbone of the internet can tolerate that level of staleness, your application can most certainly handle a few seconds.

Want to know how to pick the right trade-off? Part 4 covers the escape hatches.

When Should You Choose AP?

Go AP when:

Availability is non-negotiable *.* User-facing applications where downtime directly affects revenue, users, or both.
Stale reads are harmless. The data doesn’t affect the meaning of the data even if its a few seconds or minutes behind.
Being global is an end goal. AP systems handle multi-region deployments very well. This allows you to avoid cross-region consensus overhead.
Write throughput matters. In leaderless / multi-leader designs, multiple nodes can accept writes concurrently. Applications around event logging, IoT telemetry, time-series data, etc basically anything where a lot of writes are expected.

Be cautious of AP systems when:

Correctness is non-negotiable. Financial transactions, payments/ledgers, inventory management, etc. Anything where conflicting writes cause real-world damage.
Conflict resolution isn’t simple. If your data can’t be merged cleanly (i.e with CRDTs or well-defined rules), you’ll end up writing and maintaining custom resolution logic, something no team wants to do.
Users expect instant consistency. “I just saved my document and now it’s gone” is a horrible user experience, even if it re-appears two seconds later.

Same question I posed in Part 2, just from the opposite side of the lens: “What’s worse, being briefly wrong or briefly unavailable?”

If stale data is harmless or easily correctable, go AP. If stale data has the ability to cause real damage, go CP.

Up Next: Beyond the “Pick Two” Framing

We’ve dedicated two long posts to treating CP and AP as binary choices. They’re not.

In Part 4, the grand finale, we’ll look at how modern systems blur the line between the two:

Tunable Consistency: you saw this with Cassandra and DynamoDB.
PACLEC Theorem: If during a partition you’re trading between availability and consistency, what are you trading when there is no partition?
How systems like Spanner and CockroachDB aim for C and A-like behavior and at what cost.
Practical guidance for choosing trade-offs in your own systems.

The CAP theorem provides you with a mental model. Part 4 of this series will give you the escape hatch.

— Ali

Subscribe to get Part 4 when it drops!

CP Systems Explained: The Hidden Cost of Strong Consistency

Ali Malik — Tue, 03 Feb 2026 13:00:00 +0000

Let's look at what CP systems actually do under the hood when they choose consistency over availability.

This is Part 2 of a 4-part series covering the CAP theorem and distributed systems trade-offs. If you haven’t done so already, read Part 1 here!

In Part 1, we examined the CAP theorem holistically and its “pick two” myth. We narrowed the CAP theorem down to this singular question: what should your system do when the network fails?

CP (Consistency, Partition Tolerant) systems answer this question with: “I’d rather be unavailable than wrong.”

That’s fine and all, but choices have consequences. Consequences don’t always show up easily, sometimes it’s 2 AM on a weekend when you realize your system is taking entirely too long to write to a database.

Following along? Subscribe to get the rest of the series!

Requisites of a “CP” System

As we defined earlier, a CP system is one that prioritizes “consistency” over “availability” during network partitions. What does that look like in the real-world?

Blocked Writes

What this means is, if you attempt to write to your system during a network partition, it will not be acknowledged / confirmed until it’s been replicated to enough nodes. “Enough” is subjective but typically means a majority (i.e 2 of 3 or 3 of 5), this is also referred to as the quorum.

Always use odd numbers. A 4-node cluster has the same fault tolerance as a 3-node cluster—you're paying for a node that doesn't improve availability.

Going back to etcd, a kubectl apply will not return until at least half of the cluster confirms the write. Similarly, in a Postgres setup with synchronous replication enabled, an INSERT waits for the replicas to confirm that the data was received and written.

And this is exactly why writes are slower in CP systems. You are no longer just writing to a single node and moving on, you have to wait for network round trips and the slowest node in the quorum.

Most Up-to-Date Reads

CP systems can guarantee fresh reads when using linearizable or quorum-backed read modes. You will never read stale data. If you just wrote a value, you’ll read that very value back, or get an error (as shown above, if quorum isn’t met). In MongoDB if you enable linearizable read concern, your reads will go to the primary and wait to confirm that there aren’t any other writes in progress. Back to our etcd example, you can read from follower nodes but the default logic guarantees that you do not read anything that hasn’t been committed to quorum.

As before, what’s the cost for strong consistent reads? It’s read latency. You can’t simply fetch data from the fastest / nearest replica, there’s a robust level of coordination involved that ensures you’re always getting the freshest data.

Partitioned Nodes Stop

In CP systems, if a node can’t reach its peers it’ll refuse requests outright rather than risk returning stale data. There’s no guessing involved, it simply will not serve stale data.

What does this mean in practice? The node will return an error or simply timeout. Going back to MongoDB, if the primary node can’t reach its secondaries, it will step down as the primary and refuse writes altogether. In the case of etcd, if the node gets partitioned away, it’ll stop accepting requests until it rejoins the cluster.

What Happens During a Partition?

Let’s walk through a concrete example of a network partition that’ll help visualize all of the above.

Let’s assume we have 5-node etcd cluster with a quorum of 3. Everything is running perfectly, writes go to the leader, get replicated to followers, quorum is reached, and clients are happy.

Everything is working. Writes go to the leader, get replicated to followers, clients are happy.

Then a network partition splits the cluster in a 3-2 configuration. Three nodes can still communicate with each other. Two nodes are isolated.

Majority Side

There is still quorum (3 of 5).
If necessary, a new leader is elected and operations continue.
Both writes and reads succeed.
Clients connected to any node in this majority side experiences absolutely nothing out of the ordinary.

Minority Side

Can’t reach quorum (2 < 3)
Writes are refused.
Depending on how the system is configured, reads are also refused.
Clients connected to this side see timeouts / errors.

Clients connected to the majority side don't notice anything. Writes still work because quorum (3) is reachable.

Clients connected to the minority side see timeouts and errors. The nodes refuse to serve requests because they can't reach quorum.

This is what CP systems look like in action. The minority side (nodes that have been partitioned away) don’t return stale data. There’s no guess work, the nodes simply don’t respond. Eventually, when the partition heals, all nodes sync up with the majority side and behave normally again.

Sure this is cumbersome when you’re the client connected to the minority side, but this is what guarantees you data “consistency” (remember, what we mean here by consistency is actually linearizability).

The Costs of Linearizability

As we’ve shown thus far, nothing is free in systems. CP systems pay for linearizability with latency, availability, and operational complexity. Why? Because consensus is expensive.

A Quick Primer on Consensus Protocols

What problem have we identified thus far? CP systems need nodes to agree on the state of data. Easy when everything is going well, but given the unreliable nature of networks this is harder than it sounds.

There are two main protocols that allow multiple machines to achieve consensus: Paxos and Raft.

Paxos

Paxos is the original consensus protocol. Created by Leslie Lamport in 1989, Paxos is notorious for being difficult to understand (the original paper’s title should be enough proof of that).

Paxos guarantees three properties:

Validity: Only values that were actually proposed can be chosen. There’s no randomness and no garbage.
Agreement (Consensus): All nodes that reach a decision will reach the same decision. No two nodes can come to different decisions.
Integrity: Once a value is chosen, it’s permanent. There’s no un-choosing.

Basically, nodes go through multiple rounds of voting to agree on a specific value. Once a value is chosen it can never be un-chosen, regardless of what happens.

These guarantees hold even throughout node crashes, message loss, and packet delays. Paxos never compromises on safety. The downside? Paxos can’t guarantee liveness. Under some conditions (competing proposers, network issues), Paxos can stall indefinitely. It won’t return a wrong answer, but it might just not answer altogether (do you want to be right all the time or do you want to respond all the time). In production systems this is very rare, techniques like random backoffs and leader election help some of this, but it’s important to note that this is a trade-off in the protocol itself.

In the real world, Paxos is very difficult to implement properly which led to alternatives. Many systems you’ll encounter in the real world use Raft instead of Paxos.

Raft

Raft was created in 2014 with one goal in mind: comprehensibility. The authors of the Raft paper (In Search of an Understandable Consensus Algorithm) quite literally ran user studies that showed that Raft was easier to understand and comprehend when compared to Paxos.

Raft, like Paxos, provides the same three guarantees: validity, quorum-based agreement, and integrity. But Raft, to make the protocol more understandable, breaks them into three separate problems:

Leader Election: Only one leader per term (time period). All writes go through this leader.
Log Completion: If an entry is committed, it will be in all future leaders’ logs.
Safety: A committed write will never be lost, even if the leader goes down.

Similar to Paxos, Raft requires a majority quorum to make decisions. Like our previous example, in a 5-node cluster, you’d need 3 nodes to agree. If you can only reach 2 nodes, the system would stop accepting writes.

The liveness problem is addressed with randomized election timeouts. When a leader dies, followers wait a random amount of time before conducting an election. This makes split votes (where no candidate is able to reach quorum) rare in practice.

Let’s see the liveness problem visualized.

Two nodes timeout simultaneously and both become candidates. This time A squeaks by with 3 votes, but it was close. If one more node had voted for B, we'd have a split.

The worst case. Votes split evenly, no one reaches quorum. The cluster has to wait for timeouts to expire and try again. If this keeps happening, you're stuck without a leader.

Randomized timeouts fix this. Each node picks a random timeout (e.g., 150-300ms). Node A wakes up first at 150ms and campaigns unopposed. By the time B wakes up at 243ms, A is already leader. No split, no drama.

One problem both Raft and Paxos share is that neither handles Byzantine faults. Both protocols assume nodes to be honest.

I’ll save the deep-dive into Raft / Paxos internals for another post. For now all you need to understand is: consensus requires coordination, and coordination is very costly.

Subscribe now

The Costs of CP Systems

Here’s something that the CAP Theorem fails to mention: even when there is no partition, CP systems pay a performance price.

Latency

Every write requires a confirmation from multiple nodes. In our example of a 5-node cluster, a single write requires 3 confirmations. That’s at least 3 round trips before a write is acknowledged.

If the nodes are in the same data center, latency is a few milliseconds. However, if the nodes are geographically separated, that latency shoots up to anywhere between 5-200ms, per write.

That’s why when designing CP systems geographic placement of nodes is important. Global distribution and strong consistency don’t mix well.

Consensus hates distance. For CP systems latency is equivalent to the slowest node you must wait for not the average node latency. So in our 5-node cluster, a write requires 3 nodes to agree. The latency for that write is the latency of the slowest member of the agreeing nodes.

Throughput

Since all writes flow through a single leader, you can only write as fast as that leader can process and replicate. The leader is responsible for serialization, ordering, and coordinating replication. With multiple writes, CPU, network, and disk resource contention accumulates.

How do we increase throughput? If you said scaling you’d be right….partially. Horizontal scaling doesn’t help. Actually, it might be more of a pain. Adding more nodes increases coordination overhead. Past a certain point, horizontal scaling actually makes you slower. So, we vertically scale. Even then, there are cons to vertical scaling.

Scaling vertically makes CPU, memory, and disk i/o faster. But it does not lower network round-trip times.

There is a throughput ceiling for CP systems. Throughput is directly related to the round-trip time for quorum and the ordering capacity of the leader.

Leader Election Storms

When the leader node dies, the cluster elects a new one. Each election causes:

Coordination overhead as nodes ask for votes and reply.
Service disruption while the cluster, effectively, pauses to elect a leader.
Instability as nodes switch roles and reset timers.

Elections are annoying. If two nodes start campaigning at the same time, votes can split and no one reaches quorum. Raft uses randomized timeouts to avoid this, but under network jitter or heavy load, you can still get stuck in multiple rounds of failed elections.

If you’ve used Kubernetes before, you may have run into issues where etcd completely stops responding and it isn’t your doing.

Let’s go back to our 5-node etcd cluster (Nodes A-E) example, let's say your cloud provider's network hiccups for 500ms, just long enough for the leader to lose contact with 2 followers. It steps down. Election happens, Node B wins. 200ms later, another blip. Node B can't reach quorum now. Steps down. Another election. This keeps going while kubectl commands pile up in the background. Network stabilizes, you check the logs, and realize you just burned 30 seconds on 3 elections.

Operational Complexity

CP systems are very hard to implement. An added layer of complexity is capacity planning.

Odd Number of Nodes: a 4-node cluster and 3-node cluster have the same fault tolerance (losing two nodes will render the system unavailable since quorum can’t be met). So adding a single node doesn’t really help in this case.
Quorum / Fault Tolerance Math Matters: A 5-node cluster needs 3 nodes for quorum, meaning it can withstand 2 nodes going down. A 3-node cluster needs to 2 nodes for quorum, meaning it can handle only one node going down.
Failure Pains: CP systems behave correctly under failure, but that doesn’t mean its pretty. Partial failures become an operational pain. CP systems react rather defensively to partial failures, since the system is design around safety first.

Real-World CP Systems

Let’s look at a few CP systems you’ll actually encounter or have already encountered.

`etcd` : The Backbone of Kubernetes

If you’ve used / played around with Kubernetes at all, you’ve used etcd . Whether you knew it or not.

etcd is a distributed key-value store designed for reliability over speed. It’s purpose is not to deliver high-throughput workloads, but rather for data that absolutely cannot be wrong. Things like configuration files, coordination, service discovery, etc. Small amount of priceless data that everything else depends on.

etcd is responsible for storing the cluster state: pod definitions, service configs, secrets, all of it. When you run a kubectl apply that manifest that you provide ends up in etcd. When the scheduler needs to decide where to put a pod, it reads from etcd .

Why CP?

etcd is the single source of truth. Every component asks the same question: "What does etcd say?”

Kubernetes needs a single source of truth. Every component of Kubernetes (the scheduler, controller manager, kubelets, etc) makes decisions based on what’s stored in etcd. If two nodes disagree about, say, which pods are running where, you get absolute chaos. Double scheduled pods, orphaned containers, screwed up service routing, and the failures snowball into a nasty ugly mess.

In these situations, would it make sense for the system to continue to serve inconsistent / wrong data? Or is it more beneficial for the system to stop for a little bit and straighten things out? etcd handles this by electing a new leader, that brief pause in availability is annoying as an end user, but it is infinitely better than an inconsistent system.

The Trade-Offs In Reality

Ever experience a situation where kubectl commands hang or fail during cluster instability? That’s etcd behaving like a CP system, as it was designed to be. If etcd can’t reach quorum, it’ll stop accepting writes. Your kubectl apply commands time out, it’s safer to deny the write.

I’ve faced this multiple times, whether its slow disk I/O, network latency, or resource exhaustion. Each of these can lead to failed heartbeat checks which, in turn, leads to leader re-election and suddenly the cluster become unresponsive. Slow writes hang, kubectl commands just hang. It’s annoying, but that’s the correct behavior as etcd is designed. etcd is protecting data integrity.

etcd uses Raft for consensus. This means that all writes go through a single leader node and require an acknowledgement from a majority before it gets committed. In the context of this post, etcd reads must also go through the leader node, since it’s the only one that knows what’s been committed. etcd also allows for serializable reads (must be configured this way) but, by default, etcd provides linearizable reads.

Zookeeper: The OG

Before etcd, there was Zookeeper. It’s much older than etcd and is the backbone for the coordination for systems like Hadoop, Kafka, and Spark. If you’ve used any of those systems, Zookeeper was running in the background keeping things in order.

Like etcd, Zookeeper is also a distributed key-value store. It’s a very specific KV-store (not worth getting into in this post, but fundamentally it only stores critical data) that handles mission critical data that is absolutely needed to setup and maintain coordination. Things like, who the current leader is, what the current config looks like, etc.

Why CP?

Much like etcd , Zookeeper maintains the coordination for a system. No two nodes should think they’re the leader, no two processes should both hold a resource lock, no node should be serving stale data once a write has been committed.

If those issues aren’t addressed you can run into split-brain failures, data corruption, or cascading failures. These aren’t failure modes that can just be ignored. Zookeeper, like etcd , was designed to address these very issues.

The Trade-Offs in Reality

Zookeeper has the same quorum related issues as etcd , for a 5-node cluster if you lose 3 nodes it stops entirely. No writes, no reads, nothing at all.

The problem with Zookeeper is the way it’s integrated. Zookeeper is often a dependency for other systems such as: Kafka broker coordination, Flink high availability, Hadoop for failover coordination.

So if Zookeeper goes down, your Kafka topic goes down, suddenly your event stream is down and your entire system stalls. All because one coordination system couldn’t maintain quorum.

PostgreSQL: Everybodies Favorite DB

Here’s a shocking on, PostgreSQL can be a CP system. Not by default, gotta make a few config changes but do that and it can behave like a CP system.

Out of the box, Postgres with async replication doesn’t completely fit into CP or AP. Writes go to a single primary. Stanbys replay the logs but don’t actually accept their own writes. If the primary dies before replication is completed, those writes are gone. But you never get wrong data. For many, this is fine. You accept the small risk of data loss for better performance.

But financial systems and booking systems can’t lose a single transaction. Situations where yolo-ing isn’t an option.

Making Postgres CP

In order to make Postgres into a CP compliant system we have to enable synchronous replication. Once enabled, every write will wait for at least one replica to confirm the write before the primary acknowledges the commit.

# postgresql.conf
synchronous_commit = on
synchronous_standby_named = <server_name>

That’s all it takes to make Postgres CP compliant. Your writes are now redundant across multiple nodes. If your primary goes down, a standby has everything needed to serve data. Zero data loss on a failover.

The Trade-Offs in Reality

Now that you’ve turned on synchronous commit, you’re waiting on the network for every write. If your standby is slow, writes are slow. If your standby is unreachable, your writes hang. You’ve successfully traded availability for consistency.

Many don’t really need this. You can get by with async replication with proper failover. But for those instances where losing data is just not an option, synchronous replication is how you address it.

MongoDB: CP If Configured Properly

MongoDB, a NoSQL DBMS that uses a document-oriented data model, can behave as both AP and CP depending on its configuration.

By default, MongoDB isn’t CP. By default, writes are acknowledged after hitting only the primary and reads are local which means it can be unreplicated.

writeConcern: {w: 1}
readConcern: "local"

This is fast, but it’s also how customers allow for weaker consistency guarantees than a pure CP system.

Making MongoDB CP

It only takes two settings to configure MongoDB as a CP system.

// WRITE: WAIT FOR MAJORITY OF REPLICAS
// per operation
db.collection.insertOne(
  { item: "widget" },
  { writeConcern: { w: "majority" } }
)

// per collection
db.createCollection("orders", { writeConcernt: { w: "majority" } })

// per client
new MongoClient(uri, { writeConcern: { w: "majority" } })

---

// READS: ONLY RETURN DATA COMMITTED TO MAJORITY
// per operation
db.collection.find().readConcern("linearizable").readPref("primary")

// per collection
db.items.withOptions({
  readConcern: { level: "linearizable" },
  readPreference: "primary",
})

// per client
new MongoClient(uri, {
  readConcern: { level: "linearizable" },
  readPreference: "primary"
}

With w: "majority" , writes aren’t acknowledged until they’ve been replicated to a majority of the replica sets. With the linearizable read option, reads only return data that’s been committed by the majority. With these two options we have strong consistency.

The Pitfalls

This has to be explicitly done. MongoDB doesn’t default to a strong consistency model. The cost of these changes are pretty significant, majority writes are slower and linearizable reads can only go to the primary.

When configuring MongoDB don’t assume a strongly consistent data model. If you need CP, configure it.

[BUTTON]

When Should You Choose CP?

CP systems aren’t better or worse than AP systems. They’re both perfect for different use cases.

Choose CP when:

Real-time consistency for shared state is a necessity. Financial transactions, inventory, etc. Anything where eventually being right is just not good enough.
Coordination is the primary driver. Like we saw, leader election, distributed locks, and config management are all driven by coordination.
The blast radius for inconsistencies is large. Disagreeing nodes causing cascading failures??? Choose CP.
Brief moments of unavailability are tolerable. Leader elections or network issues can cause slight outages.

Be cautious of CP systems when:

Availability is paramount. User-facing app where something is better than nothing.
Global write distribution is required. Strong consistency across regions increases write latency and can reduce availability.
Consistency requirements are weaker than linearizability . Many systems are fine with causal consistency or read-your-writes. These system don’t require linearizability.

The question you should ask is this: “What’s worse? Being briefly wrong or briefly unavailable?”

If stale data can cause harm, lean CP. If refusing requests causes more harm, lean AP.

Up Next: AP Systems

We’ve discussed CP systems and how they sacrifice latency, throughput, and availability for consistency. What about systems that choose availability over consistency?

In Part 3 of this 4-part series, we’ll take a look at AP Systems, datastores that prioritize availability over consistency:

How DynamoDB and Cassandra remain available during network partitions?
Conflict resolutions strategies
Eventual Consistency is actually okay (sometimes)
and more.

The “pick two” framing devolves systems into thinking its a binary decision between CP and AP systems. Reality is a lot more complicated and the trade-off decisions are extremely interesting.

Thanks again for sticking around. If you enjoyed this post, please subscribe, it gives me the confidence to keep going.

Ali

References

Subscribe to get Part 3 as soon as it drops!

CAP Theorem Explained: Beyond the "Pick Two" Myth

Ali Malik — Tue, 03 Feb 2026 13:00:00 +0000

Let's get to the bottom of what Consistency, Availability, and Partition Tolerance actually mean in production distributed systems.

Hey folks, I’m back. I haven’t posted in a while, I know, I know. Between work, life, and the usual "is anyone even reading this?" doubt spiral, I let this blog collect dust. But I'm back with a new approach.

Instead of sporadic posts on random topics, I’m committing to deep-dive series on fundamental concepts that I wish I understood better earlier in my career. Technical content that goes beyond the interview answer and into the real production trade-offs.

This four-part series on CAP theorem is the first of many deep dives I have planned. If you’ve ever felt like you “get” distributed systems in theory but struggle to apply that knowledge in practice, this series is for you.

Thanks for sticking around (or for just discovering this blog). Let’s learn together.

Get the full 4-part series straight to your inbox.

This is Part 1 of a 4-part series on CAP theorem and distributed systems trade-offs. Part 2 drops next week.

If you’ve interviewed for mid or senior-level engineering roles within the last decade, you’re sure to have been asked a systems design question that ultimately leads to some version of the CAP theorem. And you probably answered something along the lines of “you can only pick two of three”. You were right…kinda.

While it’s technically correct, the thought behind it is grotesquely oversimplified. That oversimplification comes back to bite you when you’re debugging why your distributed data store went down.

My goal is to demystify what the CAP theorem tells us and explain how production systems get around this “pick two” crutch.

What Is The CAP Theorem?

Let’s go back to the turn of the century, in 2000, Eric Brewer, then running Inktomi Corporation, stood at the podium of ACM’s PODC symposium and made a statement that would trouble developers for decades: a distributed data store can only guarantee at most two of the three of the following: Consistency, Availability, and Partition Tolerance.

A couple of years later Brewer’s theorem was formalized into a proof by Seth Gilbert and Nancy Lynch. The theorem, the idea behind it, everything is true. But somewhere in the last two decades between all of the whiteboard sessions the nuance got lost.

What’s the nuance you ask? Well everyone remembers the “pick two” portion of the conjecture but they almost always leave out the most important context: you only have to pick two during a network partition.

When the network is healthy, you can have all three. The theorem provides guidance only when things go awry—when you start losing network packets, or switches fail, or when a rodent chews through a fiber line (this has actually happened at Amazon).

CAP Theorem: Consistency, Availability, and Partition Tolerance Defined

Let’s be super precise about what C, A, and P actually are and what they mean, because the casual definitions lack a bit of nuance.

What Is Consistency in CAP Theorem? (Linearizability)

When the CAP theorem talks about consistency, it’s talking about the highest form of consistency, atomic consistency. If you’re coming from a database background, you’ve heard of weak and strong consistency. Atomic consistency is just a step above strong consistency. Atomic consistency is also known as linearizability.

Linearizability is when all operations execute atomically in some order with respect to real-time. For example, if operation A completes before operation B begins, then operation B should logically take effect after operation A.

Let’s use a real world example to help illustrate this idea. Imagine you’re checking your bank account balance and you see $5000. Let’s say you transfer $1000 out of the account. Now you refresh your balance again. Linearizability guarantees $4000 (or an error), but never $5000 again. The transfer either happened or it didn’t. There’s no situation in which you will see the old value after the new value has been written.

Subscribe now

This is a lot stronger than “all nodes eventually see the same data” — this is called eventual consistency and it is NOT what the CAP theorem is talking about.

Why is this important? Because there are a bunch of systems that provide weaker consistency models, such as: monotonic reads, read-your-writes, and causal consistency, while being extremely useful. But all of those models are weaker than linearizability, which means systems can remain highly available while providing them. The CAP theorem doesn’t constrain systems using these models, only linearizability.

What Is Availability in CAP Theorem?

CAP’s definition of availability is also very precise, it states “every request to a non-failing node must receive a response”.

Notice that this doesn’t say any of the following:

That the response has to be fast.
That the response has to be correct.
That most nodes have to respond.

For example, if you have a 10 node cluster and 4 of those nodes are partitioned away, those 4 nodes must still respond to requests to be considered “available” under CAP’s definition of “availability”. In this case a 4xx error or a stale read still counts as “available”.

Here’s a real world example: During a network partition, a MongoDB primary replica set that can’t reach all of the secondary replicas will “step down” and refuse writes altogether. These write requests will time out. Under the CAP theorem that is unavailability, even though the MongoDB process is running as expected.

Even the isolated nodes (shown in red) must keep responding to requests to be "available" under CAP's definition. If they timeout or refuse to answer because they can't reach the majority, that's unavailability—even though they're still running and the cluster is "mostly up."

What Is Partition Tolerance? (And Why You Can't Avoid It)

This is where the “pick two” framing stems.

Partition tolerance means that your system continues to work as advertised even when the network splits your nodes into groups that can’t communicate. Guess what? This isn’t up to you.

Network partitions are a fact of life in distributed systems. The internet that we rely on daily, the data centers that serve our applications all have this property. Switches fail, fibers get cut, cloud providers have availability zone outages, packets get dropped, even within a data center there are transient network issues. It’s unavoidable.

The reality is, partition tolerance is absolutely mandatory. You’re not picking between C, A, and P. You’re really picking between C and A when P inevitably occurs.

The Real Trade-off

Okay let’s observe what happens when the network eventually fails.

Before doing so, let’s rephrase the CAP theorem and its choices given that P (Partition Tolerance) is absolutely mandatory:

Consistency: refuse requests that might return stale data (sacrificing availability)
Availability: answer all requests even if data is stale (sacrificing consistency)

This diagram helps to visualize the choices available at a network partition.

Once again, take note of the precondition; “when the network fails”. If the network is fine we can have both consistency and availability. When there's no partition? You can have both. Your bank's mobile app shows you accurate, up-to-date balances while staying highly available—because most of the time, the network works fine. You check your balance, transfer money, and see the updated amount immediately.

The question the CAP theorem forces you to address is: What should your system do when the network eventually fails?

The core trade-off: During a partition, CP systems protect you from seeing wrong data by becoming unavailable. AP systems protect you from downtime by potentially serving stale data. Neither is good or bad, it all depends on what is right for your system.

CP Systems: Choosing Consistency Over Availability

CP systems decide that during a network partition, they’ll refuse requests rather than risk returning outdated data.

So when does this make sense?

Banks: If a user makes a transaction is it preferable to be down for a little bit or risk the transaction occurring twice? Probably the former right? We absolutely do not want to tell the customer that the transaction succeeded when it hasn’t. We’d rather be down for like 30 seconds than risk that.
Inventory Management: if we’re not sure whether an item is in stock, is it better to make the service unavailable or to allow the possibility of over-selling? We’d probably be okay with unavailability.

Let’s analyze a real-world system, etcd. From their own website: “etcd is a strongly consistent, distributed key-value store…It gracefully handles leader elections during network partitions and can tolerate machine failure, even in the leader node.”

Kubernetes uses etcd as its backing store for all cluster configurations. If you’ve used Kubernetes before, you’ve seen kubectl command failures. One reason: etcd can’t reach a majority of its nodes, it sacrifices availability (kubectl failures) for linearizability.

AP Systems: Choosing Availability Over Consistency

AP systems decide that during a network partition, they’ll risk serving stale data rather than deny requests.

A few systems that this applies for:

Social Media Feeds: users would much rather the app serve old / stale content rather than be completely unavailable. Additionally, does it really matter if a user doesn’t see a reaction to their post for a couple of seconds? Not really. The user experience of the app being down is far worse than reaction tallies being off.
Shopping Carts: A very famous example of this is Amazon’s DynamoDB which actually started off addressing their shopping cart problem; it’s an “always on” experience. You can always add to cart. Doesn’t matter if someone else is fiddling with your cart (Amazon will resolve the changes), you will always be able to “add to cart”.

For a real-world AP system you don’t have to think very hard, the most pure AP system is none other than DNS itself. DNS, the Domain Name System, is the perfect example of an AP system and we all use it hundreds of times a day without even realizing it.

When a DNS record is updated, the update doesn’t propagate to every DNS server around the world immediately. Different DNS servers around the world may give an outdated DNS entry, however, DNS is never “unavailable”. It will always give you a response, even if that response is stale. This is exactly why DNS updates take time and why some sites appear unavailable to a few people when going through updates. In the case of DNS, eventual consistency is absolutely better than downtime, that would kill the internet.

Next Up: The Hidden Costs of Choosing Consistency

In the next part of this series, we'll dive deep into CP systems—databases and services that prioritize consistency over availability during network failures. We’ll take a look at:

How these CP systems implement strong consistency.
The cost of coordination.
When choosing consistency over availability makes sense.

The “pick two” gives the illusion of being stuck with your choice for the entire lifetime of the system, when in fact the reality is a lot more flexible.

Subscribe to get Part 2 when it drops.

Thanks again for sticking around. If you enjoyed this post, please subscribe, it gives me the confidence to keep going.

Ali

References

Computer Networks 101 (#2)

Ali Malik — Tue, 17 Sep 2024 16:01:52 +0000

Hey folks, I’m back with this weeks post. This week we’ll be taking a deep dive in to Computer Networks. More specifically, I’ll do a deep dive in to the OSI model and comparing it to the TCP/IP model. My goal is to provide a comprehensive explanation for each layer of the model while also providing concrete examples to help visualize the entire process.

Note: I am not an artist so I’ll be leveraging graphics from various sources. The references will be in each image’s caption.

Before we get started, I'd like to thank everyone for tuning in. A special shout-out to all of our subscribers—in just one short week, we're up to 11 subscribers! I appreciate each and every one of you for your support. Now, on to the regularly scheduled program…

Subscribe now

Computer networks are crucial to understand if you’re a software developer or work in the technology space at all. They are the backbone of everything we do online, whether it’s sending an email, posting on social media, or watching Netflix. At the heart of it, a computer network allows devices to communicate and share resources, enabling the seamless exchange of information across vast distances.

To understand computer networks, there are two popular models that provide insight into how networks operate: the OSI Model and the TCP/IP Model.

What is the OSI Model?

The Open Systems Interconnection (OSI) model is a conceptual framework created by the International Organization for Standardization (ISO) that details how data travels over a network. This is done by outlining seven distinct layers. It’s become the primary mode of teaching and discussing the various networking processes. By the way, OSI being created by ISO…the irony.

What is the TCP/IP Model?

In contrast to the OSI model, the TCP/IP Model is a more practical representation of how devices connect and communicate across networks. It's the foundation of the internet and is a bit simpler than the OSI model, with fewer layers to remember.

Below is a graphic (courtesy of the ) that showcases the very differences in the layers of the OSI model vs. the TCP/IP model.

Fig1: Taken from ByteByteGo

The Layers

Let's discuss each of the layers of the two models. As shown in Fig1, the TCP/IP model condenses the OSI model's Application, Presentation, and Session layers into one layer labeled Application. Keep this in mind as we delve into each layer.

Application Layer (L7)

The application layer is the closest layer to the user. It’s responsible for delivering data to applications and providing network services directly to applications. This layer dictates how users interact with the network itself.

Common protocols that operate at this layer are HTTP/S, FTP (File Transfer Protocol), SMTP (Simple Mail Transfer Protocol), and DNS (Domain Name System).

An example interaction with this layer could look something like typing in the URL of your favorite website. The HTTP/S protocol (which runs at the application layer) is responsible for fetching the web page from the web server.

In the TCP/IP model, the Application Layer encompasses not only the OSI's Application Layer but also the Presentation and Session layers.

Presentation Layer (L6)

The presentation layer is responsible for ensuring that data sent from the application layer of one system is readable by the application layer of another system. Simply, this layer handles the serialization of data. This layer is also responsible for handling data translation, encryption, and compression. Typically, it is here that data encryption/encoding and decryption/decoding take place.

Common protocols that operate at this layer are Secure Sockets Layer (SSL), Transport Layer Security (TLS).

An example of this layer at work is when you visit a secure website (HTTPS), the presentation layer is responsible for encrypting and decrypting the data with SSL/TLS.

Note: In the TCP/IP model, the functionality of the Presentation Layer is included within the Application Layer.

Session Layer (L5)

The session layer is responsible for establishing, managing, and terminating sessions between two devices. A session is defined as a time-limited two way link between two or more communication devices. This layer also manages data streams for each application/service ensuring that they do not interfere with each other. The session layer can also handle dialog control (basically is communication two way or not) and synchronization between two communicating systems. To think about synchronization, imagine a video stream where audio and video are lagging, the session layer may issue a re-synchronization request to to sync the two up.

Common protocols at this layer are Remote Procedure Call Protocol (RPC), Session Control Protocol (SCP), Password Authentication Protocol (PAP).

Note: In the TCP/IP model, session management is generally handled by the Transport Layer.

Transport Layer (L4)

The transport layer is responsible for…. the transportation of data between appropriate applications between devices. This layer supports reliable data delivery (TCP) or providing fast but not so reliable delivery (UDP). It can break chunks of data into smaller protocol data units called _ segments _ and then reassemble them at the destination host. This layer also deals with reliability, flow control and error handling.

The most famous protocols at this layer are the Transmission Control Protocol (TCP) and User Datagram Protocol (UDP). TCP is a reliable, ordered, and connection-oriented protocol that ensures a connection is established prior to delivery. UDP is a less reliable, faster, connection-less protocol that favors speed over reliability.

An example of this layer at work can be anything from sending an email with a large attachment to video streaming. Sending an email with a large attachment, TCP would ensure that the entirety of the email and its content arrive to the destination in an ordered fashion. Whereas, when you’re gaming you might be using UDP that favors the speed of transmission over the reliability of data delivery.]

This layer is almost identical to the TCP/IP model’s Transport Layer. However, the TCP/IP version of this layer has additional responsibilities which are covered by the Session Layer in the OSI model.

Network Layer (L3)

The network layer is responsible for routing data between different networks. If the Transport Layer is the vehicle the Network Layer is the GPS. It’s responsible for packet forwarding. It uses a hosts IP address to determine the best path for data delivery from source to destination. This layer supports connection-less communication, host addressing, and message forwarding. The protocol data unit at this layer is referred to as _ packets _ .

The most well-known protocol that operates at this layer is the Internet Protocol (IP), for example IPv4 and IPv6 are implementations of this protocol. Additionally, Internet Protocol Security (IPsec), Routing Information Protocol (RIP), and Internet Control Message Protocol (ICMP) also operates at this layer.

An example of this protocol at work would be when you visit a website, the network layer uses your IP address to send requests to the destination web server.

In the TCP/IP model this layer is referred to as the Network Layer or the Internet Layer. The two are not synonymous. In the TCP/IP model the Internet layer that is referenced is only a subset of the bigger Network Layer. In the TCP/IP model the Internet layer only describes on type of network architecture… the Internet.

Data Link Layer (L2)

The data link layer is responsible for node-to-node communication on the same network. It provides the functional and procedural means to transfer data between networks. It also provides error detection and possible correction. This layer is only concerned with local delivery of _ frames _ (the protocol data unit at this layer), meaning only on the same level of the network. This means that the frames do not cross the boundaries of the local area network.

This layer is actually, commonly, divided into two sub-layers: Logical Link Control (LLC) and Media Access Control (MAC). The LLC layer multiplexes the protocols running at the Data Link layer and can optionally provide flow control, error notifications, and acknowledgements. The MAC layer controls who can access the media link at any one time.

Ethernet and WiFi are the most popular protocols that operate at this layer. At the LLC layer we have things like Forward Error Correction (FEC) and Automate Repeated Request (ARQ). Everyone knows examples of ARQ like Go-Back-N, Stop and Wait, and Selective Repeat. For the MAC layer we have protocols like MAC addressing, access methods like Carrier-Sense Multiple Access with Collision Detection/Avoidance (CSMA/CD and CSMA/CA), Spanning Tree Protocol (STP), and Virtual LANs

In the TCP/IP model the functionality of the Data Link layer is contained in the bottom most layer, the Link Layer. In addition to the responsibilities of the Data Link Layer, the TCP/IP model’s Link layer also takes on additional responsibilities similar to the Physical Layer of the OSI model.

Physical Layer (L1)

The last (or first) layer of the OSI model is the Physical Layer. This layer is the lowest layer of the model meaning its most closely associated with the physical connection between devices. This layer is associated with all things electrical and mechanical regarding the connection medium.

This layer defines how we transfer a stream of bits over the physical link connecting two or more nodes. You’ll find the Ethernet trans-receivers at this layer, USBs, and the physical part of WiFi (the actual radio frequency (RF) portion of things).

As mentioned above the physical layer and data link layer are combined in the TCP/IP model. The TCP/IP models Link Layer handles the responsibilities of, both, the OSI models Data Link and Physical Layers.

Wrapping Up

Lets see what the above looks like altogether.

Subscribe to Professional Imposter Syndrome

This publication is supported entirely by you, the readers. Please subscribe to support my work and to receive my next publication.

I hope this deep dive into the OSI and TCP/IP models has provided a clearer understanding of how data travels across networks. Whether you're sending an email, streaming a movie, or browsing the web, these layers work together to make it happen seamlessly.

Stay tuned for next week's post, where we'll explore more exciting topics in the world of technology!

Catch you all next week.

Ali

Thanks for reading Professional Imposter Syndrome! This post is public so feel free to share it.

Leave a comment

Suggest future topics!

Life At Amazon (#1)

Ali Malik — Mon, 09 Sep 2024 16:00:56 +0000

Hey folks I’m back with this weeks post. Apologies that it’s a bit late. Lets just dive in. We’ll take a look at how I got to Amazon and my career thus far at Amazon.

Subscribe now

I graduated college with a BS in Computer Engineering with a focus in Computer Networks. I joined Amazon straight out of college in 2019. I joined as a Systems Analyst (a role that no longer exists) on a team tasked with staffing enough engineers to help launch the AWS Top Secret Cloud.

One of the perks of this team was that it allowed me to rotate across various AWS teams to explore different roles and find what I truly wanted to do. This gave me exposure to teams ranging from Networking to S3, and even the Groundstation team.

The Early Struggles: Burnout and Overwork

The first few months were challenging, I really struggled with the delineation of work life and personal life. I was working from 9am - 6pm during the week and putting in hours over the weekend. In my mind this was how I was going to get ahead, by putting in more hours than anyone. As you can imagine, this started to slowly burn me out. I was finding myself less motivated in my day-to-day because all I could focus on was what I still didn’t know. I wanted to learn everything I could as fast I could learn it.

This approach was completely unsustainable. As ridiculous as it sounds, retrospectively, this was worth it in the end. This hustle in the beginning allowed me to really find my footing and be useful to the team(s) that I was working with. A fellow colleague gave me the best advice possible at that time: “There’s enough work here to drown. There will always be work.”. This, seemingly, obvious piece of advice really helped me level-set my goals. It allowed me to focus more on the work I was tasked with and less so on the many things I didn’t know. This taught me the importance of pacing yourself.

Finding My Footing: Building Disconnected Edge Solutions

Remember I said the team I was on allowed mobility till I found something I liked? Well I found what I liked. I joined a team building the next generation of Disconnected Edge Solutions. At the time AWS already had an Edge Solution offering, the Snow Family of products. However, the disconnected edge space was largely untapped from AWS, that’s where this team existed. This team is where I built most of my technical expertise. I was exposed to everything from generating customer interest, all the way to physically building the solution on top of helping create the software backend to support the customer needs.

The work I did on this team built the very foundation of my technical know-how. We solved some very challenging problems while trying to get this product out the door. We had on-device WiFi, computing power at the edge, storage, etc. The work I did on this product introduced me to embedded programming, low-level kernel modifications, the intersection of software and hardware. I stayed on this team until its inevitable liquidation (this was last year).

Where Am I Now?

I am now on a team which provides network connectivity to customers. Think of AWS Direct Connect (DX), I work on a product very similar to that. That’s what I’ve been doing the last year or so. Helping productionize the system and slowly deliver it to customers.

What I’ve Learned

Reflecting on my journey, I’ve realized a few key lessons:

Pace Yourself : Early on, I fell into the trap of thinking more hours meant faster progress. It’s important to work hard but equally important to avoid burnout. Find a sustainable work-life balance.
Find Your Niche : Rotating through different teams gave me the chance to discover what I’m passionate about. Take advantage of similar opportunities to explore your interests and strengths.
Learn Continuously : The technical challenges I faced—whether embedded programming or building edge solutions—were opportunities to grow. Never shy away from learning new things, even if they seem daunting at first.

Let’s Connect

That’s a quick overview of my career at AWS so far. If you have any questions about Amazon, the interview process, or my experience, feel free to reach out. Also, let me know if you enjoyed this post and if there are any topics you’d like me to cover in future posts.

Thanks for reading, and I’ll see you next week!

Ali

Which Topics Should I Cover?

Professional Imposter Syndrome is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

Share Professional Imposter Syndrome