DEV Community: Lukas Niessen

Idempotence in System Design: Full example

Lukas Niessen — Mon, 20 Oct 2025 14:55:00 +0000

Idempotency in System Design: Full example

Idempotency is a concept frequently mentioned in system design. I will explain what it means in simple terms, briefly address common misunderstandings, and finish with a full example.

What Is It?

Something is idempotent if doing it once or multiple times gives the same outcome.

In other words, if I do that something once, I get the same result as when I do it 2 times or 3 times or 10 times. Let's look at the standard example: we have an on and off button. Pressing them is an idempotent operation. If you press on once, the machine is on. If you then press it again, and again and again, nothing changes. The machine stays on. Same for the off button.

Here's an example from programming:

def hide_my_button(self):
  self.show_my_button = False

This is clearly idempotent.

def toggle_my_button_visibility(self):
  self.show_my_button = not self.show_my_button

This is of course not idempotent.

It's not about the return value!

This is a common misunderstanding. One could implement the hide function from above like this as well:

def hide_my_button(self):
  has_something_changed = self.show_my_button
  self.show_my_button = False
  return has_something_changed

So we return whether something was changed or not. If we call this multiple times, the returned value might differ! But it's still idempotent because idempotency is about the effect produced on the "state" or "effect" and not about the response status code received by the client.

Idempotent vs Pure

Although pure is not a topic of this article, I still want to address this quickly because it's a common source of confusion.

A function or operation is pure if, given the same input, it always produces the same output.

def square(my_number):
  return my_number ** 2

This is a pure function. square(3) will always be the same number.

def square_with_randomness(my_number):
  return (my_number ** 2) * random.uniform(0, 1)

This is not a pure function. square(3) will almost always be a different outcome since we multiply it with a random number between 0 and 1. Likewise, if we would multiply it with some global variable or some class variable, it would no longer be pure. The global variable can change and then our outcome would be different.

Okay, let's look at def square(my_number) again. It's pure. But is it idempotent? Of course not. Apply it once to 2 and we get 4. Apply it again and we get 16. So a different number!

It's also easy to find an example of an idempotent operation that is not pure. So the two concepts are totally different things!

Idempotence in System Design

So why is it such an important concept in system design? There are many reasons and we will discuss the most common ones.

Message Processing

Suppose we use event-driven design in our system. Concretely, we have a message queue and a service consuming its messages.

The problem is this. When Service B consumes a message, let's say the message containing Event 3, it processes it, and then writes to our DB. Let's keep it super simple, suppose Service B calculates some complex formula for each event and writes the result to our DB. Now it's very important that nothing here gets lost ever. We have very important data!

But if Service B crashes during the calculation, or there is a network partition between Service B and the DB, or something else happens, then the message and the event are lost forever. Terrible.

The solution is simple: instead of removing the message immediately from the queue, we wait for Service B to be finished, which includes writing to the DB, and then remove the message.

But this introduces a new problem. It's possible that the same message is read twice. For example, Service B performs the calculation and writes to the DB, but then something happens. It crashes for example. So before the message is removed from the queue, the service has crashed. What happens? The service restarts, and once it's up again and running, it will continue consuming messages. And it starts with exactly that last message. So that message gets consumed twice!

How do we solve this? We can't really directly. System design is always about trade-offs: Either we might lose messages or we might consume the message more than once.

But that's not so problematic! If we design the operation of Service B to be idempotent, then nothing happens. The service will consume the message a second time, but it doesn't matter because the operation is idempotent. So the outcome is still the same.

The only downside is a little bit of extra complexity (you need to come up with a way to make the operation idempotent) and a little bit of compute resources (potentially doing the same thing more than once unnecessarily). But usually, and definitely in our case, it's better than losing messages!

Pitfalls

There are several things to be careful with here. One thing that can happen is an infinite loop (a catastrophic failover). If you have an "ill" message, for example of an invalid format, that makes your consuming service crash (Service B), it will stay on the queue. Meaning, Service B restarts just to consume the same ill message and crash again. And over and over. Even if your system doesn't have such issues, they will sooner or later arise, so you really should make use of a so called dead-letter queue (or message hospital).

Other Uses of Idempotency

We've talked about message processing, but idempotency shows up everywhere in system design. Let me walk you through the other most common ones.

APIs

If you're building REST APIs, you're already dealing with idempotency whether you realize it or not. The HTTP protocol actually defines which methods should be idempotent:

GET requests don't change anything on the server, so they're naturally idempotent. Call them 100 times, same result every time. You can refresh a webpage as many times as you want without worrying.

PUT requests should completely replace a resource. If you PUT the same data twice, you get the same outcome. Think of it like overwriting a file - doing it twice doesn't change anything.

DELETE requests should delete a resource. Delete something that's already gone? It's still gone. No problem.

POST requests are usually not idempotent by design. Each POST typically creates something. But you can make them idempotent with idempotency keys. Here's how it works: you send a unique ID with your request (often in a header), and the server remembers "I already processed this ID, so I'll just return the same result instead of doing the work again".

def create_user(request):
    idempotency_key = request.headers.get('Idempotency-Key')

    # Did we already process this exact request?
    if idempotency_key and already_processed(idempotency_key):
        return get_cached_response(idempotency_key)

    # Nope, create the user
    user = User.create(request.data)

    # Cache the response for next time
    if idempotency_key:
        cache_response(idempotency_key, user)

    return user

Databases

Database operations love being idempotent too. Here are the most common patterns:

UPSERT operations (INSERT or UPDATE if exists) are naturally idempotent. Run an upsert 10 times with the same data, and you get the same result every time. The record either gets created once or updated to the same values multiple times.

Distributed Systems

In distributed systems, things fail constantly. Networks partition, services crash, hard drives die, and yes, occasionally cats do walk over keyboards. So we retry operations all the time. But retries only work safely if your operations are idempotent.

Full Example: Order Processing System

Alright, let's put this all together with a concrete example that matches the system in your diagram. We have a simple order processing pipeline: orders come from a web app, get validated by an order service, go into a queue, and then get processed by an order processor service that writes to a database.

What We're Building

The system is straightforward:

Web App sends HTTP POST requests with order data
API Gateway handles routing, authentication, and rate limiting
Order Service validates the order and publishes it to the queue
Amazon SQS holds the order messages
Order Processor Service consumes messages and writes to the database
Orders DB stores all our order data
Dead-Letter Queue catches any poison messages
Notification Service sends confirmations to customers

The key here is that the Order Processor Service needs to be idempotent.

Making the Order Processor Service Idempotent

The Order Processor Service consumes messages from SQS and does the actual business logic.

When we process a message, so an order event, we want to:

Check if it was processed already
If not, insert it into our OrdersDB
If not, tell NotificationService to send a notification

This is idempotent because we check if it was processed already. That could be for example by doing a SELECT in the OrderDB and only inserting if it's not there yet. Something similar can be done for the NotificationService, or inside the NotificationService with its own DB.

However, note that we need to deal with concurrency issues. What if we have two different instances of OrderProcessService processing the same message? And they both execute the SELECT at the same time. We would process the message twice, not good. So we need to wrap this logic into a transaction.

We would end up something like this:

Another note: We should to make the system actually fully resilient, put a queue in between OrderProcessService and NotificationService as well and do a similar thing.

Technical Sales & Presales 101: The very basics

Lukas Niessen — Thu, 21 Aug 2025 18:53:09 +0000

Technical Sales & Presales 101: The very basics

This article is mainly aimed at developers looking to switch into technical sales. So I cover the very basics of this topic.

Lead

A lead is just a potential customer. This can be someone that signed up for a demo, someone in your contacts who you think might be interested in your product, someone who signed up for a free trial etc. There are however different types of leads and I will introduce them now.

To avoid confusion, a lead typically refers to a person. That person of course it usually associated to a company, and that company will hopefully become a customer one day.

Marketing Qualified Lead (MQL)

A lead that meets certain marketing criteria (right job title, company size, industry, engagement with marketing content). Marketing might say: "This person looks like our ICP (ideal customer profile)". This means, that person is a MQL.

Sales Accepted Lead (SAL)

This is a lead where sales agrees it's worth working on. So the marketing team has a lead and the sales team "agrees it's a good lead". The lead then is considered a SAL.

Sales Qualified Lead (SQL)

This is the stage we desire. This is the stage we want in order to continue with actually trying to make this lead a customer. So what a SQL is, is really just a lead that has shown interest in becoming a customer and marketing and sales agree they're a good fit.

Often, BANT is used to see whether a lead is a good fit. BANT = Budget, Authority, Need, and Timeline. A SQL is often also called a prospect. BANT is just one framework though, MEDDICC is another important one. Some teams also use no framework at all.

SQLs are so important because they are the group of leads that are most likely to be converted into customers.

For example, imagine you're selling cloud infrastructure services. A SQL might be a CTO from a growing startup who has:

Budget: $50k+ annual cloud budget
Authority: Decision-making power for technical purchases
Need: Their current hosting can't handle traffic growth
Timeline: They need to migrate within 6 months due to a major product launch

We generally want to know as much as possible about our leads. This helps to identify why they need our product or service. This allows us for a tailored pitch and tailored language. And so on.

The Process

What we want is to find many potential customers, also called generating leads and at some point convert them into customers. So we want: lead ➜ SQL ➜ customer. This is the process. Here some more details.

1. Generating leads: there are many ways to generate leads and this is a big topic on its own. Some are networking, asking for referrals, or internet marketing.

2. BANT: So now that we have leads, we want to see whether they are qualified, so we would invest more time in them. Again: Budget, Authority, Need, and Timeline. So we determine if the lead has the financial resources, decision-making authority, genuine requirement for the product or service, and a specific timeframe for purchase. If yes, then we consider this lead a strong candidate for further qualification. This assessment is typically done by sales representatives.

3. Lead Scoring: As the next step, we score our leads. That is, we assign them numerical values that represent how likely it is to convert them into customers. This can be based on many things, such as company size, engagement level and job title.

4. Lead Nurturing: By lead nurturing we mean taking care of a lead. The goal is to build trust and spark interest. This means providing info about the product or service and addressing their pain points. This may be achieved through personalised email campaigns, case studies, content marketing, and webinars. Important, we use our lead scoring to decide how much time we invest into lead nurturing for each lead.

5. Scheduling a Meeting or Call: Once a lead has shown strong interest in the product or service, we schedule a meeting or a call. Here we will dive deep into the lead's requirements, understand their challenges and pain points, and present tailored solutions. This is often called a discovery call and might involve both the AE and a solution engineer if technical questions are expected.

6. Closing the Deal: This is the final step of this process. When the lead shows a strong interest in the service or product and you've had a meeting or multiple already, it might be time for closing the deal. This includes making a deal proposal, which includes a summary of the customer's needs, a detailed explanation of the proposed solution, why and how that's good for the customer, pricing, terms and conditions and more. But it also includes negotiating aspects of the proposal. For complex technical sales, this might also include technical proofs of concept or pilot implementations.

This process is called sales pipeline. If we would outline the same process but write everything from the customer's perspective instead and notice that the amount of leads decreases with every step, it would be called sales funnel. However, these two terms are sometimes used interchangeably.

Note that we also often say the pipeline and mean the above process together with all existing leads. So the state your entire sales cylce is in right now.

People involved

Let's clarify who is involved in this process.

Account Executive (AE)

The account executive, also called sales representative or just sales rep, is the person who acts as the primary point of contact and owns a particular lead or potential deal. This means, he is the main face of your company to this lead. He is responsible for building the relationship and understanding the customer's needs.

For example, if you're selling enterprise software, the AE might be responsible for 20-30 active opportunities, each representing potential deals worth $50k-$500k. They spend their time on calls understanding business requirements, presenting value propositions, and navigating the customer's procurement process.

However, he is not working alone of course.

The entire process can only start if we have leads. The lead generation is typically done by the marketing team (doing ads, social media marketing, online content etc).

Next, we distinguish between pre sales and post sales. Not hard to guess, but presales is all the activities and support that occur before a sale closes. That includes customer research, prospecting, discovery (including technical discovery) and more. Post sales is everything after the deal was closed, so for example a good onboarding and general customer success.

Just as a note, developers, software architects, designers and so on, are not part of this process. At least not by the standard business lingo. Of course, the developers start working after the deal was closed, but when we say "post sales", we're not talking about that. We're talking about things like customer suport or account management.

Now when we talk about someone in presales, often we mean someone with technical expertise. That is often a solution engineer or a solution architect and their role typically is to help with technical discovery, answer technical questions that might come up (AEs don't know the answer normally), help architect and design the proposed solution and take part in the presentation of it, in the pitch.

Not everyone is a Prospect

I will not dive deep here, but I want to mention, that it's important to understand that not every lead is a prospect. It's very important to narrow down in the sales cycle. There is a reason we have different terms (Lead, MQL, SQL). If you don't narrow it down and do good lead scoring, you will waste resources massively. Don't treat everyone like SQL.

Solution Engineers & Architects in Presales

For developers considering a transition into sales, the solution engineer or solution architect role is often the natural entry point. These roles bridge the gap between technical expertise and business value. Let me break down what this actually looks like in practice.

What Solution Engineers Do

A solution engineer (SE) is essentially the technical wing of the sales team. While the AE focuses on relationship building and understanding business needs, the SE handles all technical aspects of the sale.

Technical Discovery: This means understanding the customer's current technical environment. For example, if you're selling a cloud platform, you'd need to understand their current infrastructure, what databases they use, their security requirements, and their deployment processes. You're not just asking "what technology do you use?" but rather "how does your current system handle peak traffic?" or "what's your disaster recovery setup?"

Solution Design: Based on the discovery, you design a solution that fits their specific needs. This isn't about presenting a generic demo, but rather showing exactly how your product would integrate into their environment. For instance, if they're a retail company with seasonal traffic spikes, you'd design a solution that shows auto-scaling capabilities specifically for their Black Friday traffic patterns.

Technical Demos: You'll give live demonstrations of the product, often customized to their use case. This might mean setting up a demo environment that mirrors their data structure or showing how your API would integrate with their existing systems.

Proof of Concepts (POCs): Sometimes customers want to test your solution with their actual data or use cases. You'd help set up and run these technical evaluations.

Examples of Solution Engineer Work

Let's say you're selling a data analytics platform to a logistics company:

Discovery: You'd learn they track 50,000 shipments daily, use Oracle databases, have compliance requirements, and their current reporting takes 3 hours to generate.
Solution Design: You'd design a solution showing real-time dashboards, automated compliance reporting, and integration with their Oracle systems.
Demo: You'd use their actual shipping data structure to show how reports that currently take 3 hours could be generated in real-time.
POC: They might want to test with their actual data for 30 days to see the performance improvements.

Or if you're selling cybersecurity software to a financial services company:

Discovery: Understanding their current security stack, compliance requirements (like PCI DSS), incident response procedures, and integration needs.
Solution Design: Showing how your solution fits into their existing security infrastructure without disrupting operations.
Demo: Demonstrating threat detection using scenarios relevant to financial services, like detecting suspicious transaction patterns.

Solution Architect vs Solution Engineer

The terms are often used interchangeably, but there can be subtle differences:

Solution Architect typically implies more strategic, high-level design work. They might work on larger, more complex deals and focus on architectural patterns and long-term technical strategy.

Solution Engineer often handles more hands-on technical work, demos, POCs, and day-to-day technical customer interactions.

In smaller companies, one person might do both roles. In larger companies, you might have senior solution architects who design complex solutions and junior solution engineers who execute demos and handle technical questions.

Why This Role Works for Developers

This role is attractive for developers because:

You use your technical skills daily - understanding APIs, databases, cloud architecture, security patterns, etc.
You learn business context - seeing how technology solves real business problems, not just technical challenges.
Direct customer interaction - you get immediate feedback on how your technical solutions impact real users.
Higher compensation - presales roles typically pay more than pure development roles.
Career progression - you can move into sales leadership, product management, or customer success roles.

The key difference from development is that instead of building solutions, you're designing and demonstrating them to solve specific customer problems. Your success is measured not by code quality or features shipped, but by whether customers understand and buy the technical solution you've presented.

But ultimately, it's a matter of whether you like to sell or not. Personally, I think sales is a super interesting field to work in as, when you think about it, everything in life is basically sales. So becoming good at it is not just a career win :)

Event Sourcing, CQRS and Micro Services: Real FinTech Example from my Consulting Career

Lukas Niessen — Mon, 30 Jun 2025 22:18:14 +0000

This is a detailed breakdown of a FinTech project from my consulting career. I'm writing this because I'm convinced that this was a great architecture choice and there aren't many examples of event sourcing and CQRS in the internet where it actually makes sense. You are very welcome to share your thoughts and whether you agree about this design choice or not :)

Project Description

The client was a medium sized fintech company that has in-house developed a real time trading platform that was launched as a beta test version. The functionality included:

Real time stock info
Portfolio management
Real time transaction tracking
Report generation
Account with a little social media functionality (making posts, liking and commenting)
Mobile device notifications
and more

Their app was an MVP. It had a monolithic Spring boot backend and a simple React based web UI, everything hosted on Azure.

They hired us because of two main reasons: their MVP was not auditable and thus not compliant with financial regulations and also not scalable (high usage and fault tolerance).

Our Team

We worked with the customer, not alone. Our team were about 10 people, experienced back end or full stack developers and me as a software architect. The client had about 20 developers, ranging from front end, back end to database experts and more. My role was to lead the architecture.

Our Design Decision

As said, the main issues to solve were auditability (compliance) and scalability (including performance and fault tolerance). I will start with an overview of the design including a super short repetition of what each technology is and later dive into detail in the next session, including discussing the trade offs and alternative solutions.

Auditability

Our customer must by law always know past states. For example, customer A had exactly $901 on their account 2 months ago at 1:30 pm. This was not possible with the existing system so we needed to tackle it. I proposed to use event sourcing. Here is a very brief explanation of event sourcing.

Event sourcing = Save events, not state

So instead of having a state we update, we save events. We use these events to create the state when we need it. Consider this simple example:

Old Approach

+--------------------------------------+
| Table: Account_Balance               |
+--------------------------------------+
| Account_ID | Balance | Last_Updated  |
+--------------------------------------+
| Customer_A | $0      | 2025-04-29    | <- Initial state
+--------------------------------------+
| Customer_A | $5      | 2025-04-29    | <- After receiving $5 (overwrites $0)
+--------------------------------------+
| Customer_A | $12     | 2025-04-29    | <- After receiving $7 (overwrites $5)
+--------------------------------------+

Problem: Past states (e.g., $5 at 1:30 PM) are lost unless separately logged.

Event Sourcing

+-------------------------------------------------------------------+
| Table: Account_Events                                             |
+-------------------------------------------------------------------+
| Event_ID | Account_ID | Event_Type | Amount | Timestamp           |
+-------------------------------------------------------------------+
| 1        | Customer_A | Deposit    | $5     | 2025-04-29 13:30:00 | <- Event: Received $5
+-------------------------------------------------------------------+
| 2        | Customer_A | Deposit    | $7     | 2025-04-29 13:31:00 | <- Event: Received $7
+-------------------------------------------------------------------+

Reconstructing Balance at 2025-04-29 13:30:00:

Sum events up to timestamp: $5 = $5

Reconstructing Balance at 2025-04-29 13:31:00:

Sum events up to timestamp: $5 + $7 = $12

Reconstructing Balance 2 months ago:

Sum all relevant events <= timestamp

This is event sourcing in a nutshell. For a more comprehensive explanation, please have a look here for example.

This is the right choice here because this gives us total control and transparency. When we want to know how much money a particular user had 2 months ago at 1:42 pm, we can just query the needed transactions and sum them up. We know everything with this approach. And this is required to be compliant. As a side note, accounting does the same thing but, of course, they don't call it event sourcing :)

But event sourcing comes with more advantages, including:

Rebuild state: you can always just discard the app state completely and rebuild it. You have all info you need, all events that ever took place.
Event replay: if we want to adjust a past event, for example because it was incorrect, we can just do that and rebuild the app state.
Event replay again: if we have received events in the wrong sequence, which is a common problem with systems that communicate with asynchronous messaging, we can just replay them and get the correct state.

Alternatives to Event Sourcing

Event sourcing definitely solves the auditability/compliance problem. But there are alternatives:

1. Audit Log Pattern: Keep the current state tables but add comprehensive audit logs that track all changes. This is simpler to implement but doesn't provide the same level of detail as event sourcing. You track what changed, but not necessarily the business intent behind the change.

2. Change Data Capture (CDC): Use database-level tools to capture all changes automatically. Tools like Debezium can stream database changes, but this is more technical and less business-focused than event sourcing.

3. Temporal Tables: Use database features (like SQL Server's temporal tables) to automatically version data. This provides history but lacks the rich business context that events provide.

4. Transaction Log Mining: Extract historical data from database transaction logs. This is complex and database-specific, making it harder to maintain.

CQRS (Command Query Responsibility Segregation)

The second major architectural decision was implementing CQRS, though we didn't start with it immediately due to complexity. We kept it in mind during the initial design and tested it later through a proof of concept, then implemented it in production.

Here's what CQRS means:

CQRS = Separate your reads from your writes

This is all. Often CQRS is presented as (among other things) having two separate DBs, one for writing and one for reading. But this is not true, you are doing CQRS already when you just separate read and write code, for example by putting them into separate classes.

However, the benefits we needed do indeed require separate DBs.

Traditional Approach:
┌─────────────┐    ┌──────────────┐    ┌──────────────┐
│   Client    │────│   Service    │────│   Database   │
│             │    │              │    │              │
│ Read/Write  │    │ Read/Write   │    │ Read/Write   │
└─────────────┘    └──────────────┘    └──────────────┘

CQRS Approach:
┌─────────────┐    ┌──────────────┐    ┌──────────────┐
│   Client    │────│ Command Side │────│ Write Store  │
│             │    │   (Writes)   │    │ (Event Store)│
│             │    └──────────────┘    └──────────────┘
│             │           │                    │
│             │           │ Events             │ Events
│             │           ▼                    ▼
│             │    ┌──────────────┐    ┌──────────────┐
│             │────│  Query Side  │────│  Read Store  │
│             │    │   (Reads)    │    │ (Projections)│
└─────────────┘    └──────────────┘    └──────────────┘

More about CQRS here.

The benefits of doing this are the following:

Scale read and write resources differently
- By having two separate DBs, you can choose different technologies and scale them independently
- If performance is critical in your app, this can definitely help, especially when reads and writes are not of a similar amount
- In our case, we have a read heavy app
You can have different models for reading and writing

As hinted already, this was crucial for our trading platform because:

Complex reports and dashboards need denormalized, optimized read models,
Read and write loads are completely different in trading systems, so we need independent scalability,
We can use different databases optimized for each purpose.

However, CQRS with separate DBs comes at great cost again, for example, you need to deal with eventual consistency.

Important note: We do NOT use CQRS on every service but only where it justifies the complexity.

Alternatives to CQRS

You can try to get the benefits of CQRS in other ways, for example by using caching strategies and read replicas. I'll dive into the tradeoffs of these approaches in the detailed discussion section.

Microservices

We also decided to break the monolith into microservices. The main reason for this decision was again independent scalability and higher fault tolerance. The existing monolith was often running on very high CPU usage due to report generation and real-time market data processing consuming most resources.

By separating these concerns into different services, even if our report generation service crashes due to heavy usage, other critical services like transaction processing are not impacted at all. This improves our overall system availability (MTBF - Mean Time Between Failures) and reduces recovery time (MTTR - Mean Time To Recovery).

An interesting part here was the migration from monolith to microservices using the strangler fig pattern, gradually replacing parts of the monolith.

Asynchronous Messaging

Another decision was to use asynchronous messaging for inter-service communication instead of request-response communication.

Synchronous (Traditional):
Service A ──HTTP Request──► Service B
          ◄──Response─────

Asynchronous (Our Approach):
Service A ──Event──► Message Queue ──Event──► Service B

This event-driven approach has many benefits such as high decoupling. However, we were primarily interested in better fault tolerance:

Suppose Service A informs Service B to save data to its DB. If we use a traditional HTTP request and Service B is down, then the request is lost. Of course there are ways to combat this but if we use asynchronous messaging instead, then Service A just pushed that event to the message queue and if Service B is down, nothing happens. The event just stays on the queue. And as soon as Service B is up again, the event gets processed.

So using this approach gives us better fault tolerance in the case of network partitions.

Now asynchronous messaging has clear downsides too, mainly complexity, particularly when it comes to debugging, testing and things of that kind.

Detailed Discussion: Tradeoffs and Alternatives

Microservices Deep Dive

We identified services based on business capabilities as follows:

Transaction-Portfolio Service:
├── Owns: Account balances, transaction history, stock holdings
├── Responsibilities: Money transfers, buy/sell orders, balance queries
└── Database: PostgreSQL (ACID compliance critical)

Notification Service:
├── Owns: User preferences, notification history
├── Responsibilities: Email, SMS, push notifications
└── Database: MongoDB (flexible schema for different notification types)
└── Event Sourcing: NOT used (simple CRUD operations)

Social Service:
├── Owns: Posts, likes, comments
├── Responsibilities: Social feed, user interactions
└── Database: MongoDB
└── Event Sourcing: NOT used (not critical for compliance)

Report Service:
├── Owns: Aggregated data, report templates
├── Responsibilities: Generate complex reports
└── Database: ClickHouse (optimized for analytics)
└── CQRS: Read-only projections from other services

User Service:
├── Owns: User profiles, authentication
├── Responsibilities: Registration, login, profile management
└── Database: PostgreSQL
└── Event Sourcing: NOT used (user profiles change infrequently)

Service Boundary Evolution: Initially, we considered separating Transaction Service and Portfolio Service. However, we discovered early in the design phase that this would be wrong. Due to very frequent boundary crossings and the need for distributed transactions when a trade affects both account balance and portfolio holdings, we decided to keep these as a single service. This eliminated the complexity of distributed transactions while maintaining other benefits.

In my opinion, the need of distributed transactions or sagas is always an indicator to check if your service boundaries are the right choice. Maybe you want to merge services instead. To quote Sam Newman in Building Microservices (2nd edition):

Distributed Transactions: Just Say No. For all the reasons outlined so far, I strongly suggest you avoid the use of distributed transactions like the two-phase commit to coordinate changes in state across your microservices. So what else can you do? Well, the first option could be to just not split the data apart in the first place. If you have pieces of state that you want to manage in a truly atomic and consistent way, and you cannot work out how to sensibly get these characteristics without an ACID-style transaction, then leave that state in a single database, and leave the functionality that manages that state in a single service (or in your monolith). If you're in the process of working out where to split your monolith and what decompositions might be easy (or hard), then you could well decide that splitting apart data that is currently managed in a transaction is just too difficult to handle right now. Work on some other area of the system, and come back to this later. But what happens if you really do need to break this data apart, but you don't want all the pain of managing distributed transactions? In cases like this, you may consider an alternative approach: sagas.

So he also recommends to either merge the services or, if really needed, to use sagas. In our case we decided that this service boundary would be wrong since the scalibility needs to the transaction service and the portfolio service are not that different actually.

Where We Use Event Sourcing

We used event sourcing only in the Transaction-Portfolio Service due to the strict compliance requirements for financial data. The other services used traditional CRUD patterns since they didn't require the same level of auditability.

Event Sourcing Deep Dive

Benefits of Event Sourcing

Event sourcing has better performance when it comes to writing. Consider this example:

Traditional Update Pattern:

1. SELECT current_balance FROM accounts WHERE id = 123
2. UPDATE accounts SET balance = balance + 100 WHERE id = 123
   (Requires row locking, potential contention)

Event Sourcing Pattern:

INSERT INTO events (account_id, type, amount, timestamp)
   VALUES (123, 'deposit', 100, NOW())

And it has even more advantages on the writing part:

Write Performance: Append-only writes are much faster than updates
No Lock Contention: Multiple transactions can write simultaneously
Better Concurrency: No need to lock rows for balance updates
Optimized for SSD: Sequential writes perform excellently on modern storage

Although we talked about this already, here is a sample of how you could implement summing an event replay:

-- Regulatory Question: "What was Account X's balance on Date Y at Time Z?"
-- Event Sourcing Answer:
SELECT SUM(amount) FROM events
WHERE account_id = X AND timestamp <= 'Y Z'
GROUP BY account_id

Challenges and Solutions

Performance Issue - Event Replay

The main challenge we faced was performance degradation when reconstructing current state from thousands of events. For active trading accounts, we had up to 50,000 events per day.

Our solution was a hybrid approach:

Event-Based Snapshots: Create snapshots after every 1,000 events per account
Delta Replay: Only replay events since the last snapshot

This approach ensures that we never need to replay more than 1,000 events for any account, keeping reconstruction time predictable and fast.

-- Sample code
SELECT snapshot.balance + COALESCE(SUM(events.amount), 0) as current_balance
FROM account_snapshots snapshot
LEFT JOIN events ON events.account_id = snapshot.account_id
    AND events.sequence_number > snapshot.last_event_sequence
WHERE snapshot.account_id = 123
    AND snapshot.last_event_sequence = (
        SELECT MAX(last_event_sequence) FROM account_snapshots
        WHERE account_id = 123
    )

This reduced our balance calculation time from 2-5 seconds to 50-200ms for active accounts.

Storage Growth

Events accumulate rapidly. We implemented a tiered storage strategy:

Hot storage (Azure Premium SSD): Last 3 months ~ 2TB
Warm storage (Azure Standard SSD): 3-12 months ~ 5TB
Cold storage (Azure Archive): 1+ years ~ 50TB

Total storage costs: $800/month vs $15,000/month if everything was on premium storage.

Tradeoffs

Event sourcing adds significant complexity. A part of the team needed training.

Query Complexity:
Getting current state requires aggregation:

-- Current Balance Query:
SELECT account_id, SUM(amount) as current_balance
FROM events
WHERE account_id = 123
GROUP BY account_id

-- vs Traditional:
SELECT balance FROM accounts WHERE id = 123

Storage Growth:
Events accumulate over time and require storage management strategies.

Why We Rejected Alternatives

Yes, we have decided to use event sourcing even though it comes with read performance issues - and performance was a main concern of our customer.

The reason is that event sourcing is simply much superior when it comes to audits. This was much more important to the customer than performance. Plus we managed to solve the performance issue.

CQRS Deep Dive

Very important note: CQRS in the sense of having multiple DBs adds complexity and eventual consistency. This is why we decided against using it immediately but just kept it in mind. We later created a proof of concept to compare the performance benefits we would get in the Portfolio Service.

Our POC results showed for a test user account:

Report generation time: 30 seconds → 10 seconds
Dashboard load time: 1 second → 400ms
Complex query performance: about 2x improvement

The result convinced us to implement it. We later added it to the Transaction Service for high-volume trading operations, but not to all services. Adding CQRS to all our services would have little benefits (we don't need the performance benefits or different read/write models at most services) but much complexity.

Implementation Details

We implemented CQRS for the Transaction-Portfolio Service as follows. We had a Postgres DB for the write side (command side) and a MongoDB for the read side (query side). We chose a document store because we did not want a fixed schema plus we wanted very high read throughput.

So the service received a request and decided to write, it wrote to the Postgres DB and also emitted an event to our message broker (Azure Service Bus). This event was then processed by a different instance of the Transaction-Portfolio Service and we write to the MongoDB. Here we don't write the same data but a denormalized form, so that querying the data we need is faster.

Note we sacrifice ACID by doing this. This gave us eventual consistency between read and write sides, typically within 100-500ms.

Why This Was Faster

The performance improvements came from:

Denormalized Read Models: Instead of complex JOINs across normalized tables, we had pre-computed aggregations
Optimized Indexes: Each MongoDB collection had indexes tailored for specific query patterns
Separate Scaling: We could scale read replicas independently of the write database

Consider this example of generating a user's portfolio performance report:

Traditional approach:

-- Complex query with multiple JOINs and aggregations
SELECT u.name, p.symbol,
       SUM(t.quantity) as total_shares,
       AVG(t.price) as avg_price,
       -- ... more complex calculations
FROM users u
JOIN portfolios p ON u.id = p.user_id
JOIN transactions t ON p.id = t.portfolio_id
WHERE u.id = 123
GROUP BY u.name, p.symbol
-- This took up to 30 seconds for active users

CQRS approach:

// Simple document lookup from pre-computed projection
db.portfolio_summaries.findOne({ user_id: 123 });
// fast

Major Challenges and How We Solved Them

1. Debugging Distributed Systems

This was our biggest pain point initially. When a transaction failed, tracing the issue across multiple services and async message queues was a nightmare.

We solved this by implementing distributed tracing with correlation IDs that flow through every service call and message. Every log entry includes the correlation ID, making it possible to reconstruct the entire flow. We used Jaeger for distributed tracing and structured logging with consistent fields across all services.

2. Testing Complexity

Testing event sourcing and CQRS systems is fundamentally different. You can't just mock database calls - you need to verify that events are produced correctly and that projections are updated properly.

We created integration test environments that could replay production events against test instances. This allowed us to validate that code changes wouldn't break existing event processing. We also invested heavily in property-based testing to verify that event sequences always produce valid states.

What Would I Change?

I'm convinced this was the right architecture for our specific requirements. However, there are definitely things I would approach differently:

We didn't have a clear strategy for evolving event schemas initially. When we needed to add fields to events or change event structure, it created compatibility issues with existing events.
Also our monitoring and logging was weak in the beginning and made everything even more complex to start.
I would consider using EventStore instead of Postgres for the Transaction-Portfolio Service. EventStore is purpose-built for event sourcing and provides features like built-in projections, event versioning, and optimized append-only storage. This would eliminate much of the custom event sourcing infrastructure we had to build on top of Postgres.

Consistent Hashing Explained

Lukas Niessen — Sun, 25 May 2025 10:26:24 +0000

This contains an ELI5 and a deeper explanation of consistent hashing. I have added much ASCII art, hehe :) At the end, I even added a simplified example code of how you could implement consistent hashing.

ELI5: Consistent Pizza Hashing 🍕

Suppose you're at a pizza party with friends. Now you need to decide who gets which pizza slices.

The Bad Way (Simple Hash)

You have 3 friends: Alice, Bob, and Charlie
For each pizza slice, you count: "1-Alice, 2-Bob, 3-Charlie, 1-Alice, 2-Bob..."
Slice #7 → 7 ÷ 3 = remainder 1 → Alice gets it
Slice #8 → 8 ÷ 3 = remainder 2 → Bob gets it

With 3 friends:
Slice 7 → Alice
Slice 8 → Bob
Slice 9 → Charlie

The Problem: Your friend Dave shows up. Now you have 4 friends. So we need to do the distribution again.

Slice #7 → 7 ÷ 4 = remainder 3 → Dave gets it (was Alice's!)
Slice #8 → 8 ÷ 4 = remainder 0 → Alice gets it (was Bob's!)

With 4 friends:
Slice 7 → Dave (moved from Alice!)
Slice 8 → Alice (moved from Bob!)
Slice 9 → Bob (moved from Charlie!)

Almost EVERYONE'S pizza has moved around...! 😫

The Good Way (Consistent Hashing)

Draw a big circle and put your friends around it
Each pizza slice gets a number that points to a spot on the circle
Walk clockwise from that spot until you find a friend - he gets the slice.

           Alice
      🍕7       .
      .            .
     .               .
   Dave      ○       Bob
     .              🍕8
      .             .
       .           .
          Charlie

🍕7 walks clockwise and hits Alice
🍕8 walks clockwise and hits Charlie

When Dave joins:

Dave sits between Bob and Charlie
Only slices that were "between Bob and Dave" move from Charlie to Dave
Everyone else keeps their pizza! 🎉

           Alice
      🍕7       .
      .            .
     .               .
   Dave      ○       Bob
     .              🍕8
      .             .
       .          Dave
          Charlie

🍕7 walks clockwise and hits Alice (nothing changed)
🍕8 walks clockwise and hits Dave (change)

Back to the real world

This was an ELI5 but the reality is not much harder.

Instead of pizza slices, we have data (like user photos, messages, etc)
Instead of friends, we have servers (computers that store data)

With the "circle strategy" from above we distribute the data evenly across our servers and when we add new servers, not much of the data needs to relocate. This is exactly the goal of consistent hashing.

In a "Simplified Nutshell"

Make a circle (hash ring)
Put servers around the circle (like friends around pizza)
Put data around the circle (like pizza slices)
Walk clockwise to find which server stores each piece of data
When servers join/leave → only nearby data moves

That's it! Consistent hashing keeps your data organized, also when your system grows or shrinks.

So as we saw, consistent hashing solves problems of database partitioning:

Distribute equally across nodes,
When adding or removing servers, keep the "relocating-efforts" low.

Why It's Called Consistent?

Because it's consistent in the sense of adding or removing one server doesn't mess up where everything else is stored.

Non-ELI5 Explanatiom

Here the explanation again, briefly, but non-ELI5 and with some more details.

Step 1: Create the Hash Ring

Think of a circle with points from 0 to some large number. For simplicity, let's use 0 to 100 - in reality it's rather 0 to 2^32!

                    0/100
                      │
               95 ────┼──── 5
                     ╱│╲
                90 ╱  │  ╲ 10
                  ╱   │   ╲
              85 ╱    │    ╲ 15
                ╱     │     ╲
           80 ─┤      │      ├─ 20
              ╱       │       ╲
          75 ╱        │        ╲ 25
            ╱         │         ╲
       70 ─┤          │          ├─ 30
          ╱           │           ╲
      65 ╱            │            ╲ 35
        ╱             │             ╲
   60 ─┤              │              ├─ 40
      ╱               │               ╲
  55 ╱                │                ╲ 45
    ╱                 │                 ╲
50 ─┤                 │                 ├─ 50

Step 2: Place Databases on the Ring

We distribute our databases evenly around the ring. With 4 databases, we might place them at positions 0, 25, 50, and 75:

                    0/100
                   [DB1]
               95 ────┼──── 5
                     ╱│╲
                90 ╱  │  ╲ 10
                  ╱   │   ╲
              85 ╱    │    ╲ 15
                ╱     │     ╲
           80 ─┤      │      ├─ 20
              ╱       │       ╲
    [DB4] 75 ╱        │        ╲ 25 [DB2]
            ╱         │         ╲
       70 ─┤          │          ├─ 30
          ╱           │           ╲
      65 ╱            │            ╲ 35
        ╱             │             ╲
   60 ─┤              │              ├─ 40
      ╱               │               ╲
  55 ╱                │                ╲ 45
    ╱                 │                 ╲
50 ─┤               [DB3]               ├─ 50

Step 3: Find Events on the Ring

To determine which database stores an event:

Hash the event ID to get a position on the ring
Walk clockwise from that position until you hit a database
That's your database

Example Event Placements:

Event 1001: hash(1001) % 100 = 8
8 → walk clockwise → hits DB2 at position 25

Event 2002: hash(2002) % 100 = 33
33 → walk clockwise → hits DB3 at position 50

Event 3003: hash(3003) % 100 = 67
67 → walk clockwise → hits DB4 at position 75

Event 4004: hash(4004) % 100 = 88
88 → walk clockwise → hits DB1 at position 0/100

Minimal Redistribution

Now here's where consistent hashing shines. When you add a fifth database at position 90:

Before Adding DB5:
Range 75-100: All events go to DB1

After Adding DB5 at position 90:
Range 75-90:  Events now go to DB5 ← Only these move!
Range 90-100: Events still go to DB1

Events affected: Only those with hash values 75-90

Only events that hash to the range between 75 and 90 need to move. Everything else stays exactly where it was. No mass redistribution.

The same principle applies when removing databases. Remove DB2 at position 25, and only events in the range 0-25 need to move to the next database clockwise (DB3).

Virtual Nodes: Better Load Distribution

There's still one problem with this basic approach. When we remove a database, all its data goes to the next database clockwise. This creates uneven load distribution.

The solution is virtual nodes. Instead of placing each database at one position, we place it at multiple positions:

Each database gets 5 virtual nodes (positions):

DB1: positions 0, 20, 40, 60, 80
DB2: positions 5, 25, 45, 65, 85
DB3: positions 10, 30, 50, 70, 90
DB4: positions 15, 35, 55, 75, 95

Now when DB2 is removed, its load gets distributed across multiple databases instead of dumping everything on one database.

When You'll Need This?

Usually, you will not want to actually implement this yourself unless you're designing a single scaled custom backend component, something like designing a custom distributed cache, design a distributed database or design a distributed message queue.

Popular systems do use consistent hashing under the hood for you already - for example Redis, Cassandra, DynamoDB, and most CDN networks do it.

Implementation in JavaScript

Here's a complete implementation of consistent hashing. Please note that this is of course simplified.

const crypto = require("crypto");

class ConsistentHash {
  constructor(virtualNodes = 150) {
    this.virtualNodes = virtualNodes;
    this.ring = new Map(); // position -> server
    this.servers = new Set();
    this.sortedPositions = []; // sorted array of positions for binary search
  }

  // Hash function using MD5
  hash(key) {
    return parseInt(
      crypto.createHash("md5").update(key).digest("hex").substring(0, 8),
      16
    );
  }

  // Add a server to the ring
  addServer(server) {
    if (this.servers.has(server)) {
      console.log(`Server ${server} already exists`);
      return;
    }

    this.servers.add(server);

    // Add virtual nodes for this server
    for (let i = 0; i < this.virtualNodes; i++) {
      const virtualKey = `${server}:${i}`;
      const position = this.hash(virtualKey);
      this.ring.set(position, server);
    }

    this.updateSortedPositions();
    console.log(
      `Added server ${server} with ${this.virtualNodes} virtual nodes`
    );
  }

  // Remove a server from the ring
  removeServer(server) {
    if (!this.servers.has(server)) {
      console.log(`Server ${server} doesn't exist`);
      return;
    }

    this.servers.delete(server);

    // Remove all virtual nodes for this server
    for (let i = 0; i < this.virtualNodes; i++) {
      const virtualKey = `${server}:${i}`;
      const position = this.hash(virtualKey);
      this.ring.delete(position);
    }

    this.updateSortedPositions();
    console.log(`Removed server ${server}`);
  }

  // Update sorted positions array for efficient lookups
  updateSortedPositions() {
    this.sortedPositions = Array.from(this.ring.keys()).sort((a, b) => a - b);
  }

  // Find which server should handle this key
  getServer(key) {
    if (this.sortedPositions.length === 0) {
      throw new Error("No servers available");
    }

    const position = this.hash(key);

    // Binary search for the first position >= our hash
    let left = 0;
    let right = this.sortedPositions.length - 1;

    while (left < right) {
      const mid = Math.floor((left + right) / 2);
      if (this.sortedPositions[mid] < position) {
        left = mid + 1;
      } else {
        right = mid;
      }
    }

    // If we're past the last position, wrap around to the first
    const serverPosition =
      this.sortedPositions[left] >= position
        ? this.sortedPositions[left]
        : this.sortedPositions[0];

    return this.ring.get(serverPosition);
  }

  // Get distribution statistics
  getDistribution() {
    const distribution = {};
    this.servers.forEach((server) => {
      distribution[server] = 0;
    });

    // Test with 10000 sample keys
    for (let i = 0; i < 10000; i++) {
      const key = `key_${i}`;
      const server = this.getServer(key);
      distribution[server]++;
    }

    return distribution;
  }

  // Show ring state (useful for debugging)
  showRing() {
    console.log("\nRing state:");
    this.sortedPositions.forEach((pos) => {
      console.log(`Position ${pos}: ${this.ring.get(pos)}`);
    });
  }
}

// Example usage and testing
function demonstrateConsistentHashing() {
  console.log("=== Consistent Hashing Demo ===\n");

  const hashRing = new ConsistentHash(3); // 3 virtual nodes per server for clearer demo

  // Add initial servers
  console.log("1. Adding initial servers...");
  hashRing.addServer("server1");
  hashRing.addServer("server2");
  hashRing.addServer("server3");

  // Test key distribution
  console.log("\n2. Testing key distribution with 3 servers:");
  const events = [
    "event_1234",
    "event_5678",
    "event_9999",
    "event_4567",
    "event_8888",
  ];

  events.forEach((event) => {
    const server = hashRing.getServer(event);
    const hash = hashRing.hash(event);
    console.log(`${event} (hash: ${hash}) -> ${server}`);
  });

  // Show distribution statistics
  console.log("\n3. Distribution across 10,000 keys:");
  let distribution = hashRing.getDistribution();
  Object.entries(distribution).forEach(([server, count]) => {
    const percentage = ((count / 10000) * 100).toFixed(1);
    console.log(`${server}: ${count} keys (${percentage}%)`);
  });

  // Add a new server and see minimal redistribution
  console.log("\n4. Adding server4...");
  hashRing.addServer("server4");

  console.log("\n5. Same events after adding server4:");
  const moved = [];
  const stayed = [];

  events.forEach((event) => {
    const newServer = hashRing.getServer(event);
    const hash = hashRing.hash(event);
    console.log(`${event} (hash: ${hash}) -> ${newServer}`);

    // Note: In a real implementation, you'd track the old assignments
    // This is just for demonstration
  });

  console.log("\n6. New distribution with 4 servers:");
  distribution = hashRing.getDistribution();
  Object.entries(distribution).forEach(([server, count]) => {
    const percentage = ((count / 10000) * 100).toFixed(1);
    console.log(`${server}: ${count} keys (${percentage}%)`);
  });

  // Remove a server
  console.log("\n7. Removing server2...");
  hashRing.removeServer("server2");

  console.log("\n8. Distribution after removing server2:");
  distribution = hashRing.getDistribution();
  Object.entries(distribution).forEach(([server, count]) => {
    const percentage = ((count / 10000) * 100).toFixed(1);
    console.log(`${server}: ${count} keys (${percentage}%)`);
  });
}

// Demonstrate the redistribution problem with simple modulo
function demonstrateSimpleHashing() {
  console.log("\n=== Simple Hash + Modulo (for comparison) ===\n");

  function simpleHash(key) {
    return parseInt(
      crypto.createHash("md5").update(key).digest("hex").substring(0, 8),
      16
    );
  }

  function getServerSimple(key, numServers) {
    return `server${(simpleHash(key) % numServers) + 1}`;
  }

  const events = [
    "event_1234",
    "event_5678",
    "event_9999",
    "event_4567",
    "event_8888",
  ];

  console.log("With 3 servers:");
  const assignments3 = {};
  events.forEach((event) => {
    const server = getServerSimple(event, 3);
    assignments3[event] = server;
    console.log(`${event} -> ${server}`);
  });

  console.log("\nWith 4 servers:");
  let moved = 0;
  events.forEach((event) => {
    const server = getServerSimple(event, 4);
    if (assignments3[event] !== server) {
      console.log(`${event} -> ${server} (MOVED from ${assignments3[event]})`);
      moved++;
    } else {
      console.log(`${event} -> ${server} (stayed)`);
    }
  });

  console.log(
    `\nResult: ${moved}/${events.length} events moved (${(
      (moved / events.length) *
      100
    ).toFixed(1)}%)`
  );
}

// Run the demonstrations
demonstrateConsistentHashing();
demonstrateSimpleHashing();

Code Notes

The implementation has several key components:

Hash Function: Uses MD5 to convert keys into positions on the ring. In production, you might use faster hashes like Murmur3.

Virtual Nodes: Each server gets multiple positions on the ring (150 by default) to ensure better load distribution.

Binary Search: Finding the right server uses binary search on sorted positions for O(log n) lookup time.

Ring Management: Adding/removing servers updates the ring and maintains the sorted position array.

Do not use this code for real-world usage, it's just sample code. A few things that you should do different in real examples for example:

Hash Function: Use faster hashes like Murmur3 or xxHash instead of MD5
Virtual Nodes: More virtual nodes (100-200) provide better distribution
Persistence: Store ring state in a distributed configuration system
Replication: Combine with replication strategies for fault tolerance

Single Computers vs Distributed Systems: Why Everything Gets Complicated

Lukas Niessen — Sun, 25 May 2025 06:54:57 +0000

The Predictable World of Single Computers

One machine is pretty straightforward. For example your laptop. When you run a program, it either works or it doesn't. In the sense if that if something goes wrong, the whole system usually crashes rather than giving you wrong answers. And by going wrong I don't mean software bugs because the OS and the hardware just think this way intentional. I mean 'bugs' at hardware level.

And this is actually by design. Computers are built to fail rather than produce false results.

This deterministic behavior hides all the messy details of physical hardware. Your CPU might have tiny manufacturing defects, your RAM might have occasional bit flips, but the system handles these gracefully by either working correctly or failing completely.

Distributed Systems

The moment you connect multiple computers over a network, everything changes. Now you're dealing with the messy reality of the physical world, and things get unpredictable fast.

Of course every node itself is still deterministic but there are many other issues. Such as that network cables fail, power goes out in one data center but not another, and that innocent-looking Ethernet switch might just decide to drop packets for no good reason. Suddenly you have partial failures - some parts of your system work while others don't, and you might not even know which is which.

This creates nondeterministic behavior. Did your database write succeed? Maybe. Is the user's payment processed? Who knows - the network timeout doesn't tell you if the transaction went through or not.

Two Different Approaches to Handling This Mess

When you're building large-scale systems, you basically have two philosophical approaches to dealing with these inevitable failures.

The HPC Approach: If Anything Breaks, Start Over

High-performance computing systems like supercomputers take the single-computer approach and scale it up. They use specialized, expensive, reliable hardware with fancy interconnects like shared memory and RDMA for lightning-fast communication.

When something goes wrong, they don't mess around with partial failures. Instead, they:

Stop everything
Roll back to the last checkpoint
Restart the entire computation

It's like treating a 10,000-node supercomputer as one giant computer that either works or doesn't. This makes sense when you're running a weather simulation that can afford to restart, but it doesn't work so well for systems that need to stay online.

This approach is known as vertical scaling.

The Cloud Approach: Keep Going No Matter What

Cloud computing takes the opposite approach. Instead of expensive, reliable hardware, cloud systems use cheap commodity machines that fail all the time. The assumption is that something is always breaking somewhere.

Rather than stopping everything when a node fails, cloud systems are designed to:

Keep running even when parts fail
Route around problems automatically
Replace broken components without downtime
Handle rolling updates and gradual replacements

This approach powers every internet service you use. When you're browsing Netflix, some server somewhere is probably crashing, but you never notice because the system just routes around the problem.

This approach is known as horizontal scaling.

Traditional Enterprise: The Middle Ground

Most corporate data centers fall somewhere between these approaches, trying to balance reliability with cost. They're more reliable than cloud commodity hardware but less specialized than supercomputers.

Why Cloud Systems Are So Much Harder to Build

Here's the thing about building fault-tolerant distributed systems: they're ridiculously complex.

Scale makes everything worse. A system with 1,000 nodes has way more failure modes than a system with 10 nodes. Components fail more often, network partitions happen regularly, and debugging becomes a nightmare.

Cheap hardware fails constantly. Cloud providers use commodity hardware because it's cost-effective, but that means higher failure rates. The system has to be smart enough to work around constant small failures.

Network complexity explodes. Instead of the specialized network topologies supercomputers use (like fancy meshes and toruses), cloud systems rely on IP/Ethernet with Clos network topologies to handle massive bandwidth needs.

But here's the counterintuitive part: you can actually build incredibly reliable systems from unreliable components. Think about it - TCP gives you reliable data transmission over the unreliable internet, and error-correcting codes let you store data reliably on imperfect storage devices.

The Bottom Line for System Design

If you're building any kind of distributed system, you need to accept that partial failures will happen. There's no avoiding it.

Even small systems need to plan for faults because they will eventually occur. You can't just hope your two-server setup will never have problems - it will, and probably sooner than you think.

The key is building fault tolerance into your software design from the beginning. Test for weird failure scenarios, not just the obvious ones. Plan for network splits, slow nodes, and all the other fun surprises distributed systems throw at you.

Certainly an interesting read here is the Chaos Monkey of Netflix.

ELI5: CAP Theorem in System Design

Lukas Niessen — Sat, 24 May 2025 16:31:46 +0000

ELI5: CAP Theorem in System Design

This is a super simple ELI5 explanation of the CAP Theorem. After that, I explain a common misunderstanding that you should be careful of, and then lastly, I will give two system design examples where CAP Theorem is used to make design decision.

Super simple explanation

C = Consistency = Every user gets the same data
A = Availability = Users can retrieve the data always
P = Partition tolerance = Even if there are network issues, everything works fine still

Now the CAP Theorem states that in a distributed system, you need to decide whether you want consistency or availability. You cannot have both.

Questions

And in non-distributed systems? CAP Theorem only applies to distributed systems. If you only have one database, you can totally have both. (Unless that DB server if down obviously, then you have neither.

Is this always the case? No, if everything is green, we have both, consistency and availability. However, if a server looses internet access for example, or there is any other fault that occurs, THEN we have only one of the two, that is either have consistency or availability.

Example

As I said already, the problems only arises, when we have some sort of fault. Let's look at this example.

    US (Master)                    Europe (Replica)
   ┌─────────────┐                ┌─────────────┐
   │             │                │             │
   │  Database   │◄──────────────►│  Database   │
   │   Master    │    Network     │   Replica   │
   │             │  Replication   │             │
   └─────────────┘                └─────────────┘
        │                              │
        │                              │
        ▼                              ▼
   [US Users]                     [EU Users]

Normal operation: Everything works fine. US users write to master, changes replicate to Europe, EU users read consistent data.

Network partition happens: The connection between US and Europe breaks.

    US (Master)                    Europe (Replica)
   ┌─────────────┐                ┌─────────────┐
   │             │    ╳╳╳╳╳╳╳     │             │
   │  Database   │◄────╳╳╳╳╳─────►│  Database   │
   │   Master    │    ╳╳╳╳╳╳╳     │   Replica   │
   │             │    Network     │             │
   └─────────────┘     Fault      └─────────────┘
        │                              │
        │                              │
        ▼                              ▼
   [US Users]                     [EU Users]

Now we have two choices:

Choice 1: Prioritize Consistency (CP)

EU users get error messages: "Database unavailable"
Only US users can access the system
Data stays consistent but availability is lost for EU users

Choice 2: Prioritize Availability (AP)

EU users can still read/write to the EU replica
US users continue using the US master
Both regions work, but data becomes inconsistent (EU might have old data)

What are Network Partitions?

Network partitions are when parts of your distributed system can't talk to each other. Think of it like this:

Your servers are like people in different rooms
Network partitions are like the doors between rooms getting stuck
People in each room can still talk to each other, but can't communicate with other rooms

Common causes:

Internet connection failures
Router crashes
Cable cuts
Data center outages
Firewall issues

The key thing is: partitions WILL happen. It's not a matter of if, but when.

The "2 out of 3" Misunderstanding

CAP Theorem is often presented as "pick 2 out of 3." This is wrong.

Partition tolerance is not optional. In distributed systems, network partitions will happen. You can't choose to "not have" partitions - they're a fact of life, like rain or traffic jams... :-)

So our choice is: When a partition happens, do you want Consistency OR Availability?

CP Systems: When a partition occurs → node stops responding to maintain consistency
AP Systems: When a partition occurs → node keeps responding but users may get inconsistent data

In other words, it's not "pick 2 out of 3," it's "partitions will happen, so pick C or A."

System Design Example 1: Social Media Feed

Scenario: Building Netflix

Decision: Prioritize Availability (AP)

Why? If some users see slightly outdated movie names for a few seconds, it's not a big deal. But if the users cannot watch movies at all, they will be very unhappy.

System Design Example 2: Flight Booking System

In here, we will not apply CAP Theorem to the entire system but to parts of the system. So we have two different parts with different priorities:

Part 1: Flight Search

Scenario: Users browsing and searching for flights

Decision: Prioritize Availability

Why? Users want to browse flights even if prices/availability might be slightly outdated. Better to show approximate results than no results.

Part 2: Flight Booking

Scenario: User actually purchasing a ticket

Decision: Prioritize Consistency

Why? If we would prioritize availibility here, we might sell the same seat to two different users. Very bad. We need strong consistency here.

PS: Architectural Quantum

What I just described, having two different scopes, is the concept of having more than one architecture quantum. There is a lot of interesting stuff online to read about the concept of architecture quanta :-)

System Design: Choosing the Right Dataflow

Lukas Niessen — Mon, 19 May 2025 19:17:35 +0000

When building a system, a very important decision is how data moves between components. I will walk through

Database dataflow,
Service calls (REST, SOAP, GraphQL, RPC),
Messaging (message queues).

Dataflow Through Databases

First off: The DB is almost always deployed separately, that is, a different machine or container. This has many reasons, including the following:

Scalability: You can scale your app and database independently. For example, spin up more app servers without touching the DB.
Fault Tolerance: You can restart your app without risking the DB, or upgrade the DB without downtime.
Security: You can isolate the DB behind a firewall or private subnet for tighter access control.

(An exception to this is obviously SQLite or other embedded DBs.)

So we communicate with our DB over the network.

Protocols

Which protocols are used? Databases don't use HTTP. They rely on custom binary protocols for performance. In the transport layer we use TCP due to its reliability.

Other Considerations

When storing data in a database, it is encoded into a format suitable for the database (eg. UTF-8 for text in PostgreSQL, BSON for documents in MongoDB). When retrieved, the data is decoded for use by the application.

Something that makes databases unique is that they often store data for decades, such as user profiles or posts from 10 years ago. And they store it in the original encoding or schema. This means the data must remain accessible even as applications evolve and is often summarized as data outlives code. So we need to ensure backward compatibility (new code can read old data) and forward compatibility (old code can work with new data formats or schemas). This is achieved through strategies like:

Dataflow Through Services

While there are many more, I will cover mainly REST vs SOAP vs RPC and I will mention GraphQL without going deep there. Let's first get the terms straight.

Service

Service = An API exposed by a server

The most common type of communication is: clients and servers. The server exposes an API and the clients can use that API to make requests to server over the network.

Web Service

Web Service = Service that uses HTTP

Although web services are not exlusively in the web this is still the name they received. Reminder of how the web works: clients (web browsers) send requests to web servers. For example HTTP GET requests to download HTML, CSS, JavaScript, images or scripts, or HTTP POST requests to submit data, such as a registering form.

REST (Representational State Transfer)

REST is not a protocol. It's merely a way to design your service-client or service-service communication. The core ideas are:

Use simple data formats
Use URLs to identify resources
- So strictly speaking /profiles/lukas is cool, but profiles/get-profile is not that much. The latter is an action, not a resource.
Use HTTP features, for example for cache control, authentication, or content type negotiation.

REST responses are typically in JSON. There is much more to REST and interpretations also moved pretty far from what it actually was meant to be, but that's all topic for a separate article.

REST has been gaining popularity compared to SOAP, especially for cross-organizational service integration. The main reason is its simplicity, support and interoperability.

SOAP (Simple Object Access Protocol)

SOAP is an XML-based protocol for making network API requests. Although most commonly used over HTTP, it aims to be independent from HTTP and avoids using most HTTP features. Instead, it comes with a complex set of related standards (the WS-* framework) that add various features.

The API of a SOAP web service is described using an XML-based language called the Web Services Description Language (WSDL). WSDL is not designed to be human-readable. Also SOAP messages are often too complex to construct manually. So develops rely heavily on tool support, code generation, and IDEs. And that's the biggest problem. For programming languages not supported by SOAP vendors, integration is difficult. Although SOAP is standardized, interoperability between different vendors' implementations often causes problems.

So in a nutshell, SOAP is much more complex and requires tools and the like. This often causes problems and is the main reason SOAP lost popularity (however, it's still used at many places).

RPC (Remote Procedure Call)

Now some clients are not a browser or native app or the like. In fact, applications or services can be clients as well. For example, when your service makes a call to another service, your service is a client in the context of this call. So how do we approach such communication?

RPC is intended for such remote calls and the idea is to make them look like local function calls. However, it's important to clarify that this will never really work for a multitude of reasons, here are some.

Predictability: Local function calls are predictable, succeeding or failing based on controlled parameters. Network requests are unpredictable due to potential network issues, requiring retries.
Outcomes: Local calls return a result or throw an exception. Network requests may timeout without a result, leaving uncertainty about request success.
Retries: Retrying network requests risks multiple executions if responses are lost (there are workarounds).
Response times: Local calls have consistent execution times. Network requests are slower with variable latency, ranging from milliseconds to seconds based on network conditions.
Parameter Passing: Local calls efficiently pass object references. Network requests require encoding parameters into bytes, which is problematic for large objects.
Language differences: RPC frameworks must translate datatypes across languages (eg. from Java to TypeScript), which can be complex.

The most common RPC framework is gRPC (by Google). It uses Protocol Buffers for encoding the transmitted data.

The main use of RPC is for requests between services owned by the same organization, usually within the same datacenter. For example between micro services. When done right, eg. by using gRPC in the right way, the requests are very fast (way faster than using a RESTful API). More to this later.

GraphQL

GraphQL is a query language for APIs and a runtime for executing those queries. Unlike REST, where each endpoint returns a fixed data structure, GraphQL lets clients specify exactly what data they need.

In simple terms, GraphQL works like this:

The client sends a single query describing the data it needs
The server returns exactly that data, nothing more, nothing less
Everything happens over a single endpoint (typically /graphql)

GraphQL was developed by Facebook in 2012 and released as open source in 2015. It arose from the need to efficiently fetch data for mobile applications with varying requirements and limited bandwidth.

Key advantages of GraphQL include:

Precise data fetching: Clients get exactly what they ask for, reducing over-fetching or under-fetching of data
Single request: Clients can retrieve multiple resources in a single request
Strong typing: The schema defines what queries are possible, enabling better tooling and validation
Versioning: Fields can be deprecated without breaking existing queries

The main trade-offs include:

More complex server implementation than simple REST
Potential performance issues with deeply nested queries
Caching is more challenging than with REST
Learning curve for teams new to the technology

GraphQL is particularly well-suited for complex UIs with changing data requirements and diverse clients with different needs - like Facebook or Instagram for example

Serialization and De-serialization

Serialization = converting data structures into a transmittable format`

De-serialization = reconstructing data structures from the received format`

We clearly need serialization/de-serialization when communicating over the network. So it's important to get it right. There are studies, for example this one, which show that often serialization/de-serialization accounts for 80% or more of the communication time in microservice architectures.

This shows the importance of using gRPC (or other frameworks). There we still have serialization/de-serialization but it's very efficient - much faster than typical JSON serialization/de-serialization with RESTful APIs for example.

Dataflow Through Messaging

At its simplest, messaging is like leaving a note for someone when they're not immediately available. Rather than connecting directly, you drop a message in a queue and trust it will be delivered.

Super Brief Explanation

Messaging systems operate with a few key components:

Producers: Systems that generate messages
Consumers: Systems that process messages
Brokers: The middleware that stores and routes messages
Queues/Topics: Named destinations for messages
Queues: Where each message is consumed by a single recipient (point-to-point)
Topics/Exchanges: Where messages can be broadcast to multiple subscribers (publish-subscribe)

When using a messaging system, the sender doesn't need to know about the recipient - it simply publishes a message and moves on. Also called fire-and-forget.

Strengths and Trade-offs

Strengths:

Decoupling: Services don't need to know about each other's location or implementation
Resilience: System can continue functioning even if some components are down
Buffering: Great for elasticity, we can handle traffic spikes by queuing messages before consuming them

Trade-offs:

Increased complexity: Additional infrastructure to maintain and monitor
Eventual consistency: Messages are processed asynchronously, so data may be temporarily inconsistent, however, there are workarounds
Debugging challenges: Message flow can be harder to trace than direct calls, testing is difficult too
Ordering guarantees: Most systems provide only limited message ordering guarantees

Some common messaging systems are RabbitMQ, Apache Kafka, ActiveMQ, and cloud offerings like AWS SQS/SNS, Google Pub/Sub, and Azure Service Bus.

ELI5: What exactly are ACID and BASE Transactions?

Lukas Niessen — Sun, 18 May 2025 23:41:52 +0000

ELI5: What exactly are ACID and BASE Transactions?

In this article, I will cover ACID and BASE transactions. First I give an easy ELI5 explanation and then a deeper dive. At the end, I show code examples.

What is ACID, what is BASE?

When we say a database supports ACID or BASE, we mean it supports ACID transactions or BASE transactions.

ACID

An ACID transaction is simply writing to the DB, but with these guarantees;

Write it all or nothing; writing A but not B cannot happen.
If someone else writes at the same time, make sure it still works properly.
Make sure the write stays.

Concretely, ACID stands for:

A = Atomicity = all or nothing (point 1)

C = Consistency

I = Isolation = parallel writes work fine (point 2)

D = Durability = write should stay (point 3)

BASE

A BASE transaction is again simply writing to the DB, but with weaker guarantees. BASE lacks a clear definition. However, it stands for:

BA = Basically available

S = Soft state

E = Eventual consistency.

What these terms usually mean is:

Basically available just means the system prioritizes availability (see CAP theorem later).
Soft state means the system's state might not be immediately consistent and may change over time without explicit updates. (Particularly across multiple nodes, that is, when we have partitioning or multiple DBs)
Eventual consistency means the system becomes consistent over time, that is, at least if we stop writing. Eventual consistency is the only clearly defined part of BASE.

Notes

You surely noticed I didn't address the C in ACID: consistency. It means that data follows the application's rules (invariants). In other words, if a transaction starts with valid data and preserves these rules, the data stays valid. But this is the not the database's responsibility, it's the application's. Atomicity, isolation, and durability are database properties, but consistency depends on the application. So the C doesn't really belong in ACID. Some argue the C was added to ACID to make the acronym work.

The name ACID was coined in 1983 by Theo Härder and Andreas Reuter. The intent was to establish clear terminology for fault-tolerance in databases. However, how we get ACID, that is ACID transactions, is up to each DB. For example PostgreSQL implements ACID in a different way than MySQL - and surely different than MongoDB (which also supports ACID). Unfortunately when a system claims to support ACID, it's therefore not fully clear which guarantees they actually bring because ACID has become a marketing term to a degree.

And, as you saw, BASE certainly has a very unprecise definition. One can say BASE means Not-ACID.

Simple Examples

Here quickly a few standard examples of why ACID is important.

Atomicity

Imagine you're transferring $100 from your checking account to your savings account. This involves two operations:

Subtract $100 from checking
Add $100 to savings

Without transactions, if your bank's system crashes after step 1 but before step 2, you'd lose $100! With transactions, either both steps happen or neither happens. All or nothing - atomicity.

Isolation

Suppose two people are booking the last available seat on a flight at the same time.

Alice sees the seat is available and starts booking.
Bob also sees the seat is available and starts booking at the same time.

Without proper isolation, both transactions might think the seat is available and both might be allowed to book it—resulting in overbooking. With isolation, only one transaction can proceed at a time, ensuring data consistency and avoiding conflicts.

Durability

Imagine you've just completed a large online purchase and the system confirms your order.

Right after confirmation, the server crashes.

Without durability, the system might "forget" your order when it restarts. With durability, once a transaction is committed (your order is confirmed), the result is permanent—even in the event of a crash or power loss.

Code Snippet

A transaction might look like the following. Everything between BEGIN TRANSACTION and COMMIT is considered part of the transaction.

BEGIN TRANSACTION;

-- Subtract $100 from checking account
UPDATE accounts
SET balance = balance - 100
WHERE account_type = 'checking' AND account_id = 1;

-- Add $100 to savings account
UPDATE accounts
SET balance = balance + 100
WHERE account_type = 'savings' AND account_id = 1;

-- Ensure the account balances remain valid (Consistency)
-- Check if checking account balance is non-negative
DO $$
BEGIN
    IF (SELECT balance FROM accounts WHERE account_type = 'checking' AND account_id = 1) < 0 THEN
        RAISE EXCEPTION 'Insufficient funds in checking account';
    END IF;
END $$;

COMMIT;

COMMIT and ROLLBACK

Two essential commands that make ACID transactions possible are COMMIT and ROLLBACK:

COMMIT

When you issue a COMMIT command, it tells the database that all operations in the current transaction should be made permanent. Once committed:

Changes become visible to other transactions
The transaction cannot be undone
The database guarantees durability of these changes

A COMMIT represents the successful completion of a transaction.

ROLLBACK

When you issue a ROLLBACK command, it tells the database to discard all operations performed in the current transaction. This is useful when:

An error occurs during the transaction
Application logic determines the transaction should not complete
You want to test operations without making permanent changes

ROLLBACK ensures atomicity by preventing partial changes from being applied when something goes wrong.

Example with ROLLBACK:

BEGIN TRANSACTION;

UPDATE accounts
SET balance = balance - 100
WHERE account_type = 'checking' AND account_id = 1;

-- Check if balance is now negative
IF (SELECT balance FROM accounts WHERE account_type = 'checking' AND account_id = 1) < 0 THEN
    -- Insufficient funds, cancel the transaction
    ROLLBACK;
    -- Transaction is aborted, no changes are made
ELSE
    -- Add the amount to savings
    UPDATE accounts
    SET balance = balance + 100
    WHERE account_type = 'savings' AND account_id = 1;

    -- Complete the transaction
    COMMIT;
END IF;

Why BASE?

BASE used to be important because many DBs, for example document-oriented DBs, did not support ACID. They had other advantages. Nowadays however, most document-oriented DBs support ACID.

So why even have BASE?

ACID can get really difficult when having distributed DBs. For example when you have partitioning or you have a microservice architecture where each service has its own DB. If your transaction only writes to one partition (or DB), then there's no problem. But what if you have a transaction that spans accross multiple partitions or DBs, a so called distributed transaction?

The short answer is: we either work around it or we loosen our guarantees from ACID to ... BASE.

ACID in Distributed Databases

Let's address ACID one by one. Let's only consider partitioned DBs for now.

Atomicity

Difficult. If we do a write on partition A and it works but one on B fails, we're in trouble.

Isolation

Difficult. If we have multiple transactions concurrently access data across different partitions, it's hard to ensure isolation.

Durability

No problem since each node has durable storage.

What about Microservice Architectures?

Pretty much the same issues as with partitioned DBs. However, it gets even more difficult because microservices are independently developed and deployed.

Solutions

There are two primary approaches to handling transactions in distributed systems:

Two-Phase Commit (2PC)

Two-Phase Commit is a protocol designed to achieve atomicity in distributed transactions. It works as follows:

Prepare Phase: A coordinator node asks all participant nodes if they're ready to commit

Each node prepares the transaction but doesn't commit
Nodes respond with "ready" or "abort"

Commit Phase: If all nodes are ready, the coordinator tells them to commit
- If any node responded with "abort," all nodes are told to rollback
- If all nodes responded with "ready," all nodes are told to commit

2PC guarantees atomicity but has significant drawbacks:

It's blocking (participants must wait for coordinator decisions)
Performance overhead due to multiple round trips
Vulnerable to coordinator failures
Can lead to extended resource locking

Example of 2PC in pseudo-code:

// Coordinator
function twoPhaseCommit(transaction, participants) {
    // Phase 1: Prepare
    for each participant in participants {
        response = participant.prepare(transaction)
        if response != "ready" {
            for each participant in participants {
                participant.abort(transaction)
            }
            return "Transaction aborted"
        }
    }

    // Phase 2: Commit
    for each participant in participants {
        participant.commit(transaction)
    }
    return "Transaction committed"
}

Saga Pattern

The Saga pattern is a sequence of local transactions where each transaction updates a single node. After each local transaction, it publishes an event that triggers the next transaction. If a transaction fails, compensating transactions are executed to undo previous changes.

Forward transactions: T1, T2, ..., Tn
Compensating transactions: C1, C2, ..., Cn-1 (executed if something fails)

For example, an order processing flow might have these steps:

Create order
Reserve inventory
Process payment
Ship order

If the payment fails, compensating transactions would:

Cancel shipping
Release inventory reservation
Cancel order

Sagas can be implemented in two ways:

Choreography: Services communicate through events
Orchestration: A central coordinator manages the workflow

Example of a Saga in pseudo-code:

// Orchestration approach
function orderSaga(orderData) {
    try {
        orderId = orderService.createOrder(orderData)
        inventoryId = inventoryService.reserveItems(orderData.items)
        paymentId = paymentService.processPayment(orderData.payment)
        shippingId = shippingService.scheduleDelivery(orderId)
        return "Order completed successfully"
    } catch (error) {
        if (shippingId) shippingService.cancelDelivery(shippingId)
        if (paymentId) paymentService.refundPayment(paymentId)
        if (inventoryId) inventoryService.releaseItems(inventoryId)
        if (orderId) orderService.cancelOrder(orderId)
        return "Order failed: " + error.message
    }
}

What about Replication?

There are mainly three way of replicating your DB. Single-leader, multi-leader and leaderless. I will not address multi-leader.

Single-leader

ACID is not a concern here. If the DB supports ACID, replicating it won't change anything. You write to the leader via an ACID transaction and the DB will make sure the followers are updated. Of course, when we have asynchronous replication, we don't have consistency. But this is not an ACID problem, it's a asynchronous replication problem.

Leaderless Replication

In leaderless replication systems (like Amazon's Dynamo or Apache Cassandra), ACID properties become more challenging to implement:

Atomicity: Usually limited to single-key operations
Consistency: Often relaxed to eventual consistency (BASE)
Isolation: Typically provides limited isolation guarantees
Durability: Achieved through replication to multiple nodes

This approach prioritizes availability and partition tolerance over consistency, aligning with the BASE model rather than strict ACID.

Conclusion

ACID provides strong guarantees but can be challenging to implement across distributed systems
BASE offers more flexibility but requires careful application design to handle eventual consistency

It's important to understand ACID vs BASE and the whys.

The right choice depends on your specific requirements:

Financial applications may need ACID guarantees
Social media applications might work fine with BASE semantics (at least most parts of it).

ELI5: Database Partitioning

Lukas Niessen — Sun, 18 May 2025 15:30:43 +0000

ELI5: Database Partitioning

This article is Database Partitioning in ELI5. Not only that though, I also cover each topic with a more thorough explanation. I will cover:

What is partitioning?
Why partition your database?
Partitioning vs. Replication
Key-Value partitioning strategies
Handling skewed workloads and hot spots
Secondary indexes with partitioning
Request routing and service discovery
Parallel query execution

What is Partitioning? ELI5

Say we run Facebook. There are way too many posts to store it on one computer. So we split it, some posts here, some posts there. Each split is a partition.

Partitioning = Splitting a database into smaller chunks across multiple machines

The first question is, of course, how do we partition? But there are other questions that need an answer. I will address them later.

Note: There are other names for partitions, for example, it's called shards (sharding) in MongoDB and Elasticsearch, or a tablet with Bigtable. However, partitioning is the most established term.

Why Partition? ELI5

As said, the main reason for partitioning is scalability. When your data or query load gets too big for a single machine to handle, you need to break it up.

With partitioning:

You can store more data than fits on one machine
You can distribute query load across many processors
Different partitions can be placed on different nodes in a shared-nothing cluster

We call a DB holding a partition a node.

This has other advantages as well, for example parallelizing queries. For queries that only need data from a single partition, each node can independently handle its part, so you can scale query throughput by adding more nodes. Complex queries that span partitions are harder but can potentially be parallelized.

Partitioning vs. Replication

Partitioning is usually combined with replication. That means:

Partitioning splits the data into smaller subsets
Each partition is then replicated on multiple nodes

Even though each record belongs to only one partition, it may be stored on several different nodes for fault tolerance. A node may store more than one partition. In a leader-follower model, a node might be the leader for some partitions and a follower for others.

Key-Value Partitioning Strategies

Alright, so let's address the first question. How do we decide which records go on which nodes?

The goal is to spread data and query load evenly. If every node takes a fair share, then 5 nodes should theoretically handle 5 times as much data and throughput as one node.

If the partitioning is uneven, so some partitions have more data or queries than others, we call it skewed. An extreme case of skew is a hot spot, so in other words, a partition with disproportionately high load.

1. Partitioning by Key Range

How it works: Assign each partition a continuous range of keys (from some minimum to some maximum).

Example: With movies, you could use the name's starting letters as the keys. So movies with A are stored on one node, those starting with B on the other, and so on.

Partition boundaries can be chosen:

Manually by an administrator
Automatically by the database

This approach is used by:

Bigtable and HBase
RethinkDB
MongoDB (before version 2.4)

Within each partition, keys are kept in sorted order. This makes range scans efficient and enables treating the key as a concatenated index for fetching related records in one query.

Example: Sensor network data where the key is a timestamp. You can easily fetch all readings from a particular month.

Problem: Certain access patterns create hot spots. If the key is a timestamp, all writes go to the partition for "today" while other partitions sit idle.

Solution: Use something other than a timestamp as the first element of the key. For example, prefix each timestamp with the sensor name so partitioning happens first by sensor, then by time. This spreads write load across partitions.

2. Partitioning by Hash of Key

How it works: Apply a hash function to keys and assign each partition a range of hash values.

A good hash function takes skewed data and makes it uniformly distributed. For example, Cassandra and MongoDB use this (with MD5).

This approach distributes keys fairly among partitions. The partition boundaries can be:

Evenly spaced
Chosen pseudorandomly (sometimes called consistent hashing)

The Big Tradeoff: By using a hash of the key, we lose the ability to do efficient range queries. Keys that were once adjacent are now scattered across partitions, and their sort order is lost.

In MongoDB with hash-based sharding, range queries must be sent to all partitions
Range queries on the primary key are not supported in Riak, Couchbase, or Voldemort

Handling Skewed Workloads and Hot Spots

Hashing helps reduce hot spots but can't eliminate them entirely. If all reads and writes target the same key, all requests still go to the same partition.

This happens in real life: a celebrity on social media doing something noteworthy can create a storm of activity on a single key (the celebrity's user ID or the ID of the action people are commenting on). Hashing doesn't help because identical IDs hash to the same value.

There are solutions for this too though, with tradeoffs of course.

Request Routing and Service Discovery

Now that we've partitioned our dataset across multiple nodes. Great.

But there's an issue not addressed yet. When a client wants to make a request, how does it know which node to connect to? For example which IP address should it connect to?

This problem is generally (also outside of DBs) known as service discovery. There are the main approaches:

1. Routing Tier

All client requests go through a routing layer first
This layer determines the right node for each request and forwards accordingly
The routing tier is essentially a partition-aware load balancer
It doesn't handle requests itself

2. Any-Node Routing

Clients can contact any node (for example via a round-robin load balancer)
If that node has the partition for the request, it handles it directly
Otherwise, it forwards the request to the appropriate node and passes the reply back

3. Client-Aware Routing

Clients know about the partitioning scheme and partition-to-node mapping
They connect directly to the appropriate node without intermediaries

Many systems rely on a separate coordination service like ZooKeeper to track cluster metadata:

Each node registers in ZooKeeper
ZooKeeper maintains the authoritative partition-to-node mapping
Routing tiers or clients subscribe to this information
When partitions change ownership, ZooKeeper notifies subscribers

Examples:

LinkedIn's Espresso uses Helix (built on ZooKeeper)
HBase, SolrCloud, and Kafka use ZooKeeper directly
MongoDB uses its own config server implementation with mongos daemons as the routing tier

How Request Routing Works with ZooKeeper

So, you've partitioned your dataset across multiple nodes, and you're using a coordination service like ZooKeeper to manage cluster metadata. But how does a client actually get its request to the right node? Let's break down the flow, focusing on a system with a routing tier and ZooKeeper.

The Request Routing Flow

Here's how it typically works when a client makes a request in a distributed system with a routing tier and ZooKeeper:

Client Sends Request to Routing Tier

The client doesn't know which node holds the data it needs, so it sends its request (e.g., a database query) to a routing tier. This is a partition-aware load balancer, like MongoDB's mongos daemon or a custom proxy. The routing tier's job is to figure out where to send the request.
Routing Tier Consults ZooKeeper

The routing tier needs to know which node owns the partition for the requested data. It queries ZooKeeper, which maintains the authoritative partition-to-node mapping. ZooKeeper stores this metadata in a hierarchical structure (like a file system), updated whenever nodes join, leave, or partitions are reassigned. The routing tier either:

Caches this mapping and subscribes to ZooKeeper for updates (to stay current), or
Queries ZooKeeper on-demand for each request (less common due to latency).

ZooKeeper Provides Metadata

ZooKeeper responds with the current partition-to-node mapping. For example, it might say, "Partition P1 is on Node A (IP: 192.168.1.10), Partition P2 is on Node B (IP: 192.168.1.11)." This tells the routing tier exactly where to send the request.
Routing Tier Forwards the Request

Armed with the mapping, the routing tier forwards the client's request to the correct node (e.g., Node A). The node processes the request, interacts with the database, and returns the response to the routing tier.
Routing Tier Returns Response to Client

The routing tier passes the response back to the client, completing the request. From the client's perspective, it just sent a request and got a response, unaware of the coordination happening behind the scenes.

ELI5: Database Replication

Lukas Niessen — Sun, 18 May 2025 13:50:51 +0000

This article is Database Replication in ELI5. Not only though, I also cover each topic with a more thorugh summary. I will cover:

What is replication?
Why replication?
Leader-based, multi-leader, and leaderless
Synchronous vs. asynchronous replication
How to handle node failures
Problems with replication lag
Setting up new replicas

What is Replication? ELI5

Replication = Keeping copies of the same data on multiple machines

Why Replication? ELI5

Three key reasons:

Latency: Keep data geographically close to users
- Let's say you're in China. Your app will use a nearby database server, ideally also in China. This is much faster than using one in the US for example.
Availability: Keep the system running even when some parts fail
- What if a DB server goes down? We will just connect to other, more to this later.
Read throughput: Scale out machines serving read queries
- If you have many users, it's better to have more than one server. Imagine YouTube serving all content from a single computer - not possible. So they put the content on multiple machines and distribute serving content among them.

Where's the Challenge?

So this was super easy, but of course real life is not that easy. There are many difficulties in replicating your data. Too many to cover in this article so I will just cover the most important ones, the ones you should know about.

First off, if you have 5,000 Terrabytes of data, replicating all that data is probably too much. So you would also want to split the data between DBs. This is called partitioning and is on purpose ignored here. See my other article for that.

So, the main challenge with replication is handling changes. It's easy to once copy data to 10 different computers. But what do we do when we get new changes? There are 3 main approaches.

Leader-Based Replication

The most common approach is the leader-based model (also called master-slave). Here's how it works:

One node (computer with DB) is the leader (or master)
The other nodes are called followers (or slaves)
Clients send write requests only to the leader
- Followers are read only!
The leader writes to its local storage and sends changes to the followers
Followers apply these changes.

This is a high level overview skipping over details. There are still some key questions left, even on this high level.

Synchronous vs. Asynchronous Replication

A critical decision in replication design is: should changes be applied synchronously or asynchronously?

With synchronous replication, the leader waits for the follower to confirm it received the write before confirming success to the client. This guarantees the follower's data is up-to-date with the leader.

With asynchronous replication, the leader doesn't wait for acknowledgment from followers. It processes the write locally and moves on, followers catch up when they can.

The Trade-offs:

Synchronous:
- Guarantees up-to-date copies
- Guarantees durability (that is, we know writes are acutally 'successful', that is persisted)
- But it also means we are slower
- For example, just one slow follower means the whole system needs to wait. Other writes are blocked.
- As communication over the network is unreliable, this is a big tradeoff.
Asynchronous:
- Better performance
- Writes are not blocked, despite any network conditions or other factors
- However, followers might lag behind (can be up to minutes)
- So we do not have consistency anymore, that is, you get result A from one DB but result B from another.
- Furthermore, what if a write totally fails because a follower is down for example? Then the write is not persisted.
- So we don't have ensured durability

So both have big tradeoffs. Some systems use semi-synchronous replication: one follower is synchronous, the rest are asynchronous. This guarantees at least two nodes have the latest data without sacrificing too much performance.

However, most distributed systems use fully asynchronous replication. This is a conscious trade-off of durability for availability and performance.

Setting Up New Followers

Sometimes you need new followers - maybe to increase read capacity or replace failed nodes. How do you set this up without downtime?

The conceptual process works like this:

Take a consistent snapshot of the leader's database
Copy the snapshot to the new follower
The follower connects to the leader and requests all changes since the snapshot
When the follower processes the backlog, it has "caught up" and can continue applying changes in real-time

This process varies significantly between database systems. Some automate it fully, while others require manual administrator intervention.

Handling Node Outages

Nodes fail. It's inevitable. Good replication systems should handle these failures. Fault tolerance is one of the main reasons for replications.

When a Follower Fails: Catch-up Recovery

This one is straightforward. When a follower recovers from a crash, it:

Checks its local log to find the last transaction it processed
Connects to the leader and requests all changes since that point
Applies these changes to catch up
Resumes normal operation

When a Leader Fails: Failover

This is a bit trickier. When a leader fails, we need to:

Detect the failure (usually via timeout)
Choose a new leader (usually the follower with the most up-to-date data)
Reconfigure the system to use the new leader
Handle client redirects to the new leader

This process is called failover and can be automatic or manual. But failover has issues:

With asynchronous replication, the new leader might be missing writes the old leader confirmed
Split-brain scenario: two nodes both think they're the leader
Setting the right timeout is hard - too short causes unnecessary failovers during temporary slowdowns

These aren't just theoretical concerns. A prominent example was GitHub: they had an incident where an out-of-date MySQL follower was promoted to leader. The database used auto-incrementing IDs, and the new leader reused primary keys that were previously assigned, causing inconsistency with their Redis store and exposing private data to the wrong users. Read their blog for more: here

For these reasons, some teams prefer to manually trigger failovers, accepting a brief outage instead of risking data corruption.

Replication Lag

Replication Lag = replications (for example followers) are lagging behind the most recent data

In normal operation, this lag might be milliseconds, but during heavy load or network issues, it can grow to seconds or even minutes. This introduces inconsistencies - the leader has newer data than followers.

This isn't just theoretical. Some real-world issues caused by replication lag include:

Read-after-write inconsistency: A user writes something, then immediately tries to read it but gets directed to a follower that hasn't received the update yet
Monotonic reads violations: A user sees newer data, then older data in subsequent reads
Consistent prefix issues: Related updates appear in a confusing order

Solutions for Replication Lag

There are several approaches to address these issues:

Read-your-writes consistency: After writing, ensure subsequent reads go to the leader or only to up-to-date followers
Monotonic reads: Make sure each user always reads from the same replica
Consistent prefix reads: Make sure causally related writes are seen in the correct order

Implementing these in application code is complex and error-prone. Ideally, developers shouldn't have to worry about these issues - that's why transactions exist. Transactions are not covered in this article though.

Multi-Leader Replication

The single-leader model has a critical weakness: if you can't reach the leader, you can't write to the database.

Multi-leader replication addresses this by allowing multiple nodes to accept writes. Each write is still forwarded to all nodes. This approach is especially useful in scenarios like:

Multi-datacenter operation (a leader in each datacenter)
Clients with offline operation (like calendar apps)
Collaborative editing systems

The main challenge is handling write conflicts when different leaders accept conflicting changes to the same data.

This is less common than single-leader replication and I will not go into detail here.

Leaderless Replication

This is a totally different approach. In leaderless systems (sometimes called Dynamo-style after Amazon's system), any replica can directly accept writes from clients. There are no leaders.

The typical approach works like this:

The client sends writes to multiple replicas
If enough replicas acknowledge the write, it's considered successful
During reads, the client queries multiple replicas in parallel
Version numbers identify the most recent value
"Read repair" or anti-entropy processes fix stale data

This design eliminates the need for failover, making the system more resilient to node failures. Cassandra, Riak, and Amazon's DynamoDB use variations of this approach.

I will also not go into detail here.

Choosing the Right Replication Model

Each replication approach has its place:

Single-leader: Simple, well-understood, works for most applications
Multi-leader: Good for multi-datacenter operation and offline clients
Leaderless: Highly available for write-intensive workloads with weaker consistency needs

Relational vs Document-Oriented Database for Software Architecture

Lukas Niessen — Sun, 18 May 2025 08:36:52 +0000

Relational vs Document-Oriented Database for Software Architecture

What I go through in here is:

Super quick refresher of what these two are
Key differences
Strengths and weaknesses
System design examples (+ Spring Java code)
Brief history

In the examples, I choose a relational DB in the first, and a document-oriented DB in the other. The focus is on why did I make that choice. I also provide some example code for both.

In the strengths and weaknesses part, I discuss both what used to be a strength/weakness and how it looks nowadays.

Super short summary

The two most common types of DBs are:

Relational database (RDB): PostgreSQL, MySQL, MSSQL, Oracle DB, ...
Document-oriented database (document store): MongoDB, DynamoDB, Cassandra, CouchDB...

RDB

The key idea is: fit the data into a big table. The columns are properties and the rows are the values. By doing this, we have our data in a very structured way. So we have much power for querying the data (using SQL). That is, we can do all sorts of filters, joints etc. The way we arrange the data into the table is called the database schema.

Example table

+----+---------+---------------------+-----+
| ID | Name    | Email               | Age |
+----+---------+---------------------+-----+
| 1  | Alice   | alice@example.com   | 30  |
| 2  | Bob     | bob@example.com     | 25  |
| 3  | Charlie | charlie@example.com | 28  |
+----+---------+---------------------+-----+

A database can have many tables.

Document stores

The key idea is: just store the data as it is. Suppose we have an object. We just convert it to a JSON and store it as it is. We call this data a document. It's not limited to JSON though, it can also be BSON (binary JSON) or XML for example.

Example document

{
  "user_id": 123,
  "name": "Alice",
  "email": "alice@example.com",
  "orders": [
    {"id": 1, "item": "Book", "price": 12.99},
    {"id": 2, "item": "Pen", "price": 1.50}
  ]
}

Each document is saved under a unique ID. This ID can be a path, for example in Google Cloud Firestore, but doesn't have to be.

Many documents 'in the same bucket' is called a collection. We can have many collections.

Differences

Schema

RDBs have a fixed schema. Every row 'has the same schema'.
Document stores don't have schemas. Each document can 'have a different schema'.

Data Structure

RDBs break data into normalized tables with relationships through foreign keys
Document stores nest related data directly within documents as embedded objects or arrays

Query Language

RDBs use SQL, a standardized declarative language
Document stores typically have their own query APIs
- Nowadays, the common document stores support SQL-like queries too

Scaling Approach

RDBs traditionally scale vertically (bigger/better machines)
- Nowadays, the most common RDBs offer horizontal scaling as well (eg. PostgeSQL)
Document stores are great for horizontal scaling (more machines)

Transaction Support

ACID = availability, consistency, isolation, durability

RDBs have mature ACID transaction support
Document stores traditionally sacrificed ACID guarantees in favor of performance and availability
- The most common document stores nowadays support ACID though (eg. MongoDB)

Strengths, weaknesses

Relational Databases

I want to repeat a few things here again that have changed. As noted, nowadays, most document stores support SQL and ACID. Likewise, most RDBs nowadays support horizontal scaling.

However, let's look at ACID for example. While document stores support it, it's much more mature in RDBs. So if your app puts super high relevance on ACID, then probably RDBs are better. But if your app just needs basic ACID, both works well and this shouldn't be the deciding factor.

For this reason, I have put these points, that are supported in both, in parentheses.

Strengths:

Data Integrity: Strong schema enforcement ensures data consistency
(Complex Querying: Great for complex joins and aggregations across multiple tables)
(ACID)

Weaknesses:

Schema: While the schema was listed as a strength, it also is a weakness. Changing the schema requires migrations which can be painful
Object-Relational Impedance Mismatch: Translating between application objects and relational tables adds complexity. Hibernate and other Object-relational mapping (ORM) frameworks help though.
(Horizontal Scaling: Supported but sharding is more complex as compared to document stores)
Initial Dev Speed: Setting up schemas etc takes some time

Document-Oriented Databases

Strengths:

Schema Flexibility: Better for heterogeneous data structures
Throughput: Supports high throughput, especially write throughput
(Horizontal Scaling: Horizontal scaling is easier, you can shard document-wise (document 1-1000 on computer A and 1000-2000 on computer B))
Performance for Document-Based Access: Retrieving or updating an entire document is very efficient
One-to-Many Relationships: Superior in this regard. You don't need joins or other operations.
Locality: See below
Initial Dev Speed: Getting started is quicker due to the flexibility

Weaknesses:

Complex Relationships: Many-to-one and many-to-many relationships are difficult and often require denormalization or application-level joins
Data Consistency: More responsibility falls on application code to maintain data integrity
Query Optimization: Less mature optimization engines compared to relational systems
Storage Efficiency: Potential data duplication increases storage requirements
Locality: See below

Locality

I have listed locality as a strength and a weakness of document stores. Here is what I mean with this.

In document stores, cocuments are typically stored as a single, continuous string, encoded in formats like JSON, XML, or binary variants such as MongoDB's BSON. This structure provides a locality advantage when applications need to access entire documents. Storing related data together minimizes disk seeks, unlike relational databases (RDBs) where data split across multiple tables - this requires multiple index lookups, increasing retrieval time.

However, it's only a benefit when we need (almost) the entire document at once. Document stores typically load the entire document, even if only a small part is accessed. This is inefficient for large documents. Similarly, updates often require rewriting the entire document. So to keep these downsides small, make sure your documents are small.

Last note: Locality isn't exclusive to document stores. For example Google Spanner or Oracle achieve a similar locality in a relational model.

System Design Examples

Note that I limit the examples to the minimum so the article is not totally bloated. The code is incomplete on purpose. You can find the complete code in the examples folder of the repo.

The examples folder contains two complete applications:

financial-transaction-system - A Spring Boot and React application using a relational database (H2)
content-management-system - A Spring Boot and React application using a document-oriented database (MongoDB)

Each example has its own README file with instructions for running the applications.

Example 1: Financial Transaction System

Requirements

Functional requirements

Process payments and transfers
Maintain accurate account balances
Store audit trails for all operations

Non-functional requirements

Reliability (!!)
Data consistency (!!)

Why Relational is Better Here

We want reliability and data consistency. Though document stores support this too (ACID for example), they are less mature in this regard. The benefits of document stores are not interesting for us, so we go with an RDB.

Note: If we would expand this example and add things like profiles of sellers, ratings and more, we might want to add a separate DB where we have different priorities such as availability and high throughput. With two separate DBs we can support different requirements and scale them independently.

Data Model

Accounts:
- account_id (PK = Primary Key)
- customer_id (FK = Foreign Key)
- account_type
- balance
- created_at
- status

Transactions:
- transaction_id (PK)
- from_account_id (FK)
- to_account_id (FK)
- amount
- type
- status
- created_at
- reference_number

Spring Boot Implementation

// Entity classes
@Entity
@Table(name = "accounts")
public class Account {
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long accountId;

    @Column(nullable = false)
    private Long customerId;

    @Column(nullable = false)
    private String accountType;

    @Column(nullable = false)
    private BigDecimal balance;

    @Column(nullable = false)
    private LocalDateTime createdAt;

    @Column(nullable = false)
    private String status;

    // Getters and setters
}

@Entity
@Table(name = "transactions")
public class Transaction {
    @Id
    @GeneratedValue(strategy = GenerationType.IDENTITY)
    private Long transactionId;

    @ManyToOne
    @JoinColumn(name = "from_account_id")
    private Account fromAccount;

    @ManyToOne
    @JoinColumn(name = "to_account_id")
    private Account toAccount;

    @Column(nullable = false)
    private BigDecimal amount;

    @Column(nullable = false)
    private String type;

    @Column(nullable = false)
    private String status;

    @Column(nullable = false)
    private LocalDateTime createdAt;

    @Column(nullable = false)
    private String referenceNumber;

    // Getters and setters
}

// Repository
public interface TransactionRepository extends JpaRepository<Transaction, Long> {
    List<Transaction> findByFromAccountAccountIdOrToAccountAccountId(Long accountId, Long sameAccountId);
    List<Transaction> findByCreatedAtBetween(LocalDateTime start, LocalDateTime end);
}

// Service with transaction support
@Service
public class TransferService {
    private final AccountRepository accountRepository;
    private final TransactionRepository transactionRepository;

    @Autowired
    public TransferService(AccountRepository accountRepository, TransactionRepository transactionRepository) {
        this.accountRepository = accountRepository;
        this.transactionRepository = transactionRepository;
    }

    @Transactional
    public Transaction transferFunds(Long fromAccountId, Long toAccountId, BigDecimal amount) {
        Account fromAccount = accountRepository.findById(fromAccountId)
                .orElseThrow(() -> new AccountNotFoundException("Source account not found"));

        Account toAccount = accountRepository.findById(toAccountId)
                .orElseThrow(() -> new AccountNotFoundException("Destination account not found"));

        if (fromAccount.getBalance().compareTo(amount) < 0) {
            throw new InsufficientFundsException("Insufficient funds in source account");
        }

        // Update balances
        fromAccount.setBalance(fromAccount.getBalance().subtract(amount));
        toAccount.setBalance(toAccount.getBalance().add(amount));

        accountRepository.save(fromAccount);
        accountRepository.save(toAccount);

        // Create transaction record
        Transaction transaction = new Transaction();
        transaction.setFromAccount(fromAccount);
        transaction.setToAccount(toAccount);
        transaction.setAmount(amount);
        transaction.setType("TRANSFER");
        transaction.setStatus("COMPLETED");
        transaction.setCreatedAt(LocalDateTime.now());
        transaction.setReferenceNumber(generateReferenceNumber());

        return transactionRepository.save(transaction);
    }

    private String generateReferenceNumber() {
        return "TXN" + System.currentTimeMillis();
    }
}

System Design Example 2: Content Management System

A content management system.

Requirements

Store various content types, including articles and products
Allow adding new content types
Support comments

Non-functional requirements

Performance
Availability
Elasticity

Why Document Store is Better Here

As we have no critical transaction like in the previous example but are only interested in performance, availability and elasticity, document stores are a great choice. Considering that various content types is a requirement, our life is easier with document stores as they are schema-less.

Data Model

// Article document
{
  "id": "article123",
  "type": "article",
  "title": "Understanding NoSQL",
  "author": {
    "id": "user456",
    "name": "Jane Smith",
    "email": "jane@example.com"
  },
  "content": "Lorem ipsum dolor sit amet...",
  "tags": ["database", "nosql", "tutorial"],
  "published": true,
  "publishedDate": "2025-05-01T10:30:00Z",
  "comments": [
    {
      "id": "comment789",
      "userId": "user101",
      "userName": "Bob Johnson",
      "text": "Great article!",
      "timestamp": "2025-05-02T14:20:00Z",
      "replies": [
        {
          "id": "reply456",
          "userId": "user456",
          "userName": "Jane Smith",
          "text": "Thanks Bob!",
          "timestamp": "2025-05-02T15:45:00Z"
        }
      ]
    }
  ],
  "metadata": {
    "viewCount": 1250,
    "likeCount": 42,
    "featuredImage": "/images/nosql-header.jpg",
    "estimatedReadTime": 8
  }
}

// Product document (completely different structure)
{
  "id": "product789",
  "type": "product",
  "name": "Premium Ergonomic Chair",
  "price": 299.99,
  "categories": ["furniture", "office", "ergonomic"],
  "variants": [
    {
      "color": "black",
      "sku": "EC-BLK-001",
      "inStock": 23
    },
    {
      "color": "gray",
      "sku": "EC-GRY-001",
      "inStock": 14
    }
  ],
  "specifications": {
    "weight": "15kg",
    "dimensions": "65x70x120cm",
    "material": "Mesh and aluminum"
  }
}

Spring Boot Implementation with MongoDB

@Document(collection = "content")
public class ContentItem {
    @Id
    private String id;
    private String type;
    private Map<String, Object> data;

    // Common fields can be explicit
    private boolean published;
    private Date createdAt;
    private Date updatedAt;

    // The rest can be dynamic
    @DBRef(lazy = true)
    private User author;

    private List<Comment> comments;

    // Basic getters and setters
}

// MongoDB Repository
public interface ContentRepository extends MongoRepository<ContentItem, String> {
    List<ContentItem> findByType(String type);
    List<ContentItem> findByTypeAndPublishedTrue(String type);
    List<ContentItem> findByData_TagsContaining(String tag);
}

// Service for content management
@Service
public class ContentService {
    private final ContentRepository contentRepository;

    @Autowired
    public ContentService(ContentRepository contentRepository) {
        this.contentRepository = contentRepository;
    }

    public ContentItem createContent(String type, Map<String, Object> data, User author) {
        ContentItem content = new ContentItem();
        content.setType(type);
        content.setData(data);
        content.setAuthor(author);
        content.setCreatedAt(new Date());
        content.setUpdatedAt(new Date());
        content.setPublished(false);

        return contentRepository.save(content);
    }

    public ContentItem addComment(String contentId, Comment comment) {
        ContentItem content = contentRepository.findById(contentId)
                .orElseThrow(() -> new ContentNotFoundException("Content not found"));

        if (content.getComments() == null) {
            content.setComments(new ArrayList<>());
        }

        content.getComments().add(comment);
        content.setUpdatedAt(new Date());

        return contentRepository.save(content);
    }

    // Easily add new fields without migrations
    public ContentItem addMetadata(String contentId, String key, Object value) {
        ContentItem content = contentRepository.findById(contentId)
                .orElseThrow(() -> new ContentNotFoundException("Content not found"));

        Map<String, Object> data = content.getData();
        if (data == null) {
            data = new HashMap<>();
        }

        // Just update the field, no schema changes needed
        data.put(key, value);
        content.setData(data);

        return contentRepository.save(content);
    }
}

Brief History of RDBs vs NoSQL

RDBs originated from a revolutionizing paper of Edgar Codd in 1970. After a few years, RDBs dominated the world of DBs, mainly for their reliability and consistent structure.

NoSQL emerged around 2009 (the term actually came from a Twitter hashtag for a meetup about non-relational databases) as companies like Google, Amazon, and Facebook developed custom solutions to handle their unprecedented scale. They published papers on their internal database systems, inspiring open-source alternatives like MongoDB, Cassandra, and Couchbase.

The driving forces behind NoSQL adoption were:

Need for horizontal scalability across many machines
More flexible data models for rapidly evolving applications
Performance optimization for specific query patterns
Lower operational costs for massive datasets

As mentioned already, most of these driving forces are not supported by RDBs as well, so the hard distinctions between RDBs and document stores are blurring.

Most modern databases incorporate features from both.

Data related Non-Functional Requirements

Lukas Niessen — Fri, 16 May 2025 18:19:50 +0000

Brief reminder:

Functional requirements = what the system should do
Non-functional requirements = how the system should behave

Usually applications handle complex datasets and need to:

Store data efficiently in databases
Cache expensive operation results
Provide search and filtering capabilities
Process messages asynchronously
and more.

Suppose you have a banking website. Would you rather show wrong numbers to your customers or show no data at all? Rather show no data. However, if you're Facebook, you would of course rather show wrong data (not-updated likes count for example) than no data.

So non-functional requirements matter. Here I want to walk through the most important ones. However, I will only discuss on data related ones.

Reliability

Reliability simply means "continuing to work correctly, even when things go wrong."

A reliable system:

Performs its function as expected,
Tolerates user mistakes or other mistakes.

Important distinction here: faults vs failures. A fault is when one component doesn't work or works incorrectly, while a failure is when the entire system stops working. We clearly want to focus on dealing with faults and prevent faults from causing failures.

Hardware Faults

Hardware faults will always occur. A disk will die after about 10-50 years (this is the Mean Time To Failure, or MTTF), someone can trips over a cable, power outages, and so on. In fact, in large data centers, disk failures happen daily.

So we use redundancy. For example, instead of one disk, we use 4 disks for example (you can use RAID to combine them). Or instead of one computer at one location, we have 5 computers at 5 locations. When one is down, a different one takes over.

This, by the way, has another advantage: rolling updates. With only one machine your website is down when you make updates, like a security patch at 4 am. With multiple machines, you update them one after another, and your website stays online the entire time.

While it used to be enough to have redundancy on one machine, for example by combining disks using RAID, that is no longer the case for most applications. Cloud computing like AWS, Azure or GCP also play into this trend: it's common for machines you're code is running on to become unavailable, this is because AWS and the like are e designed to prioritize flexibility and elasticity. You will get a different machine instead.

Software Errors

Also impossible to prevent. Developers make mistakes (ChatGPT & co even more). So we need good Quality Assurance (QA), including unit tests, integration tests, perhaps E2E tests, we need good monitoring, and more. But these errors will happen anyway.

Human Errors

Humans do mistakes. There are studies showing that human error causes more outages than hardware failures actually. While this again is impossible to prevent, we generally want to:

Design interfaces that make the right action obvious and wrong actions hard
Detailed monitoring
Enabling quick recovery (rollbacks, gradual deployment, data recomputation tools).

So we really want our system to be reliable.

Scalability

Scalability is how well your system handles increased load. To understand it, you need to:

Define your load parameters (requests per second, read/write ratio, etc.)
Measure performance under different loads
Implement solutions that maintain performance as load increases

Performance

This is another obvious one. Let's look at some metrics.

Throughput: The number of operations your system can handle per unit of time.

Latency and Response Time: Note, they are not the same:

Latency: The time it takes a request to reach its destination
Response Time: The total time for a request to be processed and returned (includes latency plus processing time)

Response times are important. But when measuring them, it's important to mote are never constant. They vary for each request. We could take the average response time but that's not enough. We need to look the higher percentiles (that is the very high or very low response times).

An example from Amazon makes very clear why. They found that a 100ms increase in response time reduces sales by 1%. Now they obviously don't want that, there is another reason why high percentiles are very important to them. The users with the worst response times were usually their best customers. That's because they have a long purchase history, slowing things down internally for actions related their account. But these users are the ones that bring the most money, so they really want to keep them happy.

Maintainability

In many projects, it's more costly to keep things running than to initially develop them. Think of big legacy code systems for example. These systems are difficult to maintain.

Maintainability has many aspects, here are a three important ones.

Operable: Good operability means it's easy to keep the system running. Making the ops teams life easy. Some of their tasks are the following:

Managing deployments and configurations,
Monitoring system health and fixing issues,
Debugging problems,
Keeping software and platforms patched and up-to-date,
Capacity planning,
Complex maintenance tasks (like platform migrations),
and more.

Simple: Complex systems harder to understand, modify, and maintain.

Evolvable: Systems should be easy to modify as requirements change. This requires good abstractions, clean interfaces, and proper separation of concerns.

Others

ACID

Often you want ACID (Availability, Consistency, Isolation and Durability). This usually comes at a tradeoff though and is a topic for a separate article.

Elasticity

The ability to handle to handle bursts of requests. For example, during the Jake Paul vs Mike Tyson fight, the Netflix servers couldn't handle the traffic. While Netflix is very scalable, they couldn't handle this particular bursts of requests. They had 'bad elasticity' in this case.

Speed to Market

In technology, speed to market often determines success. Some data teams deliberate on technology choices for months without making decisions, which can be fatal for success.

Best practices include:

Delivering value early and often
Using tools your team already knows when possible
Avoiding undifferentiated heavy lifting that adds little value
Selecting tools that enable quick, reliable, safe, and secure development

Interoperability

It's rare to use only one technology or system. Interoperability describes how various technologies connect, exchange information, and interact. When evaluating technologies, consider:

How easily does technology A integrate with technology B?
Is seamless integration already built into each product?
How much manual configuration is needed?

And many more...

Balancing Trade-offs

You will always have tradeoffs. For example:

Higher reliability typically means higher costs
Faster time to market might compromise long-term maintainability
Better interoperability might require more standardized (but less optimized) approaches
Lower initial costs might lead to higher long-term costs

Remember the first law of software architecture:

Everything in software architecture is a trade-off.