DEV Community: Venkatesan Ramar

Building Reliable Event-Driven Systems: Event Schemas, Versioning, Contract Testing and Events vs Commands (part-4)

Venkatesan Ramar — Mon, 20 Jul 2026 09:00:00 +0000

In this article, we're going to explore Contract Testing

Contract testing ensures that event contracts remain stable as systems evolve. Producers receive immediate feedback when changes affect consumers, and consumers gain confidence in the data they process.

16. Why Contract Testing Matters

Events act as public contracts, and those contracts evolve as systems grow. The key question is how a producer can verify that a schema change does not break its consumers.

Many teams rely on integration testing for this purpose. While useful, integration tests only confirm that systems can communicate. They do not guarantee that all consumers still understand the event contract.

Consider an Order Service publishing an OrderConfirmed event consumed by multiple services:

                  OrderConfirmed
                         |
      +------------------+-------------------+
      |                  |                   |
      v                  v                   v
 Inventory          Billing          Notification

If the producer renames a field:
{ "customerId": "CUS-501" }
to:
{ "accountId": "CUS-501" }

everything may still appear correct. The code compiles, tests pass, and deployment succeeds. However, the Billing Service may fail at runtime because it still expects customerId. The producer had no visibility into this dependency.

The issue is not deployment failure but contract violation. Contract testing exists to detect such issues before they reach production.

Integration Tests Cannot Protect Unknown Consumers

Integration tests typically validate interactions between known systems:

Order Service
      |
      v
Inventory Service

In this setup, both systems are controlled and predictable. Event-driven systems differ because consumers are loosely coupled and may not be known to the producer.

A new Analytics Service might start consuming OrderConfirmed months later. The producer cannot manually validate every consumer. Contract testing addresses this by validating expectations rather than communication.

17. Consumer-Driven Contracts

Traditional API testing starts with the producer defining the contract. Consumers adapt to it. Consumer-driven contract testing reverses this approach.

Consumers define their expectations, and producers verify that they continue to meet them. This model works well in event-driven systems where consumers evolve independently.

Thinking From the Consumer's Perspective

Different consumers often require different subsets of data. For example, the Billing Service may need:

{
    "orderId": "ORD-1001",
    "customerId": "CUS-501",
    "totalAmount": 249.99
}

The Notification Service may require:

{
    "orderId": "ORD-1001",
    "customerId": "CUS-501",
    "customerEmail": "alice@example.com"
}

The producer does not need to understand each consumer’s logic. It only needs to satisfy their declared contracts. This encourages careful evolution and explicit documentation of expectations.

A Contract Is More Than Field Names

Contracts define more than structure. They also capture meaning and constraints.

For example:
{ "orderId": "ORD-1001", "totalAmount": 249.99 }

A robust contract may specify:
orderId must exist and not be empty
totalAmount must be numeric
totalAmount must be greater than zero

These rules ensure data integrity and provide stronger guarantees for both producers and consumers.

18. Contract Testing in Practice

Consider an Order Service publishing:

{
    "eventType": "OrderConfirmed",
    "eventVersion": "1.0",
    "payload": {
        "orderId": "ORD-1001",
        "customerId": "CUS-501",
        "currency": "USD",
        "totalAmount": 249.99
    }
}

Consumers depend on different fields:
Billing: orderId, customerId, totalAmount
Inventory: orderId
Notification: customerId

If the producer changes the schema:

{
    "payload": {
        "orderId": "ORD-1001",
        "accountId": "CUS-501",
        "currency": "USD",
        "totalAmount": 249.99
    }
}

the system may still compile and deploy. However, contract validation fails because consumer expectations are no longer met. The issue is detected before release, preventing runtime failures.

The Development Workflow

Contract testing integrates into the delivery pipeline:

Consumer Defines Contract
          |
          v
Producer Validates Contract
          |
          v
   Build Succeeds
          |
          v
       Deploy

Every schema change is validated during the build process, ensuring compatibility before deployment.

19. Using JSON Schema to Describe Events

JSON Schema provides a structured way to define event contracts. It acts as an executable specification shared by producers and consumers.

Consider a sample schema:

{
  "type": "object",
  "required": [
    "orderId",
    "customerId",
    "totalAmount"
  ],
  "properties": {
    "orderId": {
      "type": "string"
    },
    "customerId": {
      "type": "string"
    },
    "totalAmount": {
      "type": "number"
    }
  }
}

This schema defines required fields, types, and structure. Additional constraints can include string lengths, numeric ranges, formats, and enumerations. Unlike static documentation, schemas can be validated automatically.

Validation During Publishing

Producers can validate events before publishing:

OrderConfirmedEvent event = ...

validator.validate(event);

eventPublisher.publish(event);

Invalid events are rejected early, preventing faulty data from entering the system.

Validation During Consumption

Consumers can also validate incoming events:

OrderConfirmedEvent event = ...

validator.validate(event);

process(event);

This ensures that only valid data reaches business logic, improving reliability.

20. Schema Registries and Centralized Contracts

As systems scale, managing schemas across multiple repositories becomes difficult. Teams may define similar events differently, leading to inconsistencies.

A centralized schema repository addresses this:

              Event Schemas
                    |
       +------------+------------+
       |                         |
       v                         v
 Producers                  Consumers

This repository becomes the single source of truth. Producers publish against registered schemas, and consumers validate against them.

Why Centralized Schemas Help

Centralized schema management improves consistency and visibility. Contracts become discoverable, version history is preserved, and compatibility rules can be enforced automatically.

Teams spend less time interpreting payloads and more time building features. The contract becomes a shared organizational asset rather than an isolated implementation detail.

21. Common Contract Testing Mistakes

Several common issues arise when adopting contract testing.

Treating Documentation as the Contract

Documentation often becomes outdated as implementations change. Consumers may rely on incorrect assumptions. Executable contracts remain synchronized with the system and provide reliable validation.

Testing Only Happy Paths

Contracts should cover more than valid scenarios. They should include:

missing required fields
invalid data types
unexpected values
optional fields
deprecated fields

This ensures consumers handle both valid and invalid inputs correctly.

Ignoring Backward Compatibility

Passing current contracts does not guarantee future compatibility. Schema evolution must be validated alongside contract correctness to avoid breaking existing consumers.

Assuming Producers Own the Contract

Although producers publish events, contracts are shared responsibilities. Both producers and consumers must maintain and validate them.

Reliable event-driven systems depend on well-defined, continuously validated contracts.

In the next part, we will explore production practices like idempotency, event ordering, duplicate handling, correlation and observability.

Building Reliable Event-Driven Systems: Event Schemas, Versioning, Contract Testing and Events vs Commands (part-3)

Venkatesan Ramar — Fri, 17 Jul 2026 09:27:53 +0000

In this article, we're going to explore Event Schema evolution with Event versioning

10. Event Schemas Will Eventually Change

No event schema stays the same forever. As businesses grow, regulations shift, products gain new features, and processes become more complex, the data shared between services must evolve as well. This evolution is not optional—it is a natural consequence of a system adapting to changing requirements.

Many teams initially assume they can simply update an event whenever needed. This assumption may hold when there is only one producer and one consumer, but real-world systems rarely remain that simple. Over time, multiple consumers emerge, each with its own responsibilities and release cycles.

A typical system often looks like this:

                  OrderConfirmed
                         |
      +------------------+-------------------+
      |                  |                   |
      v                  v                   v
 Inventory          Billing          Notification
      |
      v
 Analytics
      |
      v
 Customer Insights

Each consumer evolves independently. Some services may deploy updates weekly, while others might release changes quarterly. In some cases, consumers may even belong to external teams with entirely different priorities and timelines. Because of this, producers cannot assume that all consumers will upgrade simultaneously.

Schema evolution, therefore, is not just about modifying data structures. It is fundamentally about maintaining compatibility across independently evolving systems.

Compatibility Is More Important Than Version Numbers

When discussing schema evolution, teams often focus immediately on versioning. While versioning is useful, compatibility is far more critical. Without compatibility, versioning alone cannot prevent system breakage.

Consider the following event:

{
  "orderId": "ORD-1001",
  "customerId": "CUS-501",
  "totalAmount": 249.99
}

Now imagine a new requirement introduces currency. One approach might replace the existing field entirely:

{
  "orderId": "ORD-1001",
  "customerId": "CUS-501",
  "amount": {
      "value": 249.99,
      "currency": "USD"
  }
}

Although the data model has improved, this change breaks every existing consumer that depends on the original structure. The producer has evolved, but the contract has not been preserved.

The goal of schema evolution is to allow both producers and consumers to evolve independently without causing disruptions.

11. Backward and Forward Compatibility

Compatibility is often described using formal definitions, but a practical understanding is more useful when designing real systems.

Backward Compatibility

Backward compatibility ensures that existing consumers continue to function when producers introduce newer versions of events. This is one of the most important principles in event-driven systems.

Consider Version 1 of an event:

{
  "orderId": "ORD-1001",
  "customerId": "CUS-501"
}

Now Version 2 introduces an additional field:

{
  "orderId": "ORD-1001",
  "customerId": "CUS-501",
  "currency": "USD"
}

Older consumers can safely ignore the new field, allowing them to continue operating without modification. This makes adding optional fields one of the safest ways to evolve a schema. In contrast, removing fields is far more risky because it can break existing consumers.

Forward Compatibility

Forward compatibility addresses the opposite scenario, where newer consumers must handle older events produced by systems that have not yet been upgraded.

For example, a new consumer might expect:

{
  "orderId": "...",
  "customerId": "...",
  "currency": "USD"
}

However, older producers may still emit:

{
  "orderId": "...",
  "customerId": "..."
}

In this case, consumers must be designed to handle missing fields gracefully. They might use default values, leave fields empty, or apply fallback logic. This approach ensures that consumers remain resilient even when the system evolves asynchronously.

Consumers should never assume that every field will always be present, as distributed systems rarely evolve in perfect synchronization.

Compatibility Is a Team Discipline

Most compatibility issues arise not from technical limitations but from incorrect assumptions. A common example is the belief that all consumers have already upgraded to the latest version.

In practice, this assumption is rarely valid. Independent deployment is one of the key advantages of microservices, and compatibility is what preserves that independence. Without it, teams become tightly coupled, and deployments require coordination, defeating the purpose of a distributed architecture.

12. Safe Schema Evolution

Schema evolution is ultimately about maintaining trust between producers and consumers. Producers must continue evolving to meet business needs, while consumers must retain the freedom to upgrade on their own timelines.

Not all schema changes carry the same level of risk. Some changes are generally safe and can be introduced with minimal impact, while others are inherently breaking and require careful planning.

Understanding the difference between these types of changes is essential for preventing production failures.

Adding Optional Fields

Adding optional fields is usually a safe way to evolve a schema. It allows new functionality to be introduced without disrupting existing consumers.

For example:

{
  "orderId": "ORD-1001",
  "customerId": "CUS-501"
}

can evolve into:

{
  "orderId": "ORD-1001",
  "customerId": "CUS-501",
  "currency": "USD"
}

Older consumers will ignore the new field, while newer consumers can take advantage of it. This approach supports gradual adoption and minimizes risk.

Removing Fields

Removing fields is significantly more dangerous. If any consumer depends on a field, its removal will cause failures.

For instance, if a Billing service relies on:
{ "totalAmount": 249.99 }
and that field is removed, the service will break. Because it is often difficult to know all consumers of an event, it is safest to assume that every published field is in use somewhere.

Renaming Fields

Renaming fields may appear harmless, but it effectively behaves like removing one field and adding another. This makes it a breaking change.

For example:
{ "customerId": "CUS-501" }
changing to:
{ "accountId": "CUS-501" }
will cause existing consumers to fail because they no longer recognize the expected field. Renaming should therefore be treated with the same caution as removing fields.

Changing Data Types

Changing the data type of a field can also introduce subtle but serious issues. Even if serialization succeeds, consumers may fail when processing the data.

For example:
{ "quantity": 5 }
becoming:
{ "quantity": "5" }
may not immediately cause errors during transmission, but it can break downstream logic that expects a numeric value. Type changes should be treated as breaking changes and handled carefully.

13. Versioning Strategies

There are situations where compatibility alone is not sufficient, and the contract must change in a way that cannot be made backward-compatible. In such cases, versioning becomes necessary.

Version Inside the Event

One approach is to include version information directly within the event metadata:

{
   "eventType": "OrderConfirmed",
   "eventVersion": "2.0",
   "payload": {
      ...
   }
}

This allows consumers to adjust their behavior based on the version while keeping the event name consistent. It provides flexibility but requires consumers to handle multiple versions within their logic.

Different Event Types

Another approach is to define separate event types for each version:
OrderConfirmedV1 OrderConfirmedV2

In this model, consumers subscribe only to the versions they support. The producer may need to maintain multiple versions during the transition period, but this approach simplifies consumer logic by avoiding conditional handling within a single event type.

Which Strategy Is Better?

There is no universally correct choice between these strategies. Embedding versions keeps naming simpler, while separate event types make changes more explicit.

The most important factor is consistency. Teams should adopt a standard approach across services to reduce confusion and operational complexity.

14. Avoid Breaking Changes Whenever Possible

Breaking changes introduce tight coupling between services by forcing coordinated deployments. Instead of allowing independent evolution, they create dependencies that can slow down development and increase risk.

Consider the following structure:

Producer
    |
    +-----------------------------+
    |      |      |      |        |
    v      v      v      v        v
Service Service Service Service Service
   A       B       C       D       E

If the producer removes a field, every consumer must update before the change can be safely deployed. Deployment order becomes critical, and the independence of services is lost.

Even a single breaking change can undermine the benefits of an event-driven architecture.

Deprecation Is Usually Better Than Removal

When a field becomes obsolete, it is better to deprecate it rather than remove it immediately.
{ "legacyField": "..." }

Keeping the field while marking it as deprecated allows consumers time to migrate at their own pace. Once all consumers have transitioned away from the field, it can be safely removed.

This gradual approach reduces disruption and maintains system stability.

15. Common Versioning Mistakes

Certain mistakes appear frequently when teams manage schema evolution.

Treating Events Like Internal DTOs

Internal data transfer objects (DTOs) often change rapidly as implementation details evolve. Public event contracts, however, should be treated with much greater care.

They represent agreements between services and should not be modified casually.

Releasing Breaking Changes Without Visibility

Producers often lack visibility into who consumes their events. Making breaking changes without understanding downstream dependencies introduces significant risk.

Contract testing can help address this issue by providing insight into how events are used.

Versioning Every Small Change

Some teams create new versions for every minor change:

OrderConfirmedV2 OrderConfirmedV3 OrderConfirmedV4

In many cases, this is unnecessary. Adding optional fields often preserves compatibility without requiring a new version. Excessive versioning can make systems harder to maintain and understand.

Forgetting That Old Events Continue to Exist

Events are often stored for long periods and may be replayed for analytics, auditing, or recovery purposes. Schema evolution must account for both new and historical events.

Even if the schema changes, historical data remains unchanged. Systems must be able to handle both.

In the next section, we will explore contract testing and how it helps validate these assumptions before changes reach production.

Building Reliable Event-Driven Systems: Event Schemas, Versioning, Contract Testing and Events vs Commands (part-2)

Venkatesan Ramar — Wed, 15 Jul 2026 05:41:00 +0000

In this article, we're going to explore Event Schemas.

5. Designing Event Schemas That Age Well

Once a team adopts event-driven architecture, the event schema quickly becomes one of the most critical design artifacts in the system. Unlike an internal Java class that is used within a single codebase, an event schema is consumed by multiple independent services. These services may be developed, deployed, and maintained by different teams, often evolving at different speeds. As a result, every field included in an event effectively becomes part of a contract that consumers may rely on.

Changing that contract later is rarely as simple as modifying a Java object. While internal models can evolve freely, event schemas must remain stable over time. Good schemas are designed to evolve gracefully, allowing systems to grow without breaking existing consumers. Poorly designed schemas, on the other hand, tend to accumulate compatibility issues that become increasingly difficult to manage.

At the core of this challenge is a simple but powerful question:

Is the event describing a business fact or exposing the producer's implementation?

Events Should Represent Business Concepts

Consider an Order Service that has just confirmed an order. One way to represent this event is by focusing on the business outcome:

{
  "orderId": "ORD-1001",
  "customerId": "CUS-501",
  "status": "CONFIRMED",
  "totalAmount": 249.99
}

Another approach might expose internal structures:

{
  "id": "ORD-1001",
  "customerEntity": {
      ...
  },
  "orderAggregate": {
      ...
  },
  "hibernateVersion": 12
}

Although both events may contain similar information, only the first represents a stable business contract. The second leaks internal implementation details that consumers should not depend on. Over time, such exposure creates tight coupling and makes evolution difficult.

A useful guideline is:

Design events for consumers, not for producers.

The producer already understands its internal model. Consumers only need clear, meaningful business information.

Don't Serialize Your Domain Model

A common mistake is publishing JPA entities directly as events. For example:

eventPublisher.publish(orderEntity);

This approach may seem convenient because it avoids creating additional classes. However, it tightly couples consumers to the producer’s internal structure. Any change in the domain model can unintentionally break consumers.

Consider an initial model:

public class Order {
    private Customer customer;
}

Later, the model evolves:

public class Order {
    private CustomerAccount customerAccount;
}

From a business perspective, nothing has changed. However, the event payload has changed, potentially breaking consumers. This happens because the event contract was tied directly to the internal model.

A better approach is to define dedicated event models:

public record OrderConfirmedEvent(
    String orderId,
    String customerId,
    BigDecimal totalAmount
) {}

This separation ensures that the event contract remains stable even as the internal model evolves. Both can change independently without affecting each other.

6. Designing Event Payloads

Every event answers a specific business question. The schema should provide enough information for consumers to understand that answer clearly, without exposing unnecessary implementation details. Striking this balance is one of the most important design decisions in event-driven systems.

Include Business Information

Consider a minimal event:

{
    "orderId": "ORD-1001"
}

While technically valid, it is not very useful. Consumers will likely need additional information, forcing them to make extra API calls:

OrderConfirmed Event
         |
         v
Inventory Service
         |
         v
GET /orders/ORD-1001

If multiple consumers follow this pattern, a single event can trigger multiple network requests. This increases system load and introduces unnecessary coupling between services.

Avoid Including Everything

At the other extreme, some events include too much information:

{
    "order": {
        ...
    },
    "customer": {
        ...
    },
    "payment": {
        ...
    },
    "inventory": {
        ...
    },
    "shipment": {
        ...
    }
}

Large payloads can lead to higher network usage, increased serialization costs, and tighter coupling between services. They also make schema evolution more difficult, as changes in one part of the payload may affect multiple consumers.

Design Around Business Needs

A practical approach is to ask:

What information should every consumer reasonably expect to receive?

For an OrderConfirmed event, a balanced payload might look like this:

{
    "orderId": "ORD-1001",
    "customerId": "CUS-501",
    "orderDate": "2026-07-01T10:15:30Z",
    "currency": "USD",
    "totalAmount": 249.99
}

This provides essential business context without overwhelming consumers. Those who need additional details can fetch them independently, while others remain unaffected.

7. Event Metadata Matters

While developers often focus on the payload, metadata plays an equally important role in production systems. Metadata provides critical context that helps systems understand how to process and trace events.

It enables systems to determine when an event occurred, where it originated, how it should be tracked, and whether it has already been processed. Without this information, operating and debugging event-driven systems becomes significantly more challenging.

Business Data vs Technical Metadata

Business data should reside in the payload, while technical details belong in metadata. For example:

{
  "eventId": "8c1e6d12",
  "eventType": "OrderConfirmed",
  "eventVersion": "1.0",
  "occurredAt": "2026-07-01T10:15:30Z",
  "correlationId": "REQ-98451",
  "payload": {
      "orderId": "ORD-1001",
      "customerId": "CUS-501",
      "totalAmount": 249.99
  }
}

This separation improves clarity and makes it easier to evolve both business data and technical metadata independently.

Event Identifier

Every event should include a unique identifier: eventId

This identifier is essential for de-duplication, ensuring idempotent processing, enabling tracing, and supporting auditing. It also simplifies handling scenarios where events may be delivered multiple times.

Correlation Identifier

In distributed systems, workflows often span multiple services:

Create Order
      |
Reserve Inventory
      |
Process Payment
      |
Create Shipment

Each step may produce additional events. A correlation identifier links these events together:

Correlation ID REQ-98451

This makes it much easier to trace workflows and debug issues in production environments.

Event Timestamp

Events should record when the business action occurred, not when the event was received. These timestamps can differ due to network delays, retries, or temporary failures. Keeping business time separate from delivery time ensures accurate interpretation of events.

8. Naming Events Consistently

Naming may seem like a minor detail, but it becomes increasingly important as systems grow. Large organizations may produce hundreds of event types, and consistency helps maintain clarity and usability across teams.

Prefer Past-Tense Business Events

Effective event names describe completed business actions:

CustomerRegistered OrderConfirmed InventoryReserved PaymentCompleted ShipmentCreated

These names clearly communicate what has happened, making them easy for consumers to understand.

Avoid CRUD-Oriented Events

Generic names such as:

OrderUpdated CustomerModified ProductChanged

lack clarity. They do not explain what changed or why, forcing consumers to inspect the payload for meaning. Event names should convey intent directly.

Keep Names Business-Oriented

Avoid technical or implementation-focused names:

DatabaseUpdated RowInserted EntitySaved JpaEntityUpdated

These describe internal processes rather than business outcomes. Consumers care about what happened in the business domain, not how it was implemented.

9. Common Schema Design Mistakes

Despite differences in technology, many event-driven systems encounter similar design issues. Recognizing these common mistakes can help teams avoid long-term problems.

Publishing Internal Objects

Internal models change frequently, while public contracts should remain stable. Mixing the two leads to fragile systems. Keeping them separate ensures better maintainability.

Making Events Too Generic

An event like: OrderUpdated can represent many different actions, making it difficult for consumers to interpret. More specific events provide clearer intent:
OrderConfirmed OrderCancelled OrderRefunded

This clarity simplifies consumer logic and improves overall system understanding.

Missing Metadata

Without essential metadata such as event identifiers, timestamps, and correlation IDs, troubleshooting becomes significantly harder. Operational concerns should be considered from the beginning of event design.

Designing for Today's Consumers

Many producers design events based only on current consumers, overlooking future needs. A better approach is:

Design events as though the next consumer has not been written yet.

This mindset encourages more flexible and future-proof designs.

Practical Rule of Thumb

Well-designed schemas remain clear and understandable long after they are introduced. They focus on business facts rather than implementation details, separate metadata from payload data, and provide sufficient context without unnecessary complexity.

In the next part, we will explore schema evolution, event versioning, backward compatibility, and strategies that allow producers and consumers to evolve independently without disrupting production systems.

Assisted AI to paraphrase the content.

Building Reliable Event-Driven Systems: Event Schemas, Versioning, Contract Testing and Events vs Commands (part-1)

Venkatesan Ramar — Tue, 14 Jul 2026 05:00:00 +0000

As part-1 of a multi-part series, in this article we'll explore why and where event-driven systems fail and foundational concepts.

Distributed systems have become considerably easier to build than they were a decade ago. Modern frameworks allow us to publish events with only a few lines of code, cloud platforms provide fully managed messaging services, and frameworks like Spring Boot makes asynchronous communication feel almost effortless. Because of this, many teams successfully adopt event-driven architectures. Unfortunately, publishing events is usually the easiest part of the journey.

Designing events that remain reliable for years is considerably more difficult. Most production problems in event-driven systems are rarely caused by the messaging infrastructure itself. Instead, they originate from architectural questions such as:

What information an event should contain?
Whether an event schema can evolve without breaking existing consumers?
How new services safely consume old events?
When a service should publish an event instead of sending a command?
How producers can know they haven't broken downstream consumers?

These questions determine whether an event-driven system remains maintainable as more services, teams, and business capabilities are added.

This article explores the practices that make event-driven systems resilient over time. We focus on the contracts that services exchange with each other. In event-driven architecture, events become public APIs, and unlike REST APIs, those APIs are usually consumed by systems the producer does not directly control. It makes compatibility one of the most important design concerns.

1. Event-Driven Architecture Is Really About Contracts

Many developers describe event-driven architecture as services communicating through events. That description is correct, but it is also incomplete. Events are more than messages moving between services—every published event represents a contract.

Suppose an Order Service publishes the following event.

{
  "orderId": "ORD-1001",
  "customerId": "CUS-501",
  "status": "CONFIRMED",
  "totalAmount": 249.99
}

Several services subscribe to it.

                 OrderConfirmed
                        |
        +---------------+---------------+
        |               |               |
        v               v               v
 Inventory Service  Billing Service  Notification Service

The producer may know about three consumers today. Six months later, another team builds additional services such as Analytics, Recommendation, and Customer Loyalty. The producer does not need to change; the consumers simply subscribe. This loose coupling is one of the biggest strengths of event-driven systems, but it is also one of their biggest challenges.

The producer no longer knows who depends on the event, which makes changing the event significantly more complicated than changing an internal Java object.

Events Are Public APIs

Most teams treat REST APIs very carefully. Before removing a field, they consider existing clients, API versions, backward compatibility, and migration plans. Events deserve exactly the same level of discipline.

Once an event is published, it becomes part of the public interface of the service. Removing a field from an event can break downstream systems just as easily as removing a field from a REST response. The difference is that consumers are often invisible. A REST API usually has documented clients, while an event may have consumers owned by completely different teams, some of which may not even exist when the producer is originally developed.

Thinking of events as contracts fundamentally changes how they should be designed.

2. Why Event Design Matters More Than Event Publishing

Publishing an event is a technical task, but designing an event is an architectural task. Many event-driven projects begin with something like this:

public class Order {

    private Long id;
    private Customer customer;
    private List<OrderItem> items;
    private Address shippingAddress;

}

The easiest approach is to serialize the entire object and publish it.

eventPublisher.publish(order);

It works—until the domain model changes. A field gets renamed, an object is restructured, or a new relationship is introduced. Every consumer now receives a different payload. The producer evolved, but the contract changed accidentally.

This is one of the most common mistakes in event-driven systems. Events should not expose internal domain models; they should communicate business facts. Those are two very different things.

An Event Describes Something That Already Happened

A useful mental model is that a command asks for something to happen, while an event states that something already happened. For example:

OrderConfirmed

This is a business fact. It cannot be rejected because it has already occurred. Similarly:

PaymentCompleted InventoryReserved ShipmentCreated CustomerRegistered

All describe completed business actions.

Consumers should be able to trust that these events represent facts. This makes event names extremely important. Good names communicate completed business outcomes, while poor names often expose implementation details. Compare the following examples:

Good: OrderConfirmed
Poor: UpdateOrderStatus

The first describes something that happened, while the second sounds like an internal method call. This distinction becomes increasingly important as systems grow.

3. Events Are Immutable

One characteristic separates events from many other forms of communication: events are immutable. Once published, an event represents history, and history cannot be rewritten.

Imagine the following event:

{
  "orderId": "ORD-1001",
  "status": "CONFIRMED"
}

Tomorrow, the customer cancels the order. The producer should not update the previous event. Instead, it publishes a new one:

{
  "orderId": "ORD-1001",
  "status": "CANCELLED"
}

The event stream now tells a complete story:

OrderCreated
      |
OrderConfirmed
      |
OrderCancelled

Consumers joining later can reconstruct what happened. This is one of the reasons event-driven systems are valuable—events become an immutable business history. Changing previously published events destroys that history.

Events Represent Facts, Not State

Another common misunderstanding is treating events as snapshots of current state. Consider this event:

{
  "orderId": "ORD-1001",
  "status": "PROCESSING"
}

Does it mean the order is currently processing, or that the order entered processing? These are different meanings.

A better event would be: OrderProcessingStarted

The name clearly communicates a business fact. Consumers no longer need to interpret the payload because the event itself explains what happened. As event catalogs grow, this principle becomes increasingly important. Well-designed event names reduce ambiguity, while poorly named events force consumers to infer business meaning.

4. Events and Commands Are Not the Same

This is one of the most misunderstood topics in event-driven architecture. Many teams use commands and events interchangeably, which creates tightly coupled systems. Understanding the difference changes how services communicate.

Commands Express Intent

A command represents a request where the sender expects another service to perform an action.

For example:
ReserveInventory

The sender is asking, “Please reserve inventory.” The receiving service can accept it, reject it, validate it, or return an error.

Commands imply responsibility—someone is expected to do something.

Events Express Facts

An event communicates that something has already happened.

For example:
InventoryReserved

The reservation has already completed. Consumers cannot reject it; they simply react. This difference may appear subtle, but architecturally it is enormous. Commands influence behavior, while events communicate history.

Visualizing the Difference

Command

Order Service
      |
Reserve Inventory
      |
      v
Inventory Service

The Order Service knows exactly who should process the request, making the communication directed.

Event

InventoryReserved
        |
   +----+----+---------+
   |         |         |
   v         v         v
Billing   Shipping   Analytics

The Inventory Service simply publishes a business fact. It does not know who reacts, and it does not need to. That is the essence of loose coupling.

A Practical Rule of Thumb

One guideline that helps during system design reviews is simple: use a command when one specific service is responsible for performing an action, and use an event when informing any interested service that a business fact has already occurred.

This distinction keeps responsibilities clear and prevents event-driven systems from gradually turning into distributed RPC systems disguised as messaging.

In the next part, we will build on these foundations by designing event schemas that survive years of system evolution.

Assisted AI to paraphrase.

JVM Internals for Microservices: Classloading, Memory, and GC in Containers

Venkatesan Ramar — Wed, 08 Jul 2026 23:34:52 +0000

A few years ago, an iPaaS platform I worked started experiencing intermittent pod restarts in Kubernetes. The issue initially appeared unrelated to the integration workloads themselves. Customer integrations were processing messages successfully, API response times remained stable, and the platform showed no obvious signs of distress. Yet several integration runtime pods were being restarted multiple times a day.

At first, the investigation focused on application-level concerns. Since the platform handled large volumes of transformation logic, message routing, and connector execution, the assumption was that some integration flow was creating excessive object allocations or causing a memory leak. Heap utilization, however, remained well below the configured limits. Garbage collection logs looked healthy, and profiling revealed no significant retention issues.

The actual problem was hidden outside the heap. The JVM was consuming memory from multiple sources that were not visible in the standard application dashboards. Metaspace grew as connectors, SDKs, and framework components loaded thousands of classes. Thread stacks accumulated because integration runtimes maintained pools for message processing, scheduling, and external system communication. Direct buffers allocated by networking libraries consumed native memory. Garbage collector metadata and JVM internal structures added additional overhead.

Individually, none of these memory consumers appeared problematic. Together, they pushed the process beyond the container's memory limit.

The JVM monitoring tools showed healthy heap usage. Kubernetes, however, cared only about the total memory consumed by the process. Once the runtime exceeded its container limit, the Linux kernel terminated it, and Kubernetes restarted the pod.

That incident highlighted an important reality of modern B2B integration platforms. Many production issues are not caused by transformation logic, connector implementations, or external system latency. Instead, they originate from misunderstandings about how the JVM behaves inside containers.

Developers building integration services often focus on workflows, mappings, APIs, and connectivity while treating the JVM as a black box. In traditional deployment environments, that approach was often sufficient. Modern cloud-native integration platforms operate under very different constraints. Memory limits are tighter, startup times affect autoscaling behavior, workloads are highly dynamic, and infrastructure efficiency directly impacts operating costs.

Modern Java runtimes have improved dramatically over the last decade. Java 21 running inside Kubernetes behaves very differently from Java 8 running on dedicated virtual machines. Nevertheless, understanding a few key JVM internals remains essential for building efficient and reliable integration services.

This article focuses on the JVM concepts that matter most in containerized environments: memory layout, classloading, and the relationship between JVM behavior and container resource limits.

1. Why JVM Internals Matter in Microservices

Many JVM concepts were easy to ignore in traditional monolithic deployments because infrastructure resources were abundant. A typical enterprise application might run as a single JVM process with access to large amounts of memory and CPU resources.

1 JVM
16 GB RAM
32 CPUs

In such environments, inefficiencies often remained hidden. An application consuming an extra few hundred megabytes of memory rarely caused operational problems. Startup times were less important because deployments happened infrequently. Thread counts could grow significantly without immediately impacting system stability.

The cloud-native world operates under a completely different set of assumptions.

50 Services
512 MB each

Instead of one large application, organizations often deploy dozens or hundreds of smaller services. Each service operates within strict resource boundaries. Every deployment has memory limits, CPU quotas, startup requirements, autoscaling behavior, and infrastructure costs associated with it.

As a result, inefficiencies that were previously insignificant become highly visible.

A service wasting 100 MB of memory may not seem problematic in isolation. However, when that same inefficiency exists across one hundred services, the organization is effectively allocating an additional 10 GB of memory simply to support overhead.

10 GB of unnecessary memory allocation

At scale, these inefficiencies translate directly into infrastructure costs, operational complexity, and reduced platform efficiency.

Cloud Cost Becomes a JVM Problem

Organizations often invest heavily in optimizing databases, networking infrastructure, and cloud architecture. Surprisingly, the JVM itself is frequently overlooked despite being one of the largest consumers of compute resources in many enterprise environments.

Consider two services that deliver identical throughput and latency. One requires 2 GB of memory while the other requires only 1 GB.

2 GB memory

versus

1 GB memory

From a business perspective, the second service is significantly more efficient. Across dozens of deployments, the difference can represent substantial infrastructure savings.

This means JVM tuning is no longer merely a performance concern. It becomes an architectural and financial concern as well.

Memory efficiency directly influences cloud spending.

Scaling Magnifies JVM Decisions

Many JVM-related decisions appear harmless during development because developers typically run a single service on a powerful workstation.

1 Service

Production environments tell a different story.

50+ Services

Every JVM incurs overhead. Every service loads classes. Every service allocates memory for threads. Every service performs garbage collection.

When multiplied across an entire platform, these costs become significant.

Classloading overhead scales.
Memory overhead scales.
Thread overhead scales.
Garbage collection overhead scales.

Understanding JVM internals becomes increasingly important as systems grow.

2. The JVM Memory Model Most Developers Never See

When developers think about JVM memory, they usually think about one thing: Heap

The heap is certainly important because it stores application objects and is the primary target of garbage collection. However, the heap represents only one portion of the JVM's memory footprint.

This distinction becomes critical in containerized environments because Kubernetes and operating systems measure total process memory consumption rather than heap usage alone.

A service can have perfectly healthy heap utilization and still be terminated due to memory pressure.

Understanding where memory is allocated inside the JVM helps explain why this happens.

Heap Memory

The heap stores the majority of application objects created during execution.

Examples include:

Order order = new Order();
List customers = ...
Map cache = ...

Whenever objects are instantiated, memory is typically allocated within the heap.

Modern garbage collectors organize the heap into multiple regions or generations. Although implementation details vary between collectors, the general concepts remain familiar.

Common concepts are:
Young Generation and Old Generation

Most objects are short-lived. They are created, used briefly, and then discarded. These objects typically remain in the young generation.

Objects that survive multiple garbage collection cycles are eventually promoted into the old generation, where they remain for longer periods.

Heap memory is usually the easiest JVM memory area to monitor because most observability tools expose heap metrics by default.

Unfortunately, this visibility often creates the misconception that heap memory represents total JVM memory consumption.

Metaspace

Starting with Java 8, Metaspace replaced the older Permanent Generation (PermGen).

Metaspace stores information about loaded classes, methods, fields, and other runtime metadata required by the JVM.

For example:
class metadata
method metadata
runtime class information

Modern Spring applications load thousands of classes during startup.

For example:
Spring Framework
Spring Boot
Hibernate
Jackson
Logging libraries
Database drivers

In addition to framework classes, many frameworks generate classes dynamically at runtime.

For instance:
Spring proxies
AOP proxies
Hibernate proxies
Bytecode-enhanced entities

Every loaded class consumes Metaspace.

Large enterprise applications can easily load several thousand classes before processing their first request. As a result, Metaspace can become a meaningful contributor to overall memory consumption.

Unlike heap memory, Metaspace often receives little attention until it becomes a problem.

Thread Stacks

Every platform thread created by the JVM receives its own stack.

Typical stack sizes range between 256 KB – 1 MB depending on operating system and JVM configuration.

This memory is allocated independently of the heap.

The impact becomes significant when applications create large numbers of threads.

Consider a service configured with 500 Threads and a stack size of 1 MB. Thread stacks alone may consume 500 MB before accounting for application objects, caches, or framework overhead.

This is one reason thread-heavy applications often consume substantially more memory than expected.

Native Memory

Native memory is one of the least understood areas of JVM memory management.

The JVM frequently allocates memory outside the heap for performance reasons.

For example:
NIO buffers
JNI libraries
Compression libraries
GC metadata
JVM internal structures

These allocations do not appear in heap metrics.

From the perspective of many monitoring dashboards, this memory is effectively invisible.

However, the operating system and Kubernetes still count it toward the process memory limit.

As a result, applications can experience memory-related failures even when heap utilization appears completely normal.

The Memory Layout That Kubernetes Sees

Kubernetes does not distinguish between heap memory, Metaspace, thread stacks, or native allocations.

It sees a single process consuming memory.

A simplified representation looks like this:

The container limit applies to the entire process, not just the heap.

This distinction explains many seemingly mysterious OOMKill incidents.

3. Classloading: The Invisible Startup Cost

Classloading is one of the most fundamental JVM mechanisms, yet most developers rarely think about it.

Every Java application depends on classloading. Every framework depends on classloading. Every object created by the application ultimately relies on classes being loaded into memory.

Despite its importance, classloading remains largely invisible during day-to-day development. Its effects, however, are highly visible.

Startup time, memory consumption, deployment speed, and auto-scaling responsiveness are all influenced by classloading behavior.

What Happens During Startup

When a Spring Boot application starts, the JVM performs a series of operations before the application becomes available.

A simplified view looks like this:

Class Loading
     |
Verification
     |
Initialization
     |
Bean Creation
     |
Application Ready

Thousands of classes may be loaded and initialized during this process. Each class must be located, verified, linked, and prepared for execution. The larger the application and dependency graph, the more work the JVM must perform before serving traffic.

Why Spring Applications Load So Many Classes

Modern Spring applications rely heavily on framework capabilities such as dependency injection, reflection, annotations, auto-configuration, and proxy generation.

These features provide tremendous developer productivity but introduce startup overhead.

Common contributors include:

dependency injection
reflection
annotations
auto-configuration
proxy generation

Even a seemingly simple dependency can trigger substantial framework activity.

For example, spring-boot-starter-web does far more than provide an embedded web server.

Spring performs extensive classpath scanning, conditional configuration evaluation, bean registration, and framework initialization.

As applications grow, startup complexity grows as well.

The Classloader Hierarchy

The JVM organizes classloading through a hierarchy of classloaders.

Most applications rely on three primary classloaders:

Bootstrap ClassLoader
          |
Platform ClassLoader
          |
Application ClassLoader

The Bootstrap ClassLoader loads core JDK classes such as java.lang.String, java.util.List

The Application ClassLoader loads application-specific dependencies and business code like:

Spring
Hibernate
Business code

Understanding this hierarchy becomes important when diagnosing dependency conflicts, startup failures, or class visibility issues. Many difficult startup problems ultimately originate from classloading behavior.

Why Classloading Matters in Containers

Classloading directly affects startup performance.

In traditional environments, startup time might not matter significantly because applications remained running for weeks or months.

Containerized environments are different.

Auto-scaling events, rolling deployments, and node failures can trigger frequent application startups.

Consider a deployment scaling from: 5 Pods to 50 Pods

Every new pod must complete startup before it can serve traffic.
Classloading delays become deployment delays.

Large applications may spend a surprising amount of time loading classes and initializing frameworks before becoming operational. For highly dynamic environments, startup efficiency becomes an important operational characteristic.

4. Memory in Containers: Where Reality Gets Interesting

Many JVM misconceptions become apparent only after applications move into containers.

Historically, the JVM was designed for physical servers and virtual machines where resource boundaries were relatively straightforward.

Containers introduced a new abstraction layer that changed how resources are allocated and enforced.

These changes exposed assumptions that were previously hidden.

The JVM Was Not Originally Container-Aware

Older JVM versions viewed the environment primarily through the perspective of the host machine.

For example:

64 GB Host

The JVM assumed those resources were available.

Containers introduced a different reality:

64 GB Host
|
512 MB Container

The application could access only a small fraction of the host's resources.

Early JVM versions frequently sized memory pools based on host capacity rather than container limits.

This behavior caused numerous production issues in Kubernetes environments.

Modern JVMs have become significantly more container-aware, but understanding the historical context helps explain many configuration recommendations.

Heap Is Not Total Memory

One of the most common mistakes in containerized Java deployments is allocating the entire container budget to the heap.

For example:

Container Limit = 512 MB
Heap = 512 MB

This configuration leaves no room for any other JVM memory consumers.

The JVM still requires memory for:

Metaspace
Thread stacks
Direct memory
Native allocations

Eventually, total process memory exceeds the container limit.

A healthier configuration might look like:

Container Limit = 512 MB

Heap = 300 MB
Remaining Memory = JVM Overhead

The exact allocation depends on workload characteristics, but the principle remains universal.

The heap cannot consume the entire memory budget.

Why Pods Get OOMKilled

Many production incidents follow a similar pattern.

Consider the following memory breakdown:

Container Limit = 512 MB

Heap = 320 MB
Metaspace = 80 MB
Native = 90 MB
Thread Stacks = 50 MB

Total = 540 MB

From the JVM's perspective,
heap utilization may appear healthy.
Garbage collection may appear healthy.
Application latency may appear healthy.

Yet, Kubernetes observes only one fact:

540 MB > 512 MB

The process exceeds its memory limit. The operating system terminates it.

Kubernetes restarts the pod.

This behavior often surprises some because traditional JVM monitoring focuses heavily on heap metrics while ignoring other memory consumers.

In many real-world incidents, understanding total JVM memory consumption provides far more value than tuning garbage collection parameters.

Before optimizing GC, it is often worth ensuring that the JVM's complete memory footprint actually fits within the container budget.

5. Garbage Collection Choices That Matter

Garbage Collection discussions often become overly theoretical. Many articles spend significant time explaining concepts such as mark-and-sweep algorithms, generational memory management, compaction strategies, and collector internals. While these topics are important for understanding how the JVM works, most backend teams are usually trying to answer more practical questions. They want to know which garbage collector they should use, what problems it solves, when it makes sense to move away from the default configuration, and how to determine whether garbage collection is actually responsible for a performance issue.

Modern Java has already made many sensible decisions on behalf of developers. For most microservices, the default collector is an excellent starting point. The real challenge is understanding when the default stops being sufficient and what trade-offs alternative collectors introduce.

What Garbage Collection Is Optimizing

Every garbage collector attempts to balance three competing goals: throughput, latency, and memory efficiency. Improving one of these dimensions often comes at the expense of another.

A collector optimized for throughput may allow longer stop-the-world pauses because it prioritizes maximizing the amount of useful application work completed over time. This approach can be highly efficient for batch workloads but may introduce noticeable pauses that affect user-facing applications.

A latency-focused collector takes a different approach. Instead of maximizing throughput, it attempts to minimize pause times by performing more work concurrently with application threads. This keeps applications responsive but consumes additional CPU resources because the collector remains active while the application is running.

Memory efficiency introduces yet another dimension. Some collectors require additional metadata structures, forwarding information, or concurrent processing overhead to achieve lower pause times. As a result, reducing latency may increase memory consumption or CPU utilization.

Conceptually, the trade-off looks like this:

                 Throughput
                      ▲
                      │
Memory Efficiency ◄───┼───► Latency

Moving closer to one corner generally means moving farther away from another. There is no universally optimal collector because every workload has different priorities. The best choice depends on the characteristics of the application being deployed.

Understanding Generational Collection

Before discussing specific collectors, it is useful to understand why modern JVMs organize memory into generations. This design is based on a simple observation: most objects in Java die young.

Objects such as HTTP request wrappers, DTOs, temporary collections, serialization buffers, and JSON parsing structures are often created and discarded within milliseconds. A single request may allocate thousands of short-lived objects that become unreachable immediately after the response is returned.

Because of this behavior, the JVM separates memory into regions optimized for different object lifetimes.

Young Generation
      |
Most Objects Die
      |
      v
Old Generation
      |
Long-Lived Objects

The young generation is collected frequently because reclaiming memory there is usually inexpensive. Objects that survive multiple collection cycles are promoted into the old generation, where they are assumed to have longer lifetimes.

Many memory-related problems begin when allocation rates become excessive, objects survive longer than expected, or old-generation growth becomes continuous. In practice, understanding object lifetime patterns is often more valuable than switching collectors because many performance issues originate from application behavior rather than collector choice.

Class Data Sharing (CDS)

Class Data Sharing is one of the least discussed JVM optimizations despite its practical value.

Normally, every JVM process loads and processes core JDK classes independently. CDS allows pre-processed class metadata to be stored in a shared archive.

JDK Classes
      |
Create Archive
      |
Shared By JVMs

This reduces startup time, lowers memory consumption, and improves class-loading efficiency. Multiple JVM processes can reuse the same archive, making CDS particularly valuable in containerized environments where many identical workloads run simultaneously.

Application CDS

Application CDS extends the same concept beyond JDK classes.

Instead of sharing only core platform classes, applications can include framework classes, third-party libraries, and application-specific code within the archive.

Application Classes
        |
Generate Archive
        |
Reuse During Startup

As application size grows, the benefits become increasingly noticeable. Large Spring Boot applications can often achieve meaningful startup improvements through Application CDS.

AOT and Native Images

Spring and GraalVM introduced another strategy: moving work from runtime to build time.

Instead of performing extensive runtime analysis, applications can be compiled ahead of time.

Build Time Analysis
        |
Generate Native Binary
        |
Fast Startup

This approach delivers faster startup times, lower memory footprints, and reduced runtime initialization overhead.

The trade-offs include longer build times, reduced JVM dynamism, reflection constraints, and additional operational complexity. Native images solve a different set of problems than traditional JVM tuning, and many organizations achieve acceptable startup performance without leaving the JVM ecosystem.

6. Observability: What To Monitor

Many JVM tuning efforts fail because teams focus on the wrong metrics. Heap utilization alone rarely tells the complete story.

Effective JVM observability requires visibility into memory behavior, allocation patterns, garbage collection activity, and runtime characteristics.

Heap Metrics

Teams should monitor heap used, heap committed, and heap maximum values. These metrics help answer important questions.

Is heap usage growing continuously? Are objects surviving longer than expected? Are caches oversized? Does memory return to normal levels after traffic decreases?

A healthy pattern often resembles:

Heap Usage
    /\
   /  \
  /    \
 /      \
----------

Memory grows and shrinks as collections occur.

A problematic pattern looks more like:

Heap Usage
    /
   /
  /
 /
/

Continuous growth may indicate memory leaks, retained references, or unbounded caches. Heap metrics are often the fastest way to identify memory-related issues.

GC Metrics

Garbage collection metrics deserve equal attention. Teams should monitor pause duration, collection frequency, allocation rate, and promotion rate.

Questions worth asking include whether pauses are affecting latency, whether allocation rates are unusually high, whether old-generation occupancy is growing continuously, and whether promotion rates are increasing unexpectedly.

Allocation rate is particularly important because a service allocating several gigabytes per second may experience GC pressure even when heap utilization appears healthy. Tuning decisions should be based on these metrics rather than assumptions.

Container Metrics

Many JVM dashboards stop at heap metrics, but container platforms don't. Teams should monitor RSS memory, container memory usage, memory limits, and OOMKill events.

Container Memory
       |
+------+------+------+
| Heap | Native | OS |
+------+------+------+

The JVM controls only part of total process memory consumption. Container-level metrics frequently expose issues that remain invisible when observing heap metrics alone.

Startup Metrics

Startup metrics are increasingly important in Kubernetes, serverless environments, and autoscaling workloads.

Useful measurements include startup duration, loaded class count, and bean initialization time.

A startup breakdown often looks like:

Startup Time
      |
      +-- Class Loading
      |
      +-- Spring Initialization
      |
      +-- Bean Creation
      |
      +-- Application Ready

Startup performance should be treated as production performance because it directly affects deployment and recovery behavior.

7. Common JVM Mistakes in Microservices

Some JVM-related mistakes appear repeatedly across engineering teams.

Setting Heap Equal To Container Memory

A common mistake is configuring heap size equal to the container memory limit.

Container Limit = 1 GB
Heap = 1 GB

This leaves no room for Metaspace, thread stacks, native memory, direct buffers, or JIT compiler structures.

Actual process memory consumption includes:

Heap + Metaspace + Native Memory + Thread Stacks

The result is often container OOMKills. A safer approach reserves sufficient memory for non-heap consumers.

Ignoring Native Memory

Many teams monitor heap usage while ignoring total process memory.

Native memory includes thread stacks, direct byte buffers, JNI allocations, GC metadata, and JIT compiler structures. These components can consume substantial memory and frequently explain crashes where heap utilization appears normal.

Premature GC Tuning

A common troubleshooting pattern looks like this:

The application slows down.
Garbage collection becomes the primary suspect.
The collector is changed.

In many cases, the actual root cause is inefficient queries, memory leaks, oversized caches, or excessive serialization overhead.

GC tuning should always be evidence-driven. Teams should validate a clear correlation between latency issues and GC pauses before changing collectors.

Latency
     |
GC Pause Correlation

Treating Every Service The Same

Different workloads have different requirements.

An API gateway is typically latency-sensitive:

API Gateway
     |
Latency Sensitive

A batch processor is usually throughput-sensitive:

Batch Job
   |
Throughput Sensitive

The gateway may benefit from lower pause times, while the batch processor may prioritize throughput and memory efficiency. JVM decisions should reflect workload characteristics rather than organizational defaults.

8. Practical Recommendations

A few practical guidelines consistently work well across many environments.

Small Spring Boot Services (< 2 GB)

For smaller services, Java 21, G1GC, and container-aware JVM defaults are usually sufficient.

Teams should monitor heap usage, RSS memory, and GC pauses while avoiding aggressive tuning. Modern JVM defaults are already highly optimized.

High-Traffic APIs

For high-traffic APIs, focus on allocation rate, latency distribution, GC pause behavior, and tail latency.

A useful investigation flow looks like:

Latency Increase
       |
Check GC Pauses
       |
Check Allocation Rate
       |
Evaluate Collector

If latency requirements justify it, ZGC may be worth evaluating.

Startup-Sensitive Services

Services that scale frequently should focus on class loading, startup configuration, CDS archives, and bean initialization performance.

Startup should be measured continuously, and regressions should be treated with the same seriousness as runtime performance regressions.

Memory-Constrained Containers

When operating in memory-constrained environments, reserve capacity for Metaspace, thread stacks, direct buffers, and other native allocations.

Container Limit
       |
       +-- Heap
       |
       +-- Metaspace
       |
       +-- Native Memory

Avoid allocating the entire container budget to the heap. Sufficient headroom is essential for stable operation.

9. Final Thoughts

Understanding JVM internals is not about memorizing garbage collector algorithms or collecting obscure JVM flags.

Microservices introduce operational constraints that make certain JVM concepts impossible to ignore. Class loading influences startup behavior. Memory extends far beyond the heap. Garbage collection affects latency, throughput, and resource utilization. Container limits apply to the entire JVM process, not just the heap.

Most production incidents involving Java services are not caused by obscure JVM bugs. They are usually the result of incorrect assumptions about how the JVM behaves inside modern cloud environments.

The JVM has become remarkably efficient. Modern collectors such as G1GC and ZGC solve problems that previously required extensive manual tuning. The responsibility of engineering teams is understanding the relatively small set of JVM concepts that directly affect production systems.

These concepts influence cloud costs, deployment speed, application stability, and operational reliability far more than any collection of JVM tuning flags ever will.

Data Consistency Under Contention: Optimistic vs Pessimistic Locking

Venkatesan Ramar — Wed, 24 Jun 2026 06:35:00 +0000

A few years ago, I investigated a production issue where customers occasionally reported incorrect inventory counts. The application was healthy. The database was healthy. No errors appeared in the logs.

The problem turned out to be concurrent updates. Multiple requests were modifying the same inventory record at nearly the same time, and one update silently overwrote another. The database did exactly what it was asked to do. The application failed to co-ordinate concurrent modifications to shared data.

This is a common consistency problem. Whenever multiple users, services, or processes attempt to modify the same data simultaneously, contention appears.

To manage that contention, systems typically rely on two approaches Optimistic locking and Pessimistic locking. Both aim to preserve data consistency, but they make very different assumptions about how conflicts occur. Those assumptions directly affect performance, scalability, and user experience.

1. Why Locking Exists

Databases are excellent at storing and retrieving data, but they do not inherently understand business intent. They execute operations exactly as instructed. This becomes problematic when multiple users interact with the same piece of data at the same time.

Consider an inventory record:

Product A
Inventory = 10

Now imagine two users accessing the system simultaneously. Both users read the same inventory value:

Inventory = 10

User A purchases one item.
User B purchases two items.

The timeline looks like this:

User A reads 10
User B reads 10

User A writes 9
User B writes 8

Both transactions succeed from the database's perspective. No errors occur, and both updates are accepted. However, one update effectively overwrites the other.

This scenario is known as a lost update. Both users started with the same information, but because their updates were not co-ordinated, one user's changes disappeared. Locking mechanisms exist primarily to prevent such situations.

Concurrency Is Usually A Business Problem

Concurrency issues rarely present themselves as obvious technical failures. Systems continue running, databases remain available, and monitoring dashboards look healthy. The real impact appears in business outcomes.

Customers do not care whether the root cause involves MVCC, transaction isolation levels, or a particular locking strategy. They only see incorrect results. For that reason, concurrency control is not merely a database concern—it is a business requirement that directly affects customer trust and operational correctness.

Contention Changes Everything

Many applications operate flawlessly until contention increases. A user profile system may rarely experience concurrent updates because different users modify different records. In contrast, a payment platform may process thousands of updates against the same accounts every second. Similarly, a seat reservation system may have thousands of users competing for a very small number of records.

The frequency of contention is one of the most important factors when choosing a concurrency strategy. Systems with frequent conflicts require a different approach than systems where conflicts are rare. This distinction forms the foundation of the optimistic versus pessimistic locking debate.

2. Pessimistic Locking: Assume Conflict Will Happen

Pessimistic locking starts with a conservative assumption:

Someone else will probably try to modify this data.

Because conflicts are expected, the system prevents them from occurring by restricting access immediately. The first transaction acquires a lock on the data, and any subsequent transaction attempting to modify the same data must wait until the lock is released.

This approach prioritizes correctness by ensuring that only one transaction can modify a resource at a time.

The Bank Account Example

Imagine two transactions attempting to modify the same account balance.

Transaction A begins and acquires a lock:

SELECT * FROM account WHERE id = 100 FOR UPDATE;

The row becomes locked, preventing other transactions from modifying it.

Now Transaction B attempts the same operation:

SELECT * FROM account WHERE id = 100 FOR UPDATE;

Because Transaction A already holds the lock, Transaction B cannot proceed. It must wait until Transaction A completes and releases the lock. This guarantees that updates occur sequentially rather than concurrently.

What Happens Under Contention

The flow looks like this:

Notice that the second transaction does not fail. Instead, it pauses until the lock becomes available. This behavior makes correctness easier to reason about because the database itself enforces exclusive access to the data. Developers do not need to detect conflicts later because the database prevents them from occurring in the first place.

Why Financial Systems Like Pessimistic Locking

Certain domains prioritize correctness above all else. Examples include payment processing systems, banking platforms, trading applications, and inventory reservation systems.

In these environments, waiting is preferable to risking inconsistent data. Consider two users attempting to reserve the last available airline seat. Allowing both requests to proceed simultaneously could result in overselling the seat, creating operational and customer-service problems. A short delay is usually a much smaller cost than correcting inconsistent business data later.

The Cost Of Waiting

While pessimistic locking provides strong protection against conflicting updates, it introduces a different challenge: reduced concurrency.

As contention increases:

response times increase
throughput decreases
blocked transactions accumulate

Under heavy load, lock contention can become a significant bottleneck. Instead of processing business operations, the database spends more time coordinating access to shared resources. This trade-off becomes increasingly visible in high-traffic systems where many users compete for the same records.

3. Optimistic Locking: Assume Conflict Is Rare

Optimistic locking takes the opposite approach.

Most transactions will not conflict.

Instead of preventing concurrent access, the system allows multiple users to work with the same data simultaneously. Rather than blocking access upfront, conflicts are detected later when an update is attempted.

This approach assumes that contention is relatively uncommon and that most operations can proceed without interference.

The Core Idea

Optimistic locking typically relies on a version number stored alongside each record.

Example:

Account
--------
Id = 100
Balance = 1000
Version = 5

Suppose two users read the same row. Both receive:

Version = 5

User A updates the record first:

UPDATE account SET balance = 900, version = 6
               WHERE id = 100 AND version = 5;

The update succeeds because the version matches the expected value. The record now becomes:

Version = 6

Later, User B attempts an update:

UPDATE account SET balance = 800, version = 6
               WHERE id = 100 AND version = 5;

This update affects zero rows because the version is no longer 5. The database detects that another transaction modified the record first, and the update fails.

Conflict Becomes Explicit

Unlike pessimistic locking, optimistic locking does not force transactions to wait. Instead, conflicting updates fail immediately.

The application must then decide how to respond. Common options include:

retry
refresh data
reject the operation
ask the user to resolve the conflict

This approach makes conflicts visible rather than hiding them behind waiting transactions. The responsibility for handling those conflicts shifts from the database to the application.

Why Modern Applications Prefer Optimistic Locking

Many business applications experience relatively low contention. Examples include customer profiles, employee records, product catalogs, and content management systems. Most users interact with different records, making simultaneous updates uncommon.

In these environments, blocking every update would introduce unnecessary overhead. Optimistic locking allows the system to maximize concurrency while still detecting the occasional conflict. As a result, applications achieve better scalability and responsiveness.

The Cost Of Retrying

Optimistic locking reduces database contention but introduces complexity elsewhere. Because conflicts are detected after they occur, applications must implement strategies for handling failures.

Retries may sound straightforward, but production systems require additional considerations such as exponential back-off,
user experience, duplicate submissions and retry storms.

As a result, conflict resolution becomes an important part of application design rather than a purely database-level concern.

4. How Modern RDBMS Actually Handle Concurrency

Many engineers imagine databases constantly locking rows and blocking transactions. Modern relational databases are far more sophisticated.

Systems such as PostgreSQL and MySQL rely heavily on a technique called Multi-Version Concurrency Control (MVCC). Understanding MVCC helps explain why modern databases can support high levels of concurrency without excessive blocking.

Multiple Versions Of Data

Instead of immediately replacing existing data, MVCC creates new versions of rows whenever updates occur.

Conceptually:

┌───────────────┐
│ Row Version 1 │
└───────────────┘
         ↓
┌───────────────┐
│ Row Version 2 │
└───────────────┘
         ↓
┌───────────────┐
│ Row Version 3 │
└───────────────┘

Older versions remain available for active transactions that still need them. This allows readers to continue accessing a consistent view of the data while updates occur in parallel.

The result is significantly less blocking and much higher concurrency.

Why Reads Usually Don't Block Writes

One of the most common misconceptions about databases is:

Every update blocks every read.

In MVCC-based databases, this is not true. Readers can access a consistent snapshot of the data while writers create newer versions in the background.

This capability allows databases to support large numbers of concurrent users without forcing readers and writers to constantly wait for one another. It is one of the primary reasons modern relational databases scale far better than many developers initially expect.

Isolation Levels Matter

Locking strategies are only one part of the consistency story. Isolation levels determine what data a transaction can see while other transactions are running.

Common isolation levels include:

Read Committed
Repeatable Read
Serializable

Each level provides different guarantees and trade-offs. Higher isolation levels generally offer stronger consistency but require additional coordination and overhead.

Choosing a locking strategy without understanding transaction isolation can lead to incorrect assumptions about application behavior. In practice, consistency emerges from the combination of locking mechanisms, MVCC behavior, and transaction isolation working together.

5. Deadlocks: The Hidden Cost of Pessimistic Locking

Pessimistic locking guarantees exclusive access to data by preventing multiple transactions from modifying the same resource simultaneously. While this approach is highly effective at preserving consistency, it introduces a different class of concurrency problems: deadlocks.

Deadlocks typically do not appear during initial development or testing because contention levels are low and transaction flows are relatively simple. As systems grow, however, more users, background processes, and business workflows begin interacting with the same data concurrently. Under these conditions, transactions may start waiting on each other in ways that create circular dependencies.

When that happens, transactions that previously completed successfully begin failing unexpectedly, without any changes to the underlying business logic.

A Classic Deadlock Scenario

Consider a money transfer workflow involving two accounts.

Transaction A                     Transaction B
─────────────                     ─────────────
Lock Account A                    Lock Account B
       │                                 │
       ▼                                 ▼
Update Account A                  Update Account B
       │                                 │
       ▼                                 ▼
Lock Account B ◄──────────────► Lock Account A
                  DEADLOCK

Transaction A holds a lock on Account A and waits for Account B. Meanwhile, Transaction B holds a lock on Account B and waits for Account A.

Neither transaction can proceed because each will be waiting for a resource currently held by the other. Neither transaction can release its lock because it has not yet completed.

The database detects this circular wait condition and identifies it as a deadlock.

Deadlocks are not limited to two rows or two transactions. In complex systems, deadlocks may involve multiple tables, indexes, and transactions, making them difficult to diagnose without proper monitoring and logging.

How Databases Resolve Deadlocks

Modern relational databases continuously analyze lock dependencies between active transactions. When a deadlock is detected, the database must break the cycle to allow progress.

A simplified flow looks like this:

┌───────────────┐
│ Transaction A │
└───────────────┘
         ↓
┌───────────────┐
│   Deadlock    │
└───────────────┘
         ↓
┌──────────────────────────┐
│ Database Chooses Victim  │
└──────────────────────────┘
         ↓
┌───────────────┐
│   Rollback    │
└───────────────┘

The database selects one transaction as the deadlock victim and rolls it back. The other transaction is allowed to continue and eventually commit.

The victim selection process varies by database implementation. Factors such as transaction age, resource consumption, and rollback cost may influence which transaction is terminated.

From the application's perspective, this usually appears as an exception indicating that the transaction failed due to a deadlock. The application must be prepared to retry the operation because deadlocks are considered transient failures rather than permanent errors.

Importantly, deadlocks are not database bugs. They are an expected consequence of concurrent transactions acquiring locks in different orders.

Deadlocks Become Operational Problems

Deadlocks are difficult to reproduce in development environments because concurrency levels are significantly lower than in production.

Real-world systems contain many independent actors operating simultaneously like concurrent users, background jobs, asynchronous consumers and scheduled tasks.

Each of these components may access shared resources using different execution paths.

A deadlock occurring once every few weeks may have little operational impact. However, when contention increases and deadlocks begin occurring hundreds or thousands of times per hour, they can significantly affect throughput, latency, and user experience.

For this reason, high-scale systems attempt to minimize lock durations, enforce consistent lock acquisition ordering, or adopt optimistic concurrency strategies when contention remains relatively low.

6. Optimistic Locking in Spring and JPA

Optimistic locking is one of the most commonly used concurrency control mechanisms in enterprise Java applications. Frameworks such as JPA and Hibernate provide built-in support, making implementation straightforward while still offering strong protection against lost updates.

Unlike pessimistic locking, optimistic locking does not prevent concurrent access. Instead, it detects whether another transaction modified the data between the time it was read and the time it was updated.

The @Version Annotation

A typical entity might look like:

@Entity
public class Account {

    @Id
    private Long id;

    private BigDecimal balance;

    @Version
    private Long version;
}

The @Version field acts as a concurrency token. Every successful update increments the version number automatically.

When Hibernate generates update statements, it includes the current version value in the WHERE clause. This ensures that updates only succeed if the record has not been modified since it was originally read.

This mechanism allows multiple users to read the same data concurrently while still preventing silent overwrites.

What Actually Happens

Suppose two users load the same entity.

Both receive:
Version = 10

User A updates first.

The version becomes:
Version = 11

User B attempts an update.

Hibernate generates an update statement similar to:

UPDATE account SET balance = ?, version = 11
               WHERE id = ? AND version = 10

Because the row now contains version 11 instead of version 10, the WHERE condition no longer matches.

As a result, no rows are updated.

Hibernate detects this condition and throws OptimisticLockException. This exception indicates that another transaction modified the entity after it was originally loaded.

Rather than silently overwriting data, the application is forced to acknowledge and handle the conflict.

Handling Optimistic Lock Failures

Adding @Version annotation is only the first step.

The more important challenge is deciding how the application should respond when conflicts occur.

Possible strategies include:

retry automatically
reject the operation
reload and merge
notify the user

The appropriate choice depends heavily on business requirements.

For example, inventory/reservation systems retry automatically because conflicts are expected and transient. Collaborative editing systems may present users with merge options. Financial applications frequently reload the latest state and re-validate business rules before attempting another update.

Optimistic locking provides conflict detection. It does not provide conflict resolution. Designing an effective resolution strategy is a critical part of building reliable systems.

7. Locking in NoSQL Databases

A common misconception is that NoSQL databases eliminate concurrency concerns.

In reality, concurrent modification problems still exist. The difference lies in how databases expose consistency guarantees and concurrency control mechanisms.

Most NoSQL platforms provide some form of optimistic concurrency control rather than traditional row-level locking.

MongoDB

MongoDB provides atomic operations at the document level. Updates to a single document are isolated and executed atomically.

For concurrency control, many applications implement version-based optimistic locking.

Example:

db.orders.updateOne(
  {
     _id: 1,
     version: 5
  },
  {
     $set: {
        status: "SHIPPED"
     },
     $inc: {
        version: 1
     }
  }
)

The update succeeds only if the document still contains version 5.

If another process updates the document first, the query condition no longer matches:

Matched Documents = 0

The application can then detect the conflict and decide whether to retry or reject the operation.

Conceptually, this is very similar to optimistic locking in relational databases.

Redis

Redis is generally viewed as a simple in-memory cache, but it is also frequently used as a primary data store, coordination mechanism, and distributed locking platform.

Because Redis executes commands sequentially within a single-threaded event loop, individual commands are atomic. However, concurrency challenges still arise when multiple clients perform read-modify-write operations.

One approach is to use optimistic concurrency control through the WATCH command.

Example:

WATCH account:100
GET account:100

The client reads the value and prepares an update.

When the transaction executes:

MULTI
SET account:100 900
EXEC

Redis verifies that the watched key has not changed since it was read.

If another client modifies the key before EXEC, the transaction is aborted: Transaction Failed

The application can then retry using the latest value.

Redis is also widely used for distributed locking through commands such as:

SET resource-lock unique-id NX PX 30000

This creates a lock only if the key does not already exist and automatically expires it after a specified timeout.

While distributed locks can co-ordinate access across multiple application instances, they should be used carefully. Improper lock expiration settings, network partitions, and process failures can introduce subtle consistency issues.

For this reason, many Redis-based systems prefer optimistic concurrency patterns or idempotent operations whenever possible, reserving distributed locks for workflows that truly require exclusive access.

DynamoDB

DynamoDB provides optimistic concurrency control through conditional writes.

A write operation can specify a condition that must evaluate to true before the update is applied.

The following example performs an UpdateItem operation. It tries to reduce the Price of a product by 75—but the condition expression prevents the update if the current Price is less than or equal to 500.

aws dynamodb update-item \
    --table-name ProductCatalog \
    --key '{"Id": {"N": "456"}}' \
    --update-expression "SET Price = Price - 75" \
    --condition-expression "Price > 500"

If the starting Price is 650, the UpdateItem operation reduces the Price to 575. If you run the UpdateItem operation again, the Price is reduced to 500. If you run it a third time, the condition expression evaluates to false, and the update fails.

This approach allows DynamoDB to maintain high scalability while still preventing lost updates. Because conditional writes are implemented directly by the storage engine, applications can enforce concurrency guarantees without introducing explicit locking mechanisms.

Many large-scale AWS systems rely heavily on this pattern.

8. Distributed Systems Change Everything

Many engineers discover an uncomfortable reality when transitioning from monolithic applications to microservices:

Database locking does not extend beyond a single database.

Traditional locking mechanisms work extremely well within a single transactional boundary. Once data and business processes span multiple services, those guarantees disappear.

Locks Cannot Cross Services

Consider:

Order Service
      |
Database A

and

Inventory Service
      |
Database B

A lock acquired in Database A has no effect on Database B.

Even if both services participate in the same business workflow, neither database has visibility into the other's locks or transactions.

As a result, traditional database locking cannot guarantee consistency across service boundaries.

This limitation fundamentally changes how distributed systems are designed.

Why SAGAs Exist

Microservices frequently execute workflows that span multiple services and databases.

Example:

Create Order
      │
      ▼
Reserve Inventory
      │
      ▼
Process Payment
      │
      ▼
Create Shipment

No single ACID transaction can encompass the entire workflow.

Instead, systems rely on:

compensating transactions
retries
eventual consistency

This is the problem Saga patterns address.

Rather than locking resources across services, Sagas coordinate a sequence of local transactions and define recovery actions when failures occur.

The goal is not immediate consistency but reliable business outcomes despite partial failures.

Why Outbox Doesn't Require Locks

The Transactional Outbox pattern solves a different challenge.

It guarantees Database Commit + Event Publication without requiring distributed transactions.

The application writes both business data and an outbound event record within the same local transaction. A separate process later publishes the event.

This approach relies on transactional guarantees within a single database.

Not pessimistic locking.

Understanding this distinction is important because many distributed systems problems are fundamentally reliability and coordination problems rather than concurrency-control problems.

Idempotency Beats Locking

Many distributed systems avoid locking altogether.

Instead, they make operations idempotent, meaning the same operation can be executed multiple times without changing the final outcome.

Example:

Process Payment Event

The consumer records Payment Already Processed and ignores duplicates.

This strategy allows systems to safely retry operations without introducing global locks or distributed coordination.

Modern event-driven architectures frequently prefer:

retries
idempotency
eventual consistency

over distributed locking because these approaches scale more effectively and remain resilient during failures.

9. Choosing Between Optimistic and Pessimistic Locking

Neither optimistic nor pessimistic locking is universally superior.

The correct choice depends on workload characteristics, contention frequency, consistency requirements, and performance goals.

Understanding how often conflicts occur is usually more important than understanding the locking mechanism itself.

Choose Pessimistic Locking When

Pessimistic locking is most appropriate when conflicts are common and the cost of inconsistency is high.

Scenarios like:

seat reservation systems
inventory allocation
financial transactions
account balance updates

In these scenarios, allowing concurrent modifications may create unacceptable business outcomes. Waiting for access is preferable to resolving conflicts after they occur.

Correctness takes priority over throughput.

Choose Optimistic Locking When

Optimistic locking works best when conflicts are relatively rare.

Scenarios like:

customer profiles
product catalogs
employee records
content management systems

Most transactions complete successfully without interference from other users. Because contention is low, avoiding locks improves concurrency and reduces database overhead.

The occasional conflict can be handled through retries or user intervention.

Measure Contention First

Many teams choose a locking strategy based on assumptions rather than evidence.

A better approach is to measure:

lock wait time
retry rates
update conflicts
transaction latency

Production metrics reveal surprising patterns.

A workflow that appears highly contentious may rarely experience conflicts, while seemingly independent operations may compete heavily for shared resources.

Data should drive concurrency decisions whenever possible.

10. Common Mistakes Teams Make

Concurrency control is generally misunderstood because systems behave correctly under low load and fail only when contention increases.

Several mistakes appear repeatedly across production systems.

Using Pessimistic Locking Everywhere

Applying pessimistic locking indiscriminately can severely limit scalability.

The application remains correct, but:

throughput decreases
latency increases
lock contention grows

As traffic increases, the database spends more time coordinating access than executing business logic.

Correctness is essential, but excessive locking can become a significant performance bottleneck.

Ignoring Retry Logic

Optimistic locking assumes conflicts will occasionally occur. Without retry mechanisms, users may experience unnecessary failures even when a simple retry would succeed immediately.

Applications should treat optimistic lock exceptions as expected outcomes rather than exceptional situations. Proper retry policies are as important as the locking strategy itself.

Long Transactions

Locks held for extended periods dramatically increase contention. Transactions should perform only the work necessary to maintain consistency.

External API calls, file processing, and lengthy computations should generally occur outside transactional boundaries whenever possible.

Short transactions reduce lock duration and improve overall system throughput.

Confusing Isolation Levels with Locking

Many developers assume Serializable automatically solves every concurrency problem.

In reality, isolation levels define visibility rules between transactions, while locking strategies define how concurrent modifications are co-ordinated.

Both influence consistency.
Neither replaces the other.

Understanding the distinction is critical when diagnosing concurrency issues.

11. Final Thoughts

Concurrency control is fundamentally the discipline of managing contention while preserving correctness.

Optimistic and pessimistic locking approach this challenge from different perspectives.

The correct choice depends on:

contention patterns
consistency requirements
throughput goals
operational behavior

Many production systems use both approaches simultaneously. Critical workflows may require strict exclusivity, while less contentious operations benefit from maximum concurrency.

The most effective engineers understand the trade-offs behind each strategy and apply them deliberately based on business requirements and real-world traffic patterns. Because concurrency problems rarely appear when systems are idle. They appear when traffic grows, users increase, and contention finally arrives.

Assisted AI to generate charts and diagrams.

Build vs Buy: The Expensive Engineering Decision Less Talked About

Venkatesan Ramar — Mon, 15 Jun 2026 08:59:00 +0000

Back in 2015, I joined a product company whose platform had been evolving since late 90's. Coming from a startup background, I was overwhelmed by the number of in-house tools, and platforms that existed alongside the core product.

Over time and after leaving the organization in 2023 — I began to appreciate the trade-offs behind those build decisions. Some became strategic assets, while some introduced years of ownership and maintenance overhead.

This article shares some of the lessons I learned about one of the most important engineering decisions teams make: build or buy.

Over the years, I've come to believe that some of the most expensive engineering mistakes have very little to do with technology itself.

They start with a much simpler question:

Should we build this ourselves?
Or should we buy it?

At first glance, the answer often feels obvious. A team identifies a need.
Maybe it's:

authentication,
workflow orchestration,
internal developer portals,
database migration tool, or
some internal framework.

Someone says:

"We can build this in a few weeks."

Often, they're right. The first version usually isn't that difficult. The real challenge comes later. Because every build decision eventually becomes an ownership decision.

And ownership tends to last much longer than implementation.

Over the years, I've seen teams successfully build internal platforms that became strategic assets. I've also seen teams accidentally become software vendors to themselves.

The interesting question isn't whether we can build something. Modern engineering teams can build almost anything.

The more important question is:

Do we want to own it for the next five years?

1. Why This Decision Matters

A decade ago, many engineering teams had fewer choices. But today, the situation is completely different.

Almost every technical capability has mature products available.

Say, you need:

authentication?
observability?
workflow orchestration?
developer portals?

There is probably a vendor already solving that problem, that's what makes the decision difficult.

Because modern engineering teams are no longer choosing between having a capability, or not having one.

They're choosing between:

building it,
extending it,
buying it, or
integrating it.

The number of options has increased. At the same time, engineering capacity remains limited. Teams hit decision paralysis.

Every sprint spent building internal tooling is a sprint not spent building customer-facing capabilities.

This trade-off becomes increasingly important as organizations grow. Especially when platform investments start competing with product investments.

2. The Hidden Cost Teams Ignore

One pattern I've noticed is that teams are usually good at estimating development effort. They're much less effective at estimating ownership effort.

A discussion might sound like this:

This looks straightforward. We can probably build it in three weeks.

Probably they're right. The problem is that the three-week estimate usually covers only Version 1.

It rarely includes:

upgrades,
support,
operational maintenance,
bug fixes,
security reviews,
documentation,
on-boarding,
scalability improvements, and
future requirements.

Those costs appear gradually which makes them easy to underestimate.

Building Is Easy. Owning Is Hard

Many internal systems begin life as small engineering utilities. Gradually adoption grows. Soon other teams depend on them.

Now expectations change.

The platform suddenly needs:

up-time guarantees,
backward compatibility,
support processes, and
clear ownership.

What started as an engineering project slowly becomes a product except now the customers are internal teams.

I've seen this happen with:

internal frameworks,
workflow engines,
authentication services, and
developer portals.

The implementation wasn't the difficult part but the long-term ownership was.

The Internal SaaS Trap

One of the most interesting things about platform engineering is that organizations sometimes become software vendors without realizing it.

Imagine a team builds an internal feature flag platform.

Version 1.0 supports simple enable/disable toggles. Seems pretty straightforward.

Then adopted teams raise feature requests like percentage roll-outs, audit logs, experimentation, approval workflows.

Now the platform team is effectively running a software product. Except instead of external customers, they're supporting internal engineering teams.

The complexity didn't disappear. It simply became your responsibility.

Opportunity Cost Is Real

This is probably the most overlooked factor in build-versus-buy discussions.

Suppose five engineers spend six months building an internal platform.

The direct cost is obvious but another question is often ignored:

What didn't get built during those six months?

Perhaps:

product features were delayed,
customer requests remained unresolved,
roadmap commitments slipped.

These costs rarely appear in dashboards. Yet they often have a larger business impact than infrastructure costs.

3. Why Engineering Teams Choose To Build

Despite the risks, engineering teams continue to build internal solutions. There are good reasons for that, not every build decision is a mistake.

Some become enormous competitive advantages.

The challenge is understanding why we are choosing to build in the first place.

Engineers Like Solving Problems

This shouldn't surprise anyone.

Most engineers enjoy creating systems, and building software feels productive and empowering. It provides a level of control that third-party products cannot. When requirements are unique, building can absolutely make sense.

The problem appears when technical enthusiasm replaces strategic evaluation.

Just because something is technically interesting doesn't automatically mean it should be owned long-term.

Vendor Skepticism

Many teams have legitimate concerns about vendors.

Questions such as:

What if pricing changes?
What if the company gets acquired?
What if we're locked in?
What if customization becomes difficult?

These concerns are real. Sometimes they justify building.

But I've also seen teams dramatically overestimate vendor risks while underestimating ownership risks.

Both sides deserve equal scrutiny.

The "It Looks Simple" Fallacy

Some capabilities appear deceptively simple.

Authentication is a classic example.

At first glance:

users log in,
users log out,
passwords are stored.

Simple, until requirements expand:

OAuth
SAML
MFA
SSO
compliance
account recovery
security reviews

Suddenly the original problem looks very different.

Version 1 is usually easy. Version 10 is where the complexity appears.

4. Where Companies Successfully Build

It might sound like buying is always the safer option, No. Many of the most successful technology companies built substantial internal platforms.

The difference is that they usually built capabilities closely tied to their competitive advantage.

Build What Makes You Different

One repeated pattern among successful engineering organizations is that they build things that are core to their business. Not things that are merely useful.

This distinction matters.

Netflix Didn't Win By Building Authentication

Netflix became successful because of:

streaming infrastructure,
recommendation systems,
content delivery,
personalization.

Those capabilities directly influenced the business.

Investing heavily in them made strategic sense. Authentication was necessary. Recommendation systems were differentiating.

The difference is important.

Uber Didn't Buy Dispatching

Dispatching is central to Uber's business.

The way drivers and riders are matched directly affects customer experience, efficiency, profitability.

That capability is core business logic.

Owning it provides competitive advantage. Buying it would have limited differentiation.

LinkedIn Built Kafka

Kafka began as an internal project at LinkedIn to solve large-scale event streaming challenges. At the time, existing messaging systems struggled to handle the volume, durability, and scalability requirements of LinkedIn's growing platform.

Building Kafka made sense because reliable event streaming was becoming a foundational capability for the business. What started as an internal solution eventually evolved into one of the most widely adopted distributed systems in the industry.

The lesson isn't that every company should build its own messaging platform. The lesson is that LinkedIn built a capability that directly addressed a strategic problem at its scale.

Google Built Kubernetes

Kubernetes originated from Google's experience running massive distributed systems over many years. Google had already developed internal container orchestration platforms and operational practices long before containers became mainstream.

Rather than adapting existing solutions, Google built Kubernetes based on lessons learned from operating infrastructure at enormous scale.

For most organizations, building a container orchestration platform would be a terrible investment. For Google, infrastructure management was a core competency and strategic advantage.

The takeaway is simple:

Build when the capability is closely tied to your unique scale, business model, or competitive advantage.

The Common Pattern

The most successful build decisions usually share a characteristic:

The capability directly contributes to competitive advantage.

When that's true, ownership often makes sense. When it doesn't, the equation changes dramatically.

5. The Platform Engineering Perspective

Over the last few years, platform engineering has become a major focus area for many organizations. The goal is to make developers more productive.

Provide:

self-service capabilities,
deployment automation,
observability,
infrastructure provisioning, and
standardized workflows.

The challenge is deciding how much of that platform should be built internally.

This is where build-versus-buy decisions become particularly interesting.

The Internal Developer Platform Dilemma

Imagine a growing engineering organization.

Developers complain about:

inconsistent environments,
deployment complexity,
on-boarding difficulties.

The organization decides to build an Internal Developer Platform. The initial vision sounds reasonable.

A central portal where developers can create services, access documentation, provision resources, monitor deployments. But soon new requirements emerge:

RBAC
audit logs
integrations
workflow automation
plugin ecosystems

Before long, the platform itself becomes a product.

The Backstage Lesson

Many organizations faced this exact challenge.

Instead of building an entire developer portal from scratch, they adopted existing platforms and customized them.

This approach is interesting because it reflects a broader engineering principle:

Buy the foundation. Build the differentiation.

The organization still owns the developer experience. It avoids spending years recreating foundational capabilities.

Build The Last 20%

One of the most useful heuristics I've encountered is this:

Buy the first 80%. Build the last 20%.

The first 80% usually consists of commodity functionality. The last 20% often contains:

business-specific workflows,
domain integrations,
unique operational requirements.

That final layer is often where competitive advantage exists. It's usually a better place to invest engineering effort.

6. A Practical Decision Framework

Over time, I've found that build-versus-buy discussions become much easier when evaluated through a consistent framework.

Rather than debating technologies, the conversation shifts toward business and ownership.

Here are some of the questions help to make decision.

Question 1: Does This Differentiate The Business?

This is one of the most important question.

If the capability disappeared tomorrow, would customers notice?
Would it impact the company's competitive position?

For example:

Building recommendation engines, pricing algorithms, matching engines, or domain-specific workflows often creates differentiation.

The closer a capability is to competitive advantage, the stronger the case for building.

Question 2: Do We Want To Own This In Three Years?

Most build decisions focus on implementation.
Few focus on ownership.

A better question is:

Will we still want to maintain this three years from now?

Ownership includes:

upgrades,
security,
operational support,
bug fixes,
documentation,
compliance,
training.

If the answer feels uncomfortable, that is valuable information.

Question 3: Can We Support It Operationally?

Every system eventually enters production and production changes everything.

A build decision also means committing to on-call support, incident response, monitoring, maintenance, and/or disaster recovery.

The engineering effort doesn't end when the code is deployed. In many cases, that's where the real work begins.

Question 4: Is The Market Mature?

Sometimes buying is difficult because the market is immature. The available products may not solve the problem adequately.

But in mature categories observability, authentication, feature management and workflow orchestration vendors have often spent years refining their solutions.

Ignoring that accumulated expertise can be expensive.

Question 5: What Is The Opportunity Cost?

This question is frequently overlooked.

Suppose a team spends six engineers, six months, building an internal capability.

The direct cost is obvious but opportunity cost is harder to measure.

What customer-facing work was delayed?
What revenue-generating features were postponed?
What strategic initiatives slowed down?

Sometimes the most expensive cost is the one that never appears in a budget report.

7. The Hybrid Model Usually Wins

One thing I've noticed is that the engineering organizations rarely choose a pure build or pure buy strategy.

Instead, they combine both.

Buy The Foundation

Commodity capabilities are often purchased or adopted. These tools solve common problems that many organizations face.

Rebuilding them rarely creates differentiation.

Build Business-Specific Layers

The organization's engineering effort is then focused on:

business workflows,
domain models,
operational processes,
customer-facing capabilities,
proprietary integrations.

This is where engineering investment usually generates the highest return.

Why This Works

The hybrid model captures the advantages of both approaches.

Organizations avoid:

rebuilding mature capabilities,
unnecessary ownership burden,
platform reinvention.

At the same time, they retain flexibility where it matters most.

8. Common Mistakes Teams Make

Most build-versus-buy failures follow surprisingly similar patterns. The technology might change but the mistakes rarely do.

Underestimating Maintenance

Teams usually estimate 'Initial Build Cost' while forgetting support, upgrades, security and operations. Over a multi-year horizon, ownership often exceeds implementation cost.

Rebuilding Commodity Software

This is probably the most common mistake.

Engineering teams are highly capable. Given enough time, they can rebuild almost anything. The question is whether they should.

Optimizing For Engineering Preference

This is a subtle trap.

Engineers naturally enjoy building systems. But engineering satisfaction and business value are not always aligned.

A technically elegant solution can still be a poor investment. The best technical decision is not always the best business decision.

Assuming Vendors Never Improve

Many build decisions are based on current vendor limitations but products evolve. Markets mature gradually and capabilities improve. A solution that looked inadequate two years ago may look very different today.

Periodic re-evaluation is important.

9. AI Is Changing The Economics

It's impossible to discuss build-versus-buy decisions today without mentioning AI. AI-assisted development has significantly reduced implementation effort. Many teams can now prototype internal tools faster than ever.

Capabilities that once required months of development can sometimes be assembled in days.

Building Is Cheaper Than Before

AI helps with:

scaffolding
code generation
testing
documentation
integration

The barrier to building has unquestionably decreased.

Ownership Has Not Become Cheaper

This is the important distinction.

AI can help create software. It does not eliminate:

operational support
on-call responsibility
compliance
security
upgrades
platform ownership

The cost of creation is decreasing. The cost of ownership remains surprisingly stable. That means ownership becomes even more important in future build-versus-buy discussions.

10. Final Thoughts

One of the biggest lessons I've learned is that build-versus-buy decisions are rarely technology decisions.

They're ownership decisions.

Modern engineering teams can build almost anything. Open-source ecosystems are thriving. Cloud platforms provide powerful building blocks.

AI accelerates development even further.

The question is no longer: Can we build it?
The more important question is: Do we want to own it?

Because every build decision creates a long-term commitment. A commitment to maintenance, operations, support, upgrades, and continuous evolution.

Sometimes that commitment is absolutely worth making especially when the capability creates competitive advantage. Other times, the smarter decision is to leverage what already exists and focus engineering effort where it matters most.

In the end, the most successful engineering organizations aren't the ones that build everything. They're the ones that understand what is truly worth owning.

Assisted ChatGPT to rephrase.

Project Loom and Reactive Programming: Competing or Complementary?

Venkatesan Ramar — Mon, 08 Jun 2026 10:43:28 +0000

For almost a decade, Reactive Programming was one of the primary answers to a common scalability problem in Java applications:

How do we handle thousands of concurrent requests without creating thousands of threads?

Frameworks like Spring WebFlux, Reactor, and Netty gained popularity because they offered a way to build highly scalable applications using non-blocking I/O and event-driven execution models.

Then Project Loom arrived. Suddenly Java developers could create millions of lightweight virtual threads while continuing to write familiar synchronous code.

A new debate started.

Is Reactive Programming dead?
Do Virtual Threads make WebFlux obsolete?
Should every Spring application move back to blocking code?

Like many engineering debates, the reality is more nuanced than the headlines suggest. Depending on who you ask, the answer ranges from "absolutely" to "not even close."

The reality, as usual, is somewhere in the middle.

Project Loom and Reactive Programming solve similar scalability challenges, but they do so using fundamentally different concurrency models.

1. Why This Comparison Matters

To understand why Loom generated so much excitement, we need to revisit a problem Java developers have been dealing with for years.

Traditionally, backend applications followed a simple model:

One request.
One thread.
One execution flow.

This model is easy to understand.

It maps naturally to how developers think. The problem appears when systems scale.

The Cost of Waiting

Most backend applications are not CPU-bound, they're I/O-bound. A request spends most of its lifetime waiting for something like:

database queries
HTTP calls
cache lookups
message brokers
file systems

Consider a service that processes an order.

Order order = repository.findById(id);

Customer customer = customerService.fetch(order.getCustomerId());

Inventory inventory = inventoryService.check(order.getProductId());

The CPU does very little work. Most of the time, the thread simply waits. While waiting, that thread still consumes memory and scheduling resources.

Multiply this by thousands of concurrent requests and the traditional model begins to show its limitations.

This is the problem both Reactive Programming and Virtual Threads attempt to solve.

2. Reactive Programming: Solving Scalability Through Non-Blocking I/O

Reactive Programming emerged as a response to thread in-efficiency. Instead of allocating one thread per request, applications could use a small number of threads and process requests asynchronously.

The Core Idea

Instead of blocking:

Order order = repository.findById(id);

The operation returns immediately. Processing continues once data becomes available.

In Reactor/ WebFlux, the same flow may look like:

Mono<Order> order = repository.findById(id)
    .flatMap(order -> customerService.fetch(order.getCustomerId()))
    .flatMap(customer -> inventoryService.check(customer));

Rather than waiting, execution becomes event-driven. The framework orchestrates continuations behind the scenes.

Why Reactive Became Popular

Reactive systems offered significant advantages.

A relatively small thread pool could handle thousands of requests,
websocket connections, streaming workloads or event processing pipelines. This made Reactive particularly attractive for API gateways, streaming platforms, notification systems and real-time event processing.

At a time when traditional thread-per-request models struggled under high concurrency, Reactive felt revolutionary.

The Trade-off

The scalability gains came with a cost.

The programming model changed.
Developers needed to think differently.

Simple sequential logic became:

Mono<Order>
    .flatMap(...)
    .flatMap(...)
    .map(...)

Error handling changed.
Debugging changed.
Context propagation changed.

The application became more scalable but it also became more complex.

For many teams, this complexity was a worthwhile trade-off. For others, it became a significant source of maintenance overhead.

3. Project Loom: Solving Scalability Through Lightweight Threads

Project Loom takes a very different approach.
Instead of changing the programming model, it changes the threading model.

The Core Idea

With Virtual Threads, developers can continue writing familiar blocking code:

Order order = repository.findById(id);

Customer customer = customerService.fetch(order.getCustomerId());

Inventory inventory = inventoryService.check(order.getProductId());

The code looks synchronous. The difference is what happens underneath.

When a Virtual Thread encounters a blocking operation, the JVM can suspend it and release the underlying carrier thread to do other work.

Once the operation completes, execution resumes. The developer sees blocking code. The JVM sees efficient scheduling.

Why This Feels Different

For many Java developers, Virtual Threads feel almost too good to be true. The application remains readable, debug-able and familiar.

The mental model barely changes.

Developers don't need to learn reactive chains, event loops, or callback orchestration.

They simply write code as they always have.
This dramatically lowers adoption barriers.

What Virtual Threads Optimize For

Reactive Programming primarily optimizes for:

resource efficiency

Virtual Threads optimize for:

simplicity
readability
developer productivity

That distinction becomes important when evaluating trade-offs.

4. Concurrency Models: The Real Difference

The most important difference between Reactive and Loom is not performance; it's the concurrency model.

Reactive Model

Reactive systems typically follow an event-driven approach.

A small number of threads handle many requests. Execution is coordinated through events and continuations.

Developers explicitly model asynchronous behavior.

Virtual Thread Model

Virtual Threads retain the traditional request-processing model.

The application behaves synchronously. The JVM manages scalability behind the scenes.

This is arguably Loom's biggest innovation.

Why This Matters

One of the insightful ways to think about the difference is:

Reactive changes the programming model. Virtual Threads preserve the programming model.

That's why Loom generated so much excitement. It promises scalability improvements without forcing developers to fundamentally rethink application flow.

5. Performance: The Nuanced Reality

Performance discussions around Loom and Reactive often become oversimplified. The reality is much more nuanced.

Throughput

Both approaches can support extremely high concurrency.

For many business applications, the difference is unlikely to be the primary bottleneck. Databases, external APIs, and network latency often dominate system performance.

It means many applications will see similar throughput characteristics regardless of whether they choose Virtual Threads or Reactive.

Latency

Latency depends heavily on workload characteristics.
In some scenarios:

Reactive systems may exhibit lower overhead.
Virtual Threads may provide simpler execution paths.

The differences are often smaller.

Memory Consumption

Traditional platform threads are expensive. Reactive applications gained popularity partly because they avoided creating large numbers of threads. Virtual Threads significantly reduce thread costs.

This narrows one of the biggest historical advantages Reactive enjoyed.

However, "lighter than platform threads" does not mean "free." Millions of Virtual Threads still require memory and scheduling resources.

Architectural decisions should remain grounded in actual workload measurements.

CPU-Bound Workloads

This is a misconception worth addressing. Neither Virtual Threads nor Reactive Programming magically improve CPU-bound workloads.

If your bottleneck is CPU-intensive computation like image processing, encryption, machine learning or large aggregations switching concurrency models won't suddenly create more CPU capacity.

Both approaches primarily help systems spend less time wasting resources while waiting. Most backend systems spend far more time waiting than computing.

6. Operational Complexity: Where The Real Costs Appear

One thing I've learned over the years is that architecture decisions are rarely won or lost in benchmarks.

They're usually won or lost during:

debugging,
production incidents,
on-boarding,
maintenance, and
operational support.

This is where the discussion becomes interesting.

Reactive Complexity

Reactive systems introduce a different way of thinking.

Developers don't simply write code. They compose asynchronous execution flows. A simple business workflow may involve:

Mono<Order>
    .flatMap(this::validate)
    .flatMap(this::reserveInventory)
    .flatMap(this::processPayment)
    .flatMap(this::createShipment);

Once teams become comfortable with Reactive, this style can be extremely powerful but the learning curve is real.

New engineers often struggle with:

asynchronous flow composition,
reactive operators,
scheduler behavior,
error propagation,
context management.

Some teams adopt Reactive primarily because it was considered the "modern" approach, only to discover that most developers spent more time understanding Reactor operators than solving business problems.

That's not necessarily a flaw in Reactive.

It's simply part of the cost.

Debugging Reactive Systems

Debugging is another area where opinions often diverge.

Traditional stack traces tell a story that you can follow the execution path from top to bottom. Reactive systems are different.

Execution may jump across:

operators,
schedulers,
asynchronous boundaries,
event loops.

Modern tooling has improved dramatically, but debugging reactive flows can still be more challenging than debugging traditional synchronous code.

This is especially noticeable during production incidents.

Virtual Thread Complexity

Virtual Threads simplify application code considerably. But they are not entirely free from operational considerations.

One concept that frequently appears in Loom discussions is:

Thread pinning.

Pinning occurs when a Virtual Thread cannot be detached from its carrier thread during a blocking operation like certain synchronized blocks, native calls or some legacy libraries. When this happens, scalability benefits can diminish.

Most applications won't encounter severe issues immediately. But teams should understand that Virtual Threads are not magic. They're still subject to JVM and application-level constraints.

Observability Still Matters

Whether using Reactive, Virtual Threads, or traditional threads observability remains critical. You still need visibility into request latency, thread utilization, blocking operations, queue buildup, and resource contention.

Concurrency models change implementation details. They don't eliminate the need for operational discipline.

7. Database and I/O Reality

This is where the conversation often becomes practical. Because eventually every backend service talks to something, usually a database.

The JDBC Question

For years, one of the strongest arguments for Reactive was that traditional blocking JDBC connections limited scalability.

A typical request looked like:

Order order = repository.findById(id);

The thread blocks.
The database responds.
Execution continues.

Reactive systems addressed this by introducing non-blocking database drivers. It led to technologies like R2DBC, Reactive MongoDB drivers and Reactive Redis clients.

The entire stack became asynchronous.

What Loom Changes

With Virtual Threads, blocking becomes much less expensive.

The code remains:

Order order = repository.findById(id);

But the JVM can suspend the Virtual Thread while waiting.

For many applications, this removes a major motivation for adopting Reactive purely for scalability reasons. Specifically the existing Spring MVC applications, JDBC repositories, and synchronous libraries can often scale significantly better with minimal code changes.

That's a compelling proposition.

Does Loom Eliminate The Need For Reactive Drivers?

In short words, not entirely. This is where discussions often become overly simplistic.

Virtual Threads make blocking I/O more efficient.

But Reactive drivers still provide advantages in scenarios like streaming workloads, large-scale event processing, explicit backpressure management, and high-throughput data pipelines.

The answer isn't:

Reactive is obsolete.

The answer is:

The justification for Reactive has become more workload-dependent.

That's a healthy evolution.

8. Where Reactive Still Shines

The rise of Loom has led some people to predict the end of Reactive Programming but that's not what we're going to see.

Reactive still solves certain problems extremely well.

Streaming Systems

Reactive was built around streams. For use-cases including:

live event feeds,
telemetry pipelines,
log aggregation,
market data feeds.

A stream of events maps naturally to: Flux<Event>
This remains one of Reactive's strongest use cases.

Backpressure-Sensitive Workloads

Backpressure is a first-class concept in Reactive systems.

It allows consumers to signal: Slow down. I can't keep up.

It is important when producers generate events rapidly,
consumers process more slowly, and resource exhaustion becomes a concern.

Virtual Threads don't inherently solve backpressure.

Reactive systems still have an advantage here.

WebSockets and Real-Time Systems

Applications maintaining thousands of WebSocket connections,
continuous event streams or real-time notifications often fit naturally into Reactive architectures.

The programming model aligns well with the workload.

Event Processing Platforms

Systems built around Kafka consumers, event pipelines, and/or stream processing may continue benefiting from Reactive approaches because events are already flowing through asynchronous streams.

The architecture and programming model are naturally aligned.

9. Where Virtual Threads Shine

If Reactive excels in streaming systems, Virtual Threads shine in traditional business applications and that's a very large category.

REST APIs

Consider a typical Spring Boot service.

A request arrives. The service:

validates input,
queries a database,
calls another service,
returns a response.

This model maps perfectly to Virtual Threads. The code remains simple.

The architecture remains familiar. The scalability characteristics improve significantly.

CRUD Applications

Many enterprise applications are still fundamentally CRUD systems.
They're business applications neither event streams nor real-time data pipelines.

For these workloads, Virtual Threads often provide a compelling balance between simplicity, maintainability, and scalability.

Existing Spring MVC Systems

This may be Loom's biggest practical advantage.

Many organizations have years of Spring MVC code, JDBC repositories, and/or synchronous service layers. Moving to Reactive often requires significant architectural change. Moving to Virtual Threads may require surprisingly little.

That dramatically lowers adoption friction.

10. Common Misconceptions

Let's address a few common misconceptions:

"Virtual Threads Remove Scalability Limits"

No concurrency model removes scalability limits.

Databases still have limits.
Networks still have limits.
External services still have limits.

Virtual Threads improve resource utilization. They don't create infinite capacity.

"Reactive Solves CPU Bottlenecks"

Reactive primarily helps I/O-bound systems.

CPU-bound workloads require different optimization strategies. Changing concurrency models rarely fixes CPU shortages.

11. A Practical Decision Framework

When evaluating Loom versus Reactive, I find it useful to focus on workload characteristics rather than technology preferences.

Choose Virtual Threads When

Your application is primarily:

request-response driven
REST-based
JDBC-centric
business workflow oriented

And when:

simplicity matters,
maintainability matters,
developer productivity matters.

This describes a surprisingly large percentage of backend systems.

Choose Reactive When

Your application is heavily focused on:

event streams
WebSockets
real-time messaging
backpressure-sensitive pipelines
continuous data processing

These workloads naturally align with Reactive concepts.

Remember Team Expertise

Technology decisions are not purely technical. Team capability also matters.

A highly experienced Reactive team may be more productive with Reactive than with Loom.
A team unfamiliar with Reactive may benefit greatly from Virtual Threads.

12. So, Competing or Complementary?

After all the discussion, we arrive at the original question.

Are Project Loom and Reactive Programming competing? Or complementary?

The answer is probably both.

They compete because they address similar scalability challenges. It allows developers to write familiar synchronous code while benefiting from much of the scalability traditionally associated with asynchronous architectures.

Many applications that previously adopted Reactive primarily for concurrency may now find Virtual Threads to be a simpler alternative.

But they're also complementary because they excel in different domains.

Virtual Threads simplify traditional service architectures.
Reactive continues to excel in stream-oriented and event-driven workloads.

Ultimately, the most important question is no longer:

"Reactive or Virtual Threads?"

A better question is:

"What concurrency model best fits the workload we're trying to solve?"

The future is probably a mix of both and I find it perfectly reasonable.

Assisted ChatGPT to generate diagrams and to rephrase.

Outbox Pattern Solves Publishing. Inbox Pattern Solves Processing.

Venkatesan Ramar — Sat, 30 May 2026 14:42:34 +0000

While covering the Outbox Pattern, I realized there's another side of event reliability to discuss — and that led me to write this article.

In event-driven systems, a lot of engineering discussions focus on publishing events reliably. That’s usually where the Transactional Outbox Pattern enters the conversation.

Reliable event publishing is hard.

But over time, I’ve noticed something in backend systems that:

publishing events reliably is only half the problem.

The other half is much harder.

Processing them reliably.

Because even if:

Kafka delivers the event,
RabbitMQ retries correctly,
the Outbox Pattern guarantees publication

Real systems still face another uncomfortable reality:

duplicate processing is inevitable.

Consumers crash.
Retries happen.
Brokers re-deliver events.
Deployments interrupt processing.
Offsets commit at the wrong time.
Network failures create uncertain states.

And suddenly engineers staring at production wondering why:

a payment was processed twice,
inventory was deducted twice,
customers received three confirmation emails,
some workflow executed multiple times.

That's where the Inbox Pattern enters the conversation.

The Outbox Pattern solves reliable event publishing.
The Inbox Pattern solves reliable event processing.

And if you're building serious event-driven systems, you usually need both.

1. The Problem Starts With At-Least-Once Delivery

Most messaging systems don't promise exactly-once delivery, they promise at-least-once delivery. This includes Apache Kafka, RabbitMQ and many cloud messaging platforms.

Note:
Some might think, I've missed to consider Kafka's Exactly-Once Semantics. By default, Kafka operates on an at-least-once model. Kafka is famous for introducing true Exactly-Once Semantics (EOS).

It achieves EOS using idempotent producers (where the broker assigns a unique sequence number to each message packet to detect and discard duplicates) and a transactional API (which allows atomic writes across multiple partitions).

The Catch: It requires explicit configuration and only applies within the Kafka ecosystem (from Kafka topic to Kafka topic). Once you move data out of Kafka to an external database, you are back to managing delivery guarantees yourself.

At-least-once delivery is usually the correct trade-off.

Because systems prefer duplicate delivery over silent message loss.

That sounds reasonable until duplicate processing starts creating business problems.

A Failure Scenario

Let's say we have a payment consumer.

It receives a PaymentCompleted event.

The consumer does 3 things:

updates the database
sends confirmation email
acknowledges the message

Now imagine this sequence:

DB transaction succeeds
Service crashes before acknowledgment
Broker re-delivers event
Consumer processes again

Now:

duplicate emails get sent,
workflows execute twice,
business state becomes inconsistent.

This is one of the common distributed systems problems in production systems.

And retries make it unavoidable eventually.

2. Why Idempotency Alone Is Often Not Enough

Whenever duplicate processing comes up, the usual advice is:

“Make consumers idempotent.”

It is a good advice, but also incomplete. But in real systems, idempotency is often harder than it sounds.

Simple Idempotency Works for Simple Cases

Some operations are naturally safe.

Example:

user.setStatus(ACTIVE);

Running it twice or ten times causes no harm. But not many workflows are that simple.

Real Systems Have Side Effects

Now let's talk about flows that hurt.
Let's consider a flow:

payment processing,
inventory deduction,
shipment creation,
sending emails,
calling external APIs.

Suddenly duplicate execution becomes dangerous.

For example:

PaymentCompleted Event -> Inventory Reduced -> Email Sent

If the event processes twice:

inventory may reduce twice,
duplicate emails may send,
downstream workflows may trigger repeatedly.

Now business correctness becomes difficult.

This is the problem Inbox Pattern solves.

3. What the Inbox Pattern Actually Does

The Inbox Pattern is simple. Basic idea is:

Before processing an event, record that you've seen it.

That sounds simple, but it changes reliability significantly.

Core Flow

The flow usually looks like this:

Receive event
Check inbox table
Already processed? Ignore it
Not processed?
- process event
- store event ID in inbox table
- Commit transaction

It creates de-duplication at the consumer side. Now retries become much manageable.

Typical Inbox Flow

The detail to note here is that the business update and inbox record usually commit in the same database transaction.

Without that consistency boundary, things get weird again.

4. Why the Inbox Pattern Works

It works because it shifts duplicate handling into transactional state. Instead of relying on broker guarantees, perfect retries, or exactly-once infrastructure semantics the application explicitly tracks processed events.

It makes processing behavior deterministic.

Example Consumer Flow

A simplified example:

@Transactional
public void process(OrderCreatedEvent event) {

    if (inboxRepository.exists(event.getEventId())) {
        return;
    }

    inventoryService.reserve(event);

    inboxRepository.save(
        new InboxRecord(event.getEventId())
    );
}

Now even if Kafka re-delivers, retries happen, and/or consumers restart the duplicate event gets ignored safely.

This pattern becomes extremely useful in financial systems, inventory systems, Saga (choreography) workflows, CQRS projections, and external integrations.

5. Inbox Pattern and Exactly-Once Myths

One misunderstood phrase in event-driven systems is "Exactly-once". You might even have come across the phrase:

“Kafka provides exactly-once processing.”

And then assume duplicates are gone forever, not really. Kafka can help reduce duplicate delivery scenarios. But once business workflows involve databases, external APIs, side effects, or distributed services the problem becomes much larger.

Exactly-once delivery does not automatically become exactly-once business execution.

The Inbox Pattern acknowledges this reality. Instead of trying to eliminate duplicates globally, it focuses on:

making duplicates harmless locally.

That's usually a much more practical engineering approach.

6. Inbox + Outbox Together

Outbox and Inbox are really two halves of the same reliability story.

Outbox Solves Producer Reliability

The Outbox Pattern answers:

Did we publish the event?

If the business transaction commits, the event eventually gets published. Producer-side consistency solved.

Inbox Solves Consumer Reliability

The Inbox Pattern answers:

Did we already process this event?

If yes, ignore it. Consumer-side consistency solved.

Together They Create End-to-End Reliability

A typical flow looks like this:

This combination shows up in:

CQRS systems,
Saga workflows,
payment systems,
inventory pipelines, and
event-driven microservices.

Because reliable publishing alone is not enough. Reliable processing matters equally.

7. Inbox Pattern in Saga Workflows

The Inbox Pattern becomes important in Saga choreography systems.

In choreography-based Sagas:

services communicate entirely through events,
retries are common,
duplicate delivery eventually happens.

Example:

OrderCreated -> PaymentCompleted -> InventoryReserved -> ShippingStarted

Now imagine:

PaymentCompleted processes twice.

Without Inbox protection:

inventory may reserve twice,
shipping may trigger twice,
workflows become inconsistent.

This is why Inbox patterns are extremely valuable in distributed workflows. They reduce the risk of duplicate state transitions.

8. CQRS Projection Safety

CQRS systems also benefit heavily from Inbox-style processing.

Projection consumers often consume domain events, update read models, and rebuild de-normalized views.

Without de-duplication:

counters may inflate,
projections drift,
analytics become inaccurate.

Inbox tracking helps projections remain consistent even during replays, retries, consumer restarts, and broker re-delivery scenarios.

9. Operational Complexity

Like most distributed systems patterns, the Inbox Pattern is not free.

It comes with the overhead of:

inbox tables,
de-duplication logic,
cleanup policies,
replay considerations, and
operational overhead.

Large systems eventually need:

inbox archival,
retention strategies,
indexing optimizations, and
replay-safe workflows.

Learnt another important distributed systems lesson:

reliability patterns usually exchange simplicity for controlled consistency.

That trade-off is worth it.

10. Common Mistakes Teams Make

I've observed few mistakes repeatedly.

Assuming Brokers Eliminate Duplicates

Brokers don't eliminate duplicates. Retries and re-delivery still happen. Applications must still protect business correctness.

Forgetting Side Effects

Database updates are usually easier to de-duplicate. External side effects like emails, payments, web-hooks, and/or notifications are harder.

These require careful and reply-aware design.

Treating Exactly-Once as a Business Guarantee

Infrastructure guarantees doesn't mean guaranteed business correctness, side-effect safety, and/or distributed consistency.

Ignoring Inbox Cleanup

Inbox tables grow continuously. Without cleanup indexes become slower, queries degrade and/or replay becomes expensive.

Operational maintenance is crucial.

11. When Inbox Pattern Helps

The Inbox Pattern becomes valuable when:

duplicate processing is dangerous,
retries are common,
workflows contain side effects,
systems use at-least-once delivery, or
distributed workflows span multiple services.

Especially in:

payments,
inventory systems,
CQRS projections,
Saga choreography, and
event-driven microservices.

12. When It Might Be Overkill

Not every system needs Inbox tracking.

For simpler systems like:

internal tooling,
low-scale applications,
naturally idempotent workflows,
tightly coupled monoliths,

the added complexity may not be justified.

Like most architecture patterns, the goal is not maximum sophistication. The goal is controlled operational reliability.

13. Conclusion

One thing event-driven distributed systems teach is that:

Reliable event publishing is difficult.
Reliable event processing is even harder.

The Outbox Pattern solves:

“Did the event get published reliably?”

The Inbox Pattern solves:

“Did the event process safely despite retries and duplicates?”

Together, they form the most practical reliability foundations for event-driven systems. Not because they eliminate distributed systems complexity.

But because they acknowledge it honestly.

Assisted ChatGPT to generate diagram and paraphrase.

Why Distributed Transactions Fail and How the Outbox Pattern Helps

Venkatesan Ramar — Thu, 28 May 2026 19:34:02 +0000

While covering the Outbox Pattern in my earlier article on CQRS, I realized there was much more depth to it than I initially planned to discuss — and that led me to write this article.

Let’s start with a very common example of order management system in e-commerce:

An order gets created.
An event gets published.
Inventory updates.
Notifications get triggered.
Analytics pipelines consume events.
Downstream services react asynchronously.

At first glance, this all sounds straightforward, until systems start failing in production.

That’s usually when teams discover one of the hardest problems in distributed systems:

keeping database transactions and asynchronous events consistent.

This problem appears everywhere in microservices:

order management systems,
payment platforms,
inventory workflows,
CQRS architectures, and
event-driven systems.

And unfortunately, there is no magical distributed transaction that solves everything cleanly.

Over the years, many teams tried solving this using:

two-phase commit (2PC),
distributed XA transactions, or
tightly coupled coordination protocols.

Many large-scale systems eventually moved away from those approaches, not because they were theoretically wrong. But because they became operationally painful under real production conditions.

This is where the Transactional Outbox Pattern became extremely popular, not because it eliminates distributed systems complexity.

But because it introduces a more reliable and operationally manageable consistency model.

1. The Distributed Consistency Problem

Imagine an order service where a customer places an order.

The service needs to:

save the order into the database
publish an OrderCreated event to Kafka

Simple enough.

A typical implementation might look like this:

@Transactional
public void createOrder(Order order) {

    orderRepository.save(order);

    kafkaTemplate.send("order-events",
            new OrderCreatedEvent(order.getId()));
}

Looks harmless.

But there’s a serious problem hidden inside this flow.

What happens if the database transaction succeeds, but Kafka publish fails?

Now the order exists, but downstream systems never receive the event.

Inventory never updates.
Notifications never send.
Analytics pipelines never see the order.

The system becomes inconsistent.

Now consider the opposite scenario.

What if the Kafka publish succeeds, but the database transaction rolls back?

Now downstream services react to an order that never actually existed.

This is the classic distributed consistency problem.

And it becomes extremely common in event-driven architectures.

2. Why Dual Writes Fail

This problem is commonly called the dual-write problem.

Because the application is trying to write to the database, and the message broker at the same time.

The issue is:

the database and Kafka are two different distributed systems,
with separate transaction boundaries,
separate failure modes, and
separate availability guarantees.

There is no shared atomic transaction between them.

That creates dangerous timing windows.

A Typical Failure Sequence

Consider this flow:

Database commit succeeds
Application crashes immediately
Kafka publish never happens

The event is now permanently lost.

Or this one:

Kafka publish succeeds
Database transaction rolls back

Now downstream consumers process invalid business state.
These failures are subtle.

And they usually appear only under production traffic, partial outages or broker instability.

This is why distributed consistency becomes operationally difficult very quickly.

Why Distributed Transactions Usually Fail

The natural question becomes:

“Why not use distributed transactions?”

Technically, systems like XA transactions and two-phase commit try to solve this.

But large-scale distributed systems rarely use them heavily anymore. Because they introduce:

tight coupling,
co-ordination overhead,
blocking behavior,
availability trade-offs, and
operational fragility.

In practice, distributed locks become bottlenecks, failures become difficult to recover, and debugging becomes extremely painful.

Many modern product engineering systems eventually favor:

retries,
idempotency, and
eventual consistency models

instead of globally coordinated distributed transactions.

This is where the Outbox Pattern becomes useful.

3. What the Outbox Pattern Actually Solves

The Outbox Pattern solves a very specific problem:

How do we guarantee that if a database transaction commits, the event will eventually be published?

That wording matters.
The pattern does not guarantee:

instant consistency,
exactly-once business processing, or
perfectly synchronized systems.

What it guarantees is:

reliable event publication after transactional success.

That’s a much more realistic distributed systems goal.

Core Idea

Instead of publishing events directly to Kafka or RabbitMQ during business processing:

The application:

writes business data
writes an outbox event
commits both in the same DB transaction

Later:

a background publisher reads the outbox table
publishes events asynchronously

Now the database transaction becomes the single source of truth.

If the transaction commits the business state exists, and the event record exists.

Even if the broker is temporarily unavailable, the event is not lost.

That is the core strength of the pattern.

4. Core Architecture Flow

A typical Outbox architecture looks like this:

An important detail is:

The application never directly depends on the broker during transactional writes.

That decoupling improves reliability significantly.

Example Flow

Imagine an e-commerce order service.

Inside a single transaction:

order gets stored,
outbox event gets inserted.

Example:

@Transactional
public void createOrder(Order order) {

    orderRepository.save(order);

    outboxRepository.save(
        new OutboxEvent(
            "OrderCreated",
            order.getId(),
            payload
        )
    );
}

Now even if Kafka is unavailable:

the order still exists, and
the event is safely persisted.

A background worker can publish the event later.

This dramatically reduces synchronization failure risk.

5. Polling Publisher vs CDC-Based Outbox

There are two common ways to publish outbox events.

Polling Publisher Model

This is the simplest approach.

A scheduled worker periodically:

queries unpublished outbox events
publishes them
marks them as processed

Typical flow:

Benefits:

simple implementation
application-controlled logic
easy to understand

But there are trade-offs:

polling latency
database pressure
scaling concerns
duplicate publish handling

Still, many production systems use this successfully.

Especially moderate-scale systems.

CDC-Based Outbox Model

Larger systems often evolve toward CDC-based (Change Data Capture) publishing.

Instead of polling manually database transaction logs are monitored directly.

Tools like Debezium, Kafka Connect, MySQL binlogs, and PostgreSQL WAL logs stream outbox changes automatically into Kafka.

Typical flow:

This approach reduces polling overhead, application complexity, and publisher co-ordination logic.

Many large product engineering organizations use this architecture heavily for:

event-driven microservices,
CQRS projections,
audit pipelines, and
analytics synchronization.

But CDC introduces its own operational complexity:

infrastructure management,
schema evolution,
connector monitoring, and
replay coordination.

Like most distributed systems patterns:

complexity moves — it rarely disappears.

6. Ordering, Retries and Exactly-Once Realities

It's one of the misconceptions about the Outbox Pattern that:

“It guarantees exactly-once processing.”

No, the pattern guarantees eventual event publication.

But duplicates can still happen.

For example:

publisher crashes after sending event
retry publishes again
consumers receive duplicates

This is why idempotent consumers remain critical.

Idempotency Still Matters

Consumers should always assume:

duplicate delivery is possible,
retries will happen, and
replay scenarios will eventually occur.

Typical strategies include:

event IDs,
de-duplication tables,
idempotency keys,
replay-aware consumers.

Exactly-once business processing across distributed systems is still extremely difficult.

The Outbox Pattern improves reliability. It does not magically eliminate distributed systems realities.

7. Common Failure Scenarios in Production

Things get really interesting here.

Most Outbox Pattern complexity appears operationally, not during implementation.

Publisher Crashes Mid-Batch

Imagine:

publisher sends 50 events,
crashes before marking them processed.

Now some events may publish again after restart.

Consumers must tolerate duplicates safely.

Broker Outage

If Kafka or RabbitMQ becomes unavailable:

outbox events accumulate,
publisher lag grows,
downstream systems fall behind.

Now operational visibility becomes critical.

Teams need monitoring for:

outbox backlog,
publish failures,
retry rates, and
synchronization lag.

Outbox Table Growth

This becomes a real operational issue surprisingly fast.

Large systems can generate millions of outbox rows daily.

Without cleanup strategies:

tables grow aggressively,
indexes become slower,
polling performance degrades.

Production systems usually need:

archival policies,
cleanup jobs,
retention strategies, and
partitioned tables.

This part is often underestimated.

Replay Scenarios

Eventually:

consumers fail,
projections become corrupted,
downstream systems require rebuilding.

Now replay becomes necessary.

Replay safety becomes difficult once:

side effects exist,
notifications were already sent,
external APIs were triggered.

This is why early adoption of replay-aware design matters.

8. Operational Complexity

The Outbox Pattern improves reliability by introducing controlled complexity.

That trade-off is important.

Operationally, teams now manage:

outbox tables,
publisher workers,
retry logic,
lag monitoring,
cleanup jobs,
replay tooling, and
observability pipelines.

Most problems eventually become operational systems problems, not coding problems.

This is a recurring pattern in distributed architectures.

9. Integration Architectures/Patterns

The Outbox Pattern fits naturally into several modern architectures.

Outbox + Kafka

Very common in:

event-driven microservices,
analytics pipelines,
CQRS systems, and
distributed event platforms.

Kafka provides:

scalable event streaming,
retention,
replayability, and
partition-based ordering.

The Outbox Pattern ensures events reach Kafka reliably.

Outbox + RabbitMQ

Very common in:

workflow orchestration,
transactional async processing, and
background job systems.

RabbitMQ works especially well when:

retries,
DLQs, and
delivery workflows

matter more than event retention.

Outbox + CQRS

CQRS systems frequently use Outbox patterns for:

projection synchronization,
event propagation,
read model updates, and
asynchronous consistency.

Without reliable event publication CQRS projections become inconsistent.

The Outbox Pattern helps reduce that risk significantly.

Outbox + Saga Pattern (Choreography)

This is one of the most common real-world combinations.

In choreography-based Saga architectures, services communicate entirely through events.

There is no central orchestrator controlling the workflow.

Instead:

one service publishes an event,
another service reacts to it,
publishes another event, and
the workflow continues asynchronously.

For example:

This architecture heavily depends on reliable event propagation.

If even one event gets lost:

the Saga flow breaks,
downstream services stop reacting, and
the business workflow becomes inconsistent.

Imagine this scenario:

Order service commits the order
OrderCreated event fails to publish
Payment service never starts

Now the Saga is stuck halfway.

This is exactly why the Outbox Pattern becomes extremely important in choreography-based Sagas.

Each service can:

update its local database
store the outgoing Saga event in the outbox
publish it asynchronously and reliably

This ensures Saga state transitions are not silently lost during failures.

In practice, many event-driven microservice systems combine:

Saga choreography,
Kafka or RabbitMQ,
Outbox Pattern,
retries, and
idempotent consumers

to build resilient distributed workflows.

Without reliable event publishing, choreography-based Sagas become fragile very quickly.

10. When the Outbox Pattern Helps

The pattern works especially well in:

microservices,
event-driven systems,
CQRS architectures, and
Saga choreography workflows.

It becomes valuable whenever:

business consistency depends on reliable asynchronous event propagation.

11. When the Outbox Pattern Hurts

The pattern is not free.

It introduces:

operational overhead,
eventual consistency,
duplicate handling,
replay complexity, and
infrastructure management.

For simpler systems:

tightly coupled monoliths,
internal tools,
low-scale applications

the additional complexity may not be worth it.

Not every application needs distributed event reliability.

12. Conclusion

The hardest part of event-driven systems is rarely publishing events.

It is guaranteeing that systems remain consistent once:

failures happen,
retries occur,
brokers become unavailable, and
distributed timing problems appear in production.

The Outbox Pattern became popular because it accepts an important reality:

distributed consistency is fundamentally a failure-handling problem.

Instead of trying to eliminate failures entirely, the pattern focuses on:

reliable recovery,
eventual synchronization, and
operational resilience.

That is usually a far more practical approach in modern distributed systems.

Like most architecture patterns, the Outbox Pattern is ultimately a trade-off.

It exchanges immediate simplicity for long-term reliability and recoverability.

And in many event-driven production systems, that trade-off is absolutely worth it.

Assisted ChatGPT to create diagrams.

In this article. I've covered the half-side of event reliability i.e., publisher, the other half on consumer-side will come soon.

CQRS: Where It Helps and Where It Hurts in Backend Systems

Venkatesan Ramar — Tue, 26 May 2026 08:44:28 +0000

CQRS has been one of the most talked-about architectural patterns in modern backend systems. Over the last decade, its popularity has grown alongside microservices, event-driven systems, domain-driven design, and distributed architectures in general.

And honestly, there’s a good reason for that.

As systems scale, reads and writes often start behaving very differently. Some systems become heavily read-oriented, while others require strict transactional guarantees on writes. Performance expectations also change over time. A single data model that worked perfectly in the beginning slowly starts becoming harder to optimize for every use case.

But there’s another side to the story that often gets ignored.

In production systems, CQRS also introduces:

operational complexity,
eventual consistency challenges,
synchronization issues,
debugging overhead, and
distributed failure scenarios.

This is where many architectural discussions become less theoretical and much more practical.

A lot of CQRS content online focuses heavily on command handlers, query handlers, or framework abstractions. But most of the real complexity appears later:

when systems scale,
teams grow,
failures happen, and
distributed state becomes difficult to reason about.

CQRS is not automatically a “better architecture”. It’s a tradeoff. Like most distributed systems patterns, it solves very specific problems while introducing entirely new ones.

1. Why CQRS became popular

Traditional CRUD architectures work perfectly fine for many systems. But as systems grow, read and write workloads often evolve very differently.

For example:

e-commerce platforms may receive millions of catalog reads but relatively few inventory updates
analytics dashboards may execute heavy aggregations while writes remain transactional
financial systems may require strict write validation while supporting highly optimized reporting queries

Over time, many teams realizes something important:

the same data model rarely optimizes both reads and writes equally well.

This is where CQRS became attractive.

Instead of forcing a single model to solve everything, CQRS separates command responsibilities from query responsibilities. That separation allows independent scaling, optimized read models, de-normalized projections, and clearer domain boundaries.

Large-scale product engineering organizations gradually adopted similar patterns in:

recommendation systems
reporting platforms
inventory services
analytics pipelines
event-driven architectures

But many teams also copied CQRS simply because “modern architectures use it” or because it became associated with microservices and DDD trends.

That is usually where problems begin.

2. What CQRS Actually Is

CQRS stands for Command Query Responsibility Segregation. At its core, CQRS separates write operations (commands) from read operations (queries).

But the important thing is: CQRS is not simply about separate classes, APIs, or folders.

Real CQRS usually means separate models, separate optimization strategies, separate consistency concerns, and sometimes even separate storage systems.

Command Side

The command side focuses on enforcing business rules, validating state transitions, maintaining consistency, and processing writes safely.

Typical examples include:

placing orders
processing payments
updating inventory
approving workflows

This side usually prioritizes correctness, transactional integrity, and domain behavior.

Query Side

The query side focuses on fetching data efficiently, supporting high-volume reads, optimizing projections, and minimizing query complexity.

Typical examples include:

dashboards
search results
analytics views
reporting systems
product catalogs

This side usually prioritizes speed, scalability, and denormalized access patterns.

The Architectural Shift

The important shift in CQRS is not technical. It is conceptual.

CQRS separates:

consistency models,
scaling concerns, and
operational responsibilities.

That changes system behavior significantly.

And once distributed messaging enters the architecture, CQRS naturally introduces asynchronous synchronization, eventual consistency, projection rebuilding, replay mechanisms, and distributed failure scenarios.

That’s where the real engineering trade-offs begin.

3. Where CQRS Helps

CQRS becomes valuable when read and write concerns evolve differently enough that a shared model becomes a bottleneck. It happens more often in large-scale systems than in small applications.

Read-Heavy Systems

One of the strongest CQRS use cases is read-heavy workloads.

Common examples are:

e-commerce product catalogs
recommendation systems
analytics dashboards
search platforms
customer reporting systems

In many product engineering systems, writes remain relatively controlled while reads scale aggressively.

A product catalog may receive millions of search queries, filtering operations, recommendation lookups, and aggregation requests, while inventory updates happen far less frequently.

Using a single normalized transactional model for both concerns eventually becomes inefficient.

CQRS allows teams to build optimized read projections, denormalized query models, caching strategies, and independently scalable read infrastructure. This pattern appears heavily in large marketplace and streaming platforms.

Complex Domain Workflows

CQRS also helps in systems with complicated business workflows.

Examples include:

payment processing
subscription life-cycle management
insurance claim processing

These systems often contain complex validations, business in-variants, state transitions, and transactional rules.

Separating command handling allows teams to isolate domain logic more clearly, while read models remain lightweight and query-optimized.

This separation becomes increasingly valuable as business complexity grows.

Event-Driven Architectures

CQRS naturally fits event-driven systems.

A typical production flow looks something like this:

A command updates transactional state
A domain event gets published
Consumers update read projections
Queries read from optimized projections

This pattern appears heavily in:

order management systems
recommendation systems
analytics architectures

Messaging systems like Apache Kafka and RabbitMQ are commonly used to synchronize projections asynchronously.

This architecture enables scalable reads, independent consumers, and flexible downstream integrations. But it also introduces distributed consistency challenges that teams eventually need to manage carefully.

Performance Isolation

Another underrated benefit of CQRS is workload isolation.

Read workloads and write workloads often behave very differently. Reporting queries may be CPU-heavy, while writes remain latency-sensitive and transactional.

CQRS allows teams to:

scale reads independently
optimize storage differently
isolate expensive queries

Some systems even use relational databases for writes and search or document stores for reads.

This flexibility becomes valuable at scale, although it also increases operational complexity.

4. Synchronization Strategies that Work

One of the most important production concerns in CQRS architectures is synchronization.

Once reads and writes become separated, teams must decide how read models stay updated and how consistency propagates across the system.

The hardest problem in CQRS is often not projection design — it is guaranteeing reliable synchronization between transactional writes and asynchronous event propagation.

Different synchronization strategies introduce different trade-offs involving:

latency,
consistency,
operational complexity,
scalability, and
failure handling.

There is no universally correct approach.

The right strategy depends heavily on:

business requirements,
consistency expectations,
traffic patterns, and
operational maturity.

Synchronous Projection Updates

In this approach, the write operation updates both:

the transactional model, and
the read model

within the same request flow.

This strategy provides:

stronger consistency,
simpler debugging, and
immediate read visibility.

It is commonly used in:

smaller CQRS systems,
modular monoliths, or
systems where stale reads are unacceptable.

However, synchronous updates reduce one of CQRS’s biggest advantages: independent scaling.

They also increase coupling between command processing, projection logic, and query infrastructure.

As systems scale, synchronous projections can become latency bottlenecks.

Asynchronous Event-Driven Synchronization

This is the most common CQRS synchronization strategy in production systems.

The flow typically looks like this:

Command succeeds
Domain event gets published
Consumers process events asynchronously
Read projections update independently

This model is heavily used in e-commerce platforms, streaming systems, recommendation engines, and analytics architectures.

Benefits include:

scalability,
loose coupling,
independent consumers, and
resilient downstream integrations

But this strategy also introduces:

eventual consistency,
projection lag,
replay complexity, and
distributed failure handling.

Most large-scale CQRS systems eventually evolve toward this model because it scales operationally better than tightly coupled synchronous updates.

Transactional Outbox Pattern

In asynchronous CQRS systems, one of the hardest reliability problems is guaranteeing that transactional writes, and domain event publishing remain consistent.

A common failure scenario looks like this:

Database transaction commits successfully
Event publishing fails
Read projections never update
System state becomes inconsistent

This is where the Transactional Outbox Pattern becomes extremely valuable.

Instead of publishing events directly to the broker during command processing, the application:

stores business changes, and
persists domain events into an outbox table

inside the same database transaction.

A background publisher later reads the outbox table and safely publishes events to Kafka, RabbitMQ, or other messaging systems.

This approach significantly improves synchronization reliability because:

if the transaction commits, the event cannot be lost.

Many large-scale product engineering systems use variations of this pattern to:

synchronize CQRS projections,
maintain audit pipelines,
support event-driven integrations, and
improve recovery guarantees.

However, the pattern also introduces additional operational concerns:

outbox cleanup,
duplicate publishing,
replay handling,
publisher lag, and
idempotent consumers.

Like most distributed systems patterns, the Outbox Pattern improves reliability by introducing controlled complexity.

Change Data Capture (CDC)

Some organizations synchronize read models using database-level change streams instead of explicit domain events.

This pattern is commonly called Change Data Capture (CDC).

Tools like:

Debezium
Kafka Connect
database replication logs

can stream transactional database changes into messaging systems or projection pipelines.

Uber uses Kafka for event streaming between write and read models, while Netflix combines CDC for database changes with Kafka for business events.

This approach is attractive because:

application services remain simpler,
transactional writes stay centralized, and
synchronization becomes infrastructure-driven.

Several large engineering organizations use CDC pipelines for:

analytics synchronization,
search indexing,
audit systems, and
reporting architectures.

However, CDC introduces its own trade-offs:

weaker domain semantics,
infrastructure complexity,
schema coupling, and
operational dependency on database internals.

CDC works well for integration-heavy systems but may become difficult when business workflows require explicit domain intent.

Polling-Based Synchronization

Some systems use scheduled polling jobs to synchronize projections periodically.

For example:

reporting databases refreshing every few minutes,
analytics snapshots rebuilding hourly,
search indexes syncing in batches.

This strategy is operationally simple and often surprisingly effective for:

internal systems,
low-frequency reporting, or
non-real-time workloads.

Benefits include:

simpler infrastructure,
easier debugging, and
reduced messaging complexity.

But polling introduces:

synchronization delays,
inefficient querying, and
stale data windows.

For systems requiring near real-time consistency, polling usually becomes insufficient.

Hybrid Synchronization Models

Many production systems eventually adopt hybrid approaches.

For example:

transactional projections for critical workflows,
asynchronous projections for analytics,
CDC pipelines for integrations, and
polling for low-priority reporting.

This is extremely common in large organizations because different workloads often require different consistency guarantees.

For example:

payment confirmation views may require immediate consistency,
while recommendation systems tolerate several seconds of lag.

The important insight is this:

CQRS synchronization is rarely a single architectural decision.

It usually evolves into multiple consistency models optimized for different business requirements.

Choosing the Right Strategy

The synchronization strategy should match the actual business problem.

Questions teams should ask include:

How stale can reads safely become?
What happens if projections lag?
Can users tolerate temporary inconsistency?
How expensive are replay operations?
What operational tooling exists for monitoring synchronization health?
How difficult will debugging become during failures?

Many CQRS failures happen because teams optimize for architectural purity instead of operational reality.

Synchronization strategy is one of the most important architectural decisions in any CQRS system because it directly affects:

consistency,
scalability,
observability, and
operational complexity.

5. Where CQRS Hurts

This is the part most CQRS articles under-discuss.

The implementation itself is usually not the hardest part.

The operational consequences are.

Eventual Consistency Becomes Real

Once reads and writes separate, consistency becomes asynchronous.

That means writes may succeed while read projections remain temporarily stale.

This sounds manageable in theory. But in production systems, eventual consistency creates subtle problems:

users refreshing dashboards and seeing old state
inventory counts temporarily incorrect
recently updated data not immediately searchable
stale projections causing business confusion

Many teams underestimate how difficult eventual consistency becomes operationally, especially once traffic increases, retries happen, projections lag, or events fail partially.

Distributed consistency sounds simple in architecture diagrams. It becomes much harder during production incidents.

Projection Failures Create New Failure Modes

CQRS systems introduce entirely new operational risks.

For example:

event consumers crash
projections stop updating
replay logic becomes corrupted
messages process out of order
stale read models accumulate silently

Now the system may appear partially healthy while still serving inconsistent data.

These failures are often difficult to debug because the write side succeeded, but downstream projections failed asynchronously later. That separation increases debugging complexity significantly.

Operational Complexity Grows Quickly

CQRS rarely stays “simple.”

As systems evolve, teams eventually manage multiple models, projection pipelines, messaging infrastructure, replay mechanisms, synchronization logic, and consistency monitoring.

Operational maturity becomes critical.

Teams need visibility into:

projection lag
failed consumers
replay failures
dead-letter queues
synchronization health

Many CQRS problems are not coding problems.

They are operational systems problems.

Cognitive Load Increases

CQRS also increases mental overhead for engineers.

Developers now need to reason about asynchronous synchronization, stale reads, distributed consistency, projection rebuilding, replay safety, and eventual consistency behavior.

Onboarding becomes harder. Debugging becomes harder. Distributed state becomes harder to reason about.

This complexity compounds over time, especially for smaller teams.

Simple Systems Become Overengineered

One of the biggest mistakes teams make is introducing CQRS too early.

Many business systems are still fundamentally:

CRUD applications
admin platforms
internal tools
transactional APIs

Adding asynchronous projections, event synchronization, and separate consistency models often introduces far more complexity than value.

A simple monolithic relational model is frequently easier to maintain and evolve.

CQRS solves scaling and domain complexity problems. If those problems do not exist yet, CQRS may simply become architectural overhead.

6. CQRS and Event Sourcing Are Not the Same Thing

These two patterns are commonly confused, but they solve different problems.

CQRS separates read responsibilities from write responsibilities.

Event sourcing stores immutable domain events instead of current state snapshots.

They are often used together because event streams naturally feed read projections. But they are not dependent on each other.

You can have:

CQRS without event sourcing
event sourcing without CQRS or
neither

This distinction matters because event sourcing introduces another layer of operational complexity involving replay behavior, schema evolution, event versioning, and long-term event retention.

Many systems benefit from CQRS without needing full event sourcing.

7. Production Trade-offs

This is where CQRS becomes less theoretical.

In production systems, the hardest problems are rarely command handlers, DTOs, or API design.

The hardest problems are usually operational.

Projection Rebuilds

Eventually, projections fail, schemas evolve, consumers change, or read models become corrupted.

Now teams need replay capabilities.

Rebuilding projections for millions of events under production traffic can become operationally expensive. This is where event retention strategies suddenly matter a lot.

Replay Safety

Replay sounds easy until external integrations exist, side effects occur, or duplicate events become dangerous.

For example:

replaying payment events
resending notifications
retriggering workflows

Safe replay requires idempotency, side-effect isolation, and careful event handling design.

Many teams discover this too late.

Observability Becomes Critical

CQRS systems require much deeper operational visibility.

Teams usually need monitoring for:

projection lag
replay progress
failed event handlers
synchronization latency
stale projections
consumer health

Without strong observability, distributed inconsistencies become extremely difficult to diagnose.

8. When to Use CQRS

CQRS becomes valuable when systems genuinely need:

independent read/write scaling
optimized query models
complex domain workflows
asynchronous event-driven integration
large-scale reporting architectures

Typical examples include:

e-commerce platforms
recommendation systems
analytics pipelines
financial processing systems
inventory-heavy domains
audit-heavy architectures

In these systems, the architectural benefits can outweigh the complexity cost.

9. When to Avoid CQRS

It's best to avoid CQRS for:

simple CRUD systems
small internal tools
low-scale APIs
small engineering teams
tightly consistent transactional systems
domains without meaningful read/write asymmetry

In many systems, the biggest bottleneck is not database scalability.

It is shipping features reliably, maintaining operational simplicity, and keeping systems maintainable.

Introducing distributed consistency models too early can slow teams down significantly.

When to Abandon CQRS: Netflix’s Case Study

Netflix’s Tudum platform provides a fascinating case study in CQRS limitations. Initially built with CQRS using Kafka and Cassandra, the team concluded that, for the use-case at hand, the CQRS design pattern wasn’t the optimal approach, and using a distributed, in-memory object store suited the situation better.

The problems they encountered:

Kafka consumer logic became overly complex
Different services duplicated logic to rebuild current state
Events arrived out of order, causing state inconsistencies
Schema evolution became difficult as the system matured

Their solution: Replace Kafka and Cassandra with RAW Hollow, an in-memory object store, which eliminated cache invalidation problems as the entire dataset could fit into application memory. The result was dramatically reduced data propagation times and simpler code.

The lesson: Sometimes the latest state is all that matters. If you don’t need event history, event replay, or complex event processing, CQRS might be over-engineering.

10. A Practical Rule of Thumb

A simple rule usually works well.

If your biggest problem is still:

feature delivery
developer productivity
operational simplicity
basic scalability

CQRS is probably not the first optimization you need.

CQRS becomes valuable when domain complexity, scaling asymmetry, and architectural evolution genuinely justify the additional operational burden.

Until then, simpler architectures are often the better engineering decision.

Conclusion

CQRS is a powerful architectural pattern. But it is not free.

It introduces distributed consistency, operational overhead, replay complexity, synchronization challenges, and entirely new failure modes.

The hardest part of CQRS is rarely implementation.

It is operating distributed consistency models reliably once systems evolve under production pressure.

Good architecture is not about using the most advanced patterns. It is about understanding the trade-offs, the operational consequences, and the real problems the system actually needs to solve.

RabbitMQ vs Kafka: Choosing the Right Messaging System for Real Backend Architectures (3/3)

Venkatesan Ramar — Thu, 21 May 2026 22:48:16 +0000

In this article, I'd explain with sample code snippets for RabbitMQ & Kafka with Spring Boot.

9. Spring Boot Integration Examples

Messaging systems make a lot more sense once you see how they actually behave inside applications.

This section is not about building a full production-ready setup.

The goal here is simpler:
show how RabbitMQ and Kafka integrations usually feel different inside Spring Boot apps.

RabbitMQ Integration Example

RabbitMQ integration in Spring Boot is usually pretty simple and workflow-focused.

A typical flow looks something like this:

order gets created,
app publishes a processing task,
consumer picks it up and runs business logic.

Producer Example

@Service
public class OrderPublisher {

    @Autowired
    private RabbitTemplate rabbitTemplate;

    public void publish(OrderCreatedEvent event) {
        rabbitTemplate.convertAndSend(
                "order.exchange",
                "order.created",
                event
        );
    }
}

Here:

the exchange handles routing,
routing keys decide where messages go, and
RabbitMQ distributes messages to queues.

This routing flexibility is one of RabbitMQ’s biggest strengths.

Consumer Example

@Component
public class OrderConsumer {

    @RabbitListener(queues = "order.processing.queue")
    public void process(OrderCreatedEvent event) {

        System.out.println("Processing order: " + event.orderId());

        // Business logic
    }
}

This style works really well for things like:

background jobs,
workflow execution,
notifications, and
transactional async tasks.

The queue basically acts like a work dispatcher.

Retry & DLQ Configuration

One reason RabbitMQ is popular in backend systems is its retry handling.

A common production setup usually includes:

main queue,
retry queue,
dead-letter queue (DLQ).

@Bean
public Queue orderQueue() {
    return QueueBuilder.durable("order.processing.queue")
            .deadLetterExchange("order.dlx")
            .build();
}

In real systems:

temporary failures go through retry flows,
poison messages move into DLQs, and
teams get visibility into failed processing.

You’ll see this pattern everywhere in enterprise systems.

Kafka Integration Example

Kafka integration feels different because Kafka itself works differently.

Instead of queue-based task distribution, Kafka is built around event streams and partitioned logs.

Producer Example

@Service
public class OrderEventPublisher {

    @Autowired
    private KafkaTemplate<String, OrderCreatedEvent> kafkaTemplate;

    public void publish(OrderCreatedEvent event) {

        kafkaTemplate.send(
                "order-events",
                event.orderId(),
                event
        );
    }
}

Notice this part:

event.orderId()

That’s the partition key.

And it matters a lot.

Kafka guarantees ordering only inside a partition.

Using the order ID as the partition key ensures:

all events for the same order,
stay inside the same partition, and
remain ordered.

Partition strategy becomes a huge design topic in Kafka systems.

Consumer Example

@Component
public class OrderEventConsumer {

    @KafkaListener(
            topics = "order-events",
            groupId = "order-processing-group"
    )
    public void consume(OrderCreatedEvent event) {

        System.out.println("Processing order event: "
                + event.orderId());

        // Business logic
    }
}

Unlike RabbitMQ:

Kafka consumers track offsets,
messages stay in the log, and
multiple consumer groups can process the same events independently.

That means:

analytics services,
audit systems,
notification services,
reporting pipelines

can all consume the same event stream separately.

This is one reason Kafka works so well for event-driven architectures.

Kafka Retry Handling

Retries in Kafka are usually handled using:

retry topics,
delayed retry topics, or
custom consumer retry logic.

A common pattern looks like this:

failed events move into retry topics,
consumers retry later,
poison messages eventually move into DLQs or parking-lot topics.

This setup is powerful, but definitely more operationally complex than RabbitMQ retry routing.

Kafka gives you more flexibility.

But it also expects more architectural discipline from the team.

The Bigger Architectural Difference

Even from the code examples, the difference becomes pretty obvious.

RabbitMQ apps usually feel:

workflow-oriented,
routing-focused, and
delivery-centric.

Kafka apps usually feel:

stream-oriented,
event-centric, and
partition-aware.

Neither one is universally better.

They’re just optimized for different kinds of problems.

And that difference becomes much more important once systems start scaling and production complexity kicks in.

10. Common Mistakes Teams Make

Most production messaging issues are not really caused by RabbitMQ or Kafka.

They usually happen because of:

bad assumptions,
over-engineering, or
missing operational visibility.

And honestly, the same mistakes show up again and again across teams.

Using Kafka as a Task Queue

This one happens a lot.

Kafka is amazing for:

event streaming,
analytics,
replayability, and
handling huge event volumes.

But teams sometimes use it for very simple things like:

background jobs,
workflow execution, or
async task processing.

That usually brings in:

partition management,
retry complexity,
consumer coordination, and
extra operational overhead.

If the actual requirement is just:

“Run tasks reliably in the background”

RabbitMQ is often the cleaner and simpler solution.

Not every async workflow needs a distributed event streaming platform.

Sometimes a queue is just a queue.

Choosing Kafka Just Because It “Scales Better”

Yes, Kafka scales extremely well.

But scalability only matters when you actually need it.

A lot of systems never reach the scale where Kafka’s architecture becomes necessary.

Meanwhile, the team still has to deal with:

partitions,
retention policies,
lag monitoring,
broker management, and
cluster operations.

That’s a lot of complexity to carry around for no real reason.

Good architecture solves real problems — not imaginary future scale problems.

Ignoring Idempotency

Retries eventually create duplicates.

Always assume that.

This applies to both RabbitMQ and Kafka.

If consumers are not idempotent:

payments may run twice,
emails may send twice,
inventory may break,
workflows may repeat unexpectedly.

Messaging guarantees alone won’t save you here.

Applications still need:

deduplication logic,
safe retry handling, and
idempotent consumers.

Experienced engineers usually assume duplicate delivery will happen eventually.

Because in distributed systems, it eventually does.

Treating RabbitMQ Like Event Storage

RabbitMQ is built for message delivery.

Not long-term event retention.

Trying to build:

replayable event history,
event sourcing systems, or
analytics pipelines

on top of RabbitMQ usually becomes painful later.

Kafka is naturally better for those workloads.

Using the wrong abstraction eventually creates operational headaches.

Over-Partitioning Kafka

Partitions help with parallelism.

But too many partitions create their own problems:

rebalance overhead,
broker pressure,
operational complexity, and
consumer coordination costs.

More partitions do not automatically mean better performance.

Partition strategy should match:

throughput requirements,
scaling needs, and
ordering guarantees.

Bad partition planning becomes very hard to fix later.

Ignoring Observability

Teams generally monitor broker uptime and stop there.

But healthy messaging systems need much deeper visibility.

You usually want to monitor:

queue depth,
consumer lag,
retry rates,
DLQ growth,
processing latency, and
throughput trends.

Distributed systems rarely fail instantly.

Problems usually build slowly over time.

Without observability, teams often discover issues only after customers complain.

11. Decision Matrix

At this point, the pattern becomes pretty obvious:

RabbitMQ and Kafka solve different kinds of problems.

They are not direct replacements for each other in every scenario.

Here’s a simple decision guide.

Scenario	Better Fit	Why
Background job processing	RabbitMQ	Simpler retries and task distribution
Workflow orchestration	RabbitMQ	Flexible routing and operational simplicity
Notification systems	RabbitMQ	Easy fanout and retry handling
Payment workflows	RabbitMQ	Better delivery-focused control
Event streaming	Kafka	High-throughput distributed event log
Real-time analytics	Kafka	Replayability and scalable consumers
Audit systems	Kafka	Durable event retention
Event sourcing	Kafka	Immutable event history
CDC pipelines	Kafka	Stream-first architecture
Simple async microservice communication	RabbitMQ	Lower operational overhead
Large-scale event platforms	Kafka	Built for distributed streaming

A Practical Rule of Thumb

A simple rule usually works well:

Choose RabbitMQ when the main concern is:

task execution,
workflow coordination,
retries, and
operational simplicity.

Choose Kafka when the main concern is:

event streaming,
replayability,
analytics, and
long-term event retention.

That distinction alone clears up a lot of confusion early in system design.

Final Thoughts

RabbitMQ and Kafka are both excellent technologies and were designed with very different goals.

Good engineering is not about picking the most impressive or cutting-edge technology.

It’s about choosing the technology that fits naturally, stays maintainable, and behaves predictably under real production pressure.

Many mature systems eventually use both RabbitMQ and Kafka together.

The important part is knowing where each one actually fits best.

Appreciate your support and suggestions.