DEV Community: Nejc Korasa

Kafka Backfill Patterns: A Guide to Accessing Historical Data

Nejc Korasa — Tue, 04 Nov 2025 10:00:00 +0000

Event-driven architectures with Kafka have become a standard way of building modern microservices. At first, everything works smoothly - services communicate via events, state is rebuilt from event streams, and the system scales well. But as your data grows, you face an inevitable challenge: what happens when you need to access historical events that are no longer in Kafka?

1. The Problem: Finite Retention & The Need for Backfills

In a perfect world, we would keep every event log in Kafka forever. In the real world, however, storing an ever-growing history on high-performance broker disks is prohibitively expensive.

This leads to the inevitable compromise: data retention policies. We keep a few weeks or months of events in Kafka for real-time processing and offload the rest to cheaper, long-term cold storage like Amazon S3. This process becomes part of a general Data Lake sink strategy.

This works well until a scenario arises that demands access to the full historical record, for example:

Bootstrapping a New Service: A new microservice needs to build its own materialized view of the world by processing the entire history of events.
Recovering from a Bug: A subtle bug is discovered in a service, and you need to rebuild its state from a point in time months ago.
Enriching Data for New Features: A new feature requires historical context, forcing a re-process of old events.

The core problem is the same: how do we gracefully rehydrate our services with data that now lives in cold storage?

2. The Backfill Blueprint: A Two-Phase Process

Backfill can be broken down into two distinct phases:

Phase 1: Sourcing the Data: First, we must establish a reliable way to get the stream of historical events from cold storage.
Phase 2: Consuming the Data: Second, we need a robust strategy for our service to process this historical stream safely, without disrupting live traffic.

3. Phase 1: Sourcing Historical Data

There are three primary architectural patterns for sourcing historical data that is no longer in Kafka.

Pattern 1: Kafka Tiered Storage

The most elegant solution is one that eliminates the need for a separate ETL process: using a Kafka distribution that supports Tiered Storage. This feature allows Kafka to automatically move older event segments to object storage like S3, while the topic's log remains logically intact and infinitely queryable. The data is physically in two places, but Kafka presents it as a single, seamless stream.

Pattern 2: ETL Bridge

If you don’t have Tiered Storage, you need a safe, reliable bridge between your S3 data lake and Kafka. The core of this pattern is a generic, on-demand ETL job (AWS Glue or Spark is a perfect fit) that reads from S3 and produces it onto a dedicated, temporary backfill topic (e.g., events.backfill). This isolates the historical load from the live stream, preventing disruption to real-time consumers.

Handling Schema Evolution: Using a schema registry, the ETL job can perform a "schema-on-read" transformation. It reads multiple historical Avro schema versions from S3, evolves each record to the latest schema version, and writes the clean data to the backfill topic. This means the service consumer only needs to be aware of the latest schema.

Pattern 3: Pull-Based Backfill (Bypassing Kafka)

In some scenarios, re-populating a Kafka topic is unnecessary overhead. Instead, the service needing the data can fetch it directly from its long-term storage location.

This pattern simplifies the data platform but shifts complexity to the consuming service. It must now contain logic to read from two different sources, merge the streams, and handle potential event ordering conflicts. Unless you can afford to isolate the backfill and run it before the service goes live.

Alternative A: Direct Lake Query

If you have query engines like Trino set up, your service can bypass Kafka for historical data. It can implement a job that directly queries S3 via Trino, fetching and processing data in controlled chunks.

Alternative B: Service-to-Service Backfill

When historical data still resides in the source service's live database. The source service provides a paginated API, allowing the consuming service to pull the history in manageable batches.

While often faster to set up, this approach puts a direct and heavy read load on a live production service. This can degrade the source service's performance, so mitigation is essential;

Control the Load: Throttling, rate-limiting.
Schedule Wisely: Run the backfill during off-peak hours if possible.
Isolate the Impact: Scale resources accordingly, use a database read replica if possible.
Monitor

4. Phase 2: Consuming Historical Data

Getting the data is only half the challenge. The consuming service must be architected to handle rehydration safely.

Idempotent Processing is Non-Negotiable

When a service re-processes historical events, it will inevitably encounter data it has already seen. The consumer logic must be idempotent, meaning that processing the same event multiple times produces the same result as processing it once. This is the foundational prerequisite for any safe backfill strategy.

Choose Your Consumption Strategy

A. The Simple Replay

For many use cases, like enriching data for an analytics model or rebuilding a non-critical cache, the strategy is simple. A dedicated consumer reads from the backfill source until it is empty. The job is complete when all historical data has been processed. This approach is perfect for stateless tasks or systems that can afford a brief maintenance window to switch over.

B. The Zero-Downtime Migration (The Shadow Pattern)

For critical, stateful services that cannot have downtime, a more sophisticated strategy is required. This strategy rebuilds a system using the Shadow Migration pattern. It's a specific implementation of Parallel Change, sometimes called the Shadow Table Strategy, where a "shadow" process runs alongside the live service before a final, coordinated cutover.

Run in Parallel: A "shadow" consumer reads the entire event history, writing to the new table (v2). Simultaneously, the existing "live" consumer continues its normal operation, writing only to the old table (v1).
Catch Up: The shadow consumer runs until it has processed all historical data and is keeping up with the live topic in near real-time.
Verify Consistency: Run validation jobs to ensure data in v2 is consistent with v1. This critical go/no-go step confirms that the migration is safe to complete.
Execute the Cutover: The final switch can be handled in two ways, depending on the system's downtime tolerance.

A. Hard Cutover (Simpler/Faster)

For systems that can tolerate a brief service pause, you can skip dual writes. This involves stopping the live consumer, reconfiguring it to write only to v2, and restarting it at the same time you repoint the application's reads to v2. This must be a single, atomic action.

B. Dual-Write Cutover (Safer/Zero-Downtime)

For critical systems, reconfigure the live consumer to write to both v1 and v2. This keeps both tables perfectly in sync, creating a safe, indefinite window to verify v2 under a live load before repointing the application reads at your leisure.
Decommission: After a period of monitoring the new table, the process is complete. If you used the dual-write method, reconfigure the consumer one last time to write only to v2. Finally, remove the old v1 table and any legacy code.

To prevent any missed events during the handoff, the live consumer should rewind its offset to slightly before where the shadow consumer finished. This creates a small, intentional overlap of events. For this reason, idempotent processing is absolutely essential, as it allows the system to handle these duplicates gracefully without corrupting data.

5. Optimizing the Backfill with Snapshots

Replaying every event from the beginning of time can be slow. For many use cases, you can accelerate the process by using a snapshot—a precomputed, materialized state of your data.

State Snapshots: A periodically generated full snapshot of an entity's state. Rehydration then involves loading this snapshot and replaying only the events from Kafka that have occurred since the snapshot was created.
Kafka-Native Snapshots (Log Compaction): For services that only need the current state of an entity, Kafka's log compaction provides a powerful, built-in solution. A compacted topic retains at least the last known value for each message key. Reading this topic from the beginning provides a consumer with a full, live snapshot of the current state.

In short: use State Snapshots when you need a point-in-time view plus the full event history that followed; use Log Compaction when you only need the latest value for every entity, not their history.

6. Execution

A successful backfill requires more than a solid architectural blueprint; it demands disciplined execution. Some operational best practices to mitigate risk and ensure a predictable outcome;

Monitoring and Observability: A backfill should never be a "black box." Track key metrics like consumer lag, processing throughput, and resource utilization in real-time. This is the only way to detect bottlenecks or failures before they cascade.
Resilience and Failure Handling: The process must be resumable. Large backfills can take hours or days, and failures are inevitable. By tracking progress, it can resume where it left off, saving significant time and resources.
Cost Awareness: A large-scale data replay can incur significant costs from compute resources (ETL jobs, consumer pods) and cloud data egress. Model these costs beforehand to avoid budget surprises.
Incremental Testing: Naturally don't run a full-scale backfill for the first time in production. Validate the entire process with a small, representative slice of data in a staging environment to catch logical errors and performance issues early.

Ultimately, a historical data backfill is a planned, two-phase process for sourcing and consuming historical data. It can be done in a controlled and repeatable manner When you combine the right architectural patterns with operational best practices.

Data Oriented Programming in Java

Nejc Korasa — Sun, 20 Apr 2025 14:22:20 +0000

What is Data Oriented Programming?
Why Consider DOP? The Benefits
Java's Embrace of Data
Textbook Example
- Introducing New Behavior
- Introducing New Data
Handling Outcomes with Clarity
In Conclusion: Clear Benefits of DOP and Modern Java
References and Further Reading

What is Data Oriented Programming?

Data Oriented Programming (DOP) is gaining momentum in the Java ecosystem due to recent language features streamlining its adoption. While conceptually straightforward, DOP offers significant advantages. But what is it?

How do we build our objects? Where does the state go? Where does the behavior go? OOO encourages us to bundle state and behavior together. But what if we separated this? What if data became the primary focus, with logic completely separated? This is the central idea of Data Oriented Programming (DOP), simple.

So instead of emphasizing objects with bundled state and methods, DOP centers around simple data structures. The application's logic and behavior are implemented as independent functions that operate on this data. The data itself is passive; the intelligence lies in the functions. Inside Java defines DOP with the following principles:

Model data immutably and transparently.
Model the data, the whole data, and nothing but the data.
Make illegal states unrepresentable.
Separate operations from data.

Why Consider DOP? The Benefits

Why might you choose this approach? Here are a few compelling reasons:

Simpler and More Readable Code: Separating data from behavior leads to clearer data structures and focused functions, making the code easier to understand and follow.
Improved Maintainability: With simple data structures and distinct logic, modifications are less likely to create ripple effects across the codebase.
Enhanced Code Optionality and Reduced Coupling: Adding new functionality often involves creating new functions rather than modifying existing data structures, leading to less invasive changes and reduced coupling between different parts of the system.
Easier Testing: Functions operating on plain data with simple inputs and outputs are often easier to test in isolation.

Java's Embrace of Data

Modern Java provides excellent tools that make DOP a viable option:

Records: Simple, immutable data carriers. Less boilerplate, letting you focus on the data itself.
```
record Point(double x, double y) {}
```

Sealed Classes: These allow you to restrict the possible subtypes of a class or interface. This is crucial for ensuring you can have exhaustive knowledge of the data you're dealing with.

sealed interface Shape permits Circle, Rectangle {}
record Circle(Point center, double radius) implements Shape {}
record Rectangle(Point topLeft, Point bottomRight) implements Shape {}
record Triangle(Point p1, Point p2, Point p3) implements Shape {}

Switch and Pattern Matching with Exhaustiveness Checks: This is main one, the one that closes the loop and brings the main advantage. The enhanced switch statement in Java, with its support for pattern matching, works hand in hand with sealed classes. The compiler helps you with exhaustiveness checks, shifting runtime errors to compile time. This is not limited to just sealed classes, pure enums also work.
```
int numOfEdgesCircle = switch (shape) {
    case Circle c -> 0;
    case Rectangle r -> 4;
    case Triangle t -> 3;
};
```

Textbook Example

Introducing New Behavior

Consider the Shape example. In a traditional OOP approach, you might add a getCenter() method to the Shape interface and implement it in each concrete shape class. If you later needed to perform a new operation or modify an existing one, you'd likely need to update the Shape interface and all its implementations, which can lead to tightly coupled code.

With DOP, we define the data structures and then create separate functions to operate on them. This separation of concerns makes adding new functionality cleaner and less coupled. Here's how the getCenter function looks in a DOP style:

public Point getCenter(Shape shape) {
    return switch (shape) {
        case Circle(Point center, double _) -> new Point(center.x, center.y);
        case Rectangle(Point topLeft, Point bottomRight) ->
            new Point(
                (topLeft.x + bottomRight.x) / 2,
                (topLeft.y + bottomRight.y) / 2
            );
        case Triangle(Point p1, Point p2, Point p3) ->
            new Point(
                (p1.x + p2.x + p3.x) / 3,
                (p1.y + p2.y + p3.y) / 3
            );
    };
}

The enhanced switch statement, combined with sealed classes, ensures that all possible cases are handled at compile time. If you are familiar with programming design patterns, this makes the Visitor Pattern redundant. The new features simplify similar scenarios dramatically by allowing you to handle all cases directly in a type-safe and concise manner.

Introducing New Data

While DOP simplifies introducing new behavior, it also ensures consistency when introducing new data types. The compiler enforces the implementation of all missing operations, ensuring your code remains consistent and complete. This is one of the most powerful advantages of these new Java features.

For instance, if you add a new Pentagon shape, the compiler will flag the switch statement in the getCenter() method as incomplete, requiring you to implement the logic for the new shape. This compile-time enforcement not only prevents runtime errors but also ensures that your codebase evolves safely and predictably as new data types are added.

However, it's important to avoid using a default branch in your switch statements. A default branch bypasses the exhaustiveness checks provided by the compiler, which can lead to missed cases and potential bugs.

Handling Outcomes with Clarity

Data Oriented Programming also lends itself well in scenarios where clear and explicit handling of outcomes is required, such as processing different types of results or managing errors/failures. Consider this example for handling the result of a process function:

sealed interface Result<T> {
    record Ok<T>(T value) implements Result<T> {}
    record Error<T>(String message) implements Result<T> {}
}

public static Result<String> process() {
    // Does some actual processing…
    if (operationSuccessful) {
        return new Ok("Success return value");
    } else {
        return new Error("Processing error");
    }
}

String result = switch (process()) {
    case Ok(var value) -> value;
    case Error(var message) -> throw new IllegalStateException("Processing error: " + message);
};

This approach allows callers to easily handle all possible results using a switchexpression, promoting explicit and type-safe result processing. It eliminates ambiguity about potential return values and encourages explicit error handling.

See this InfoQ article with some more examples of complex return types and how they can be implemented in DOP style.

In Conclusion: Clear Benefits of DOP and Modern Java

By focusing on data and keeping it separate from business logic and processing, Data Oriented Programming together with modern Java offer some great advantages:

Simpler and More Readable Code: Easier to understand and follow due to the separation of concerns.
Improved Maintainability: Modifications are less likely to have widespread impact.
Enhanced Code Optionality and Reduced Coupling: Adding new features is less invasive and reduces dependencies.
Easier Testing: Functions operating on plain data are more straightforward to test.
Keeps Data Clean and Decoupled from Business Logic.
Safer (and Cheaper) to Refactor and Change: Minimizing coupling reduces the cost of future changes, as explained in Tidy First? By Kent Beck, cost of software is approximately the same as the cost of changing it.

References and Further Reading

What Do You Think?

Have you tried Data Oriented Programming in your Java projects? What challenges or benefits have you experienced? Share your thoughts in the comments or reach out to discus.

Idempotent Processing with Kafka

Nejc Korasa — Sun, 12 Feb 2023 12:00:00 +0000

Duplicate Messages are Inevitable
Understanding the Intricacies of exactly-once semantics in Kafka
Achieving Idempotent Processing with Kafka
- Idempotent Consumer Pattern
- Ordering of Messages
- Retry Handling
- Idempotent Processing and External Side Effects
Publishing Output Messages to Kafka and Maintaining Data Consistency
- The Simplest Solution
- Transactional Outbox Pattern
- Without Transactional Outbox
How it compares to Synchronous REST APIs
Final Thoughts

Duplicate Messages are Inevitable

Duplicate messages are an inherent aspect of message-based systems and can occur for various reasons. In the context of Kafka, it is essential to ensure that your application is able to handle these duplicates effectively. As a Kafka consumer, there are several scenarios that can lead to the consumption of duplicate messages:

There can be an actual duplicate message in the kafka topic you are consuming from. The consumer is reading 2 different messages that should be treated as duplicates.
You consume the same message more than once due to various error scenarios that can happen, either in your application, or in the communication with a Kafka broker.

To ensure the idempotent processing and handle these scenarios, it's important to have a proper strategy to detect and handle duplicate messages.

Understanding the Intricacies of exactly-once semantics in Kafka

Kafka offers different message delivery guarantees, or delivery semantics, between producers and consumers, namely at-least-once, at-most-once and exactly-once.

Exactly-once would seem like an obvious choice to guard against duplicate messages, but it not that simple and the devil is in the details. Confluent has spent a lot of resources to deliver exactly-once delivery guarantee, and you can read here on how it works in detail. It requires enabling specific Kafka features (i.e. Idempotent Producer and Kafka Transactions).

First of all, it is only applicable in an application that consumes a Kafka message, does some processing, and writes a resulting message to a Kafka topic. Exactly-once messaging semantics ensures the combined outcome of multiple steps will happen exactly-once. Key word here is combined. A message will be consumed, processed, and resulting messages produced, exactly-once.

Critical points to understand about exactly-once delivery are:

All other actions occurring as part of the processing can still happen multiple times, if the original message is re-consumed

The guarantee only covers resulting messages from the processing to be written exactly once, so downstream transaction aware consumers will not have to handle duplicates. Hence, each individual action (internal or external) still needs to be processed in an idempotent fashion to ensure real end-to-end exactly once processing. Application may need to, for example, perform REST calls to other applications, write to the database etc.
All participating consumers and producers need to be configured correctly

Kafka exactly-once semantics is achieved by enabling Kafka Idempotent Producers and Kafka Transactions in all consumers and producers involved. That includes the upstream producer and downstream consumers from the perspective of you application. If you are using Event-driven architecture to implement inter-service communication in your system, it is likely that you will consume messages you don't control, or own. Kafka topic is just your asynchronous API you are a consumer of. The topic and the producer can be owned by another team or a 3rd party. Similarly, you may not control downstream consumers. To add to the first point, outbound messages can still be written to the topic multiple times before being successfully committed, it is the responsibility of any downstream consumers to only read committed messages (i.e. be transaction aware) in order to meet the exactly-once guarantee.
It comes with a performance impact

Exactly-once delivery comes with a performance overhead. There are simply more steps involved for a single kafka message to be processed (e.g. Kafka performs a two-phase commit to support transactions) and that results in lower throughput and increased latency.

In practice, it's often much simpler, and more common, to settle for at-least-once semantic and just de-duplicate messages on the consumer side. Especially in cases where application processing is either expensive, or more involved and consists of other actions (e.g. REST calls and DB writes). It's important to remember there is a transaction boundary gap between a DB transaction and a Kafka transaction, more on that later.

Achieving Idempotent Processing with Kafka

This will depend on the nature of processing, and on the shape of the output. To enable idempotent processing, the trigger for the processing - whether it be a Kafka message or an HTTP request - must carry a unique identifier (i.e. an idempotency key).

Idempotent Consumer Pattern

An Idempotent Consumer Pattern ensures that a Kafka consumer can handle duplicate messages correctly. Consumer can be made idempotent by recording in the database the IDs of the messages that it has processed successfully. When processing a message, a consumer can detect and discard duplicates by querying the database. To illustrate that with pseudocode:

var kafkaMessage = kafkaConsumer.consume();

if (!database.isDuplicate(kafkaMessage)) {
    var result = processMessageIdempotently(kafkaMessage);
    database.updateAndRecordProcessed(result);
}

kafkaConsumer.commitOffset(kafkaMessage);

Ordering of Messages

Choosing an appropriate topic key can help to ensure ordering guarantees within the same Kafka partition. For example, if messages are being processed in the context of a customer, using a customer ID as the topic key will ensure that messages for any individual customer will always be processed in the correct order.

Retry Handling

Kafka's offset commits can be used to create a "transaction boundary" (not to be confused with Kafka transactions mentioned before) for retrying message processing in case of failure. The same message can then be consumed again until the consumer offset is committed. Retry handling is a complex topic and various strategies can be employed depending on the specific requirements of the application. Confluent has written about Kafka Error Handling Patterns that can be used to handle retries in a Kafka-based application.

Idempotent Processing and External Side Effects

As mentioned before, there is no exactly-once guarantee for application processing. All actions occurring as part of the processing, and all external side effects, can still happen multiple times. For example, in case of REST calls to other services, calls themselves need to be idempotent, and the same idempotency key needs to be relayed over to those calls. Similarly, all database writes need to be idempotent.

Publishing Output Messages to Kafka and Maintaining Data Consistency

When it comes to publishing messages back to Kafka after processing is complete, the complexity increases. In a Microservices architecture, services along with updating their own local data store they often need to notify other services within the organization of changes that have occurred. This is where event-driven architecture shines, allowing individual services to publish changes as events to a Kafka topic that can be consumed by other services. But how can this be achieved in a way that ensures data consistency and enables idempotent processing?

The Simplest Solution

Consuming from Kafka has a built-in retry mechanism. If the processing is naturally idempotent, deterministic, and does not interact with other services (i.e. all its state resides in Kafka), then the solution can be relatively simple:

var kafkaMessage = kafkaConsumer.consume();

var result = processMessageIdempotently(kafkaMessage);

var kafkaOutputMessage = result.toKafkaOutputMessage();

kafkaProducer.produceAndFlush(kafkaOutputMessage);
kafkaConsumer.commitOffset(kafkaMessage);

Consume the message from a Kafka topic.
Process the message.
Publish the resulting message to a Kafka topic.
Commit the consumer offset.

This approach ensures data consistency and enables idempotent processing. It guarantees that at least one published message is produced for every consumed message.

To ensure at least-once delivery of published messages, it's also necessary to ensure that the message is actually sent to the Kafka broker and that the Kafka producer has flushed its outgoing message queue.

Transactional Outbox Pattern

Another approach is to utilize Transactional Outbox Pattern which fills the gap between the database and Kafka transaction boundary by atomically updating both within the database transaction. The reason being that it is not possible to have a single transaction that spans the application’s database as well as Kafka.

One possible implementation of this pattern is to have an “outbox” table and instead of publishing resulting messages directly to Kafka, the messages are written to the outbox table in a compatible format (e.g. Avro).

var kafkaMessage = kafkaConsumer.consume();

if (!database.isDuplicate(kafkaMessage)) {
    var result = processMessageIdempotently(kafkaMessage);

    var transaction = database.startTransaction();
    database.updateAndRecordProcessed(result);
    database.writeOutbox(result);
    transaction.commit();
}

kafkaConsumer.commitOffset(kafkaMessage);

However, this pattern comes with additional complexity. The message must not only be written to the database but also published to Kafka. This can be implemented by a separate message relay service that continuously polls the database for new outbox messages, publishes them to Kafka, and marks them as processed. However, this approach has several drawbacks:

Increased load on the database: Frequently polling the database can cause a high level of read traffic, which can lead to increased load on the database and potentially slow down other processes that are trying to access it.
Latency: Depending on the interval at which the database is polled, there may be a significant delay between when a message is added to the outbox and when it is published to Kafka.
Scalability: If the number of messages to be published to Kafka increases, the rate of polling will need to be increased, which can further increase the load on the database and make the system less scalable.
Schema incompatibility issues: If the message schema is incompatible with a destination topic, application processing will succeed, but the poller could be unable to publish a message to Kafka. The risk of this can be minimized by verifying Avro schema with a schema registry before writing to the outbox table.
Ordering of messages: Poller needs to ensure the order of messages written to the outbox tables is retained when publishing to Kafka.
Missed messages: There is a chance that a message is not picked up by the poller and not published to Kafka.
Lack of real-time: The messages are not published to kafka in real-time as it depends on the polling interval.

A better approach is to utilize CDC (change data capture) if your database supports it. You can use Debezium and Kafka Connect to integrate CDC with a PostgresDB for example. That way, the database and Kafka stay in sync, and you don't have to deal with the drawbacks of database polling.

Without Transactional Outbox

However, even with the use of CDC, that will still result in another component that needs to be managed and monitored, and another possible point of failure. In certain situations it is easier to avoid the Transactional Outbox Pattern and handle writes to Kafka within the application. That can be achieved by combining the first simple solution explained above with the Idempotent Consumer pattern:

var kafkaMessage = consumeKafkaMessage(kafkaClient);

if (!database.isDuplicate(kafkaMessage)) {
    result = processMessageIdempotently(kafkaMessage);
    database.updateAndRecordProcessed(result);
} else {
    result = database.readResult(kafkaMessage);
}

var kafkaOutputMessage = result.toKafkaOutputMessage();
kafkaProducer.produceAndFlush(kafkaOutputMessage);
kafkaConsumer.commitOffset(kafkaMessage);

Consume the message from a Kafka topic.
Consult the database to confirm the message has not been previously processed. If it has, read the stored result and proceed to step 5.
Process the message, taking care to handle any external actions in an idempotent manner.
Write results to the database and mark the message as successfully processed.
Publish the resulting message to a Kafka topic.
Commit the consumer offset.

The approach outlined above combines the use of the Idempotent Consumer pattern with direct publishing to Kafka, resulting in a streamlined solution for handling duplicate messages.

Additionally, by eliminating the need for an intermediate "outbox" table, this approach reduces the number of components that need to be managed and monitored, resulting in a simpler overall architecture.

Furthermore, it also benefits from reduced latency in message publishing as it avoids the added step of writing to a database before publishing to Kafka.

This approach has some downsides to consider:

It might simplify overall architecture but it increases the complexity of processing within the application.
The addition of a Kafka publish step can cause a performance overhead and prolong overall processing time.

How it compares to Synchronous REST APIs

Similarly to the Idempotent Consumer Pattern, in case of a REST API, received message IDs could also be tracked in a database to handle idempotency. However, there are drawbacks to using REST call as a trigger for processing, namely:

The retry strategy is out of the control of the application, and the caller is responsible for retrying the operation. That makes it more susceptible to failure scenarios and inconsistent states.
There is no ordering guarantee when responding to HTTP calls, and additional care must be taken to avoid certain race conditions during processing.

Publishing output messages to Kafka in a way that maintains data consistency can be achieved by using Transactional Outbox Pattern to atomically update the database and publish a message to Kafka.

Final Thoughts

Kafka is an ideal platform for implementing idempotent processing in your application, and it offers several key advantages over traditional synchronous processing methods such as REST APIs. Its built-in retry mechanism and ordering guarantees are essential for ensuring idempotence and maintaining data consistency in the presence of failures.

When it comes to message delivery guarantees, the exactly-once semantics offered by Kafka can be a powerful tool to guard against duplicate messages. However, it's important to understand the intricacies of this feature, the requirements for its implementation, and its limitations. Additionally, the performance impact and complexity of exactly-once semantics should be taken into consideration.

Achieving idempotent processing requires a thorough understanding of the triggers, actions, and outputs of the processing. Different approaches such as Idempotent Consumer Pattern and Transactional Outbox Pattern can be used to ensure that messages are processed correctly and that data consistency is maintained. It's important to weigh the complexity and potential drawbacks of each approach before deciding on the best solution for your application. As we have seen, Transactional Outbox is not always necessary.

Avoid Tight Coupling of Tests to Implementation Details

Nejc Korasa — Tue, 10 Jan 2023 13:17:54 +0000

Building backend systems today will likely involve building many small, independent services that communicate and coordinate with one another to form a distributed system. While there are many resources available discussing the pros and cons of microservices, the architecture, and when it is appropriate to use, I want to focus on the functional testing of microservices and how it differs from traditional approaches.

In my experience, the "best testing practices" have evolved with the introduction of microservices, and traditional testing pyramids may not be the most effective or even potentially harmful in this context. In my work on various projects and companies, including the development of new digital banks and the migration of older systems to microservices as they scale, I have often encountered disagreements about the most appropriate testing strategies for microservices.

Why do we have tests?

As software engineers, we rely on testing to verify that our code functions as expected. Testing should support refactoring, but it can sometimes make it more difficult. The purpose of testing is to define the intended behavior of the code, rather than the details of its implementation. In summary, tests should:

Confirm that the code does what it should.
Provide fast, accurate, reliable, and predictable feedback.
Make maintenance easier, which is often overlooked when writing tests.

Effective testing is crucial for building reliable software, and it is important to keep these goals in mind when writing tests. By focusing on the intended behavior of the code and the needs of maintenance, we can write tests that give us confidence in our code and make the development process more efficient.

Common Mistakes

It is not uncommon to come across codebases with a large number of tests and high test coverage percentages, only to find that the code is not truly tested and that refactoring or adding new features is difficult. In my experience, this is often due to the following pitfalls:

Overreliance on unit tests
Tight coupling of tests to implementation details

Avoiding these mistakes is key to writing effective tests that support the development process and ensure the reliability of the code.

Do you need Unit Tests?

One common approach to testing is the belief that all classes, functions, and methods must be tested. This can lead to a large number of unit tests and a high test coverage percentage. However, an excess of unit tests can make it difficult to change the code without also having to modify the tests. This can undermine the confidence in the code and negate the benefits of testing if the tests must be rewritten every time the code is changed.

In the case of microservices, which are small and independent by definition, it could be argued that the microservice itself is a unit and should be tested as an isolated component through its contracts, or in a black-box fashion. In this sense, the term "unit tests" for microservices can be thought of as implementation detail tests. Instead of focusing on unit tests, it may be more effective to consider the testing of microservices at a higher level, such as through integration tests.

Don't couple you tests to Implementation Details

Nejc Korasa

@nejckorasa

Please don't couple your tests to implementation details. Tests should support refactoring, not make it harder. #SoftwareEngineering #testing twitter.com/KentBeck/statu…

13:31 PM - 04 Dec 2022

Kent Beck 🌻 @KentBeck
Tests should be coupled to the behavior of code and decoupled from the structure of code. Seeing tests that fail on both counts.

When writing tests, it is important to avoid coupling them to implementation details. This ensures that tests serve as a reliable safety net, allowing you to refactor the internals of your microservice without having to modify the tests.

Focus on Integration Tests

To avoid testing implementation details we should test from the edges of microservices, by examining the inputs and outputs of the service and verifying their correctness in an isolated manner while focusing on the interaction points and making them very explicit.

Define Inputs and Outputs

Look at the entrypoint of the service (e.g. a REST API, Kafka consumer) to define the inputs for your tests and find the corresponding outputs (e.g. HTTP response, published Kafka message). It may be necessary to assert multiple outputs for a single input, as processing an HTTP request could result in a database update, new kafka message, and HTTP response

Test the Microservice as an Isolated Component (Unit)

Spin up the microservice and all necessary infrastructure components, such as web servers and databases, and send inputs to verify the outputs. Tools like Testcontainers for Java can help by running the application in a short-lived test mode with dependencies, such as databases and message queues, running in Docker containers.

By setting up specific infrastructure components in a separate test setup stage, you can isolate them from the actual tests, allowing you to change the underlying infrastructure without modifying the test methods themselves (e.g. replacing the database from PostgreSQL to NoSQL).

This approach is similar to hexagonal architecture, which decouples infrastructure and domain logic, but the testing strategy will differ.

There is a cost to it as it adds some complexity, but I have seen codebases where the benefits were worth it. Ultimately, the decision of how much complexity to add through isolation should be based on how often you anticipate changing the infrastructure of the service and whether the added complexity is justified.

Clear Definition of Microservice Behavior through Testing

A test suite with a focus on integration tests will likely have fewer tests overall, but they will clearly define the expected behavior of the microservice. When examining the test suite, you should be able to get a clear understanding of what the microservice is intended to do.

There still is a place for Implementation Details Tests

There will be parts of the code that are domain specific and only contain business logic. Those naturally isolated parts have an internal complexity of their own and this is where implementation details tests should be used. Testing all variations and edge cases will be cumbersome and too heavy to test through integration tests.

References

Unzip files in S3 with Java

Nejc Korasa — Thu, 03 Nov 2022 14:43:37 +0000

I've been spending a lot of time with AWS S3 recently building data pipelines and have encountered a surprisingly non-trivial challenge of unzipping files in an S3 bucket.
A few minutes with Google and StackOverflow made it clear many others have faced the same issue.

I'll explain a few options to handle the unzipping as well as the end solution which has led me to build nejckorasa/s3-stream-unzip.

To sum up:

there is no support to unzip files in S3 in-line,
there also is no unzip built-in api available in AWS SDK.

In order to unzip you therefore need to download the files from S3, unzip and upload decompressed files back.

This solution is simple to implement with the use of Java AWS SDK, and it probably is good enough if you are dealing with smaller files - if files are small enough you can just keep hold of decompressed files in memory and upload them back.

Alternatively, in case of memory constraints, files can be persisted to disk storage. Great, that works.

Problems arise with larger files. AWS Lambda, for example, has a 1024MB memory and disk space limit. A dedicated EC2 instance will solve the disk space issue, but it requires more maintenance. I'd also argue that storing 500MB+ files to disk is not the most optimal approach.
That will of course depend on how many files need to be unzipped as well as the run frequency of that operation - it's ok as a one-off but maybe not so if it needs to run daily. In any case, we really can do better.

Streaming solution

A better approach would be to stream the file from S3, download it in chunks, unzip and upload them back to S3 utilizing multipart upload. That way you completely avoid the need for disk storage and you can minimize the memory footprint by tuning the download and upload chunk sizes.

There are 2 parts of this solution that need to be integrated:

1) Download and uznip

Streaming S3 objects is natively supported by AWS SDK, there is a getObjectContent() method that returns the input stream containing the contents of the S3 object.

Java provides ZipInputStream as an input stream filter for reading files in the ZIP file format. It reads ZIP content entry-by-entry and thus allows custom handling for each entry.

Streaming object content from S3 and feeding that into ZipInputStream will give us decompressed chunks of object content we can buffer in memory.

2) Upload unzipped chunks to S3

Uploading files to S3 is a common task and SDK supports several options to choose from, including multipart upload.

What is multipart upload?

Multipart upload allows you to upload a single object as a set of parts.
Each part is a contiguous portion of the object's data. You can upload these object parts independently and in any order.
If transmission of any part fails, you can retransmit that part without affecting other parts.

After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object.

In general, when your object size reaches 100 MB, you should consider using multipart uploads instead of uploading the object in a single operation.

nejckorasa/s3-stream-unzip

All that is left to do now is to integrate stream download, unzip, and multipart upload.
I've done all the hard work and built nejckorasa/s3-stream-unzip.

Java library to manage unzipping of large files and data in AWS S3 without knowing the size beforehand and without keeping it all in memory or writing to disk.

Unzipping is achieved without knowing the size beforehand and without keeping it all in memory or writing to disk. That makes it suitable for large data files - it has been used to unzip files of size 100GB+.

It supports different unzip strategies including an option to split zipped files (suitable for larger files, e.g. csv files). It's lightweight and only requires an AmazonS3 client to run.

It has a simple API:

// initialize AmazonS3 client
AmazonS3 s3CLient = AmazonS3ClientBuilder.standard()
        // customize the client
        .build()

// create UnzipStrategy
var strategy = new NoSplitUnzipStrategy();
var strategy = new SplitTextUnzipStrategy()
        .withHeader(true)
        .withFileBytesLimit(100 * MB);

// or create UnzipStrategy with additional config
var config = new S3MultipartUpload.Config()
        .withThreadCount(5)
        .withQueueSize(5)
        .withAwaitTerminationTimeSeconds(2)
        .withCannedAcl(CannedAccessControlList.BucketOwnerFullControl)
        .withUploadPartBytesLimit(20 * MB)
        .withCustomizeInitiateUploadRequest(request -> {
            // customize request
            return request;
        });

var strategy = new NoSplitUnzipStrategy(config);

// create S3UnzipManager
var um = new S3UnzipManager(s3Client, strategy);
var um = new S3UnzipManager(s3Client, strategy.withContentTypes(List.of("application/zip"));

// unzip options
um.unzipObjects("bucket-name", "input-path", "output-path");
um.unzipObjectsKeyMatching("bucket-name", "input-path", "output-path", ".*\\.zip");
um.unzipObjectsKeyContaining("bucket-name", "input-path", "output-path", "-part-of-object-");
um.unzipObject(s3Object, "output-path");

Library is available on Maven Central and on Github.

You can see the original blog post here: https://nejckorasa.github.io/posts/s3-unzip/

Open source: Instagram Analyzer

Nejc Korasa — Tue, 17 Jul 2018 11:51:11 +0000

instagram-analyzer is an application written in Python that analyzes geotags using reverse geocoding in user's Instagram photos and videos.

It provides the data of specific locations, countries and cities you've visited so far, as well as how many times and which Instagram posts match the location.

I want to hear feedback, good or bad, so please go check it out!
Thanks

What it does

📍Store all instagram media data 📷

Application loads all user's instagram media and saves it in JSON format. This data includes all media metadata, including likes, location, tagged users, comments, image url-s ...

📍Store all instagram location data 📊

Analyzes geotags and saves locations in JSON forma. This data includes occurrence for each location as well as image and instagram media url-s ...

📍Store all instagram countries and cities location data

Countries and cities are additionally analyzed using reverse geocoding with LocationIQ API. Data is saved in JSON files.

📍Prints occurrences for location, country and city ✈️

You have visited 99 different locations
You have visited 7  different countries
You have visited 32 different cities

Print table view of most visited location, countries and cities 🌍

For example, when executed for nejckorasa print for countries looks like this:

Countries: 

+------+-----------------+-------------+
| rank | country         | occurrences |
+------+-----------------+-------------+
|  1   | Slovenia        |     51      |
+------+-----------------+-------------+
|  2   | The Netherlands |     12      |
+------+-----------------+-------------+
|  3   | Spain           |      8      |
+------+-----------------+-------------+
|  4   | Poland          |      8      |
+------+-----------------+-------------+
|  5   | Russia          |      7      |
+------+-----------------+-------------+
|  6   | Croatia         |      7      |
+------+-----------------+-------------+
|  7   | Hungary         |      6      |
+------+-----------------+-------------+

Similar tables are printed for specific locations and cities.

Install

To install instagram-analyzer:

$ pip install instagram-analyzer

To update instagram-analyzer:

$ pip install instagram-analyzer --upgrade

Usage

Once installed, import it, configure it and run it:

from instagram_analyzer import InstaAnalyzer

InstaAnalyzer(
    insta_token='<INSTAGRAM_TOKEN_HERE>',
    location_iq_token='<LOCATION_IQ_TOKEN_HERE>').run()

Before you run it, see Configuration & Options

Configuration & Options

Acquire Tokens

Acquire Instagram Access Token

Go to Pixelunion, generate token, don't forget the token!

Acquire Location IQ Access Token

Go to Location IQ, sign up, get the token, don't forget the token!

Configure and run

Create InstaAnalyzer instance with token values.

analyzer = InstaAnalyzer(
    insta_token='<INSTAGRAM_TOKEN_HERE>',
    location_iq_token='<LOCATION_IQ_TOKEN_HERE>')
analyzer.read_media_from_file = False
analyzer.run()

Once instagram media data is stored in JSON, you can read it from there, instead of loading it again via Instagram API (API is limited to 200 request per hour). Set analyzer.read_media_from_file = True

Options

location_iq_token is optional. If not set only basic location analysis will be run and saved to file.
Once InstaAnalyzer has been run all data is available to access:

# Configure InstaAnalyzer
analyzer = InstaAnalyzer(
    insta_token='<INSTAGRAM_TOKEN_HERE>',
    location_iq_token='<LOCATION_IQ_TOKEN_HERE>')

# Run InstaAnalyzer    
analyzer.run()

# Access cities, countries and location data
cities = analyzer.cities
countires = analyzer.countires
locations = analyzer.locations

# Access instagram media data
instagram_media = analyzer.insta_media_data

# Print locations later
analyzer.print_locations()

Stored data examples

When executed for nejckorasa data for one country item (Spain) looks like this:

"Spain": {
    "count": 8,
    "media_items": [
      [
        {
          "id": "<post_id>",
          "image": "https://scontent.cdninstagram.com/vp/e7705068da5e289f5e44c0c396c08f74/5BD54C95/t51.2885-15/sh0.08/e35/p640x640/36149213_609452269436842_8766778259800064000_n.jpg?efg=eyJ1cmxnZW4iOiJ1cmxnZW5fZnJvbV9pZyJ9",
          "link": "https://www.instagram.com/p/Bkh3-KfgxL9/"
        }
      ],
      {
        "id": "<post_id>",
        "image": "https://scontent.cdninstagram.com/vp/2b239894a363f6bbe93d604ab2cdfa8a/5BE953CD/t51.2885-15/sh0.08/e35/p640x640/33941046_171665143683479_8766885676932136960_n.jpg?efg=eyJ1cmxnZW4iOiJ1cmxnZW5fZnJvbV9pZyJ9",
        "link": "https://www.instagram.com/p/Bj7Uj56gxBs/"
      },
      {
        "id": "<post_id>",
        "image": "https://scontent.cdninstagram.com/vp/9d7003f674af9ca05accf9961df893a6/5BE28FDA/t51.2885-15/sh0.08/e35/p640x640/33120615_197967877520708_8731075699906969600_n.jpg?efg=eyJ1cmxnZW4iOiJ1cmxnZW5fZnJvbV9pZyJ9",
        "link": "https://www.instagram.com/p/Bjmp-6bAYus/"
      },
      {
        "id": "<post_id>",
        "image": "https://scontent.cdninstagram.com/vp/1e7ca79fc44823ff3ef8b24e6dd55e61/5BD1E8C3/t51.2885-15/sh0.08/e35/p640x640/33608474_597094857325212_724188974242856960_n.jpg?efg=eyJ1cmxnZW4iOiJ1cmxnZW5fZnJvbV9pZyJ9",
        "link": "https://www.instagram.com/p/BjR_9lpAqpc/"
      },
      {
        "id": "<post_id>",
        "image": "https://scontent.cdninstagram.com/vp/1b046c05b1cbe9708f57f5e591b68d1c/5BD8E039/t51.2885-15/sh0.08/e35/p640x640/32947036_172314443452529_4611639929133334528_n.jpg?efg=eyJ1cmxnZW4iOiJ1cmxnZW5fZnJvbV9pZyJ9",
        "link": "https://www.instagram.com/p/BjNEIwiA6Py/"
      },
      {
        "id": "<post_id>",
        "image": "https://scontent.cdninstagram.com/vp/5ac0e05fb60700cba4c41d6d1216eb5b/5BC8A9DB/t51.2885-15/e15/10802615_318814311644936_1896556761_n.jpg?efg=eyJ1cmxnZW4iOiJ1cmxnZW5fZnJvbV9pZyJ9",
        "link": "https://www.instagram.com/p/vdWuHBkwuY/"
      },
      {
        "id": "<post_id>",
        "image": "https://scontent.cdninstagram.com/vp/40620d8f5e7e01a546e2b958d18bd42a/5BE9E99F/t51.2885-15/e15/10784835_319487204924131_388050040_n.jpg?efg=eyJ1cmxnZW4iOiJ1cmxnZW5fZnJvbV9pZyJ9",
        "link": "https://www.instagram.com/p/vYybQyEwiA/"
      },
      {
        "id": "<post_id>",
        "image": "https://scontent.cdninstagram.com/vp/b733c0bdf312ee5c21bb3fd6148e6221/5BE263EA/t51.2885-15/e15/10802986_691193854310946_2042620114_n.jpg?efg=eyJ1cmxnZW4iOiJ1cmxnZW5fZnJvbV9pZyJ9",
        "link": "https://www.instagram.com/p/vc9ZFakwrq/"
      },
      {
        "id": "<post_id>",
        "image": "https://scontent.cdninstagram.com/vp/875bff08c310444273eae90a67e525dd/5BC8F29F/t51.2885-15/e15/928044_671144066338855_1666493611_n.jpg?efg=eyJ1cmxnZW4iOiJ1cmxnZW5fZnJvbV9pZyJ9",
        "link": "https://www.instagram.com/p/vaWbQLEwqX/"
      }
    ]
  }

Of course, <post_id> will be an actual post ID.

Data for cities is almost the same. For specific location one location item looks like this:

"236678869": {
    "latitude": 45.7925,
    "longitude": 15.1647,
    "name": "Novo Mesto",
    "id": 236678869,
    "count": 4,
    "media_items": [
      {
        "id": "<post_id>",
        "image": "https://scontent.cdninstagram.com/vp/6941d16b164ec488dd3a303004344f78/5BE40DE8/t51.2885-15/sh0.08/e35/p640x640/31270267_1592482480868234_8257495365851283456_n.jpg?efg=eyJ1cmxnZW4iOiJ1cmxnZW5fZnJvbV9pZyJ9",
        "link": "https://www.instagram.com/p/Bij24yzAdHB/"
      },
      {
        "id": "<post_id>",
        "image": "https://scontent.cdninstagram.com/vp/3189c0f2e5931f47b4506046ff26afff/5BDB6109/t51.2885-15/e15/10724200_1496985983889525_746072573_n.jpg?efg=eyJ1cmxnZW4iOiJ1cmxnZW5fZnJvbV9pZyJ9",
        "link": "https://www.instagram.com/p/uDDPHekwtW/"
      },
      {
        "id": "<post_id>",
        "image": "https://scontent.cdninstagram.com/vp/fbf31b5c410c9036ce43862012249d02/5BEC3F36/t51.2885-15/e15/10488704_250740985124191_1862853011_n.jpg?efg=eyJ1cmxnZW4iOiJ1cmxnZW5fZnJvbV9pZyJ9",
        "link": "https://www.instagram.com/p/q94LWMkwlk/"
      },
      {
        "id": "<post_id>",
        "image": "https://scontent.cdninstagram.com/vp/27c6681709c7b71fc86d8477c11d2b88/5BCAD041/t51.2885-15/e15/10013254_641464529259998_1091484863_n.jpg?efg=eyJ1cmxnZW4iOiJ1cmxnZW5fZnJvbV9pZyJ9",
        "link": "https://www.instagram.com/p/mKDvsikwsC/"
      }
    ],
    "city": "Novo mesto",
    "additional_data": {
      "place_id": "113385772",
      "licence": "\u00a9 LocationIQ.org CC BY 4.0, Data \u00a9 OpenStreetMap contributors, ODbL 1.0",
      "osm_type": "way",
      "osm_id": "167321715",
      "lat": "45.7897769",
      "lon": "15.1680662",
      "display_name": "Krka, Novo mesto, Jugovzhodna Slovenija, 8000, Slovenia",
      "address": {
        "suburb": "Krka",
        "town": "Novo mesto",
        "state_district": "Jugovzhodna Slovenija",
        "postcode": "8000",
        "country": "Slovenia",
        "country_code": "si"
      },
      "boundingbox": [
        "45.7858017",
        "45.7927137",
        "15.1640388",
        "15.1725268"
      ]
    }
  }

Notice additional_data field, this data is populated using Location IQ API

FAQ

Why does it take so long to load additional location data?

For reverse geocoding, Location IQ API is used. Free version of that API si rate limited to 1 request per second. That is why additional data loading takes <different_location_count> seconds.

Go check it out, leave feedback 🙏

Here's a link to github: instagram-analyzer