andrewlegacci

Posted on Mar 9

Kafka for Software Engineers

#kafka

Kafka is often described as a message broker, an event bus, or a streaming platform. Those descriptions are not wrong, but they can be misleading if they make it sound like Kafka is just another queue with more knobs.

The most useful way to understand Kafka is this: Kafka is a durable, distributed log that many producers can append to and many consumers can read from at their own pace. A lot of its design starts making sense once you focus on the log part.

If you have only built request-response systems, Kafka can feel unfamiliar at first. In a typical HTTP setup, one service calls another service directly and waits for an answer. That works well for many problems, but it creates tight coupling in time and availability. The caller has to know who to call, the callee has to be up, and the two sides need to agree on how fast the interaction should happen.

Kafka solves a different kind of problem. It helps when systems need to exchange facts about things that happened, without every producer and consumer needing to talk to each other directly.

The problem Kafka solves

Imagine an e-commerce system. An order is placed. Several things may need to happen next: inventory should update, payment should be processed, an email should be sent, analytics should record the purchase, fraud checks may run, and a warehouse system may start fulfillment.

You could wire all of that directly into the order service. But then the order service becomes responsible for too much. It needs to know about every downstream system, handle their failures, deal with latency, and evolve whenever a new consumer appears.

Kafka changes the shape of the integration. Instead of the order service calling everything else, it publishes an event like OrderPlaced. Other systems subscribe to that stream and react independently.

That gives you looser coupling in a few important ways. Producers do not need to know which consumers exist. Consumers can be added later without changing producers. Consumers can fall behind temporarily and catch up later. Events can also be replayed, which is a major difference from many queue-based systems.

That last point matters. In a queue, the main model is often “work to be done.” In Kafka, the model is closer to “a history of facts.” Consumers read that history and derive their own view of the world from it.

The log abstraction

The log abstraction is the core idea.

A log is an ordered sequence of records that can only be appended to. You do not normally insert into the middle. You do not update old entries in place. You keep adding new records to the end.

Kafka stores records in topics, and each topic is split into partitions. A partition is an append-only ordered log. Every record in a partition gets an offset, which is basically its position in that log.

This is a very different mental model from a traditional queue. In many queue systems, once a message is consumed, it is gone. In Kafka, records stay in the log for some retention period, or until storage rules delete them. Consumption does not remove the record. A consumer tracks its own position, usually by storing offsets.

That means two consumers can read the same topic independently. One consumer might be at offset 1000, another at 4000. That is normal. Kafka is not asking “has this message been consumed?” in a global sense. It is asking “where is each consumer in the log?”

This is what makes replay possible. If a bug in your analytics service caused bad calculations for the last two days, you can fix the code and replay from an earlier offset. That is a powerful capability, and it is one of the reasons Kafka shows up in data pipelines, event-driven systems, and audit-style architectures.

Core architecture

At a high level, Kafka has producers, brokers, topics, partitions, and consumers.

Producers write records to topics. A record usually has a key, a value, a timestamp, and some metadata. The key is especially important because it often determines which partition the record goes to. If you use a customer ID as the key, all records for that customer can be routed to the same partition, preserving order for that key.

Brokers are the Kafka servers that store data and serve reads and writes. A Kafka cluster has multiple brokers, and topic partitions are distributed across them.

Topics are logical streams of records. You can think of a topic as a named feed such as orders, payments, or user-signups.

Partitions are where ordering and scalability meet. Ordering in Kafka is guaranteed within a partition, not across an entire topic. This is an important detail. If a topic has six partitions, Kafka can scale reads and writes across them, but there is no single total order across all six. There is only an order inside each partition.

Consumers read records from topics. Consumers are usually organized into consumer groups. Within a consumer group, each partition is assigned to one consumer instance at a time. That lets a group scale out horizontally. If a topic has eight partitions, up to eight consumers in the same group can process in parallel. If you run more than eight, some will sit idle.

This is one of Kafka’s central tradeoffs. Partitions give you throughput and parallelism, but they also define your ordering boundaries.

Why partitioning matters

Partitioning is not just a storage detail. It affects correctness.

Suppose you are processing bank account events and order matters per account. If AccountDebited and AccountCredited for the same account can land in different partitions, consumers may see them in different relative orders. That can break assumptions.

The usual answer is to choose a partition key that matches your ordering needs. If order matters per account, key by account ID. If order matters per tenant, key by tenant ID. You are deciding what unit of ordering you care about.

This is one reason Kafka design discussions often turn into domain discussions. Picking a key is really about deciding which events belong to the same ordered stream.

Delivery semantics and duplicates

Kafka often gets described with phrases like at-most-once, at-least-once, and exactly-once. These are worth understanding without overcomplicating them.

At-most-once means records may be lost but not redelivered. At-least-once means records are not lost in normal operation, but duplicates can happen. Exactly-once is the strongest and most complicated model, and in practice it depends on more than just Kafka itself.

For most engineers, the important point is this: assume duplicates are possible unless you have designed very carefully around them. Consumer logic should often be idempotent, meaning processing the same event twice should not produce a bad result.

That advice applies even outside Kafka, but Kafka makes it especially relevant because retries and rebalances are normal parts of distributed systems.

Kafka is not just async HTTP

One mistake is to treat Kafka as if it were just asynchronous RPC. That misses the point.

If service A publishes an event only because it expects service B to act immediately in a request-like chain, you have preserved much of the same coupling, just with more moving parts. Kafka is most useful when the event is meaningful on its own. “Order placed” is a fact. “Please call this specific service later” is usually not.

That distinction matters because events should represent something that happened in the domain, not just a transport mechanism between services.

Schemas, compatibility, and evolution

Once multiple systems rely on the same event stream, schemas matter. If one team changes the shape of an event carelessly, other teams can break.

That is why Kafka setups often include schema discipline, whether through JSON conventions, Avro, Protobuf, or a schema registry. The exact tooling varies, but the principle is the same: event formats are contracts, and those contracts evolve over time.

Experienced engineers usually already know this lesson from APIs. Kafka makes it even more important because consumers may lag behind, replay old data, or be owned by different teams.

When Kafka is a good fit

Kafka fits well when you have multiple consumers for the same stream of events, when replay matters, when systems need to be decoupled in time, or when throughput is high enough that a distributed log is worth the operational cost.

It is especially useful for audit trails, integration between services, event-driven workflows, change-data-capture pipelines, and stream processing.

It is not always the right answer. If one service just needs to call another and get a response, plain HTTP may be simpler. If you only need a small background job queue, a simpler queue may be enough. Kafka adds real operational and conceptual complexity, so it is worth using when its specific strengths matter.

A practical mental model

If you are new to Kafka, the best mental model is not “queue” but “shared history.”

Producers append facts to a durable log. Consumers read that history and build their own outcomes from it. Offsets let each consumer decide where it is in the stream. Partitions let the system scale, but they also define where ordering exists. Consumer groups let you distribute work across instances. Schemas keep the shared contract stable enough for many systems to coexist.

Once you see Kafka as a distributed log rather than a fancy mailbox, the rest becomes easier to reason about. It is a system for recording streams of events durably and letting many independent consumers make use of them without every integration turning into a web of direct calls.

That is the problem Kafka solves, and that is why the log abstraction is the center of the whole design.

DEV Community