Tayfun Yalcinkaya

Posted on Jun 4

Part 2: Event Pipeline Design: The Real-Time Data Lifecycle from Kafka to Lakehouse

#kafka #architecture #streaming #dataengineering

In the first article, we covered the core concepts of Event Driven Architecture, topic/channel design on Kafka, the difference between events and commands, schema contracts, and producer-consumer relationships. In this article, we will take one step further and look at the lifecycle of an event inside the platform.

Because in EDA design, the real challenge is not only producing events. The real challenge is making those events reliable, traceable, reprocessable, enrichable, and usable by different consumers.

In this article, we will focus on these questions:

What happens when a raw event arrives in the platform?
How is an event validated, enriched, and made consumable?
How should raw, validated, enriched, and curated topics be positioned?
How can this structure be related to the Medallion approach in modern lakehouse architectures?
When do DLQ and alert topics come into play?
How should replay, idempotency, monitoring, security, and governance be designed?

What Is an Event Pipeline?

In EDA architectures, especially in data platform projects, events usually pass through a lifecycle.

This lifecycle can be modeled like this:

raw -> validated -> enriched -> curated
                  |             |
                  v             v
                 dlq          alert

This structure allows the data flow to mature step by step.

The raw topic carries the original event from the source. The validated topic contains events that passed schema and basic quality checks. The enriched topic contains events improved with reference data or other data sources. The curated topic represents trusted, normalized, and business-ready events for consumers.

The Relationship Between Event Pipeline and Medallion Architecture

This structure has a natural similarity with the Medallion approach often used in modern lakehouse architectures.

In the lakehouse world, the Bronze layer represents raw data, the Silver layer represents cleaned and enriched data, and the Gold layer represents business-ready data products. Kafka raw, validated, enriched, and curated topics apply a similar maturity logic to data in motion.

However, there is an important difference. On the Kafka side, these stages work as event channels. On the lakehouse side, they become persistent, queryable, and governed datasets.

We can think about it like this:

Kafka raw topic       -> Lakehouse bronze table
Kafka enriched topic  -> Lakehouse silver table
Kafka curated topic   -> Lakehouse gold table

This is not a one-to-one or mandatory mapping. Not every topic must become a separate lakehouse table. But from an architectural point of view, both approaches aim to move data from a raw state to a trusted and consumable state.

A more accurate distinction is:

Data in motion: Kafka topics
Data at rest: Lakehouse tables

Kafka provides real-time flow, consumer independence, fan-out, low latency, and replay capability. The lakehouse provides persistent storage, historical analysis, SQL access, BI/ML consumption, data governance, and analytical data products.

Raw Topic

The raw topic is the first channel where the original event from the source is written.

raw.payment.transaction

At this stage, the data is stored as it comes from the source. The format may be incorrect, fields may be missing, or the message may not fully match the expected schema. The raw layer is valuable for replay, audit, and troubleshooting.

The raw topic can also be the source for the bronze layer in the lakehouse. This creates a link between the streaming event flow and the analytical raw data archive.

Validated Topic

The validated topic contains events that passed schema and basic data quality checks.

validated.payment.transaction

Typical checks at this stage include:

Are required fields available?
Does the event match the schema?
Are date, number, and currency formats correct?
Is there an event ID?
Is the event time valid?
Is duplicate control required?

A validation service consumes the raw topic and writes successful events to the validated topic. Failed events can be written to a DLQ topic.

raw.payment.transaction
        |
        v
validation-service
        |
        +--> validated.payment.transaction
        +--> dlq.payment.transaction.validation

The validated topic prevents downstream processes from working directly with raw and unreliable data.

Enriched Topic

The enriched topic contains the validated event after additional information is added.

enriched.payment.transaction

For example, the raw event may contain only customer ID and transaction details:

{
  "transaction_id": "T1001",
  "customer_id": "C789",
  "merchant_id": "M456",
  "amount": 950,
  "currency": "TRY"
}

After enrichment, the event may include customer segment, merchant category, country, risk score, or location information:

{
  "transaction_id": "T1001",
  "customer_id": "C789",
  "customer_segment": "premium",
  "merchant_id": "M456",
  "merchant_category": "electronics",
  "country": "TR",
  "amount": 950,
  "currency": "TRY",
  "risk_score": 42
}

At this stage, the consumer is no longer only moving data. It may join with other sources, perform lookups, or match reference data.

From a lakehouse perspective, enriched events are often meaningful inputs for the silver layer. The data is no longer raw. It has passed certain quality checks and has become more useful for analytics.

Curated Topic

The curated topic contains trusted, normalized, and business-ready events for consumers.

curated.payment.transaction

The curated layer is the level where most downstream consumers should connect. BI, dashboards, data lake ingestion, machine learning feature pipelines, notification services, and operational analytics usually consume curated events.

On the lakehouse side, curated events can feed gold-layer data products. For example, operational dashboards, customer segment analysis, fraud KPIs, or payment success rate metrics can be created from curated events.

Alert Topic

The alert topic is the channel where action-oriented events are published, such as alarms, fraud alerts, threshold breaches, or anomalies.

alert.payment.fraud
alert.machine.temperature
alert.network.anomaly

The alert topic does not always have to be the final step of a linear pipeline. It can be produced by a separate stream processing job consuming from validated, enriched, or curated topics.

For example:

enriched.payment.transaction
        |
        v
fraud-detection-service
        |
        +--> alert.payment.fraud

Alert topics are designed for operational reaction. For this reason, retention, consumer SLA, retry behavior, and notification integration should be considered separately.

DLQ Topic

DLQ means Dead Letter Queue. It is the channel where failed or unprocessable events are sent.

dlq.payment.transaction.validation
dlq.payment.transaction.enrichment

A DLQ event should not only contain the failed payload. It should also contain the reason for the failure.

{
  "original_topic": "raw.payment.transaction",
  "original_partition": 3,
  "original_offset": 982133,
  "error_type": "VALIDATION_ERROR",
  "error_message": "customer_id is missing",
  "failed_at": "2026-05-13T12:30:00Z",
  "payload": {
    "transaction_id": "T1001",
    "amount": 950
  }
}

This approach prevents failed records from being lost and makes later correction, investigation, or replay possible.

In EDA projects without a DLQ design, failed events either disappear silently or create continuous retries on the consumer side, slowing down the whole pipeline.

Does Every Stage Have to Be a Separate Topic?

No. Raw, validated, enriched, and curated stages do not always need to be physical Kafka topics.

There are three common approaches.

1. Stage-Based Pipeline

Each stage is modeled as a separate topic.

raw -> validated -> enriched -> curated

This model is strong for observability, replay, audit, and team responsibility separation. However, it increases the number of topics and operational complexity.

2. Single Processor, Multi-Output

A single stream processing job reads the raw topic, performs validation, enrichment, and curation, and writes only to result topics.

raw.payment.transaction
        |
        v
stream-processing-job
        |
        +--> curated.payment.transaction
        +--> alert.payment.fraud
        +--> dlq.payment.transaction

This model creates fewer topics and lower latency. However, debugging and replay may be harder because intermediate stages are not visible.

3. Branching Pipeline

Multiple consumers read from the same topic for different purposes.

validated.payment.transaction
        |
        +--> enrichment-service
        +--> audit-sink-service
        +--> fraud-service
        +--> realtime-dashboard-service

This is one of the strongest parts of EDA. The same event can be used by multiple independent consumers for different purposes.

How Kafka and Lakehouse Work Together

In a well-designed EDA architecture, Kafka and the lakehouse are not alternatives to each other. They complement each other.

Kafka is used for:

Event transport.
Low-latency processing.
Consumer fan-out.
Operational reaction.
Replay.
Loose coupling between systems.

The lakehouse is used for:

Persistent storage.
Historical analysis.
SQL analytics.
BI and dashboards.
Machine learning feature generation.
Governance, lineage, and data products.

An example flow may look like this:

raw.payment.transaction
        |
        +--> bronze.payment_transaction_raw
        |
        v
validation-service
        |
        +--> validated.payment.transaction
        +--> dlq.payment.transaction
        |
        v
enrichment-service
        |
        +--> enriched.payment.transaction
        +--> silver.payment_transaction_enriched
        |
        v
curation-service
        |
        +--> curated.payment.transaction
        +--> gold.payment_transaction_analytics
        +--> alert.payment.fraud

In this model, Kafka manages data in motion, while the lakehouse manages data at rest.

The Same Scenario from a Pipeline Perspective: From Streaming to Archive and Analytics

In the first article, we looked at the anonymized energy distribution field example from a channel-based topic design perspective. Now let’s look at the same scenario from an event pipeline perspective.

In this architecture, XML-based meter and activity messages from different sources first arrived in a secure Kafka ingress layer. Because of security and operational isolation needs, data from external sources was not taken directly into the internal platform. Instead, a controlled ingress approach was preferred. In such architectures, separating the external or semi-external data landing layer from the internal event backbone provides important benefits for security and operational management.

In the next step, the stream processing layer read messages from Kafka, parsed the XML content, interpreted the messages, and routed them to target systems. For operational processes, data was written to a relational database. For raw and historical analysis needs, a separate big data/archive layer was fed. As a result, the same event flow served both daily operational needs and historical query and analytics scenarios.

This structure is a good field example of the relationship between EDA and lakehouse thinking. On the Kafka side, flowing events provided real-time transport, separation, and consumer independence. On the archive and analytics side, raw data storage, historical querying, anomaly detection, and forecasting scenarios were supported. If we use modern lakehouse terminology, raw meter messages correspond to a bronze-like archive layer, parsed and interpreted data corresponds to a silver-like analytics layer, and operational dashboards or reporting outputs correspond to gold-like consumption layers.

Kafka raw meter topics
        |
        +--> Raw archive / bronze-like layer
        |
        v
stream processing / parsing / enrichment
        |
        +--> Operational database
        +--> Search and historical query layer
        +--> Analytics and forecasting layer
        +--> Dashboard and alert mechanisms

In this scenario, Kafka retention was also a critical part of the architecture. When a target system slowed down, became temporarily unavailable, or went under maintenance, Kafka acted as a controlled durability layer to prevent data loss. Consumers could continue reading from their last committed offset when the system became healthy again. This shows why retention and replay design in EDA is not only a technical setting, but also a business continuity decision.

Operational visibility was as important as data transport. Monitoring the number of messages received by distribution company and data type in the last seconds and hours, checking whether data arrived at the expected frequency, and producing alerts for flow interruptions made the platform not only working, but also observable.

The main lesson from this example is this: EDA is not only about systems writing events to Kafka. A real enterprise data flow architecture appears when secure data ingress, topic design, stream processing, retention, replay, operational monitoring, archiving, and analytics layers are designed together.

In such a flow, XML messages that cannot be parsed, missing meter information, unexpected format changes, or target system write errors should not stop the main pipeline. For this reason, DLQ or quarantine-style error flows should not be treated as later improvements. In large-scale architectures like this, they should be considered a natural part of the design.

Why Consumer Idempotency Matters

In Kafka-based systems, we should always assume that consumers may process the same event more than once. Duplicate processing can happen because of network issues, retries, offset commit problems, or application restarts.

For this reason, consumers should be designed to be idempotent.

This means that processing the same event twice should not change the final result.

For example, if a payment event arrives twice, the customer should not receive two refunds. To prevent this, duplicate checks should be performed using event_id, transaction_id, or a business key.

Idempotency is especially critical for these consumers:

Consumers performing financial operations.
Services sending notifications.
Sinks writing to operational databases.
Systems producing alerts.
Jobs performing upsert or merge operations on lakehouse tables.

Replay Design

One of the strengths of EDA is that events can be retained in Kafka for a certain time and read again when needed.

Replay may be needed in these situations:

A new consumer wants to process past events from the beginning.
An incorrect enrichment logic is fixed and events must be processed again.
Missing events need to be reloaded into the data lake.
A machine learning feature pipeline must be rebuilt.
Records in the DLQ are corrected and sent back to the flow.

Replay is powerful, but it also carries risk. During replay, downstream systems must not create duplicate records, and irreversible actions such as financial transactions must not be triggered again.

Replay design should answer these questions from the beginning:

Is Kafka retention long enough?
Are raw events archived in the lakehouse bronze layer?
Are consumers idempotent?
Will replay be done with a separate consumer group?
Will replayed events trigger operational actions again?
Which topics are considered safe for replay?

Monitoring and Operations

Operational visibility must be part of EDA design.

Key metrics to monitor include:

Topic throughput.
Producer error rate.
Consumer lag.
Consumer processing time.
Failed event count.
DLQ event count.
Partition skew.
Broker disk usage.
Replication status.
End-to-end latency.

It is not enough to know that the Kafka cluster is running. We must also know how long it takes for an event to move from the source to the target consumer, and where it is waiting in the pipeline.

To achieve this level of visibility, monitoring should not depend only on infrastructure metrics. The event itself should also be traceable. For this reason, standard fields should be carried either in the event payload or in the event metadata.

The following fields become critical for tracking the journey of an event inside the platform:

event_id          -> Unique identifier of the event
event_type        -> Type of the event
event_time        -> Time when the event was created in the source system
source_system     -> Source system that produced the event
correlation_id    -> Identifier used to connect events from the same business flow
schema_version    -> Version of the event schema
processing_stage  -> Current stage of the event inside the pipeline

With these fields, we can track not only whether an event was written to a Kafka topic, but also which stages it passed through inside the platform.

For example, different timestamps can be captured for the same event:

event_time                -> Time when the event was created in the source system
ingestion_time            -> Time when the event was first written to Kafka
validation_time           -> Time when the event passed the validation stage
enrichment_time           -> Time when the event passed the enrichment stage
curation_time             -> Time when the event was written to the curated topic
downstream_delivery_time  -> Time when the event reached the target system

This approach makes it possible to measure end-to-end latency more accurately. It also helps identify whether the delay happened in the source system, Kafka topic, stream processing job, downstream sink, or lakehouse write layer.

In practice, monitoring design should be handled at three levels:

1. Platform level
   Broker health
   Disk usage
   Replication status
   Partition distribution

2. Flow level
   Topic throughput
   Consumer lag
   Failed event count
   DLQ count

3. Event lifecycle level
   End-to-end latency
   Stage latency
   Correlation ID based tracking
   Event freshness

Managing this level of visibility only with manual scripts, separate dashboards, or scattered alert rules can become difficult over time. Especially in enterprise environments, platforms with advanced monitoring, observability, alerting, and governance capabilities for Kafka operations can reduce this operational burden.

For example, enterprise Kafka distributions such as Cloudera Kafka, Confluent Platform, or similar Kafka platforms can provide a more centralized approach for topic-level monitoring, consumer lag tracking, cluster health checks, capacity planning, security integration, audit, and operational alert management.

The goal is not only to collect metrics, but to make Kafka-based event flows operationally manageable. In EDA, monitoring is not just about seeing a few charts on a dashboard. It is a continuous validation mechanism that confirms whether the data flow is healthy, secure, traceable, and aligned with SLAs.

Alert thresholds should also be defined from the beginning for every critical pipeline. For example, an automatic alert should be produced when data does not arrive from a specific source within the expected time, consumer lag exceeds a defined threshold, DLQ count rises above normal behavior, or end-to-end latency violates the SLA.

For this reason, a good EDA monitoring design should answer not only the question “Is Kafka running?”, but also the following questions:

Did the event arrive at the expected time?
Did the event pass through the correct stage?
At which stage was it delayed?
Which consumer is falling behind?
Which topic has an increasing error rate?
Did the DLQ count move outside normal behavior?
Did the curated event reach the downstream consumer on time?

Especially in a stage-based pipeline, latency and error metrics should be monitored for each stage:

raw -> validated latency
validated -> enriched latency
enriched -> curated latency
curated -> downstream latency

Security and Governance

In enterprise EDA designs, security and governance should not be added later.

Important questions include:

Who can write to which topic?
Who can read from which topic?
Does the event contain sensitive data?
Is masking or tokenization required?
Is event lineage tracked?
Which consumer reads which data?
Are audit logs available?
Is retention aligned with regulations?
Are lakehouse table permissions consistent with topic permissions?

EDA is not only a technical streaming project. It is a platform approach that requires data governance, security, and operational discipline.

On the Kafka side, topic-level access control, schema governance, and audit are important. On the lakehouse side, table-level policies, lineage, data catalog, and data classification become critical.

Where Should You Start with EDA?

When starting with EDA, the first goal should not be making the whole organization event-driven. A better approach is to choose a pilot use case with clear business value.

Good starting use cases include:

Fraud detection.
Real-time transaction monitoring.
IoT sensor monitoring.
Customer activity tracking.
Payment event streaming.
Security log analytics.
Real-time operational dashboard.

The first phase should produce these outputs:

Event domain model.
Topic naming standard.
Schema standard.
Producer and consumer contracts.
Raw, validated, enriched, curated approach.
DLQ and retry strategy.
Lakehouse bronze/silver/gold mapping principles.
Monitoring approach.
Security and access model.
Replay strategy.

Example Reference Architecture

A simple EDA + lakehouse reference architecture can look like this:

Source Systems
      |
      v
Source system adapters / Custom Producers
      |
      v
Kafka Event Channels
      |
      +--> Stream Processing
      |        |
      |        +--> Validated / Enriched / Curated Topics
      |        +--> Alert Topics
      |        +--> DLQ Topics
      |
      +--> Lakehouse Bronze / Silver / Gold
      +--> Operational Database
      +--> BI / Dashboard
      +--> Machine Learning Pipelines
      +--> Notification Services

Here, Kafka provides the event transport backbone. Source system adapters or custom producer applications move data into Kafka event channels. The stream processing layer validates, enriches, routes, and produces alerts from events. The lakehouse is positioned as the persistent analytical data layer.

Conclusion of the Second Article

In Event Driven Architecture, producing events is not enough. How events mature inside the platform, how they are validated, how they are enriched, how failed records are handled, and how they are safely provided to downstream consumers are as important as the event model itself.

The raw, validated, enriched, and curated approach allows streaming data to mature in a controlled way inside the platform. This structure shares a similar way of thinking with the Bronze, Silver, and Gold layers in modern lakehouse architectures. On the Kafka side, these stages appear as event channels. On the lakehouse side, they become persistent and queryable datasets.

A well-designed EDA + lakehouse architecture gives organizations two important capabilities at the same time: real-time action and reliable historical analytics.

If it is poorly designed, Kafka topics, stream processing jobs, lakehouse tables, DLQs, and consumers can become a complex system that is hard to trace.

For this reason, EDA projects should not start only by creating topics. They should start by designing the event lifecycle, the lakehouse relationship, the replay strategy, the monitoring approach, and the governance model together.

DEV Community