ayoabass777

Posted on Apr 18 • Edited on Apr 30

# Building a Streaming Session Analytics Pipeline with Kafka, Postgres, and dbt

#dataengineering #kafka #dbt #postgres

How I built an end-to-end clickstream pipeline with exactly-once delivery guarantees

When I set out to build Pulse, I had a specific goal: demonstrate that I could work with streaming data, not just batch. My first portfolio project (Ballistics) was a batch pipeline — API calls on a schedule, Airflow orchestration, daily refreshes. That's the bread and butter of most data engineering work, but it's only half the picture.

Pulse is the other half. Real-time events flowing through Kafka, landing in Postgres, transformed by dbt into session analytics. Same dbt layer, completely different ingestion paradigm.

What I Built

Pulse is a session analytics pipeline that processes clickstream events in real-time:

Event Simulator → Kafka → Python Consumer → Postgres → dbt → Metabase

The simulator generates realistic user behavior — page views, product views, add-to-cart events, checkouts, and payments. These flow through Kafka, get written to Postgres with exactly-once semantics, and dbt transforms them into:

Session metrics — duration, bounce rate, landing pages
Funnel analysis — step-by-step conversion from awareness to purchase
User engagement — DAU/WAU/MAU with stickiness ratios

Architecture

┌─────────────┐     ┌──────────┐     ┌──────────────┐     ┌──────────┐     ┌───────────┐
│   Event     │     │  Kafka   │     │   Python     │     │ Postgres │     │ Metabase  │
│  Simulator  │────▶│ (KRaft)  │────▶│  Consumer    │────▶│   raw    │────▶│ Dashboard │
│  (producer) │     │          │     │              │     │  events  │     │           │
└─────────────┘     └──────────┘     └──────────────┘     └────┬─────┘     └───────────┘
                         │                                     │
                    ┌────▼─────┐                          ┌────▼─────┐
                    │   DLQ    │                          │   dbt    │
                    │  topic   │                          │ models   │
                    └──────────┘                          └──────────┘

Everything runs locally in Docker — Kafka in KRaft mode (no Zookeeper), Postgres, and Metabase. The producer and consumer are Python scripts.

The Hard Part: Exactly-Once Delivery

The most interesting engineering challenge was achieving exactly-once semantics end-to-end. This required two separate mechanisms working together:

Layer 1: Idempotent Producer (`enable.idempotence=True`)

When the producer sends a message and the network times out, it doesn't know if Kafka received it. So it retries. Without idempotence, you'd get duplicate messages in the topic.

The idempotent producer solves this by tagging each message with a sequence number. If Kafka already has that sequence, it silently drops the retry. Duplicates never enter the topic.

Layer 2: Idempotent Consumer Key (`ON CONFLICT DO NOTHING`)

Even with an idempotent producer, the consumer can still create duplicates. Here's how:

Consumer reads message from Kafka
Consumer writes row to Postgres ✓
Consumer crashes before committing offset to Kafka
Consumer restarts, replays from last committed offset
Consumer writes the same row again ← duplicate

The fix is an idempotent key. Every event gets a unique event_id derived from user_id + kafka_timestamp_ms. The Postgres table has a primary key constraint on this field, and every insert uses ON CONFLICT (event_id) DO NOTHING.

When the replay happens, Postgres silently rejects the duplicate. No error, no data corruption.

INSERT INTO raw.events (...) VALUES (...)
ON CONFLICT (event_id) DO NOTHING;

These two mechanisms are not the same thing. The producer idempotence prevents duplicates in Kafka. The consumer idempotent key prevents duplicates in Postgres. You need both for end-to-end exactly-once.

I verified this works by stopping the consumer mid-stream, resetting the Kafka offset to the beginning, and replaying all messages. Zero duplicates in Postgres.

Two Timestamps, Two Purposes

Every event carries two timestamps:

Timestamp	Source	Purpose
`event_timestamp`	Producer (business time)	When did the user act? Used for session ordering.
`kafka_timestamp`	Kafka broker (ingestion time)	When did we receive it? Used for freshness checks.

Session reconstruction uses event_timestamp because that's the truth of user behavior. kafka_timestamp is for operational concerns — "is data flowing?" and "how stale is the latest batch?"

This distinction matters because events can arrive out of order. A user might click at 10:00:01, but network latency means Kafka receives it at 10:00:03. If you sessionize on ingestion time, you get wrong session boundaries.

Session Reconstruction in SQL

The sessionization logic uses a 30-minute inactivity gap — industry standard for web analytics. If a user is idle for more than 30 minutes, the next event starts a new session.

-- Flag new sessions based on time gap
CASE
  WHEN prev_event_timestamp IS NULL THEN 1  -- first event
  WHEN event_timestamp - prev_event_timestamp > INTERVAL '30 minutes' THEN 1
  ELSE 0
END AS is_new_session

Then a running sum of those flags gives each event its session number:

SUM(is_new_session) OVER (
  PARTITION BY user_id
  ORDER BY event_timestamp
  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS session_number

I chose to do this in dbt (batch) rather than a stream processor (Flink/Spark Streaming) deliberately. The session definition is still evolving — maybe 30 minutes becomes 20, maybe we add page-specific rules. SQL is testable, rerunnable, and version-controlled. Once the rules stabilize, I can move to stream processing if latency requires it.

Error Handling: DLQ for Non-Transient Errors Only

Not all errors are equal:

Error Type	Example	Action
Transient	Connection timeout, deadlock	Don't commit offset → Kafka replays automatically
Non-transient	Missing required field, bad data type	Route to DLQ topic → commit offset to unblock

Transient errors fix themselves. Just let Kafka replay. Non-transient errors need human attention, so they go to a dead letter queue where someone can inspect and decide what to do.

The key insight: DLQ is for errors you can't retry your way out of. If you DLQ transient errors, you're throwing away free retries.

Funnel Analysis

The funnel tracks users through a five-step purchase journey:

Page View (0) → Product View (1) → Add to Cart (2) → Checkout (3) → Payment (4)

Each event carries a funnel_step_index from the producer. dbt aggregates this into daily conversion rates:

-- What % of users who viewed a product added it to cart?
ROUND(100.0 * users_step_2 / NULLIF(users_step_1, 0), 2) AS cvr_product_to_cart

DAU/WAU/MAU with Stickiness

User engagement uses rolling windows:

DAU: Distinct users today
WAU: Distinct users in the last 7 days
MAU: Distinct users in the last 30 days
Stickiness: DAU / MAU — how often do monthly users come back daily?

A stickiness of 25% means a quarter of your monthly users are daily actives. Consumer apps aim for 20%+. Below 10% suggests users try the product once and ghost.

Late Data Detection

In streaming, events can arrive out of order. A user clicks at 10:00:00, but network lag means Kafka receives it at 10:06:00. That's "late" data.

Pulse flags these in the staging layer:

CASE
  WHEN kafka_timestamp - event_timestamp > INTERVAL '5 minutes' 
  THEN TRUE
  ELSE FALSE
END AS is_late

I chose flagging over exclusion. Late events still contribute to session reconstruction — they're just marked for observability. If late data becomes a problem (>5% of events), that's a signal to investigate upstream latency, not throw data away.

What's Not Here (And Why)

No Airflow. Pulse is event-driven. The consumer runs continuously, reacting to Kafka messages. There's nothing to schedule for ingestion. dbt runs on a simple cron or EventBridge trigger — Airflow would be overkill for a single transform job.

No S3 landing zone. For a production deployment, I'd add S3 between Kafka and Postgres as a raw archive layer. Enables replay from cold storage if the database needs to be rebuilt. I documented this in the production architecture doc but didn't implement it locally — diminishing returns for a portfolio project.

Simulated data, not real traffic. The event simulator generates fake clickstream. Real production would swap in a JavaScript SDK tracking actual user behavior. The pipeline architecture doesn't change — only the producer does.

Production Path

If this were going to production on AWS:

Local	Production
Kafka (Docker)	Amazon MSK
Postgres (Docker)	Amazon RDS
Manual dbt runs	EventBridge + Lambda
localhost:3000	QuickSight or Power BI Service

The patterns stay identical. Idempotent producer, idempotent consumer key, DLQ for non-transient errors, dbt for transforms. Just swap local containers for managed services.

What I Learned

Exactly-once is a chain, not a single mechanism. Producer idempotence and consumer idempotent keys solve different failure modes. You need both.
Timestamps are a design decision. Business time vs ingestion time isn't academic — it affects session reconstruction correctness.
DLQ is for non-retryable errors. Transient failures should replay from Kafka, not clutter your dead letter queue.
dbt works for streaming too. The transform layer doesn't care if events arrived via batch API or real-time Kafka. Same staging → intermediate → marts pattern.

What's Next

This completes my second portfolio project. Ballistics showed batch patterns (API → Airflow → S3 → Postgres → dbt). Pulse shows streaming patterns (Kafka → Postgres → dbt). Together they tell the story: I can work across both paradigms.

Next up: an AI-flavoured pipeline. RAG ingestion, embeddings, vector store. The "DE + AI" trend isn't about building ML models — it's about building pipelines that feed them.

Code: github.com/ayoabass777/Pulse

Author: Ayomide Abass — Data Engineer, Vancouver

LinkedIn · GitHub

DEV Community

# Building a Streaming Session Analytics Pipeline with Kafka, Postgres, and dbt

What I Built

Architecture

The Hard Part: Exactly-Once Delivery

Layer 1: Idempotent Producer (`enable.idempotence=True`)

Layer 2: Idempotent Consumer Key (`ON CONFLICT DO NOTHING`)

Two Timestamps, Two Purposes

Session Reconstruction in SQL

Error Handling: DLQ for Non-Transient Errors Only

Funnel Analysis

DAU/WAU/MAU with Stickiness

Late Data Detection

What's Not Here (And Why)

Production Path

What I Learned

What's Next

Top comments (0)

What I Built

Architecture

The Hard Part: Exactly-Once Delivery

Layer 1: Idempotent Producer (enable.idempotence=True)

Layer 2: Idempotent Consumer Key (ON CONFLICT DO NOTHING)

Two Timestamps, Two Purposes

Session Reconstruction in SQL

Error Handling: DLQ for Non-Transient Errors Only

Funnel Analysis

DAU/WAU/MAU with Stickiness

Late Data Detection

What's Not Here (And Why)

Production Path

What I Learned

What's Next

Layer 1: Idempotent Producer (`enable.idempotence=True`)

Layer 2: Idempotent Consumer Key (`ON CONFLICT DO NOTHING`)