DEV Community: ayoabass777

Building an AI-Augmented News Intelligence Pipeline with Kafka, Delta Lake, and LLMs

ayoabass777 — Thu, 30 Apr 2026 01:47:55 +0000

How I built a streaming pipeline that uses LLMs as a transform layer and Delta Lake for stateful content versioning

My first portfolio project (Ballistics) was batch — API calls on a schedule, Airflow orchestration, S3 landing zone. My second (Pulse) was streaming — Kafka, exactly-once delivery, session analytics in dbt. Both used the same transformation tool (dbt) with different ingestion patterns.

Sentinel is the third project, and the question changed. Ballistics and Pulse processed structured data — JSON from APIs, simulated clickstream events. What happens when the raw data is unstructured? When the "transformation" isn't a SQL model but an LLM that extracts entities, sentiment, and summaries from raw HTML?

Sentinel is a news intelligence pipeline that ingests articles from multiple sources, uses LLMs to extract structured data, and serves it through an API and dashboard. It's not a product — it's a proof of work for AI-augmented data engineering patterns.

What I Built

GDELT + RSS ──► Kafka ──► Fetcher ──► Kafka ──► LLM Parser ──► Kafka ──► Bronze Writer ──► Delta Lake ──► FastAPI
                  │                                    │                       │                │
               Redis L1/L2                          DLQ topic          Delta Lake write    PySpark MERGE
               (dedup gate)                      (exponential                                  (CDF)
                                                  backoff)

Two producers (GDELT and an 18-feed RSS aggregator) discover article URLs and push them to Kafka. A fetcher consumer retrieves the HTML. An LLM parser extracts structured data — title, author, companies, people, topics, sentiment, summary — and produces it to a parsed_articles topic. A Bronze Writer consumer reads from that topic and writes to Delta Lake. A PySpark job transforms Bronze into Silver using a stateful MERGE. FastAPI serves the Silver layer to a React dashboard.

Everything runs locally in Docker — Kafka in KRaft mode, Redis, and the dashboard. The producers, fetcher, and parser are long-running Python services.

Architecture

┌──────────────┐
│    GDELT     │──┐    ┌─────────────┐     ┌─────────────┐
│   Producer   │  ├───►│ Redis Dedup │────►│sentinel.urls│
└──────────────┘  │    │  L1+L2 keys │     └──────┬──────┘
┌──────────────┐  │    └─────────────┘            │
│     RSS      │──┘                               ▼
│  (18 feeds)  │                        ┌─────────────────────┐
└──────────────┘                        │ Fetcher (txn)       │
                                        │ begin→fetch→commit  │
                                        └──────────┬──────────┘
                                                   ▼
                                        ┌──────────────────┐
                                        │sentinel.raw_html │
                                        └──────────┬───────┘
                                                   ▼
                                        ┌─────────────────────┐
                                        │ LLM Parser (txn)    │
                                        │ OpenAI / Anthropic  │
                                        │ / DeepSeek          │
                                        └──────────┬──────────┘
                      ┌──────────┐                 │
                      │   DLQ    │◄── fails        │
                      │  Replay  │                 │
                      │1m→5m→30m │                 │
                      └──────────┘                 ▼
                                      ┌────────────────────────┐
                                      │sentinel.parsed_articles│
                                      └───────────┬────────────┘
                                                  ▼
                                        ┌──────────────────┐
                                        │  Bronze Writer   │
                                        └──────────┬───────┘
                                                   ▼
                                            ┌─────────────┐
                                            │Delta Bronze │
                                            │ (CDF-enabled)│
                                            └──────┬──────┘
                                                   ▼
                                            ┌─────────────┐
                                            │  PySpark    │
                                            │ MERGE→Silver│
                                            └──────┬──────┘
                                                   ▼
                                            ┌─────────────┐
                                            │  FastAPI +  │
                                            │  Dashboard  │
                                            └─────────────┘

Data Flow: Life of an Article

One article, end to end:

Discovery. The RSS producer polls a tech feed. A new entry appears — https://example.com/article-about-funding. The producer checks Redis: L1 key (sentinel:src:rss:{guid}) doesn't exist, L2 key (sentinel:url:example.com/article-about-funding) doesn't exist. Both keys are set with TTLs. The URL goes to sentinel.urls.
Fetch. The fetcher consumer reads the message inside a Kafka transaction — begin, fetch HTML via httpx, extract clean text with trafilatura (stripping nav bars, ads, boilerplate), produce to sentinel.raw_html, commit. If the fetch fails (timeout, 404), the message goes to sentinel.dlq.fetch for retry with exponential backoff (1m → 5m → 30m).
Parse. The LLM parser reads the clean text and sends it to the LLM with a structured extraction prompt. The LLM returns JSON: title, author, companies, people, topics, sentiment, summary. The parser validates the response against a Pydantic schema, computes data_value_score, and produces the structured BronzeArticle to sentinel.parsed_articles. If the LLM returns malformed JSON or the response fails validation, the message goes to sentinel.dlq.parse — same exponential backoff as the fetch DLQ. The entire step — read, extract, produce — is a Kafka transaction. No Delta Lake write here.
Bronze Write. The Bronze Writer consumer reads from sentinel.parsed_articles and writes each article to the Delta Lake Bronze table with CDF enabled. If the write fails, the offset isn't committed — Kafka replays the message. This separation means the parser's transaction stays fully within Kafka's guarantees, and the storage write has its own error boundary.
Transform. The PySpark job reads Bronze's Change Data Feed — only new rows since the last checkpoint. It dedupes (keeping the latest per URL, aggregating all sources), filters by quality score, enriches with freshness_status and ingestion_lag_hours, and MERGEs into Silver. If this URL already exists in Silver with a different content_hash, the MERGE bumps content_version and stores the previous hash.
Serve. FastAPI reads Silver via deltalake + PyArrow (no Spark). The dashboard fetches /articles and renders cards with sentiment badges, freshness status, source tags, and quality scores.

No scheduler told stages 1–4 when to run. The URL landing on a Kafka topic was the trigger. Each service reacted to data arriving, not a clock ticking. The transform (stage 5) is the exception — it's a batch job triggered manually or on a cron, reading only new changes via CDF. The architecture is designed so this becomes a one-line swap to Spark Structured Streaming when throughput demands it.

Why Docker Compose, Not Airflow

This was the first architectural decision and it surprised me. I used Airflow for Ballistics. Three DAGs, task dependencies, scheduled triggers — it worked because Ballistics is a batch pipeline. Tasks start, run, finish, repeat on a schedule.

Sentinel's consumers are different. They're services, not jobs. The fetcher doesn't run at 8am and finish at 8:05. It starts and stays alive, polling Kafka for new messages in a loop. Same for the parser. If I used Airflow, it would look like: "every 5 minutes, wake up, poll Kafka, process messages, shut down." That's wasteful — the consumer spends more time starting and stopping than working.

Docker Compose keeps containers alive. Airflow schedules tasks. Different tools for different orchestration patterns.

Pattern	Orchestrator	Why
Batch (Ballistics)	Airflow	Tasks with dependencies, on a schedule
Service (Sentinel)	Docker Compose	Long-running consumers, react to events

The key insight: Kafka is the real orchestrator here. A message landing on a topic is the trigger. No scheduler tells the fetcher when to run — data arriving on sentinel.urls is the signal. Docker Compose just keeps the services alive so they can listen.

4-Level Dedup: Why One Layer Isn't Enough

Two producers ingesting from overlapping sources means duplicates are inevitable. The same article can appear in both GDELT and an RSS feed within minutes. I needed dedup at every boundary.

Level	Key	Where	TTL	What it catches
L1	`sentinel:src:{source}:{record_id}`	Redis	24h	Same source, same record
L2	`sentinel:url:{normalized_url}`	Redis	6h	Same URL from different sources
L3	`url + kafka_ts`	Bronze Delta	∞	Replay after crash
L4	`normalized_url + content_hash`	Silver MERGE	∞	Content change detection

L1 and L2 are cheap — Redis SET NX calls in a pipeline batch. They catch 95% of duplicates before anything touches Kafka. L3 is the parser's safety net against replayed messages (same mechanism as Pulse's ON CONFLICT DO NOTHING). L4 is the stateful MERGE that tracks whether content actually changed.

URL normalization matters here. https://www.example.com/article?utm_source=rss and http://example.com/article are the same article. The normalizer strips www., tracking params (utm_*, fbclid, gclid), protocol differences, and trailing slashes. Without this, L2 misses cross-source duplicates.

The LLM as a Transform Layer

In Ballistics and Pulse, transformation meant SQL — dbt models that reshape structured data. In Sentinel, the "transformation" is an LLM that takes article text and extracts structured fields:

Clean text → LLM → {
  title, author, publish_date,
  body, summary, sentiment,
  companies[], people[], topics[]
}

This creates two problems batch SQL doesn't have:

1. The bottleneck is cost, not throughput. Each LLM call takes 2–5 seconds and costs tokens. The fetcher can saturate the sentinel.raw_html topic far faster than the parser can drain it. Kafka absorbs the pressure difference as buffered messages — that's backpressure for free.

2. The output is non-deterministic. The same article parsed twice might produce slightly different entity lists. This is why the Silver MERGE uses content_hash (an MD5 of the article body) rather than comparing extracted fields. If the source content hasn't changed, the extraction shouldn't re-run — regardless of whether the LLM might produce different output.

The parser is pluggable — a --provider flag switches between OpenAI, Anthropic, and DeepSeek per instance. This was a practical decision: OpenAI quotas ran out during development, so I needed a fallback. But it also demonstrates a production pattern — run expensive models (GPT-4o, Claude Sonnet) for quality-critical extraction and cheap models (GPT-4o-mini, DeepSeek) for bulk ingestion.

Transaction Boundaries: Why the Parser Doesn't Write to Delta Lake

The parser reads from one Kafka topic and produces to another — sentinel.raw_html in, sentinel.parsed_articles out. That read-process-produce cycle is wrapped in a Kafka transaction (begin, extract, produce, commit). If the LLM call fails or validation rejects the output, the transaction aborts — the offset isn't committed, the output message isn't visible, and the failed message is routed to sentinel.dlq.parse for retry with exponential backoff. No half-written state, no silent data loss.

Critically, the parser does not write to Delta Lake. That's a separate consumer (the Bronze Writer) with its own error boundary. Earlier in development, the parser wrote directly to Delta Lake inside the transaction — but Delta writes aren't part of Kafka's transactional guarantees, so a successful Delta write followed by a failed transaction commit would leave data in Bronze with an uncommitted offset. Separating them keeps each transaction boundary clean.

Delta Lake CDF: Incremental Transforms Without Full Scans

The Bronze → Silver transform needs to process only new articles, not re-scan the entire Bronze table every run. Delta Lake's Change Data Feed (CDF) solves this.

When CDF is enabled on a table, every write creates a versioned changelog entry. The transform reads only changes since the last processed version:

# Read only new changes from Bronze
cdf = (
    spark.read.format("delta")
    .option("readChangeFeed", "true")
    .option("startingVersion", last_processed_version + 1)
    .option("endingVersion", current_version)
    .load(bronze_path)
    .filter(F.col("_change_type").isin("insert", "update_postimage"))
)

A simple checkpoint file stores the last processed version. First run does a full scan automatically (no checkpoint found), subsequent runs only process new changes. The --full flag forces a reconciliation scan if Silver drifts.

This is the same design pattern as Kafka consumer offsets — "give me everything since my last position." The difference is that Kafka coordinates the streaming stages (producers, fetcher, parser), and CDF coordinates the storage stages (Bronze → Silver). Each tool handles readiness signaling in its domain.

Stateful MERGE: Content Versioning

The Silver MERGE does more than dedup. It tracks whether an article's content has changed over time:

silver_dt.alias("s").merge(
    source_df.alias("b"),
    "s.normalized_url = b.normalized_url"
)
.whenMatchedUpdate(
    condition="b.content_hash != s.content_hash AND b.kafka_ts > s.kafka_ts",
    set={
        # Update content fields...
        "content_version": F.col("s.content_version") + 1,
        "previous_content_hash": "s.content_hash",
        "is_updated": F.lit(True),
        # first_seen_ts intentionally NOT updated
    },
)
.whenNotMatchedInsertAll()

Three outcomes:

Incoming vs Silver	Action
New URL	Insert (`content_version=1, is_updated=False`)
Same URL + same content_hash	Skip (true duplicate)
Same URL + different content_hash	Update: bump version, store old hash, set `is_updated=True`

The statefulness is in three lines: s.content_version + 1 reads Silver's current state to compute the update. The MERGE isn't just transforming data — it's making decisions based on what's already there.

If you need the old content, Delta time travel has it. Every previous version of Silver is accessible by version number. No need to store duplicate rows — the history is in the transaction log.

Data Quality: Don't Drop Late Data, Flag It

Every article gets a freshness_status computed from the gap between publish_date and fetch_timestamp:

Status	Condition
`fresh`	Ingested within 48 hours of publication
`old`	Ingested 48 hours to 7 days after publication
`stale`	Ingested more than 7 days after publication
`unknown`	No `publish_date` available

Combined with ingestion_lag_hours (the continuous metric) and a composite data_value_score (freshness 40%, completeness 40%, accuracy 20%), Silver consumers can filter by quality without losing data. Articles below data_value_score < 0.3 are excluded from Silver — but they remain in Bronze if you ever need to reprocess them with different thresholds.

The key decision: flag, don't drop. A week-old article about a funding round is still valuable. Dropping it because it's "stale" loses signal. Flagging it lets the consumer decide.

Scaling: Kafka Consumer Groups

Each stage scales horizontally via Kafka consumer groups. If sentinel.raw_html has 3 partitions, you can run up to 3 parser instances in parallel — each gets a partition, no code changes:

parser-0:
  command: python -m sentinel.consumers.llm_parser 0
parser-1:
  command: python -m sentinel.consumers.llm_parser 1

Kafka rebalances automatically when consumers join or crash. But the real throughput ceiling isn't partitions — it's LLM cost. Three parsers at 2 calls/minute each gives 6 articles per minute. Switching to a cheaper model (GPT-4o-mini, Haiku, or DeepSeek at ~$0.001/article) shifts the bottleneck from cost to Kafka partitions, which is a one-command fix.

What Production Would Look Like

Local	Production
Kafka (Docker)	Amazon MSK
Redis (Docker)	ElastiCache
Delta Lake (local filesystem)	Delta Lake on S3
Manual `docker-compose up`	ECS Fargate + KEDA autoscaling
PySpark batch transform	Spark Structured Streaming (one-line swap)
localhost API	ECS Fargate + ALB

The CDF-based transform is designed so switching to streaming is a one-line change: spark.read becomes spark.readStream. The MERGE logic, dedup, and quality filters stay identical.

What I Learned

The LLM is a liability, not a feature. The attention framework needs attention. It's the slowest, most expensive, least deterministic component in the pipeline. Everything around it — rate limiting, DLQ with exponential backoff, pluggable providers, content hashing to avoid re-extraction — exists to manage that liability. The architecture assumes the LLM will fail.

Streaming and batch coexist. Sentinel isn't purely streaming or purely batch. Kafka coordinates the streaming stages, CDF coordinates the storage stages, and the batch transform sits at the boundary. Production systems almost always have both patterns running in parallel.

Dedup is a system design problem, not a single filter. Each dedup layer catches different classes of duplicates at different costs. Redis is fast but ephemeral. Delta MERGE is durable but expensive. Stacking them means each layer only handles what the previous one missed.

The progression: Ballistics (batch, Airflow, S3) → Pulse (streaming, Kafka, exactly-once) → Sentinel (streaming + LLM, Kafka, Delta Lake, stateful MERGE). Three projects, three ingestion paradigms, each building on the patterns learned in the last.

Code: github.com/ayoabass777/Sentinel

Ayomide Abass — Data Engineer, Vancouver
LinkedIn · GitHub

# Building a Streaming Session Analytics Pipeline with Kafka, Postgres, and dbt

ayoabass777 — Sat, 18 Apr 2026 23:10:46 +0000

How I built an end-to-end clickstream pipeline with exactly-once delivery guarantees

When I set out to build Pulse, I had a specific goal: demonstrate that I could work with streaming data, not just batch. My first portfolio project (Ballistics) was a batch pipeline — API calls on a schedule, Airflow orchestration, daily refreshes. That's the bread and butter of most data engineering work, but it's only half the picture.

Pulse is the other half. Real-time events flowing through Kafka, landing in Postgres, transformed by dbt into session analytics. Same dbt layer, completely different ingestion paradigm.

What I Built

Pulse is a session analytics pipeline that processes clickstream events in real-time:

Event Simulator → Kafka → Python Consumer → Postgres → dbt → Metabase

The simulator generates realistic user behavior — page views, product views, add-to-cart events, checkouts, and payments. These flow through Kafka, get written to Postgres with exactly-once semantics, and dbt transforms them into:

Session metrics — duration, bounce rate, landing pages
Funnel analysis — step-by-step conversion from awareness to purchase
User engagement — DAU/WAU/MAU with stickiness ratios

Architecture

┌─────────────┐     ┌──────────┐     ┌──────────────┐     ┌──────────┐     ┌───────────┐
│   Event     │     │  Kafka   │     │   Python     │     │ Postgres │     │ Metabase  │
│  Simulator  │────▶│ (KRaft)  │────▶│  Consumer    │────▶│   raw    │────▶│ Dashboard │
│  (producer) │     │          │     │              │     │  events  │     │           │
└─────────────┘     └──────────┘     └──────────────┘     └────┬─────┘     └───────────┘
                         │                                     │
                    ┌────▼─────┐                          ┌────▼─────┐
                    │   DLQ    │                          │   dbt    │
                    │  topic   │                          │ models   │
                    └──────────┘                          └──────────┘

Everything runs locally in Docker — Kafka in KRaft mode (no Zookeeper), Postgres, and Metabase. The producer and consumer are Python scripts.

The Hard Part: Exactly-Once Delivery

The most interesting engineering challenge was achieving exactly-once semantics end-to-end. This required two separate mechanisms working together:

Layer 1: Idempotent Producer (`enable.idempotence=True`)

When the producer sends a message and the network times out, it doesn't know if Kafka received it. So it retries. Without idempotence, you'd get duplicate messages in the topic.

The idempotent producer solves this by tagging each message with a sequence number. If Kafka already has that sequence, it silently drops the retry. Duplicates never enter the topic.

Layer 2: Idempotent Consumer Key (`ON CONFLICT DO NOTHING`)

Even with an idempotent producer, the consumer can still create duplicates. Here's how:

Consumer reads message from Kafka
Consumer writes row to Postgres ✓
Consumer crashes before committing offset to Kafka
Consumer restarts, replays from last committed offset
Consumer writes the same row again ← duplicate

The fix is an idempotent key. Every event gets a unique event_id derived from user_id + kafka_timestamp_ms. The Postgres table has a primary key constraint on this field, and every insert uses ON CONFLICT (event_id) DO NOTHING.

When the replay happens, Postgres silently rejects the duplicate. No error, no data corruption.

INSERT INTO raw.events (...) VALUES (...)
ON CONFLICT (event_id) DO NOTHING;

These two mechanisms are not the same thing. The producer idempotence prevents duplicates in Kafka. The consumer idempotent key prevents duplicates in Postgres. You need both for end-to-end exactly-once.

I verified this works by stopping the consumer mid-stream, resetting the Kafka offset to the beginning, and replaying all messages. Zero duplicates in Postgres.

Two Timestamps, Two Purposes

Every event carries two timestamps:

Timestamp	Source	Purpose
`event_timestamp`	Producer (business time)	When did the user act? Used for session ordering.
`kafka_timestamp`	Kafka broker (ingestion time)	When did we receive it? Used for freshness checks.

Session reconstruction uses event_timestamp because that's the truth of user behavior. kafka_timestamp is for operational concerns — "is data flowing?" and "how stale is the latest batch?"

This distinction matters because events can arrive out of order. A user might click at 10:00:01, but network latency means Kafka receives it at 10:00:03. If you sessionize on ingestion time, you get wrong session boundaries.

Session Reconstruction in SQL

The sessionization logic uses a 30-minute inactivity gap — industry standard for web analytics. If a user is idle for more than 30 minutes, the next event starts a new session.

-- Flag new sessions based on time gap
CASE
  WHEN prev_event_timestamp IS NULL THEN 1  -- first event
  WHEN event_timestamp - prev_event_timestamp > INTERVAL '30 minutes' THEN 1
  ELSE 0
END AS is_new_session

Then a running sum of those flags gives each event its session number:

SUM(is_new_session) OVER (
  PARTITION BY user_id
  ORDER BY event_timestamp
  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS session_number

I chose to do this in dbt (batch) rather than a stream processor (Flink/Spark Streaming) deliberately. The session definition is still evolving — maybe 30 minutes becomes 20, maybe we add page-specific rules. SQL is testable, rerunnable, and version-controlled. Once the rules stabilize, I can move to stream processing if latency requires it.

Error Handling: DLQ for Non-Transient Errors Only

Not all errors are equal:

Error Type	Example	Action
Transient	Connection timeout, deadlock	Don't commit offset → Kafka replays automatically
Non-transient	Missing required field, bad data type	Route to DLQ topic → commit offset to unblock

Transient errors fix themselves. Just let Kafka replay. Non-transient errors need human attention, so they go to a dead letter queue where someone can inspect and decide what to do.

The key insight: DLQ is for errors you can't retry your way out of. If you DLQ transient errors, you're throwing away free retries.

Funnel Analysis

The funnel tracks users through a five-step purchase journey:

Page View (0) → Product View (1) → Add to Cart (2) → Checkout (3) → Payment (4)

Each event carries a funnel_step_index from the producer. dbt aggregates this into daily conversion rates:

-- What % of users who viewed a product added it to cart?
ROUND(100.0 * users_step_2 / NULLIF(users_step_1, 0), 2) AS cvr_product_to_cart

DAU/WAU/MAU with Stickiness

User engagement uses rolling windows:

DAU: Distinct users today
WAU: Distinct users in the last 7 days
MAU: Distinct users in the last 30 days
Stickiness: DAU / MAU — how often do monthly users come back daily?

A stickiness of 25% means a quarter of your monthly users are daily actives. Consumer apps aim for 20%+. Below 10% suggests users try the product once and ghost.

Late Data Detection

In streaming, events can arrive out of order. A user clicks at 10:00:00, but network lag means Kafka receives it at 10:06:00. That's "late" data.

Pulse flags these in the staging layer:

CASE
  WHEN kafka_timestamp - event_timestamp > INTERVAL '5 minutes' 
  THEN TRUE
  ELSE FALSE
END AS is_late

I chose flagging over exclusion. Late events still contribute to session reconstruction — they're just marked for observability. If late data becomes a problem (>5% of events), that's a signal to investigate upstream latency, not throw data away.

What's Not Here (And Why)

No Airflow. Pulse is event-driven. The consumer runs continuously, reacting to Kafka messages. There's nothing to schedule for ingestion. dbt runs on a simple cron or EventBridge trigger — Airflow would be overkill for a single transform job.

No S3 landing zone. For a production deployment, I'd add S3 between Kafka and Postgres as a raw archive layer. Enables replay from cold storage if the database needs to be rebuilt. I documented this in the production architecture doc but didn't implement it locally — diminishing returns for a portfolio project.

Simulated data, not real traffic. The event simulator generates fake clickstream. Real production would swap in a JavaScript SDK tracking actual user behavior. The pipeline architecture doesn't change — only the producer does.

Production Path

If this were going to production on AWS:

Local	Production
Kafka (Docker)	Amazon MSK
Postgres (Docker)	Amazon RDS
Manual dbt runs	EventBridge + Lambda
localhost:3000	QuickSight or Power BI Service

The patterns stay identical. Idempotent producer, idempotent consumer key, DLQ for non-transient errors, dbt for transforms. Just swap local containers for managed services.

What I Learned

Exactly-once is a chain, not a single mechanism. Producer idempotence and consumer idempotent keys solve different failure modes. You need both.
Timestamps are a design decision. Business time vs ingestion time isn't academic — it affects session reconstruction correctness.
DLQ is for non-retryable errors. Transient failures should replay from Kafka, not clutter your dead letter queue.
dbt works for streaming too. The transform layer doesn't care if events arrived via batch API or real-time Kafka. Same staging → intermediate → marts pattern.

What's Next

This completes my second portfolio project. Ballistics showed batch patterns (API → Airflow → S3 → Postgres → dbt). Pulse shows streaming patterns (Kafka → Postgres → dbt). Together they tell the story: I can work across both paradigms.

Next up: an AI-flavoured pipeline. RAG ingestion, embeddings, vector store. The "DE + AI" trend isn't about building ML models — it's about building pipelines that feed them.

Code: github.com/ayoabass777/Pulse

Author: Ayomide Abass — Data Engineer, Vancouver

LinkedIn · GitHub

Building a Football Analytics Pipeline: Patterns, Tradeoffs, and What Production Would Look Like

ayoabass777 — Sun, 12 Apr 2026 02:50:39 +0000

Football is the most watched sport on the planet. Millions of fans follow their teams across leagues, tracking form, streaks, and head-to-head records. I built Ballistics — a pipeline that automates the ingestion, transformation, and analytics of football data, currently covering 19 leagues across 15 European countries. The goal is to serve fans with analytical data about their favourite teams: streak tracking, head-to-head breakdowns, and performance metrics.

This isn't a production system. It's a portfolio project. But every pattern in it maps to a real-world equivalent, and every shortcut I took, I can explain what the production version would look like. That's what this post is about.

Repo: github.com/ayoabass777/ballistics

The Architecture

                         ┌──────────────┐
                         │ Football API │
                         │  (RapidAPI)  │
                         └──────┬───────┘
                                │
                      ┌─────────▼──────────┐
                      │  Apache Airflow    │
                      │  (Dockerized)      │
                      └──┬─────────────┬───┘
                         │             │
              ┌──────────▼──┐   ┌──────▼──────────┐
              │  S3 Landing │   │   Postgres       │
              │  Zone (raw) │──►│   raw schema     │
              └──────┬──────┘   └──────┬───────────┘
                     │                 │
              ┌──────▼──────┐   ┌──────▼───────┐
              │  S3 DLQ     │   │     dbt      │
              │  (failures) │   │  (Docker)    │
              └──────┬──────┘   └──────┬───────┘
                     │                 │
              ┌──────▼──────┐    ┌─────┼─────────────┐
              │ Replay DAG  │    │     │             │
              │ (re-extract)│    ┌────▼───┐ ┌─────▼────┐ ┌────▼────┐
              └─────────────┘    │  stg   │ │   int    │ │  mart   │
                                 │ views  │ │  tables  │ │ tables  │
                                 └────────┘ └──────────┘ └─────────┘

The pipeline has three DAGs, each with a distinct job:

Bootstrap — one-time setup triggered by config changes
Incremental — daily fixture updates, standings refresh, dbt transforms
Replay — retries failed extractions from a dead letter queue

Let me walk through each one and explain why I made the choices I did.

Bootstrap: Adding a League Should Be a Config Change

The bootstrap DAG runs on a daily schedule, but most days it does nothing. A custom MetadataChangeSensor compares the modification time of metadata.yaml against the last-processed timestamp stored as an Airflow Variable. If the file hasn't changed, all downstream tasks soft-fail and skip.

class MetadataChangeSensor(BaseSensorOperator):
    def poke(self, context) -> bool:
        current_mtime = os.path.getmtime(self.filepath)
        last_mtime = float(Variable.get("metadata_last_mtime", default_var=0))
        if current_mtime > last_mtime:
            Variable.set("metadata_last_mtime", str(current_mtime))
            return True
        return False

When you do edit metadata.yaml — say, adding the Egyptian Premier League — the sensor triggers the full chain: create schemas, extract metadata from the API, full-load all historical fixtures into S3, then load into Postgres.

The design principle: adding a league is a config change, not a code change. You edit a YAML file, and the pipeline handles the rest.

Why a sensor and not a manual trigger?

I considered using TriggerDagRunOperator from a separate "admin" DAG, or just running the bootstrap manually. The sensor pattern won over because it's self-documenting — the DAG itself encodes when it should run. No one has to remember to trigger it, no runbook to maintain. Edit the file, walk away.

Parallel Branches

The bootstrap DAG has a fan-out/fan-in structure. Two schema creation tasks — metadata schema (dim tables) and raw fixtures schema — run in parallel before converging at the extract step:

sensor ──► metadata_schema ──► extract_metadata ──┐
                                                   ├──► extract_fixtures ──► load_fixtures
sensor ──► raw_fixtures_schema ───────────────────┘

This is a small optimisation, but it demonstrates an important principle: tasks that don't depend on each other shouldn't wait for each other. In a production pipeline with heavier DDL operations or network-bound tasks, this pattern pays off significantly.

XCom: Keep It Lightweight

The extract task returns a list of S3 keys via XCom. The load task consumes them. The critical decision here: pass URIs, not data. XCom stores values in the Airflow metadata database. Stuffing raw JSON payloads into XCom creates bloat and can crash the metadata DB in production. S3 keys are just strings — a few bytes each.

In production, I'd go further and use an S3-backed XCom backend. But for a demo, default XCom with lightweight strings works fine.

S3 as a Raw Landing Zone

Every extraction — whether full load or incremental — writes raw JSON to S3 before touching Postgres. The prefix structure makes each write immutable and replayable:

full_load/{league_id}/{season}/{ds}/fixtures.json
incremental/{ds}/fixtures.json
dlq/{league_id}/{season}/{ds}/error.json

Why not write directly to Postgres?

If the Postgres load fails (network issue, schema mismatch, disk full), the raw data is still sitting in S3. I can replay the load from the landing zone without re-hitting the API, which matters when the API has rate limits and costs per call.

This is a common pattern in production pipelines: extract once, load many times. S3 is cheap storage that gives you a replayable audit trail. Think of it as a recording studio approach — capture the raw signal first, then process it. You can always reprocess, but you can't un-lose a raw recording.

Why date-partitioned prefixes?

Each write goes to a unique prefix that includes the execution date ({ds}). This means incremental loads never overwrite each other. If I need to debug what was loaded on a specific day, the S3 prefix tells me exactly where to look.

In production, you'd add lifecycle policies to move older prefixes to S3 Infrequent Access or Glacier. Football fixture data is public and re-fetchable, so I didn't bother with versioning — but for non-recoverable data, I'd enable it.

Failure Handling: The Dead Letter Queue

Not every API call succeeds. Rate limits, network timeouts, malformed responses — any extraction can fail. The question is: what happens when it does?

The bootstrap DAG doesn't retry inline. Instead, failed extractions write an error record to the dlq/ prefix in S3 with enough context to retry later:

def write_to_dlq(api_league_id, season_year, ds, error, league_name=""):
    payload = {
        "api_league_id": api_league_id,
        "season_year": season_year,
        "league_name": league_name,
        "ds": ds,
        "error": str(error),
        "traceback": traceback.format_exc(),
        "timestamp": datetime.now(timezone.utc).isoformat(),
    }
    s3.put_object(Bucket=bucket, Key=f"dlq/{api_league_id}/{season_year}/{ds}/error.json", ...)

A separate Replay DAG runs daily, scans the DLQ, re-extracts from the API, and loads successfully replayed entries into Postgres. After replay, it checks if any entries remain and logs a warning. In production, this is where you'd wire up a Slack or PagerDuty alert.

Why a separate DAG instead of retries?

Airflow has built-in retries (retries=3, retry_delay=timedelta(minutes=5)). Why not just use those?

Because the failure mode matters. If one league-season fails during a bootstrap of 30 league-seasons, I don't want to re-run the entire bootstrap DAG. The DLQ pattern isolates failures — the 29 successful extractions are safe in S3 and loaded into Postgres. Only the failed one needs another attempt, and it happens independently without blocking anything else.

This is the same pattern AWS SQS dead letter queues use. The idea: separate the happy path from the recovery path. Each can run on its own schedule, with its own retry logic, without coupling them together.

The Incremental Path: Daily Updates and Edge Cases

The daily DAG picks up fixtures that have been played since the last run — specifically, fixtures where kickoff_utc is in the past but fulltime goals are still null. It fetches updates by fixture ID, writes to S3, upserts into Postgres, corrects any rescheduled kickoff times, refreshes league standings, and triggers dbt.

extract_updates ──► load_updates ──► fixture_corrections ──► update_standings ──► dbt_run

The Belgian Playoff Problem

A week after going live, I noticed warnings in the logs: Fixture 1540006 not found in DB; skipping. The fixture correction script fetches the last/next N fixtures per league from the API and checks them against the database. These fixture IDs existed in the API but not in my database.

Investigation revealed these were Belgium's Jupiler Pro League championship playoff fixtures. Belgian football splits into playoff groups mid-season, and the league body creates entirely new fixture IDs for the playoff round. My bootstrap captured the regular season, but the playoff fixtures were created after that.

This exposed a gap: the incremental path only updates existing fixtures. It doesn't insert new ones. Newly created fixtures (playoffs, rescheduled additions, cup ties added mid-season) get silently skipped.

For now, re-bootstrapping the affected league-season picks them up. The proper fix is adding an "insert new fixtures" path to the incremental flow. I've documented this as a known gap — acknowledging it matters more to me than pretending it doesn't exist.

Standings Deduplication

Another edge case from Belgium: the API returns multiple standings tables for the same league-season when playoff groups are active. The same team appears in both the regular season table and the playoff table, both tagged with the same league_season_id. Postgres rejected the batch upsert because two rows with the same (league_season_id, api_team_id) can't appear in a single INSERT.

The fix: deduplicate before upserting, keeping the last occurrence (the most current table). A small function, but it's the kind of thing that only surfaces with real data from real APIs — not from tutorials.

dbt: Three Layers of Transformation

The dbt project follows a standard staging → intermediate → mart structure:

Staging views are thin wrappers over raw tables. They rename columns, cast types, and derive flags like is_played from fixture_status = 'FT'. The staging layer uses incremental materialisation keyed on api_fixture_id, so daily runs only process changed fixtures.

Intermediate tables enrich fixtures with league context (league name, season label, current season flag) and team dimensions. This is where streak computation happens — win runs, clean sheet runs, scoring runs — along with relevance scores based on configurable weights stored as dbt seeds.

Mart tables assemble API-ready JSON payloads for the frontend: team pages, fixture pages, head-to-head breakdowns, homepage streak rankings. These rebuild fully on each run — at ~10k fixtures, that's fine. At higher volumes, you'd push incremental materialisation further down the DAG.

Data Quality at Three Levels

I set up tests at each layer, each catching a different class of problem:

Source tests validate the raw data at the door. Not-null and unique constraints on primary keys, accepted-value checks on result columns (win, draw, loss, null). Source freshness monitoring warns at 24 hours stale, errors at 48 hours. Freshness tests don't validate the data itself — they validate that the pipeline is running. A passing not-null test on a table that hasn't been updated in a week is a false sense of security.

Staging model tests validate the output of my transforms. Key uniqueness, not-null on derived columns like is_played, and referential integrity — stg_dim_league_seasons.league_id must reference a valid stg_dim_leagues.league_id. These are like foreign key constraints enforced at test time rather than in the database.

Singular tests encode business logic. I wrote one: assert_no_future_completed_fixtures.sql — no fixture should have is_played = TRUE with a kickoff_utc in the future. This catches data quality issues where the API returns a "finished" status for a match that hasn't happened yet, typically a sync bug or timezone mismatch.

The distinction between generic and singular tests matters. Generic tests (not_null, unique) are reusable patterns. Singular tests are custom SQL queries specific to your domain. Having both shows you understand data quality isn't just "add not_null to everything."

What Production Would Look Like

This project runs on Docker Compose on my laptop. Here's what I'd change for a real deployment:

Aspect	Demo (current)	Production
Airflow	Local Docker Compose	EC2/ECS with IAM role
Credentials	Access keys in `.env`	Role-based, no keys in code
S3	Single bucket, prefix-partitioned	Separate buckets per environment
Versioning	Off	On for critical data
Encryption	SSE-S3 (default)	SSE-KMS for sensitive data
IAM	Admin user + access keys	Least-privilege roles per service

The biggest shift: moving from access keys to IAM roles. In the demo, Airflow connects to S3 using access keys stored in .env. In production, the Airflow instance would run on EC2 or ECS with an IAM role attached — no keys in code, no keys to rotate, no keys to leak.

I'd also add proper monitoring: the DLQ check task currently logs a warning. In production, that becomes a Slack notification or PagerDuty alert. Source freshness errors would trigger similar alerts. The infrastructure for observability is already there in the pipeline design — it just needs to be wired to real alerting tools.

What I'd Do Differently

The playoff gap is the most concrete lesson. I assumed that bootstrapping a league-season once would capture all fixtures. It doesn't — leagues create new fixtures mid-season for playoffs, rescheduled matches, and cup ties. The incremental path needs an "insert new fixtures" capability, not just updates.

dbt schema management gave me trouble. I didn't initially set up a separate profile for the project, so both the old and new dbt projects shared the same profiles.yml block. Changing the database for one would break the other. The fix was simple — a separate ballistics profile — but it cost me a debugging session that could have been avoided with upfront isolation.

Docker-in-Docker is a footgun. Running dbt via DockerOperator inside a Dockerized Airflow requires mounting the host's Docker socket. This works for local development but wouldn't fly in production — you'd run dbt as a dedicated container service or use the BashOperator with dbt installed directly in the Airflow image.

Wrapping Up

Ballistics currently tracks 19 leagues across 15 countries, ingesting and transforming fixture data daily. The patterns — sensor-gated bootstrapping, S3 landing zones, dead letter queues, layered data quality testing — are all transferable to production pipelines at any scale.

The gap between a portfolio project and production isn't the patterns. It's the operational maturity: IAM roles instead of access keys, alerting instead of log messages, lifecycle policies instead of unbounded storage. Knowing that gap exists, and being able to articulate it, matters as much as the code itself.

Check out the repo: github.com/ayoabass777/ballistics

DEV Community: ayoabass777

Building an AI-Augmented News Intelligence Pipeline with Kafka, Delta Lake, and LLMs

What I Built

Architecture

Data Flow: Life of an Article

Why Docker Compose, Not Airflow

4-Level Dedup: Why One Layer Isn't Enough

The LLM as a Transform Layer

Transaction Boundaries: Why the Parser Doesn't Write to Delta Lake

Delta Lake CDF: Incremental Transforms Without Full Scans

Stateful MERGE: Content Versioning

Data Quality: Don't Drop Late Data, Flag It

Scaling: Kafka Consumer Groups

What Production Would Look Like

What I Learned

# Building a Streaming Session Analytics Pipeline with Kafka, Postgres, and dbt

What I Built

Architecture

The Hard Part: Exactly-Once Delivery

Layer 1: Idempotent Producer (enable.idempotence=True)

Layer 2: Idempotent Consumer Key (ON CONFLICT DO NOTHING)

Two Timestamps, Two Purposes

Session Reconstruction in SQL

Error Handling: DLQ for Non-Transient Errors Only

Funnel Analysis

DAU/WAU/MAU with Stickiness

Late Data Detection

What's Not Here (And Why)

Production Path

What I Learned

What's Next

Building a Football Analytics Pipeline: Patterns, Tradeoffs, and What Production Would Look Like

The Architecture

Bootstrap: Adding a League Should Be a Config Change

Why a sensor and not a manual trigger?

Parallel Branches

XCom: Keep It Lightweight

S3 as a Raw Landing Zone

Why not write directly to Postgres?

Why date-partitioned prefixes?

Failure Handling: The Dead Letter Queue

Why a separate DAG instead of retries?

The Incremental Path: Daily Updates and Edge Cases

The Belgian Playoff Problem

Standings Deduplication

dbt: Three Layers of Transformation

Data Quality at Three Levels

What Production Would Look Like

What I'd Do Differently

Wrapping Up

Layer 1: Idempotent Producer (`enable.idempotence=True`)

Layer 2: Idempotent Consumer Key (`ON CONFLICT DO NOTHING`)