How I built a streaming pipeline that uses LLMs as a transform layer and Delta Lake for stateful content versioning
My first portfolio project (Ballistics) was batch — API calls on a schedule, Airflow orchestration, S3 landing zone. My second (Pulse) was streaming — Kafka, exactly-once delivery, session analytics in dbt. Both used the same transformation tool (dbt) with different ingestion patterns.
Sentinel is the third project, and the question changed. Ballistics and Pulse processed structured data — JSON from APIs, simulated clickstream events. What happens when the raw data is unstructured? When the "transformation" isn't a SQL model but an LLM that extracts entities, sentiment, and summaries from raw HTML?
Sentinel is a news intelligence pipeline that ingests articles from multiple sources, uses LLMs to extract structured data, and serves it through an API and dashboard. It's not a product — it's a proof of work for AI-augmented data engineering patterns.
What I Built
GDELT + RSS ──► Kafka ──► Fetcher ──► Kafka ──► LLM Parser ──► Kafka ──► Bronze Writer ──► Delta Lake ──► FastAPI
│ │ │ │
Redis L1/L2 DLQ topic Delta Lake write PySpark MERGE
(dedup gate) (exponential (CDF)
backoff)
Two producers (GDELT and an 18-feed RSS aggregator) discover article URLs and push them to Kafka. A fetcher consumer retrieves the HTML. An LLM parser extracts structured data — title, author, companies, people, topics, sentiment, summary — and produces it to a parsed_articles topic. A Bronze Writer consumer reads from that topic and writes to Delta Lake. A PySpark job transforms Bronze into Silver using a stateful MERGE. FastAPI serves the Silver layer to a React dashboard.
Everything runs locally in Docker — Kafka in KRaft mode, Redis, and the dashboard. The producers, fetcher, and parser are long-running Python services.
Architecture
┌──────────────┐
│ GDELT │──┐ ┌─────────────┐ ┌─────────────┐
│ Producer │ ├───►│ Redis Dedup │────►│sentinel.urls│
└──────────────┘ │ │ L1+L2 keys │ └──────┬──────┘
┌──────────────┐ │ └─────────────┘ │
│ RSS │──┘ ▼
│ (18 feeds) │ ┌─────────────────────┐
└──────────────┘ │ Fetcher (txn) │
│ begin→fetch→commit │
└──────────┬──────────┘
▼
┌──────────────────┐
│sentinel.raw_html │
└──────────┬───────┘
▼
┌─────────────────────┐
│ LLM Parser (txn) │
│ OpenAI / Anthropic │
│ / DeepSeek │
└──────────┬──────────┘
┌──────────┐ │
│ DLQ │◄── fails │
│ Replay │ │
│1m→5m→30m │ │
└──────────┘ ▼
┌────────────────────────┐
│sentinel.parsed_articles│
└───────────┬────────────┘
▼
┌──────────────────┐
│ Bronze Writer │
└──────────┬───────┘
▼
┌─────────────┐
│Delta Bronze │
│ (CDF-enabled)│
└──────┬──────┘
▼
┌─────────────┐
│ PySpark │
│ MERGE→Silver│
└──────┬──────┘
▼
┌─────────────┐
│ FastAPI + │
│ Dashboard │
└─────────────┘
Data Flow: Life of an Article
One article, end to end:
Discovery. The RSS producer polls a tech feed. A new entry appears —
https://example.com/article-about-funding. The producer checks Redis: L1 key (sentinel:src:rss:{guid}) doesn't exist, L2 key (sentinel:url:example.com/article-about-funding) doesn't exist. Both keys are set with TTLs. The URL goes tosentinel.urls.Fetch. The fetcher consumer reads the message inside a Kafka transaction —
begin, fetch HTML viahttpx, extract clean text withtrafilatura(stripping nav bars, ads, boilerplate), produce tosentinel.raw_html,commit. If the fetch fails (timeout, 404), the message goes tosentinel.dlq.fetchfor retry with exponential backoff (1m → 5m → 30m).Parse. The LLM parser reads the clean text and sends it to the LLM with a structured extraction prompt. The LLM returns JSON: title, author, companies, people, topics, sentiment, summary. The parser validates the response against a Pydantic schema, computes
data_value_score, and produces the structuredBronzeArticletosentinel.parsed_articles. If the LLM returns malformed JSON or the response fails validation, the message goes tosentinel.dlq.parse— same exponential backoff as the fetch DLQ. The entire step — read, extract, produce — is a Kafka transaction. No Delta Lake write here.Bronze Write. The Bronze Writer consumer reads from
sentinel.parsed_articlesand writes each article to the Delta Lake Bronze table with CDF enabled. If the write fails, the offset isn't committed — Kafka replays the message. This separation means the parser's transaction stays fully within Kafka's guarantees, and the storage write has its own error boundary.Transform. The PySpark job reads Bronze's Change Data Feed — only new rows since the last checkpoint. It dedupes (keeping the latest per URL, aggregating all sources), filters by quality score, enriches with
freshness_statusandingestion_lag_hours, and MERGEs into Silver. If this URL already exists in Silver with a differentcontent_hash, the MERGE bumpscontent_versionand stores the previous hash.Serve. FastAPI reads Silver via
deltalake+ PyArrow (no Spark). The dashboard fetches/articlesand renders cards with sentiment badges, freshness status, source tags, and quality scores.
No scheduler told stages 1–4 when to run. The URL landing on a Kafka topic was the trigger. Each service reacted to data arriving, not a clock ticking. The transform (stage 5) is the exception — it's a batch job triggered manually or on a cron, reading only new changes via CDF. The architecture is designed so this becomes a one-line swap to Spark Structured Streaming when throughput demands it.
Why Docker Compose, Not Airflow
This was the first architectural decision and it surprised me. I used Airflow for Ballistics. Three DAGs, task dependencies, scheduled triggers — it worked because Ballistics is a batch pipeline. Tasks start, run, finish, repeat on a schedule.
Sentinel's consumers are different. They're services, not jobs. The fetcher doesn't run at 8am and finish at 8:05. It starts and stays alive, polling Kafka for new messages in a loop. Same for the parser. If I used Airflow, it would look like: "every 5 minutes, wake up, poll Kafka, process messages, shut down." That's wasteful — the consumer spends more time starting and stopping than working.
Docker Compose keeps containers alive. Airflow schedules tasks. Different tools for different orchestration patterns.
| Pattern | Orchestrator | Why |
|---|---|---|
| Batch (Ballistics) | Airflow | Tasks with dependencies, on a schedule |
| Service (Sentinel) | Docker Compose | Long-running consumers, react to events |
The key insight: Kafka is the real orchestrator here. A message landing on a topic is the trigger. No scheduler tells the fetcher when to run — data arriving on sentinel.urls is the signal. Docker Compose just keeps the services alive so they can listen.
4-Level Dedup: Why One Layer Isn't Enough
Two producers ingesting from overlapping sources means duplicates are inevitable. The same article can appear in both GDELT and an RSS feed within minutes. I needed dedup at every boundary.
| Level | Key | Where | TTL | What it catches |
|---|---|---|---|---|
| L1 | sentinel:src:{source}:{record_id} |
Redis | 24h | Same source, same record |
| L2 | sentinel:url:{normalized_url} |
Redis | 6h | Same URL from different sources |
| L3 | url + kafka_ts |
Bronze Delta | ∞ | Replay after crash |
| L4 | normalized_url + content_hash |
Silver MERGE | ∞ | Content change detection |
L1 and L2 are cheap — Redis SET NX calls in a pipeline batch. They catch 95% of duplicates before anything touches Kafka. L3 is the parser's safety net against replayed messages (same mechanism as Pulse's ON CONFLICT DO NOTHING). L4 is the stateful MERGE that tracks whether content actually changed.
URL normalization matters here. https://www.example.com/article?utm_source=rss and http://example.com/article are the same article. The normalizer strips www., tracking params (utm_*, fbclid, gclid), protocol differences, and trailing slashes. Without this, L2 misses cross-source duplicates.
The LLM as a Transform Layer
In Ballistics and Pulse, transformation meant SQL — dbt models that reshape structured data. In Sentinel, the "transformation" is an LLM that takes article text and extracts structured fields:
Clean text → LLM → {
title, author, publish_date,
body, summary, sentiment,
companies[], people[], topics[]
}
This creates two problems batch SQL doesn't have:
1. The bottleneck is cost, not throughput. Each LLM call takes 2–5 seconds and costs tokens. The fetcher can saturate the sentinel.raw_html topic far faster than the parser can drain it. Kafka absorbs the pressure difference as buffered messages — that's backpressure for free.
2. The output is non-deterministic. The same article parsed twice might produce slightly different entity lists. This is why the Silver MERGE uses content_hash (an MD5 of the article body) rather than comparing extracted fields. If the source content hasn't changed, the extraction shouldn't re-run — regardless of whether the LLM might produce different output.
The parser is pluggable — a --provider flag switches between OpenAI, Anthropic, and DeepSeek per instance. This was a practical decision: OpenAI quotas ran out during development, so I needed a fallback. But it also demonstrates a production pattern — run expensive models (GPT-4o, Claude Sonnet) for quality-critical extraction and cheap models (GPT-4o-mini, DeepSeek) for bulk ingestion.
Transaction Boundaries: Why the Parser Doesn't Write to Delta Lake
The parser reads from one Kafka topic and produces to another — sentinel.raw_html in, sentinel.parsed_articles out. That read-process-produce cycle is wrapped in a Kafka transaction (begin, extract, produce, commit). If the LLM call fails or validation rejects the output, the transaction aborts — the offset isn't committed, the output message isn't visible, and the failed message is routed to sentinel.dlq.parse for retry with exponential backoff. No half-written state, no silent data loss.
Critically, the parser does not write to Delta Lake. That's a separate consumer (the Bronze Writer) with its own error boundary. Earlier in development, the parser wrote directly to Delta Lake inside the transaction — but Delta writes aren't part of Kafka's transactional guarantees, so a successful Delta write followed by a failed transaction commit would leave data in Bronze with an uncommitted offset. Separating them keeps each transaction boundary clean.
Delta Lake CDF: Incremental Transforms Without Full Scans
The Bronze → Silver transform needs to process only new articles, not re-scan the entire Bronze table every run. Delta Lake's Change Data Feed (CDF) solves this.
When CDF is enabled on a table, every write creates a versioned changelog entry. The transform reads only changes since the last processed version:
# Read only new changes from Bronze
cdf = (
spark.read.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", last_processed_version + 1)
.option("endingVersion", current_version)
.load(bronze_path)
.filter(F.col("_change_type").isin("insert", "update_postimage"))
)
A simple checkpoint file stores the last processed version. First run does a full scan automatically (no checkpoint found), subsequent runs only process new changes. The --full flag forces a reconciliation scan if Silver drifts.
This is the same design pattern as Kafka consumer offsets — "give me everything since my last position." The difference is that Kafka coordinates the streaming stages (producers, fetcher, parser), and CDF coordinates the storage stages (Bronze → Silver). Each tool handles readiness signaling in its domain.
Stateful MERGE: Content Versioning
The Silver MERGE does more than dedup. It tracks whether an article's content has changed over time:
silver_dt.alias("s").merge(
source_df.alias("b"),
"s.normalized_url = b.normalized_url"
)
.whenMatchedUpdate(
condition="b.content_hash != s.content_hash AND b.kafka_ts > s.kafka_ts",
set={
# Update content fields...
"content_version": F.col("s.content_version") + 1,
"previous_content_hash": "s.content_hash",
"is_updated": F.lit(True),
# first_seen_ts intentionally NOT updated
},
)
.whenNotMatchedInsertAll()
Three outcomes:
| Incoming vs Silver | Action |
|---|---|
| New URL | Insert (content_version=1, is_updated=False) |
| Same URL + same content_hash | Skip (true duplicate) |
| Same URL + different content_hash | Update: bump version, store old hash, set is_updated=True
|
The statefulness is in three lines: s.content_version + 1 reads Silver's current state to compute the update. The MERGE isn't just transforming data — it's making decisions based on what's already there.
If you need the old content, Delta time travel has it. Every previous version of Silver is accessible by version number. No need to store duplicate rows — the history is in the transaction log.
Data Quality: Don't Drop Late Data, Flag It
Every article gets a freshness_status computed from the gap between publish_date and fetch_timestamp:
| Status | Condition |
|---|---|
fresh |
Ingested within 48 hours of publication |
old |
Ingested 48 hours to 7 days after publication |
stale |
Ingested more than 7 days after publication |
unknown |
No publish_date available |
Combined with ingestion_lag_hours (the continuous metric) and a composite data_value_score (freshness 40%, completeness 40%, accuracy 20%), Silver consumers can filter by quality without losing data. Articles below data_value_score < 0.3 are excluded from Silver — but they remain in Bronze if you ever need to reprocess them with different thresholds.
The key decision: flag, don't drop. A week-old article about a funding round is still valuable. Dropping it because it's "stale" loses signal. Flagging it lets the consumer decide.
Scaling: Kafka Consumer Groups
Each stage scales horizontally via Kafka consumer groups. If sentinel.raw_html has 3 partitions, you can run up to 3 parser instances in parallel — each gets a partition, no code changes:
parser-0:
command: python -m sentinel.consumers.llm_parser 0
parser-1:
command: python -m sentinel.consumers.llm_parser 1
Kafka rebalances automatically when consumers join or crash. But the real throughput ceiling isn't partitions — it's LLM cost. Three parsers at 2 calls/minute each gives 6 articles per minute. Switching to a cheaper model (GPT-4o-mini, Haiku, or DeepSeek at ~$0.001/article) shifts the bottleneck from cost to Kafka partitions, which is a one-command fix.
What Production Would Look Like
| Local | Production |
|---|---|
| Kafka (Docker) | Amazon MSK |
| Redis (Docker) | ElastiCache |
| Delta Lake (local filesystem) | Delta Lake on S3 |
Manual docker-compose up
|
ECS Fargate + KEDA autoscaling |
| PySpark batch transform | Spark Structured Streaming (one-line swap) |
| localhost API | ECS Fargate + ALB |
The CDF-based transform is designed so switching to streaming is a one-line change: spark.read becomes spark.readStream. The MERGE logic, dedup, and quality filters stay identical.
What I Learned
The LLM is a liability, not a feature. The attention framework needs attention. It's the slowest, most expensive, least deterministic component in the pipeline. Everything around it — rate limiting, DLQ with exponential backoff, pluggable providers, content hashing to avoid re-extraction — exists to manage that liability. The architecture assumes the LLM will fail.
Streaming and batch coexist. Sentinel isn't purely streaming or purely batch. Kafka coordinates the streaming stages, CDF coordinates the storage stages, and the batch transform sits at the boundary. Production systems almost always have both patterns running in parallel.
Dedup is a system design problem, not a single filter. Each dedup layer catches different classes of duplicates at different costs. Redis is fast but ephemeral. Delta MERGE is durable but expensive. Stacking them means each layer only handles what the previous one missed.
The progression: Ballistics (batch, Airflow, S3) → Pulse (streaming, Kafka, exactly-once) → Sentinel (streaming + LLM, Kafka, Delta Lake, stateful MERGE). Three projects, three ingestion paradigms, each building on the patterns learned in the last.

Top comments (0)