ayoabass777

Posted on Apr 30

Building an AI-Augmented News Intelligence Pipeline with Kafka, Delta Lake, and LLMs

#ai #systemdesign #sideprojects #dataengineering

How I built a streaming pipeline that uses LLMs as a transform layer and Delta Lake for stateful content versioning

My first portfolio project (Ballistics) was batch — API calls on a schedule, Airflow orchestration, S3 landing zone. My second (Pulse) was streaming — Kafka, exactly-once delivery, session analytics in dbt. Both used the same transformation tool (dbt) with different ingestion patterns.

Sentinel is the third project, and the question changed. Ballistics and Pulse processed structured data — JSON from APIs, simulated clickstream events. What happens when the raw data is unstructured? When the "transformation" isn't a SQL model but an LLM that extracts entities, sentiment, and summaries from raw HTML?

Sentinel is a news intelligence pipeline that ingests articles from multiple sources, uses LLMs to extract structured data, and serves it through an API and dashboard. It's not a product — it's a proof of work for AI-augmented data engineering patterns.

What I Built

GDELT + RSS ──► Kafka ──► Fetcher ──► Kafka ──► LLM Parser ──► Kafka ──► Bronze Writer ──► Delta Lake ──► FastAPI
                  │                                    │                       │                │
               Redis L1/L2                          DLQ topic          Delta Lake write    PySpark MERGE
               (dedup gate)                      (exponential                                  (CDF)
                                                  backoff)

Two producers (GDELT and an 18-feed RSS aggregator) discover article URLs and push them to Kafka. A fetcher consumer retrieves the HTML. An LLM parser extracts structured data — title, author, companies, people, topics, sentiment, summary — and produces it to a parsed_articles topic. A Bronze Writer consumer reads from that topic and writes to Delta Lake. A PySpark job transforms Bronze into Silver using a stateful MERGE. FastAPI serves the Silver layer to a React dashboard.

Everything runs locally in Docker — Kafka in KRaft mode, Redis, and the dashboard. The producers, fetcher, and parser are long-running Python services.

Architecture

┌──────────────┐
│    GDELT     │──┐    ┌─────────────┐     ┌─────────────┐
│   Producer   │  ├───►│ Redis Dedup │────►│sentinel.urls│
└──────────────┘  │    │  L1+L2 keys │     └──────┬──────┘
┌──────────────┐  │    └─────────────┘            │
│     RSS      │──┘                               ▼
│  (18 feeds)  │                        ┌─────────────────────┐
└──────────────┘                        │ Fetcher (txn)       │
                                        │ begin→fetch→commit  │
                                        └──────────┬──────────┘
                                                   ▼
                                        ┌──────────────────┐
                                        │sentinel.raw_html │
                                        └──────────┬───────┘
                                                   ▼
                                        ┌─────────────────────┐
                                        │ LLM Parser (txn)    │
                                        │ OpenAI / Anthropic  │
                                        │ / DeepSeek          │
                                        └──────────┬──────────┘
                      ┌──────────┐                 │
                      │   DLQ    │◄── fails        │
                      │  Replay  │                 │
                      │1m→5m→30m │                 │
                      └──────────┘                 ▼
                                      ┌────────────────────────┐
                                      │sentinel.parsed_articles│
                                      └───────────┬────────────┘
                                                  ▼
                                        ┌──────────────────┐
                                        │  Bronze Writer   │
                                        └──────────┬───────┘
                                                   ▼
                                            ┌─────────────┐
                                            │Delta Bronze │
                                            │ (CDF-enabled)│
                                            └──────┬──────┘
                                                   ▼
                                            ┌─────────────┐
                                            │  PySpark    │
                                            │ MERGE→Silver│
                                            └──────┬──────┘
                                                   ▼
                                            ┌─────────────┐
                                            │  FastAPI +  │
                                            │  Dashboard  │
                                            └─────────────┘

Data Flow: Life of an Article

One article, end to end:

Discovery. The RSS producer polls a tech feed. A new entry appears — https://example.com/article-about-funding. The producer checks Redis: L1 key (sentinel:src:rss:{guid}) doesn't exist, L2 key (sentinel:url:example.com/article-about-funding) doesn't exist. Both keys are set with TTLs. The URL goes to sentinel.urls.
Fetch. The fetcher consumer reads the message inside a Kafka transaction — begin, fetch HTML via httpx, extract clean text with trafilatura (stripping nav bars, ads, boilerplate), produce to sentinel.raw_html, commit. If the fetch fails (timeout, 404), the message goes to sentinel.dlq.fetch for retry with exponential backoff (1m → 5m → 30m).
Parse. The LLM parser reads the clean text and sends it to the LLM with a structured extraction prompt. The LLM returns JSON: title, author, companies, people, topics, sentiment, summary. The parser validates the response against a Pydantic schema, computes data_value_score, and produces the structured BronzeArticle to sentinel.parsed_articles. If the LLM returns malformed JSON or the response fails validation, the message goes to sentinel.dlq.parse — same exponential backoff as the fetch DLQ. The entire step — read, extract, produce — is a Kafka transaction. No Delta Lake write here.
Bronze Write. The Bronze Writer consumer reads from sentinel.parsed_articles and writes each article to the Delta Lake Bronze table with CDF enabled. If the write fails, the offset isn't committed — Kafka replays the message. This separation means the parser's transaction stays fully within Kafka's guarantees, and the storage write has its own error boundary.
Transform. The PySpark job reads Bronze's Change Data Feed — only new rows since the last checkpoint. It dedupes (keeping the latest per URL, aggregating all sources), filters by quality score, enriches with freshness_status and ingestion_lag_hours, and MERGEs into Silver. If this URL already exists in Silver with a different content_hash, the MERGE bumps content_version and stores the previous hash.
Serve. FastAPI reads Silver via deltalake + PyArrow (no Spark). The dashboard fetches /articles and renders cards with sentiment badges, freshness status, source tags, and quality scores.

No scheduler told stages 1–4 when to run. The URL landing on a Kafka topic was the trigger. Each service reacted to data arriving, not a clock ticking. The transform (stage 5) is the exception — it's a batch job triggered manually or on a cron, reading only new changes via CDF. The architecture is designed so this becomes a one-line swap to Spark Structured Streaming when throughput demands it.

Why Docker Compose, Not Airflow

This was the first architectural decision and it surprised me. I used Airflow for Ballistics. Three DAGs, task dependencies, scheduled triggers — it worked because Ballistics is a batch pipeline. Tasks start, run, finish, repeat on a schedule.

Sentinel's consumers are different. They're services, not jobs. The fetcher doesn't run at 8am and finish at 8:05. It starts and stays alive, polling Kafka for new messages in a loop. Same for the parser. If I used Airflow, it would look like: "every 5 minutes, wake up, poll Kafka, process messages, shut down." That's wasteful — the consumer spends more time starting and stopping than working.

Docker Compose keeps containers alive. Airflow schedules tasks. Different tools for different orchestration patterns.

Pattern	Orchestrator	Why
Batch (Ballistics)	Airflow	Tasks with dependencies, on a schedule
Service (Sentinel)	Docker Compose	Long-running consumers, react to events

The key insight: Kafka is the real orchestrator here. A message landing on a topic is the trigger. No scheduler tells the fetcher when to run — data arriving on sentinel.urls is the signal. Docker Compose just keeps the services alive so they can listen.

4-Level Dedup: Why One Layer Isn't Enough

Two producers ingesting from overlapping sources means duplicates are inevitable. The same article can appear in both GDELT and an RSS feed within minutes. I needed dedup at every boundary.

Level	Key	Where	TTL	What it catches
L1	`sentinel:src:{source}:{record_id}`	Redis	24h	Same source, same record
L2	`sentinel:url:{normalized_url}`	Redis	6h	Same URL from different sources
L3	`url + kafka_ts`	Bronze Delta	∞	Replay after crash
L4	`normalized_url + content_hash`	Silver MERGE	∞	Content change detection

L1 and L2 are cheap — Redis SET NX calls in a pipeline batch. They catch 95% of duplicates before anything touches Kafka. L3 is the parser's safety net against replayed messages (same mechanism as Pulse's ON CONFLICT DO NOTHING). L4 is the stateful MERGE that tracks whether content actually changed.

URL normalization matters here. https://www.example.com/article?utm_source=rss and http://example.com/article are the same article. The normalizer strips www., tracking params (utm_*, fbclid, gclid), protocol differences, and trailing slashes. Without this, L2 misses cross-source duplicates.

The LLM as a Transform Layer

In Ballistics and Pulse, transformation meant SQL — dbt models that reshape structured data. In Sentinel, the "transformation" is an LLM that takes article text and extracts structured fields:

Clean text → LLM → {
  title, author, publish_date,
  body, summary, sentiment,
  companies[], people[], topics[]
}

This creates two problems batch SQL doesn't have:

1. The bottleneck is cost, not throughput. Each LLM call takes 2–5 seconds and costs tokens. The fetcher can saturate the sentinel.raw_html topic far faster than the parser can drain it. Kafka absorbs the pressure difference as buffered messages — that's backpressure for free.

2. The output is non-deterministic. The same article parsed twice might produce slightly different entity lists. This is why the Silver MERGE uses content_hash (an MD5 of the article body) rather than comparing extracted fields. If the source content hasn't changed, the extraction shouldn't re-run — regardless of whether the LLM might produce different output.

The parser is pluggable — a --provider flag switches between OpenAI, Anthropic, and DeepSeek per instance. This was a practical decision: OpenAI quotas ran out during development, so I needed a fallback. But it also demonstrates a production pattern — run expensive models (GPT-4o, Claude Sonnet) for quality-critical extraction and cheap models (GPT-4o-mini, DeepSeek) for bulk ingestion.

Transaction Boundaries: Why the Parser Doesn't Write to Delta Lake

The parser reads from one Kafka topic and produces to another — sentinel.raw_html in, sentinel.parsed_articles out. That read-process-produce cycle is wrapped in a Kafka transaction (begin, extract, produce, commit). If the LLM call fails or validation rejects the output, the transaction aborts — the offset isn't committed, the output message isn't visible, and the failed message is routed to sentinel.dlq.parse for retry with exponential backoff. No half-written state, no silent data loss.

Critically, the parser does not write to Delta Lake. That's a separate consumer (the Bronze Writer) with its own error boundary. Earlier in development, the parser wrote directly to Delta Lake inside the transaction — but Delta writes aren't part of Kafka's transactional guarantees, so a successful Delta write followed by a failed transaction commit would leave data in Bronze with an uncommitted offset. Separating them keeps each transaction boundary clean.

Delta Lake CDF: Incremental Transforms Without Full Scans

The Bronze → Silver transform needs to process only new articles, not re-scan the entire Bronze table every run. Delta Lake's Change Data Feed (CDF) solves this.

When CDF is enabled on a table, every write creates a versioned changelog entry. The transform reads only changes since the last processed version:

# Read only new changes from Bronze
cdf = (
    spark.read.format("delta")
    .option("readChangeFeed", "true")
    .option("startingVersion", last_processed_version + 1)
    .option("endingVersion", current_version)
    .load(bronze_path)
    .filter(F.col("_change_type").isin("insert", "update_postimage"))
)

A simple checkpoint file stores the last processed version. First run does a full scan automatically (no checkpoint found), subsequent runs only process new changes. The --full flag forces a reconciliation scan if Silver drifts.

This is the same design pattern as Kafka consumer offsets — "give me everything since my last position." The difference is that Kafka coordinates the streaming stages (producers, fetcher, parser), and CDF coordinates the storage stages (Bronze → Silver). Each tool handles readiness signaling in its domain.

Stateful MERGE: Content Versioning

The Silver MERGE does more than dedup. It tracks whether an article's content has changed over time:

silver_dt.alias("s").merge(
    source_df.alias("b"),
    "s.normalized_url = b.normalized_url"
)
.whenMatchedUpdate(
    condition="b.content_hash != s.content_hash AND b.kafka_ts > s.kafka_ts",
    set={
        # Update content fields...
        "content_version": F.col("s.content_version") + 1,
        "previous_content_hash": "s.content_hash",
        "is_updated": F.lit(True),
        # first_seen_ts intentionally NOT updated
    },
)
.whenNotMatchedInsertAll()

Three outcomes:

Incoming vs Silver	Action
New URL	Insert (`content_version=1, is_updated=False`)
Same URL + same content_hash	Skip (true duplicate)
Same URL + different content_hash	Update: bump version, store old hash, set `is_updated=True`

The statefulness is in three lines: s.content_version + 1 reads Silver's current state to compute the update. The MERGE isn't just transforming data — it's making decisions based on what's already there.

If you need the old content, Delta time travel has it. Every previous version of Silver is accessible by version number. No need to store duplicate rows — the history is in the transaction log.

Data Quality: Don't Drop Late Data, Flag It

Every article gets a freshness_status computed from the gap between publish_date and fetch_timestamp:

Status	Condition
`fresh`	Ingested within 48 hours of publication
`old`	Ingested 48 hours to 7 days after publication
`stale`	Ingested more than 7 days after publication
`unknown`	No `publish_date` available

Combined with ingestion_lag_hours (the continuous metric) and a composite data_value_score (freshness 40%, completeness 40%, accuracy 20%), Silver consumers can filter by quality without losing data. Articles below data_value_score < 0.3 are excluded from Silver — but they remain in Bronze if you ever need to reprocess them with different thresholds.

The key decision: flag, don't drop. A week-old article about a funding round is still valuable. Dropping it because it's "stale" loses signal. Flagging it lets the consumer decide.

Scaling: Kafka Consumer Groups

Each stage scales horizontally via Kafka consumer groups. If sentinel.raw_html has 3 partitions, you can run up to 3 parser instances in parallel — each gets a partition, no code changes:

parser-0:
  command: python -m sentinel.consumers.llm_parser 0
parser-1:
  command: python -m sentinel.consumers.llm_parser 1

Kafka rebalances automatically when consumers join or crash. But the real throughput ceiling isn't partitions — it's LLM cost. Three parsers at 2 calls/minute each gives 6 articles per minute. Switching to a cheaper model (GPT-4o-mini, Haiku, or DeepSeek at ~$0.001/article) shifts the bottleneck from cost to Kafka partitions, which is a one-command fix.

What Production Would Look Like

Local	Production
Kafka (Docker)	Amazon MSK
Redis (Docker)	ElastiCache
Delta Lake (local filesystem)	Delta Lake on S3
Manual `docker-compose up`	ECS Fargate + KEDA autoscaling
PySpark batch transform	Spark Structured Streaming (one-line swap)
localhost API	ECS Fargate + ALB

The CDF-based transform is designed so switching to streaming is a one-line change: spark.read becomes spark.readStream. The MERGE logic, dedup, and quality filters stay identical.

What I Learned

The LLM is a liability, not a feature. The attention framework needs attention. It's the slowest, most expensive, least deterministic component in the pipeline. Everything around it — rate limiting, DLQ with exponential backoff, pluggable providers, content hashing to avoid re-extraction — exists to manage that liability. The architecture assumes the LLM will fail.

Streaming and batch coexist. Sentinel isn't purely streaming or purely batch. Kafka coordinates the streaming stages, CDF coordinates the storage stages, and the batch transform sits at the boundary. Production systems almost always have both patterns running in parallel.

Dedup is a system design problem, not a single filter. Each dedup layer catches different classes of duplicates at different costs. Redis is fast but ephemeral. Delta MERGE is durable but expensive. Stacking them means each layer only handles what the previous one missed.

The progression: Ballistics (batch, Airflow, S3) → Pulse (streaming, Kafka, exactly-once) → Sentinel (streaming + LLM, Kafka, Delta Lake, stateful MERGE). Three projects, three ingestion paradigms, each building on the patterns learned in the last.

Code: github.com/ayoabass777/Sentinel

Ayomide Abass — Data Engineer, Vancouver
LinkedIn · GitHub