d33ply

Posted on Feb 15

How I Built an AI Pipeline That Produces Daily Investigative Podcasts

#ai #webdev #machinelearning #podcast

What if you could turn any trending topic into a 25-minute investigative podcast episode — automatically, every day?

That's what I built with d33ply. It's an AI-powered investigative podcast platform that monitors global events, researches them from 8+ sources, writes a two-host dialogue script, generates voice audio, and publishes — all without human intervention.

In this post, I'll walk through the architecture, the technical decisions, and the real costs of running this in production.

The Problem: Headlines Without Depth

News moves fast. Most coverage gives you the headline and maybe two paragraphs. If you actually want to understand a topic — the context, the competing perspectives, the implications — you either need to spend an hour reading or wait for a long-form journalist to cover it (which might take weeks, if it happens at all).

I wanted something that could take a trending topic and produce the kind of deep-dive episode you'd expect from a professional podcast team — but do it the same day the story breaks.

Architecture Overview

Here's the system at a glance:

┌─────────────────────────────────────────────────────────────────┐
│  FLUTTER ADMIN (lib/) - Web + Desktop                           │
│  Trends → RAG (LinkUp+Claude) → Script (Claude) → TTS → Upload  │
│  WEB: Cloud Function proxies | NATIVE: Direct API (.env keys)   │
└──────────────────────┬──────────────────────────────────────────┘
                       ↓
┌─────────────────────────────────────────────────────────────────┐
│  FIREBASE                                                       │                                                                 │                                                                 │
│  Firestore (data) + Storage (audio/images) + Functions (BE)     │
└──────────┬──────────────────────────────────────────────────────┘
           ↓
┌──────────────────────────┐  ┌──────────────────────────────────┐
│  SVELTEKIT SITE (site/)  │  │  REMOTION (social/)              │
│  SPA + Svelte 5 runes    │  │  Video rendering (Reels/Shorts)  │
│  Stripe + Audio Player   │  │  Local render server :3456       │
└──────────────────────────┘  └──────────────────────────────────┘

The admin app runs on Flutter — which might seem like an unusual choice for a content pipeline, but it lets me run the same codebase on web (Firebase Hosting) and desktop (native macOS). The web build routes all external API calls through Cloud Function proxies to avoid CORS issues and keep API keys server-side. The native build calls APIs directly with keys loaded from .env. Same pipeline logic, two execution paths.

The full production pipeline looks like this:

Trend Detection → RAG Research → Script Generation → TTS → QA → Banner → Publish

Let me break down each stage.

1. Trend Detection

The pipeline doesn't just pick a random topic — it actively monitors multiple news APIs and scores candidates against each other.

Signal Collection

Three primary sources feed the aggregator:

NewsAPI.ai (primary) — event-based news clustering with global coverage
The Guardian API — quality journalism signal
finlight — financial and geopolitical events

Optional enrichment sources include SerpAPI (Google Trends validation) and Apify (social virality scoring from Reddit/Twitter).

Filtering & Scoring

Raw results pass through a 4-step pipeline:

Step 1 — Collect: Each source returns candidate topics tagged to one of 14 categories across 5 macro areas (Technology, Science & Health, Global Affairs, Business & Innovation, Culture & Lifestyle).

Step 2 — Filter: Each candidate is checked against category-specific keywords and negative keyword lists. These live in CategoryRegistry as defaults, but Firestore overrides (trend_config/overrides) are applied at runtime — so the system can learn what not to fetch without a code deploy.

Step 3 — Score: A composite confidence score (0–100) from four equally weighted dimensions:

Dimension	Weight	What It Measures
Sources	25%	How many independent sources report the story
Engagement	25%	Social signals (Reddit upvotes, comments)
Freshness	25%	Recency — score of 100 for today, decaying with age
Quality	25%	Source credibility and content depth

Step 4 — Deduplicate: Jaccard similarity at 0.7 threshold. If two candidates have 70%+ word overlap, the higher-scoring one wins. The top 5 per category survive to the dashboard.

Self-Optimizing Quality Loop

This is the part I'm most proud of in the trend system. A quality optimization pipeline runs periodically across all 14 categories:

Assess — Score each category on 6 dimensions: relevance, cross-source validation, freshness, depth, trending strength, and confidence
Diagnose — Flag weak categories (composite score < 50)
Optimize — Claude Haiku generates improved search queries and negative keywords for underperforming categories
Apply — Winning overrides are stored in Firestore and take effect on the next trend scan

The system also supports three aggregation presets — fast (NewsAPI.ai only, 3 results), standard (multi-source, 5 results), and enhanced (all sources including social, 8 results) — so the dashboard can load quickly for browsing while quality assessments use the full pipeline.

2. 4-Layer RAG Research

This is where it gets interesting. I needed more than "ask an LLM about the topic." LLMs hallucinate, miss recent events, and lack the kind of multi-source synthesis that makes journalism credible.

So I built a 4-layer RAG (Retrieval-Augmented Generation) system:

Layer	What It Does	Provider	Cost
L1 — Surface Facts	Broad web search for the latest on the topic	LinkUp Search API	~$0.006
L2 — Context	5 parallel deep searches for background, data, expert opinions, counter-arguments, historical context	LinkUp (5x parallel)	~$0.027
L3 — Pattern Analysis	Identify recurring themes, contradictions, and gaps across L1+L2 results	Claude Haiku 3.5	~$0.005
L4 — Synthesis	Deep reasoning over all layers to produce a unified research brief	Claude Sonnet 4	~$0.048

Standard episodes use L1-L2 only (~$0.03). Investigative episodes run all four layers (~$0.085). That's the total RAG cost — not per-query, per-episode.

The key insight: search is cheap, reasoning is where the value is. I originally built this on Perplexity's API for both search and reasoning. Switching to LinkUp for L1-L2 was ~88% cheaper for the same quality of web results. For L3-L4, using Claude instead of Perplexity's reasoning models saved ~68%. Total episode cost dropped ~30-40%.

The architecture uses an AI Provider abstraction (AIProviderFactory) so swapping providers is a config change, not a rewrite. Each layer has a dedicated provider mapping — search goes to LinkUp, reasoning goes to Claude — and the UnifiedRagService orchestrates depth selection (RagDepth.standard vs RagDepth.investigative) with built-in caching.

3. Script Generation

The RAG output feeds into Claude Opus 4.5, which generates a two-host dialogue script. The script follows a strict 7-section structure:

Section	Turns	Duration	Purpose
Cold Open	4-6	~2-3 min	Hook + brand signature line
Context & Stakes	8-10	~3-4 min	Why this matters NOW
Deep Dive	10-14	~4-5 min	Core analysis
Human Impact	8-10	~3-4 min	Real stories & effects
Challenges & Debate	6-8	~3 min	Critical perspectives
Future Outlook	6-8	~2-3 min	Implications going forward
Closing	3-5	~1-2 min	Takeaway + promo

Target: 20-25 minutes total. The brand signature — "It's time to understand this... deeply" — is delivered by the female host by turn 4 of the Cold Open. It sounds like a small detail, but consistent brand moments in AI-generated content are what make it feel produced rather than generated.

Prompt Engineering as Software Engineering

The prompts are hundreds of lines of structured XML using Anthropic's format:

<identity>You are a podcast scriptwriter for an investigative show...</identity>
<context>Topic: {topic}\nResearch: {rag_output}</context>
<instructions>1. Create dialogue between two hosts...</instructions>
<output_format>[THEMATIC INTRODUCTION]...</output_format>

These get versioned, tested, and iterated on like any other code.

Title Enhancement

Titles go through a separate Claude Opus 4.5 pass with tight constraints:

Temperature: 0.55 (lower = more reliable, less cliché)
Hard-banned words: Game-Changer, Revolutionary, Unleashed, Deep Dive, Unprecedented, and a dozen more
Hard-banned structures: Alliterative clichés, "X: The Y That Z" patterns
Requirements: Each title must include a number, named entity, or concrete outcome
Character limit: 50-90 characters

The banned-words list came from analyzing what makes AI-generated titles feel immediately artificial. Removing those patterns made a measurable difference in click-through.

Section Parsing

Each [SECTION] marker in the script becomes exactly one audio chunk for TTS. This was a hard-won lesson — the original approach used [CHUNK] markers that split the script into 30+ fragments, producing 40+ minute episodes with awkward pauses. The SemanticScriptParser now produces ~7 semantic sections, each containing all its dialogue turns as a single chunk. Fewer cuts = better pacing.

Cost per script: ~$0.65 (Claude Opus 4.5).

4. Text-to-Speech

The TTS stage converts each of the 7 script sections into audio using Qwen3-TTS (1.7B parameters) running on RunPod GPUs.

Speaker Mapping

The script uses host names ("Sofia:" and "Marco:"), but the TTS model expects speaker tokens. A formatting step converts:

"Sofia: Welcome to the show"  →  "[SPEAKER1] Welcome to the show"  (female)
"Marco: Thanks for having me"  →  "[SPEAKER0] Thanks for having me" (male)

Factory Pattern with Fallback

TtsFactory.getService()     → Active engine (Qwen3-TTS)
TtsFactory.getPrimary()     → Qwen3-TTS (official)
TtsFactory.getFallback()    → VibeVoice 1.5B (backup)

The active engine is a config switch. If Qwen3 has an outage, I flip one string and the pipeline routes to VibeVoice without touching any other code.

Parallel Job Submission

This was a 75% speed improvement. Instead of generating sections sequentially:

Sequential: S1→wait→S2→wait→S3→wait... (total: sum of all ≈ 5 min)
Parallel:   Submit S1,S2,S3... → Poll all → Done (total: max of all ≈ 1 min)

The implementation has two phases:

Phase 1: Submit all 7 TTS jobs immediately — no await between submissions
Phase 2: Future.wait() polls all jobs concurrently

Sections complete out of order as GPU workers finish, and the orchestrator reassembles them in sequence.

5. Automated QA

Here's something I don't see discussed much in AI audio pipelines: quality assurance. TTS models occasionally garble words, add strange pauses, or skip phrases. You can't ship that.

I built an automated QA pipeline:

Qwen3-ASR transcribes each audio section back to text
Text diff (Levenshtein distance + word alignment) compares the transcript against the original script
Issues are flagged with timestamps, severity levels (high/medium/low), and word-error-rate (WER)
A Reaper project file (.rpp) is auto-generated with markers at each flagged timestamp

This means I can open the project in a DAW and jump directly to problems — instead of listening to 25 minutes of audio to find that one garbled word at minute 18.

The issue types are specific: ttsError (model garbled the word), longPause (unexpected silence), missingWord (word dropped), extraWord (hallucinated audio). Each gets a severity based on impact — a garbled proper noun is high severity, an extra breath pause is low.

6. Entity Extraction & Content Graph

When an episode is created, the title, description, and script (capped at 4,000 characters to stay within token limits) are sent to Claude Haiku 3.5 for entity extraction. The model returns structured entities across four types:

{
  "people": ["Donald Trump", "Robert Lighthizer"],
  "organizations": ["Tesla", "SpaceX"],
  "locations": ["Washington DC", "Silicon Valley"],
  "topics": ["tariffs", "AI regulation", "electric vehicles"]
}

Each type is capped at 10 entities, and podcast host names (Sofia, Marco, etc.) are explicitly excluded.

Heat Classification

A Cloud Function classifies every episode into one of three heat levels based on entity recurrence and overlap with recent content:

Heat	Criteria	Auto-Archive
HOT	Recurrence > 5 (30-day window) AND Jaccard entity overlap > 0.7 (7-day window)	45 days
WARM	Recurrence > 2 OR overlap > 0.5	180 days
EVERGREEN	Default	Never

Recurrence is the count of recent episodes (last 30 days) sharing the same primary entity. Overlap is the Jaccard similarity of entity sets with episodes from the last 7 days. The thresholds were tuned empirically — 0.7 overlap means two episodes share 70%+ of their entities, which almost always means they're covering the same breaking story.

Topic Channels

When an episode is classified, a topic channel is auto-created (or updated) with a deterministic ID:

${category}-${normalizeEntity(primaryEntity)}
// Example: "ai-ml-elon-musk", "geopolitics-donald-trump"

Channels are always created in the database, but only displayed to listeners when episodeCount >= 2. This prevents one-off topics from cluttering the browsing experience. When a second episode about the same entity publishes, the channel suddenly appears — and listeners can browse the full thread.

Related Episodes

The discovery system finds related episodes using Jaccard entity overlap > 0.3, weighted by recency. This turns a flat list of episodes into a navigable knowledge graph. Listeners come for one topic and discover related coverage they didn't know existed.

7. SEO Layer

AI-generated content needs to be discoverable. I built a server-side rendering layer using Cloud Functions:

Episode pages: Slug-first lookup, with UUID fallback and automatic 301 redirect. If someone hits /episode/{uuid}, the function looks up the slug and redirects to /episode/{slug} — so search engines always see the clean URL.

Channel pages: SSR with JSON-LD structured data (CollectionPage + Breadcrumb schemas). Crawlers get fully rendered HTML; real users get redirected to the SPA.

Dynamic sitemap (/sitemap.xml): Auto-generated from all published episodes (slug-based URLs) and active channels (episodeCount >= 2). Cached for 1 hour.

RSS feed (/feed.xml): RSS 2.0 with full iTunes extensions for podcast directory distribution (Apple Podcasts, Spotify, etc.).

Grouped search: Search results are triaged into channels, HOT episodes (last 7 days), EVERGREEN episodes, and everything else — so the search page surfaces structure, not just a flat list.

8. Banner Generation

Each episode gets a unique banner image. Claude Haiku 3.5 generates a cinematic scene description from the episode title, then Flux 2 Pro (via Replicate) renders the image. The prompts are tuned for a consistent dark, editorial aesthetic across all categories.

The Stack

Layer	Tech
Admin app	Flutter 3.29 + Riverpod (web + desktop)
Public site	SvelteKit 2 + Svelte 5 + Tailwind
Backend	Firebase Functions (Node.js 22) + Firestore
AI	Claude Opus 4.5 (scripts), Haiku 3.5 (quick tasks), LinkUp (search)
TTS	Qwen3-TTS 1.7B on RunPod
QA	Qwen3-ASR + Whisper (captions)
Images	Flux 2 Pro via Replicate
Social video	Remotion 4 (Instagram Reels, Daily Bulletins)

Real Production Costs

Here's what a single episode actually costs:

Component	Cost
RAG Research (standard, L1-L2)	~$0.03
Script Generation (Opus 4.5)	~$0.65
TTS (7 sections, RunPod)	~$0.15
Banner (Haiku + Flux 2 Pro)	~$0.08
Total per episode	~$0.91

For investigative episodes (all 4 RAG layers), add ~$0.055 for a total of ~$0.97.

At daily production across multiple categories, we're looking at $5-10/day depending on volume.

One cost-saving detail: the batch generation system saves a checkpoint after the draft step. If the polish step (the Opus 4.5 enhancement pass) fails, you can retry from polish only — skipping RAG + draft, which saves ~$0.19 per failed trend. When you're generating 5-10 episodes in a batch, those retries add up.

What I Learned

1. Search is solved, reasoning is the bottleneck. Getting facts from the web is cheap and reliable with APIs like LinkUp. The hard part is synthesizing those facts into something coherent and insightful. That's where the multi-layer RAG approach pays off — and why L3-L4 (Claude reasoning) cost more than L1-L2 (LinkUp search) despite doing less I/O.

2. TTS quality requires QA, period. No TTS model is perfect. The ASR-based QA pipeline catches issues that manual listening would miss (or take forever to find). Automate your QA.

3. Prompt engineering is software engineering. The script generation prompts are hundreds of lines of structured XML. They get versioned, tested, and iterated on like any other code. Treat them accordingly.

4. Parallel processing changes everything. Submitting TTS jobs in parallel (instead of sequentially) cut generation time by 75%. Same principle applies to the RAG layers — L2 runs 5 searches in parallel.

5. The content graph is more valuable than individual episodes. Entity extraction and topic channels turned a flat list of episodes into a navigable knowledge base. The heat classification (HOT/WARM/EVERGREEN) with automatic archiving keeps the content fresh without manual curation.

6. Let the system optimize itself. The quality loop — where Claude Haiku assesses trend quality across 14 categories and generates improved search queries for weak ones — means the system gets better at finding stories without me tuning configs by hand.

Try It

d33ply publishes daily across 14 categories. You can listen on:

Web: d33ply.xyz
Apple Podcasts: d33ply on Apple
Spotify: d33ply on Spotify
RSS: d33ply.xyz/feed.xml
YouTube: d33ply playlist

If you're building something similar or have questions about the pipeline, drop a comment — happy to go deeper on any part of this.

Follow me for more posts about building AI-powered content systems.

DEV Community