What if you could turn any trending topic into a 25-minute investigative podcast episode — automatically, every day?
That's what I built with d33ply. It's an AI-powered investigative podcast platform that monitors global events, researches them from 8+ sources, writes a two-host dialogue script, generates voice audio, and publishes — all without human intervention.
In this post, I'll walk through the architecture, the technical decisions, and the real costs of running this in production.
The Problem: Headlines Without Depth
News moves fast. Most coverage gives you the headline and maybe two paragraphs. If you actually want to understand a topic — the context, the competing perspectives, the implications — you either need to spend an hour reading or wait for a long-form journalist to cover it (which might take weeks, if it happens at all).
I wanted something that could take a trending topic and produce the kind of deep-dive episode you'd expect from a professional podcast team — but do it the same day the story breaks.
Architecture Overview
Here's the system at a glance:
┌─────────────────────────────────────────────────────────────────┐
│ FLUTTER ADMIN (lib/) - Web + Desktop │
│ Trends → RAG (LinkUp+Claude) → Script (Claude) → TTS → Upload │
│ WEB: Cloud Function proxies | NATIVE: Direct API (.env keys) │
└──────────────────────┬──────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ FIREBASE │ │ │
│ Firestore (data) + Storage (audio/images) + Functions (BE) │
└──────────┬──────────────────────────────────────────────────────┘
↓
┌──────────────────────────┐ ┌──────────────────────────────────┐
│ SVELTEKIT SITE (site/) │ │ REMOTION (social/) │
│ SPA + Svelte 5 runes │ │ Video rendering (Reels/Shorts) │
│ Stripe + Audio Player │ │ Local render server :3456 │
└──────────────────────────┘ └──────────────────────────────────┘
The admin app runs on Flutter — which might seem like an unusual choice for a content pipeline, but it lets me run the same codebase on web (Firebase Hosting) and desktop (native macOS). The web build routes all external API calls through Cloud Function proxies to avoid CORS issues and keep API keys server-side. The native build calls APIs directly with keys loaded from .env. Same pipeline logic, two execution paths.
The full production pipeline looks like this:
Trend Detection → RAG Research → Script Generation → TTS → QA → Banner → Publish
Let me break down each stage.
1. Trend Detection
The pipeline doesn't just pick a random topic — it actively monitors multiple news APIs and scores candidates against each other.
Signal Collection
Three primary sources feed the aggregator:
- NewsAPI.ai (primary) — event-based news clustering with global coverage
- The Guardian API — quality journalism signal
- finlight — financial and geopolitical events
Optional enrichment sources include SerpAPI (Google Trends validation) and Apify (social virality scoring from Reddit/Twitter).
Filtering & Scoring
Raw results pass through a 4-step pipeline:
Step 1 — Collect: Each source returns candidate topics tagged to one of 14 categories across 5 macro areas (Technology, Science & Health, Global Affairs, Business & Innovation, Culture & Lifestyle).
Step 2 — Filter: Each candidate is checked against category-specific keywords and negative keyword lists. These live in CategoryRegistry as defaults, but Firestore overrides (trend_config/overrides) are applied at runtime — so the system can learn what not to fetch without a code deploy.
Step 3 — Score: A composite confidence score (0–100) from four equally weighted dimensions:
| Dimension | Weight | What It Measures |
|---|---|---|
| Sources | 25% | How many independent sources report the story |
| Engagement | 25% | Social signals (Reddit upvotes, comments) |
| Freshness | 25% | Recency — score of 100 for today, decaying with age |
| Quality | 25% | Source credibility and content depth |
Step 4 — Deduplicate: Jaccard similarity at 0.7 threshold. If two candidates have 70%+ word overlap, the higher-scoring one wins. The top 5 per category survive to the dashboard.
Self-Optimizing Quality Loop
This is the part I'm most proud of in the trend system. A quality optimization pipeline runs periodically across all 14 categories:
- Assess — Score each category on 6 dimensions: relevance, cross-source validation, freshness, depth, trending strength, and confidence
- Diagnose — Flag weak categories (composite score < 50)
- Optimize — Claude Haiku generates improved search queries and negative keywords for underperforming categories
- Apply — Winning overrides are stored in Firestore and take effect on the next trend scan
The system also supports three aggregation presets — fast (NewsAPI.ai only, 3 results), standard (multi-source, 5 results), and enhanced (all sources including social, 8 results) — so the dashboard can load quickly for browsing while quality assessments use the full pipeline.
2. 4-Layer RAG Research
This is where it gets interesting. I needed more than "ask an LLM about the topic." LLMs hallucinate, miss recent events, and lack the kind of multi-source synthesis that makes journalism credible.
So I built a 4-layer RAG (Retrieval-Augmented Generation) system:
| Layer | What It Does | Provider | Cost |
|---|---|---|---|
| L1 — Surface Facts | Broad web search for the latest on the topic | LinkUp Search API | ~$0.006 |
| L2 — Context | 5 parallel deep searches for background, data, expert opinions, counter-arguments, historical context | LinkUp (5x parallel) | ~$0.027 |
| L3 — Pattern Analysis | Identify recurring themes, contradictions, and gaps across L1+L2 results | Claude Haiku 3.5 | ~$0.005 |
| L4 — Synthesis | Deep reasoning over all layers to produce a unified research brief | Claude Sonnet 4 | ~$0.048 |
Standard episodes use L1-L2 only (~$0.03). Investigative episodes run all four layers (~$0.085). That's the total RAG cost — not per-query, per-episode.
The key insight: search is cheap, reasoning is where the value is. I originally built this on Perplexity's API for both search and reasoning. Switching to LinkUp for L1-L2 was ~88% cheaper for the same quality of web results. For L3-L4, using Claude instead of Perplexity's reasoning models saved ~68%. Total episode cost dropped ~30-40%.
The architecture uses an AI Provider abstraction (AIProviderFactory) so swapping providers is a config change, not a rewrite. Each layer has a dedicated provider mapping — search goes to LinkUp, reasoning goes to Claude — and the UnifiedRagService orchestrates depth selection (RagDepth.standard vs RagDepth.investigative) with built-in caching.
3. Script Generation
The RAG output feeds into Claude Opus 4.5, which generates a two-host dialogue script. The script follows a strict 7-section structure:
| Section | Turns | Duration | Purpose |
|---|---|---|---|
| Cold Open | 4-6 | ~2-3 min | Hook + brand signature line |
| Context & Stakes | 8-10 | ~3-4 min | Why this matters NOW |
| Deep Dive | 10-14 | ~4-5 min | Core analysis |
| Human Impact | 8-10 | ~3-4 min | Real stories & effects |
| Challenges & Debate | 6-8 | ~3 min | Critical perspectives |
| Future Outlook | 6-8 | ~2-3 min | Implications going forward |
| Closing | 3-5 | ~1-2 min | Takeaway + promo |
Target: 20-25 minutes total. The brand signature — "It's time to understand this... deeply" — is delivered by the female host by turn 4 of the Cold Open. It sounds like a small detail, but consistent brand moments in AI-generated content are what make it feel produced rather than generated.
Prompt Engineering as Software Engineering
The prompts are hundreds of lines of structured XML using Anthropic's format:
<identity>You are a podcast scriptwriter for an investigative show...</identity>
<context>Topic: {topic}\nResearch: {rag_output}</context>
<instructions>1. Create dialogue between two hosts...</instructions>
<output_format>[THEMATIC INTRODUCTION]...</output_format>
These get versioned, tested, and iterated on like any other code.
Title Enhancement
Titles go through a separate Claude Opus 4.5 pass with tight constraints:
- Temperature: 0.55 (lower = more reliable, less cliché)
- Hard-banned words: Game-Changer, Revolutionary, Unleashed, Deep Dive, Unprecedented, and a dozen more
- Hard-banned structures: Alliterative clichés, "X: The Y That Z" patterns
- Requirements: Each title must include a number, named entity, or concrete outcome
- Character limit: 50-90 characters
The banned-words list came from analyzing what makes AI-generated titles feel immediately artificial. Removing those patterns made a measurable difference in click-through.
Section Parsing
Each [SECTION] marker in the script becomes exactly one audio chunk for TTS. This was a hard-won lesson — the original approach used [CHUNK] markers that split the script into 30+ fragments, producing 40+ minute episodes with awkward pauses. The SemanticScriptParser now produces ~7 semantic sections, each containing all its dialogue turns as a single chunk. Fewer cuts = better pacing.
Cost per script: ~$0.65 (Claude Opus 4.5).
4. Text-to-Speech
The TTS stage converts each of the 7 script sections into audio using Qwen3-TTS (1.7B parameters) running on RunPod GPUs.
Speaker Mapping
The script uses host names ("Sofia:" and "Marco:"), but the TTS model expects speaker tokens. A formatting step converts:
"Sofia: Welcome to the show" → "[SPEAKER1] Welcome to the show" (female)
"Marco: Thanks for having me" → "[SPEAKER0] Thanks for having me" (male)
Factory Pattern with Fallback
TtsFactory.getService() → Active engine (Qwen3-TTS)
TtsFactory.getPrimary() → Qwen3-TTS (official)
TtsFactory.getFallback() → VibeVoice 1.5B (backup)
The active engine is a config switch. If Qwen3 has an outage, I flip one string and the pipeline routes to VibeVoice without touching any other code.
Parallel Job Submission
This was a 75% speed improvement. Instead of generating sections sequentially:
Sequential: S1→wait→S2→wait→S3→wait... (total: sum of all ≈ 5 min)
Parallel: Submit S1,S2,S3... → Poll all → Done (total: max of all ≈ 1 min)
The implementation has two phases:
-
Phase 1: Submit all 7 TTS jobs immediately — no
awaitbetween submissions -
Phase 2:
Future.wait()polls all jobs concurrently
Sections complete out of order as GPU workers finish, and the orchestrator reassembles them in sequence.
5. Automated QA
Here's something I don't see discussed much in AI audio pipelines: quality assurance. TTS models occasionally garble words, add strange pauses, or skip phrases. You can't ship that.
I built an automated QA pipeline:
- Qwen3-ASR transcribes each audio section back to text
- Text diff (Levenshtein distance + word alignment) compares the transcript against the original script
- Issues are flagged with timestamps, severity levels (high/medium/low), and word-error-rate (WER)
- A Reaper project file (.rpp) is auto-generated with markers at each flagged timestamp
This means I can open the project in a DAW and jump directly to problems — instead of listening to 25 minutes of audio to find that one garbled word at minute 18.
The issue types are specific: ttsError (model garbled the word), longPause (unexpected silence), missingWord (word dropped), extraWord (hallucinated audio). Each gets a severity based on impact — a garbled proper noun is high severity, an extra breath pause is low.
6. Entity Extraction & Content Graph
When an episode is created, the title, description, and script (capped at 4,000 characters to stay within token limits) are sent to Claude Haiku 3.5 for entity extraction. The model returns structured entities across four types:
{
"people": ["Donald Trump", "Robert Lighthizer"],
"organizations": ["Tesla", "SpaceX"],
"locations": ["Washington DC", "Silicon Valley"],
"topics": ["tariffs", "AI regulation", "electric vehicles"]
}
Each type is capped at 10 entities, and podcast host names (Sofia, Marco, etc.) are explicitly excluded.
Heat Classification
A Cloud Function classifies every episode into one of three heat levels based on entity recurrence and overlap with recent content:
| Heat | Criteria | Auto-Archive |
|---|---|---|
| HOT | Recurrence > 5 (30-day window) AND Jaccard entity overlap > 0.7 (7-day window) | 45 days |
| WARM | Recurrence > 2 OR overlap > 0.5 | 180 days |
| EVERGREEN | Default | Never |
Recurrence is the count of recent episodes (last 30 days) sharing the same primary entity. Overlap is the Jaccard similarity of entity sets with episodes from the last 7 days. The thresholds were tuned empirically — 0.7 overlap means two episodes share 70%+ of their entities, which almost always means they're covering the same breaking story.
Topic Channels
When an episode is classified, a topic channel is auto-created (or updated) with a deterministic ID:
${category}-${normalizeEntity(primaryEntity)}
// Example: "ai-ml-elon-musk", "geopolitics-donald-trump"
Channels are always created in the database, but only displayed to listeners when episodeCount >= 2. This prevents one-off topics from cluttering the browsing experience. When a second episode about the same entity publishes, the channel suddenly appears — and listeners can browse the full thread.
Related Episodes
The discovery system finds related episodes using Jaccard entity overlap > 0.3, weighted by recency. This turns a flat list of episodes into a navigable knowledge graph. Listeners come for one topic and discover related coverage they didn't know existed.
7. SEO Layer
AI-generated content needs to be discoverable. I built a server-side rendering layer using Cloud Functions:
Episode pages: Slug-first lookup, with UUID fallback and automatic 301 redirect. If someone hits /episode/{uuid}, the function looks up the slug and redirects to /episode/{slug} — so search engines always see the clean URL.
Channel pages: SSR with JSON-LD structured data (CollectionPage + Breadcrumb schemas). Crawlers get fully rendered HTML; real users get redirected to the SPA.
Dynamic sitemap (/sitemap.xml): Auto-generated from all published episodes (slug-based URLs) and active channels (episodeCount >= 2). Cached for 1 hour.
RSS feed (/feed.xml): RSS 2.0 with full iTunes extensions for podcast directory distribution (Apple Podcasts, Spotify, etc.).
Grouped search: Search results are triaged into channels, HOT episodes (last 7 days), EVERGREEN episodes, and everything else — so the search page surfaces structure, not just a flat list.
8. Banner Generation
Each episode gets a unique banner image. Claude Haiku 3.5 generates a cinematic scene description from the episode title, then Flux 2 Pro (via Replicate) renders the image. The prompts are tuned for a consistent dark, editorial aesthetic across all categories.
The Stack
| Layer | Tech |
|---|---|
| Admin app | Flutter 3.29 + Riverpod (web + desktop) |
| Public site | SvelteKit 2 + Svelte 5 + Tailwind |
| Backend | Firebase Functions (Node.js 22) + Firestore |
| AI | Claude Opus 4.5 (scripts), Haiku 3.5 (quick tasks), LinkUp (search) |
| TTS | Qwen3-TTS 1.7B on RunPod |
| QA | Qwen3-ASR + Whisper (captions) |
| Images | Flux 2 Pro via Replicate |
| Social video | Remotion 4 (Instagram Reels, Daily Bulletins) |
Real Production Costs
Here's what a single episode actually costs:
| Component | Cost |
|---|---|
| RAG Research (standard, L1-L2) | ~$0.03 |
| Script Generation (Opus 4.5) | ~$0.65 |
| TTS (7 sections, RunPod) | ~$0.15 |
| Banner (Haiku + Flux 2 Pro) | ~$0.08 |
| Total per episode | ~$0.91 |
For investigative episodes (all 4 RAG layers), add ~$0.055 for a total of ~$0.97.
At daily production across multiple categories, we're looking at $5-10/day depending on volume.
One cost-saving detail: the batch generation system saves a checkpoint after the draft step. If the polish step (the Opus 4.5 enhancement pass) fails, you can retry from polish only — skipping RAG + draft, which saves ~$0.19 per failed trend. When you're generating 5-10 episodes in a batch, those retries add up.
What I Learned
1. Search is solved, reasoning is the bottleneck. Getting facts from the web is cheap and reliable with APIs like LinkUp. The hard part is synthesizing those facts into something coherent and insightful. That's where the multi-layer RAG approach pays off — and why L3-L4 (Claude reasoning) cost more than L1-L2 (LinkUp search) despite doing less I/O.
2. TTS quality requires QA, period. No TTS model is perfect. The ASR-based QA pipeline catches issues that manual listening would miss (or take forever to find). Automate your QA.
3. Prompt engineering is software engineering. The script generation prompts are hundreds of lines of structured XML. They get versioned, tested, and iterated on like any other code. Treat them accordingly.
4. Parallel processing changes everything. Submitting TTS jobs in parallel (instead of sequentially) cut generation time by 75%. Same principle applies to the RAG layers — L2 runs 5 searches in parallel.
5. The content graph is more valuable than individual episodes. Entity extraction and topic channels turned a flat list of episodes into a navigable knowledge base. The heat classification (HOT/WARM/EVERGREEN) with automatic archiving keeps the content fresh without manual curation.
6. Let the system optimize itself. The quality loop — where Claude Haiku assesses trend quality across 14 categories and generates improved search queries for weak ones — means the system gets better at finding stories without me tuning configs by hand.
Try It
d33ply publishes daily across 14 categories. You can listen on:
- Web: d33ply.xyz
- Apple Podcasts: d33ply on Apple
- Spotify: d33ply on Spotify
- RSS: d33ply.xyz/feed.xml
- YouTube: d33ply playlist
If you're building something similar or have questions about the pipeline, drop a comment — happy to go deeper on any part of this.
Follow me for more posts about building AI-powered content systems.
Top comments (0)