Pierre

Posted on May 6

How I built an AI podcast generator that turns any content into audio conversations

#ai #podcast #saas

I read too much. PDFs, newsletters, long articles - my reading list is a graveyard of good intentions. At some point I stopped fighting it and just built a tool to listen to it all instead.

That's Podcastify: paste a URL, upload a PDF, drop some text, and get back a podcast-style audio conversation between two AI hosts discussing your content. Six weeks after launching, we have 3 paying subscribers. Not a hockey stick, but real people handing over real money for a thing I shipped. Here's how it works under the hood.

What it does

The core loop is simple:

You submit any content - a URL, a PDF, raw text, or an image
Gemini reads it and writes a Q&A-style conversation between two hosts
A TTS provider converts that transcript to audio, per speaker
The segments are merged into a single MP3, stored, and served back to you

The output feels like a podcast episode where two people actually discuss the content, not just read it aloud.

Architecture: why two phases

The pipeline is split into two distinct phases, and this wasn't just a design preference - it's a practical necessity.

Phase 1 - Transcript generation

Input → ContentParser → Gemini (LLM) → Transcript → Supabase

Phase 2 - Audio generation

Transcript → TTS (per speaker) → Audio segments → Merge → MP3 → Supabase Storage

Separating them means:

You can regenerate audio without re-running the LLM (cheaper)
You can inspect and even edit the transcript before rendering audio
Failures are isolated, so a TTS hiccup doesn't waste a Gemini call

Both phases run as Celery tasks behind a FastAPI backend, with Redis as the broker. Long-running jobs simply don't belong in an HTTP request/response cycle. A typical generation takes 30–90 seconds depending on content length and TTS provider.

The stack

Frontend:  Next.js 16 (App Router) + React 19 + TypeScript + Tailwind CSS 4
Backend:   FastAPI + Celery + Redis
Database:  Supabase (PostgreSQL + Auth + file storage)
LLM:       Google Gemini
TTS:       Factory pattern → ElevenLabs / OpenAI / Gemini TTS / Edge TTS

The frontend calls Next.js API routes, which proxy to the FastAPI backend. This keeps secrets server-side and gives us a clean separation between the Next.js layer (auth, UX, billing) and the Python layer (AI, heavy lifting).

For storage, a single Supabase bucket (audios_n_transcripts) holds both transcripts (JSON) and final audio (MP3). Row-level security keeps everything scoped to the generating user.

The hardest part: parsing anything

The promise, "submit any content", is easy to say and painful to implement.

The ContentParser service has to handle:

Web pages: rendered via Playwright (headless Chromium), because half the modern web is JavaScript-rendered and can't be scraped with a simple HTTP fetch
PDFs: text extraction, with layout awareness to avoid garbled column ordering
Images: sent directly to Gemini's multimodal endpoint
Raw text: trivial, but still needs cleaning and length normalization

Playwright in particular adds real overhead, it's a full browser. We run it in the Celery worker rather than the API process, and cache aggressively to avoid re-fetching the same URL.

TTS: the factory pattern

Different TTS providers have very different tradeoffs - latency, voice quality, cost, language support. Rather than hardcoding one, we use a factory:

# tts/factory.py
def get_tts_provider(name: str) -> BaseTTSProvider:
    providers = {
        "gemini": GeminiTTSProvider,
        "openai": OpenAITTSProvider,
        "elevenlabs": ElevenLabsTTSProvider,
        "edge": EdgeTTSProvider,
    }
    return providers[name]()

Each provider implements the same interface: synthesize(text, voice, language) -> bytes. Swapping providers is a config change, not a code change. This matters because TTS pricing and quality move fast, and we've already switched defaults once.

Monetization: card-first reverse trial

The billing model went through a few iterations. Here's where we landed:

No free tier - new signups must enter a credit card to unlock generation
7-day Hobby trial managed by Stripe, with trial_period_days: 7 and payment_method_collection: "always"
After the trial: auto-charge unless cancelled
Quota is enforced in audio characters (TTS character count), not generation count, which is fairer for users with varying content lengths

The middleware (proxy.ts) enforces the paywall: any generation attempt without an active subscription returns a 401 with a redirect to the checkout page. No subscription row in the DB = no generation, full stop.

This "reverse trial" approach (card first, trial after) filters out people who were never going to pay, and converts the ones who get value quickly. Three paying users in six weeks from a technical product with zero marketing spend. Not viral, but validated.

Lessons learned

Ship the infra first. The async job pipeline (FastAPI + Celery + Redis) was the most painful part to set up, but getting it right early meant every feature after was just another task type.

Two-phase pipelines are worth it. The ability to inspect and replay individual phases saved hours of debugging and reduced AI API costs significantly during development.

TTS quality is a product differentiator. Users notice voice quality immediately. The factory abstraction let us tune this without touching business logic.

Quota in output units, not input actions. Charging per generation sounds simple but punishes users who feed short content. Characters generated is a much better proxy for actual resource consumption.

Card-first converts. Adding the reverse trial (vs. a freemium model) was uncomfortable to ship but immediately filtered signal from noise.

What's next

Multi-language support (the TTS layer is ready; the LLM prompts need work)
Transcript editing UI before audio render
Podcast RSS feeds so you can subscribe to your own generated shows in any app

If you're building something similar or want to try it out: podcastify.io

Stack questions, architecture feedback, roast my code, happy to discuss in the comments.

DEV Community