I read too much. PDFs, newsletters, long articles - my reading list is a graveyard of good intentions. At some point I stopped fighting it and just built a tool to listen to it all instead.
That's Podcastify: paste a URL, upload a PDF, drop some text, and get back a podcast-style audio conversation between two AI hosts discussing your content. Six weeks after launching, we have 3 paying subscribers. Not a hockey stick, but real people handing over real money for a thing I shipped. Here's how it works under the hood.
What it does
The core loop is simple:
- You submit any content - a URL, a PDF, raw text, or an image
- Gemini reads it and writes a Q&A-style conversation between two hosts
- A TTS provider converts that transcript to audio, per speaker
- The segments are merged into a single MP3, stored, and served back to you
The output feels like a podcast episode where two people actually discuss the content, not just read it aloud.
Architecture: why two phases
The pipeline is split into two distinct phases, and this wasn't just a design preference - it's a practical necessity.
Phase 1 - Transcript generation
Input → ContentParser → Gemini (LLM) → Transcript → Supabase
Phase 2 - Audio generation
Transcript → TTS (per speaker) → Audio segments → Merge → MP3 → Supabase Storage
Separating them means:
- You can regenerate audio without re-running the LLM (cheaper)
- You can inspect and even edit the transcript before rendering audio
- Failures are isolated, so a TTS hiccup doesn't waste a Gemini call
Both phases run as Celery tasks behind a FastAPI backend, with Redis as the broker. Long-running jobs simply don't belong in an HTTP request/response cycle. A typical generation takes 30–90 seconds depending on content length and TTS provider.
The stack
Frontend: Next.js 16 (App Router) + React 19 + TypeScript + Tailwind CSS 4
Backend: FastAPI + Celery + Redis
Database: Supabase (PostgreSQL + Auth + file storage)
LLM: Google Gemini
TTS: Factory pattern → ElevenLabs / OpenAI / Gemini TTS / Edge TTS
The frontend calls Next.js API routes, which proxy to the FastAPI backend. This keeps secrets server-side and gives us a clean separation between the Next.js layer (auth, UX, billing) and the Python layer (AI, heavy lifting).
For storage, a single Supabase bucket (audios_n_transcripts) holds both transcripts (JSON) and final audio (MP3). Row-level security keeps everything scoped to the generating user.
The hardest part: parsing anything
The promise, "submit any content", is easy to say and painful to implement.
The ContentParser service has to handle:
- Web pages: rendered via Playwright (headless Chromium), because half the modern web is JavaScript-rendered and can't be scraped with a simple HTTP fetch
- PDFs: text extraction, with layout awareness to avoid garbled column ordering
- Images: sent directly to Gemini's multimodal endpoint
- Raw text: trivial, but still needs cleaning and length normalization
Playwright in particular adds real overhead, it's a full browser. We run it in the Celery worker rather than the API process, and cache aggressively to avoid re-fetching the same URL.
TTS: the factory pattern
Different TTS providers have very different tradeoffs - latency, voice quality, cost, language support. Rather than hardcoding one, we use a factory:
# tts/factory.py
def get_tts_provider(name: str) -> BaseTTSProvider:
providers = {
"gemini": GeminiTTSProvider,
"openai": OpenAITTSProvider,
"elevenlabs": ElevenLabsTTSProvider,
"edge": EdgeTTSProvider,
}
return providers[name]()
Each provider implements the same interface: synthesize(text, voice, language) -> bytes. Swapping providers is a config change, not a code change. This matters because TTS pricing and quality move fast, and we've already switched defaults once.
Monetization: card-first reverse trial
The billing model went through a few iterations. Here's where we landed:
- No free tier - new signups must enter a credit card to unlock generation
-
7-day Hobby trial managed by Stripe, with
trial_period_days: 7andpayment_method_collection: "always" - After the trial: auto-charge unless cancelled
- Quota is enforced in audio characters (TTS character count), not generation count, which is fairer for users with varying content lengths
The middleware (proxy.ts) enforces the paywall: any generation attempt without an active subscription returns a 401 with a redirect to the checkout page. No subscription row in the DB = no generation, full stop.
This "reverse trial" approach (card first, trial after) filters out people who were never going to pay, and converts the ones who get value quickly. Three paying users in six weeks from a technical product with zero marketing spend. Not viral, but validated.
Lessons learned
Ship the infra first. The async job pipeline (FastAPI + Celery + Redis) was the most painful part to set up, but getting it right early meant every feature after was just another task type.
Two-phase pipelines are worth it. The ability to inspect and replay individual phases saved hours of debugging and reduced AI API costs significantly during development.
TTS quality is a product differentiator. Users notice voice quality immediately. The factory abstraction let us tune this without touching business logic.
Quota in output units, not input actions. Charging per generation sounds simple but punishes users who feed short content. Characters generated is a much better proxy for actual resource consumption.
Card-first converts. Adding the reverse trial (vs. a freemium model) was uncomfortable to ship but immediately filtered signal from noise.
What's next
- Multi-language support (the TTS layer is ready; the LLM prompts need work)
- Transcript editing UI before audio render
- Podcast RSS feeds so you can subscribe to your own generated shows in any app
If you're building something similar or want to try it out: podcastify.io
Stack questions, architecture feedback, roast my code, happy to discuss in the comments.
Top comments (0)