DEV Community

Cover image for Building a podcast generation app
Luis Novo
Luis Novo

Posted on

Building a podcast generation app

Hi everyone, I am the maintainer of Open Notebook, an open source version of Google's Notebook LM that works locally and with a wide array of models and providers.

One of the biggest reasons people use this tool is for the audio learning feature (aka podcasts). So I decided to spin off a project for just that purpose to make a little more specialized in that.

Enter Podcast Creator

An app for generating podcasts in a simple and extensible way:
https://github.com/lfnovo/podcast-creator

My challenge was to:

  • Have a SoTa transcript generation process that adds value to the conversation, rather than just chit-chat
  • Use multiple providers for text and audio
  • Enable 1-4 speakers, not the hardcoded 2 option with Notebook LM
  • Make it very extensible and easy to use for devs and non-devs

This post is to showcase the end result and provide some tips on the challenges and decisions made along the way

Transcript Quality

We all know that LLMs tend to lose text quality if you ask them to do a big content generation in one go, like "give me a transcript for a 20 minute podcast". No matter what model you use, you'll get disappointed.

Using an outline + transcript strategy

So, the first thing I tried was to have one LLM call for building an outline of what should be discussed in the podcast and then a second call for generating the transcript based on that.

Although it did enhance the quality a bit, still underwhelming.

Outline + writing sections in parallel

The second attempt was to create a transcript with, say, 10 sections and then parallelize the creation of the dialogue for each section.

Result: the sections didn't "fit" each other nicely, there was no continuity and the flow was broken many times.

Success: Outline + Sequential generation

This worked beautifully. What I did was to generate the outline, then run section 1 dialogue, send the partial for section 2 and so on, accumulating the transcripts as I generate. Yes, it does consume more tokens, but the end result was worth it.

Workflow management

Since the project needed some workflow management (extract files -> sequential transcript -> individual audio -> merging audio into a single file), Langgraph was my tool of choice. It does a great job at this and comes with free session tracking and resuming. Not a hard choice.

Speaker unique traits

The second challenge was to make each speaker behave like a different person, adding depth to the conversation and making it less robotic. After messing a lot with the prompts, I ended up settling with the option to define each speaker's background story and personality. Worked like a charm. Setting a speaker to be controversial generated my best episodes so far 😂.

This is one I created using 4 speakers and also by cloning my voice in Elevenlabs

And this one is featuring a single speaker, like your personal teacher:

Easy to use API for the developers

I wanted this to be a very simple method call for devs so that they could drop this in any working app in minutes.

So, instead of having them mess around with langgraph directly, I just wrapped it with an easy to use method call:

import asyncio
from podcast_creator import create_podcast

async def main():
    result = await create_podcast(
        content="Your content here...",
        briefing="Create an engaging discussion about...",
        episode_name="my_podcast",
        output_dir="output/my_podcast",
        speaker_config="ai_researchers"
    )
    print(f"✅ Podcast created: {result['final_output_file_path']}")

asyncio.run(main())
Enter fullscreen mode Exit fullscreen mode

Easy to use interface for other humans

Not all people that use this type of tools is a developer so I wanted it to have an easy to use UI. In the Python world, it's hard to beat the ease of use of Streamlit. Very few lines of code and you have yourself a responsive working app.

Templates without the string concat nightmare

I hate to build my prompts by messing around with strings and loops. So I am using Jinja for a much more elegant solution:

The podcast features the following speakers:
<speakers>
{% for speaker in speakers %}
- **{{ speaker.name }}**: {{ speaker.backstory }}
  Personality: {{ speaker.personality }}
{% endfor %}
</speakers>
Enter fullscreen mode Exit fullscreen mode

That's all folks

This was a fun project to build. I licensed as MIT for anyone that wants to play around and basically do whatever they want with it :)

Take if for a spin: https://github.com/lfnovo/podcast-creator

Hope you enjoy it.

Top comments (0)