This Open-Source Pipeline Transforms Any Podcast into AI-Ready Transcripts with Speaker Diarization (MIT License)

#ai #opensource #machinelearning #python

I stumbled upon this gem on GitHub and had to share it with the community. If you're working with podcasts, audio transcription, or building AI applications that need clean, structured audio data - this is going to save you weeks of work.

What is be-flow-dtd?

be-flow-dtd (Download → Transcribe → Diarize) is a production-ready podcast transcription pipeline that:

Automatically fetches new podcast episodes via the Taddy API
Transcribes audio with word-level timestamps using Whisper large-v3
Identifies who speaks when using Pyannote 3.1 speaker diarization
Matches voices against known speaker embeddings (ECAPA-TDNN)
Uploads structured JSON to Supabase cloud storage

Why This Matters for AI Developers

The output is perfectly formatted for LLM training, RAG systems, and semantic search:

{
  "transcript": [
    {
      "text": "Hello and welcome to the show.",
      "start": 0.0,
      "end": 2.5,
      "speaker_id": "host-uuid",
      "speaker_name": "John Host",
      "words": [
        {"text": "Hello", "start": 0.0, "end": 0.4, "confidence": 0.98}
      ]
    }
  ]
}

The Architecture is Clean

Taddy API → Download → Transcribe → Diarize → Identify → Upload
(episodes)   (yt-dlp)   (Whisper)   (Pyannote)  (ECAPA)   (Supabase)

Each GPU model is loaded sequentially with explicit VRAM management - no more OOM errors!

Key Features I Love

✅ Automatic Episode Discovery - Set it and forget it
✅ State Tracking - SQLite prevents reprocessing
✅ GPU Optimized - Works on 8GB VRAM cards
✅ Docker Ready - Deploy anywhere with docker-compose
✅ MIT Licensed - Use it however you want

Quick Start

git clone https://github.com/goonerstrike/be-flow-dtd
cd be-flow-dtd
pip install -r requirements.txt

# Configure your API keys in .env
python main.py --dry-run --verbose

You'll need:

Taddy API key (podcast metadata)
HuggingFace token (for Pyannote models)
Supabase credentials (cloud storage)

Built with CocoIndex

The project uses CocoIndex for pipeline visualization and monitoring. Run cocoindex server -ci cocoindex_flow.py and connect to CocoInsight for real-time visibility into your data flows.

Performance

On an RTX 3090/4090, you can process ~100 hours of audio per day with the large-v3 model. For smaller GPUs (8GB), the medium model works great.

GitHub: https://github.com/goonerstrike/be-flow-dtd

If you're building anything with podcast data, audio transcription, or need speaker-attributed transcripts for your AI projects - definitely check this out. The structured JSON output is chef's kiss for downstream ML pipelines.

Has anyone else been working on similar audio processing pipelines? Would love to hear about your approaches!