I stumbled upon this gem on GitHub and had to share it with the community. If you're working with podcasts, audio transcription, or building AI applications that need clean, structured audio data - this is going to save you weeks of work.
What is be-flow-dtd?
be-flow-dtd (Download → Transcribe → Diarize) is a production-ready podcast transcription pipeline that:
- Automatically fetches new podcast episodes via the Taddy API
- Transcribes audio with word-level timestamps using Whisper large-v3
- Identifies who speaks when using Pyannote 3.1 speaker diarization
- Matches voices against known speaker embeddings (ECAPA-TDNN)
- Uploads structured JSON to Supabase cloud storage
Why This Matters for AI Developers
The output is perfectly formatted for LLM training, RAG systems, and semantic search:
{
"transcript": [
{
"text": "Hello and welcome to the show.",
"start": 0.0,
"end": 2.5,
"speaker_id": "host-uuid",
"speaker_name": "John Host",
"words": [
{"text": "Hello", "start": 0.0, "end": 0.4, "confidence": 0.98}
]
}
]
}
The Architecture is Clean
Taddy API → Download → Transcribe → Diarize → Identify → Upload
(episodes) (yt-dlp) (Whisper) (Pyannote) (ECAPA) (Supabase)
Each GPU model is loaded sequentially with explicit VRAM management - no more OOM errors!
Key Features I Love
✅ Automatic Episode Discovery - Set it and forget it
✅ State Tracking - SQLite prevents reprocessing
✅ GPU Optimized - Works on 8GB VRAM cards
✅ Docker Ready - Deploy anywhere with docker-compose
✅ MIT Licensed - Use it however you want
Quick Start
git clone https://github.com/goonerstrike/be-flow-dtd
cd be-flow-dtd
pip install -r requirements.txt
# Configure your API keys in .env
python main.py --dry-run --verbose
You'll need:
- Taddy API key (podcast metadata)
- HuggingFace token (for Pyannote models)
- Supabase credentials (cloud storage)
Built with CocoIndex
The project uses CocoIndex for pipeline visualization and monitoring. Run cocoindex server -ci cocoindex_flow.py and connect to CocoInsight for real-time visibility into your data flows.
Performance
On an RTX 3090/4090, you can process ~100 hours of audio per day with the large-v3 model. For smaller GPUs (8GB), the medium model works great.
GitHub: https://github.com/goonerstrike/be-flow-dtd
If you're building anything with podcast data, audio transcription, or need speaker-attributed transcripts for your AI projects - definitely check this out. The structured JSON output is chef's kiss for downstream ML pipelines.
Has anyone else been working on similar audio processing pipelines? Would love to hear about your approaches!
Top comments (0)