DEV Community

Cover image for Stop paying OpenAI to transcribe your voice notes (My offline Telegram bot stack) 🎙️
Ama
Ama

Posted on

Stop paying OpenAI to transcribe your voice notes (My offline Telegram bot stack) 🎙️

Every tutorial on building an AI Telegram bot right now uses the exact same lazy architecture:

  • User sends a voice message.
  • Bot downloads the .ogg file.
  • Bot sends the file to OpenAI's Whisper API.
  • You get billed per minute of audio.

This is fine if you are building a quick prototype. But if you actually use your bot every single day, you are burning money on a task your own CPU can do for free. Not to mention the privacy nightmare of shipping all your personal audio logs to a third-party cloud.

The local alternative ⚙️
I wanted to build a Telegram interface for the Nomi API. I heavily rely on voice messages, so I needed speech-to-text.

Instead of defaulting to a paid API, I built the entire transcription pipeline locally using Vosk and FFmpeg.

The workflow is dead simple:

  • Telegram sends the .ogg voice note.
  • FFmpeg runs a local process to convert it to a .wav file with the correct sample rate.
  • The offline Vosk model reads the file and returns the text.
  • Then the text is sent to the LLM.
code Python

# The core logic is just wrapping the offline tools
# No network requests, no API keys for the transcription

async def process_voice(file_path):
    wav_path = convert_ogg_to_wav_ffmpeg(file_path)
    text = await run_vosk_transcription(wav_path)
    return text
Enter fullscreen mode Exit fullscreen mode

Why this matters 🧱
Network latency disappears. Your server isn't waiting on a cloud provider to process an audio file. The transcription happens instantly, offline, and costs absolutely zero dollars.

More importantly, it forces you to understand your tools. Setting up async subprocesses for FFmpeg inside a Telegram bot (I use aiogram) teaches you way more about backend architecture than just making another HTTP request to OpenAI.

You can find the entire implementation in my NomiAssistantTG repo. It handles the Telegram async loop, the offline voice processing, and the API connections.

Why I build micro-tools (and a quick favor) 🤝
Developers default to paid SaaS tools because setting up local binaries like FFmpeg and Vosk feels like a headache.

I spend my time taking those headaches, wrapping them into clean, reusable Python code, and dropping them on GitHub. My repositories aren't theoretical frameworks; they are the actual tools I use to bypass expensive API limits.

If my NomiAssistantTG code just saved you a monthly OpenAI Whisper bill, or gave you a working async boilerplate for your next Telegram bot, consider dropping a sponsorship on my GitHub:

👉 Sponsor AmaLS367 on GitHub

Your support directly buys the time I need to keep breaking things, reading awful documentation, and open-sourcing production-ready templates so you don't have to.

TL;DR

  • Stop sending basic audio transcription to the cloud.
  • Use FFmpeg to normalize Telegram voice notes.
  • Use Vosk for free, offline, instant speech-to-text.
  • Keep your architecture lean.

Are you still using Whisper API for personal projects, or have you moved to local models? Let me know 👇

Top comments (0)