Every tutorial on building an AI Telegram bot right now uses the exact same lazy architecture:
- User sends a voice message.
- Bot downloads the .ogg file.
- Bot sends the file to OpenAI's Whisper API.
- You get billed per minute of audio.
This is fine if you are building a quick prototype. But if you actually use your bot every single day, you are burning money on a task your own CPU can do for free. Not to mention the privacy nightmare of shipping all your personal audio logs to a third-party cloud.
The local alternative ⚙️
I wanted to build a Telegram interface for the Nomi API. I heavily rely on voice messages, so I needed speech-to-text.
Instead of defaulting to a paid API, I built the entire transcription pipeline locally using Vosk and FFmpeg.
The workflow is dead simple:
- Telegram sends the .ogg voice note.
- FFmpeg runs a local process to convert it to a .wav file with the correct sample rate.
- The offline Vosk model reads the file and returns the text.
- Then the text is sent to the LLM.
code Python
# The core logic is just wrapping the offline tools
# No network requests, no API keys for the transcription
async def process_voice(file_path):
wav_path = convert_ogg_to_wav_ffmpeg(file_path)
text = await run_vosk_transcription(wav_path)
return text
Why this matters 🧱
Network latency disappears. Your server isn't waiting on a cloud provider to process an audio file. The transcription happens instantly, offline, and costs absolutely zero dollars.
More importantly, it forces you to understand your tools. Setting up async subprocesses for FFmpeg inside a Telegram bot (I use aiogram) teaches you way more about backend architecture than just making another HTTP request to OpenAI.
You can find the entire implementation in my NomiAssistantTG repo. It handles the Telegram async loop, the offline voice processing, and the API connections.
Why I build micro-tools (and a quick favor) 🤝
Developers default to paid SaaS tools because setting up local binaries like FFmpeg and Vosk feels like a headache.
I spend my time taking those headaches, wrapping them into clean, reusable Python code, and dropping them on GitHub. My repositories aren't theoretical frameworks; they are the actual tools I use to bypass expensive API limits.
If my NomiAssistantTG code just saved you a monthly OpenAI Whisper bill, or gave you a working async boilerplate for your next Telegram bot, consider dropping a sponsorship on my GitHub:
Your support directly buys the time I need to keep breaking things, reading awful documentation, and open-sourcing production-ready templates so you don't have to.
TL;DR
- Stop sending basic audio transcription to the cloud.
- Use FFmpeg to normalize Telegram voice notes.
- Use Vosk for free, offline, instant speech-to-text.
- Keep your architecture lean.
Are you still using Whisper API for personal projects, or have you moved to local models? Let me know 👇
Top comments (2)
Solid approach! The privacy argument for local transcription is underrated. Sending all your voice messages to OpenAI servers — including potentially sensitive conversations — is a real concern most people don't think about.
I went through a similar "build it yourself" journey with my Telegram Mini App. Not with transcription, but with the general philosophy of keeping things self-hosted and lean. My stack (Node.js + PostgreSQL + React) runs on a single VPS and handles everything from anonymous messaging to payments via Telegram Stars. Total monthly cost: ~$10.
The aiogram async approach you mention is solid. On the Node.js side, I found that using Telegraf with Express middleware gives you similar flexibility — you can handle both bot webhooks and Mini App API routes in the same process.
One question though: how does Vosk compare to Whisper in terms of accuracy for non-English languages? I've been considering adding voice-to-text features and quality across languages is a dealbreaker for my use case (EN + RU users).
Yeah, Telegraf + Express on a cheap VPS is the way to go.
About Vosk vs Whisper: Whisper is definitely more accurate. Vosk struggles with punctuation and heavy background noise.
I stick to Vosk because of what happens after the transcription in my stack. The raw text goes straight into an LLM. The model easily fixes missing commas and slightly misheard words based on context. I don't need perfect phonetic accuracy, I just need the intent.
If you actually need to display the transcribed text back to your users (RU+EN), you should probably look into self-hosting whisper.cpp. But if you're just extracting intent to trigger an API or backend logic, Vosk is fast, completely offline, and does the job.