Solved: Convert Voice Memos from Telegram to Text using OpenAI Whisper API

#devops #programming #tutorial #cloud

🚀 Executive Summary

TL;DR: This project solves the problem of unstructured voice memos in Telegram by creating a Python bot that automatically transcribes them. It uses the Telegram Bot API to receive voice notes and the OpenAI Whisper API to convert them into searchable, copy-pasteable text, significantly boosting efficiency.

🎯 Key Takeaways

The solution integrates python-telegram-bot for message handling, pydub with ffmpeg for Ogg Opus to MP3 audio conversion, and the openai library for Whisper API transcription.
Secure management of API keys is achieved using python-dotenv to load TELEGRAM\_BOT\_TOKEN and OPENAI\_API\_KEY from a config.env file, preventing hardcoding.
Temporary audio files (OGA and MP3) are downloaded, processed, and then reliably cleaned up using os.remove within a finally block to ensure resource management.

Convert Voice Memos from Telegram to Text using OpenAI Whisper API

Alright, team. Darian here. Let’s talk about efficiency. I used to leave myself voice memos on the go—quick thoughts, reminders, even mini-debug sessions while walking the dog. The problem? They’d pile up in my Telegram “Saved Messages,” becoming a black hole of unstructured audio. Listening back to find one specific thought was a huge time sink. This little project changed that. Now, I just send a voice note to a bot, and a few seconds later, I get a clean text transcription back. It’s searchable, copy-pasteable, and has genuinely saved me a couple of hours a week.

This isn’t just a gimmick; it’s a powerful way to bridge the gap between spoken ideas and actionable, written data. Let’s build it.

Prerequisites

Before we dive in, make sure you have the following ready. We’re all busy, so getting this sorted out first will make the process much smoother.

A Telegram Bot Token: You can get this from the BotFather on Telegram. Just start a chat with him, create a new bot, and he’ll give you an API token.
An OpenAI API Key: You’ll need an account on the OpenAI platform. Grab your API key from your account dashboard.
Python Environment: A working Python 3.8+ installation.
FFmpeg: This is a crucial dependency for audio processing. You’ll need to install it on your system. A quick search for “install ffmpeg on [your OS]” will get you there. Pydub, the library we’ll use, depends on it.

The Guide: Step-by-Step

I’ll skip the standard virtual environment setup (venv, etc.) since you likely have your own workflow for that. Let’s jump straight to the logic. You’ll need to install a few Python libraries. Run your package installer for python-telegram-bot, openai, python-dotenv, and pydub.

Step 1: Environment and Configuration

First rule of production: never hardcode secrets. We’ll store our API keys in a config.env file. Create a file with that name in your project directory and add your keys like this:

TELEGRAM_BOT_TOKEN="YOUR_TELEGRAM_TOKEN_HERE"
OPENAI_API_KEY="YOUR_OPENAI_KEY_HERE"

Now, let’s start our Python script. We’ll call it transcriber\_bot.py. We’ll begin by importing the necessary libraries and loading our environment variables.

import os
import logging
from dotenv import load_dotenv
from telegram import Update
from telegram.ext import Application, MessageHandler, filters, ContextTypes
from openai import OpenAI
from pydub import AudioSegment

# Load environment variables from config.env
load_dotenv('config.env')

# Setup basic logging
logging.basicConfig(
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    level=logging.INFO
)
logger = logging.getLogger(__name__)

# Initialize OpenAI client
openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

Step 2: The Telegram Bot Core Logic

Next, we’ll set up the main structure of our bot. This involves creating an Application instance and adding a MessageHandler. We specifically want to filter for voice messages, so we’ll use filters.VOICE.

async def handle_voice_message(update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
    # This is where the magic will happen. We'll fill this in next.
    await update.message.reply_text("Processing your voice memo...")
    # (Future steps go here)

def main() -> None:
    """Start the bot."""
    telegram_token = os.getenv("TELEGRAM_BOT_TOKEN")
    if not telegram_token:
        logger.error("TELEGRAM_BOT_TOKEN not found in environment variables!")
        return

    application = Application.builder().token(telegram_token).build()

    # Add a handler for voice messages
    application.add_handler(MessageHandler(filters.VOICE, handle_voice_message))

    # Start the Bot
    logger.info("Bot is starting...")
    application.run_polling()

if __name__ == '__main__':
    main()

This boilerplate code sets up a listener. When the bot receives a voice message, it will call our handle\_voice\_message function.

Step 3: Downloading and Converting the Audio

Telegram voice messages usually come in the Ogg Opus audio codec (.oga format). Whisper API works best with more standard formats like MP3 or WAV. This is where pydub and ffmpeg shine. We’ll download the file, then convert it.

Let’s flesh out the handle\_voice\_message function:

async def handle_voice_message(update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
    """Downloads, converts, and transcribes a voice message."""
    file_id = update.message.voice.file_id
    try:
        # 1. Download the file
        voice_file = await context.bot.get_file(file_id)

        # We create temporary file paths
        oga_path = f'{file_id}.oga'
        mp3_path = f'{file_id}.mp3'

        await voice_file.download_to_drive(oga_path)
        logger.info(f"Downloaded voice file to {oga_path}")

        # 2. Convert OGA to MP3
        audio = AudioSegment.from_ogg(oga_path)
        audio.export(mp3_path, format="mp3")
        logger.info(f"Converted {oga_path} to {mp3_path}")

        # (Transcription step comes next)

    except Exception as e:
        logger.error(f"An error occurred: {e}")
        await update.message.reply_text("Sorry, I couldn't process that voice memo.")
    finally:
        # 4. Clean up the temporary files
        if os.path.exists(oga_path):
            os.remove(oga_path)
        if os.path.exists(mp3_path):
            os.remove(mp3_path)
        logger.info("Cleaned up temporary files.")

Pro Tip: In my production setups, I handle file paths more robustly, often using a dedicated /tmp or temporary directory structure. For this example, creating files in the local directory is fine, but always be mindful of where you’re writing data, especially in a containerized environment. Cleaning up files in a finally block ensures they get deleted even if an error occurs.

Step 4: Transcribing with OpenAI Whisper

With our MP3 file ready, sending it to OpenAI is straightforward. We’ll use the client.audio.transcriptions.create method.

Let’s add the transcription logic into our handle\_voice\_message function:

async def handle_voice_message(update: Update, context: ContextTypes.DEFAULT_TYPE) -> None:
    """Downloads, converts, and transcribes a voice message."""
    file_id = update.message.voice.file_id
    oga_path = f'{file_id}.oga'
    mp3_path = f'{file_id}.mp3'

    try:
        await update.message.reply_text("Processing your voice memo...")
        voice_file = await context.bot.get_file(file_id)
        await voice_file.download_to_drive(oga_path)

        audio = AudioSegment.from_ogg(oga_path)
        audio.export(mp3_path, format="mp3")

        # 3. Send to Whisper API for transcription
        with open(mp3_path, "rb") as audio_file:
            transcription = openai_client.audio.transcriptions.create(
                model="whisper-1",
                file=audio_file
            )

        transcribed_text = transcription.text
        logger.info(f"Transcription successful: {transcribed_text}")

        # 4. Reply to the user
        await update.message.reply_text(f"Transcription:\n\n{transcribed_text}", parse_mode='HTML')

    except Exception as e:
        logger.error(f"An error occurred: {e}")
        await update.message.reply_text("Sorry, I couldn't process that voice memo.")
    finally:
        # 5. Clean up
        if os.path.exists(oga_path):
            os.remove(oga_path)
        if os.path.exists(mp3_path):
            os.remove(mp3_path)
        logger.info("Cleaned up temporary files for " + file_id)

And that’s the complete loop! The bot receives a voice note, downloads it, converts it, sends it to Whisper, and replies with the text.

Common Pitfalls

Here are a few places I’ve tripped up in the past. Hopefully, you can avoid them.

ffmpeg Not Found: The most common issue. The pydub library is just a Python wrapper around the ffmpeg command-line tool. If ffmpeg isn’t installed and available in your system’s PATH, pydub will fail. The error message is usually pretty clear about this.
API Key Errors: Double-check your config.env file. A typo in the variable name or a misplaced quote can lead to authentication failures. Make sure the file is in the same directory you’re running the script from, or provide an absolute path to it.
File Size Limits: The OpenAI Whisper API has a file size limit (currently 25 MB). For a simple voice memo bot, this is rarely an issue. But if you were adapting this for longer audio, you’d need to implement chunking—splitting the audio into smaller pieces and processing them sequentially.

Conclusion

You now have a fully functional, private transcription service. This pattern is incredibly versatile. You could modify it to save transcriptions to a database, send them to a Notion page, or create a Jira ticket. It’s a fantastic building block for automating any workflow that starts with a spoken idea.

Happy building,

Darian Vance

Senior DevOps Engineer, TechResolve