Building a Telegram Video Avatar Converter with FFmpeg and Aiogram 3

#telegram #python #ffmpeg #tutorial

The bug that ate my afternoon

I record a quick clip on my iPhone, open Telegram, tap the camera icon next to my profile, pick the video, hit set as avatar. Telegram thinks for a second, then shows the old avatar like nothing happened. No error, no toast, no log. The file is fine. It plays in Photos. It uploads to other chats. But as a video avatar, it gets silently rejected.

The culprit: iPhone shoots in HEVC (H.265) by default since iOS 11. Telegram's video avatar slot accepts only H.264 in an MP4 container, plus a stack of other constraints that are not documented in any user-facing place. The mobile app does not tell you why the upload was rejected, it just discards it.

I built @LiveAvaBot to fix that one specific paper cut. You send any video or GIF, the bot returns a Telegram-ready video avatar. Here is how the pipeline works.

What Telegram actually wants

The official Bot API docs are vague, but reverse-engineering the desktop client plus a lot of trial and error gives you the real spec for video avatars:

Container: MP4 with faststart (moov atom at the front)
Video codec: H.264 (libx264), yuv420p pixel format
Resolution: exactly 800x800, square
Duration: at most 10 seconds
File size: at most 2 MB
Audio: must be stripped entirely (no silent audio track, just no track at all)
Framerate: anything up to 30 fps is safe, I default to 25

Miss any one of these and the upload gets silently dropped. The 2 MB cap is the meanest one because it forces you to compromise between duration, resolution detail, and bitrate.

The ffmpeg recipe

The conversion is two passes. First, I run cropdetect to find the largest square inside the input. Most videos are 16:9 or 9:16, so naive center-crop loses important content. Cropdetect catches letterboxes and lets me crop a tighter square around the actual subject.

ffmpeg -hide_banner -i input.mov \
  -vf "cropdetect=24:16:0,metadata=mode=print" \
  -f null - 2>&1 | grep -oP 'crop=\S+' | tail -n 1

That spits out something like crop=1080:1080:420:0. I parse the four numbers, pick a square region, then run the real encode:

ffmpeg -y -i input.mov \
  -t 10 \
  -vf "crop=1080:1080:420:0,scale=800:800:flags=lanczos,format=yuv420p,fps=25" \
  -c:v libx264 -preset slow -crf 28 \
  -movflags +faststart \
  -an \
  output.mp4

Key flags worth calling out:

-t 10 clamps duration to 10 seconds. Telegram rejects anything longer.
crop=W:H:X:Y then scale=800:800:flags=lanczos gives a clean 800x800 with decent downscaling.
format=yuv420p is non-negotiable. yuv444 or yuv422 will be silently rejected even though the file plays fine elsewhere.
-crf 28 -preset slow hits the 2 MB target for most 10 second clips. For longer or busier scenes I drop crf to 30 and re-encode.
-movflags +faststart puts the moov atom at the start of the file, otherwise Telegram clients hang on the first frame.
-an strips audio. Even a silent AAC track will cause rejection.

If the encoded file overshoots 2 MB I bump crf in steps of 2 and retry, up to crf 36. Past that I shorten duration. So far the loop terminates in at most three retries on real inputs.

The aiogram 3 handler

The bot is a single file in spirit, though I split it for testability. Here is the trimmed video handler:

from aiogram import Router, F
from aiogram.types import Message, FSInputFile
from pathlib import Path
import tempfile

router = Router()

@router.message(F.video | F.video_note | F.animation | F.document)
async def handle_video(message: Message) -> None:
    file_id = (
        message.video.file_id if message.video
        else message.video_note.file_id if message.video_note
        else message.animation.file_id if message.animation
        else message.document.file_id
    )

    with tempfile.TemporaryDirectory() as tmp:
        src = Path(tmp) / "in.mov"
        dst = Path(tmp) / "out.mp4"

        file = await message.bot.get_file(file_id)
        await message.bot.download_file(file.file_path, destination=src)

        ok = await convert_to_avatar(src, dst)
        if not ok:
            await message.answer("Could not convert this one. Try a shorter clip.")
            return

        await message.answer_video(
            FSInputFile(dst),
            width=800,
            height=800,
            caption="Set this as your Telegram video avatar in Settings > Edit profile.",
        )

async def convert_to_avatar(src: Path, dst: Path) -> bool:
    crop = await detect_crop(src)
    crf = 28
    while crf <= 36:
        await run_ffmpeg(src, dst, crop, crf)
        if dst.stat().st_size <= 2 * 1024 * 1024:
            return True
        crf += 2
    return False

detect_crop shells out to the cropdetect command from earlier and parses the last crop= line. run_ffmpeg is a thin asyncio.create_subprocess_exec wrapper around the encode command. Nothing exotic, just gluing ffmpeg to aiogram.

The handler accepts video, video_note, animation, and document because users send videos as any of those four types depending on the client. The document branch covers people who attached the .mov file directly to bypass Telegram's compression.

Shipping it as @liveavabot

The pipeline runs on a small Hetzner VPS. ffmpeg is doing the heavy lifting, I just wrote the wrapper. Total Python is maybe 400 lines including error handling and a sqlite log of conversions.

A few production-only things I learned the hard way:

Always use a tempdir per request. I started with a shared workdir and got mysterious failures under concurrent uploads.
aiogram 3's FSInputFile is the right primitive for sending the result back. BufferedInputFile works too but doubles memory for no reason.
Telegram's send_video has width and height arguments. If you do not pass them the desktop client picks a weird thumbnail. Pass 800, 800.
Keep ffmpeg stderr in a buffer and log it on non-zero exit. cropdetect failures are usually a malformed input rather than a bug in your pipeline.

You can try it at https://t.me/LiveAvaBot?start=devto_article_20260629. Send a video or a GIF, get back a Telegram-ready avatar in a few seconds.

What is next

The current pipeline does not handle 4K input gracefully, it just downscales and hopes for the best. I want to add a smart pre-scale step for anything above 1080p so cropdetect runs faster on the source. I also want to expose a mode where you can nudge the crop interactively instead of trusting cropdetect blindly.

The cropdetect threshold (24 in the command above) is also tuned for typical phone footage. Screen recordings with dark UI elements sometimes confuse it. If your input is mostly UI, drop the threshold to 16 or it will crop too aggressively.

Built by me, @LiveAvaBot is the Telegram side.