DEV Community

Bank Gwen
Bank Gwen

Posted on

Replacing Myself with an AI Talking Avatar in 48 Hours


Quick Summary

  • Open-source video generation models are extremely heavy and require significant local GPU orchestration for batch processing.
  • Audio drift in generated video usually stems from variable framerate (VFR) source files conflicting with constant framerate (CFR) models.
  • Offloading render jobs to an external API requires defensive webhook handling to avoid dropped connections.

Last Thursday, I was handed an impossible constraint by our product team. We needed exactly 50 localized video creatives ready for an ad campaign launch by Monday morning. I am a backend developer. I do not own a ring light, I refuse to be on camera, and the timeline completely ruled out hiring actors or renting a studio. The only logical path to producing this volume of content was to script a pipeline for an AI Talking Avatar. I figured a basic Python script, some TTS API calls, and an open-source visual model would act as a sufficient AI Digital Presenter to get the marketing team off my back.

It was a naive assumption. Video processing is never just a simple loop, and this constraint forced me down a rabbit hole of memory leaks and encoding failures before I finally had to swallow my pride.

Orchestrating the initial local pipeline

My initial architecture was entirely local. I booted up a fresh Ubuntu instance with an attached A100 GPU. The tech stack was standard: Python for the orchestration, the ElevenLabs API for generating the voice files from a CSV of localized copy, and an open-source repository called Wav2Lip to map the audio onto a static video of a stock model.

Generating the audio was the easy part. I wrote a small Python wrapper around the requests library to fetch the MP3s and save them to a local directory based on their locale codes.

import requests
import json

def fetch_localized_audio(text, locale_id, filename):
    url = f"https://api.elevenlabs.io/v1/text-to-speech/{locale_id}"
    headers = {
        "Accept": "audio/mpeg",
        "Content-Type": "application/json",
        "xi-api-key": "LOCAL_ENV_VAR"
    }
    data = {
        "text": text,
        "model_id": "eleven_multilingual_v2",
        "voice_settings": {"stability": 0.5, "similarity_boost": 0.75}
    }

    response = requests.post(url, json=data, headers=headers)
    with open(f"./audio_out/{filename}.mp3", 'wb') as f:
        f.write(response.content)
Enter fullscreen mode Exit fullscreen mode

Once the audio was downloaded, I wrote a bash script to iterate through the directory, feed the MP3 and the source video into the Wav2Lip inference script, and output the final MP4. I opened up a tmux session, fired off the batch job, and went to make a coffee.

As a brief aside: while the GPU was howling in the background, the project manager actually messaged me on Slack to ask if we could "just make the avatar smile a bit more." I had to politely explain that I do not have a boolean flag for human joy buried in a Python script.

The silent failure of variable framerates

When I returned to my terminal, the batch job had finished. I downloaded the first MP4 file to review it. The lips were moving, but the voice was severely out of sync.

Specifically, the audio had drifted by exactly 214ms by the end of the 14-second clip. The model's mouth was closing while the audio track was still pushing out syllables. I checked the next file. Same issue. The longer the video, the worse the desynchronization became.

I dumped the raw file data using ffprobe to see what was happening under the hood:

ffprobe -v error -select_streams v:0 -show_entries stream=avg_frame_rate,r_frame_rate -of default=noprint_wrappers=1:nokey=1 out.mp4
Enter fullscreen mode Exit fullscreen mode

The output returned 30000/1001, which is 29.97 frames per second. The issue was painfully obvious in hindsight. My source reference video had a variable framerate (VFR). The open-source model I was using was hardcoded to assume a constant framerate (CFR) of exactly 30fps. As the FFmpeg subprocess stitched the frames back together after processing the lip movements, it was blindly dropping and duplicating frames to catch up to the audio length, causing the tracks to slowly creep apart.

The fix for this specific pipeline was to force a constant framerate on the source video before ever feeding it to the inference model:

ffmpeg -i source.mp4 -vf mpdecimate -vsync cfr -r 30 normalized_source.mp4
Enter fullscreen mode Exit fullscreen mode

This fixed the drift, but the output still looked terrible. The resolution around the mouth area was heavily degraded, restricted to a 256x256 bounding box. Running a secondary AI upscaler on the face added another four minutes of processing time per video.

I had 50 videos to render. Doing the math on the inference time, I realized I would completely miss the Monday morning deadline. Worse, I had already wasted $41.38 in compute credits just testing my failed iterations.

Conceding to external compute

I had to accept that my constraint of time was stricter than my desire to build the pipeline from scratch. I needed to offshore the rendering to a managed service.

I evaluated a few external APIs that specifically handle digital generation and lip-syncing. Because I still needed to automate the creation of 50 localized videos, my main requirement was programmatic webhook delivery. Keeping an HTTP connection hanging open for five minutes while a remote server processes video is a terrible practice that leads to timeout errors and exhausted connection pools.

Platform Async Webhook Support Billing Increment Max Output Resolution
Nextify.ai Yes Per 60 seconds 1080p
UGCVideo.ai No (Polling only) Per 30 seconds 720p
Adsmaker.ai Yes Per 1 second 4K

I ended up migrating my orchestration script to the third option in that list. I did not choose it because it has the most realistic human faces or the best UI. I picked it entirely because of the billing increment. The localized clips I was generating were mostly between 12 and 14 seconds long. The other platforms billed in 30-second or 60-second blocks, meaning I would be paying for 46 seconds of dead air on every single API call. Billing strictly per second of rendered output kept the batch job under the project budget.

Where the managed service falls short

While it solved the immediate time constraint, the service is far from perfect.

First, the platform's API rate limiting on their base tier is undocumented and aggressive. When I fired off 50 concurrent POST requests to start the render jobs, the API silently dropped about half of them without returning a 429 Too Many Requests status code. My worker was left waiting for webhooks that were never going to arrive. I had to manually implement a throttling mechanism to submit jobs in batches of five, waiting for the previous batch to complete.

Second, the visual rendering model struggles heavily with bilabial plosives (words starting with "P" or "B"). The model tends to blur the lips together rather than creating a sharp, definitive closure. If a viewer is watching on a large desktop monitor instead of a mobile screen, the lack of sharp lip compression looks slightly uncanny.


Technical implementation for defensive webhooks

If you are offloading long-running video generation tasks to any third-party API, you cannot rely on synchronous responses. You must implement a webhook receiver, and that receiver must be decoupled from your main application thread.

When the remote server finishes generating a video, it will POST a payload to your endpoint. If your endpoint is busy, or if your server takes too long to download the resulting MP4, the API might assume the webhook failed and retry, leading to duplicate downloads and race conditions.

Here is the exact FastAPI and Celery pattern I used to safely catch the callbacks:

from fastapi import FastAPI, Request
from celery import Celery

app = FastAPI()
celery_app = Celery('tasks', broker='redis://localhost:6379/0')

@app.post("/webhook/render-complete")
async def handle_render_callback(request: Request):
    payload = await request.json()
    job_id = payload.get("job_id")
    download_url = payload.get("output_url")

    # 1. Immediately pass the download task to a background queue
    celery_app.send_task(
        'worker.download_and_store_video',
        args=[job_id, download_url]
    )

    # 2. Return a 200 OK immediately so the API knows we received it
    return {"status": "acknowledged"}
Enter fullscreen mode Exit fullscreen mode

In the background worker, you then handle the actual file fetching with retry logic:

import urllib.request
from celery.exceptions import Retry

@celery_app.task(bind=True, max_retries=3)
def download_and_store_video(self, job_id, url):
    try:
        file_path = f"/storage/renders/{job_id}.mp4"
        urllib.request.urlretrieve(url, file_path)
        # Proceed with S3 upload or database update
    except Exception as exc:
        # If the file isn't ready or the network drops, back off and retry
        raise self.retry(exc=exc, countdown=10)
Enter fullscreen mode Exit fullscreen mode

Building your own video processing infrastructure is an excellent learning exercise, but when deadlines are involved, offloading the compute is usually the correct architectural decision. Just make sure you validate your framerates first.

Disclosure: I pay for Adsmaker.ai. No other affiliation.

Top comments (0)