Cutting PixelAPI's Failure Rate from 35% to 3.2% — A Technical Post-Mortem

#python #api #devops #cuda

How I Fixed PixelAPI's 35% Job Failure Rate — And Hit 96.8% Success

Three bugs were silently killing 1 in 3 jobs. Here's exactly what was wrong and how I fixed it.

Three weeks ago, PixelAPI had a 35% job failure rate.

Every third API call was returning an error instead of an image. Users were complaining. I was embarrassed. And honestly? I had no idea where to start — the errors were scattered across different models, different GPU machines, different Python modules.

Today, PixelAPI sits at 96.8% job success rate.

This is the honest, technical breakdown of what was broken and how I fixed it.

The Three Root Causes

Bug 1: WAN 2.1 Video Generation — Timeout Set 40% Too Low

PixelAPI's video generation endpoint uses the Wan 2.1 (I2V) model on LLM3. The timeout was set to 70 seconds.

Problem: Wan 2.1 takes 70–120 seconds on a good day. When the GPU is warm and the model is loaded freshly, it hits the lower end. But under any real load, it easily exceeds 70s.

Error: Job timed out after 70 seconds

Fix: Bumped timeout to 120 seconds. Added a GPU pre-warming step so the model is loaded before the first request hits it.

# Before
TIMEOUT = 70

# After  
TIMEOUT = 120
# + GPU pre-warming: keep model loaded in memory between requests

Result: Wan 2.1 success rate went from ~40% to ~97%.

Bug 2: Background Removal — CUDA OOM on Large Images

The background removal tool uses RMBG-1.4 on GPU. For small images, it worked fine. For anything over 2048x2048, it crashed with:

RuntimeError: CUDA out of memory. Tried to allocate 2.4GB

The fix wasn't just a timeout issue — it was a memory management problem. The model was being reloaded on every single request, consuming ~6GB VRAM each time without proper cleanup.

Fix: Implemented model caching (load once, reuse across requests) + automatic image resize for inputs over 2048px:

# Auto-downscale large images before processing
if image.size[0] > 2048 or image.size[1] > 2048:
    image = image.resize((2048, 2048), Image.LANCZOS)

Result: No more OOM crashes. Background removal is now PixelAPI's most reliable endpoint.

Bug 3: Remove Text — Python Variable Shadowing Bug

This one was embarrassing.

The text removal module had a variable named io that was being used for the image IO buffer. But somewhere in the processing pipeline, the built-in io module was getting overwritten:

import io

def remove_text(image):
    io = io.BytesIO()  # ← This shadows the `io` module!
    ...
    # Later: io.BytesIO() fails because `io` is now a BytesIO object, not the module

This bug only triggered when certain image processing conditions were met, which is why it was intermittent and hard to reproduce.

Fix: Renamed the local variable from io to img_buffer. Three-line fix, silent failures for weeks.

The Result: 96.8% Success Rate

After all three fixes:

Metric	Before	After
Job success rate	65%	96.8%
Daily failed jobs (avg)	~35	~3
Revenue impact	Users churning	Retention up

April so far: 2,819 completed jobs, 94 failures across all endpoints.

What I Learned

Timeout values are living parameters. Set them once and forget them, and they'll bite you when models evolve or hardware load changes.
CUDA memory management is not optional. Model caching + input size limits should be implemented from day one, not added retroactively.
Variable shadowing in Python is a silent killer. Use linters (ruff, pylint) that catch io shadowing. I now run ruff check on every new module before it touches production.
Intermittent failures are harder than obvious ones. The text removal bug took longest to find because it only failed under specific image conditions.

PixelAPI now processes AI image and video generation at 2x lower cost than PhotoRoom, Replicate, and other mainstream competitors — and with a 96.8% success rate to back it up.

If you're building with AI media APIs and hitting reliability issues, feel free to reach out. Happy to share what I learned.

Links: