How I Fixed PixelAPI's 35% Job Failure Rate — And Hit 96.8% Success
Three bugs were silently killing 1 in 3 jobs. Here's exactly what was wrong and how I fixed it.
Three weeks ago, PixelAPI had a 35% job failure rate.
Every third API call was returning an error instead of an image. Users were complaining. I was embarrassed. And honestly? I had no idea where to start — the errors were scattered across different models, different GPU machines, different Python modules.
Today, PixelAPI sits at 96.8% job success rate.
This is the honest, technical breakdown of what was broken and how I fixed it.
The Three Root Causes
Bug 1: WAN 2.1 Video Generation — Timeout Set 40% Too Low
PixelAPI's video generation endpoint uses the Wan 2.1 (I2V) model on LLM3. The timeout was set to 70 seconds.
Problem: Wan 2.1 takes 70–120 seconds on a good day. When the GPU is warm and the model is loaded freshly, it hits the lower end. But under any real load, it easily exceeds 70s.
Error: Job timed out after 70 seconds
Fix: Bumped timeout to 120 seconds. Added a GPU pre-warming step so the model is loaded before the first request hits it.
# Before
TIMEOUT = 70
# After
TIMEOUT = 120
# + GPU pre-warming: keep model loaded in memory between requests
Result: Wan 2.1 success rate went from ~40% to ~97%.
Bug 2: Background Removal — CUDA OOM on Large Images
The background removal tool uses RMBG-1.4 on GPU. For small images, it worked fine. For anything over 2048x2048, it crashed with:
RuntimeError: CUDA out of memory. Tried to allocate 2.4GB
The fix wasn't just a timeout issue — it was a memory management problem. The model was being reloaded on every single request, consuming ~6GB VRAM each time without proper cleanup.
Fix: Implemented model caching (load once, reuse across requests) + automatic image resize for inputs over 2048px:
# Auto-downscale large images before processing
if image.size[0] > 2048 or image.size[1] > 2048:
image = image.resize((2048, 2048), Image.LANCZOS)
Result: No more OOM crashes. Background removal is now PixelAPI's most reliable endpoint.
Bug 3: Remove Text — Python Variable Shadowing Bug
This one was embarrassing.
The text removal module had a variable named io that was being used for the image IO buffer. But somewhere in the processing pipeline, the built-in io module was getting overwritten:
import io
def remove_text(image):
io = io.BytesIO() # ← This shadows the `io` module!
...
# Later: io.BytesIO() fails because `io` is now a BytesIO object, not the module
This bug only triggered when certain image processing conditions were met, which is why it was intermittent and hard to reproduce.
Fix: Renamed the local variable from io to img_buffer. Three-line fix, silent failures for weeks.
The Result: 96.8% Success Rate
After all three fixes:
| Metric | Before | After |
|---|---|---|
| Job success rate | 65% | 96.8% |
| Daily failed jobs (avg) | ~35 | ~3 |
| Revenue impact | Users churning | Retention up |
April so far: 2,819 completed jobs, 94 failures across all endpoints.
What I Learned
Timeout values are living parameters. Set them once and forget them, and they'll bite you when models evolve or hardware load changes.
CUDA memory management is not optional. Model caching + input size limits should be implemented from day one, not added retroactively.
Variable shadowing in Python is a silent killer. Use linters (ruff, pylint) that catch
ioshadowing. I now runruff checkon every new module before it touches production.Intermittent failures are harder than obvious ones. The text removal bug took longest to find because it only failed under specific image conditions.
PixelAPI now processes AI image and video generation at 2x lower cost than PhotoRoom, Replicate, and other mainstream competitors — and with a 96.8% success rate to back it up.
If you're building with AI media APIs and hitting reliability issues, feel free to reach out. Happy to share what I learned.
Links:
- API Docs: https://pixelapi.dev/docs
- Dashboard: https://pixelapi.dev/app
- GitHub SDK: Coming soon
Top comments (0)