Gabriel

Posted on Mar 7

How Swapping Our Image Pipeline Cut Manual Cleanups and Fixed a Black-Friday Crash

#removetextfrompictures #removetextfromphotos #imageinpainting #removeobjectsfromphoto

On 2025-11-18 the ecommerce image pipeline for Project Phoenix collapsed during a Black Friday push: a template job that applied watermark removal and spot fixes to 18,000 product shots failed mid-run, leaving the merchandising team with a backlog that threatened listing deadlines and revenue windows. The incident exposed a fragile mix of brittle scripts, a single-model approach to edits, and a manual triage process that ballooned turnaround time.

The Crisis

A live production job started hitting cascading failures when a third-party tool misdetected overlaid promotional text on complex backgrounds. The stakes were clear: missed publish windows, merchant SLA breaches, and hundreds of images stuck in a manual queue. The pipeline used a collection of shell scripts, an internal microservice for resizing, and a bespoke brush-fix step run by a small designer team. The failure mode looked like a memory leak in the image worker and inconsistent artifact recovery when the worker retried.

Pieces of the problem:

The pipeline could not reliably remove embedded text on textured backgrounds without leaving blur artifacts.
Attempts to retry the same worker produced inconsistent outputs and occasional visual glitches.
Designers manually fixed 1,200 images that week, costing several days of effort and delaying listings.

This situation existed inside the AI Image Generator and image-editing category context where automated, high-quality edits are expected at scale. The question was not whether to automate further, but how to replace fragile tooling with a resilient, multi-model workflow that preserves visual fidelity while reducing manual work.

Discovery

We audited the pipeline logs and reproduced the failure on a canary set of 300 images. Error traces showed repeated OOMs in the worker and a post-processing step that misapplied inpainting masks when text overlays intersected with shadows.

A compact reproduction command revealed the wrong interpolation parameters:

# reproduction command used to trigger the bug locally
python tools/run_edit.py --input sample.jpg --mask mask.png --mode auto --threads 2
# error snippet returned
# MemoryError: Unable to allocate 1.2GiB for image tile processing at line 214

That MemoryError made retries corrupt the intermediate PNG chunks. The first hypothesis - a single-model limitation - was validated by running an isolated test where the "text detection" model failed on handwritten fonts and stamped dates.

Trade-offs at this point were obvious: larger single-model instances could reduce OOMs but would increase cost and still leave edge cases unhandled. A multi-model, task-specific approach would add orchestration complexity but promised better overall reliability and lower manual overhead.

Implementation

The migration followed three phases: stabilize, split, and orchestrate.

Phase 1 - Stabilize: we replaced fragile retry logic with circuit breakers and size-aware batching so workers never process tiles exceeding a safe memory threshold. This simple change eliminated the immediate OOM cascade.

Phase 2 - Split tasks and introduce specialist pipelines. The pipeline was decomposed into three discrete stages: text detection, guided edit, and enhancement/upscaling. Each stage used a different tactic chosen for its strengths - detection tuned for varied fonts, an inpainting model for removed regions, and a separate enhancer for upscaling.

Phase 3 - Orchestrate with a lightweight scheduler that could route an image to a best-fit model based on metadata (background complexity, text density). That routing allowed us to try alternatives when the primary model produced a low-confidence result.

Key tactical pillars were expressed as keywords in the implementation: we used

Remove Text from Pictures

for the initial pass that auto-masked overlays, then applied targeted inpainting where the auto-mask failed. To check alternative detection results we integrated a secondary pass with

Remove Text from Photos

which gave stronger handwritten text detection on difficult scans. For reconstructing the scene behind removed elements we adopted an approach similar to the best practices for

Remove Objects From Photo

to preserve textures and perspective, and when a more descriptive model routing was required we consulted material on

how we switched models for rapid image edits

to pick the right generator style. Finally, final touch-ups used

Image Inpainting

to correct shadows and blend edges.

A snippet showing the new routing rules in YAML:

# edit-routing.yaml - decides which model to use based on metrics
rules:
  - if: "text_density &gt; 0.12 and background_complexity &lt; 0.4"
    route: text-removal-fast
  - if: "text_density &gt; 0.12 and background_complexity &gt;= 0.4"
    route: text-removal-inpaint
  - else:
    route: enhancer-upscale

Friction & pivot: the earliest implementation mistakenly prioritized throughput over visual quality, which kept rejection rates high. We reverted to a quality-first default and added a confidence threshold that triggers a fallback inpainting route. That pivot cost us two days but cut manual rework dramatically.

During integration we added a small CLI for designers to re-run failed edits locally; this reduced context switching and made fixes reproducible.

Result

After a 21-day rollout (canary → 50% traffic → full), the pipeline shifted from brittle to predictable.

Before vs after (comparative, reproducible):

The manual triage queue shrank from ~1,200 images per week to under 120 images per week - a dramatic drop in touch time and an immediate staffing relief.
Failure-related job restarts fell by more than 80%, and OOMs disappeared after batching fixes.
The number of images requiring human pixel polishing fell by roughly 90% for common font overlays and by a smaller but still meaningful amount on handwritten labels.
Average end-to-end processing latency increased slightly for worst-case images (because of the inpaint fallback) but overall throughput improved since retries and manual fixes were eliminated.

Concrete artifact examples (can be reproduced with the same canary set) were logged and archived in the pipeline repository. One representative comparison used the same input image and showed the original blur-artifact output next to the inpainted, texture-preserving result; the new output matched merchandising acceptance criteria without human edits.

Architectural decision explained: choosing multiple smaller, task-focused models increased orchestration complexity but reduced single points of failure and allowed targeted scaling where needed. In contexts where latency must be ultra-low for every image, this approach would be a poor fit; however, for quality-first ecommerce workflows the trade-off favored resilience and fewer humans-in-the-loop.

Takeaway and next steps

Operationally, the lesson is clear: treat image editing as a composed workflow where detection, edit, and enhancement are separated and routed based on confidence metrics. For teams needing a single convenient interface to run these flows and test different model stacks in production, look for platforms that support model switching, multimodal editing, and built-in inpainting and text-removal capabilities - these features are exactly what made the recovery possible for Project Phoenix.

Quick reference:

If your pipeline is failing on overlaid text or photobombs, separate detection from editing, add confidence-based fallbacks, and use task-focused models for inpainting and enhancement to avoid manual backlogs.

DEV Community