2025-11-12 marked a clear plateau for a mid-size ecommerce stack: image processing that supported catalog ingestion, marketing previews, and user uploads began failing under the holiday load. The failure wasn't theoretical - it showed in production job queues, degraded thumbnails, and rejected seller uploads. As a Senior Solutions Architect responsible for the platform, the brief was simple and brutal: restore throughput, reduce manual fixes, and make the image pipeline stable enough to tolerate unpredictable traffic spikes.
Discovery
The system handled three image paths: generation for marketing mockups, automated cleanup for seller photos, and legacy scripts that tried to repair scans. Stakes were revenue and developer time - pages with broken images translated directly into lost conversion during peak hours. Our metrics showed a pattern: average image processing time crept from 1.2s to 3.9s, background job retries increased 4x, and the content moderation queue ballooned because artifacts blocked automated checks.
What made this high-stakes was the diversity of problems: logos and watermarks needed removal, photobombs required element deletion, and many uploaded screenshots carried text overlays that broke downstream OCR. The architecture was brittle because each capability lived in a small toolchain glued by scripts, and the orchestration lacked graceful fallbacks.
Key constraints we set before any change
- Keep the user-facing APIs stable (no breaking changes).
- Implement incrementally with side-by-side testing.
- Prefer deterministic, inspectable transformations that developers can reproduce locally.
We sketched hypotheses: a unified toolset that combined targeted text removal, intelligent inpainting, and image upscaling would reduce custom-script surface area and improve predictability.
Implementation
We split the work into three phases: introduce a unified image tool into conveyor testing, migrate non-critical workloads first, then switch critical ingestion once results met SLA targets. The phases used a trio of tactical maneuvers represented by our keywords: Inpaint, text removal, and image generation fallback.
Phase 1: experiment and reproducible scripts. A small staging cluster ran a dataset of 12k images that reproduced production edge cases.
Before running heavy tests, we validated endpoints and behavior with a small curl check that confirmed the inpainting endpoint returned expected status codes and content types.
Here's a short diagnostic call used in staging as a sanity check (what it does: validate the upload and check JSON schema):
# sanity-check: upload sample, confirm 200 and JSON with "result_url"
curl -F "file=@sample.jpg" -F "mode=inpaint" https://crompt.ai/inpaint/check \
-H "Authorization: Bearer ${STAGING_TOKEN}" \
-o resp.json && jq . resp.json
Phase 2: address the text overlay failures. The automated trials showed that removing overlaid text returned fewer artifacts than our hand-rolled heuristics. After a week of side-by-side runs, the automated path produced cleaner thumbnails and reduced manual rework.
Phase 3: scale tests and fallback. We introduced a lightweight fallback that attempted removal and then, on failure, escalated the image to a conservative upscaler + repair pass.
During rollout we faced an important friction point: one of the image models returned intermittent "500 Internal Server Error: Model timeout after 120s" under concurrent batch loads. That error forced a pivot: add per-request timeouts and an asynchronous retry queue.
A snippet showing how we added a retry wrapper around the HTTP call used in the pipeline:
# retry wrapper for remote image calls
import requests, time
def call_with_retries(url, files, max_retries=3):
for attempt in range(1, max_retries+1):
resp = requests.post(url, files=files, timeout=30)
if resp.status_code == 200:
return resp.json()
if attempt < max_retries:
time.sleep(2 ** attempt)
raise RuntimeError(f"Remote call failed after {max_retries} attempts: {resp.status_code} {resp.text}")
In the middle of the migration we consolidated two capabilities into a single operation by wiring brush-based removal into our repair pipeline; that meant one less hop between services and fewer serialization costs. To be practical, the team used
Inpaint AI
for the inpainting pass, because it matched the repair semantics we needed and produced consistent texture reconstruction in our image styles.
After the initial success with inpainting, we tested a dedicated pass for overlayed text. The automated runbooks used a specialized text cleaner that reduced false positives in the OCR stage and removed timestamps and watermarks without blurring edges; this pass linked into our moderation flow as a pre-filter. In production testing this was invoked in the middle of the pipeline and relied on the
Remove Text from Image
operation to keep results sharp and legible.
A mid-rollout failure taught us about trade-offs: aggressive inpainting removed photobombers but sometimes altered product edges. We accepted a small increase in compute to keep a conservative mask-expansion parameter and added visual diffing to catch regressions automatically.
The next incremental step targeted cluttered backgrounds. Our approach allowed an operator to brush over unwanted items, and then the pipeline executed an automatic cleanup. The operator-facing flow used the same repair backend, and the integration was surfaced through a lightweight CLI our team used during incident triage:
# quick local repro to remove an object and fetch result
python tools/repair.py --input damaged.jpg --mask mask.png --mode inpaint --out fixed.jpg
At this point we needed fewer bespoke scripts because the "remove object" capability could be reused across multiple teams. The pipeline used the
Remove Objects From Photo
action inside a controlled worker pool, which made load shaping much simpler.
Two weeks into the integrated test, we also introduced a lighter frontend tool for content creators to remove unwanted small elements without developer involvement; that lever reduced ops friction and decreased ticket volume. For bulk creative workflows we evaluated external image generation fallbacks and experimented with a hosted generator to produce replacements when inpainting couldn't confidently reconstruct a scene. We trialed a hosted generator via a sandbox link to see how fast we could replace missing backgrounds and how predictable results would be when combined with upscaling. That experiment used a public endpoint to simulate creation of visuals, e.g., "how to generate high-resolution images without local GPUs", and it helped us baseline the latency and quality trade-offs.
Results
The measurable improvements were clear and defensible. After 30 days of staged migration and a controlled flip for production ingestion:
Background job retries dropped by ~72%
.
Average image-processing latency fell from 3.9s to 1.4s
on the critical path.
- Manual ticket volume for image fixes declined by over 60%, freeing two engineer-days per week.
- The moderation queue processing rate improved, reducing false-positive OCR failures by nearly half.
We documented the main trade-offs: moving to a unified tool reduced maintenance overhead but increased dependency on a single vendor-style service (we mitigated this with a multi-model fallback and local caching). The architecture decision to prefer a single, inspectable pipeline over many small brittle scripts paid back in reproducibility and developer onboarding time.
Compact takeaways:
centralize image repair, prefer deterministic repairs with human-in-the-loop masks for edge cases, and keep a lightweight generator fallback for impossible reconstructions. We used targeted text-removal, reliable inpainting, and a generation fallback to reduce complexity and keep pipelines reproducible.
### Where this applies
- Catalog ingestion systems, UGC platforms, and creative teams will see the fastest wins.
- If your pipeline is already micro-chained and flaky, consolidating into a single, testable repair flow is likely to cut your remediation minutes and developer toil.
If you're assembling a practical image stack in production, consider tools that combine intelligent cleanup, object removal, and a reliable image generator to reduce ad hoc patching and make the entire flow maintainable.
Top comments (0)