Lucy.L

Posted on Apr 14

Building an AI Product Photography Pipeline: Multi-Model Workflows, Async Tasks, and Real Costs

#javascript #webdev #programming #ai

I spent the last six months building an AI product photography platform. The premise is simple: upload a product photo, pick a scene template, get a professional-looking shot back. The implementation was anything but simple.

What started as "call an API, return an image" evolved into a multi-stage pipeline involving background removal, product segmentation, scene composition, local model inference, cloud-based refinement, and a workflow engine to orchestrate all of it. This post walks through what we built, what it costs, and the mistakes that nearly killed us along the way.

The Problem We Were Solving

E-commerce product photography is expensive and slow. A single studio shoot for a product line runs $500–$2,000 when you factor in the photographer, studio rental, lighting setup, props, and post-production. For small sellers on Shopify or Amazon, that's a non-starter.

The numbers tell the story: 75% of online shoppers rely on product photos to make purchasing decisions, and high-quality product images show 94% higher conversion rates than low-quality ones. Yet most small sellers are stuck with smartphone photos on a bedsheet background.

AI image generation has reached the point where it can fill this gap. The challenge isn't the AI model quality — it's building a production pipeline that's reliable, affordable, and fast enough to serve real users at scale.

The Architecture: It's More Complex Than You Think

A naive "send prompt → get image" approach fails for product photography. Here's why: the AI needs to preserve the exact product while generating a new environment around it. That requires multiple processing stages.

Our Multi-Stage Pipeline

Upload → Pre-processing → Scene Composition → Generation → Post-processing → Delivery
  │            │                  │                │              │              │
  │       Background         Template          Multi-model    Color           CDN
  │       Removal            Matching          Fusion         Correction      Distribution
  │            │                  │                │              │              │
  │       Segmentation      Prompt            Local GPU +     Quality         User
  │       & Masking          Engineering       Cloud API       Check           Gallery

Stage 1 — Pre-processing: User uploads a raw photo. We run background removal using a locally deployed RMBG-2.0 model on our GPU server. This gives us a clean product mask and a segmented foreground. Running this locally is significantly cheaper per image than cloud-based alternatives, and it's faster (under 2 seconds on an A10G).

Stage 2 — Scene Composition: Based on the user's template selection, we build a composite prompt that includes the product description (auto-detected via a vision model), the scene parameters, lighting instructions, and preservation directives. This is where most of the "secret sauce" lives — the difference between "product on a table" and "product on an oak table next to a window with golden hour light streaming in from camera left, shallow depth of field, warm color grading."

Stage 3 — Multi-Model Generation: This is where things get interesting. We don't use a single model. Instead, we run a parallel generation workflow:

Primary generation: A cloud-hosted inference API handles the main scene generation. We route to faster models for standard shots and higher-quality models (like Seedream) for premium outputs.
Local fallback: We run FLUX.1 Dev on our own GPU for overflow and for users on premium tiers who want maximum product fidelity. Local inference gives us full control over seeds, CFG scale, and denoising steps.
Ensemble selection: For high-value generations, we run 2–3 variants in parallel and use a CLIP-based scoring model to auto-select the best one.

Stage 4 — Post-processing: Generated images go through automated quality checks. We run a product comparison model (fine-tuned on our own data) that scores how well the product was preserved. If the score is below our threshold, the pipeline automatically retries with adjusted parameters — without the user ever knowing.

The Workflow Engine

Coordinating all these stages required a proper workflow engine. We built a lightweight DAG-based orchestrator that handles dependencies between stages:

[Background Removal] ──→ [Mask Generation] ──→ [Prompt Engineering]
                                                    │
                                              [Parallel Generation]
                                               ╱         │         ╲
                                         Cloud API    Local GPU    Variant #3
                                               ╲         │         ╱
                                            [Quality Scoring] ──→ [Best Selection]
                                                                      │
                                                               [Post-processing]
                                                                      │
                                                                 [Delivery]

Each node in the DAG is an independent worker that can be scaled horizontally. The orchestrator handles retries, timeouts, and dead-letter queuing for failed tasks.

The Async Task Problem

With a multi-stage pipeline, latency adds up fast. A single image can take 15–60 seconds from upload to delivery. You can't block the HTTP request for that long.

Our Solution: Event-Driven Processing

User uploads photo
    → API creates task (status: "pending")
    → API enqueues first workflow stage
    → Returns task ID immediately

Frontend polls every 3 seconds: "Is my image ready?"

Workflow engine processes stages sequentially:
    → Stage completes → enqueue next stage
    → Any stage fails → retry (up to 3 times)
    → All retries exhausted → mark task failed, refund credits

Final delivery:
    → Upload result to object storage
    → Update task status to "completed"
    → Frontend receives image URL on next poll

The key insight: don't couple the user request to the processing pipeline. The API's only job is to accept the upload and return a task ID. Everything else happens asynchronously.

Here's a simplified version of the workflow coordinator:

// Simplified workflow coordinator
async function runPipeline(taskId: string, stages: Stage[]) {
  let currentData: PipelineData = { taskId };

  for (const stage of stages) {
    const result = await executeWithRetry(stage, currentData, {
      maxRetries: 3,
      timeout: 60_000, // 60s per stage
    });

    if (result.status === 'failed') {
      await markTaskFailed(taskId);
      await refundCredits(taskId);
      return;
    }

    currentData = { ...currentData, ...result.output };
  }

  // All stages passed — deliver result
  const imageUrl = await uploadToStorage(currentData.finalImage);
  await updateTask(taskId, { status: 'completed', imageUrl });
}

Why Not Just Use a Queue?

We did start with a simple message queue. The problem is that multi-stage pipelines have complex failure modes. Stage 3 might fail because Stage 1 produced a bad mask. A flat queue doesn't capture these dependencies — you end up with orphaned messages and mysterious failures.

The DAG-based approach lets us replay from any stage, which is critical for debugging. When a user reports a bad generation, we can trace exactly which stage introduced the artifact and fix it without re-running the entire pipeline.

Local GPU vs. Cloud API: When to Use Which

We run a hybrid setup: local GPUs for some workloads, cloud APIs for others. Here's the decision matrix:

Factor	Local GPU (FLUX.1 Dev)	Cloud Inference API
Latency	8–15s (single image)	10–30s (varies by model)
Cost at scale	Lower (amortized GPU)	Higher (pay per call)
Burst capacity	Limited by GPU count	Virtually unlimited
Customization	Full control over params	Limited to API options
Cold start	3–5s model loading	Sub-second (managed)
Product fidelity	Excellent (fine-tuned)	Very good (out of box)

When we use local: Background removal (always), quality scoring (always), primary generation for premium users, batch jobs where we control the throughput.

When we use cloud API: Burst traffic that exceeds local capacity, models we haven't fine-tuned locally, and geographic routing for users far from our GPU region.

The hybrid approach is key to keeping our per-image cost low enough to offer a competitive product.

Cost Breakdown: $1–2/image vs. $25–150 Traditional

This is the part I wish someone had written before we started. When you factor in the entire production pipeline — not just the raw API call — here's what a single AI-generated product photo actually costs to deliver:

Component	Cost per Image	Notes
GPU infrastructure (local inference, amortized)	~$0.15–0.25	A10G instances, background removal + generation + quality scoring
Cloud inference API (burst/overflow)	~$0.10–0.20	Multi-model routing, premium model access
Model fine-tuning & training (amortized)	~$0.05–0.10	Custom product preservation models, ongoing improvement
Object storage & CDN	~$0.05	Reference images, generated images, global delivery
Database & application hosting	~$0.03	Task metadata, user state, workflow orchestration
Quality assurance pipeline	~$0.05–0.10	Multi-pass verification, automated retries, CLIP scoring
Engineering overhead (amortized)	~$0.10–0.15	Pipeline maintenance, model updates, monitoring
Total	~$0.55–0.98	All-in cost per delivered image

So we're looking at roughly $1 per image on the efficient path and up to $2 when cloud burst capacity kicks in or a generation needs multiple retries.

Compare that to traditional product photography at $25–150 per image. Even the cheapest stock photo services run $5–10 per image, and those aren't even your product.

The math is brutal for traditional studios: a seller with 200 SKUs needing 5 photos each would spend $25,000–$150,000 on photography. With AI, that same catalog costs $1,000–$2,000. That's not a marginal improvement — it's a completely different business model for small sellers.

Mistakes We Made

1. Single-Model Dependency (Month 1)

We started with one cloud API provider for everything. When they had a 6-hour outage, our entire platform went dark. Zero fallback, zero redundancy.

Fix: Built the hybrid local + cloud architecture. Now if the cloud API goes down, local GPU inference takes over automatically. We maintain 99.5% uptime even during provider outages.

2. No Quality Gate (Month 2)

Early on, every generation was delivered directly to the user — including the ones where the AI hallucinated extra buttons on a shirt or changed the product color entirely. We got angry emails daily.

Fix: Added the automated quality scoring stage. A fine-tuned model compares the output against the original product photo and rejects generations that distort the product. Rejected images trigger an automatic retry with adjusted parameters. This brought our "bad generation" rate from ~15% to under 3%.

3. Synchronous Processing (Month 2)

Our first architecture processed the entire pipeline synchronously during the HTTP request. A complex generation could take 40+ seconds, which caused frequent timeouts and a terrible user experience.

Fix: Moved to fully async event-driven processing. The API returns immediately with a task ID, and the frontend polls for results. This eliminated timeout errors entirely.

4. No Credit Refund on Failure (Month 3)

Users were getting charged even when the pipeline failed entirely or the quality gate rejected the output after all retries. Support tickets piled up fast.

Fix: The workflow engine now automatically refunds credits for any task that fails at any stage — whether it's a model error, a timeout, or a quality rejection after exhausting retries.

Product Preservation: The Hardest Problem

Here's something most "AI product photography" articles don't mention: the AI will change your product. It might add buttons that don't exist, warp logos, change the product color, or subtly alter the shape.

This is a dealbreaker for e-commerce. Your product photo has to accurately represent what the customer receives, or you get returns and bad reviews. NRF data shows retailers lose $890B annually to returns, and inaccurate product photos are a major contributor.

Our multi-layered approach:

Semantic segmentation first: Before any generation, we extract a precise product mask. This tells the generation model exactly what to preserve.
Structured prompt engineering: We don't let users write free-form prompts. Every generation uses a curated template with explicit preservation instructions embedded in the prompt architecture.
Multi-pass verification: After generation, our quality model scores product fidelity. Low-scoring outputs are retried automatically with adjusted parameters (stronger preservation weight, different seed, modified scene complexity).
Human-in-the-loop (premium): For enterprise clients, failed quality checks route to a human reviewer who can manually adjust parameters before a final retry.

The Result

All of this pipeline complexity exists to serve one purpose: letting a small seller upload a phone photo and get back a professional product shot they can actually use on their storefront.

We shipped this as Photoshoot.app — an AI product photography platform that handles product shoots, OOTD fashion photos, and social media content. Users upload a product image, pick a scene template, and receive a studio-quality photo in about 30 seconds. At $1–2 per image, it's accessible to sellers who could never afford a traditional shoot.

What's Next

We're expanding the pipeline to handle video generation (product demo clips from static photos), batch processing for catalogs with 500+ SKUs, and a self-serve API for e-commerce platforms.

The multi-stage workflow architecture scales well — adding a new processing stage is just another node in the DAG. The hard part wasn't building the pipeline; it was getting product preservation to a level where users trust the output enough to put it on their storefront.

If you're building something similar — multi-model AI pipelines, product-focused generation, or just wrestling with async task processing at scale — I'd love to hear about your approach in the comments.

DEV Community