I spent the last six months building an AI product photography platform. The premise is simple: upload a product photo, pick a scene template, get a professional-looking shot back. The implementation was anything but simple.
What started as "call an API, return an image" evolved into a multi-stage pipeline involving background removal, product segmentation, scene composition, local model inference, cloud-based refinement, and a workflow engine to orchestrate all of it. This post walks through what we built, what it costs, and the mistakes that nearly killed us along the way.
The Problem We Were Solving
E-commerce product photography is expensive and slow. A single studio shoot for a product line runs $500–$2,000 when you factor in the photographer, studio rental, lighting setup, props, and post-production. For small sellers on Shopify or Amazon, that's a non-starter.
The numbers tell the story: 75% of online shoppers rely on product photos to make purchasing decisions, and high-quality product images show 94% higher conversion rates than low-quality ones. Yet most small sellers are stuck with smartphone photos on a bedsheet background.
AI image generation has reached the point where it can fill this gap. The challenge isn't the AI model quality — it's building a production pipeline that's reliable, affordable, and fast enough to serve real users at scale.
The Architecture: It's More Complex Than You Think
A naive "send prompt → get image" approach fails for product photography. Here's why: the AI needs to preserve the exact product while generating a new environment around it. That requires multiple processing stages.
Our Multi-Stage Pipeline
Upload → Pre-processing → Scene Composition → Generation → Post-processing → Delivery
│ │ │ │ │ │
│ Background Template Multi-model Color CDN
│ Removal Matching Fusion Correction Distribution
│ │ │ │ │ │
│ Segmentation Prompt Local GPU + Quality User
│ & Masking Engineering Cloud API Check Gallery
Stage 1 — Pre-processing: User uploads a raw photo. We run background removal using a locally deployed RMBG-2.0 model on our GPU server. This gives us a clean product mask and a segmented foreground. Running this locally is significantly cheaper per image than cloud-based alternatives, and it's faster (under 2 seconds on an A10G).
Stage 2 — Scene Composition: Based on the user's template selection, we build a composite prompt that includes the product description (auto-detected via a vision model), the scene parameters, lighting instructions, and preservation directives. This is where most of the "secret sauce" lives — the difference between "product on a table" and "product on an oak table next to a window with golden hour light streaming in from camera left, shallow depth of field, warm color grading."
Stage 3 — Multi-Model Generation: This is where things get interesting. We don't use a single model. Instead, we run a parallel generation workflow:
- Primary generation: A cloud-hosted inference API handles the main scene generation. We route to faster models for standard shots and higher-quality models (like Seedream) for premium outputs.
- Local fallback: We run FLUX.1 Dev on our own GPU for overflow and for users on premium tiers who want maximum product fidelity. Local inference gives us full control over seeds, CFG scale, and denoising steps.
- Ensemble selection: For high-value generations, we run 2–3 variants in parallel and use a CLIP-based scoring model to auto-select the best one.
Stage 4 — Post-processing: Generated images go through automated quality checks. We run a product comparison model (fine-tuned on our own data) that scores how well the product was preserved. If the score is below our threshold, the pipeline automatically retries with adjusted parameters — without the user ever knowing.
The Workflow Engine
Coordinating all these stages required a proper workflow engine. We built a lightweight DAG-based orchestrator that handles dependencies between stages:
[Background Removal] ──→ [Mask Generation] ──→ [Prompt Engineering]
│
[Parallel Generation]
╱ │ ╲
Cloud API Local GPU Variant #3
╲ │ ╱
[Quality Scoring] ──→ [Best Selection]
│
[Post-processing]
│
[Delivery]
Each node in the DAG is an independent worker that can be scaled horizontally. The orchestrator handles retries, timeouts, and dead-letter queuing for failed tasks.
The Async Task Problem
With a multi-stage pipeline, latency adds up fast. A single image can take 15–60 seconds from upload to delivery. You can't block the HTTP request for that long.
Our Solution: Event-Driven Processing
User uploads photo
→ API creates task (status: "pending")
→ API enqueues first workflow stage
→ Returns task ID immediately
Frontend polls every 3 seconds: "Is my image ready?"
Workflow engine processes stages sequentially:
→ Stage completes → enqueue next stage
→ Any stage fails → retry (up to 3 times)
→ All retries exhausted → mark task failed, refund credits
Final delivery:
→ Upload result to object storage
→ Update task status to "completed"
→ Frontend receives image URL on next poll
The key insight: don't couple the user request to the processing pipeline. The API's only job is to accept the upload and return a task ID. Everything else happens asynchronously.
Here's a simplified version of the workflow coordinator:
// Simplified workflow coordinator
async function runPipeline(taskId: string, stages: Stage[]) {
let currentData: PipelineData = { taskId };
for (const stage of stages) {
const result = await executeWithRetry(stage, currentData, {
maxRetries: 3,
timeout: 60_000, // 60s per stage
});
if (result.status === 'failed') {
await markTaskFailed(taskId);
await refundCredits(taskId);
return;
}
currentData = { ...currentData, ...result.output };
}
// All stages passed — deliver result
const imageUrl = await uploadToStorage(currentData.finalImage);
await updateTask(taskId, { status: 'completed', imageUrl });
}
Why Not Just Use a Queue?
We did start with a simple message queue. The problem is that multi-stage pipelines have complex failure modes. Stage 3 might fail because Stage 1 produced a bad mask. A flat queue doesn't capture these dependencies — you end up with orphaned messages and mysterious failures.
The DAG-based approach lets us replay from any stage, which is critical for debugging. When a user reports a bad generation, we can trace exactly which stage introduced the artifact and fix it without re-running the entire pipeline.
Local GPU vs. Cloud API: When to Use Which
We run a hybrid setup: local GPUs for some workloads, cloud APIs for others. Here's the decision matrix:
| Factor | Local GPU (FLUX.1 Dev) | Cloud Inference API |
|---|---|---|
| Latency | 8–15s (single image) | 10–30s (varies by model) |
| Cost at scale | Lower (amortized GPU) | Higher (pay per call) |
| Burst capacity | Limited by GPU count | Virtually unlimited |
| Customization | Full control over params | Limited to API options |
| Cold start | 3–5s model loading | Sub-second (managed) |
| Product fidelity | Excellent (fine-tuned) | Very good (out of box) |
When we use local: Background removal (always), quality scoring (always), primary generation for premium users, batch jobs where we control the throughput.
When we use cloud API: Burst traffic that exceeds local capacity, models we haven't fine-tuned locally, and geographic routing for users far from our GPU region.
The hybrid approach is key to keeping our per-image cost low enough to offer a competitive product.
Cost Breakdown: $1–2/image vs. $25–150 Traditional
This is the part I wish someone had written before we started. When you factor in the entire production pipeline — not just the raw API call — here's what a single AI-generated product photo actually costs to deliver:
| Component | Cost per Image | Notes |
|---|---|---|
| GPU infrastructure (local inference, amortized) | ~$0.15–0.25 | A10G instances, background removal + generation + quality scoring |
| Cloud inference API (burst/overflow) | ~$0.10–0.20 | Multi-model routing, premium model access |
| Model fine-tuning & training (amortized) | ~$0.05–0.10 | Custom product preservation models, ongoing improvement |
| Object storage & CDN | ~$0.05 | Reference images, generated images, global delivery |
| Database & application hosting | ~$0.03 | Task metadata, user state, workflow orchestration |
| Quality assurance pipeline | ~$0.05–0.10 | Multi-pass verification, automated retries, CLIP scoring |
| Engineering overhead (amortized) | ~$0.10–0.15 | Pipeline maintenance, model updates, monitoring |
| Total | ~$0.55–0.98 | All-in cost per delivered image |
So we're looking at roughly $1 per image on the efficient path and up to $2 when cloud burst capacity kicks in or a generation needs multiple retries.
Compare that to traditional product photography at $25–150 per image. Even the cheapest stock photo services run $5–10 per image, and those aren't even your product.
The math is brutal for traditional studios: a seller with 200 SKUs needing 5 photos each would spend $25,000–$150,000 on photography. With AI, that same catalog costs $1,000–$2,000. That's not a marginal improvement — it's a completely different business model for small sellers.
Mistakes We Made
1. Single-Model Dependency (Month 1)
We started with one cloud API provider for everything. When they had a 6-hour outage, our entire platform went dark. Zero fallback, zero redundancy.
Fix: Built the hybrid local + cloud architecture. Now if the cloud API goes down, local GPU inference takes over automatically. We maintain 99.5% uptime even during provider outages.
2. No Quality Gate (Month 2)
Early on, every generation was delivered directly to the user — including the ones where the AI hallucinated extra buttons on a shirt or changed the product color entirely. We got angry emails daily.
Fix: Added the automated quality scoring stage. A fine-tuned model compares the output against the original product photo and rejects generations that distort the product. Rejected images trigger an automatic retry with adjusted parameters. This brought our "bad generation" rate from ~15% to under 3%.
3. Synchronous Processing (Month 2)
Our first architecture processed the entire pipeline synchronously during the HTTP request. A complex generation could take 40+ seconds, which caused frequent timeouts and a terrible user experience.
Fix: Moved to fully async event-driven processing. The API returns immediately with a task ID, and the frontend polls for results. This eliminated timeout errors entirely.
4. No Credit Refund on Failure (Month 3)
Users were getting charged even when the pipeline failed entirely or the quality gate rejected the output after all retries. Support tickets piled up fast.
Fix: The workflow engine now automatically refunds credits for any task that fails at any stage — whether it's a model error, a timeout, or a quality rejection after exhausting retries.
Product Preservation: The Hardest Problem
Here's something most "AI product photography" articles don't mention: the AI will change your product. It might add buttons that don't exist, warp logos, change the product color, or subtly alter the shape.
This is a dealbreaker for e-commerce. Your product photo has to accurately represent what the customer receives, or you get returns and bad reviews. NRF data shows retailers lose $890B annually to returns, and inaccurate product photos are a major contributor.
Our multi-layered approach:
Semantic segmentation first: Before any generation, we extract a precise product mask. This tells the generation model exactly what to preserve.
Structured prompt engineering: We don't let users write free-form prompts. Every generation uses a curated template with explicit preservation instructions embedded in the prompt architecture.
Multi-pass verification: After generation, our quality model scores product fidelity. Low-scoring outputs are retried automatically with adjusted parameters (stronger preservation weight, different seed, modified scene complexity).
Human-in-the-loop (premium): For enterprise clients, failed quality checks route to a human reviewer who can manually adjust parameters before a final retry.
The Result
All of this pipeline complexity exists to serve one purpose: letting a small seller upload a phone photo and get back a professional product shot they can actually use on their storefront.
We shipped this as Photoshoot.app — an AI product photography platform that handles product shoots, OOTD fashion photos, and social media content. Users upload a product image, pick a scene template, and receive a studio-quality photo in about 30 seconds. At $1–2 per image, it's accessible to sellers who could never afford a traditional shoot.
What's Next
We're expanding the pipeline to handle video generation (product demo clips from static photos), batch processing for catalogs with 500+ SKUs, and a self-serve API for e-commerce platforms.
The multi-stage workflow architecture scales well — adding a new processing stage is just another node in the DAG. The hard part wasn't building the pipeline; it was getting product preservation to a level where users trust the output enough to put it on their storefront.
If you're building something similar — multi-model AI pipelines, product-focused generation, or just wrestling with async task processing at scale — I'd love to hear about your approach in the comments.

Top comments (0)