Most teams building with AI image generation APIs obsess over which model to use. FLUX or Stable Diffusion? Which checkpoint? Which LoRA?
I ran an AI headshot company that generated over 35 million images in two years. Crossed $2.2M in revenue. Hit 87% gross margins. And the model we used was open source. Free.
The model was never what made it work. The workflow around the model was.
Here's what I learned building AI image pipelines at scale, and why most teams get the architecture completely wrong.
The "generate and pray" problem
Here's how most teams ship AI-generated images today:
- User sends a request
- Call an image generation API
- Return whatever comes back
- Hope it's good
This works fine at 10 images a day. It breaks completely at 10,000.
At scale, defects become statistical certainties. Face distortions. Wrong backgrounds. Artifacts that look fine at thumbnail size and horrific at full resolution. Skin tone inconsistencies. Missing fingers (the classic).
When you generate 100 images, you might get lucky. When you generate 100,000, you will ship garbage. Guaranteed. The only question is how much.
We learned this the hard way. Our first month running AI headshots, we generated a batch of images for a customer and delivered them without any automated QA. The customer's feedback: "Why does my colleague have three ears?"
That was the last time we shipped without scoring.
The assembly line, not the craftsman
In 1913, a skilled craftsman took 12 hours to build a single car chassis. Henry Ford didn't hire a better craftsman. He built the assembly line. Specialized stations. Quality inspection at every step. Rework loops when something failed. Result: 93 minutes per chassis. 8x faster. 69% cheaper.
Most AI image teams today are still in the craftsman era. One model call. One output. Ship it.
What we built instead was an assembly line for AI images. Three distinct layers, each solving a different problem.
Layer 1: Generate more than you need
This sounds wasteful. It's the opposite.
For every customer request, we didn't generate 1 image. We generated 240 candidates. Only the best 60 made it to the customer. The other 180 went straight to the trash.
The math works because GPU time is cheap compared to a bad customer experience. At our volumes, generating 4x more candidates added roughly $0.02 per delivered image. A single refund from a bad image costs 100x that.
The key insight: treat image generation like a funnel, not a function call. You're not calling an API. You're running a selection process.
# Simplified version of our generation loop
candidates = []
for i in range(num_candidates):
image = generate_image(
prompt=prompt,
seed=random_seed(),
provider=select_cheapest_available_provider()
)
candidates.append(image)
# Score all candidates
scored = quality_score(candidates)
# Deliver only what passes threshold
delivered = [img for img in scored if img.score >= threshold]
Layer 2: Score everything before it ships
This is where most teams have a blind spot. They generate images but have no automated way to evaluate whether the output is actually good.
We built a three-tier scoring system:
Tier 1: Generic quality. Does the image have artifacts? Is it sharp? Does it match the prompt? These checks apply to every single image regardless of use case. Think of it as a basic sanity check.
Tier 2: Use-case specific. For headshots, this meant: face fidelity, expression naturalness, skin tone consistency, lighting quality, background coherence. A perfectly sharp image with a distorted face is still unusable.
Tier 3: Custom rules. Business-specific criteria. "No visible branding in the background." "Skin tone must be within 2 stops of reference." "Eyes must be open." Whatever the client cares about.
Each dimension gets scored independently. The final decision isn't a single number. It's a pass/fail across all dimensions, with configurable thresholds.
def quality_score(image, config):
scores = {}
# Tier 1: Generic
scores['artifacts'] = detect_artifacts(image)
scores['sharpness'] = measure_sharpness(image)
scores['prompt_alignment'] = clip_similarity(image, prompt)
# Tier 2: Use-case specific
if config.use_case == 'headshot':
scores['face_fidelity'] = score_face(image)
scores['expression'] = score_expression(image)
scores['skin_tone'] = score_skin_consistency(image, reference)
# Tier 3: Custom rules
for rule in config.custom_rules:
scores[rule.name] = rule.evaluate(image)
# Pass/fail per dimension
passed = all(
scores[dim] >= config.thresholds[dim]
for dim in scores
)
return ScoredImage(image=image, scores=scores, passed=passed)
The result: we eliminated manual QA entirely. No human ever looked at the rejected images. The scoring layer caught everything.
When we didn't have this (early days), our customer support tickets were 40% image quality complaints. After implementing automated scoring, they dropped to under 3%.
Layer 3: Route to the cheapest GPU that can do the job
This is the one nobody talks about.
When you're calling AI image generation APIs at scale, you're probably using one provider. Maybe fal.ai, maybe Replicate, maybe Together.ai. You picked one, integrated it, and moved on.
That's leaving money on the table.
We built a routing layer that checked multiple providers on every single request and sent the job to the cheapest one that was currently available and fast enough.
Why this matters: provider pricing varies wildly. Not just between providers, but within the same provider over time. Spot pricing changes. Capacity fluctuates. Cold start times spike during peak hours.
Some real numbers from our routing data:
| Provider scenario | Cost per image (1 megapixel) |
|---|---|
| Single provider, no routing | $0.035 |
| Cheapest provider at any given moment | $0.012 |
| With fallback on timeout/error | $0.014 |
That's a 60-65% cost reduction just from routing. At 100K+ images per month, this is the difference between a viable business and burning cash.
The routing decision is simple in concept:
def select_provider(model, requirements):
available = get_healthy_providers(model)
# Filter by capability
capable = [p for p in available if p.supports(requirements)]
# Sort by current effective cost
capable.sort(key=lambda p: p.current_cost_per_image(requirements))
# Return cheapest, with fallback chain
return capable[0] if capable else fallback_provider
In practice, there's more to it. You need health checking (is this provider actually responding right now?), timeout handling (if it takes too long, abort and retry on a different provider), and cost tracking (did the actual cost match what we expected?).
But the basic pattern is dead simple: check what's available, pick the cheapest, have a fallback.
The numbers that convinced me
Before the routing and scoring layers, our unit economics looked like this:
- COGS: ~40% of revenue
- Customer complaints about quality: ~40% of support tickets
- Manual QA required: yes, for every batch
After:
- COGS: 11% of revenue
- Quality complaints: under 3% of tickets
- Manual QA: zero
Gross margins went from roughly 60% to 87%. On the same models. Same images. Same customers. The only thing that changed was the workflow around the model.
Why this pattern works for any AI image use case
We started with headshots. But the pattern applies everywhere.
Background removal? Same thing. Commercial APIs charge $0.02 to $0.20 per image. Self-hosted open source models can do it for $0.0004. But only if you have the routing and quality layers to handle provider failures, cold starts, and the occasional garbage output.
Product photography? Virtual try-on? Ad creative generation? The specific models change. The scoring dimensions change. But the architecture stays the same:
- Generate more candidates than you need
- Score every candidate automatically
- Route to the cheapest capable provider
- Only deliver what passes your quality bar
It's not complicated. It's just a pattern most teams haven't adopted yet because they're still in the "call one API and hope" phase.
What I'd do differently
If I were starting a new AI image product today, I'd build the scoring layer before I built the product. Not after. Not when quality becomes a problem. Before.
Here's why: the scoring layer changes what's possible. When you can automatically evaluate quality, you can:
- Use cheaper models and compensate with volume
- Switch providers without regression testing every image by hand
- Set up automated retry loops (generate, score, regenerate if failed)
- Give customers quality guarantees instead of quality hopes
The model is a commodity. There are hundreds of them. New ones every week. The workflow is the moat.
What we're building now
We took everything we learned from generating 35 million images and turned it into Runflow. It's the infrastructure layer we wish existed when we started: automated quality evaluation, multi-provider routing, one-click deployment for ComfyUI workflows. The things that took us two years to build from scratch.
If you're running AI image generation at any kind of scale and want to compare notes, I'm always up for a conversation. Find me on LinkedIn or drop a comment.
The model is never the product. The workflow is the product.
Ricardo Ghekiere, CEO at Runflow
Top comments (0)