How I Built an Evaluation Pipeline for AI Image Generation

#ai #genai #softwareengineering #llm

We automated blog image generation with Imagen 4. It worked beautifully... except when it didn't.

Sometimes the AI would sneak text into images despite explicit instructions not to. Here's how we fixed it with automated evaluation.

The Problem

We're generating retro hero images for blog posts at Vets Who Code. The requirements were strict:

Bold navy/red/white color palette only
NO text or typography
Retro poster aesthetic

Imagen 4 is powerful, but non-deterministic. We'd get 8 perfect images, then 2 with random text. Manual QA doesn't scale.

The Solution: Automated Evaluation

We built a test harness using Gemini Vision to grade Imagen's outputs:

Step 1: Generate test images
Run the same prompts 10x to measure consistency

Step 2: Automated validation

Use Gemini Vision to detect text (even single letters)
Validate color palette adherence
Check style consistency

Step 3: Track metrics

Pass rate across test cases
Violation patterns (text vs colors vs style)
Save failed images for inspection

Step 4: Iterate on prompts
Based on eval data we discovered:

Moving "NO TEXT" to the top of the prompt improved adherence by 40%
Repeating constraints in multiple places reduced violations BUT only to an extent

Too many NOs equal a YES

This turned out to be much like a human interaction with children. The more you tell them 'NO' the more likely it is to fail. We found out that the amount of negative constrains (no letters, no text, no numbers etc.) turned out to be a bell curved chart in that having 0 did not work, adding a few at deliberate locations worked well, adding more turned out to fail more. WHY?

Too many of these instances turned out to flood the LLM with those words and even though we were stating NOT to include them they saw the word count to turn into these might be important to have.

The Takeaway

When using AI in production:

Assume non-determinism — what works once might fail the 10th time
Use AI to evaluate AI — Gemini Vision is great for validating image outputs
Make evals fast — automated testing lets you iterate rapidly
Track metrics over time — you want to know if model updates break your workflow

Brad
bradhankee.com