We automated blog image generation with Imagen 4. It worked beautifully... except when it didn't.
Sometimes the AI would sneak text into images despite explicit instructions not to. Here's how we fixed it with automated evaluation.
The Problem
We're generating retro hero images for blog posts at Vets Who Code. The requirements were strict:
- Bold navy/red/white color palette only
- NO text or typography
- Retro poster aesthetic
Imagen 4 is powerful, but non-deterministic. We'd get 8 perfect images, then 2 with random text. Manual QA doesn't scale.
The Solution: Automated Evaluation
We built a test harness using Gemini Vision to grade Imagen's outputs:
Step 1: Generate test images
Run the same prompts 10x to measure consistency
Step 2: Automated validation
- Use Gemini Vision to detect text (even single letters)
- Validate color palette adherence
- Check style consistency
Step 3: Track metrics
- Pass rate across test cases
- Violation patterns (text vs colors vs style)
- Save failed images for inspection
Step 4: Iterate on prompts
Based on eval data we discovered:
- Moving "NO TEXT" to the top of the prompt improved adherence by 40%
- Repeating constraints in multiple places reduced violations BUT only to an extent
Too many NOs equal a YES
- This turned out to be much like a human interaction with children. The more you tell them 'NO' the more likely it is to fail. We found out that the amount of negative constrains (no letters, no text, no numbers etc.) turned out to be a bell curved chart in that having 0 did not work, adding a few at deliberate locations worked well, adding more turned out to fail more. WHY?
Too many of these instances turned out to flood the LLM with those words and even though we were stating NOT to include them they saw the word count to turn into these might be important to have.
The Takeaway
When using AI in production:
- Assume non-determinism — what works once might fail the 10th time
- Use AI to evaluate AI — Gemini Vision is great for validating image outputs
- Make evals fast — automated testing lets you iterate rapidly
- Track metrics over time — you want to know if model updates break your workflow
Brad
bradhankee.com
Top comments (0)