SIKOUTRIS

Posted on Mar 11

Benchmarking AI Image Generators: Building an Automated Visual Quality Pipeline

#ai #python #machinelearning #webdev

How do you objectively compare the output of Midjourney, DALL-E 3, Stable Diffusion, and Flux? That was the engineering challenge behind AI Image Compare, and the answer turned out to involve more signal processing than I expected.

The Subjective Quality Problem

Image quality is inherently subjective. One person loves photorealistic output; another prefers artistic interpretation. Building a comparison platform means you cannot just say "Midjourney is better" — you need to break quality down into measurable dimensions.

We identified five axes that matter for most users:

Prompt adherence — Did the image match what was asked for?
Visual fidelity — Resolution, artifact-free rendering, coherent lighting
Text rendering — Can the model render text in images correctly?
Style consistency — Same prompt, similar style across generations
Generation speed — Time from prompt submission to output

Automated Prompt Testing Framework

We built a test suite of 200 standardized prompts organized by difficulty level:

test_prompts = {
    "basic_object": [
        "A red apple on a white table",
        "A blue car parked on a street",
    ],
    "complex_scene": [
        "A medieval castle on a cliff at sunset with birds flying overhead",
        "A busy Tokyo street at night with rain reflections",
    ],
    "text_rendering": [
        "A storefront sign reading OPEN 24 HOURS",
        "A birthday cake with the text Happy 30th written in icing",
    ],
    "anatomy": [
        "A person playing piano, hands visible on keys",
        "Two people shaking hands in a business setting",
    ],
    "abstract": [
        "The concept of time flowing like water",
        "Synesthesia visualized as colors emerging from music",
    ]
}

Every week, a cron job sends each prompt to every supported API and stores the results. This gives us longitudinal data — we can track how models improve (or regress) over time.

The Prompt Adherence Scorer

This was the trickiest component. How do you automatically judge whether an image matches a prompt?

We use a multi-model approach. The generated image gets passed to a vision model (GPT-4V or Claude) with the original prompt, and we ask for a structured evaluation:

def score_adherence(image_path, original_prompt):
    response = vision_api.analyze(
        image=image_path,
        system="You are an image quality evaluator. Score strictly.",
        prompt=f"""The user requested: \"{original_prompt}\"

        Score this image on:
        1. Object presence (0-10): Are all requested objects present?
        2. Spatial accuracy (0-10): Are objects positioned correctly?
        3. Attribute accuracy (0-10): Colors, sizes, styles correct?
        4. Overall match (0-10): How well does this match the prompt?

        Return JSON only."""
    )
    return json.loads(response)

Using a vision LLM to evaluate another model is output feels circular, but in practice the evaluator model is remarkably consistent. We validated this by having three humans score 500 images and comparing against the automated scores. The correlation was 0.87 — good enough for ranking purposes, not for absolute quality claims.

Image Quality Metrics Without AI

Not everything requires an LLM. We also compute traditional image quality metrics:

from PIL import Image
import numpy as np

def analyze_technical_quality(image_path):
    img = Image.open(image_path)
    arr = np.array(img)

    return {
        "resolution": f"{img.width}x{img.height}",
        "file_size_kb": os.path.getsize(image_path) / 1024,
        "color_depth": len(np.unique(arr.reshape(-1, arr.shape[2]), axis=0)),
        "edge_sharpness": compute_laplacian_variance(arr),
        "noise_level": estimate_noise(arr),
        "dynamic_range": arr.max() - arr.min(),
    }

def compute_laplacian_variance(img_array):
    """Higher values = sharper image"""
    gray = np.mean(img_array, axis=2)
    laplacian = np.array([[0,1,0],[1,-4,1],[0,1,0]])
    from scipy.signal import convolve2d
    filtered = convolve2d(gray, laplacian, mode="same")
    return float(np.var(filtered))

The Laplacian variance is particularly useful for detecting the "AI smoothness" problem — where generated images look superficially polished but lack the fine detail of real photographs.

Storage Architecture

Each weekly benchmark run generates approximately 1,000 images (200 prompts x 5 models). At an average of 2MB per image, that is 2GB per week, roughly 100GB per year.

We use a tiered storage approach:

Hot storage: Last 4 weeks of images on the web server for fast display
Warm storage: Cloudflare R2 for older images (cheap, fast CDN)
Cold storage: Compressed archives of raw benchmark data for analysis

The comparison pages do not serve full-resolution images. We generate WebP thumbnails at three breakpoints (400px, 800px, 1200px) and use responsive srcset attributes. This dropped our average page weight from 12MB to under 800KB.

The Comparison UI Challenge

Showing two or more AI-generated images side by side sounds simple until you actually build it. Key decisions:

Synchronized zoom: When a user zooms into one image, all comparison images zoom to the same region. This is essential for comparing fine details like hand rendering or text quality.

function syncZoom(sourceCanvas, targetCanvases, region) {
    const { x, y, width, height } = region;
    targetCanvases.forEach(canvas => {
        const ctx = canvas.getContext("2d");
        ctx.drawImage(
            canvas.sourceImage,
            x, y, width, height,
            0, 0, canvas.width, canvas.height
        );
    });
}

Slider comparison: A draggable slider overlay that lets you compare two images pixel-by-pixel. This is built with a single CSS clip-path on the overlay image, updated on mousemove.

Blind mode: Users can evaluate images without knowing which model generated them. This reduces brand bias in community ratings.

API Cost Management

Running 200 prompts through 5 image generation APIs weekly is not free. Current approximate costs per run:

Model	Cost per image	Weekly cost (200 prompts)
DALL-E 3	$0.04-0.08	$8-16
Midjourney	~$0.05	~$10
Stable Diffusion (API)	$0.01-0.03	$2-6
Flux Pro	$0.05	$10
Ideogram	$0.04	$8

Total: roughly $40-50 per weekly benchmark cycle. We offset this through affiliate partnerships with the platforms — if a user decides to try a tool after seeing our comparison, we earn a referral fee.

Lessons From a Year of Benchmarking

After running automated benchmarks for over a year, some observations:

Text rendering improved dramatically across all models between mid-2024 and early 2025. It went from "mostly broken" to "usually correct."
Hand and finger rendering is still the most reliable differentiator between models.
Speed matters more than people admit. A model that generates in 5 seconds consistently beats a model that takes 30 seconds, even if the slower model has slightly better quality.
Consistency is underrated. Some models produce amazing results 60% of the time and garbage 40%. Others produce good-not-great results 95% of the time. Users prefer the latter.

If you are interested in seeing how the current models stack up, check out the latest benchmarks at aiimagecompare.com.

The benchmark methodology is fully documented on the site. We believe in transparent evaluation.

DEV Community