DEV Community

Igor Gridel
Igor Gridel

Posted on • Originally published at igorgridel.com

No AI Model Can Carry a Creative Project End to End. The HCB Just Proved It.

No AI Model Can Carry a Creative Project End to End. The HCB Just Proved It.

Subtitle: Contra Labs ran 15 AI models through 93 prompts across 5 creative domains. Professional creatives judged the output. The result: different models win at different stages, and no single tool can carry a project from concept to final.

Excerpt: 93 prompts, 80 evaluation sessions, roughly 15,000 judgments from professional creatives. The Human Creativity Benchmark found no model leads all three creative phases in any domain. The data kills the single-model workflow.

Takeaway: The model you start with should not be the model you finish with. Multi-model pipelines are not a power user hack. They are the only workflow the data supports.

Category: Building
Read time: 5 MIN READ
SEO primary: ai creative model comparison hcb
SEO secondary: best ai models for creative workflow phases


The Human Creativity Benchmark from Contra Labs is the largest structured creative AI test anyone has run. Ninety-three prompts across five creative domains: landing pages, product videos, ad images, brand design, desktop apps. Each domain broken into three phases: ideation, mockup, refinement. Eighty evaluation sessions with professional creatives producing roughly 15,000 judgments.

The headline finding flattens most of what people assume about creative AI tools:

No single model led all three phases in any domain. Not Claude Opus 4.6. Not Gemini 3.1 Pro Preview. Not GPT. Not Veo 3.1. Every model owns one or two phases and falls behind on the rest. The pattern is so consistent across all five domains that it stops looking like model comparison and starts looking like a structural truth about creative work itself.

Here is what happened, domain by domain.

Landing pages. Claude Opus 4.6 dominates ideation. Visual hierarchy and layout coherence on first pass looked intentional, not generated. Designers rated it highest for feeling like someone thought about it. But when the task shifted to mockup (color palettes, typography pairing, grid structure), Gemini 3.1 Pro Preview took over with a 68.9% win rate and the highest Usability scalar of any model in any domain at 4.03 out of 5. Then refinement arrived and Claude reclaimed the lead at 60.0%. The evaluators literally changed their minds as the task changed. It wasn't about which model was better. It was about which model matched the phase.

Product videos. Veo 3.1 generates the strongest concepts from scratch, winning ideation at 61.1%. Kling 3.0 Pro is the most consistent model across all three phases, competitive at every stage (51.4% to 61.1%). Grok Imagine Video climbs to 56.5% in refinement as the task narrows to production fidelity. But here is the genuinely strange finding: Veo 3.1 is the only model in the entire benchmark that gets worse as tasks become more constrained. Give it creative freedom and it thrives. Add constraints and it introduces unwanted changes during iteration. Every other model either stays flat or improves under constraints. Veo alone degrades.

Ad images. GPT Image 1.5 leads ideation and holds the lead through mockup. Then Seedream 4.5 climbs from third to first as the criteria narrow to typography, CTA placement, and contrast. Flux 2 Pro climbs from last place to second by refinement. The early leader gets overtaken. The late-stage specialist materializes.

Brand design. GPT Image 1.5 owns the early phase. Gemini 3 Pro Image takes the middle on composition, lighting, and product accuracy. Then Gemini collapses in refinement and Seedream 4.5 and Flux 2 Pro take over.

Desktop apps. Claude Opus 4.6 at ideation. Gemini 3.1 Pro Preview for prompt adherence and usability in mockup. Claude Opus 4.6 and GPT 5.3 Codex for detail execution in refinement.

The pattern is not subtle. Ideation rewards concept generation and divergent thinking. Mockup rewards prompt adherence and design system fidelity. Refinement rewards precision editing and incremental improvement. These are different skills. Different models have different strengths on these axes. Of course no single model wins all three.

Two more findings deserve their own screen time.

First, the convergence and divergence framework Contra Labs uses to measure agreement. Convergence means evaluators agree. That signals best practices, objective quality, things that work regardless of taste. Divergence means they disagree. That signals taste, preference, the stuff no model can optimize for because there is no right answer.

The most counterintuitive data point here: landing page agreement decreases as work progresses. Kendall's W goes from 0.484 at ideation to 0.293 at mockup to 0.333 at refinement. Claude's ideation dominance creates initial agreement, then personal taste takes over and evaluators scatter. Meanwhile ad images run the opposite direction: 0.345 to 0.436 to 0.549. As the criteria narrow, people agree more. Different domains move toward or away from consensus as the work deepens.

Second, product video scene coherence is net negative across every model tested. Temporal consistency in video generation is still broken at the fundamental level. No model can keep scenes coherent end to end. That is not a model limitation. That is a category limitation.

One more number: 84% of ad images that score usability-5 finish in the top 2. Usability-1 finishes top 2 only 10% of the time. Usability doesn't just predict performance. It is performance.

So what does any of this mean for someone actually using these tools?

If you are building landing pages, you should be ideating in Claude Opus 4.6, mocking up in Gemini 3.1 Pro Preview, and refining back in Claude. If you are producing product videos: Veo 3.1 for concepts, Kling 3.0 Pro for the middle, Grok Imagine Video for the final pass. If you are running ad images: GPT Image 1.5 for the first two phases, Seedream 4.5 for refinement.

One model per project is the wrong shape. Model-switching is not a power user optimization. It is the only workflow the data supports at every level.

This is why I am building Scopeful. There are dozens of creative AI tools, each with its own pricing, its own model lineup, and most importantly its own strengths at different phases of the creative process. Nobody has built a simple way to figure out which tool for which stage. Price comparison sites exist. Phase-based quality comparison doesn't. The HCB data makes the case clearer than I could make it myself: the single-model era is already over. The question is not which tool is best. The question is which tool for which phase.

The Scopeful waitlist is open at scopeful.org. I also write about tool comparisons, creative AI strategy, and benchmark data as it drops at igorgridel.com

The HCB findings are not academic. They are operational. The model you start with should not be the model you finish with. The data is in. The workflow has to change.

Top comments (0)