The handoff is the workflow

#ai #productivity #workflow #machinelearning

For about a year, I have been running the same quiet experiment in my own work: keep swapping AI tools until one of them clearly wins.

It never happens.

A model that opens up beautifully in ideation chokes the moment I ask it to honor a spec. A model that nails the spec produces work so averaged it could have been made by anyone. I keep stacking subscriptions, hoping the next release will collapse the stack into one. It does not. After a while, I started to suspect the question itself was wrong.

Last month, Contra Labs published the Human Creativity Benchmark: around 15,000 individual judgments across five creative domains, 93 prompts, and 80 sessions. They have a commercial stake in this work, so read their numbers as data, not gospel. But the central finding maps almost exactly to what I have been seeing in my own studio. No model in the study led all three phases (ideation, mockup, refinement) in any domain. Not one. Leadership shifts as the work moves.

This is not a tooling problem. It is a workflow problem. Once you see it that way, almost everything about how indie operators evaluate AI changes.

The trait that wins the demo loses the deadline

The cleanest case in the HCB data is Veo 3.1.

In ideation, evaluators rated it positively for realism: a net sentiment of +6. By refinement, the same model on the same realism axis dropped to -3. Same tool, opposite verdict, depending on where in the workflow you ask it to perform.

The Contra writeup phrases it well: what makes Veo 3.1 excellent at generation, its creativity, is also what makes it unreliable for refinement. The model's openness, the thing you want when you are exploring, becomes noise when you are trying to land a specific frame.

This is the pattern I keep tripping over. Generation strength and refinement reliability are not the same competency. They are often opposite competencies. The trait that wins the demo loses the deadline.

If you have ever loved a tool in a sandbox and quietly stopped using it on real client work, you already know this in your hands. You just did not have language for it.

Two axes, not one

The other thing the benchmark surfaces, and the part I find most useful as an operator, is that "good model" is not a single dimension.

There are at least two:

Best-practice adherence: how well the model honors verifiable constraints. Typography, contrast, CTA placement, brand spec, prompt fidelity.
Taste-steerability: how well the model responds when you push it past the safe average toward a specific point of view.

These are orthogonal. A model can be strong on one and weak on the other. Most current models, in the HCB sample at least, occupy only one quadrant.

Contra frames it as creative partner versus opinionated engine. Some models behave like a partner: high latitude, weaker on spec compliance, useful when you do not yet know what you want. Others behave like an engine: strong on spec, narrower in range, useful when you know exactly what you want and need it executed without drift.

You can probably guess which models in your stack play which role. I am not naming them as buy-now recommendations because the version numbers in the study are forward-dated and the leaderboard will look different by the time you read this. The point is the shape, not the names.

The shape is this: if you grade tools on one axis, you will pick the wrong tool half the time. You will pick the partner when you needed the engine, or the engine when you needed the partner, and then blame your prompt.

Why your outputs feel flat

There is a phrase from the report I keep coming back to: "the frustration with AI tools is not that they produce bad work, but that they produce undifferentiated work."

That is the whole problem in one line.

Ask a single model to carry an entire creative process and it converges on safe averaged aesthetics. It defaults to the median of its training distribution. Every output ends up technically fine and recognizably nothing. A model can be technically excellent and creatively flat, and most of them are, most of the time, when you use them like a one-shot.

The fix is not a better prompt. The fix is a different unit of work.

The handoff is the workflow

Here is the reframe I am building my own systems around.

The unit of advantage is not the tool. It is the handoff between tools. The model that explores does not have to be the model that executes, and trying to make it so is what produces undifferentiated work.

A phase-aware workflow looks something like this:

Ideation phase. Use the partner. Reward range, surprise, divergence. Do not grade the output against a spec yet. You are looking for direction, not delivery. Save more than you think you need.

Mockup phase. Switch tools. The criteria change. Now you are testing whether the chosen direction can survive contact with constraints: layout, typography, brand fit, usability. The HCB data shows usability behaving as a hard gate in ad images. Outputs that scored 5 on usability finished top-two 84% of the time. Score-1 outputs only 10%. Visual quality could not rescue an unusable layout. Plan accordingly.

Refinement phase. Switch again, often back. By refinement, the field compresses. Models cluster. Evaluator agreement on visual appeal goes up because everything looks fine, and now the question is taste, not quality. This is where the engine, the spec-honoring model, tends to outperform the partner.

You can run this with two models. You can run it with four. The number matters less than the discipline of admitting that the criteria change between phases, and so should the tool.

A caveat I want to be honest about. The three-phase model is a study scaffold, not how creative work actually unfolds. Real work loops, doubles back, abandons phases entirely. Treat the structure as a thinking tool, not a workflow law. The goal is not to march in three steps. The goal is to stop asking one model to do three different jobs.

What this changes for solo operators

If you run a small team or you are a one-person shop with a stack of subscriptions, three things follow.

Stop grading tools on one axis. When you evaluate a new model, ask separately how it does on best-practice adherence and on taste-steerability. Then ask which phase of your work it would actually serve. Most tools are good somewhere and mediocre everywhere else. That is fine, if you know where the somewhere is.

Design for the handoff. The expensive part of an AI workflow is not the generation. It is the export, format, and re-import friction between tools. Pick stacks where the seams are clean. A great partner model that produces outputs your engine cannot ingest is not a partner. It is a tax.

Stop trying to consolidate to one subscription. I know this is the opposite of every productivity essay this year. But the honest read of the data, and of my own desk, is that creative workflows are not single-model problems. The cost of running two right tools is almost always lower than the cost of forcing one wrong tool through a phase it cannot handle.

The new question

For most of the last two years, the operator question has been: which AI model is best?

That question stops working once every output is good enough. The HCB authors put it directly. Once every output is good enough, designers stop evaluating against standards and start evaluating against taste. They diverge not because of disagreements on quality, but because quality is no longer the question.

The better question, the one I am running my own work on now, is the one Contra ends with:

Good for whom, at what stage, and toward what end?

That is the question a workflow answers. A tool cannot.

If you found this useful, I write a newsletter about workflows like this one: what is working in my own builds and what quietly is not. You can subscribe at igorgridel.com. If you want to see the actual phase stack I am running right now, with the tools named, that lives on Patreon.