Generative UI Is Three Things. Only One Ships.

#ai #ux #uxdesign #ui

Google shipped a generative UI experiment in November 2025 that matched a human-designed interface in roughly half the cases it was tested against. The other half it produced an artifact the way an unsteamed dumpling is food — recognizably the right shape, not actually edible. The system could take a minute or more to render a single page and produced different output on every refresh. Most of the takes I read framed the gap as a "more compute will fix it" problem. It won't.

Google's own scoring is honest about it: their implementation earned an ELO of 1736.2, a strong preference over every other output format, and lost only to human experts, whom it merely matched about half the time. That's an impressive result aimed at the wrong target.

Notice which half gets the spotlight. "Comparable to a human expert in half the cases" is the generous read; the inverse is that the other half come out worse. And screens don't arrive one at a time, they chain into workflows. Probability compounds in the direction nobody's gut expects. At a coin-flip per screen, a four-step flow renders cleanly about one time in sixteen: roughly 6% of users get through without a visible failure, and the other 94% hit at least one bad screen. Stretch it to six steps and you're at 98%. The demo reads like a coin toss; the workflow reads like a near-certain stumble. People hear "50%" and picture a wash. It isn't — it's a near-guarantee, dressed up as a fair bet.

"Generative UI" is one name stretched across at least three distinct technical approaches, and the conversation suffers for the conflation. One is the demo everyone admires and nobody can ship. One is a modest improvement on the first that still doesn't work. The third is tame enough that most people won't file it under generative UI at all, and it's the one that wins.

One distinction up front, because the rest depends on it: generating code is not the same as generating interfaces. Tools like v0 and Claude Code generate UI, but the output is a static artifact, a box you build once and reuse. Generative UI in the sense that matters here is a runtime behavior, an interface that adapts to context and input as you use it. The difference is the difference between asking an AI to build you a box for your camping gear and having an AI watch you nearly drop an armful of cans, decide you need a backpack, sew one, and hand it over. The second is a marvel — and a parade of assumptions, each one a fresh chance to be wrong, which is roughly why we evolved language first. Being told what someone needs beats inferring it from watching them fumble.

Three Variants Wearing One Name

The most ambitious variant generates the interface from scratch on every load. The model takes a prompt, writes HTML, CSS, and JavaScript, and the browser renders whatever comes back. It has the highest ceiling — anything could be in the UI — and it's the version Google demoed. It also produced usable interfaces at a coin-flip rate, took minutes per page, burned tokens by the fistful, and rendered differently every refresh. Make the model ten times bigger and faster and it's still stochastic, still expensive, still inconsistent. Compute doesn't change the shape of the problem.

The middle variant constrains the model to a finite component library. Instead of writing layout code, the model emits a serialized payload, usually JSON, naming which existing components to render, with what props, in what arrangement. The client reads the payload and assembles the page. This is a genuine improvement: components stay consistent, accessibility doesn't regress on every refresh, and the model can't ship anything the design system forbids. It also has real momentum. Google open-sourced A2UI in December 2025 to standardize exactly this kind of agent-emitted interface, and a wave of commercial tools now lets non-technical staff assemble forms and reports inside a sanctioned component set. Boring, mostly. But boring is what most corporate software is, and the CIO gets at least some say over the standards — enough to be a little less likely to wake up at 3 AM already sweating before the phone finishes its first ring.

The catch is that the middle variant still hands the model authority over the whole screen. Every button, every input, every layout call is the model's. The variance that made the full-code version unreliable just moves to the component-selection axis: refresh and you might get a different field order, a different empty state, a different primary action. That's the same usability problem with extra steps. Add competing, half-formed standards and the fact that non-designers still can't organize an information flow, and this variant stalls in the near term even where the plumbing works.

The third variant is where products will actually land. Keep the core interface deterministic: designed by a human, shipped as fixed components, predictable across visits. Then carve out a bounded surface where dynamic widget selection happens. The design team builds a finite library of widgets (twenty, thirty, fifty) and a layer above them (an LLM, a plain rules engine, often both) decides which to surface in which context. The model isn't laying out the page. It's picking from a menu. That surface is often supplemental, but it doesn't have to be: it can be the primary one, a form that grows its own fields as it works out which data is still missing.

The 80/20 of Music

Effective interfaces follow the balance that effective music does. Most of a song is familiar: the beat you can clock, the progression you can predict, the structure you expect. A thin sliver is novel: the unexpected modulation, the dropped chorus, the turn you didn't see coming. The familiar parts are why you can listen. The novel parts are why you keep listening. Tilt the ratio too far either way and the song fails: too much repetition bores, too much novelty becomes unlistenable.

Interfaces work the same way. Most of what a user touches needs to be predictable enough that they build muscle memory and stop seeing the chrome. The smaller novel portion — the smart default, the contextual helper, the just-in-time widget — is what makes the product feel intelligent. Invert the ratio and you get generative UI in the full-code sense: little is predictable, and the user relearns the interface on every visit. The model can make a good local call every single time and the experience still fails, because consistency is a property of the sequence of interactions, not any one of them.

The third variant respects the ratio, but the ratio is about the whole product, not each screen. The deterministic core is the 80%. The dynamic surface is the 20%. Some individual screens will lean heavily dynamic, and that's fine; what matters is that the parts the user depends on stay put while the model gets to be clever in a bounded space where being wrong is recoverable.

The Travel App Test

Travel booking is the example I keep returning to, because the assumptions are so visible. Kayak, Travelocity, the airline sites: they're all built around one traveler or one group, one origin, one destination, maybe a round trip with a couple of legs. That covers most trips. It also falls apart the moment a trip doesn't fit the mold, and the interfaces have no graceful way to bend.

Picture a bachelorette party. Six friends converging on Vegas from four cities, all needing to land before 8 PM Friday for Penn & Teller tickets. Two have hard work deadlines pinning their return dates; the others have more flexibility. Today you solve this with six or more browser tabs, a spreadsheet, and a group chat full of screenshots. Nobody ships an interface for it: it's NP-shaped and rare enough not to be worth hand-designing.

Here's what generative UI actually buys you, and it isn't what the demos suggest. You don't design an interface for the bachelorette case. You design an origin widget, a destination widget, a lodging widget, a ground-transport widget, each one independent, each with sensible defaults, and you give the model enough context to know when to flex them. The origin widget either grows a state that accepts up to ten cities, or the model drops a second origin widget on the canvas for parallel tracking. Either way, the origin widget is the origin widget. It neither knows nor cares whether the trip is one person round-tripping to Vegas or six people scattering home from three cities. Build enough flexibility into each component, hand the model good defaults and enough context to choose between them, and the permutations resolve themselves. You were never going to hand-design all of them anyway.

The same machinery handles the far more common case, the business loop trip, A → B → C → D → A. The core interface shows the loop. A bounded panel runs the pigeonhole logic: which legs have a rental car, which don't, which hotels are confirmed, which are still open. Twenty-five holes to fill, sixteen filled, nine left, and the model's job is to rank the nine and surface the most important one next. The widgets that render them are already designed.

Decompose the Permutation, Not the Interface

I shipped a version of this at Reva. Credit and criminal screening reports come back in thousands of permutations, and we had to represent that range while telling both the applicant and the property manager, in plain language, what a given result actually meant, then drive different workflows, validation, and escalation off it. Designing a layout for every combination was never on the table.

So I didn't. I decomposed the screening result into about thirty key metrics. I decomposed the interface into a dozen regions: headline, subheading, summary, a credit line item, and so on. Then each region into the five to twenty text variants that could fill it. The output was a clean, human-readable JSON payload of thirty-odd metrics that drove the i18n keys, and those keys covered every combination: clean credit with a criminal flag, thin file with nothing on it, all of it. I never solved the whole permutation at once. I solved small, independent slices, which is the only reason I could prove the result was mathematically complete. In the end it was a set of rules. Decompose the information problem cleanly enough and most of the apparent complexity was never really there — Shannon's fingerprints are on every logistics problem if you look for them.

Why This De-Risks the Bad Days

The third variant carries a quiet benefit the ambitious ones don't. When the model gets it wrong — and it will, routinely — the damage stays inside a surface the user can ignore. The core interface keeps working, defaults stay sane, and escape hatches are easy to design because the design team owns the chrome.

In the full-code variant, a model mistake is a broken page. In the component-everywhere variant, it's a confusing layout the user has to decode. In the bounded variant, it's an oddly chosen widget the user can dismiss (or correct, which feeds a signal back to sharpen the next call) while their actual task runs on untouched. Same model, same error rate, wildly different blast radius.

What to Actually Build

If I were building generative UI into a product today, the work is concrete. Build a finite widget library, twenty to fifty widgets covering the workflow components your application actually has. The trick is clean separation of concerns: if the origin widget only ever captures origin, it contributes exactly one slice to the information problem at the heart of nearly every logistics task. Define what data each widget reads, what states it can hold, what context makes it relevant. Then build the selection layer — sometimes an LLM, sometimes a rules engine, usually both — that picks widgets based on the live context: what the user is doing, what they've finished, what's still missing.

The same widgets pay off twice, because they work in a chat context as readily as a panel. The user who wants a traditional UI gets widgets in a side panel; the user who wants to type gets the same widgets inline in the conversation. One library, one data contract, two surfaces, and the design investment carries across interaction modes instead of being rebuilt for each.

The full-dynamic dream — an interface generated from scratch every time — ignores why interfaces work at all. Consistency isn't a symptom of stale design; it's the point. The right ratio is most of the song you recognize and a chorus that surprises you. Get it right and generative UI is genuinely useful. Get it wrong and you're shipping an unsteamed dumpling.