Ryan Brinn

Posted on Jun 29

The case for letting your UI tell you what to build next

#webdev #ai #programming

There's a pattern I keep coming back to that sits somewhere between "chat interface" and "traditional app UI," and I've been trying to figure out whether it's actually interesting or just a cool demo that falls apart in production. I wrote a longer whitepaper on this, but I wanted to share the rough edges of the thinking here and see what resonates — or where I'm wrong.

The problem I kept running into

I've worked on software for over 20 years, and what's surprising is how much we still get wrong: Pendo found that 80% of software features are rarely or never used (the Standish Group puts it at 64%). Either way, it's a lot of engineering effort going nowhere.

The frustrating thing is this isn't a "bad engineers" problem. I think it's a signal problem. Teams build the wrong things because they have no reliable feedback loop between what users actually need and what ends up on the roadmap. Roadmaps get driven by whoever is loudest — escalated tickets, squeaky customers, whoever had the best slide.

Wouldn't it be nice if the interface itself could tell you what to build next?

That led me into Generative UI territory.

What I mean by Generative UI (and what I don't)

Here's the pattern I'm working with: an AI layer sits alongside a traditional UI — not replacing it — and handles requests the static interface can't. The key constraint: the AI generates structured data that selects from a fixed set of pre-built components. It does not generate arbitrary code or markup.

The most concrete production implementation of this is Vercel's AI SDK Generative UI, which originated from v0.dev and was open-sourced around 2024. The LLM calls predefined "tools," and tool results are passed to corresponding pre-built React components for rendering — the LLM never touches the component code itself. There's also a 2025 arXiv paper formalizing the framework and research from CHI '26 on semantic guidance for closing the gap between intent and generated output. The academic lineage goes back further through Ink & Switch's malleable software work and Horvitz's mixed-initiative UI research, but Vercel is where I'd point anyone who wants to see it running in production.

That schema-vs-code distinction matters more than it might seem, and I'll come back to it.

The "alongside" part is equally important. I'm not interested in the "everything is a chatbot now" approach. I'm interested in UIs where 95% of interactions still go through normal, fast, deterministic code — and the AI layer activates only when the user hits a ceiling. Two ways that activation can happen: the user explicitly reaches for it (they hit a wall and ask for something the UI doesn't support), or the system notices something and surfaces UI proactively, without being asked.

I've been spending most of my thinking time on the proactive case, because it's weirder and less explored.

The cost problem reframed

I think it is safe to say that: GenUI is structurally more expensive than a conventional web stack. Every request that passes through an LLM inference layer costs orders of magnitude more than a normal database query and render. That gap narrows as model costs fall but it never closes.

But I think people compare against the wrong baseline. The real comparison isn't "LLM inference vs. a database query." It's "LLM inference costs (cents per request) vs. engineering waste costs (hundreds of thousands of dollars per year building things nobody uses)."

If the AI layer generates demand signal — if every unresolvable user request becomes a labeled data point about what users actually want — then the inference costs are buying something that didn't exist before: a continuous, behavior-grounded product roadmap. Not "GenUI is cheap" (it isn't), but "GenUI's costs show up on the infrastructure budget while its savings show up in planning efficiency and reduced waste."

Whether that's convincing probably depends a lot on your org structure. I'm genuinely curious how others see it.

The graduation pipeline

This is the piece that made the whole thing feel deployable rather than theoretical.

Don't think of GenUI as a permanent runtime. Think of it as a temporary handling layer with a built-in exit ramp. The lifecycle: the AI layer handles a novel request and logs it ({intent, parameters, timestamp}). Once that intent crosses some threshold — say, 50+ occurrences over 90 days — it becomes a graduation candidate. The team ships a real hardcoded view. The AI layer routes that intent to the new view permanently. Inference cost for that pattern: zero, forever.

The AI surface area shrinks over time. You're paying inference costs early, when patterns are novel and uncertain, and cashing out into zero marginal cost later, once patterns are understood. The inference path keeps narrowing toward the genuine long tail.

The thing I'm still working out: what's the right graduation threshold? Frequency is the obvious answer, but I wonder if there are better signals — user confirmation rate on AI-composed views, task completion time, explicit "save this" actions. Haven't landed on this yet, and I suspect the right answer is domain-specific.

Where I've been actually building this: arr-mcp

The proactive variant is easier to think through in a concrete domain, so I've been prototyping against a home lab setup I already maintain — an open-source MCP server for self-hosted media stacks (Plex, Sonarr, Radarr, that whole ecosystem).

The core scenario: you pull a new container image. Without doing anything else, the system identifies what it is, surfaces a configuration form specific to that service in your current stack context, and pre-populates what it can infer. You answer a handful of questions. Dependent services get updated silently.

That works here because the domain has exactly the right shape for this kind of intelligence. The container runtime emits events you can observe. The canonical Plex + Sonarr + Radarr + SABnzbd stack is well-understood enough to encode in a static service registry. Setup is genuinely painful — API keys, library paths, inter-service connections — so there's real value in automating it. And the user base is split between power users who want full control and household members who just want it to work, which is exactly the population where proactive UI earns its keep.

The architecture I've been working with has three agents running in sequence: a domain agent that looks at the detected event and answers "what is this, what does it depend on, what does it replace?"; a security agent that runs in parallel and vets the image before anything happens; and a UX/component agent that takes both outputs and decides what the user actually needs to see. The UX agent doesn't act until both upstream agents resolve. A security block from the security agent halts the whole pipeline — no override except explicit, logged admin acknowledgment.

The confidence thresholds are where it gets interesting in practice. High confidence (>0.90): act and tell the user what you did. Medium confidence (0.60–0.90): surface a question, wait for confirmation. Low confidence (<0.60): flag it, don't guess. Security warning at any confidence level: block regardless. The numbers above are hypothetical — I'm still calibrating what they should actually be. If anyone has done empirical calibration of this in a deployed system, I'd genuinely love to hear about it.

The one architectural thing I feel strongly about

There are two architectures that both get called "Generative UI" and they have very different risk profiles. In one, the LLM generates structured data selecting from a closed, pre-validated schema — component code is written and reviewed by developers in advance, and LLM output is treated as untrusted input, validated before rendering. In the other, the LLM generates executable code or markup that gets run.

That second one is genuinely risky. It creates an LLM-controlled code injection vector. Indirect prompt injection — malicious instructions hidden in data the LLM processes — becomes an attack surface for the entire UI. Everything I've described above assumes the schema-generating model. The audit guarantees, the governance properties, the safety of proactive action — none of that holds if you let the LLM write code.

The weak point to watch even in the safe variant: an overly generic "escape hatch" component in your schema — arbitrary HTML, raw markup, iframes. Every component the LLM can select should be something a developer would have shipped on its own.

What I don't have good answers to yet

Confidence calibration at scale is the one that keeps me up a little. The thresholds I'm using are hypothetical and I suspect they're very domain-specific, which makes me wonder how much the approach generalizes.

Multi-agent conflict resolution is also unresolved for me — when the domain agent and security agent return conflicting signals, what does the UX agent actually do? I have a rough intuition (security wins, always) but haven't thought through the edge cases.

And I'm not sure this works at all in regulated industries. Auditability requirements are in direct tension with non-deterministic UI generation. The exception-path scoping probably helps — the AI layer is discovery and triage, not the system of record — but I haven't tried to make that argument to a compliance team yet.

Where I'm at

I'm in the early stages of building this out more seriously, and I'm less certain about the right approach than this article might make it sound. The home lab prototype is narrow enough that I trust it. Generalizing to production software products is where the interesting hard parts live.

If you've shipped something that looks like this — or tried and hit walls — I'd really like to compare notes. What broke? What surprised you? Where did the confidence calibration go sideways?

If you want the full architecture document with the cost/tradeoff tables and a minimal viable experiment spec, I'm happy to share that separately.

DEV Community