Day 4: Generative UI Gen 3 — MCP Apps and Open-Ended Surfaces

#ai #llm #mcp #ui

This is Day 4 of my 6-part series on how LLMs rewrote the user interface over the past year. Day 3 covered declarative specs with A2UI — and ended at its ceiling: the agent can compose your primitives, but it can't ship anything you didn't build.

The far end of the freedom axis

Gen 1 let the agent pick components. Gen 2 let it compose them. Gen 3 drops the training wheels: the tool ships an entire interactive UI surface — real HTML, CSS, and JavaScript — and the host renders it in the conversation.

A data tool can render its own interactive chart. A booking tool can ship its own seat map. A 3D modeling tool can embed an actual viewport. None of it pre-built by the chat client, none of it expressible as card-and-list primitives.

This is what MCP Apps standardizes — and the politics are almost as interesting as the tech: the extension (SEP-1865) was authored jointly by the MCP-UI creators, OpenAI, and Anthropic. The two biggest rivals in AI agreed on one way to put interfaces inside conversations. ChatGPT, Claude, Goose, and VS Code have all shipped support.

How it works

Three pieces:

1. A tool declares a UI resource. Alongside its normal schema, a tool references an HTML resource via the ui:// scheme:

import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js"

const server = new McpServer({ name: "flight-server", version: "1.0.0" })

// The UI template the host will render
server.registerResource(
  "flight-picker",
  "ui://flight-server/picker.html",
  { mimeType: "text/html" },
  async () => ({
    contents: [{ uri: "ui://flight-server/picker.html", text: PICKER_HTML }],
  })
)

// The tool that uses it
server.registerTool(
  "search_flights",
  {
    description: "Search flights and show an interactive picker",
    inputSchema: { destination: z.string(), date: z.string() },
    _meta: { "ui/resourceUri": "ui://flight-server/picker.html" },
  },
  async ({ destination, date }) => ({
    content: [{ type: "text", text: "Found 12 flights" }],
    structuredContent: { flights: await searchFlights(destination, date) },
  })
)

2. The host renders it in a sandboxed iframe. When the tool runs, the client fetches the HTML resource and renders it with restricted permissions — no cookies, no parent DOM access, no arbitrary network.

3. The iframe talks back over JSON-RPC via postMessage. The embedded UI can read the tool result, update as new data arrives, and even request follow-up tool calls — every message loggable by the host, every tool call subject to host approval:

// Inside the iframe: react to the tool result the host passes in
window.addEventListener("message", (event) => {
  const msg = event.data
  if (msg.method === "ui/toolResult") {
    renderFlightCards(msg.params.structuredContent.flights)
  }
})

// Ask the host to run a tool (host can require user approval)
function bookFlight(flightId) {
  window.parent.postMessage({
    jsonrpc: "2.0",
    id: 1,
    method: "tools/call",
    params: { name: "book_flight", arguments: { flightId } },
  }, "*")
}

If you've used ChatGPT Apps, this is the same architecture — OpenAI's Apps SDK (with its Skybridge sandbox runtime) was one of the two foundations the standard was built on, alongside the community's MCP-UI project. ChatGPT now supports MCP Apps for compatibility.

Why this generation is different in kind

Gen 1 and Gen 2 kept a crucial property: everything on screen came from your code. Gen 3 breaks it — the pixels now come from the tool developer's code, running inside your product.

That makes MCP Apps less like a component model and more like an app store inside the conversation. The chat client becomes an OS: it manages windows (iframes), brokers IPC (JSON-RPC over postMessage), and enforces permissions (sandbox flags, tool-call approval). The conversation is the new desktop.

That framing explains the design choices that look conservative at first:

Sandboxed iframes with restricted permissions — because you're running someone else's code
Declared resources, fetched ahead of render — so hosts can review HTML before showing it
All communication through loggable JSON-RPC — no invisible side channels
Host-gated tool calls from the UI — a button click can't silently spend the user's money

The trade-offs, honestly

	Gen 2 (A2UI)	Gen 3 (MCP Apps)
Expressiveness	Your primitives, any arrangement	Anything a webview can do
Consistency	Native look, your design system	Tool's own look — may clash
Trust model	Data only, no code	Sandboxed third-party code
Cross-platform	Same stream renders natively	Needs a webview everywhere
Review surface	Catalog + layout	Entire HTML/JS bundle

Two practical pain points to plan for. Design consistency: ten tools from ten vendors means ten visual styles in one conversation; expect hosts to push styling guidelines and CSS variables the way mobile OSes pushed design languages. Security review: the sandbox contains the blast radius, but a malicious or compromised tool UI can still mislead the user — a button that says "Cancel" but sends "approve." The host can log every JSON-RPC message, but logging isn't preventing. Trust now extends to the tool developer, not just the model.

Pick your generation by trust boundary, not by ambition: in-house agent in your own app → Gen 1 or 2 keeps your design system intact; platform hosting third-party tools → Gen 3 is the only real option, with all the OS-like responsibilities that follow.

What's next

Tomorrow wraps the series with everything that doesn't fit in a bubble or an iframe: canvas and branching interfaces, adaptive UX that reshapes itself per user, and the security bill that comes due when interfaces are generated at runtime.