Why JSON Mode Slows Your Streaming UX (And When That Tradeoff Makes Sense)

#llm #streaming #openai #anthropic

Book: Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You wired up streaming on your chat UI and it felt great. Tokens flowed in, the cursor blinked, the user watched the answer build. Then product asked for one structured field on the response (a next_step the UI could route on), and you flipped on JSON mode. The stream still works. But now the user sees {"r then eason" then : then "the flicker into the message bubble before you can hide it. The streaming UX you were proud of has turned into a falling pile of half-formed braces.

You are not doing it wrong. JSON mode and a token-by-token stream genuinely fight each other. The model still emits one token at a time, but the tokens are now JSON syntax, not the prose you wanted to render. Decide which characters are user-facing and which are wire format, and render only the first.

Three patterns work in practice. They scale from "lightest-touch" to "I am rewriting the response shape."

Why the wire format leaks into the UI

The streaming protocols themselves are not the problem. OpenAI's Chat Completions stream and the Responses API stream send tokens as small text deltas with no awareness of JSON structure (streaming reference). Anthropic's Messages stream does the same; tool-input deltas arrive as input_json_delta events whose partial_json field carries the bytes (streaming docs), and Anthropic notes the partial JSON may not be valid mid-stream.

The API gives you bytes. Whether those bytes form a valid token of prose, a half-typed key, or a stray comma is your problem to figure out client-side. If you take the raw delta and append it to a <span>, the user sees the wire format.

OpenAI's structured outputs guarantee is about the final response shape — once the response is complete, it matches the schema. That guarantee says nothing about intermediate deltas. Mid-stream, the bytes are whatever the decoder happens to be on.

Pattern 1: tolerant parser, render whatever cleans up

The lightest-touch fix is to never render the raw delta. You buffer the JSON as it streams, run a tolerant parser on the buffer after every chunk, and re-render the UI from whatever fields the parser was able to extract.

A tolerant parser is one that closes any open string with ", any open object with }, any open array with ], and ignores trailing commas. Two third-party libraries do this off the shelf: partial-json on npm (MIT) and jsonrepair on PyPI (ISC). A small hand-rolled state machine works too.

Here is the pattern in TypeScript with partial-json. The buffer grows; a fresh parse runs on every delta; React rerenders from the parsed object.

import { parse, Allow } from "partial-json";

let buffer = "";
let lastGood: { reason?: string; answer?: string } = {};

// This snippet targets Chat Completions deltas.
// Responses API uses response.output_text.delta events
// with a different shape — adapt the read accordingly.
for await (const chunk of stream) {
  const delta = chunk.choices[0].delta.content ?? "";
  buffer += delta;
  try {
    lastGood = parse(buffer, Allow.ALL);
  } catch {
    // partial parser failed harder than usual; keep
    // the last good snapshot and wait for more bytes
  }
  render(lastGood);
}

render only reads reason and answer. It never sees a stray brace. While the reason field is being typed, lastGood.reason grows one character at a time exactly the way the user expects. The buffer it reads from stays structurally valid the whole time.

The cost is small. You parse on every delta. For a 500-token JSON response over 3 seconds, that is maybe 200 parses on a string that grows to a couple of kilobytes. A tolerant parser handles that in single-digit milliseconds per parse. Negligible against the model latency you are already paying.

Pattern 2: depth-aware filter, only show closed fields

The tolerant parser shows you the field as it is being typed. Sometimes you don't want that. You want the field to appear only once it is committed. No flicker, no quote marks.

The trick is to track JSON depth as the bytes arrive and lift a field's value out only when its closing quote lands. You don't need a real parser for this. A small state machine over the byte stream is enough.

Track four things: string state, escape state, depth, and the last opened key. When the closer for the current value arrives, you emit (key, value) upward. Everything else is silent.

class FieldEmitter:
    def __init__(self):
        self.buf = ""
        self.in_string = False
        self.escape = False
        self.current_key = None
        self._string_start = None
        self.depth = 0

    def feed(self, chunk: str):
        out = []
        for ch in chunk:
            self.buf += ch
            i = len(self.buf) - 1
            if self.escape:
                self.escape = False
                continue
            if ch == "\\" and self.in_string:
                self.escape = True
                continue
            if ch == '"':
                if not self.in_string:
                    self.in_string = True
                    self._string_start = i + 1
                else:
                    self.in_string = False
                    text = self.buf[
                        self._string_start:i
                    ]
                    if (self.current_key is None
                            and self.depth == 1):
                        self.current_key = text
                    elif self.current_key is not None:
                        out.append(
                            (self.current_key, text)
                        )
                        self.current_key = None
                continue
            if self.in_string:
                continue
            if ch == "{":
                self.depth += 1
            elif ch == "}":
                self.depth -= 1
                self.current_key = None
        return out

The output is a list of (key, value) tuples that landed during this chunk. The renderer reads them and updates the corresponding UI elements. As written, this snippet handles flat single-level objects with string values — the common case for routing decisions like {"intent": "...", "next_step": "..."}. Nested objects will mis-pair their inner keys, and number or boolean values are not emitted at all. To extend, watch for , and } boundaries outside string state and capture the bytes between the colon and the boundary as the value.

This is the right pattern when the user-facing field is a routing decision or short label that should snap into place, not type itself out. Use it for the next_step field that drives a UI route. Don't use it for the long-form answer field, because the user will see a few seconds of nothing followed by a wall of text. Pattern 1 reads better there.

Pattern 3: stop-token streaming, parse once at the end

The third option is to stop fighting the schema and put the prose outside it. The model emits human-readable text bracketed by sentinel tokens, and the JSON envelope arrives only at the end. You stream the prose unmodified and parse the structured part once when the stream closes.

The prompt looks like this:

Reply in two parts. First, the user-visible answer between
<<<ANSWER>>> and <<<END_ANSWER>>>. Then a single JSON
object with the fields {"intent": "...", "confidence": ...}.
Do not put the JSON inside the answer block.

The renderer keeps a flag for "are we inside the answer block." Bytes go to the UI when the flag is true. After <<<END_ANSWER>>> the flag flips and the rest of the buffer is collected into a string that gets parsed once the stream stops.

Sentinels like <<<ANSWER>>> will not be a single token in any modern tokenizer, so they can straddle a chunk boundary. The loop below accumulates into a small lookahead buffer before scanning, so a sentinel split across two deltas still gets caught.

import { parse, Allow } from "partial-json";

let inside = false;
let visible = "";
let envelope = "";
let pending = "";

const OPEN = "<<<ANSWER>>>";
const CLOSE = "<<<END_ANSWER>>>";

for await (const chunk of stream) {
  const delta = chunk.choices[0].delta.content ?? "";
  pending += delta;

  while (true) {
    if (!inside) {
      const open = pending.indexOf(OPEN);
      if (open < 0) {
        envelope += pending.slice(
          0,
          Math.max(0, pending.length - OPEN.length)
        );
        pending = pending.slice(
          Math.max(0, pending.length - OPEN.length)
        );
        break;
      }
      envelope += pending.slice(0, open);
      pending = pending.slice(open + OPEN.length);
      inside = true;
    } else {
      const close = pending.indexOf(CLOSE);
      if (close < 0) {
        const safe = Math.max(
          0,
          pending.length - CLOSE.length
        );
        visible += pending.slice(0, safe);
        pending = pending.slice(safe);
        render(visible);
        break;
      }
      visible += pending.slice(0, close);
      pending = pending.slice(close + CLOSE.length);
      inside = false;
      render(visible);
    }
  }
}

envelope += pending;

let decision: { intent?: string; confidence?: number };
try {
  decision = JSON.parse(envelope.trim());
} catch {
  // Model may wrap JSON in a code fence or trail text;
  // fall back to a tolerant parse for the same recovery
  // Pattern 1 relies on.
  decision = parse(envelope.trim(), Allow.ALL);
}
route(decision);

You lose the schema validator's mid-stream guarantee — the model can drop the JSON block or leak text outside the markers, so retry rates climb a percent or two. The UX is the cleanest of the three.

Stop-token streaming reads as the most "obviously correct" pattern. The guarantees are the worst of the three. Pick it when the user-visible part is the long-form answer and the JSON is a one-line routing decision tacked on the end. Skip it when the structured part is the whole point of the response.

Picking between them

Pattern choice comes down to one question: how much of the JSON do you want visible mid-stream?

If most fields should appear as they are typed — a chat answer, a long explanation, a code block — reach for Pattern 1. The tolerant parser keeps the structure intact and the prose flows.

If most fields are short labels or routing decisions and one or two are long, Pattern 2 is the right call. Snap the labels in when they close; let the long field stream as text by passing it through the depth filter as a special case.

If the long field is genuinely the whole answer and the JSON is a thin envelope around it, Pattern 3 earns its keep. Don't make the parser do work the prompt can do.

Don't render raw deltas from a JSON-mode stream into a user-facing surface. The flicker isn't a model bug. It's wire format leaking through, and adding one of these three filters is your job. Pick the one that matches the shape of your response and the streaming UX returns to the thing you were proud of.

If this was useful

The Prompt Engineering Pocket Guide has more on the prompt-shape decisions that hide inside "should this be JSON mode." Output contracts, sentinel tokens, when a schema pays for itself versus when a regex does — the same questions, applied to ten more shapes of response than this post had room for.