How Structured Stream Parsing Makes AI Interactive Fiction Feel Instant

#ai #chatgpt #architecture #programming

When we built the Novellum interactive fiction player, the first obvious move was streaming. Push LLM tokens to the frontend as they arrive. Simple.

It didn't work.

The LLM response in an interactive fiction system isn't plain text. It's a narrative envelope:

[DIALOGUE]You were brave to come here.[/DIALOGUE]
[SPEAKER_NPC_ID]42[/SPEAKER_NPC_ID]
[PORTRAIT_EXPRESSION]sad[/PORTRAIT_EXPRESSION]
[BGM_MOOD]melancholic[/BGM_MOOD]
[SCENE]A candlelit library. Rain against tall windows.[/SCENE]

Dialogue for the user. Character expression, music mood, and scene description for the rendering layer. All from a single generation pass.

Stream this raw → brackets and tag names appear mid-story.

Buffer the full response → 3-5 second loading wait.

Neither is acceptable.

The Reactive Stream Scanner

We built a scanner that processes the stream chunk-by-chunk and behaves differently per tag type:

For [DIALOGUE] content:

Emit partial tokens immediately as they accumulate. UTF-8 safe — no cutting multi-byte Chinese or Japanese characters mid-byte.

For semantic tags (BGM_MOOD, PORTRAIT_EXPRESSION, SPEAKER_NPC_ID, SCENE):

Accumulate until the closing tag is found. Emit a single complete event.

Each complete event triggers an immediate downstream action:

Event	Action
`DIALOGUE` partial	`stream` WS message → text appears word-by-word
`BGM_MOOD` complete	Track lookup → `bgm_change` → music fades in mid-stream
`SPEAKER_NPC_ID` + `PORTRAIT_EXPRESSION` both present	`speaker_state` → portrait expression changes
`SCENE` complete	Async image generation queued → `asset_ready` later

Timeline of a Single Turn

t+0ms     User message sent
t+100ms   First dialogue words appear ← user is already reading
t+300ms   BGM_MOOD closed → music starts changing
t+500ms   speaker_state fires → character expression shifts
t+2000ms  LLM finishes generating
t+10000ms Scene background image arrives

By the time the LLM finishes, the user has been reading for 1-2 seconds. The music changed. The character's face settled. The world moved while the story was being written.

Notable Edge Cases

UTF-8 safety: The scanner must find the longest prefix that ends on a valid rune boundary before emitting a partial event. East Asian content breaks if you cut at a byte boundary.

Ambiguous brackets: Inside [DIALOGUE], a [ might start a closing tag or be literal character dialogue. The scanner keeps a one-bracket lookahead before committing.

speaker_state coordination: SPEAKER_NPC_ID and PORTRAIT_EXPRESSION arrive as separate tags in no guaranteed order. The processor accumulates partial state and emits speaker_state only when both are present and valid.

Fallback mode: If the buffer exceeds a threshold without recognized tag structure (malformed model output), the scanner enters fallback mode rather than stalling indefinitely.

Why This Is Hard to Retrofit

This layer requires 4 coordinated parts:

Prompt engineering — the model must emit the tag structure consistently
Server-side scanner — runs backend to resolve BGM/speaker lookups before forwarding
Typed WebSocket protocol — not raw text, typed event envelopes per event class
Frontend projection system — handles concurrent stream + bgm_change + speaker_state events without visual artifacts

You can't bolt this onto a chatbot architecture. You'd be replacing the interaction layer.

About Novellum

Novellum is a full-stack interactive fiction system for platform deployment. The reactive stream parsing layer is a core component of the standard player experience — included with full-system deployments alongside creator tools, operations backend, and monetization.

For the product strategy context: Why AI Chat Products Lose Users and Interactive Fiction Keeps Them