shinji shimizu

Posted on May 31 • Originally published at dev.to

Fitting LLM Reply Suggestions Into Every Provider's Prompt Cache — Without Structured Output

#ai #rust #llm #webdev

I wanted to add reply suggestions to a voice roleplay chat — the classic UX where three "you could say this next" chips appear under each AI response. Sounds simple. But when your chat is built around streaming and prompt caching, every obvious approach turns out to be a bad fit.

I ended up going with the unglamorous move of embedding inline markers in the response and stripping them out afterward. The path to that decision was interesting enough to write up.

What I wanted to build: three "you could say this" chips per AI response — no structured output, no stream interruption, no cache invalidation.

Two Hard Constraints

1. The conversation is built around prompt caching

Keeping token costs down in an LLM chat comes down to caching, and every provider does it differently.

Gemini: explicit cache. A cache object is created per session, containing the persona prompt and conversation history. Each turn sends only the diff. When history grows too long, the cache is rebuilt.
DeepSeek / Cerebras (OpenAI-compatible): send system + full history + user every time and ride the server's implicit prefix cache (measurable via prompt_cache_hit_tokens etc.).
Grok (xAI): the x-grok-conv-id header ties requests to the same conversation, keeping them pinned to the cache.

The common thread: the conversation prefix (persona + history) should be reused as much as possible. Anything that disturbs that prefix hurts both cost and latency.

2. Structured output is off the table

The natural-looking approach to fetching three suggestions would be something like {"reply": "...", "suggestions": ["...", "...", "..."]}. I ruled it out for two reasons.

Gemini flash-lite class models show noticeable latency increases with structured output. The lighter the model, the heavier schema compliance costs are relative to the task.
It directly conflicts with sentence-level TTS streaming. This chat is designed to start speaking from the very first sentence. While the model is outputting JSON, there's no way to pull out that first sentence. Structured output means waiting for full generation before any audio plays.

Three Approaches I Considered

A. Separate API call to generate suggestions
Fire a second request after the main turn. The prefix would likely hit the cache again, but there's an extra round-trip, and maintaining cache consistency — across Grok's conv-id, implicit prefix caches, etc. — becomes your problem.

B. Structured output, bundled in the main turn
No second request, so cache consistency is trivial. But ruled out for the reasons above (latency + streaming conflict).

C. Inline markers, bundled in the main turn (chosen)
Ask the model to append {{SUGGEST: option1 | option2 | option3}} at the very end of its response, and extract it server-side.

Why C Works

It's the same request. There is no "second request." Whether it's an explicit cache or an implicit prefix cache, that turn is already on the cache — alignment is automatic. No per-provider logic needed.
No structured output. Plain text generation all the way through.
Zero perceived latency increase. TTS is already playing from the first sentence while {{SUGGEST}} trickles out at the end. Generation finishes while the user is listening.
Reuses the existing marker infrastructure. This chat already has inline markers like {{SHOW: label}}, {{POSE: ...}}, and {{IMAGE: ...}}, plus a pipeline for extracting and stripping them. Suggestions are just one more entry in that system. Design stays consistent.

The Key Implementation Detail: Strip From Both Places

The important part: once extracted, the marker must be removed from both the TTS/display text and the DB history. Suggestions are ephemeral UI scaffolding, not part of the character's actual speech — leaving them in history would pollute context for future turns.

// Extract {{SUGGEST: a | b | c}} and remove it entirely from the body
static RE_SUGGEST: Lazy<Regex> =
    Lazy::new(|| Regex::new(r"(?is)\{\{\s*SUGGEST\s*:\s*([\s\S]*?)\}\}").unwrap());

fn extract_suggest(text: &str) -> (String, Vec<String>) {
    match RE_SUGGEST.captures(text) {
        Some(cap) => {
            let suggestions = cap[1]
                .split('|')
                .map(|s| s.trim().to_string())
                .filter(|s| !s.is_empty())
                .take(3)
                .collect();
            let clean = RE_SUGGEST.replace_all(text, "").trim().to_string();
            (clean, suggestions)
        }
        None => (text.to_string(), Vec::new()),
    }
}

This is where the existing "store annotated / display clean" separation pays off. In this chat:

ai_text returned to the client (display + TTS) is fully stripped of all markers.
What gets saved to DB re-attaches {{SHOW}}/{{POSE}} markers (so the model keeps seeing its own canonical format in history and continues using it correctly).

{{SUGGEST}} is different from {{SHOW}}/{{POSE}} — it doesn't go back into the DB at all. It's ephemeral. The design of choosing per-marker whether to persist or discard let suggestions slot in cleanly without touching anything else.

On the prompt side, it's just one extra block gated by a feature flag in the persona config:

At the very end of your response, add exactly three short replies the user
might say next, in this format:
{{SUGGEST: option1 | option2 | option3}}
- Always place it last (after any {{SHOW}}/{{POSE}} markers)
- Write each option in first person, casual, short
- Vary the direction: one enthusiastic, one deflecting, one asking a question back

A Note on Implicit Prefix Cache Alignment

Implicit prefix caches hit when the token sequence at the start of a request matches a previously seen prefix. The marker approach simply generates suggestions as part of the current turn's response — the next turn's input prefix (system + history) is identical to what it would be in a plain conversation. The prefix keeps hitting the cache normally. The suggestions never touch the prefix at all. That's a quiet but important property.

Summary

When adding secondary structured data to a streaming + caching chat, consider inline markers + extraction before reaching for structured output.
Bundling everything into the same request makes cross-provider cache alignment a non-issue by construction.
If you already have a marker extraction pipeline, the marginal cost is nearly zero. Design it so you can choose per-marker whether to persist or discard — that flexibility makes ephemeral UI additions painless to add later.

The costs: output tokens increase by a few dozen, and occasionally the model mangles the marker format (same risk level as {{SHOW}}/{{POSE}}). Both are acceptable.

This chat is part of kotonia, a voice roleplay product running multilingual TTS × lip-sync avatars on a local GPU.

Top comments (1)

Harjot Singh • May 31

Designing prompts to fit every provider's prompt cache is exactly the kind of unglamorous cost engineering that pays off at scale, because prompt caching only works if your prefix is stable, and the moment you interleave variable content into what should be the cacheable part, you silently bust the cache and pay full price on every call. The discipline you're describing, structure the prompt so the big stable instructions sit in the cacheable prefix and only the small variable bit changes, is the same principle as separating what doesn't change from what does, applied at the token level. The without-structured-output wrinkle is the interesting tension: structured output is great for parsing but can fragment or constrain the prompt in ways that fight caching or aren't portable across providers, so you're trading rigid parseability for cache-friendliness and cross-provider portability, and for high-volume reply suggestions that trade is often worth it. The portability angle is underrated too, designing to the common denominator of every provider's cache means you keep them swappable instead of locking into one's caching quirks. Stabilize the prefix, vary only the delta, stay provider-portable. That cache-the-stable-part instinct is core to how I think about cost in Moonshift. Without structured output, how are you parsing the suggestions reliably, a lightweight delimiter convention, or post-hoc parsing you've made robust?