DEV Community

Cover image for Fitting LLM Reply Suggestions Into Every Provider's Prompt Cache — Without Structured Output
shinji shimizu
shinji shimizu

Posted on • Originally published at dev.to

Fitting LLM Reply Suggestions Into Every Provider's Prompt Cache — Without Structured Output

I wanted to add reply suggestions to a voice roleplay chat — the classic UX where three "you could say this next" chips appear under each AI response. Sounds simple. But when your chat is built around streaming and prompt caching, every obvious approach turns out to be a bad fit.

I ended up going with the unglamorous move of embedding inline markers in the response and stripping them out afterward. The path to that decision was interesting enough to write up.

Three reply suggestion chips shown below an AI response (kotonia)
What I wanted to build: three "you could say this" chips per AI response — no structured output, no stream interruption, no cache invalidation.

Two Hard Constraints

1. The conversation is built around prompt caching

Keeping token costs down in an LLM chat comes down to caching, and every provider does it differently.

  • Gemini: explicit cache. A cache object is created per session, containing the persona prompt and conversation history. Each turn sends only the diff. When history grows too long, the cache is rebuilt.
  • DeepSeek / Cerebras (OpenAI-compatible): send system + full history + user every time and ride the server's implicit prefix cache (measurable via prompt_cache_hit_tokens etc.).
  • Grok (xAI): the x-grok-conv-id header ties requests to the same conversation, keeping them pinned to the cache.

The common thread: the conversation prefix (persona + history) should be reused as much as possible. Anything that disturbs that prefix hurts both cost and latency.

2. Structured output is off the table

The natural-looking approach to fetching three suggestions would be something like {"reply": "...", "suggestions": ["...", "...", "..."]}. I ruled it out for two reasons.

  • Gemini flash-lite class models show noticeable latency increases with structured output. The lighter the model, the heavier schema compliance costs are relative to the task.
  • It directly conflicts with sentence-level TTS streaming. This chat is designed to start speaking from the very first sentence. While the model is outputting JSON, there's no way to pull out that first sentence. Structured output means waiting for full generation before any audio plays.

Three Approaches I Considered

A. Separate API call to generate suggestions
Fire a second request after the main turn. The prefix would likely hit the cache again, but there's an extra round-trip, and maintaining cache consistency — across Grok's conv-id, implicit prefix caches, etc. — becomes your problem.

B. Structured output, bundled in the main turn
No second request, so cache consistency is trivial. But ruled out for the reasons above (latency + streaming conflict).

C. Inline markers, bundled in the main turn (chosen)
Ask the model to append {{SUGGEST: option1 | option2 | option3}} at the very end of its response, and extract it server-side.

Why C Works

  • It's the same request. There is no "second request." Whether it's an explicit cache or an implicit prefix cache, that turn is already on the cache — alignment is automatic. No per-provider logic needed.
  • No structured output. Plain text generation all the way through.
  • Zero perceived latency increase. TTS is already playing from the first sentence while {{SUGGEST}} trickles out at the end. Generation finishes while the user is listening.
  • Reuses the existing marker infrastructure. This chat already has inline markers like {{SHOW: label}}, {{POSE: ...}}, and {{IMAGE: ...}}, plus a pipeline for extracting and stripping them. Suggestions are just one more entry in that system. Design stays consistent.

The Key Implementation Detail: Strip From Both Places

The important part: once extracted, the marker must be removed from both the TTS/display text and the DB history. Suggestions are ephemeral UI scaffolding, not part of the character's actual speech — leaving them in history would pollute context for future turns.

// Extract {{SUGGEST: a | b | c}} and remove it entirely from the body
static RE_SUGGEST: Lazy<Regex> =
    Lazy::new(|| Regex::new(r"(?is)\{\{\s*SUGGEST\s*:\s*([\s\S]*?)\}\}").unwrap());

fn extract_suggest(text: &str) -> (String, Vec<String>) {
    match RE_SUGGEST.captures(text) {
        Some(cap) => {
            let suggestions = cap[1]
                .split('|')
                .map(|s| s.trim().to_string())
                .filter(|s| !s.is_empty())
                .take(3)
                .collect();
            let clean = RE_SUGGEST.replace_all(text, "").trim().to_string();
            (clean, suggestions)
        }
        None => (text.to_string(), Vec::new()),
    }
}
Enter fullscreen mode Exit fullscreen mode

This is where the existing "store annotated / display clean" separation pays off. In this chat:

  • ai_text returned to the client (display + TTS) is fully stripped of all markers.
  • What gets saved to DB re-attaches {{SHOW}}/{{POSE}} markers (so the model keeps seeing its own canonical format in history and continues using it correctly).

{{SUGGEST}} is different from {{SHOW}}/{{POSE}}it doesn't go back into the DB at all. It's ephemeral. The design of choosing per-marker whether to persist or discard let suggestions slot in cleanly without touching anything else.

On the prompt side, it's just one extra block gated by a feature flag in the persona config:

At the very end of your response, add exactly three short replies the user
might say next, in this format:
{{SUGGEST: option1 | option2 | option3}}
- Always place it last (after any {{SHOW}}/{{POSE}} markers)
- Write each option in first person, casual, short
- Vary the direction: one enthusiastic, one deflecting, one asking a question back
Enter fullscreen mode Exit fullscreen mode

A Note on Implicit Prefix Cache Alignment

Implicit prefix caches hit when the token sequence at the start of a request matches a previously seen prefix. The marker approach simply generates suggestions as part of the current turn's response — the next turn's input prefix (system + history) is identical to what it would be in a plain conversation. The prefix keeps hitting the cache normally. The suggestions never touch the prefix at all. That's a quiet but important property.

Summary

  • When adding secondary structured data to a streaming + caching chat, consider inline markers + extraction before reaching for structured output.
  • Bundling everything into the same request makes cross-provider cache alignment a non-issue by construction.
  • If you already have a marker extraction pipeline, the marginal cost is nearly zero. Design it so you can choose per-marker whether to persist or discard — that flexibility makes ephemeral UI additions painless to add later.

The costs: output tokens increase by a few dozen, and occasionally the model mangles the marker format (same risk level as {{SHOW}}/{{POSE}}). Both are acceptable.


This chat is part of kotonia, a voice roleplay product running multilingual TTS × lip-sync avatars on a local GPU.

Top comments (0)