DEV Community

Cover image for Synthadoc: Streaming Queries, Local Web Chat, and a Self-Invalidating Cache
Paul Chen
Paul Chen

Posted on

Synthadoc: Streaming Queries, Local Web Chat, and a Self-Invalidating Cache

There's a moment every Synthadoc user hits eventually. You've got forty or fifty compiled pages, a nightly ingest schedule running, lint keeping everything healthy. And then you open a terminal, type synthadoc query "...", and wait. The BM25 retrieval is instant. But then the cursor blinks. The LLM is thinking. You wait four seconds, six seconds, eight seconds. The answer eventually appears, all at once, like a curtain dropping.

That wait is fine the first time. It gets annoying on the tenth query when you're in a research session and you already know the answer is coming - you just want to read it as it forms, not stare at a blinking cursor.

v0.7.0 improves that. Streaming query output across all three query surfaces, a local web chat UI that understands the health of your wiki, and a query cache that eliminates the LLM call entirely when nothing in your wiki has changed. The architecture behind each of these turned out to be more interesting than I expected when we started building them.


Diagram 1: What Changed in v0.7.0: The Architecture at a Glance

The diagram below maps the full Synthadoc architecture as it stands after v0.7.0. Items marked [NEW] are additions in this release; everything else was already present. The three features in this post - Web Chat UI, streaming query, and query cache - touch three separate layers: the access layer gains a new client, the engine gains new agents, and the core gains a cache component tied to a new wiki_epoch counter.

Three additions connect the three features in this post:

  • Query Web UI (Access Layer): the new synthadoc web browser client using HTTP + SSE
  • Query (stream) · Action · Hint Engine (Agents): streaming query pipeline, live command execution, deterministic hint generation
  • Query Cache + wiki_epoch (Core): shared cache with epoch-based automatic invalidation

Everything flows through the same server process per wiki. The CLI, Obsidian plugin, and Web Chat UI are all thin clients talking to the same HTTP + SSE endpoint. There's no separate service for the web UI. The MCP Server (shown as optional with a dashed border) is a fourth access path for AI tools like Claude Desktop or Cursor — it exposes the same wiki operations over the Model Context Protocol and requires opt-in setup.


Three Ways to Query Your Wiki and When to Use Each

Before getting into the streaming and caching mechanics, it's worth laying out the three query surfaces Synthadoc now supports. All three can answer the same question, the difference is workflow fit.

CLI: when the query is part of a larger workflow

The CLI is where you go when a query isn't just a question, it's a step in something automated. The obvious case is CI/CD: a post-ingest job that queries the wiki to verify a newly compiled page before promoting it to active. Less obvious is using it as part of an agent integration, where an external orchestrator issues queries and parses the structured JSON output.

# Stream to terminal - tokens appear as the LLM generates them
synthadoc query "What were the main causes of the 2008 financial crisis?"

# Script mode - waits for full response, stdout is clean for piping
synthadoc query "Summarize page: moore's-law" --no-stream | jq .

# Force LLM call even if cache has a result - useful when wiki just changed
synthadoc query "What changed in the latest ingest?" --no-cache
Enter fullscreen mode Exit fullscreen mode

The --no-stream flag is specifically for automation. Streaming output is beautiful on a terminal and disruptive in a pipeline. A script that parses stdout doesn't want token-by-token delivery, it wants a complete JSON blob when the query is done. --no-stream gives it that.

Obsidian Plugin: when you're in a research session

The Obsidian plugin exists for a different moment: you're writing a note, you need to check a claim against your wiki, and you don't want to leave Obsidian. The query modal (Ctrl/Cmd+P → Synthadoc: Query: ask the wiki...) is the right tool here. It renders [[wikilinks]] as clickable links, which means an answer that references related pages becomes navigable instantly.

The streaming behaviour in the Obsidian plugin mirrors the CLI, tokens appear as they arrive, citations follow at the end. The bypass cache checkbox is visible in the modal, unchecked by default. For researchers doing active ingest sessions, checking it once gets you fresh output without reaching for the terminal.

Web Chat UI: when you want a session, not a one-shot query

synthadoc web is the new entry. It opens a local chat interface in your browser, nothing leaves your machine, no cloud service, no authentication. It's designed for the kind of session that's too exploratory for the CLI and too long for the Obsidian modal.

Each turn is an independent query - the same cache applies here as in the CLI and Obsidian plugin. The chat history is displayed in the browser, but prior messages are not yet injected into the LLM prompt; multi-turn context injection is planned for a future release.

What the web UI adds over the other surfaces: operational commands. You can type "run lint", "show wiki status", "what pages are orphan pages?" or "schedule ingest every night at 9 PM" directly in the chat, and the Action Agent parses those and executes them live against your wiki, with results shown inline.

The screenshot above shows a live session against the history-of-computing demo wiki. The response to "What changed in the wiki this week?" includes a date-indexed ingest table, current lifecycle counts (Active: 80, Draft/Stale/Contradicted/Archived: all zero), and three action chips - "Activate a draft page", "Archive a stale page", "Restore an archived page to draft" - rendered inline as clickable buttons. The left panel shows prior session queries, allowing you to jump back into an earlier thread.


Streaming: The Architecture Behind a Two-Phase Response

Every Synthadoc query goes through two phases. Phase 1 is retrieval: BM25 search, routing, sub-question decomposition if needed. This is synchronous and fast, typically 100–200ms. Phase 2 is synthesis: the LLM generates an answer from the retrieved pages. This is where the latency lives.

The decision to stream only Phase 2 was deliberate. Phase 1 finishes before the first LLM token could possibly arrive, there's no partial retrieval state worth exposing. So the SSE protocol is clean:

The status events let the UI give immediate feedback. The user knows within 150ms whether the wiki found relevant pages or not before any LLM latency has accumulated. "sources: 3" in the synthesizing event tells them the answer is backed by three pages before they've read a single word of it.

The gap event fires only when the wiki doesn't have enough to answer confidently. Instead of a vague "I don't know," it returns suggested_searches - concrete ingest strings the user can use to fill the gap. These are generated by a secondary LLM call that decomposes the original question into targeted search queries - the same decomposition that drives sub-question retrieval, reused here to produce actionable ingest suggestions.

Provider Streaming Behavior

Not all providers stream in the same sense. API-based providers - OpenAI, Anthropic, Gemini, Ollama - emit tokens as they are generated, so the CLI and web UI render them character-by-character in real time. The latency shown in the SSE sequence above (one token every ~20ms) is what these providers deliver.

CLI subprocess providers - Claude Code (claude-code) and Opencode (opencode) - work differently. They run as child processes and write their output only when the process exits, so there is no per-token stream to intercept. Synthadoc runs the subprocess to completion, then emits the result word-by-word through the same SSE pipe. The words arrive in a rapid burst rather than a gradual flow - the total wait is the same, but the perceived streaming effect is a short pause followed by the full answer appearing almost at once.

If you are using a CLI subprocess provider and queries are timing out, increase the default timeout:

synthadoc query "..." --timeout 180
Enter fullscreen mode Exit fullscreen mode

The default is 60 seconds, which is sufficient for API providers but may be short for subprocess providers on complex queries.

Session Management

Sessions live server-side in audit.db, in two tables: chat_sessions and chat_messages. The React UI stores only the session_id in memory - it's React state, not localStorage. This means sessions don't survive page reload, and every new browser tab starts fresh. This is a deliberate design choice: a session is tied to one exploratory thread, not your entire browsing history.

Chat messages are stored to audit.db after each turn, but prior messages are not yet injected into the LLM prompt, each query is answered independently. The session record is used for mode persistence and hint rotation, not for conversational context. Multi-turn prompt injection is planned for a future release.

Diagram 2: Web Query Flow: Client to Server, Session to Stream

The diagram below traces a complete web UI query round-trip, from the user typing a question to the hint chips updating after the response. The left column is the browser; the right column is the server.

A few things worth highlighting in this flow. The session_id lives only in React state - close the tab and it's gone. The mode determined at POST /sessions (step 1) persists for the lifetime of that tab and shapes hint generation at every done event (step 3 and 4). The HintEngine never calls the LLM - it reads the answer content and the session mode and applies deterministic rules to generate the three chips.

Adaptive Hints: No LLM Required

The hint chips - three clickable suggestions rendered below the chat input - update after every response. They're generated by a deterministic HintEngine, not an LLM. No API call, no extra cost.

The engine first classifies the wiki's health state when the session is created:

Mode Condition Initial hints
NEW_WIKI Fewer than 5 pages Guide user toward first ingest
EXPLORER First session, healthy wiki Offer tour queries
HEALTH_CHECK Stale or contradicted pages exist Surface lint and lifecycle actions
POWER_USER Returning user, healthy wiki Context-sensitive topic suggestions

After each assistant response, the done SSE event carries a next_hints array, three suggestions computed from the answer content and session mode. If the answer mentioned a specific page, the hints might suggest a follow-up on a related page. If the answer triggered a knowledge gap, the hints offer the suggested_searches as clickable options.

The design principle here is that hints should reflect where you are in the conversation, not where you were when you opened the browser. A user on a HEALTH_CHECK session who just asked about contradicted pages shouldn't see generic "try querying about X" chips, they should see "run lint", "list orphan pages", "archive contradicted page". The mode carries through the session, shaping every hint update.


Query Caching: When Not to Call the LLM

The cache design started from a specific observation: most queries against a domain wiki are repeated. A team maintaining a knowledge base about their software architecture will ask the same questions dozens of times - during onboarding, during incident reviews, during planning sessions. Every one of those calls hits the LLM and incurs both latency and cost.

The cache eliminates that. But the tricky part is knowing when to invalidate it.

Cache Key Design

key = SHA-256(normalized_question + "|" + wiki_epoch + "|" + provider_model)
Enter fullscreen mode Exit fullscreen mode

Three components:

Normalized question - lowercased, whitespace-collapsed. "What is Moore's Law?" and "what is moore's law?" hit the same cache entry.

Wiki epoch - an integer counter on the server instance. It starts at 0 on startup and increments on every ingest job completion and every lifecycle state transition. When the epoch changes, the cache key for every question changes. Prior entries don't get deleted immediately, they just become unreachable. Old entries are cleaned up in a background sweep (entries more than 5 epochs behind current, or older than 7 days).

Provider/model - "openai/gpt-4o-mini" or "anthropic/claude-sonnet-4-6". Switching models invalidates the cache. A cached answer from a smaller model shouldn't surface when you've upgraded to a better one.

The epoch approach is what makes invalidation automatic. You don't call "invalidate cache" after an ingest, the epoch bump does it implicitly. Any query after a wiki change computes a new key that has never been seen, misses the cache, and calls the LLM fresh. The previous answer doesn't need to be deleted; it simply ceases to be looked up.

Diagram 3: Cache Lookup, Hit, Miss, and Epoch Invalidation

Measured Results

Rather than leaving the latency claims as estimates, we wrote a full performance test suite against the cache layer. All numbers below come from running pytest tests/performance/test_query_cache_perf.py locally on a Windows development machine with an SSD. Linux bare-metal numbers are consistently 30–40% better.

Chart 1 - Cache read latency distribution (500 reads, 200 cached entries)

Cache read latency distribution - P50=0.26ms P95=0.34ms P99=0.41ms

P50 = 0.26ms, P95 = 0.34ms, P99 = 0.41ms against a 10ms SLO. The distribution is extremely tight - the persistent connection eliminates the per-call connection-open overhead that was the main source of outliers. Every percentile sits well inside the budget with headroom to spare.

Chart 2 - Cache hit vs miss latency at varying LLM speeds

Cache hit vs miss latency and speedup factor at 50ms / 200ms / 500ms / 2000ms simulated LLM

The left panel uses a log scale because the gap is so large it can't be shown linearly. Cache hit P50 stays flat at ~0.25ms regardless of LLM speed - one shared persistent connection makes the hit path a pure queue-and-execute SQLite read with no file-open cost. The miss path scales directly with LLM latency. The right panel shows the resulting speedup factor:

Simulated LLM speed Cache miss P50 Cache hit P50 Speedup
50ms (fast provider) 95ms 0.29ms ~330×
200ms (mid provider) 235ms 0.32ms ~730×
500ms (slow provider) 544ms 0.24ms ~2270×
2000ms (reasoning model) 2055ms 0.26ms ~7900×

The cache hit time is so small relative to any real LLM that the ratio is dominated entirely by provider latency. At reasoning models (o3-mini, MiniMax M2) a single saved round-trip reclaims 15–30 seconds of wall time.

Chart 3 - Concurrent readers: persistent connection scaling curve

Concurrent cache readers P95 latency vs concurrency - smooth monotonic scaling

Single reader: 0.5ms P95. Ten readers: 2.0ms. Twenty-five: 3.8ms. Fifty: 7.8ms. One hundred: 14.9ms. The curve is smooth and monotonically increasing - no Windows spikes, no non-monotonic jitter. All concurrent reads queue through one shared aiosqlite background thread; the connection-open overhead that caused the old instability is simply not there. For a local single-user tool the realistic ceiling is n=5–10 concurrent reads, where P95 is under 2ms. Even at n=100 the tail is well inside a 50ms budget.

Chart 4 - Cache vs no-cache throughput (queries/second)

Cache vs no-cache throughput and advantage ratio at concurrency 1 to 100

The throughput advantage starts at 75.7× at n=1 and compresses to 4.1× at n=100. The compression is expected: asyncio.gather() parallelises the simulated LLM calls so the no-cache path scales nearly linearly with concurrency. The cache path, sharing one connection, serializes through the aiosqlite queue and grows sublinearly. But critically, the cache always wins by a wide margin - 4.1× at n=100 is far better than the 1.3× seen before the persistent connection fix. At realistic single-user concurrency (n=1–5), the advantage is 33–76×.

Estimated Latency Gains

A typical Synthadoc query against a mid-size wiki has two latency components:

  • Phase 1 (BM25 retrieval): 100–200ms. This runs regardless of cache.
  • Phase 2 (LLM synthesis): 2–10 seconds depending on provider, model, and answer length.

A cache hit skips Phase 2 entirely. The server reads the cached result_json from SQLite (~0.26ms P50 on SSD via a persistent connection), then emits a synthetic SSE burst at full network speed. The client receives what looks like a live streamed response, but the entire burst completes in under 100ms instead of waiting 2–10 seconds for the LLM. With a reasoning model provider, that gap widens to 15–30 seconds per query, the cache makes those queries feel instant.

The Cache Is Shared Across All Three Surfaces

CLI, Obsidian plugin, and Web Chat UI all share the same cache.db. If you ran synthadoc query "..." from the CLI this morning and the wiki hasn't changed, opening the Obsidian modal and asking the same question will hit the cache. The key is identical - same normalized question, same epoch, same model.

# Drop the entire cache - both LLM response cache and query cache
synthadoc cache clear
Cache cleared: 47 entries removed.
Enter fullscreen mode Exit fullscreen mode

What Makes This Architecturally Different

Most streaming chat interfaces work the same way: user sends a message, server calls the LLM, tokens stream back. There's no retrieval, no structured knowledge, no notion of whether the answer is backed by reviewed sources.

Synthadoc's streaming pipeline is a two-phase system where the first phase is a structured knowledge retrieval against a compiled, lifecycle-tracked wiki. The tokens you receive as they arrive are not hallucinated filler - they're synthesized from pages that passed lint, have known provenance, and carry a lifecycle state that tells you when they were last reviewed. The sources: N in the status event isn't decorative. It tells you before the first word of the answer how much of your wiki was relevant.

The session mode detection adds something I haven't seen elsewhere: the server classifies your wiki's health state when you open a session and uses that classification to shape every hint update for the rest of the session. A HEALTH_CHECK session doesn't give you generic "explore your wiki" prompts, it gives you "these pages need attention." The hints aren't cosmetic. They're a live triage system for wiki health.

The caching architecture also differs from the typical approach of setting an explicit TTL (cache for 24 hours, or cache for one week). TTL-based caches are almost always wrong at the edges: they're either too short (you evict answers that are still valid) or too long (you serve stale content after a wiki update). Epoch-based invalidation is event-driven, the cache is valid until something in the wiki changes, exactly.


Quick Demo

All three query surfaces are covered in the quick-start guide against the history-of-computing demo wiki:

The full thing runs locally in about ten minutes:

git clone https://github.com/axoviq-ai/synthadoc.git
pip install -e ".[dev]"
synthadoc install history-of-computing --target ~/wikis --demo
synthadoc plugin install history-of-computing
synthadoc web   # opens browser
Enter fullscreen mode Exit fullscreen mode

If you find Synthadoc useful, a ⭐ on GitHub helps the project reach more people: https://github.com/axoviq-ai/synthadoc.

Top comments (0)