Paul Chen

Posted on Jun 8 • Edited on Jun 11

Synthadoc: Streaming Queries and a Local Web Chat UI

#ai #llm #architecture #cli

There's a moment every Synthadoc user hits eventually. You've got forty or fifty compiled pages, a nightly ingest schedule running, lint keeping everything healthy. And then you open a terminal, type synthadoc query "...", and wait. The BM25 retrieval is instant. But then the cursor blinks. The LLM is thinking. You wait four seconds, six seconds, eight seconds. The answer eventually appears, all at once, like a curtain dropping.

That wait is fine the first time. It gets annoying on the tenth query when you're in a research session and you already know the answer is coming - you just want to read it as it forms, not stare at a blinking cursor.

v0.7.0 improves that. Streaming query output across all three query surfaces, and a local web chat UI that understands the health of your wiki. The architecture behind each of these turned out to be more interesting than I expected when we started building them. (A companion post covers the third feature in this release: a self-invalidating query cache.)

Diagram 1: What Changed in v0.7.0: The Architecture at a Glance

The diagram below maps the full Synthadoc architecture as it stands after v0.7.0. Items marked [NEW] are additions in this release; everything else was already present. The two features in this post touch two separate layers: the access layer gains a new client, and the engine gains new agents.

Two additions are the focus of this post:

Query Web UI (Access Layer): the new synthadoc web browser client using HTTP + SSE
Query (stream) · Action · Hint Engine (Agents): streaming query pipeline, live command execution, deterministic hint generation

Everything flows through the same server process per wiki. The CLI, Obsidian plugin, and Web Chat UI are all thin clients talking to the same HTTP + SSE endpoint. There's no separate service for the web UI. The MCP Server (shown as optional with a dashed border) is a fourth access path for AI tools like Claude Desktop or Cursor — it exposes the same wiki operations over the Model Context Protocol and requires opt-in setup.

Three Ways to Query Your Wiki and When to Use Each

Before getting into the streaming mechanics, it's worth laying out the three query surfaces Synthadoc now supports. All three can answer the same question, the difference is workflow fit.

CLI: when the query is part of a larger workflow

The CLI is where you go when a query isn't just a question, it's a step in something automated. The obvious case is CI/CD: a post-ingest job that queries the wiki to verify a newly compiled page before promoting it to active. Less obvious is using it as part of an agent integration, where an external orchestrator issues queries and parses the structured JSON output.

# Stream to terminal - tokens appear as the LLM generates them
synthadoc query "What were the main causes of the 2008 financial crisis?"

# Script mode - waits for full response, stdout is clean for piping
synthadoc query "Summarize page: moore's-law" --no-stream | jq .

# Force LLM call even if cache has a result - useful when wiki just changed
synthadoc query "What changed in the latest ingest?" --no-cache

The --no-stream flag is specifically for automation. Streaming output is beautiful on a terminal and disruptive in a pipeline. A script that parses stdout doesn't want token-by-token delivery, it wants a complete JSON blob when the query is done. --no-stream gives it that.

Obsidian Plugin: when you're in a research session

The Obsidian plugin exists for a different moment: you're writing a note, you need to check a claim against your wiki, and you don't want to leave Obsidian. The query modal (Ctrl/Cmd+P → Synthadoc: Query: ask the wiki...) is the right tool here. It renders [[wikilinks]] as clickable links, which means an answer that references related pages becomes navigable instantly.

The streaming behaviour in the Obsidian plugin mirrors the CLI, tokens appear as they arrive, citations follow at the end. The bypass cache checkbox is visible in the modal, unchecked by default. For researchers doing active ingest sessions, checking it once gets you fresh output without reaching for the terminal.

Web Chat UI: when you want a session, not a one-shot query

synthadoc web is the new entry. It opens a local chat interface in your browser, nothing leaves your machine, no cloud service, no authentication. It's designed for the kind of session that's too exploratory for the CLI and too long for the Obsidian modal.

Each turn is an independent query - the same cache applies here as in the CLI and Obsidian plugin. The chat history is displayed in the browser, but prior messages are not yet injected into the LLM prompt; multi-turn context injection is planned for a future release.

What the web UI adds over the other surfaces: operational commands. You can type "run lint", "show wiki status", "what pages are orphan pages?" or "schedule ingest every night at 9 PM" directly in the chat, and the Action Agent parses those and executes them live against your wiki, with results shown inline.

The screenshot above shows a live session against the history-of-computing demo wiki. The response to "What changed in the wiki this week?" includes a date-indexed ingest table, current lifecycle counts (Active: 80, Draft/Stale/Contradicted/Archived: all zero), and three action chips - "Activate a draft page", "Archive a stale page", "Restore an archived page to draft" - rendered inline as clickable buttons. The left panel shows prior session queries, allowing you to jump back into an earlier thread.

Streaming: The Architecture Behind a Two-Phase Response

Every Synthadoc query goes through two phases. Phase 1 is retrieval: BM25 search, routing, sub-question decomposition if needed. This is synchronous and fast, typically 100–200ms. Phase 2 is synthesis: the LLM generates an answer from the retrieved pages. This is where the latency lives.

The decision to stream only Phase 2 was deliberate. Phase 1 finishes before the first LLM token could possibly arrive, there's no partial retrieval state worth exposing. So the SSE protocol is clean:

The status events let the UI give immediate feedback. The user knows within 150ms whether the wiki found relevant pages or not before any LLM latency has accumulated. "sources: 3" in the synthesizing event tells them the answer is backed by three pages before they've read a single word of it.

The gap event fires only when the wiki doesn't have enough to answer confidently. Instead of a vague "I don't know," it returns suggested_searches - concrete ingest strings the user can use to fill the gap. These are generated by a secondary LLM call that decomposes the original question into targeted search queries - the same decomposition that drives sub-question retrieval, reused here to produce actionable ingest suggestions.

Provider Streaming Behavior

Not all providers stream in the same sense. API-based providers - OpenAI, Anthropic, Gemini, Ollama - emit tokens as they are generated, so the CLI and web UI render them character-by-character in real time. The latency shown in the SSE sequence above (one token every ~20ms) is what these providers deliver.

CLI subprocess providers - Claude Code (claude-code) and Opencode (opencode) - work differently. They run as child processes and write their output only when the process exits, so there is no per-token stream to intercept. Synthadoc runs the subprocess to completion, then emits the result word-by-word through the same SSE pipe. The words arrive in a rapid burst rather than a gradual flow - the total wait is the same, but the perceived streaming effect is a short pause followed by the full answer appearing almost at once.

If you are using a CLI subprocess provider and queries are timing out, increase the default timeout:

synthadoc query "..." --timeout 180

The default is 60 seconds, which is sufficient for API providers but may be short for subprocess providers on complex queries.

Session Management

Sessions live server-side in audit.db, in two tables: chat_sessions and chat_messages. The React UI stores only the session_id in memory - it's React state, not localStorage. This means sessions don't survive page reload, and every new browser tab starts fresh. This is a deliberate design choice: a session is tied to one exploratory thread, not your entire browsing history.

Chat messages are stored to audit.db after each turn, but prior messages are not yet injected into the LLM prompt, each query is answered independently. The session record is used for mode persistence and hint rotation, not for conversational context. Multi-turn prompt injection is planned for a future release.

Diagram 2: Web Query Flow: Client to Server, Session to Stream

The diagram below traces a complete web UI query round-trip, from the user typing a question to the hint chips updating after the response. The left column is the browser; the right column is the server.

A few things worth highlighting in this flow. The session_id lives only in React state - close the tab and it's gone. The mode determined at POST /sessions (step 1) persists for the lifetime of that tab and shapes hint generation at every done event (step 3 and 4). The HintEngine never calls the LLM - it reads the answer content and the session mode and applies deterministic rules to generate the three chips.

Adaptive Hints: No LLM Required

The hint chips - three clickable suggestions rendered below the chat input - update after every response. They're generated by a deterministic HintEngine, not an LLM. No API call, no extra cost.

The engine first classifies the wiki's health state when the session is created:

Mode	Condition	Initial hints
`NEW_WIKI`	Fewer than 5 pages	Guide user toward first ingest
`EXPLORER`	First session, healthy wiki	Offer tour queries
`HEALTH_CHECK`	Stale or contradicted pages exist	Surface lint and lifecycle actions
`POWER_USER`	Returning user, healthy wiki	Context-sensitive topic suggestions

After each assistant response, the done SSE event carries a next_hints array, three suggestions computed from the answer content and session mode. If the answer mentioned a specific page, the hints might suggest a follow-up on a related page. If the answer triggered a knowledge gap, the hints offer the suggested_searches as clickable options.

The design principle here is that hints should reflect where you are in the conversation, not where you were when you opened the browser. A user on a HEALTH_CHECK session who just asked about contradicted pages shouldn't see generic "try querying about X" chips, they should see "run lint", "list orphan pages", "archive contradicted page". The mode carries through the session, shaping every hint update.

What Makes This Architecturally Different

Most streaming chat interfaces work the same way: user sends a message, server calls the LLM, tokens stream back. There's no retrieval, no structured knowledge, no notion of whether the answer is backed by reviewed sources.

Synthadoc's streaming pipeline is a two-phase system where the first phase is a structured knowledge retrieval against a compiled, lifecycle-tracked wiki. The tokens you receive as they arrive are not hallucinated filler - they're synthesized from pages that passed lint, have known provenance, and carry a lifecycle state that tells you when they were last reviewed. The sources: N in the status event isn't decorative. It tells you before the first word of the answer how much of your wiki was relevant.

The session mode detection adds something I haven't seen elsewhere: the server classifies your wiki's health state when you open a session and uses that classification to shape every hint update for the rest of the session. A HEALTH_CHECK session doesn't give you generic "explore your wiki" prompts, it gives you "these pages need attention." The hints aren't cosmetic. They're a live triage system for wiki health.

Quick Demo

The CLI streaming and web UI are covered in the quick-start guide against the history-of-computing demo wiki:

CLI streaming: Step 5 - Query the pre-built wiki
Web Chat UI: Step 22 - Use the web chat UI

The full thing runs locally in about ten minutes:

git clone https://github.com/axoviq-ai/synthadoc.git
pip install -e ".[dev]"
synthadoc install history-of-computing --target ~/wikis --demo
synthadoc plugin install history-of-computing
synthadoc web   # opens browser

If you find Synthadoc useful, a ⭐ on GitHub helps the project reach more people: https://github.com/axoviq-ai/synthadoc.

DEV Community