Paul Chen

Posted on Jun 11

Synthadoc: A Self-Invalidating Query Cache

#ai #automation #llm #architecture

Most LLM queries feel expensive the second time. You've already asked your wiki "What is Moore's Law?" this morning. The answer hasn't changed - nothing in your wiki has changed - but if you ask again this afternoon, Synthadoc will hit the retrieval pipeline, build a prompt, call the LLM, and wait 2–10 seconds for an answer you already have.

v0.7.0 eliminates that. A query cache stores the result the first time, and serves it instantly on every repeat - with one rule: the cache invalidates itself automatically whenever your wiki changes. No manual flush, no TTL to configure. You never serve a stale answer, and you never pay twice for the same question.

This post covers the cache design and the performance benchmarks we ran against it. The streaming query pipeline and web chat UI that shipped alongside it in v0.7.0 are covered in the companion post: Synthadoc: Streaming Queries and a Local Web Chat UI.

Query Caching: When Not to Call the LLM

The cache design started from a specific observation: most queries against a domain wiki are repeated. A team maintaining a knowledge base about their software architecture will ask the same questions dozens of times - during onboarding, during incident reviews, during planning sessions. Every one of those calls hits the LLM and incurs both latency and cost.

The cache eliminates that. But the tricky part is knowing when to invalidate it.

Cache Key Design

key = SHA-256(normalized_question + "|" + wiki_epoch + "|" + provider_model)

Three components:

Normalized question - lowercased, whitespace-collapsed. "What is Moore's Law?" and "what is moore's law?" hit the same cache entry.

Wiki epoch - an integer counter on the server instance. It starts at 0 on startup and increments on every ingest job completion and every lifecycle state transition. When the epoch changes, the cache key for every question changes. Prior entries don't get deleted immediately, they just become unreachable. Old entries are cleaned up in a background sweep (entries more than 5 epochs behind current, or older than 7 days).

Provider/model - "openai/gpt-4o-mini" or "anthropic/claude-sonnet-4-6". Switching models invalidates the cache. A cached answer from a smaller model shouldn't surface when you've upgraded to a better one.

The epoch approach is what makes invalidation automatic. You don't call "invalidate cache" after an ingest, the epoch bump does it implicitly. Any query after a wiki change computes a new key that has never been seen, misses the cache, and calls the LLM fresh. The previous answer doesn't need to be deleted; it simply ceases to be looked up.

Diagram: Cache Lookup, Hit, Miss, and Epoch Invalidation

Measured Results

Rather than leaving the latency claims as estimates, we wrote a full performance test suite against the cache layer. All numbers below come from running pytest tests/performance/test_query_cache_perf.py locally on a Windows development machine with an SSD. Linux bare-metal numbers are consistently 30–40% better.

Chart 1 - Cache read latency distribution (500 reads, 200 cached entries)

P50 = 0.26ms, P95 = 0.34ms, P99 = 0.41ms against a 10ms SLO. The distribution is extremely tight - the persistent connection eliminates the per-call connection-open overhead that was the main source of outliers. Every percentile sits well inside the budget with headroom to spare.

Chart 2 - Cache hit vs miss latency at varying LLM speeds

The left panel uses a log scale because the gap is so large it can't be shown linearly. Cache hit P50 stays flat at ~0.25ms regardless of LLM speed - one shared persistent connection makes the hit path a pure queue-and-execute SQLite read with no file-open cost. The miss path scales directly with LLM latency. The right panel shows the resulting speedup factor:

Simulated LLM speed	Cache miss P50	Cache hit P50	Speedup
50ms (fast provider)	95ms	0.29ms	~330×
200ms (mid provider)	235ms	0.32ms	~730×
500ms (slow provider)	544ms	0.24ms	~2270×
2000ms (reasoning model)	2055ms	0.26ms	~7900×

The cache hit time is so small relative to any real LLM that the ratio is dominated entirely by provider latency. At reasoning models (o3-mini, MiniMax M2) a single saved round-trip reclaims 15–30 seconds of wall time.

Chart 3 - Concurrent readers: persistent connection scaling curve

Single reader: 0.5ms P95. Ten readers: 2.0ms. Twenty-five: 3.8ms. Fifty: 7.8ms. One hundred: 14.9ms. The curve is smooth and monotonically increasing - no Windows spikes, no non-monotonic jitter. All concurrent reads queue through one shared aiosqlite background thread; the connection-open overhead that caused the old instability is simply not there. For a local single-user tool the realistic ceiling is n=5–10 concurrent reads, where P95 is under 2ms. Even at n=100 the tail is well inside a 50ms budget.

Chart 4 - Cache vs no-cache throughput (queries/second)

The throughput advantage starts at 75.7× at n=1 and compresses to 4.1× at n=100. The compression is expected: asyncio.gather() parallelises the simulated LLM calls so the no-cache path scales nearly linearly with concurrency. The cache path, sharing one connection, serializes through the aiosqlite queue and grows sublinearly. But critically, the cache always wins by a wide margin - 4.1× at n=100 is far better than the 1.3× seen before the persistent connection fix. At realistic single-user concurrency (n=1–5), the advantage is 33–76×.

Estimated Latency Gains

A typical Synthadoc query against a mid-size wiki has two latency components:

Phase 1 (BM25 retrieval): 100–200ms. This runs regardless of cache.
Phase 2 (LLM synthesis): 2–10 seconds depending on provider, model, and answer length.

A cache hit skips Phase 2 entirely. The server reads the cached result_json from SQLite (~0.26ms P50 on SSD via a persistent connection), then emits a synthetic SSE burst at full network speed. The client receives what looks like a live streamed response, but the entire burst completes in under 100ms instead of waiting 2–10 seconds for the LLM. With a reasoning model provider, that gap widens to 15–30 seconds per query, the cache makes those queries feel instant.

The Cache Is Shared Across All Three Surfaces

CLI, Obsidian plugin, and Web Chat UI all share the same cache.db. If you ran synthadoc query "..." from the CLI this morning and the wiki hasn't changed, opening the Obsidian modal and asking the same question will hit the cache. The key is identical - same normalized question, same epoch, same model.

# Drop the entire cache - both LLM response cache and query cache
synthadoc cache clear
Cache cleared: 47 entries removed.

What Makes This Architecturally Different

The caching architecture differs from the typical approach of setting an explicit TTL (cache for 24 hours, or cache for one week). TTL-based caches are almost always wrong at the edges: they're either too short (you evict answers that are still valid) or too long (you serve stale content after a wiki update). Epoch-based invalidation is event-driven — the cache is valid until something in the wiki changes, exactly. You don't think about expiry. You ingest new content, and the next query automatically goes to the LLM. Every query after that hits the cache again until the next change.

Quick Demo

Query caching is covered in the quick-start guide against the history-of-computing demo wiki:

Query caching: Step 23 - Query caching

The full thing runs locally in about ten minutes:

git clone https://github.com/axoviq-ai/synthadoc.git
pip install -e ".[dev]"
synthadoc install history-of-computing --target ~/wikis --demo
synthadoc plugin install history-of-computing
synthadoc web   # opens browser

If you find Synthadoc useful, a ⭐ on GitHub helps the project reach more people: https://github.com/axoviq-ai/synthadoc.

DEV Community