<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Paul Chen</title>
    <description>The latest articles on DEV Community by Paul Chen (@paul_chen_90371fe7426cb44).</description>
    <link>https://dev.to/paul_chen_90371fe7426cb44</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3898190%2F38f30ba0-280f-4a52-8c7d-c773315b8da8.jpg</url>
      <title>DEV Community: Paul Chen</title>
      <link>https://dev.to/paul_chen_90371fe7426cb44</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/paul_chen_90371fe7426cb44"/>
    <language>en</language>
    <item>
      <title>Synthadoc: Beyond Keyword Search -How Combines BM25 and Vector Search to Build a Smarter Domain Wiki</title>
      <dc:creator>Paul Chen</dc:creator>
      <pubDate>Mon, 27 Apr 2026 00:21:34 +0000</pubDate>
      <link>https://dev.to/paul_chen_90371fe7426cb44/beyond-keyword-search-how-synthadoc-v020-combines-bm25-and-vector-search-to-build-a-smarter-43l7</link>
      <guid>https://dev.to/paul_chen_90371fe7426cb44/beyond-keyword-search-how-synthadoc-v020-combines-bm25-and-vector-search-to-build-a-smarter-43l7</guid>
      <description>&lt;h1&gt;
  
  
  What is Synthadoc?
&lt;/h1&gt;

&lt;p&gt;Synthadoc is an open-source, LLM-powered wiki engine. Point it at your&lt;br&gt;
organisation's documents - PDFs, PPTX, spreadsheets, DOCX, images, or web pages - and it builds a persistent, structured knowledge base your team can query, audit, and extend over time.&lt;/p&gt;

&lt;p&gt;Unlike general-purpose RAG pipelines that retrieve raw chunks at query&lt;br&gt;
time and discard results afterwards, Synthadoc compiles knowledge at&lt;br&gt;
ingest time into a living wiki that grows smarter and more consistent&lt;br&gt;
with every new source. The core lifecycle is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ingest: extract and synthesise facts from any source format (PDF,
XLSX, PNG, web URL)&lt;/li&gt;
&lt;li&gt;Detect: flag contradictions with existing pages and quarantine them
for review&lt;/li&gt;
&lt;li&gt;Link: connect related pages and surface knowledge gaps&lt;/li&gt;
&lt;li&gt;Query: answer questions with hybrid BM25 + optional vector search, citing the pages used&lt;/li&gt;
&lt;li&gt;Lint: resolve contradictions and surface orphan pages for human or
automated action&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Synthadoc is designed for organisations that need domain-specific, auditable knowledge management: legal teams tracking regulatory&lt;br&gt;
precedent, financial analysts maintaining market research, engineering&lt;br&gt;
groups documenting system behaviour, and research teams building&lt;br&gt;
institutional memory that persists beyond individual contributors.&lt;/p&gt;

&lt;p&gt;Synthadoc v0.2.0 is released last week, it scales seamlessly while maintaining accuracy through autonomous self-optimization.&lt;/p&gt;

&lt;p&gt;👉 Synthadoc GitHub: &lt;a href="https://github.com/axoviq-ai/synthadoc" rel="noopener noreferrer"&gt;https://github.com/axoviq-ai/synthadoc&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;Most LLM knowledge tools take one of two approaches to retrieval: pure&lt;br&gt;
keyword search (fast but vocabulary-dependent) or pure vector/semantic&lt;br&gt;
search (flexible but resource-intensive). In practice, both have&lt;br&gt;
meaningful blind spots.&lt;/p&gt;

&lt;p&gt;Synthadoc v0.2.0 ships a hybrid retrieval pipeline that uses BM25 as a&lt;br&gt;
fast, precise first-pass filter and optional vector re-ranking as a&lt;br&gt;
semantic second pass. The result is a system that is accurate on&lt;br&gt;
exact-match queries, robust on paraphrased or conceptual queries, and&lt;br&gt;
fast enough to run on a laptop with no cloud dependency.&lt;/p&gt;

&lt;p&gt;This post explains how each technique works, where each one falls short&lt;br&gt;
alone, why the hybrid matters for a persistent domain wiki, and how&lt;br&gt;
Synthadoc v0.2.0 layers query decomposition and knowledge gap detection&lt;br&gt;
on top.&lt;/p&gt;

&lt;h1&gt;
  
  
  What Is BM25?
&lt;/h1&gt;

&lt;p&gt;BM25 (Best Match 25) is a probabilistic ranking function. It scores a&lt;br&gt;
page relative to a query by counting how often query terms appear in the&lt;br&gt;
page, discounting terms that appear in almost every page, and penalising&lt;br&gt;
very long pages for artificially inflated counts. BM25 is the retrieval&lt;br&gt;
backbone of Elasticsearch, Lucene, and most production search systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scoring intuition
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fegaoeg5kyikmfkmz5481.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fegaoeg5kyikmfkmz5481.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where BM25 falls short
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Vocabulary mismatch: query says "contributions", page says
"pioneered". Score: near zero.&lt;/li&gt;
&lt;li&gt;Synonyms: "ML" and "machine learning" are different tokens.&lt;/li&gt;
&lt;li&gt;Conceptual distance: "reasoning under uncertainty" and
"probabilistic inference" are semantically identical but lexically
distant.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On a domain wiki ingested from diverse sources (papers, docs, blog posts, PDFs), the same concept will be described in many different vocabularies. BM25 alone misses a meaningful fraction of relevant pages.&lt;/p&gt;

&lt;h1&gt;
  
  
  What Is Vector Search?
&lt;/h1&gt;

&lt;p&gt;Vector (semantic) search encodes text into dense numerical embeddings using a neural language model. Semantically similar texts land close together in that high-dimensional space regardless of surface wording.&lt;br&gt;
Similarity is measured as cosine distance between the query vector and each page vector.&lt;/p&gt;

&lt;h2&gt;
  
  
  Embedding intuition
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fakwlpeoe9thsuu831zza.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fakwlpeoe9thsuu831zza.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The two sentences share almost no keywords, but their vectors point in nearly the same direction because the model understands they describe the same concept.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where vector search falls short alone
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Cold-start penalty: embedding thousands of pages takes time and compute; BM25 is instant.&lt;/li&gt;
&lt;li&gt;Exact-match dilution: specific product names or identifiers can be blurred by semantic proximity.&lt;/li&gt;
&lt;li&gt;Domain drift: general-purpose models may not distinguish highly specific domain terminology.&lt;/li&gt;
&lt;li&gt;Resource requirement: needs a model (~130 MB for bge-small-en-v1.5) and inference at query time.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  BM25 vs. Vector Search: Side-by-Side
&lt;/h1&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;BM25&lt;/th&gt;
&lt;th&gt;Vector / Semantic&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Matching strategy&lt;/td&gt;
&lt;td&gt;Matching strategy Exact term overlap (TF x IDF)&lt;/td&gt;
&lt;td&gt;Semantic similarity (cosine distance)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vocabulary required&lt;/td&gt;
&lt;td&gt;Query words must appear in page&lt;/td&gt;
&lt;td&gt;Paraphrases and synonyms handled&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speed&lt;/td&gt;
&lt;td&gt;Microseconds -- no model needed&lt;/td&gt;
&lt;td&gt;Milliseconds -- model inference required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Setup cost&lt;/td&gt;
&lt;td&gt;Zero -- pure algorithm&lt;/td&gt;
&lt;td&gt;~130 MB model download (one-time)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Exact-match queries&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Good&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Synonym / paraphrase&lt;/td&gt;
&lt;td&gt;Often misses&lt;/td&gt;
&lt;td&gt;Handles well&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Domain terminology&lt;/td&gt;
&lt;td&gt;Good if terms match&lt;/td&gt;
&lt;td&gt;Depends on model training&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Interpretability&lt;/td&gt;
&lt;td&gt;Score is explainable&lt;/td&gt;
&lt;td&gt;Black-box similarity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Known vocabulary, structured content&lt;/td&gt;
&lt;td&gt;Conceptual queries, diverse sources&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h1&gt;
  
  
  How Synthadoc Combines Both
&lt;/h1&gt;

&lt;p&gt;Synthadoc uses a hybrid pipeline where BM25 and vector search are not alternatives - they are sequential layers. BM25 does the heavy filtering; vector re-ranks the survivors.&lt;/p&gt;

&lt;h2&gt;
  
  
  The retrieval pipeline
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg4ydqh6gcu86b8ffu8qp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg4ydqh6gcu86b8ffu8qp.png" alt=" " width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Query decomposition: why it matters for retrieval
&lt;/h2&gt;

&lt;p&gt;Before any search happens, Synthadoc v0.2.0 breaks compound questions into focused sub-questions via an LLM call. Each sub-question runs its own BM25 (and vector) search in parallel. Results are merged by best score per page before synthesis. One complex query can retrieve from multiple distinct parts of the wiki simultaneously.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Example: Query: "Compare Turing's contributions with Von Neumann's &lt;br&gt;
architecture"&lt;br&gt;
-&amp;gt; Decomposed: ["Turing contributions computing"] | ["Von Neumann &lt;br&gt;
architecture design"]&lt;br&gt;
-&amp;gt; Two parallel BM25 searches -&amp;gt; merged candidates -&amp;gt; one synthesised&lt;br&gt;
answer&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Knowledge gap detection
&lt;/h2&gt;

&lt;p&gt;After retrieval, Synthadoc evaluates three independent signals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fewer than 3 pages retrieved - the wiki barely covers the topic&lt;/li&gt;
&lt;li&gt;Max BM25 score below configurable threshold (default: 2.0) - weak keyword overlap&lt;/li&gt;
&lt;li&gt;Fewer than 2 candidates contain key nouns from the question - off-topic matches&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When a gap fires, Synthadoc generates targeted web search suggestions and surfaces them as an Obsidian callout or CLI tip, creating a feedback loop that makes the wiki progressively denser over time.&lt;/p&gt;

&lt;h1&gt;
  
  
  Practical Examples in Synthadoc
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Example 1: BM25 exact match (no vector needed)
&lt;/h2&gt;

&lt;p&gt;Wiki page: "Alan Turing - Enigma and the Bombe Machine"&lt;/p&gt;

&lt;p&gt;Query: "Bombe machine Enigma decryption"&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;BM25 succeeds: BM25 score: HIGH - "Bombe", "machine", "Enigma", &lt;br&gt;
"decryption" all present.&lt;br&gt;
Result: page retrieved correctly. Vector re-ranking not required.`**&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Example 2: BM25 misses, vector rescues
&lt;/h2&gt;

&lt;p&gt;Wiki page: "Alan Turing - Theoretical Foundations of Modern Computers"&lt;/p&gt;

&lt;p&gt;Query: "What were Turing's contributions to computing?"&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;BM25 misses, Vector rescues: BM25 score: LOW - "contributions" and &lt;br&gt;
"computing" absent from the page.&lt;br&gt;
Vector cosine score: HIGH - embeddings are semantically close.&lt;br&gt;
Result: page retrieved correctly after re-ranking. BM25 alone would have &amp;gt; missed it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Example 3: Knowledge gap fires, ingest suggestion generated
&lt;/h2&gt;

&lt;p&gt;Wiki: finance domain. Query: "What is the impact of quantitative easing on inflation?"&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Gap detected: BM25: 1 page returned, score 0.8 (below threshold 2.0) &lt;br&gt;
Knowledge gap detected. Synthadoc generates:&lt;br&gt;
synthadoc ingest "search for: quantitative easing inflation impact" -w finance-wiki&lt;br&gt;
synthadoc ingest "search for: central bank monetary policy effects" -w finance-wiki&lt;br&gt;
After ingest and re-query: 7 pages returned, fully synthesised answer.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h1&gt;
  
  
  Enabling Vector Search in Synthadoc
&lt;/h1&gt;

&lt;p&gt;BM25 is the default - zero setup, zero dependencies. To add vector re-ranking:&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Install fastembed
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;pip install fastembed&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Enable in config
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;[search]&lt;br&gt;
vector = true&lt;br&gt;
vector_top_candidates = 20 # BM25 pool size before re-ranking&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Restart the server
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;synthadoc serve -w my-wiki&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;On first start, Synthadoc downloads BAAI/bge-small-en-v1.5 (~130 MB) once and embeds existing pages in the background. BM25 stays active throughout - no downtime. If the model is unavailable, the system falls back to BM25 silently.&lt;/p&gt;

&lt;h1&gt;
  
  
  Synthadoc v0.2.0: Full Feature Summary
&lt;/h1&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Query decomposition&lt;/td&gt;
&lt;td&gt;Compound questions split into parallel BM25 sub-queries, merged before synthesis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vector re-ranking&lt;/td&gt;
&lt;td&gt;Opt-in semantic re-ranking (BAAI/bge-small-en-v1.5 via fastembed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Knowledge gap detection&lt;/td&gt;
&lt;td&gt;3-signal gap check; auto-generates targeted ingest suggestions as Obsidian callout&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Web search decomposition&lt;/td&gt;
&lt;td&gt;Broad search topics split into focused Tavily queries; URL deduplication and cap&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-model cost tracking&lt;/td&gt;
&lt;td&gt;Per-token rate table; ingest + query cost in audit.db, CLI, and Obsidian&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query audit trail&lt;/td&gt;
&lt;td&gt;Full query history with sub-question count, tokens, cost, timestamp&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Obsidian live web search view&lt;/td&gt;
&lt;td&gt;Real-time polling panel: phase, pages created, URL errors as fan-out completes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8 new Obsidian commands&lt;/td&gt;
&lt;td&gt;15 commands total: lint, auto-resolve, job retry/purge, audit history, scaffold&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MiniMax support&lt;/td&gt;
&lt;td&gt;M2.5/M2.7 reasoning models with reasoning_content fallback for structured output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rate-limit requeue&lt;/td&gt;
&lt;td&gt;429 responses requeue job (retry budget preserved); fail-fast on daily quota&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Job crash recovery&lt;/td&gt;
&lt;td&gt;in_progress jobs at shutdown auto-reset to pending on next startup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bulk job cancel&lt;/td&gt;
&lt;td&gt;Cancel all pending jobs in one operation via CLI or API&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h1&gt;
  
  
  How Synthadoc Compares to Alternatives
&lt;/h1&gt;

&lt;p&gt;Most LLM knowledge tools are general-purpose RAG pipelines that retrieve raw chunks at query time with no persistent synthesis. Synthadoc compiles knowledge at ingest time, maintains a living wiki, and is designed for domain-specific, auditable deployments.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Synthadoc v0.2.0&lt;/th&gt;
&lt;th&gt;LlamaIndex / LangChain&lt;/th&gt;
&lt;th&gt;Notion AI&lt;/th&gt;
&lt;th&gt;Obsidian Copilot&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Ingest-time synthesis&lt;/td&gt;
&lt;td&gt;Compiled wiki&lt;/td&gt;
&lt;td&gt;Raw chunks at query time&lt;/td&gt;
&lt;td&gt;Page-level only&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Domain scope filtering&lt;/td&gt;
&lt;td&gt;purpose.md&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-model support&lt;/td&gt;
&lt;td&gt;6 providers&lt;/td&gt;
&lt;td&gt;Many providers&lt;/td&gt;
&lt;td&gt;OpenAI only&lt;/td&gt;
&lt;td&gt;OpenAI / Ollama&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Audit trail&lt;/td&gt;
&lt;td&gt;Full SQLite audit&lt;/td&gt;
&lt;td&gt;None built-in&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost tracking&lt;/td&gt;
&lt;td&gt;Per-token, per-op&lt;/td&gt;
&lt;td&gt;Manual / callback&lt;/td&gt;
&lt;td&gt;Opaque&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Offline / local&lt;/td&gt;
&lt;td&gt;Fully local&lt;/td&gt;
&lt;td&gt;Depends on provider&lt;/td&gt;
&lt;td&gt;Cloud only&lt;/td&gt;
&lt;td&gt;Ollama&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Obsidian-native output&lt;/td&gt;
&lt;td&gt;Wikilinks, Dataview&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Notion-only&lt;/td&gt;
&lt;td&gt;Read-only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HTTP API + MCP server&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;td&gt;Manual wiring&lt;/td&gt;
&lt;td&gt;Proprietary API&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Contradiction detection&lt;/td&gt;
&lt;td&gt;Automated&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query decomposition&lt;/td&gt;
&lt;td&gt;Parallel BM25&lt;/td&gt;
&lt;td&gt;Manual chains&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Knowledge gap detection&lt;/td&gt;
&lt;td&gt;Auto-suggestions&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Extensible skills&lt;/td&gt;
&lt;td&gt;Drop-in folders&lt;/td&gt;
&lt;td&gt;Custom loaders&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Licence&lt;/td&gt;
&lt;td&gt;AGPL-3.0 open source&lt;/td&gt;
&lt;td&gt;Apache-2.0&lt;/td&gt;
&lt;td&gt;Proprietary SaaS&lt;/td&gt;
&lt;td&gt;MIT&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h1&gt;
  
  
  Enterprise and Domain-Specific Readiness
&lt;/h1&gt;

&lt;p&gt;Synthadoc is built for organisations that need a knowledge system they control, audit, and deploy into existing infrastructure - not a SaaS black box.&lt;/p&gt;

&lt;p&gt;Concrete use cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Legal - track regulatory updates, case precedents, and compliance requirements across jurisdictions. New ruling ingested, old page flagged contradicted, compliance team reviews.&lt;/li&gt;
&lt;li&gt;Finance - build a living market research wiki from analyst reports, earnings calls, and regulatory filings. Query with natural language, get cited answers with full audit trail.&lt;/li&gt;
&lt;li&gt;Engineering - maintain a persistent runbook that absorbs incident post-mortems, architecture decision records, and API docs. Contradiction detection prevents stale documentation from accumulating.&lt;/li&gt;
&lt;li&gt;Research - aggregate papers, datasets, and notes into a structured knowledge base. Knowledge gap detection surfaces what the team does not yet know and generates targeted ingest suggestions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Domain specificity
&lt;/h2&gt;

&lt;p&gt;Every wiki defines its own scope via purpose.md. The LLM reads this before every ingest decision and rejects out-of-scope sources cleanly. A legal wiki does not absorb marketing copy. A financial wiki does not absorb engineering runbooks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Auditability
&lt;/h2&gt;

&lt;p&gt;Every ingest, query, contradiction detection, and auto-resolution is written to an append-only SQLite audit trail with token counts, cost, timestamps, and page-level actions -- all queryable from the CLI or Obsidian audit commands.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;synthadoc audit history -w my-wiki # ingest records&lt;/p&gt;

&lt;p&gt;synthadoc audit cost -w my-wiki # token spend breakdown&lt;/p&gt;

&lt;p&gt;synthadoc audit events -w my-wiki # contradiction, gate, resolution&lt;br&gt;
events&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Product and grid readiness
&lt;/h2&gt;

&lt;p&gt;Synthadoc exposes the same operations across four surfaces sharing a single agent and storage layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CLI: for operators, automation scripts, and CI pipelines&lt;/li&gt;
&lt;li&gt;HTTP REST API: for product integrations and custom front-ends&lt;/li&gt;
&lt;li&gt;MCP server: for direct agent-to-agent communication&lt;/li&gt;
&lt;li&gt;Obsidian plugin: for knowledge workers doing active research&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hook scripts fire on lifecycle events (on_ingest_complete, on_lint_complete), enabling event-driven automation: post a Slack summary when ingest completes, trigger a downstream build when a key page changes, or chain into a broader orchestration pipeline. Cron scheduling is built in, and multi-wiki isolation means each team or domain runs on its own port with its own audit trail.&lt;/p&gt;

&lt;h1&gt;
  
  
  Synthadoc in Agentic Autonomous Systems
&lt;/h1&gt;

&lt;p&gt;Synthadoc is purpose-built to serve as the persistent knowledge layer for LLM agent systems. Where an agent's context window is ephemeral and limited, Synthadoc's wiki is persistent, structured, and queryable -it gives agents a long-term memory that survives across sessions, scales to millions of tokens of accumulated knowledge, and is fully auditable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agent integration via MCP
&lt;/h2&gt;

&lt;p&gt;The built-in MCP (Model Context Protocol) server exposes ingest, query, and lint as native tool calls. An agent running in any MCP-compatible host - Claude, GPT-4o, a custom LangChain pipeline - can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Call query to retrieve cited, synthesised answers from accumulated knowledge before acting&lt;/li&gt;
&lt;li&gt;Call ingest to push new findings, research results, or external documents back into the wiki&lt;/li&gt;
&lt;li&gt;Call lint to check for contradictions introduced by new data before committing to a decision&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Event-driven agent pipelines
&lt;/h2&gt;

&lt;p&gt;Hook scripts fire on lifecycle events and can trigger downstream agent actions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;on_ingest_complete: a downstream agent reads newly created pages
and decides whether to trigger follow-up ingests or alert a human
reviewer&lt;/li&gt;
&lt;li&gt;on_lint_complete: an orchestrator agent receives contradiction and orphan reports and routes resolution tasks to specialised sub-agents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example pipeline: a web-crawling agent ingests raw URLs; Synthadoc synthesises and deduplicates; a reporting agent queries the updated wiki and posts a daily briefing - all without human intervention.&lt;/p&gt;

&lt;h2&gt;
  
  
  Persistent domain memory for multi-agent systems
&lt;/h2&gt;

&lt;p&gt;In multi-agent architectures, shared knowledge is a coordination bottleneck. Synthadoc solves this by acting as a shared, structured memory store:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple agents read from the same wiki via parallel HTTP queries - no shared state management required&lt;/li&gt;
&lt;li&gt;One agent's ingest results are immediately available to all agents querying the same wiki&lt;/li&gt;
&lt;li&gt;Multi-wiki isolation means separate agent clusters maintain scoped knowledge without interference&lt;/li&gt;
&lt;li&gt;The audit trail provides a complete record of which agent ingested what, when, at what cost - making multi-agent systems auditable by design&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because Synthadoc is self-hosted and open-source, teams building autonomous systems retain full control over data residency, model selection, and cost - a critical requirement for enterprise agentic&lt;br&gt;
deployments.&lt;/p&gt;

&lt;h1&gt;
  
  
  Try Synthadoc v0.2.0
&lt;/h1&gt;

&lt;p&gt;Synthadoc v0.2.0 is available now on GitHub under the AGPL-3.0 licence. BM25 search works out of the box. Vector re-ranking is one pip install away. The Gemini free tier means you can run a full ingest-and-query cycle at zero cost.&lt;/p&gt;

&lt;p&gt;Feedback welcome: Feedback, issues, and contributions are very welcome. Open an issue on GitHub or start a discussion - the roadmap is shaped by what users need.&lt;/p&gt;

&lt;p&gt;👉 README: &lt;a href="https://github.com/axoviq-ai/synthadoc#readme" rel="noopener noreferrer"&gt;https://github.com/axoviq-ai/synthadoc#readme&lt;/a&gt;&lt;br&gt;
👉 Quick-start guide: &lt;a href="https://github.com/axoviq-ai/synthadoc/blob/main/docs/user-quick-start-guide.md" rel="noopener noreferrer"&gt;https://github.com/axoviq-ai/synthadoc/blob/main/docs/user-quick-start-guide.md&lt;/a&gt;&lt;br&gt;
👉 Design document: &lt;a href="https://github.com/axoviq-ai/synthadoc/blob/main/docs/design.md" rel="noopener noreferrer"&gt;https://github.com/axoviq-ai/synthadoc//blob/main/docs/design.md&lt;/a&gt;&lt;br&gt;
👉 Release notes: &lt;a href="https://github.com/axoviq-ai/synthadoc/releases/tag/v0.2.0" rel="noopener noreferrer"&gt;https://github.com/axoviq-ai/synthadoc/releases/tag/v0.2.0&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>agents</category>
      <category>agentskills</category>
    </item>
  </channel>
</rss>
