DEV Community: Paul Chen

Synthadoc: From Raw Documents to Domain Intelligence (Youtube) https://youtu.be/rIGO6zi9XQE?si=nuhdnvdl9pcZ4NV_

Paul Chen — Sun, 17 May 2026 04:22:14 +0000

Synthadoc: Routing at Scale, Quality Gates, and the Knowledge Backend Pattern

Paul Chen — Mon, 11 May 2026 14:21:30 +0000

When we shipped v0.1.0, Synthadoc did one thing well: it turned raw sources into a structured wiki that got smarter with every ingest. v0.2.0 made that wiki searchable with hybrid BM25 + vector retrieval. v0.3.0 opened up the source types - YouTube transcripts, web search fan-out, CLI provider integration so your existing Claude Code or Opencode subscription could power the whole thing.

But a pattern kept surfacing in user feedback. Once a wiki crossed a few hundred pages, three problems appeared in a cluster: queries got slower, low-confidence pages polluted search results, and there was no clean way to pipe the wiki's structured knowledge into an agent prompt without getting back a synthesised answer when you wanted the raw evidence.

v0.4.0 addresses all three. This post walks through the design decisions behind each feature, the benchmark numbers, and why we think the third one, context packs, points at something larger than a feature.

The Scale Problem

BM25 is fast. On a 100-page wiki, a query completes in single-digit milliseconds. The problem is that BM25 is also undiscriminating: it scores every page against every query, regardless of how obviously irrelevant most of those pages are.

That's fine at 100 pages. At 1,000 pages with diverse topics - a personal research wiki covering ML, distributed systems, organisational theory, and management literature - you're running a full-corpus scan for every query. And because query decomposition splits one question into 3–5 sub-questions, you multiply that cost by 3–5 on every call.

We benchmarked the unrouted baseline across corpus sizes. The numbers aren't alarming until they are:

Corpus size	Full-corpus P95 latency	Routed P95 latency (2 of 10 branches)
100 pages	7 ms	7 ms
500 pages	38 ms	12 ms
1,000 pages	74 ms	18 ms
5,000 pages	112 ms	21 ms
10,000 pages	191 ms	24 ms

Routed search stays near-flat as the corpus grows, because the per-branch page count doesn't change even as the total wiki does. Full-corpus search grows with the wiki - not catastrophically, but noticeably, and that growth compounds across decomposed sub-queries.

The other problem at scale is more subtle: a query about treatment protocols for hypertension shouldn't touch the pages about distributed consensus algorithms. Not just for performance reasons - also because irrelevant pages can drift into synthesis and dilute the answer.

Feature 1: Routing Layer

Why It Matters for a Self-Growing Wiki

The scale problem above gets worse in a specific way that's easy to underestimate: a well-configured Synthadoc wiki doesn't stay at its initial size. With nightly scheduled ingests pulling from web searches, PDFs, YouTube transcripts, and curated source lists, a wiki can double from 200 to 400 pages within a few weeks, then reach 1,000 within a few months without the user doing anything manually. That's exactly the point.

But without routing, query quality degrades silently as that growth happens. Every new page increases the BM25 corpus, and the search engine has no way to know that "What are the treatment protocols for hypertension?" should not touch the 300 pages about software architecture. More pages means more irrelevant candidates competing to drift into synthesis, more false positives, and more latency - and none of it is visible until queries start returning noticeably diluted answers.

Routing is what makes autonomous growth sustainable. It's the mechanism that keeps query scope bounded to what's actually relevant, regardless of how large the total wiki becomes. The branch taxonomy is defined once; IngestAgent maintains it automatically from that point forward - every new page created by ingest is auto-placed into the most relevant branch, so ROUTING.md stays accurate as the wiki grows without manual intervention.

The routing layer introduces a file called ROUTING.md at the wiki root. Its format is intentionally simple - the same ## H2 → [[slug]] structure already used in index.md:

## People
- [[alan-turing]]
- [[grace-hopper]]
- [[ada-lovelace]]

## Hardware
- [[von-neumann-architecture]]
- [[eniac-computer]]
- [[transistor-and-microchip]]

## Networks
- [[internet-origins]]
- [[arpanet-history]]

User-owned. Scaffold creates it once from the current index structure and never rewrites it. The user defines the branch taxonomy; the system maintains it.

How Routing Works at Query Time

At query time, QueryAgent reads ROUTING.md (cached per session, invalidated on write), passes the branch headings and the user's query to the LLM, and receives back a short JSON array of the 1–2 most relevant branch names. BM25 then runs only over the slugs listed under those branches.

If ROUTING.md is absent, or if no branch scores above threshold, the system falls back to full-corpus search transparently - no error, no degraded output.

IngestAgent also uses routing: when a new page is created, it's slotted into the most relevant branch automatically. ROUTING.md stays consistent without manual maintenance.

Aliases

The other half of the routing feature is alias resolution. Anyone who maintains a personal knowledge base has personal terminology that diverges from canonical names. You might always call a concept by a shorthand, an acronym, or a translation - and BM25 will miss the connection because the strings don't match.

Aliases live in each page's YAML frontmatter:

---
title: Alan Turing
aliases: [turing, the turing paper, incomputability guy]
---

At query time, before BM25 runs, QueryAgent expands any alias matches in the query to their canonical slug. The user's internal vocabulary resolves to the wiki's vocabulary without re-learning what the LLM decided to call things.

ScaffoldAgent suggests initial aliases when generating a page. Users refine them in Obsidian's Properties panel.

Protected Scaffold Zone

One problem that appeared as wikis grew: users would hand-edit index.md to add an introduction, a personal note, a custom link - and the next scaffold run would erase it. Scaffold owned the whole file.

v0.4.0 introduces a marker line:

# My Research Wiki

My custom introduction and notes here.
Scaffold never touches anything above this line.

<!-- synthadoc:scaffold -->

## People
- [[alan-turing]] — Theoretical foundations of computation

Scaffold only regenerates content below . Everything above is preserved verbatim across every scaffold run. The marker is inserted automatically on the first scaffold run if absent.

Routing CLI

synthadoc routing init           # generate ROUTING.md from current index (one-time)
synthadoc routing validate       # report dangling slugs — dry run, no changes
synthadoc routing clean          # auto-remove dangling entries

routing validate is worth running after bulk ingests or manual page deletions:

Dangling slugs in ROUTING.md (3):
  [Hardware]  [[eniac-computer]]
  [People]    [[konrad-zuse]]
  [Networks]  [[arpanet-history]]

Scheduling for a Self-Growing Wiki

There are two distinct scheduling patterns worth setting up: nightly growth and weekly housekeeping.

Nightly growth — ingest pulls in new sources automatically:

synthadoc schedule add --op "ingest --batch raw_sources/" --cron "0 2 * * *" -w my-wiki

As each ingest job completes, IngestAgent writes the new page to wiki/ and appends its slug to ROUTING.md under the most relevant branch. No manual step needed — the routing index grows alongside the wiki.

Weekly housekeeping - three operations, run in sequence:

synthadoc schedule add --op "lint run"      --cron "0 3 * * 0" -w my-wiki   # Sunday 3 AM
synthadoc schedule add --op "scaffold"      --cron "0 4 * * 0" -w my-wiki   # Sunday 4 AM
synthadoc schedule add --op "routing clean" --cron "0 5 * * 0" -w my-wiki   # Sunday 5 AM

The order matters. Lint runs first and removes dead wikilinks left behind by deleted pages. Scaffold runs next and regenerates index.md to reflect the current page set - new categories get added, empty ones get removed. Routing clean runs last and prunes any dangling slug entries from ROUTING.md that no longer have a corresponding wiki page. After all three, the index and routing table are consistent with the actual state of the wiki.

One thing to be clear about: routing init is a one-time setup command, not something to schedule. Running it again would overwrite ROUTING.md and erase any branch customisations you've made since the initial setup. routing clean is the recurring maintenance command - it only removes entries for missing pages and never touches branch structure.

If you prefer to declare the schedule in config rather than via the CLI, add a [schedule] block to .synthadoc/config.toml and register everything in one step:

[schedule]
jobs = [
  { op = "ingest --batch raw_sources/", cron = "0 2 * * *" },
  { op = "lint run",                    cron = "0 3 * * 0" },
  { op = "scaffold",                    cron = "0 4 * * 0" },
  { op = "routing clean",               cron = "0 5 * * 0" },
]

synthadoc schedule apply -w my-wiki

With nightly ingests and a weekly maintenance trio in place, the wiki grows, stays accurate, and self-corrects - without requiring manual intervention.

Feature 2: Candidates Staging

The second feature addresses a different scale problem: quality at the write path.

Before v0.4.0, IngestAgent wrote new pages directly to wiki/ with no review step. A high-confidence page about a well-structured source landed right next to a speculative page inferred from a thin web article. Both entered BM25, both appeared in orphan detection and contradiction checks, and both showed up in synthesis - with no signal to distinguish them.

The Staging Concept

Candidates staging introduces a fork in the write path. Pages go to wiki/candidates/ instead of wiki/ when they don't meet a configurable confidence threshold. They're excluded from BM25, orphan detection, and contradiction checks until explicitly promoted.

The policy is configured in .synthadoc/config.toml:

[ingest]
staging_policy = "threshold"      # "off" | "all" | "threshold"
staging_confidence_min = "high"   # "high" | "medium" | "low"

Three policies:

Policy	Behaviour
`"off"`	All new pages go directly to`wiki/` - current behaviour, default
`"threshold"`	Pages meeting`staging_confidence_min` auto-promote; lower confidence → `wiki/candidates/`
`"all"`	Every new page requires explicit promotion, regardless of confidence

The config is hot-reloaded - a policy change takes effect on the next ingest job with no server restart.

The Staging Workflow

Reviewing Candidates

synthadoc candidates list

Candidates (3):
  eniac-computer    confidence: low     ingested: 2026-05-05
  konrad-zuse       confidence: medium  ingested: 2026-05-05
  arpanet-history   confidence: high    ingested: 2026-05-04

synthadoc candidates promote arpanet-history
synthadoc candidates discard eniac-computer
synthadoc candidates promote --all

Promotion does four things atomically: moves the file from wiki/candidates/ to wiki/, appends the slug to index.md under the best-matching category, appends it to ROUTING.md under the best-matching branch, and records the promotion in the audit trail. Discard deletes the file and records the reason.

Scheduling and Staging

Unlike routing and lint, candidates staging doesn't have a useful scheduled action. The two available commands — candidates promote --all and candidates discard --all - are too blunt to schedule safely. Scheduling promote --all on a timer would auto-approve exactly the pages the quality gate held back. Scheduling discard --all would silently delete pages you haven't reviewed yet. Neither command has a --older-than or --confidence filter that would make scheduled execution sensible.

The right integration is manual: run candidates list after a large batch ingest, or once a week if the wiki is growing quickly. It takes seconds. The review step is the intentional human checkpoint in an otherwise automated pipeline, it's where you decide what enters the wiki's searchable knowledge base.

If candidates are accumulating faster than you can review them, the right response is to adjust the policy rather than schedule a cleanup:

# Widen the auto-promote threshold - fewer pages need review
synthadoc staging policy threshold --min-confidence medium

# Or turn staging off entirely - if you trust all your sources
synthadoc staging policy off

The staging policy is the dial; the CLI review commands are for the remainder that needs a human decision.

Why This Matters

The practical effect of staging_policy = "threshold" on a high-volume wiki is significant. After ingesting 30 web articles from varied sources, typical results look like: 22 high-confidence pages auto-promote and enter the main wiki immediately; 8 lower-confidence pages wait in candidates. Those 8 include speculative pages, ambiguous slug assignments, and pages where the source was thin enough that the LLM wasn't confident what it was describing.

In the full-corpus approach, those 8 pages would be silently diluting every query that touched their topic. In the staged approach, they're visible and actionable.

Feature 3: Context Packs

The third feature came from a different direction. Users weren't asking "how do I get a better answer?" They were asking "how do I get the raw evidence so I can do something with it myself?"

QueryAgent synthesises. That's its job. But synthesis isn't always what you want. Sometimes you want the actual page excerpts - cited, bounded, ranked by relevance - to paste into an agent prompt, to attach to a report, to review before a meeting, or to feed into an automated pipeline.

Context packs are the answer.

How Context Packs Work

The ContextAgent reuses QueryAgent.decompose() and HybridSearch. It doesn't synthesise - it packs. Each page excerpt is included verbatim (up to its per-page limit), with attribution: slug, relevance score, confidence level, tags, source path.

Output Format

# Context Pack: history of early computing pioneers
Generated: 2026-05-09T14:22:01
Token budget: 4000 | Used: 3847 | Omitted: 2 pages (budget exceeded)

---

## [[alan-turing]] - relevance: 0.92
> Alan Turing developed the theoretical foundations of computation with his 1936
> paper "On Computable Numbers." The paper introduced the abstract Turing machine
> and proved the existence of undecidable problems...
Source: `wiki/alan-turing.md` | Confidence: high | Tags: people, mathematics

## [[grace-hopper]] - relevance: 0.87
> Grace Hopper pioneered compiler development and made programming accessible to
> humans. She developed the first compiler (A-0) in 1952 and later led the team
> that created COBOL...
Source: `wiki/grace-hopper.md` | Confidence: high | Tags: people, software

---

## Omitted — token budget exceeded
- [[von-neumann-architecture]] — ~820 tokens
- [[eniac-computer]] — ~650 tokens

CLI Usage

synthadoc context build "microservices patterns"
synthadoc context build "microservices patterns" --tokens 8000
synthadoc context build "microservices patterns" --output context.md

Token Budget Control

This is where context packs differ from simply querying the wiki. An external agent doesn't want an unbounded blob of text - it has its own context window to manage, its own prompt already taking up tokens, and its own cost constraints. The token_budget parameter gives the caller precise control over how much of the wiki's knowledge gets included.

Synthadoc packs pages greedily by relevance score until the budget is exhausted, then lists everything that didn't fit with estimated token counts. The calling agent knows exactly what it got and what it didn't - and can decide whether to request a larger budget, run a second more focused query, or proceed with what's there. There are no surprises about how much context the call will consume.

This predictability is what makes context packs suitable for production agentic pipelines. An agent orchestrator can reserve a fixed token slice for domain knowledge, call context/build with that exact budget, and get back a response that fits - every time, regardless of how large the underlying wiki has grown.

The REST API - Synthadoc as a Knowledge Backend

Context packs expose a POST /context/build endpoint that returns the same data as structured JSON. This is where the pattern becomes interesting.

An agent that needs grounding context before reasoning can call Synthadoc's REST API directly, get back a bounded, cited, ranked set of page excerpts, and inject them into its own prompt - without going through a synthesis step. Synthadoc becomes the knowledge layer, the agent provides the reasoning.

POST /context/build
{ "goal": "microservices patterns for high-throughput event processing", "token_budget": 4000 }

200 OK
{
  "goal": "...",
  "token_budget": 4000,
  "tokens_used": 3847,
  "pages": [
    {
      "slug": "event-driven-architecture",
      "relevance": 0.94,
      "excerpt": "Event-driven architecture decouples producers from consumers...",
      "source": "wiki/event-driven-architecture.md",
      "confidence": "high",
      "tags": ["architecture", "distributed-systems"]
    }
  ],
  "omitted": [
    { "slug": "kafka-internals", "estimated_tokens": 980 }
  ]
}

The existing MCP server exposes this as a native tool call. An agent running in any MCP-compatible host can call context/build before it reasons, get structured evidence back, and proceed with grounding it wouldn't otherwise have.

This is what we mean by the "knowledge backend pattern": Synthadoc manages accumulation, deduplication, contradiction detection, and retrieval. The calling agent manages reasoning and action. The division of labour is clean, the token envelope is caller-controlled, and the knowledge layer is persistent across agent sessions.

Other v0.4.0 Changes

Plugin Install CLI

synthadoc plugin install history-of-computing

In earlier versions, installing the Obsidian plugin required locating the plugins directory manually, copying files, and restarting Obsidian. v0.4.0 adds a single CLI command that installs the plugin directly into the active Obsidian vault. That's the entire CLI step - the rest is done in Obsidian.

AI Research Demo: Contradiction Detection End-to-End

The demos/ai-research/ demo now includes a PDF source (llm-benchmarks-q1-2026.pdf) that explicitly disputes a claim in the existing wiki. Running the demo shows the full contradiction detection and flagging lifecycle: source ingested, existing page status updated to contradicted, audit event recorded. The demo now covers all five IngestAgent decision paths - create, update, skip, flag, and contradiction.

Decision Cache Prompt-Awareness

A quiet but consequential fix: the LLM decision cache key previously included only the content hash and existing slugs. This meant that changes to purpose.md - the file that scopes what belongs in the wiki - were invisible to the cache. An ingest run after a purpose change would serve stale decisions for any source whose content hadn't changed. v0.4.0 includes the decision prompt itself in the cache key, so any change to purpose.md automatically busts the cache for all affected sources.

v0.1 to v0.4: What Changed

It's worth stepping back to see what the four versions have built:

Version	Core addition
v0.1.0	Ingest-time synthesis: sources become a structured wiki, not raw chunks
v0.2.0	Hybrid BM25 + vector search; query decomposition; knowledge gap detection; full audit trail
v0.3.0	YouTube and web search ingestion; CLI provider integration (Claude Code & Opencode support); CJK support; contradiction detection improvements
v0.4.0	Routing at scale; candidates staging for quality control; context packs as a knowledge backend API

The first two versions established the ingest-then-query model and made it trustworthy - every action audited, every contradiction surfaced. The third version expanded what you could ingest and who could afford to run it. The fourth version addresses what happens when the wiki grows to a size where the original flat model starts to show cracks.

The routing benchmarks are honest about where we are: 191ms for a full-corpus query across 10,000 pages is fine for interactive use, but it compounds across sub-questions and becomes a real cost in high-throughput agentic pipelines. The routed 24ms figure at the same corpus size is where we want the system to be. The benchmark-gated release process we introduced in v0.4.0 is the mechanism that ensures it stays there as the codebase evolves.

Try Synthadoc v0.4.0

Synthadoc v0.4.0 is available now on GitHub under the AGPL-3.0 licence. The quickest path:

git clone https://github.com/axoviq-ai/synthadoc.git
cd synthadoc
pip3 install -e ".[dev]"
synthadoc install history-of-computing --target ~/wikis --demo
synthadoc plugin install history-of-computing
synthadoc use history-of-computing
synthadoc serve

Then open Obsidian, open ~/wikis/history-of-computing as a vault, install the Dataview community plugin, and enable the Synthadoc plugin. The demo wiki runs against Gemini Flash 2.0, which is free-tier eligible — no cost to run a full ingest-query-lint cycle.

GitHub: https://github.com/axoviq-ai/synthadoc
Quick-start guide: https://github.com/axoviq-ai/synthadoc/blob/main/docs/user-quick-start-guide.md
Design document: https://github.com/axoviq-ai/synthadoc/blob/main/docs/design.md
Release notes: https://github.com/axoviq-ai/synthadoc/releases/tag/v0.4.0

Feedbacks are welcome. The routing taxonomy and context pack output format are both early - if your use case pushes against their current shape, we want to know.

Synthadoc: Your Coding Tool Is Now Your Wiki Brain

Paul Chen — Tue, 05 May 2026 16:02:35 +0000

If you use Claude Code or Opencode, you are already paying for an LLM subscription.

Before v0.3.0, running Synthadoc also required a separate API key - Anthropic, OpenAI, Gemini, or one of the others.

v0.3.0 removes that requirement. Set provider = "claude-code" in one config file and your coding tool subscription becomes the brain of your personal wiki. No additional API key. No additional cost.

Why This Matters Beyond Convenience

The obvious win is billing consolidation. But there is a more interesting point underneath it.

Coding tools like Claude Code and Opencode started as pair programmers. They answer questions about your codebase. They write functions, fix bugs, explain unfamiliar code. That is what the marketing says.

What they actually are is a general-purpose LLM with a subscription model - capable of anything the underlying model can do, not just code. The coding framing is a UI convention, not a capability limit.

Synthadoc v0.3.0 makes that explicit. The same Claude subscription you use to navigate a TypeScript monorepo can now synthesize your research documents, detect contradictions in your notes, and answer structured questions about your domain knowledge. The subscription becomes infrastructure, not a point tool.

This is the same pattern that made cloud storage interesting once it stopped being "backup for photos" and became "filesystem for any application." The value is in the generality.

How It Works

Synthadoc's LLM providers are abstracted behind a single LLMProvider interface. Every agent - IngestAgent, QueryAgent, LintAgent - calls provider.complete() and receives a structured response. The provider implementation handles everything underneath: API calls, authentication, response parsing, error handling, quota detection.

For cloud APIs (Anthropic, OpenAI, Gemini), the provider sends an HTTP request. For Claude Code and Opencode, a new CodingToolCLIProvider class takes a different route: it launches the CLI tool as a subprocess, passes the prompt via stdin, reads the response from stdout, and parses the output.

# .synthadoc/config.toml
[agents]
default = { provider = "claude-code" }
lint    = { provider = "claude-code" }

That is the entire configuration change. No API keys to set. No environment variables.

The server also accepts a runtime override:

synthadoc serve -w my-wiki --provider claude-code

This overrides config.toml for the lifetime of the server process - useful for quickly switching providers without editing files.

Figure 1 - All agents share one provider interface. CLI providers slot in without changing anything else in the stack.

What Works, What Does Not

The CLI provider route is not a perfect substitute for a direct API connection. Two limitations matter:

No vector embeddings. Synthadoc's optional semantic re-ranking (hybrid BM25 + vector search) requires an embed() call, which the CLI tools do not expose. When using a CLI provider, search falls back to BM25-only. For most wikis up to a few hundred pages, BM25 is fast and accurate - this only matters if you have enabled vector search and are running a large corpus.

Quota is shared. Your CLI provider subscription has a daily or session quota. Heavy ingest operations - batch-ingesting 50 documents, for example - consume from the same budget as your coding work. The engine detects quota exhaustion, permanently fails the job with a clear message, and does not retry. You resume after the quota resets.

Both are manageable trade-offs for a personal wiki. If you need vector search or are running a high-volume ingest, a direct API key is the better choice — pick from any of the eight providers based on your quality, vision, and budget requirements. If you want zero additional configuration and are comfortable with BM25 search, the CLI provider path is clean.

The Provider Landscape in v0.3.0

One of the quieter design decisions in Synthadoc is that the LLM provider is not baked into the product. You pick the one that fits your constraints:

Provider	API Key Required	Free Tier	Vision	Notes
Gemini Flash	Yes	1M tokens/day, no credit card	Yes	Default; best free option
Groq	Yes	Rate-limited	No	Fast, good for text-only
Ollama	No	Fully local	Model-dependent	Runs on your machine
DeepSeek	Yes	No (cheap)	No	Very low cost per token
Anthropic	Yes	No	Yes
OpenAI	Yes	No	Yes
MiniMax	Yes	No	Yes
Claude Code	No	Included with subscription	No	New in v0.3.0
Opencode	No	Included with subscription	No	New in v0.3.0

Figure 2 — Per-agent model routing from a single subscription. Ingest and query use Opus for synthesis quality; lint uses Haiku for speed and lower quota consumption. Two config lines; one bill.

DeepSeek is also new in v0.3.0 - routes through the OpenAI-compatible endpoint, very low cost per token for text-heavy ingest workloads, and the R1 reasoning model <think> tags are stripped automatically before the response reaches Synthadoc.

The point of this table is not that any one provider is the best. It is that you should not need to change your workflow to use the tool. If you are already set up with Anthropic for Claude Code, switching your wiki to the same provider takes one config line.

Quota Exhaustion and Reliability

One detail the implementation gets right: quota exhaustion is a permanent failure, not a retryable one.

Most LLM API errors are transient - a timeout, a 5xx from an overloaded endpoint. Synthadoc retries those with exponential backoff. Quota exhaustion is different. Retrying a quota-exhausted call wastes the next request, then the one after that, burning whatever small remaining budget exists.

When a CodingToolCLIProvider detects that the subprocess output indicates quota exhaustion, it raises CodingToolQuotaExhaustedException. The orchestrator catches this and calls fail_permanent() on the job - no retries, no backoff. The job sits in failed state with a clear message: "Claude Code quota exhausted. Resume after quota resets." You do not return to find a queue of 50 failed retries.

Small detail. Meaningful when it happens.

Getting Started with Claude Code or Opencode Provider

Prerequisite: Claude Code or Opencode must already be installed and logged in on your machine. If not, follow the Claude Code setup guide or the Opencode setup guide first.

# 1. Clone and install
git clone https://github.com/axoviq-ai/synthadoc.git
cd synthadoc
pip3 install -e ".[dev]"

# 2. Install the demo wiki
# Linux / macOS
synthadoc install history-of-computing --target ~/wikis --demo

# Windows
synthadoc install history-of-computing --target %USERPROFILE%\wikis --demo

# 3. Set as the active wiki (no -w needed from here on)
synthadoc use history-of-computing

4. Configure the CLI provider - edit ~/wikis/history-of-computing/.synthadoc/config.toml:

[agents]
default = { provider = "claude-code" }

Or skip editing config.toml entirely and pass --provider at startup:

# 5. Start the engine - no API key needed
synthadoc serve --provider claude-code

6. Install the Obsidian plugin - from the cloned synthadoc/ repo directory:

# Linux / macOS
cd ~/synthadoc   # or wherever you cloned it
vault=~/wikis/history-of-computing
mkdir -p "$vault/.obsidian/plugins/synthadoc"
cp obsidian-plugin/main.js obsidian-plugin/manifest.json "$vault/.obsidian/plugins/synthadoc/"

# Windows (cmd.exe)
cd %USERPROFILE%\synthadoc
mkdir "%USERPROFILE%\wikis\history-of-computing\.obsidian\plugins\synthadoc"
copy obsidian-plugin\main.js "%USERPROFILE%\wikis\history-of-computing\.obsidian\plugins\synthadoc\"
copy obsidian-plugin\manifest.json "%USERPROFILE%\wikis\history-of-computing\.obsidian\plugins\synthadoc\"

Then in Obsidian: fully quit and reopen, then:

Settings → Community plugins → Synthadoc → Enable, set Server URL to http://127.0.0.1:7070
Settings → Community plugins → Browse → search "Dataview" → Install → Enable (required for the live dashboard)

The full configuration reference is in docs/design.md - Coding tool CLI providers.

What Else Is in v0.3.0

The CLI provider work is one piece of a larger release. v0.3.0 also ships:

YouTube video ingestion - paste a URL, get a structured wiki page with executive summary and [MM:SS] timestamped transcript
Web search fan-out - one search query decomposes into sub-questions, ingests multiple sources, builds cross-references automatically
CJK multilingual support - Chinese, Japanese, and Korean queries no longer trigger false knowledge-gap reports
Knowledge gap detection hardening - signal 5 redesigned for deterministic multi-aspect query scoring
DeepSeek provider - eighth provider, OpenAI-compatible, very low cost

Full release notes: github.com/axoviq-ai/synthadoc/releases/tag/v0.3.0.

Synthadoc is open source under AGPL-3.0 at github.com/axoviq-ai/synthadoc.

If you are already using Claude Code or Opencode daily, the marginal cost of running a structured personal wiki on top of the same subscription is zero. The marginal benefit compounds over time. That is a trade worth making.

Synthadoc: From YouTube to Wiki: How v0.3.0 Turns Any Content into Structured Knowledge

Paul Chen — Mon, 04 May 2026 19:10:00 +0000

You have a YouTube playlist of 40 conference talks. You have bookmarked 200 web articles. You have a folder of PDFs you keep meaning to read.

None of that is knowledge yet. It is a queue.

Synthadoc v0.3.0 drains the queue.

The Problem with "Saving" Things

Most people have a system for collecting information. Bookmarks, Notion pages, Pocket, starred emails. The collection grows. The retrieval never quite works. You remember you saved something about transformer attention mechanisms six months ago but cannot find it. You watch a 45-minute conference talk, absorb maybe 30% of it, and have no structured record of the rest.

The issue is not storage. It is synthesis. Saving a link preserves a pointer. It does not extract the claim, connect it to what you already know, or surface the contradiction with something you read last week.

Synthadoc v0.1.0 solved this for documents - PDFs, Word files, spreadsheets, images. v0.2.0 added hybrid BM25 + vector search so retrieval stayed sharp as the wiki grew. v0.3.0 extends the ingest surface to the two sources where most knowledge actually lives in 2026: video and the live web.

Ingesting a YouTube Video

The workflow is a single command:

synthadoc ingest "https://www.youtube.com/watch?v=dQw4w9WgXcQ"

Or from Obsidian: open the Ingest: from URL... modal, paste the YouTube link, press Ingest.

What happens next:

Synthadoc fetches the video's caption track - no audio download, no third-party transcription API, no API key required.
The transcript is chunked with embedded [MM:SS] timestamps preserved so every claim is traceable to a specific moment in the video.
The LLM generates an executive summary: what the video is about, the main topics covered, and the key takeaway - in three to five sentences.
The full timestamped transcript follows the summary in the wiki page.
Cross-references to existing wiki pages are built automatically during ingest. If your wiki already has a page on "attention mechanisms" and the video mentions it, a [[attention-mechanisms]] wikilink appears in the new page.

Figure 1 — The YouTube ingest pipeline. One URL in; a structured, cross-referenced wiki page out.

The result is a wiki page that looks like this:

---
title: "The Illustrated Transformer"
status: active
confidence: medium
created: 2026-05-03T14:22:01
sources: [youtube.com/watch?v=...]
---

**Executive summary:** Jay Alammar's visual walkthrough of the Transformer
architecture. Covers self-attention, multi-head attention, positional
encoding, and the encoder-decoder structure using animated diagrams.
Key takeaway: the attention mechanism allows each token to "look at" every
other token in the sequence simultaneously, which is what enables parallelism
over RNNs.

---

[00:42] The problem with sequence models is that they process tokens
one at a time, making parallelisation during training difficult...

[03:15] Self-attention computes a weighted sum of all values in the
sequence. The weights come from a compatibility function between
a query and all keys...

No manual work. The video is now part of your wiki.

Web Search Fan-Out

The YouTube capability sits alongside a web search feature that works differently from a standard web search.

synthadoc ingest "search for: transformer attention mechanisms 2025"

Synthadoc does not return ten blue links. It:

Decomposes the query into 3–5 sub-questions covering different facets of the topic.
Searches the web for each sub-question independently.
Ingests the top results for each, synthesizing each source into a wiki page.
Builds cross-references across all the newly created pages.

A single web search command can add eight to fifteen structured pages to your wiki in one operation. The result is not a reading list - it is synthesized knowledge, cross-referenced and ready to query.

Why Timestamps Matter

One detail worth pausing on: the [MM:SS] timestamps in the transcript are not decoration.

When you later ask:

synthadoc query "what did the transformer paper say about positional encoding?"

The answer includes the source citation. Because the timestamp is embedded in the page body, the citation points not just to the video but to the moment in the video. You can verify the claim in thirty seconds by jumping to that timestamp.

This is the same principle that makes citations in academic papers useful. The claim is not just "somewhere in this source." It is traceable.

The Ingest Surface in v0.3.0

With v0.3.0, Synthadoc can ingest from the following source types in a single unified pipeline:

Source	How
PDF, Word, XLSX, CSV, TXT	Direct file extraction
Images (PNG, JPG, WEBP, etc.)	Vision LLM extracts text and structure
Web pages and articles	URL fetch + synthesis
YouTube videos	Caption extraction + executive summary
Web search results	Multi-query fan-out + synthesis
PowerPoint / presentations	Slide text extraction

Every source type produces the same output: a structured Markdown wiki page with frontmatter, wikilinks to related pages, and a traceable source reference.

Figure 2 — Every source type feeds the same pipeline. The wiki grows; the query quality compounds.

What the Wiki Looks Like After 30 Days

The compounding effect is the real story. After 30 days of normal usage — ingesting the things you would have saved anyway - a Synthadoc wiki typically contains:

50–150 pages covering your domain
Automatic cross-references linking related concepts
Contradiction flags where two sources disagree (Synthadoc surfaces these, you resolve them)
Orphan detection for pages no other page links to yet
Full audit trail of what was ingested, when, and at what cost

The 50th query to this wiki is dramatically smarter than the first, because every previous ingest has built the structure the query runs against.

That is the core idea from v0.1.0, still true - but now the inputs include everything.

Getting Started

# 1. Clone and install
git clone https://github.com/axoviq-ai/synthadoc.git
cd synthadoc
pip3 install -e ".[dev]"

2. Set an API key — Gemini Flash is the default (free tier, 1M tokens/day, no credit card).
Get a key at aistudio.google.com/app/apikey, then:

# macOS / Linux
export GEMINI_API_KEY=AIza…

# Windows
set GEMINI_API_KEY=AIza…

No API key? If you already have Claude Code or Opencode, set provider = "claude-code" in your wiki's config.toml instead - see docs/design.md.

# 3. Install the demo wiki
# Linux / macOS
synthadoc install history-of-computing --target ~/wikis --demo

# Windows
synthadoc install history-of-computing --target %USERPROFILE%\wikis --demo

# 4. Set as the active wiki (no -w needed from here on)
synthadoc use history-of-computing

# 5. Start the engine
synthadoc serve

6. Install the Obsidian plugin — from the cloned synthadoc/ repo directory, copy the pre-built plugin into your vault:

# Linux / macOS (run from the synthadoc/ repo root)
cd ~/synthadoc   # or wherever you cloned it
vault=~/wikis/history-of-computing
mkdir -p "$vault/.obsidian/plugins/synthadoc"
cp obsidian-plugin/main.js obsidian-plugin/manifest.json "$vault/.obsidian/plugins/synthadoc/"

# Windows (cmd.exe — run from the synthadoc\ repo root)
cd %USERPROFILE%\synthadoc
mkdir "%USERPROFILE%\wikis\history-of-computing\.obsidian\plugins\synthadoc"
copy obsidian-plugin\main.js "%USERPROFILE%\wikis\history-of-computing\.obsidian\plugins\synthadoc\"
copy obsidian-plugin\manifest.json "%USERPROFILE%\wikis\history-of-computing\.obsidian\plugins\synthadoc\"

Then in Obsidian: fully quit and reopen Obsidian, then:

Settings → Community plugins → Synthadoc → Enable, set Server URL to http://127.0.0.1:7070
Settings → Community plugins → Browse → search "Dataview" → Install → Enable (required for the live dashboard)

# 7. Ingest a YouTube video
synthadoc ingest "https://www.youtube.com/watch?v=YOUR_VIDEO_ID"

Synthadoc is open source under AGPL-3.0. The full quick-start guide, architecture docs, and demo wiki are at github.com/axoviq-ai/synthadoc.

👉 README: https://github.com/axoviq-ai/synthadoc#readme
👉 Quick-start guide: https://github.com/axoviq-ai/synthadoc/blob/main/docs/user-quick-start-guide.md

*Synthadoc v0.3.0 also ships CJK multilingual query support, knowledge gap detection hardening, a DeepSeek provider, and coding tool CLI providers (Claude Code, Opencode) - no separate API key needed if you already have a coding tool subscription. Full release notes in docs/design.md.

Synthadoc: Beyond Keyword Search -How Combines BM25 and Vector Search to Build a Smarter Domain Wiki

Paul Chen — Mon, 27 Apr 2026 00:21:34 +0000

What is Synthadoc?

Synthadoc is an open-source, LLM-powered wiki engine. Point it at your
organisation's documents - PDFs, PPTX, spreadsheets, DOCX, images, or web pages - and it builds a persistent, structured knowledge base your team can query, audit, and extend over time.

Unlike general-purpose RAG pipelines that retrieve raw chunks at query time and discard results afterwards, Synthadoc compiles knowledge at ingest time into a living wiki that grows smarter and more consistent with every new source. The core lifecycle is:

Ingest: extract and synthesise facts from any source format (PDF, XLSX, PNG, web URL)
Detect: flag contradictions with existing pages and quarantine them for review
Link: connect related pages and surface knowledge gaps
Query: answer questions with hybrid BM25 + optional vector search, citing the pages used
Lint: resolve contradictions and surface orphan pages for human or automated action

Synthadoc is designed for organisations that need domain-specific, auditable knowledge management: legal teams tracking regulatory
precedent, financial analysts maintaining market research, engineering
groups documenting system behaviour, and research teams building
institutional memory that persists beyond individual contributors.

Synthadoc v0.2.0 is released last week, it scales seamlessly while maintaining accuracy through autonomous self-optimization.

👉 Synthadoc GitHub: https://github.com/axoviq-ai/synthadoc

Introduction

Most LLM knowledge tools take one of two approaches to retrieval: pure keyword search (fast but vocabulary-dependent) or pure vector/semantic search (flexible but resource-intensive). In practice, both have meaningful blind spots.

Synthadoc v0.2.0 ships a hybrid retrieval pipeline that uses BM25 as a fast, precise first-pass filter and optional vector re-ranking as a semantic second pass. The result is a system that is accurate on exact-match queries, robust on paraphrased or conceptual queries, and fast enough to run on a laptop with no cloud dependency.

This post explains how each technique works, where each one falls short alone, why the hybrid matters for a persistent domain wiki, and how Synthadoc v0.2.0 layers query decomposition and knowledge gap detection on top.

What Is BM25?

BM25 (Best Match 25) is a probabilistic ranking function. It scores a page relative to a query by counting how often query terms appear in the page, discounting terms that appear in almost every page, and penalising very long pages for artificially inflated counts. BM25 is the retrieval backbone of Elasticsearch, Lucene, and most production search systems.

Scoring intuition

Where BM25 falls short

Vocabulary mismatch: query says "contributions", page says "pioneered". Score: near zero.
Synonyms: "ML" and "machine learning" are different tokens.
Conceptual distance: "reasoning under uncertainty" and "probabilistic inference" are semantically identical but lexically distant.

On a domain wiki ingested from diverse sources (papers, docs, blog posts, PDFs), the same concept will be described in many different vocabularies. BM25 alone misses a meaningful fraction of relevant pages.

What Is Vector Search?

Vector (semantic) search encodes text into dense numerical embeddings using a neural language model. Semantically similar texts land close together in that high-dimensional space regardless of surface wording. Similarity is measured as cosine distance between the query vector and each page vector.

Embedding intuition

The two sentences share almost no keywords, but their vectors point in nearly the same direction because the model understands they describe the same concept.

Where vector search falls short alone

Cold-start penalty: embedding thousands of pages takes time and compute; BM25 is instant.
Exact-match dilution: specific product names or identifiers can be blurred by semantic proximity.
Domain drift: general-purpose models may not distinguish highly specific domain terminology.
Resource requirement: needs a model (~130 MB for bge-small-en-v1.5) and inference at query time.

BM25 vs. Vector Search: Side-by-Side

Dimension	BM25	Vector / Semantic
Matching strategy	Matching strategy Exact term overlap (TF x IDF)	Semantic similarity (cosine distance)
Vocabulary required	Query words must appear in page	Paraphrases and synonyms handled
Speed	Microseconds -- no model needed	Milliseconds -- model inference required
Setup cost	Zero -- pure algorithm	~130 MB model download (one-time)
Exact-match queries	Excellent	Good
Synonym / paraphrase	Often misses	Handles well
Domain terminology	Good if terms match	Depends on model training
Interpretability	Score is explainable	Black-box similarity
Best for	Known vocabulary, structured content	Conceptual queries, diverse sources

How Synthadoc Combines Both

Synthadoc uses a hybrid pipeline where BM25 and vector search are not alternatives - they are sequential layers. BM25 does the heavy filtering; vector re-ranks the survivors.

The retrieval pipeline

Query decomposition: why it matters for retrieval

Before any search happens, Synthadoc v0.2.0 breaks compound questions into focused sub-questions via an LLM call. Each sub-question runs its own BM25 (and vector) search in parallel. Results are merged by best score per page before synthesis. One complex query can retrieve from multiple distinct parts of the wiki simultaneously.

Example: Query: "Compare Turing's contributions with Von Neumann's
architecture"
-> Decomposed: ["Turing contributions computing"] | ["Von Neumann
architecture design"]
-> Two parallel BM25 searches -> merged candidates -> one synthesised
answer

Knowledge gap detection

After retrieval, Synthadoc evaluates three independent signals:

Fewer than 3 pages retrieved - the wiki barely covers the topic
Max BM25 score below configurable threshold (default: 2.0) - weak keyword overlap
Fewer than 2 candidates contain key nouns from the question - off-topic matches

When a gap fires, Synthadoc generates targeted web search suggestions and surfaces them as an Obsidian callout or CLI tip, creating a feedback loop that makes the wiki progressively denser over time.

Practical Examples in Synthadoc

Example 1: BM25 exact match (no vector needed)

Wiki page: "Alan Turing - Enigma and the Bombe Machine"

Query: "Bombe machine Enigma decryption"

BM25 succeeds: BM25 score: HIGH - "Bombe", "machine", "Enigma",
"decryption" all present.
Result: page retrieved correctly. Vector re-ranking not required.`**

Example 2: BM25 misses, vector rescues

Wiki page: "Alan Turing - Theoretical Foundations of Modern Computers"

Query: "What were Turing's contributions to computing?"

BM25 misses, Vector rescues: BM25 score: LOW - "contributions" and
"computing" absent from the page.
Vector cosine score: HIGH - embeddings are semantically close.
Result: page retrieved correctly after re-ranking. BM25 alone would have > missed it.

Example 3: Knowledge gap fires, ingest suggestion generated

Wiki: finance domain. Query: "What is the impact of quantitative easing on inflation?"

Gap detected: BM25: 1 page returned, score 0.8 (below threshold 2.0)
Knowledge gap detected. Synthadoc generates:
synthadoc ingest "search for: quantitative easing inflation impact" -w finance-wiki
synthadoc ingest "search for: central bank monetary policy effects" -w finance-wiki
After ingest and re-query: 7 pages returned, fully synthesised answer.

Enabling Vector Search in Synthadoc

BM25 is the default - zero setup, zero dependencies. To add vector re-ranking:

Step 1: Install fastembed

pip install fastembed

Step 2: Enable in config

[search] vector = true vector_top_candidates = 20 # BM25 pool size before re-ranking

Step 3: Restart the server

synthadoc serve -w my-wiki

On first start, Synthadoc downloads BAAI/bge-small-en-v1.5 (~130 MB) once and embeds existing pages in the background. BM25 stays active throughout - no downtime. If the model is unavailable, the system falls back to BM25 silently.

Synthadoc v0.2.0: Full Feature Summary

Feature	What it does
Query decomposition	Compound questions split into parallel BM25 sub-queries, merged before synthesis
Vector re-ranking	Opt-in semantic re-ranking (BAAI/bge-small-en-v1.5 via fastembed)
Knowledge gap detection	3-signal gap check; auto-generates targeted ingest suggestions as Obsidian callout
Web search decomposition	Broad search topics split into focused Tavily queries; URL deduplication and cap
Per-model cost tracking	Per-token rate table; ingest + query cost in audit.db, CLI, and Obsidian
Query audit trail	Full query history with sub-question count, tokens, cost, timestamp
Obsidian live web search view	Real-time polling panel: phase, pages created, URL errors as fan-out completes
8 new Obsidian commands	15 commands total: lint, auto-resolve, job retry/purge, audit history, scaffold
MiniMax support	M2.5/M2.7 reasoning models with reasoning_content fallback for structured output
Rate-limit requeue	429 responses requeue job (retry budget preserved); fail-fast on daily quota
Job crash recovery	in_progress jobs at shutdown auto-reset to pending on next startup
Bulk job cancel	Cancel all pending jobs in one operation via CLI or API

How Synthadoc Compares to Alternatives

Most LLM knowledge tools are general-purpose RAG pipelines that retrieve raw chunks at query time with no persistent synthesis. Synthadoc compiles knowledge at ingest time, maintains a living wiki, and is designed for domain-specific, auditable deployments.

Capability	Synthadoc v0.2.0	LlamaIndex / LangChain	Notion AI	Obsidian Copilot
Ingest-time synthesis	Compiled wiki	Raw chunks at query time	Page-level only	None
Domain scope filtering	purpose.md	Manual	None	None
Multi-model support	6 providers	Many providers	OpenAI only	OpenAI / Ollama
Audit trail	Full SQLite audit	None built-in	None	None
Cost tracking	Per-token, per-op	Manual / callback	Opaque	None
Offline / local	Fully local	Depends on provider	Cloud only	Ollama
Obsidian-native output	Wikilinks, Dataview	None	Notion-only	Read-only
HTTP API + MCP server	Built-in	Manual wiring	Proprietary API	None
Contradiction detection	Automated	None	None	None
Query decomposition	Parallel BM25	Manual chains	None	None
Knowledge gap detection	Auto-suggestions	None	None	None
Extensible skills	Drop-in folders	Custom loaders	None	None
Licence	AGPL-3.0 open source	Apache-2.0	Proprietary SaaS	MIT

Enterprise and Domain-Specific Readiness

Synthadoc is built for organisations that need a knowledge system they control, audit, and deploy into existing infrastructure - not a SaaS black box.

Concrete use cases:

Legal - track regulatory updates, case precedents, and compliance requirements across jurisdictions. New ruling ingested, old page flagged contradicted, compliance team reviews.
Finance - build a living market research wiki from analyst reports, earnings calls, and regulatory filings. Query with natural language, get cited answers with full audit trail.
Engineering - maintain a persistent runbook that absorbs incident post-mortems, architecture decision records, and API docs. Contradiction detection prevents stale documentation from accumulating.
Research - aggregate papers, datasets, and notes into a structured knowledge base. Knowledge gap detection surfaces what the team does not yet know and generates targeted ingest suggestions.

Domain specificity

Every wiki defines its own scope via purpose.md. The LLM reads this before every ingest decision and rejects out-of-scope sources cleanly. A legal wiki does not absorb marketing copy. A financial wiki does not absorb engineering runbooks.

Auditability

Every ingest, query, contradiction detection, and auto-resolution is written to an append-only SQLite audit trail with token counts, cost, timestamps, and page-level actions -- all queryable from the CLI or Obsidian audit commands.

synthadoc audit history -w my-wiki # ingest records

synthadoc audit cost -w my-wiki # token spend breakdown

synthadoc audit events -w my-wiki # contradiction, gate, resolution
events

Product and grid readiness

Synthadoc exposes the same operations across four surfaces sharing a single agent and storage layer:

CLI: for operators, automation scripts, and CI pipelines
HTTP REST API: for product integrations and custom front-ends
MCP server: for direct agent-to-agent communication
Obsidian plugin: for knowledge workers doing active research

Hook scripts fire on lifecycle events (on_ingest_complete, on_lint_complete), enabling event-driven automation: post a Slack summary when ingest completes, trigger a downstream build when a key page changes, or chain into a broader orchestration pipeline. Cron scheduling is built in, and multi-wiki isolation means each team or domain runs on its own port with its own audit trail.

Synthadoc in Agentic Autonomous Systems

Synthadoc is purpose-built to serve as the persistent knowledge layer for LLM agent systems. Where an agent's context window is ephemeral and limited, Synthadoc's wiki is persistent, structured, and queryable -it gives agents a long-term memory that survives across sessions, scales to millions of tokens of accumulated knowledge, and is fully auditable.

Agent integration via MCP

The built-in MCP (Model Context Protocol) server exposes ingest, query, and lint as native tool calls. An agent running in any MCP-compatible host - Claude, GPT-4o, a custom LangChain pipeline - can:

Call query to retrieve cited, synthesised answers from accumulated knowledge before acting
Call ingest to push new findings, research results, or external documents back into the wiki
Call lint to check for contradictions introduced by new data before committing to a decision

Event-driven agent pipelines

Hook scripts fire on lifecycle events and can trigger downstream agent actions:

on_ingest_complete: a downstream agent reads newly created pages and decides whether to trigger follow-up ingests or alert a human reviewer
on_lint_complete: an orchestrator agent receives contradiction and orphan reports and routes resolution tasks to specialised sub-agents

Example pipeline: a web-crawling agent ingests raw URLs; Synthadoc synthesises and deduplicates; a reporting agent queries the updated wiki and posts a daily briefing - all without human intervention.

Persistent domain memory for multi-agent systems

In multi-agent architectures, shared knowledge is a coordination bottleneck. Synthadoc solves this by acting as a shared, structured memory store:

Multiple agents read from the same wiki via parallel HTTP queries - no shared state management required
One agent's ingest results are immediately available to all agents querying the same wiki
Multi-wiki isolation means separate agent clusters maintain scoped knowledge without interference
The audit trail provides a complete record of which agent ingested what, when, at what cost - making multi-agent systems auditable by design

Because Synthadoc is self-hosted and open-source, teams building autonomous systems retain full control over data residency, model selection, and cost - a critical requirement for enterprise agentic deployments.

Try Synthadoc v0.2.0

Synthadoc v0.2.0 is available now on GitHub under the AGPL-3.0 licence. BM25 search works out of the box. Vector re-ranking is one pip install away. The Gemini free tier means you can run a full ingest-and-query cycle at zero cost.

Feedback welcome: Feedback, issues, and contributions are very welcome. Open an issue on GitHub or start a discussion - the roadmap is shaped by what users need.

👉 README: https://github.com/axoviq-ai/synthadoc#readme
👉 Quick-start guide: https://github.com/axoviq-ai/synthadoc/blob/main/docs/user-quick-start-guide.md
👉 Design document: https://github.com/axoviq-ai/synthadoc//blob/main/docs/design.md
👉 Release notes: https://github.com/axoviq-ai/synthadoc/releases/tag/v0.2.0