Status: Draft v1 · 2026-07-02
Author: design session (Claude Fable 5)
Implementer: any capable coding model, per SYSTEM-DESIGN.md
1. One-liner

Convert the cheap token throughput of open-weight thinking models into verified conclusions, by externalizing their reasoning as a graph, sampling it N independent times, and letting mechanical graph checks — not model vibes — decide what survives.
2. Problem
- Self-correction by prompting fails. Re-feeding a model its own chain-of-thought ("let's revisit...") makes it anchor to its own tokens and rubber-stamp itself. Known result (Huang et al., "LLMs Cannot Self-Correct Reasoning Yet").
- tokens/sec is a garbage metric. A cheap model emitting 1M tokens of confident nonsense is worth less than nothing. The metric that matters is correct conclusions per dollar.
- Verification-by-LLM inherits the problem. Asking a model "is this true?" just moves the vibes one level up. We need checks that are mechanical — computable, deterministic, not model-judged.
- Cheap models confabulate from pretraining. A system whose claims must trace to imported sources cannot tolerate a model that silently answers from its weights.
3. Core insights (the product thesis)
- Shape is checkable even when truth isn't. An argument externalized as a directed graph admits mechanical checks: unsupported premises (orphans), circular reasoning (cycles), disconnected conclusions (reachability), refuted-claims-still-load-bearing.
- "Truthiness" becomes topology. Confidence in a conclusion ∝ how many node-disjoint chains connect evidence to it (Menger's theorem). One road to Rome = fragile; three independent roads = an adversary must break three separate claims.
- Independence is sampled, not requested. Asking one context for "3 independent arguments" yields the same argument paraphrased 3 times. Asking N independent inference calls (fresh context, temperature > 0) yields genuinely decorrelated attempts — the detective's technique: separate interrogations, then cross-examine. Contradictions between runs are the interrogation slip-ups; they are invisible inside any single transcript.
- Direct attention, don't diffuse it. "Revisit your reasoning" makes a model skim everything shallowly. "Edge 7 is load-bearing and unverified" makes it spend its whole budget on the right spot. Min-cut/betweenness computes where that spot is.
- The formalization tax is a feature. A vague step becomes visibly vague when it must be written as a typed node with typed edges.
4. Users & use cases
- Primary user: a developer/researcher running multi-step reasoning or research-over-documents workloads on cheap open-weight models (via OpenRouter or local inference), who needs reliability without frontier-model prices.
-
Use cases:
- U1: Multi-hop question answering over provided documents, with an auditable argument graph as output.
- U2: Claim verification — "here is a conclusion; how much would have to be wrong for it to fail?"
- U3: As an MCP tool available to any agent, giving it external working memory for arguments plus structural self-checks.
5. Goals & success metrics
| # | Goal | Metric | Target |
|---|---|---|---|
| G1 | Screening identifies grounded models | screen produces grounded% / memorized% / abstain% per model | screen runs end-to-end < $1 |
| G2 | System beats naive self-correction | accuracy on eval set vs "revisit" and verbatim re-injection arms | ≥ +5pp at equal or lower cost |
| G3 | Cheap ensemble beats one big call | correct-per-dollar vs one frontier-class call | ≥ 2× correct-per-dollar |
| G4 | Confidence is calibrated by structure | conclusions with support width ≥ 2 are correct more often than width 1 | monotone relationship observed |
| G5 | Auditable output | every conclusion ships with its graph: surviving claims, width, disputed nodes | 100% of runs |
6. Non-goals
- Not a truth oracle. The graph verifies argument shape; every edge is semantically vouched for only by a model. Validly-connected garbage passes structure checks — by design, the structural layer only decides where semantic effort (re-interrogation) goes.
-
No fine-tuning, no inference-loop surgery. Everything works through standard chat-completions + tool calls. No custom sampling, no template hacking, no stop-at-
</think>interception. - No persistence. Graphs are session-scoped and ephemeral (in-memory). No database.
- Not a general theorem prover. Natural-language claims, heuristic dedup — not formal logic.
System Design
1. Architecture
Two deployment modes for the graph server, same codebase:
-
Library mode (default for orchestrator): orchestrator imports
graph_server.storedirectly, no MCP transport. Simpler, faster, use for P3/P4. -
MCP mode:
server.pyexposes the same functions as MCP tools over stdio, so any MCP-capable agent (Claude Code, etc.) can use the graph as external working memory. Thin wrapper only — zero logic in the transport layer.
2. Component A — Grounding screen (P1, exists)
Already implemented at repo root: screen.py, items.jsonl (50 items), models.json. Contract: temperature 0, 1 call/item/model, regex-graded. See README.md. Output feeds one decision: which model id the orchestrator defaults to. Do not modify except to fix bugs.
3. Component B — Graph server (P2)
3.1 Data model
In-memory only. One GraphStore holds many independent graphs keyed by graph_id (string, caller-supplied).
Node:
id: str # caller-supplied, unique within graph, e.g. "r2:n4" (run 2, node 4)
claim: str # natural-language claim, one sentence
type: str # "given" | "inference" | "assumption" | "conclusion"
confidence: float # 0.0-1.0, model-stated, default 0.8
run_ids: set[str] # provenance: which interrogation runs asserted this claim
refuted: bool # default False, set only via mark_refuted
refute_reason: str | None
aliases: list[str] # claims of nodes merged into this one (dedup provenance)
Edge:
src: str, dst: str
relation: str # "supports" | "attacks" | "assumes"
confidence: float # 0.0-1.0, default 0.8
run_ids: set[str]
Backed by one networkx.MultiDiGraph per graph_id (multi because supports and attacks may both exist between the same pair). Node attrs hold the Node fields; edge attrs hold relation/confidence/run_ids.
Validation on ingest (reject with error message, don't crash): unknown type/relation, edge endpoints not in graph after this ingest, confidence outside [0,1], duplicate node id with different claim text (same id + same claim = idempotent, merge run_ids).
3.2 Public functions / MCP tools
Eight functions. Signatures below are the library API; MCP tool schemas mirror them 1:1 with graph_id always the first param.
| # | Function | Params | Returns |
|---|---|---|---|
| 1 | assert_graph |
graph_id, nodes[], edges[] |
{accepted_nodes, accepted_edges, rejected: [{item, reason}], auto_merged: [[kept_id, merged_id], ...]} — runs dedup (§3.3) on the new nodes against existing ones automatically |
| 2 | merge_duplicates |
graph_id, jaccard_threshold=0.7, ratio_threshold=0.85 |
{merges: [[kept_id, merged_id]], contradictions_created: [[id_a, id_b]]} — full-graph re-pass |
| 3 | check_structure |
graph_id, conclusion_id | {orphans: [id], cycles: [[id,...]], unreachable_conclusion: bool, refuted_but_feeding: [id]} |
| 4 | critical_links |
graph_id, conclusion_id |
{min_cut_nodes: [id], bridge_edges: [[src,dst]], ranked: [{edge, betweenness, min_confidence_on_edge}]} sorted least-confident-first |
| 5 | support_width |
graph_id, conclusion_id | {disjoint_paths: int, paths: [[id,...]], max_flow: float} |
| 6 | surviving_claims |
graph_id |
{in: [id], out: [id], undecided: [id], surviving: [id]} (§3.5) |
| 7 | mark_refuted |
graph_id, node_id, reason |
{ok, width_before, width_after} for the graph's conclusion node if one exists |
| 8 | disputed_nodes |
graph_id, conclusion_id |
{contradiction_pairs: [[id_a, id_b]], isolated_load_bearing: [{id, run_count, on_path}]} (§3.6) |
Every function MUST be pure-read except 1, 2, 7. Every function MUST return JSON-serializable dicts. Errors return {error: "..."}, never raise across the MCP boundary.
3.3 Dedup / merge (THE load-bearing component)
Purpose: identical claims from different runs MUST collapse to one node, otherwise (a) support width is inflated by paraphrase and (b) contradictions never meet each other.
Normalization norm(claim):
- lowercase; unicode NFC
- strip punctuation except
.inside numbers and% - canonicalize numbers: remove thousands separators (
84,200→84200) - collapse whitespace
- tokenize on whitespace; drop English stopwords (small fixed list: a, an, the, is, are, was, were, of, in, on, at, to, that, this, it, and)
Negation guard (checked BEFORE similarity): let NEG = {not, no, never, cannot, "n't", without, false}. If tokens(a) - NEG == tokens(b) - NEG but the two differ in negation-token count parity → DO NOT merge; instead create mutual attacks edges between them and report in contradictions_created.
Numeric-conflict guard: if token sets are equal except for differing numeric tokens (e.g. "trellium melts at 412" vs "trellium melts at 350") → do not merge; create mutual attacks edges and report.
Similarity (only if guards pass): merge iff jaccard(tokens_a, tokens_b) >= 0.7 OR difflib.SequenceMatcher(None, norm_a, norm_b).ratio() >= 0.85. Both thresholds tunable via config. (v1 upgrade path, out of scope: cosine over a small sentence-embedding model; keep the interface identical.)
Merge policy: keep the node with the earliest assertion (stable order: first run_id, then node id). Union run_ids and aliases; keep max confidence; re-point all edges to the kept node; collapse resulting parallel edges of the same relation (max confidence, union run_ids). given type wins over inference wins over assumption if merged nodes disagree on type. Never merge a refuted=True node into a live one — merge live into refuted keeps refuted.
Clustering: union-find over pairwise matches (O(n²) pairwise is fine; graphs are hundreds of nodes, not millions).
3.4 Structural checks (networkx recipes)
Work on the supports+assumes subgraph (attacks edges excluded) unless stated. Build D = nx.DiGraph view collapsing the multigraph, keeping max confidence per (src,dst).
-
orphans: nodes with in-degree 0 that are not
type=givenand nottype=assumption(assumptions are declared floats; still reported separately asassumptionsin the payload — implementer: include keyassumptions: [id]). -
cycles:
list(nx.simple_cycles(D)), cap output at first 10. -
unreachable_conclusion: conclusion not in
descendants(D, g) for any given g. Compute once:reachable = union of nx.descendants(D, g) | {g} for all givens; unreachable iff conclusion not in reachable. -
refuted_but_feeding: nodes with
refuted=Truethat still have a path to the conclusion in D. -
critical_links:
nx.minimum_node_cut(D_aug, s, t)and betweenness vianx.betweenness_centrality_subset(D, sources=givens, targets=[conclusion]). For s-t computations with multiple sources add a virtual super-source__S__with edges to every given (confidence 1.0), and use conclusion as sink; NEVER report__S__in results. Bridge edges: edges whose removal makes conclusion unreachable (test each edge on the paths; graphs are small, brute force is fine).
3.5 support_width & surviving_claims
support_width (Menger): on D_aug (super-source __S__ → all non-refuted givens):
-
disjoint_paths = len(list(nx.node_disjoint_paths(D_aug, "__S__", conclusion)))— strip__S__from reported paths. HandleNetworkXNoPath→ 0. -
max_flow: capacities = edge confidence; node capacities = node confidence (implement via standard node-splitting: v → v_in, v_out with capacity = confidence).nx.maximum_flow_valueon the split graph. Givens and__S__edges get capacity ∞ (use 1e9).
surviving_claims (grounded semantics over attacks, then reachability):
A = subgraph with attacks edges only (all nodes, attack edges)
label all nodes UNDEC
repeat until no change:
node -> IN if every attacker of it is OUT (vacuously true if no attackers)
node -> OUT if some attacker of it is IN
# refuted nodes are forced OUT before the loop starts
surviving = { n : label(n) != OUT
and (type(n)=="given"
or n reachable from some IN/UNDEC given
via supports/assumes edges through only non-OUT nodes) }
Return the full labelling plus surviving.
3.6 disputed_nodes
-
contradiction_pairs: all pairs connected by mutual attacks created by the guards in §3.3, plus any pair the caller asserted with explicit mutual attacks. -
isolated_load_bearing: nodes withlen(run_ids) == 1AND (betweenness_subset > 0 from givens to conclusion OR node lies on any given→conclusion simple path). These are the prime confabulation suspects; the orchestrator re-interrogates them first.
4. Component C — Orchestrator (P3)
CLI: python -m orchestrator.run --task task.json --model <id> --n 6 --k 2 --budget-calls 20 --temp 0.8
task.json: {"question": str, "documents": [str], "expected_answer": str|null} (expected_answer only used by the eval harness).
4.1 Control loop
1. FAN OUT: N parallel interrogation calls (fresh context each, temperature=temp).
Each call gets: system prompt (§4.2) + documents + question.
Parse each response's JSON graph (§4.3). assert_graph with run_id="r{i}",
node ids prefixed "r{i}:".
2. MERGE: merge_duplicates(graph_id).
3. ASSESS: check_structure, support_width(conclusion), disputed_nodes.
The conclusion node: each run marks exactly one node type=conclusion; after
merge there should be 1..N conclusion nodes. If >1 cluster of conclusions
survive dedup, the runs DISAGREE on the answer itself -> treat each cluster
as a candidate; the report ranks candidates by support width.
4. RE-INTERROGATE (while budget remains and disputed set nonempty):
For each disputed node (contradiction pairs first, then isolated load-bearing,
max 3 per round): send M=3 fresh verification calls (§4.2 template B: the
claim + documents ONLY — no graph, no prior transcript). Majority verdict:
- refuted by >=2 of 3 -> mark_refuted(node, reason)
- confirmed by >=2 of 3 -> bump confidence to max(old, 0.9), add run_id "v*"
5. STOP when: (a) no disputed nodes remain, or (b) all conclusion candidates
stable AND top candidate width >= k, or (c) budget-calls exhausted.
6. REPORT: JSON + rendered markdown: chosen conclusion (claim text), support
width + disjoint paths, surviving_claims, killed nodes with reasons,
remaining disputed, per-call and total cost, wall clock.
4.2 Prompt templates (verbatim starting points; iterate only with eval evidence)
Template A — interrogation (system):
You are building a rigorous argument. Read the documents, then answer the
question by constructing an explicit reasoning graph.
Rules:
- Every claim must come from the documents or be an explicitly labeled
assumption. Do not use outside knowledge.
- Aim for at least {k} independent lines of support for your conclusion,
but NEVER invent support: a single honest chain beats two fabricated ones.
- If you find evidence against a claim, include it as a node with an
"attacks" edge. Objections make the graph stronger, not weaker.
Output ONLY a JSON object, no prose, matching:
{"conclusion_node": "<node id>",
"nodes": [{"id": "n1", "claim": "<one sentence>",
"type": "given|inference|assumption|conclusion",
"confidence": 0.0-1.0}],
"edges": [{"from": "n1", "to": "n2",
"relation": "supports|attacks|assumes",
"confidence": 0.0-1.0}]}
"given" = stated in a document. "inference" = derived. Every inference must
have at least one incoming supports edge.
Template B — verification (system; fresh context, no graph shown):
Assess one claim against the documents. Reply ONLY with JSON:
{"verdict": "supported" | "refuted" | "not_determinable",
"reason": "<one sentence citing the document or the gap>"}
Use ONLY the documents. If the documents don't settle it, say not_determinable.
User: documents + Claim: "<claim text>". not_determinable counts toward neither majority; 3× not_determinable → leave node as-is but drop its confidence to min(old, 0.5).
4.3 JSON parsing policy
- Strip markdown fences and any text before the first
{/ after the last}. -
json.loads; on failure, one retry call appending the parse error and "Output ONLY the JSON object." - On second failure, salvage with a lenient parser (
json_repairlib or regex-extract of nodes/edges arrays); logsalvaged=true. - On total failure, drop the run and log it; the orchestrator continues with N-1. NEVER crash the loop on one bad run.
- Schema-compliance rate per model is a reported metric — feed it back into model choice.
4.4 API client
Reuse the conventions from screen.py: OpenRouter chat-completions, OPENROUTER_API_KEY env var, retry ×3 with exponential backoff on 429/5xx, cost accounting from usage × catalog pricing. Factor this into orchestrator/client.py — do NOT import from screen.py (P1 stays frozen); copy and adapt.
5. Component D — Eval harness (P4)
Six arms, same items, same model (screen winner), same total-token budget reported per arm:
| Arm | Description |
|---|---|
| 1 | Single pass baseline (one call, direct answer) |
| 2 | Naive revisit: answer → "revisit your reasoning" → final |
| 3 | Verbatim CoT re-injection: answer → full trace fed back → final |
| 4 | Arrow-notation forced thinking (system prompt), single pass |
| 5 | Model-compressed chain: answer → model compresses own trace to arrow chain → fresh context verifies chain → final |
| 6 | Full system (orchestrator §4) |
Datasets: (a) GSM8K test subset, 200 items, exact-match on final number; (b) a grounded multi-hop set — MuSiQue or HotpotQA distractor setting, 150 items, answer exact-match after normalization. (b) matters more: the system's thesis is imported-knowledge reasoning.
Metrics per arm: accuracy, total cost USD, cost per correct answer (headline), mean wall-clock, tokens. Plus arm-6-only: does accuracy stratify by support width (PRD G4)?
Output: evals/results/<runstamp>/ with raw JSONL per arm + one summary markdown table.
6. Stack, layout, dependencies
- Python ≥ 3.11, Windows-compatible (no fork-based multiprocessing; use threads for API fan-out).
- Deps:
networkx,requests,mcp(orfastmcp) for MCP mode,pytest. Optional:json_repair. NO heavyweight deps (no torch/transformers) in v0. - Secrets:
OPENROUTER_API_KEYenv var only. Never on disk, never in logs.
logic-screen/ # existing repo root (rename to logic-graph/ if desired)
screen.py items.jsonl models.json README.md # P1, frozen
docs/PRD.md docs/SYSTEM-DESIGN.md
graph_server/
__init__.py store.py # GraphStore + 8 functions (library mode)
dedup.py # §3.3
checks.py # §3.4
semantics.py # §3.5 grounded labelling, width, flow
server.py # MCP stdio wrapper (thin)
tests/test_store.py test_dedup.py test_semantics.py test_fixture.py
orchestrator/
__init__.py run.py client.py prompts.py parse.py report.py
tests/test_parse.py test_loop.py # loop tested with a mocked client
evals/
arms.py score.py datasets/ results/
7. Milestones & acceptance criteria (implement in order)
M0 — run the P1 screen (human triggers; needs API key). Done when: results.jsonl exists and a model is chosen. Everything after is buildable without a key.
M1 — GraphStore + dedup (store.py, dedup.py). Done when:
- ingest of §8 fixture passes validation; re-ingest is idempotent
-
"Server X runs Linux"(r1) and"server x runs linux."(r2) merge; run_ids unioned -
"Server X runs Linux"vs"Server X does not run Linux"do NOT merge; mutual attacks created (negation guard) -
"trellium melts at 412 C"vs"trellium melts at 350 C"→ attacks (numeric guard) - pytest green.
M2 — checks + semantics (checks.py, semantics.py). Done when all §8 expected outputs reproduce exactly in test_fixture.py.
M3 — MCP wrapper (server.py). Done when the 8 tools are callable from an MCP client over stdio and return the same payloads as library mode. No logic in this file.
M4 — orchestrator with a mocked client (canned JSON responses incl. one malformed, one contradicting). Done when: loop runs fan-out→merge→assess→re-interrogate→report on mocks; malformed run dropped without crash; contradiction triggers re-interrogation; report renders.
M5 — live smoke + eval harness. Done when arm 1 and arm 6 run live on 10 GSM8K items under $0.50, then full 6-arm run produces the summary table.
8. Worked example — canonical test fixture
Ingest (single run r1, then r2 adds one contradicting node):
r1 nodes: A given "the survey covers every server in rack 7" conf .9
B given "server x9 is listed in the rack 7 survey" conf .9
C infer "server x9 was included in the survey" conf .85
D given "the survey marks server x9 as running linux" conf .9
E infer "server x9 can run the cron scheduler" conf .8
Z concl "server x9 can be used for the nightly cron job" conf .8
F infer "rack 7 has spare capacity" conf .6 (orphan)
r1 edges: A→C supports .9 ; B→C supports .9 ; C→E supports .8 ;
D→E supports .85 ; E→Z supports .8 ; D→Z supports .7
r2 nodes: G infer "the rack 7 survey is outdated and unreliable" conf .6
r2 edges: G attacks A .6
Expected outputs (these ARE the M2 acceptance tests):
-
check_structure(Z): orphans = [F, G]* , cycles = [], unreachable_conclusion = false, refuted_but_feeding = []. (*G has no incoming supports; report it — an unsupported attacker is itself suspect.) -
support_width(Z): disjoint_paths = 2 (e.g. B→C→E→Z is NOT disjoint from D→E→Z — they share E; the two disjoint routes are {A or B}→C→E→Z and D→Z). Expected: 2. -
critical_links(Z): min_cut_nodes size 2; bridge_edges = . -
surviving_claims: G unattacked → IN; A attacked by IN node → OUT; C still reachable via B → C survives; surviving = {B, C, D, E, Z, G}; A ∉ surviving. (F: no path from a given? F is an orphan inference — not reachable → not surviving.) - After
mark_refuted(D, "survey column misread"): support_width(Z) = 1 and refuted_but_feeding = [D] until D's edges are ignored (refuted nodes are excluded from D_aug — width drops to 1 via B→C→E→Z). -
disputed_nodes(Z): isolated_load_bearing includes G (single-run, attacker of load-bearing A) — implementer note: attackers of on-path nodes count as load-bearing.
If any expected value above conflicts with your implementation's output, re-read §3 before changing the expectation; the fixture was hand-computed and one discrepancy (e.g. off-by-one on disjoint paths) usually means super-source handling or refuted-node exclusion is wrong.
9. Glossary
- Support width: number of node-disjoint paths from givens to a conclusion (Menger's theorem: equals the minimum number of nodes whose failure disconnects it).
- Grounded extension (Dung): the unique minimal fixed-point labelling of an attack graph; the "skeptically acceptable" claims.
- Interrogation: one independent, fresh-context model call that emits a full reasoning graph.
- Detective pass: cross-run comparison after merge — corroboration (many runs), contradiction (X vs ¬X across runs), isolation (single-run load-bearing claim).
- Formalization tax: the useful difficulty of restating vague prose as typed nodes/edges.







Top comments (0)