Ryo Suwito

Posted on Jul 2

Logic-Graph Verification System

#ai #webdev #productivity #programming

Status: Draft v1 · 2026-07-02
Author: design session (Claude Fable 5)
Implementer: any capable coding model, per SYSTEM-DESIGN.md

1. One-liner

Convert the cheap token throughput of open-weight thinking models into verified conclusions, by externalizing their reasoning as a graph, sampling it N independent times, and letting mechanical graph checks — not model vibes — decide what survives.

2. Problem

Self-correction by prompting fails. Re-feeding a model its own chain-of-thought ("let's revisit...") makes it anchor to its own tokens and rubber-stamp itself. Known result (Huang et al., "LLMs Cannot Self-Correct Reasoning Yet").
tokens/sec is a garbage metric. A cheap model emitting 1M tokens of confident nonsense is worth less than nothing. The metric that matters is correct conclusions per dollar.
Verification-by-LLM inherits the problem. Asking a model "is this true?" just moves the vibes one level up. We need checks that are mechanical — computable, deterministic, not model-judged.
Cheap models confabulate from pretraining. A system whose claims must trace to imported sources cannot tolerate a model that silently answers from its weights.

3. Core insights (the product thesis)

Shape is checkable even when truth isn't. An argument externalized as a directed graph admits mechanical checks: unsupported premises (orphans), circular reasoning (cycles), disconnected conclusions (reachability), refuted-claims-still-load-bearing.

"Truthiness" becomes topology. Confidence in a conclusion ∝ how many node-disjoint chains connect evidence to it (Menger's theorem). One road to Rome = fragile; three independent roads = an adversary must break three separate claims.

Independence is sampled, not requested. Asking one context for "3 independent arguments" yields the same argument paraphrased 3 times. Asking N independent inference calls (fresh context, temperature > 0) yields genuinely decorrelated attempts — the detective's technique: separate interrogations, then cross-examine. Contradictions between runs are the interrogation slip-ups; they are invisible inside any single transcript.

Direct attention, don't diffuse it. "Revisit your reasoning" makes a model skim everything shallowly. "Edge 7 is load-bearing and unverified" makes it spend its whole budget on the right spot. Min-cut/betweenness computes where that spot is.

The formalization tax is a feature. A vague step becomes visibly vague when it must be written as a typed node with typed edges.

4. Users & use cases

Primary user: a developer/researcher running multi-step reasoning or research-over-documents workloads on cheap open-weight models (via OpenRouter or local inference), who needs reliability without frontier-model prices.
Use cases:
- U1: Multi-hop question answering over provided documents, with an auditable argument graph as output.
- U2: Claim verification — "here is a conclusion; how much would have to be wrong for it to fail?"
- U3: As an MCP tool available to any agent, giving it external working memory for arguments plus structural self-checks.

5. Goals & success metrics

#	Goal	Metric	Target
G1	Screening identifies grounded models	screen produces grounded% / memorized% / abstain% per model	screen runs end-to-end < $1
G2	System beats naive self-correction	accuracy on eval set vs "revisit" and verbatim re-injection arms	≥ +5pp at equal or lower cost
G3	Cheap ensemble beats one big call	correct-per-dollar vs one frontier-class call	≥ 2× correct-per-dollar
G4	Confidence is calibrated by structure	conclusions with support width ≥ 2 are correct more often than width 1	monotone relationship observed
G5	Auditable output	every conclusion ships with its graph: surviving claims, width, disputed nodes	100% of runs

6. Non-goals

Not a truth oracle. The graph verifies argument shape; every edge is semantically vouched for only by a model. Validly-connected garbage passes structure checks — by design, the structural layer only decides where semantic effort (re-interrogation) goes.
No fine-tuning, no inference-loop surgery. Everything works through standard chat-completions + tool calls. No custom sampling, no template hacking, no stop-at-</think> interception.
No persistence. Graphs are session-scoped and ephemeral (in-memory). No database.
Not a general theorem prover. Natural-language claims, heuristic dedup — not formal logic.

System Design

1. Architecture

Two deployment modes for the graph server, same codebase:

Library mode (default for orchestrator): orchestrator imports graph_server.store directly, no MCP transport. Simpler, faster, use for P3/P4.
MCP mode: server.py exposes the same functions as MCP tools over stdio, so any MCP-capable agent (Claude Code, etc.) can use the graph as external working memory. Thin wrapper only — zero logic in the transport layer.

2. Component A — Grounding screen (P1, exists)

Already implemented at repo root: screen.py, items.jsonl (50 items), models.json. Contract: temperature 0, 1 call/item/model, regex-graded. See README.md. Output feeds one decision: which model id the orchestrator defaults to. Do not modify except to fix bugs.

3. Component B — Graph server (P2)

3.1 Data model

In-memory only. One GraphStore holds many independent graphs keyed by graph_id (string, caller-supplied).

Node:
  id: str            # caller-supplied, unique within graph, e.g. "r2:n4" (run 2, node 4)
  claim: str         # natural-language claim, one sentence
  type: str          # "given" | "inference" | "assumption" | "conclusion"
  confidence: float  # 0.0-1.0, model-stated, default 0.8
  run_ids: set[str]  # provenance: which interrogation runs asserted this claim
  refuted: bool      # default False, set only via mark_refuted
  refute_reason: str | None
  aliases: list[str] # claims of nodes merged into this one (dedup provenance)

Edge:
  src: str, dst: str
  relation: str      # "supports" | "attacks" | "assumes"
  confidence: float  # 0.0-1.0, default 0.8
  run_ids: set[str]

Backed by one networkx.MultiDiGraph per graph_id (multi because supports and attacks may both exist between the same pair). Node attrs hold the Node fields; edge attrs hold relation/confidence/run_ids.

Validation on ingest (reject with error message, don't crash): unknown type/relation, edge endpoints not in graph after this ingest, confidence outside [0,1], duplicate node id with different claim text (same id + same claim = idempotent, merge run_ids).

3.2 Public functions / MCP tools

Eight functions. Signatures below are the library API; MCP tool schemas mirror them 1:1 with graph_id always the first param.

#	Function	Params	Returns
1	`assert_graph`	graph_id, nodes[], edges[]	`{accepted_nodes, accepted_edges, rejected: [{item, reason}], auto_merged: [[kept_id, merged_id], ...]}` — runs dedup (§3.3) on the new nodes against existing ones automatically
2	`merge_duplicates`	graph_id, jaccard_threshold=0.7, ratio_threshold=0.85	`{merges: [[kept_id, merged_id]], contradictions_created: [[id_a, id_b]]}` — full-graph re-pass
3	`check_structure`	graph_id, conclusion_id	`{orphans: [id], cycles: [[id,...]], unreachable_conclusion: bool, refuted_but_feeding: [id]}`
4	`critical_links`	graph_id, conclusion_id	`{min_cut_nodes: [id], bridge_edges: [[src,dst]], ranked: [{edge, betweenness, min_confidence_on_edge}]}` sorted least-confident-first
5	`support_width`	graph_id, conclusion_id	`{disjoint_paths: int, paths: [[id,...]], max_flow: float}`
6	`surviving_claims`	graph_id	`{in: [id], out: [id], undecided: [id], surviving: [id]}` (§3.5)
7	`mark_refuted`	graph_id, node_id, reason	`{ok, width_before, width_after}` for the graph's conclusion node if one exists
8	`disputed_nodes`	graph_id, conclusion_id	`{contradiction_pairs: [[id_a, id_b]], isolated_load_bearing: [{id, run_count, on_path}]}` (§3.6)

Every function MUST be pure-read except 1, 2, 7. Every function MUST return JSON-serializable dicts. Errors return {error: "..."}, never raise across the MCP boundary.

3.3 Dedup / merge (THE load-bearing component)

Purpose: identical claims from different runs MUST collapse to one node, otherwise (a) support width is inflated by paraphrase and (b) contradictions never meet each other.

Normalization norm(claim):

lowercase; unicode NFC
strip punctuation except . inside numbers and %
canonicalize numbers: remove thousands separators (84,200 → 84200)
collapse whitespace
tokenize on whitespace; drop English stopwords (small fixed list: a, an, the, is, are, was, were, of, in, on, at, to, that, this, it, and)

Negation guard (checked BEFORE similarity): let NEG = {not, no, never, cannot, "n't", without, false}. If tokens(a) - NEG == tokens(b) - NEG but the two differ in negation-token count parity → DO NOT merge; instead create mutual attacks edges between them and report in contradictions_created.

Numeric-conflict guard: if token sets are equal except for differing numeric tokens (e.g. "trellium melts at 412" vs "trellium melts at 350") → do not merge; create mutual attacks edges and report.

Similarity (only if guards pass): merge iff jaccard(tokens_a, tokens_b) >= 0.7 OR difflib.SequenceMatcher(None, norm_a, norm_b).ratio() >= 0.85. Both thresholds tunable via config. (v1 upgrade path, out of scope: cosine over a small sentence-embedding model; keep the interface identical.)

Merge policy: keep the node with the earliest assertion (stable order: first run_id, then node id). Union run_ids and aliases; keep max confidence; re-point all edges to the kept node; collapse resulting parallel edges of the same relation (max confidence, union run_ids). given type wins over inference wins over assumption if merged nodes disagree on type. Never merge a refuted=True node into a live one — merge live into refuted keeps refuted.

Clustering: union-find over pairwise matches (O(n²) pairwise is fine; graphs are hundreds of nodes, not millions).

3.4 Structural checks (networkx recipes)

Work on the supports+assumes subgraph (attacks edges excluded) unless stated. Build D = nx.DiGraph view collapsing the multigraph, keeping max confidence per (src,dst).

orphans: nodes with in-degree 0 that are not type=given and not type=assumption (assumptions are declared floats; still reported separately as assumptions in the payload — implementer: include key assumptions: [id]).
cycles: list(nx.simple_cycles(D)), cap output at first 10.
unreachable_conclusion: conclusion not in descendants(D, g) for any given g. Compute once: reachable = union of nx.descendants(D, g) | {g} for all givens; unreachable iff conclusion not in reachable.
refuted_but_feeding: nodes with refuted=True that still have a path to the conclusion in D.
critical_links: nx.minimum_node_cut(D_aug, s, t) and betweenness via nx.betweenness_centrality_subset(D, sources=givens, targets=[conclusion]). For s-t computations with multiple sources add a virtual super-source __S__ with edges to every given (confidence 1.0), and use conclusion as sink; NEVER report __S__ in results. Bridge edges: edges whose removal makes conclusion unreachable (test each edge on the paths; graphs are small, brute force is fine).

3.5 support_width & surviving_claims

support_width (Menger): on D_aug (super-source __S__ → all non-refuted givens):

disjoint_paths = len(list(nx.node_disjoint_paths(D_aug, "__S__", conclusion))) — strip __S__ from reported paths. Handle NetworkXNoPath → 0.
max_flow: capacities = edge confidence; node capacities = node confidence (implement via standard node-splitting: v → v_in, v_out with capacity = confidence). nx.maximum_flow_value on the split graph. Givens and __S__ edges get capacity ∞ (use 1e9).

surviving_claims (grounded semantics over attacks, then reachability):

A = subgraph with attacks edges only (all nodes, attack edges)
label all nodes UNDEC
repeat until no change:
    node -> IN  if every attacker of it is OUT (vacuously true if no attackers)
    node -> OUT if some attacker of it is IN
# refuted nodes are forced OUT before the loop starts
surviving = { n : label(n) != OUT
              and (type(n)=="given"
                   or n reachable from some IN/UNDEC given
                      via supports/assumes edges through only non-OUT nodes) }

Return the full labelling plus surviving.

3.6 disputed_nodes

contradiction_pairs: all pairs connected by mutual attacks created by the guards in §3.3, plus any pair the caller asserted with explicit mutual attacks.
isolated_load_bearing: nodes with len(run_ids) == 1 AND (betweenness_subset > 0 from givens to conclusion OR node lies on any given→conclusion simple path). These are the prime confabulation suspects; the orchestrator re-interrogates them first.

4. Component C — Orchestrator (P3)

CLI: python -m orchestrator.run --task task.json --model <id> --n 6 --k 2 --budget-calls 20 --temp 0.8

task.json: {"question": str, "documents": [str], "expected_answer": str|null} (expected_answer only used by the eval harness).

4.1 Control loop

1. FAN OUT: N parallel interrogation calls (fresh context each, temperature=temp).
   Each call gets: system prompt (§4.2) + documents + question.
   Parse each response's JSON graph (§4.3). assert_graph with run_id="r{i}",
   node ids prefixed "r{i}:".
2. MERGE: merge_duplicates(graph_id).
3. ASSESS: check_structure, support_width(conclusion), disputed_nodes.
   The conclusion node: each run marks exactly one node type=conclusion; after
   merge there should be 1..N conclusion nodes. If >1 cluster of conclusions
   survive dedup, the runs DISAGREE on the answer itself -> treat each cluster
   as a candidate; the report ranks candidates by support width.
4. RE-INTERROGATE (while budget remains and disputed set nonempty):
   For each disputed node (contradiction pairs first, then isolated load-bearing,
   max 3 per round): send M=3 fresh verification calls (§4.2 template B: the
   claim + documents ONLY — no graph, no prior transcript). Majority verdict:
   - refuted by >=2 of 3 -> mark_refuted(node, reason)
   - confirmed by >=2 of 3 -> bump confidence to max(old, 0.9), add run_id "v*"
5. STOP when: (a) no disputed nodes remain, or (b) all conclusion candidates
   stable AND top candidate width >= k, or (c) budget-calls exhausted.
6. REPORT: JSON + rendered markdown: chosen conclusion (claim text), support
   width + disjoint paths, surviving_claims, killed nodes with reasons,
   remaining disputed, per-call and total cost, wall clock.

4.2 Prompt templates (verbatim starting points; iterate only with eval evidence)

Template A — interrogation (system):

You are building a rigorous argument. Read the documents, then answer the
question by constructing an explicit reasoning graph.

Rules:
- Every claim must come from the documents or be an explicitly labeled
  assumption. Do not use outside knowledge.
- Aim for at least {k} independent lines of support for your conclusion,
  but NEVER invent support: a single honest chain beats two fabricated ones.
- If you find evidence against a claim, include it as a node with an
  "attacks" edge. Objections make the graph stronger, not weaker.

Output ONLY a JSON object, no prose, matching:
{"conclusion_node": "<node id>",
 "nodes": [{"id": "n1", "claim": "<one sentence>",
            "type": "given|inference|assumption|conclusion",
            "confidence": 0.0-1.0}],
 "edges": [{"from": "n1", "to": "n2",
            "relation": "supports|attacks|assumes",
            "confidence": 0.0-1.0}]}
"given" = stated in a document. "inference" = derived. Every inference must
have at least one incoming supports edge.

Template B — verification (system; fresh context, no graph shown):

Assess one claim against the documents. Reply ONLY with JSON:
{"verdict": "supported" | "refuted" | "not_determinable",
 "reason": "<one sentence citing the document or the gap>"}
Use ONLY the documents. If the documents don't settle it, say not_determinable.

User: documents + Claim: "<claim text>". not_determinable counts toward neither majority; 3× not_determinable → leave node as-is but drop its confidence to min(old, 0.5).

4.3 JSON parsing policy

Strip markdown fences and any text before the first { / after the last }.
json.loads; on failure, one retry call appending the parse error and "Output ONLY the JSON object."
On second failure, salvage with a lenient parser (json_repair lib or regex-extract of nodes/edges arrays); log salvaged=true.
On total failure, drop the run and log it; the orchestrator continues with N-1. NEVER crash the loop on one bad run.
Schema-compliance rate per model is a reported metric — feed it back into model choice.

4.4 API client

Reuse the conventions from screen.py: OpenRouter chat-completions, OPENROUTER_API_KEY env var, retry ×3 with exponential backoff on 429/5xx, cost accounting from usage × catalog pricing. Factor this into orchestrator/client.py — do NOT import from screen.py (P1 stays frozen); copy and adapt.

5. Component D — Eval harness (P4)

Six arms, same items, same model (screen winner), same total-token budget reported per arm:

Arm	Description
1	Single pass baseline (one call, direct answer)
2	Naive revisit: answer → "revisit your reasoning" → final
3	Verbatim CoT re-injection: answer → full trace fed back → final
4	Arrow-notation forced thinking (system prompt), single pass
5	Model-compressed chain: answer → model compresses own trace to arrow chain → fresh context verifies chain → final
6	Full system (orchestrator §4)

Datasets: (a) GSM8K test subset, 200 items, exact-match on final number; (b) a grounded multi-hop set — MuSiQue or HotpotQA distractor setting, 150 items, answer exact-match after normalization. (b) matters more: the system's thesis is imported-knowledge reasoning.

Metrics per arm: accuracy, total cost USD, cost per correct answer (headline), mean wall-clock, tokens. Plus arm-6-only: does accuracy stratify by support width (PRD G4)?

Output: evals/results/<runstamp>/ with raw JSONL per arm + one summary markdown table.

6. Stack, layout, dependencies

Python ≥ 3.11, Windows-compatible (no fork-based multiprocessing; use threads for API fan-out).
Deps: networkx, requests, mcp (or fastmcp) for MCP mode, pytest. Optional: json_repair. NO heavyweight deps (no torch/transformers) in v0.
Secrets: OPENROUTER_API_KEY env var only. Never on disk, never in logs.

logic-screen/                # existing repo root (rename to logic-graph/ if desired)
  screen.py items.jsonl models.json README.md    # P1, frozen
  docs/PRD.md docs/SYSTEM-DESIGN.md
  graph_server/
    __init__.py store.py     # GraphStore + 8 functions (library mode)
    dedup.py                 # §3.3
    checks.py                # §3.4
    semantics.py             # §3.5 grounded labelling, width, flow
    server.py                # MCP stdio wrapper (thin)
    tests/test_store.py test_dedup.py test_semantics.py test_fixture.py
  orchestrator/
    __init__.py run.py client.py prompts.py parse.py report.py
    tests/test_parse.py test_loop.py   # loop tested with a mocked client
  evals/
    arms.py score.py datasets/ results/

7. Milestones & acceptance criteria (implement in order)

M0 — run the P1 screen (human triggers; needs API key). Done when: results.jsonl exists and a model is chosen. Everything after is buildable without a key.

M1 — GraphStore + dedup (store.py, dedup.py). Done when:

ingest of §8 fixture passes validation; re-ingest is idempotent
"Server X runs Linux" (r1) and "server x runs linux." (r2) merge; run_ids unioned
"Server X runs Linux" vs "Server X does not run Linux" do NOT merge; mutual attacks created (negation guard)
"trellium melts at 412 C" vs "trellium melts at 350 C" → attacks (numeric guard)
pytest green.

M2 — checks + semantics (checks.py, semantics.py). Done when all §8 expected outputs reproduce exactly in test_fixture.py.

M3 — MCP wrapper (server.py). Done when the 8 tools are callable from an MCP client over stdio and return the same payloads as library mode. No logic in this file.

M4 — orchestrator with a mocked client (canned JSON responses incl. one malformed, one contradicting). Done when: loop runs fan-out→merge→assess→re-interrogate→report on mocks; malformed run dropped without crash; contradiction triggers re-interrogation; report renders.

M5 — live smoke + eval harness. Done when arm 1 and arm 6 run live on 10 GSM8K items under $0.50, then full 6-arm run produces the summary table.

8. Worked example — canonical test fixture

Ingest (single run r1, then r2 adds one contradicting node):

r1 nodes: A given  "the survey covers every server in rack 7"            conf .9
          B given  "server x9 is listed in the rack 7 survey"            conf .9
          C infer  "server x9 was included in the survey"                conf .85
          D given  "the survey marks server x9 as running linux"         conf .9
          E infer  "server x9 can run the cron scheduler"                conf .8
          Z concl  "server x9 can be used for the nightly cron job"      conf .8
          F infer  "rack 7 has spare capacity"                           conf .6   (orphan)
r1 edges: A→C supports .9 ; B→C supports .9 ; C→E supports .8 ;
          D→E supports .85 ; E→Z supports .8 ; D→Z supports .7
r2 nodes: G infer  "the rack 7 survey is outdated and unreliable"        conf .6
r2 edges: G attacks A .6

Expected outputs (these ARE the M2 acceptance tests):

check_structure(Z): orphans = [F, G]* , cycles = [], unreachable_conclusion = false, refuted_but_feeding = []. (*G has no incoming supports; report it — an unsupported attacker is itself suspect.)
support_width(Z): disjoint_paths = 2 (e.g. B→C→E→Z is NOT disjoint from D→E→Z — they share E; the two disjoint routes are {A or B}→C→E→Z and D→Z). Expected: 2.
critical_links(Z): min_cut_nodes size 2; bridge_edges = .
surviving_claims: G unattacked → IN; A attacked by IN node → OUT; C still reachable via B → C survives; surviving = {B, C, D, E, Z, G}; A ∉ surviving. (F: no path from a given? F is an orphan inference — not reachable → not surviving.)
After mark_refuted(D, "survey column misread"): support_width(Z) = 1 and refuted_but_feeding = [D] until D's edges are ignored (refuted nodes are excluded from D_aug — width drops to 1 via B→C→E→Z).
disputed_nodes(Z): isolated_load_bearing includes G (single-run, attacker of load-bearing A) — implementer note: attackers of on-path nodes count as load-bearing.

If any expected value above conflicts with your implementation's output, re-read §3 before changing the expectation; the fixture was hand-computed and one discrepancy (e.g. off-by-one on disjoint paths) usually means super-source handling or refuted-node exclusion is wrong.

9. Glossary

Support width: number of node-disjoint paths from givens to a conclusion (Menger's theorem: equals the minimum number of nodes whose failure disconnects it).
Grounded extension (Dung): the unique minimal fixed-point labelling of an attack graph; the "skeptically acceptable" claims.
Interrogation: one independent, fresh-context model call that emits a full reasoning graph.
Detective pass: cross-run comparison after merge — corroboration (many runs), contradiction (X vs ¬X across runs), isolation (single-run load-bearing claim).
Formalization tax: the useful difficulty of restating vague prose as typed nodes/edges.

DEV Community