DEV Community: AlexMikhalev

Plug Terraphim Search into Claude Code and opencode (CLI First, MCP When You Need It)

AlexMikhalev — Sat, 18 Apr 2026 08:53:07 +0000

Your AI coding agent already has a knowledge graph. It is just not yours yet. The model knows GitHub, Stack Overflow, and the public training corpus -- it has no idea that in your project npm should be bun, that RFP is shorthand for acquisition need, or that the email about the Stripe receipt for the Obsidian licence lives in your Fastmail mailbox. This post shows the smallest path to fixing that for both Claude Code and opencode, using Terraphim and the three roles we have published over the last week (Terraphim Engineer, Personal Assistant, System Operator).

Two paths. CLI first.

What "integrate" means here

The host -- Claude Code or opencode -- needs a way to ask your role-aware Terraphim setup a question and get back ranked, source-attributed results. The model decides when to ask. The role decides which haystacks to search. Terraphim's terraphim-graph ranker decides which results come back first.

Concrete example. You are working in opencode and you type:

/tsearch "System Operator" RFP

The slash command runs against the System Operator role. The role's knowledge graph normalises RFP to its INCOSE-canonical form acquisition need. The Aho-Corasick matcher walks the role's haystack (1,347 Logseq pages from the terraphim/system-operator repository). The top hit comes back ranked 13 -- Acquisition need.md -- with the synonyms:: line that mapped your query to it visible in the snippet. The model now has the right page in its context window and can answer your follow-up without a hallucinated INCOSE handbook reference.

This works in both hosts because both speak the same two integration languages: shell-out slash commands and MCP servers. We are going to use both.

Path A -- CLI via slash command

Why this is the recommended starting point

terraphim-agent already exists, takes --role and --limit, and writes ranked results to stdout. There is nothing to build. Both Claude Code and opencode let slash commands shell out via Bash. So a two-line command file is the entire integration.

One file, two hosts

Drop this at ~/.claude/commands/tsearch.md (and an identical copy at ~/.config/opencode/command/tsearch.md -- both hosts read the same frontmatter shape):

---
description: "Terraphim search across configured roles. Usage: /tsearch [role] <query>"
allowed-tools: Bash(terraphim-agent search:*), Bash(terraphim-agent-pa search:*)
---
Run `terraphim-agent search --role "<role>" --limit 5 "<query>"` (or
`terraphim-agent-pa search ...` if the role is "Personal Assistant" and
the query needs the JMAP haystack). Return the top results as a numbered
list with title, source path/URL, and a 120-char snippet.

That is it. The allowed-tools line auto-approves the two CLI invocations so the model does not have to ask permission per call. Restart the host (or reload commands) and /tsearch is live.

Why fast enough

terraphim-agent reads its persisted role state at start (low milliseconds), runs the query against the role's haystacks, and returns. For a typical knowledge-graph query against the Terraphim Engineer role on a laptop, the round trip from slash command to formatted output is well under a second. The agent already has the typed CLI -- --role, --limit, --format json -- so there is nothing the MCP layer adds for the search-only flow.

Three example queries

/tsearch "Terraphim Engineer" rolegraph
/tsearch "System Operator" RFP
/tsearch "Personal Assistant" invoice    # uses terraphim-agent-pa wrapper for JMAP

The Personal Assistant case is the most interesting because it crosses surfaces -- Obsidian notes interleave with jmap:///email/<id> URLs from your Fastmail mailbox, ranked by the same terraphim-graph scoring. The wrapper script injects JMAP_ACCESS_TOKEN from 1Password at call time so the secret never lands on disk; the bare terraphim-agent continues to work for the other five roles without paying the unlock cost.

Path B -- MCP server (when you want typed tools)

The CLI path is enough for search. If you want the model to call search as a first-class tool with structured JSON parameters -- alongside autocomplete_terms, autocomplete_with_snippets, four flavours of fuzzy autocomplete, build_autocomplete_index, and update_config_tool -- that is what terraphim_mcp_server exposes. It reads the same ~/.config/terraphim/embedded_config.json, so the role list is identical.

Build and install

cd ~/projects/terraphim/terraphim-ai
cargo build --release -p terraphim_mcp_server --features jmap
cp target/release/terraphim_mcp_server ~/.cargo/bin/terraphim_mcp_server

For the Personal Assistant role, mirror the existing terraphim-agent-pa wrapper at ~/bin/terraphim_mcp_server-pa so the JMAP token flows through op run instead of being baked into config.

Register

opencode -- add to ~/.config/opencode/opencode.json under mcp:

"terraphim":    { "type": "local", "command": ["/Users/alex/.cargo/bin/terraphim_mcp_server"] },
"terraphim-pa": { "type": "local", "command": ["/Users/alex/bin/terraphim_mcp_server-pa"] }

Claude Code -- one shell command per server:

claude mcp add terraphim    /Users/alex/.cargo/bin/terraphim_mcp_server
claude mcp add terraphim-pa /Users/alex/bin/terraphim_mcp_server-pa
claude mcp list      # both should show as Connected

The model now sees mcp__terraphim__search and mcp__terraphim_pa__search (plus the autocomplete tools) in its tool list.

SessionStart primer (both paths)

Slash commands and MCP tools are useless if the model does not know the roles exist. Extend the SessionStart hook in ~/.claude/settings.json to print a one-screen role index when each session starts:

printf '\n--- Terraphim search via /tsearch [role] <query> ---\n'
printf '  Terraphim Engineer  (Rust/agent KG)\n'
printf '  Personal Assistant  (Obsidian + Fastmail JMAP, use terraphim-agent-pa for email)\n'
printf '  System Operator     (INCOSE/MBSE Logseq KG)\n'
printf '  Context Engineering Author, Rust Engineer, Default\n'

Equivalent hook in opencode. Cost: one screen of context per session. Benefit: the model picks the right role on the first try instead of guessing.

When to pick which path

	CLI (Path A)	MCP (Path B)
New binaries	None	`terraphim_mcp_server` plus wrapper
Cold start	~50-200 ms per call	~10-50 ms per call (long-lived process)
Tools exposed	`search` only	`search` + 4 autocomplete + `build_autocomplete_index` + `update_config_tool`
Works in any host	Yes -- anything that runs a slash command	Only hosts that speak MCP
Token handling	`terraphim-agent-pa` wrapper	`terraphim_mcp_server-pa` wrapper

For the search-across-roles flow, CLI is enough. Add MCP when the model needs autocomplete-as-you-type, when you want it to manage role configuration without leaving the conversation, or when you are using a host where the typed-tool surface matters more than the cold-start cost.

You do not have to choose. Wire both. The slash command above defaults to CLI and falls back to MCP if the binary is missing -- the two paths coexist cleanly because they read the same role config.

Why this is the right shape

Most "AI assistant + knowledge base" integrations end up tightly coupled to a specific host. Vendor X's plugin marketplace, Vendor Y's tool format. Terraphim takes the opposite stance: the role configuration lives in your filesystem, the haystacks live in your filesystem (or your mailbox), the ranker runs in a process you own, and the integration with the AI host is the thinnest possible shim -- a slash command or an MCP server, both of which are commodity surfaces.

Yesterday the Personal Assistant role was a private setup on one laptop. Today it is callable from inside two different AI coding hosts via a one-file slash command. Tomorrow you can add Cursor or Aider with the same two-line wrapper because the integration surface is terraphim-agent search, not vendor-specific-tool-protocol-v3.

The expensive part of context engineering is not the ranker. It is the vocabulary in the knowledge graph and the haystacks the role can reach. The integration layer should not be allowed to compete for that budget. CLI-first keeps it small.

Try it

# Build (or install the published crate when JMAP feature lands on crates.io)
cd ~/projects/terraphim/terraphim-ai
cargo install --path crates/terraphim_agent --features jmap

# Configure roles -- copy the snippets from the how-tos linked below
$EDITOR ~/.config/terraphim/embedded_config.json

# Install the slash command
mkdir -p ~/.claude/commands ~/.config/opencode/command
cp ~/projects/terraphim/terraphim-ai/docs/src/howto/mcp-integration-claude-opencode.md \
   /tmp/tsearch.md  # adapt to your slash command file shape

# Reload roles
terraphim-agent config reload

# Try
terraphim-agent search --role "Terraphim Engineer" --limit 3 "rolegraph"

Step-by-step in the docs: Plug Terraphim Search into Claude Code and opencode.

For the underlying engine, start with Why Graph Embeddings Matter. For the two roles this integration most cleanly exposes, see Personal Assistant and System Operator.

System Operator Demo: A Logseq Knowledge Graph Drives Enterprise MBSE Search

AlexMikhalev — Fri, 17 Apr 2026 18:31:00 +0000

Terraphim's System Operator role is the demo we point people at when they want to see a real Logseq knowledge graph drive search. 1,347 Logseq pages, 52 of them carrying explicit synonyms:: lines, covering Model-Based Systems Engineering vocabulary -- requirements, architecture, verification, validation, life cycle concepts. This post walks the demo end-to-end and shows the piece people miss: the KG is doing real work, not just re-ranking text matches.

What the demo is

The terraphim/system-operator repository on GitHub is a Logseq vault -- flat folder of markdown files under pages/, one page per concept, with Logseq's bullet-tree syntax for structure and Terraphim-format synonyms:: lines for the knowledge-graph layer. Two things make it a useful demo rather than a toy:

Real MBSE vocabulary. The synonyms are not invented; they track the INCOSE Systems Engineering Handbook v.4, the V-Model, and SEMP conventions. When you type RFP, the automaton normalises it to acquisition need because that is what the handbook calls it.
Real scale. 1,347 markdown files is enough to expose cold-start behaviour (~5-10 seconds to index on a laptop) without being so large it obscures the ranking signal.

Run it

There is an automated setup script in the repo. As of today it clones to a durable path under ~/.config/terraphim/system_operator instead of /tmp, so the vault survives a reboot:

./scripts/setup_system_operator.sh

Then either drive the role via the server --

cargo run --bin terraphim_server -- \
  --config terraphim_server/default/system_operator_config.json
curl "http://127.0.0.1:8000/documents/search?q=RFP&role=System%20Operator&limit=5"

-- or via the terraphim-agent CLI after adding the role entry to ~/.config/terraphim/embedded_config.json:

terraphim-agent config reload
terraphim-agent search --role "System Operator" --limit 5 "RFP"

The full config snippet and the embedded_config.json entry are in the README_SYSTEM_OPERATOR.md.

The piece people miss

Search-over-notes tools usually describe ranking in terms of "it uses a knowledge graph". That sentence hides a lot. Is the graph actually consulted at query time? Is it just a post-hoc re-ranker on top of BM25? Does it expand synonyms? On what vocabulary?

Terraphim exposes the answer directly. validate --connectivity prints which words in your query the automaton matched and what canonical terms they normalised to:

$ terraphim-agent validate --role "System Operator" --connectivity \
    "RFP business analysis life cycle model business requirements documentation tree"

Connectivity Check for role 'System Operator':
  Connected: false
  Matched terms: ["acquisition need", "business or mission analysis",
                  "business requirements", "documentation tree",
                  "life cycle concepts"]

Five query fragments, five canonical matches. RFP collapsed to acquisition need (its synonym, from Acquisition need.md in the vault). business analysis collapsed to business or mission analysis (INCOSE terminology). life cycle model collapsed to life cycle concepts. None of this is text matching -- the word RFP does not appear in the canonical page body; it lives in the synonyms:: line.

Once a query is normalised, the ranker walks the graph. A document that mentions acquisition need directly outranks one that mentions it through three synonym hops, and both outrank a document that mentions none of the canonical terms at all. Ranks come back with concrete integer scores -- [13] on a top result, not an opaque 0.87 cosine.

How it compares to the Personal Assistant role

We wrote up the Personal Assistant role yesterday: a private per-user role that indexes a Fastmail mailbox plus an Obsidian vault. Same engine, same ranker, different haystacks. The knowledge graph there is a small kg/ folder inside the user's vault with 14 synonym files covering personal vocabulary (bun with npm/yarn/pnpm synonyms, odilo, invoice, meeting).

The two roles expose the same pattern at two scales:

	System Operator	Personal Assistant
KG size	52 synonym files, 1,300-concept vocabulary	14 synonym files, ~30-concept personal vocabulary
Haystacks	1 (Logseq repo)	2 (Obsidian vault + Fastmail JMAP)
Source	Public GitHub repo	Private user files and mailbox
Audience	Demos, onboarding, public showcase	One user
Lifetime	Frozen per release	Edited daily, rebuilt in 20 ms per edit

Both use terraphim-graph ranking. Both build an Aho-Corasick automaton once at role-load time. Both run in a 4 GB process on a laptop with no cloud round-trip. The only interesting difference is the vocabulary, which is exactly the separation of concerns a knowledge-graph-first design is supposed to deliver.

Why this matters for teams evaluating MBSE tooling

If you are evaluating Terraphim for a systems engineering group, the System Operator role is the honest starting point. It runs on a laptop against a public vault; you can check that every synonym mapping traces back to a concrete page; you can diff the pages/ folder against the INCOSE handbook and argue about terminology. When your team's own vocabulary diverges (every organisation's does), you clone the repo, edit synonyms:: lines, and the graph rebuilds in 20 milliseconds without a retraining step.

The expensive part of enterprise search is not the ranker. It is the vocabulary. A deterministic graph makes the vocabulary an asset you curate, not a black box you tune.

Try it

git clone https://github.com/terraphim/terraphim-ai
cd terraphim-ai
./scripts/setup_system_operator.sh
cargo run --bin terraphim_server -- \
  --config terraphim_server/default/system_operator_config.json

Or cut to the CLI if you already have terraphim-agent installed -- the embedded_config.json snippet is in the README.

For the underlying engine, start with Why Graph Embeddings Matter. For the personal-productivity analogue, see the Personal Assistant role post.

Personal Assistant Role: One Search Across Email and Notes

AlexMikhalev — Fri, 17 Apr 2026 13:31:38 +0000

Most "personal AI" tools split your context across silos: one search box for email, another for notes, a third for your chat history. Terraphim treats every source as a haystack on the same role, so a single query crosses all of them. This post shows how to wire up the two most common personal sources -- email via JMAP and notes in an Obsidian vault -- under a new Personal Assistant role.

Why a unified role matters

The mental tax of personal search is not the typing. It is the deciding. "Did I read that in an email or write it in a note?" Each silo you skip is a context switch with no useful payload. Once Terraphim is the front door for both surfaces, the question collapses to "where is the thing about X" and the role's terraphim-graph ranking serves whichever source actually has the strongest signal.

The Personal Assistant role uses two haystacks under one role:

Obsidian vault indexed by the Ripgrep service. Plain markdown, sub-millisecond local search, no daemon.
Fastmail mailbox indexed by the Jmap service (RFC 8620/8621). One HTTPS round trip per query, server-side full-text against your real mailbox, results returned with jmap:///email/<id> URLs you can paste back into a mail client.

Ranking is the same terraphim-graph scoring as every other Terraphim role: an Aho-Corasick automaton built from the Obsidian vault contributes synonyms specific to your project vocabulary, then both haystacks share the same rank ladder. Notes and email interleave by relevance, not by source.

What you get

A single command: terraphim-agent-pa search "<query>" returns mixed hits ordered by rank.
Determinism. Every match traces back to a concrete edge in your knowledge graph -- no opaque embedding score.
Privacy. The Obsidian vault never leaves disk; the JMAP query goes directly to your mail provider with your token. No Terraphim cloud component sits in the path.
Composition. The role is just a JSON entry in ~/.config/terraphim/embedded_config.json. Add another haystack tomorrow (calendar, contacts, browser history) and the same query sweeps it too.

A 4 GB process on your laptop holds the whole working set; queries return in single-digit milliseconds for the local side and a few hundred for the remote JMAP round trip.

Wiring sketch

The role config is roughly thirty lines of JSON: two haystacks, one knowledge-graph pointer, no LLM. The Fastmail token is not in the config -- it is injected at runtime via op run from 1Password into the JMAP_ACCESS_TOKEN environment variable, so the secret never lands on disk:

exec op run --account my.1password.com \
  --env-file=<(echo 'JMAP_ACCESS_TOKEN=op://VAULT/ITEM/credential') \
  -- /Users/alex/.cargo/bin/terraphim-agent "$@"

Wrap that in ~/bin/terraphim-agent-pa, chmod +x, and the JMAP haystack lights up only for queries that ask for it. The other roles keep using the bare terraphim-agent and never pay for the 1Password unlock.

Why graph embeddings make this practical

The reason a unified role works at all -- not just for two haystacks but for any reasonable number -- is that Terraphim's graph-embeddings layer is sub-millisecond and deterministic. There is no per-query embedding API call to amortise across sources, no vector database to keep in sync, no opaque ranker that has to be retrained when you add a new haystack. The matching is byte-level Aho-Corasick traversal of an automaton built once at role-load time. We wrote up the engine in detail at Why Graph Embeddings Matter; this Personal Assistant role is one application of that engine.

Try it

The end-to-end how-to is in the docs: install the prerequisites, add the JSON snippet, write the wrapper, run three verification queries.

Read the how-to: Personal Assistant Role on docs.terraphim.ai

One caveat worth surfacing up front: the published terraphim-agent on crates.io does not yet ship with the JMAP haystack (the haystack_jmap dependency is not published either). For email search you need to build from local source with cargo build --release -p terraphim_agent --features jmap. The how-to walks through the two Cargo.toml edits required.

What is next

Personal Assistant is the smallest useful instance of "Terraphim as the front door for everything I read." Calendar (CalDAV), contacts (CardDAV), browser bookmarks, RSS, and AI session logs are all natural follow-ups -- each is a single haystack entry on the same role. The pattern composes; the cost stays linear in haystacks, not quadratic in cross-source queries.

If you want the underlying engine, start with Why Graph Embeddings Matter. If you want to wire knowledge-graph hooks into your AI coding agent on the same machine, Teaching AI Coding Agents with Knowledge Graph Hooks covers that side of the same engine.

Teaching AI Coding Agents with Knowledge Graph Hooks

AlexMikhalev — Fri, 17 Apr 2026 10:50:07 +0000

How we use Aho-Corasick automata and knowledge graphs to automatically enforce coding standards across AI coding agents like Claude Code, Cursor, and Aider.

New: see Why Graph Embeddings Matter for the underlying engine that makes these hooks possible — sub-millisecond, deterministic, fully explainable.

Anthropic Bought Bun. Claude Still Outputs `npm install`.

On December 3, 2025, Anthropic announced its first-ever acquisition: Bun, the blazing-fast JavaScript runtime. This came alongside Claude Code reaching $1 billion in run-rate revenue just six months after public launch.

As Mike Krieger, Anthropic's Chief Product Officer, put it:

"Bun represents exactly the kind of technical excellence we want to bring into Anthropic... bringing the Bun team into Anthropic means we can build the infrastructure to compound that momentum."

Claude Code ships as a Bun executable to millions of developers. Anthropic now owns the runtime their flagship coding tool depends on.

And yet...

Ask Claude to set up a Node.js project, and what do you get?

npm install express
yarn add lodash
pnpm install --save-dev jest

Yet Anthropic's own models still default to npm, yarn, and pnpm in their outputs. The training data predates the acquisition, and old habits die hard.

So how do you teach your AI coding tools to consistently use Bun, regardless of what the underlying LLM insists on?

The Problem: LLMs Don't Know Your Preferences

AI coding agents are powerful, but they're trained on the internet's collective habits—which means npm everywhere. Your team might have standardized on Bun for its speed (25% monthly growth, 7.2 million downloads in October 2025), but every AI agent keeps suggesting the old ways.

Manually fixing these inconsistencies is tedious. What if your knowledge graph could automatically intercept and transform AI outputs?

The Solution: Knowledge Graph Hooks

Terraphim provides a hook system that intercepts AI agent actions and applies knowledge graph-based transformations. The system uses:

Aho-Corasick automata for efficient multi-pattern matching
LeftmostLongest strategy ensuring specific patterns match before general ones
Markdown-based knowledge graph files that are human-readable and version-controlled

How It Works

Input Text → Aho-Corasick Automata → Pattern Match → Knowledge Graph Lookup → Transformed Output

The knowledge graph is built from simple markdown files:

# bun install

Fast package installation with Bun.

synonyms:: pnpm install, npm install, yarn install

When the automata encounter any synonym, they replace it with the canonical term (the heading).

Real-World Example: npm → bun

Let's prove it works. Here's a live test:

$ echo "npm install" | terraphim-agent replace
bun install

$ echo "yarn install lodash" | terraphim-agent replace
bun install lodash

$ echo "pnpm install --save-dev jest" | terraphim-agent replace
bun install --save-dev jest

The LeftmostLongest matching ensures npm install matches the more specific pattern before standalone npm could match.

Hook Integration Points

Terraphim hooks integrate at multiple points in the development workflow:

1. Claude Code PreToolUse Hooks

Intercept Bash commands before execution:

{
  "hooks": {
    "PreToolUse": [{
      "matcher": "Bash",
      "hooks": [{
        "type": "command",
        "command": "terraphim-agent replace"
      }]
    }]
  }
}

When Claude Code tries to run npm install express, the hook transforms it to bun install express before execution.

2. Git prepare-commit-msg Hooks

Enforce attribution standards in commits:

#!/bin/bash
COMMIT_MSG_FILE=$1
ORIGINAL=$(cat "$COMMIT_MSG_FILE")
TRANSFORMED=$(echo "$ORIGINAL" | terraphim-agent replace)
echo "$TRANSFORMED" > "$COMMIT_MSG_FILE"

With a knowledge graph entry:

# Terraphim AI

Attribution for AI-assisted development.

synonyms:: Claude Code, Claude, Anthropic Claude

Every commit message mentioning "Claude Code" becomes "Terraphim AI".

3. MCP Tools

The replace_matches MCP tool exposes the same functionality to any MCP-compatible client:

{
  "tool": "replace_matches",
  "arguments": {
    "text": "Run npm install to setup"
  }
}

Architecture

The hook system is built on three crates:

Crate	Purpose
`terraphim_automata`	Aho-Corasick pattern matching, thesaurus building
`terraphim_hooks`	ReplacementService, HookResult, binary discovery
`terraphim_agent`	CLI with `replace` subcommand

Performance

Pattern matching: O(n) where n is input length (not pattern count)
Startup: ~50ms to load knowledge graph and build automata
Memory: Automata are compact finite state machines

Extending the Knowledge Graph

Adding new patterns is simple. Create a markdown file in the mdBook source tree under docs/src/kg/ (published at https://docs.terraphim.ai/src/kg/).

# pytest

Python testing framework.

synonyms:: python -m unittest, unittest, nose

The system automatically rebuilds the automata on startup.

Pattern Priority

The LeftmostLongest strategy means:

npm install matches before npm
python -m pytest matches before python
Longer, more specific patterns always win

Installation

Quick Setup

# Install all hooks
./scripts/install-terraphim-hooks.sh --easy-mode

# Test the replacement
echo "npm install" | ./target/release/terraphim-agent replace

Manual Setup

Build the agent:

cargo build -p terraphim_agent --features repl-full --release

Configure Claude Code hooks in .claude/settings.local.json
Install Git hooks:

cp scripts/hooks/prepare-commit-msg .git/hooks/
chmod +x .git/hooks/prepare-commit-msg

Use Cases

Use Case	Pattern	Replacement
Package manager standardization	npm, yarn, pnpm	bun
AI attribution	Claude Code, Claude	Terraphim AI
Framework migration	React.Component	React functional components
API versioning	/api/v1	/api/v2
Deprecated function replacement	moment()	dayjs()

Claude Code Skills Plugin

For AI agents that support skills, we provide a dedicated plugin:

claude plugin install terraphim-engineering-skills@terraphim-ai

The terraphim-hooks skill teaches agents how to:

Use the replace command correctly
Extend the knowledge graph
Debug hook issues

Conclusion

Knowledge graph hooks provide a powerful, declarative way to enforce coding standards across AI agents. By defining patterns in simple markdown files, you can:

Standardize package managers across your team
Ensure consistent attribution in commits
Migrate deprecated patterns automatically
Keep your knowledge graph version-controlled and human-readable

The Aho-Corasick automata ensure efficient matching regardless of pattern count, making this approach scale to large knowledge graphs.

Next Steps

To wire knowledge-graph hooks into your own project, the Command Rewriting How-to walks through the configuration end to end. To understand why the matching is sub-millisecond and deterministic — and what that lets you promise to your users — read Why Graph Embeddings Matter.

Resources

Why Graph Embeddings Matter

AlexMikhalev — Fri, 17 Apr 2026 10:49:56 +0000

Vector databases are probabilistic and slow. Graph embeddings are deterministic and sub-millisecond. If you are building context for an AI coding agent — or any system where you need to know why a result came back — the difference is not academic. It changes what your application is allowed to promise.

The Pitch in One Paragraph

Terraphim represents concepts as nodes in a knowledge graph and ranks them by how many synonyms and edges connect them. There is no embedding model, no GPU, no per-query distance computation in a 1024-dimensional space. There is an Aho-Corasick automaton built once, queried in O(n+m+z) time over the input length plus the number of matches. The mechanism is described in detail on the Graph Embeddings reference page; this post is about why it matters.

The Numbers

Three numbers carry the argument. Each is reproducible on a laptop.

1.4 million patterns matched in under one millisecond, with under 4 GB of RAM. That is the working set behind a multi-role knowledge graph — operator, engineer, analyst — held resident in the same process that serves the query.
5–10 nanoseconds per knowledge-graph inference step. Not microseconds. Nanoseconds. Once the automaton is built, traversal is a tight loop over byte slices and graph edges, and modern CPUs are extremely good at that.
20 milliseconds to rebuild the embeddings for a role from scratch. Rename a synonym, add a new term, drop an obsolete one — the whole role's graph is reconstituted before your editor has rendered the next frame.

For comparison, a typical vector-DB nearest-neighbour query lands in the 5–50 ms range after you have paid the embedding API call (50–500 ms) and the network round-trip. We are not in the same regime.

Three Consequences

The numbers are interesting on their own. The reason they matter is what they let you build.

1. Full Explainability

Every match in Terraphim traces back to a specific edge in the knowledge graph and a specific synonym in a specific role. There is no "the model said so." When a search returns a document, you can show the user exactly which terms matched, which role's graph supplied the synonym, and which edges connected them. That is not a debugging nicety — it is a regulatory requirement in any domain where you have to defend a decision after the fact. Healthcare, legal, finance, government. Vector search by construction cannot do this.

2. No Training, No Retraining, No Fine-Tuning

Adding a new concept is a text edit. You write the synonym down, you point Terraphim at the file, the graph rebuilds in 20 ms. There is no training run, no GPU bill, no "we need to schedule a retrain on the new corpus." This collapses the loop between noticing a gap and fixing the gap from days or weeks to seconds. For an AI coding agent that needs to learn a project's vocabulary as you onboard, this is the difference between a working tool and a stalled rollout.

3. Language-Agnostic Without Language Detection

Because matching is done on normalised terms — synonyms you supply explicitly — the same node in the graph can carry English, French, Russian, and Mandarin labels at no extra cost. There is no language-detection step, no per-language embedding model, no separate index. The query "consensus" and the query "консенсус" both reach the same node if you have told the graph they are synonyms. Stop-word lists become irrelevant: if a word is not in the graph, it does not match, full stop.

What This Lets You Do

The pieces above are infrastructure. The story arc continues:

Build hooks that transform AI coding agent output deterministically. When Claude Code suggests npm install, intercept it via a graph-embeddings match and replace it with bun install. We wrote this up at Teaching AI Coding Agents with Knowledge Graph Hooks — that post is the demo of what this engine enables.
Capture and reuse mistakes. When an agent gets corrected, store the correction as a new synonym and the next session never repeats it. See Teaching AI Agents to Learn from Their Mistakes and Learning via Negativa.
Run the whole thing in a 4 GB process on your laptop with no network calls. The compactness is not an accident — it is the engineering brief from the Origin Story, which explains where the design came from and why it has stayed this small.

How to Apply

If you want to wire this into your own project, the Command Rewriting How-to walks through the moving parts: where to put your synonyms, how the role graph is built, how hooks call the matcher.

The mechanism — automata, ranking formula, ASCII walk-through — is on the Graph Embeddings reference page. Read that next if you want the data structures.

Why Bother Saying This Out Loud

The current default in the AI tooling ecosystem is to reach for a vector database the moment anyone mentions "semantic search." It is the path of least resistance because the tools are well-marketed and the API surface is familiar. But for a large class of problems — explainability-first systems, on-device agents, anywhere you need a hard latency budget or a hard explainability guarantee — graph embeddings are the better-engineered answer. Not the only answer; the better one for that class.

The promotion campaign over the next few weeks goes deeper: a sub-millisecond context article walks through the FST/Aho-Corasick implementation, and the Context Engineering with Knowledge Graphs book (launching in May) puts it in the wider context of moving from RAG to context graphs.

Until then: read the reference, try the how-to, and let us know in Discourse what you build with it.

Disciplined Engineering: How We Build AI Systems That Actually Work

AlexMikhalev — Thu, 16 Apr 2026 15:00:55 +0000

AI coding agents are making us worse engineers, unless we add discipline back. Here is what we do instead of vibe coding, and how you can do it too in 30 seconds.

Update, May 2026: this repository is now private

Up until last week, terraphim-skills was open source. You could install the whole set with one command via skills.sh. Then I ran a comparison: our skills against Andrej Karpathy's, against derivatives of Karpathy's, and against a no-skills control. For the first time in my life I clicked "Make private" on a public repository.

Even on the first batch of tests, our approach beat both Karpathy's and the control:

85% traceability of code back to the original requirements
Tests written by default, not bolted on after the fact
No ralph loop, no autoresearch swarm, no token-burning

We control the flow and we control the cost, because we are not an AI lab and we are not a VC-funded startup burning tokens to chase a leaderboard. We are a consultancy getting things done with AI and building AI.

The results speak for themselves. A few weeks ago I read a story about an autoresearch agent that achieved a speedup on the Liquid (Rails) parser. I had a slow markdown parser inside this very project. I ran our disciplined loop on it. Twice. Result: a 400x to 900x speedup in parsing.

Here is how.

The Vibe Coding Problem

Every AI-generated pull request we review has the same pattern:

Scope creep beyond the original task. You ask for a bug fix, you get a refactored module.
No traceability from requirements to tests. The agent shipped code, but nobody verified it does what was actually asked.
Knowledge lost between sessions. Each conversation starts from scratch. Yesterday's design decisions evaporate overnight.

The agent shipped code. It even passed the tests. But the tests were written by the same agent that wrote the code, optimising for the metric rather than understanding the problem.

The missing piece is not better models. It is engineering discipline. AI agents need the same rigour humans use: understand the problem before coding, verify against the design, validate against requirements. We encoded this as executable skills that any AI coding agent can follow.

The research evidence behind this framework, including language-specific scaling laws and the 30% adoption gap between code intelligence research and production harnesses, is the subject of our next article.

The V-Model: Adding Discipline Back

We built a V-model for AI agents. The left side asks "what should we build?" The right side asks "did we build it correctly?"

Phase 1: Research

Before writing any code, the agent must understand the problem space:

Search existing knowledge graphs for relevant patterns
Identify language-specific constraints
Determine optimal context window for target language
Find similar implementations in the codebase

Phase 2: Design

Create a specification before implementation:

Define interfaces and data structures
Identify cross-language considerations
Plan for compiler feedback integration
Document the "why" not just the "what"

Phase 3: Implementation

Write code with tests from the start:

Language-appropriate context management
Compiler feedback integration points
Type-safe by default (for typed languages)
Self-documenting through clear structure

Phase 4: Verification

Verify against the design, not just tests:

Type-checking passes
Linting passes
Compiler warnings addressed
Design intent preserved

Phase 5: Validation

Validate against the original requirements:

Does it solve the actual problem?
Are there simpler alternatives?
What's the maintenance burden?

terraphim-skills: 32+ Executable Disciplines

We packaged the V-model as executable skills for any AI coding agent. As of May 2026 the repository is private (see the update at the top of this post) and the skills.sh installer has been retired. The framework is now deployed as part of our consulting engagements at zestic.ai.

The skills enforce:

disciplined-research: Understand before building
disciplined-design: Plan before coding
disciplined-implementation: Build with tests
disciplined-verification: Verify against design
disciplined-validation: Validate against requirements

Each skill is a self-contained prompt that guides the agent through the phase's inputs, outputs, and quality gates.

Quality Gates: The Judge System

Our judge system (Kimi K2.5) catches what humans miss:

90% verdict agreement with human reviewers
62.5% NO-GO detection rate on genuinely flawed code
~9s average review latency

Automated quality gates, zero manual overhead. Every PR reviewed before it merges.

Guard Rails in Practice

AI agents do not type commands into a terminal. They invoke tools programmatically, and they do not always get it right. "Cleaning up build artefacts" becomes rm -rf ./src (one-character typo). "Resetting to last commit" becomes git reset --hard (uncommitted work gone). You need a safety net that operates between the agent and your shell.

We use two layers of guard rails:

Layer 1: git-safety-guard (a terraphim-skill that runs as a PreToolUse hook):

Blocks git reset --hard, git push --force, rm -rf and similar destructive commands before they execute
Checks for secrets in diffs before commits
Validates commit message format
Zero configuration: install the skill, protection is immediate

Layer 2: Destructive Command Guard (DCG) by Jeff Emanuel, integrated via tool hooks:

A Rust binary using SIMD-accelerated pattern matching
Intercepts every shell command the agent attempts to run
Returns allow/block verdicts in under 1ms
Works with Claude Code, OpenCode, and any agent that exposes a pre-execution hook

The architecture is simple: the agent calls a bash tool, the hook pipes the command to DCG as JSON, DCG pattern-matches against known destructive commands, and blocks execution before damage occurs. The agent receives an error explaining why, and can adjust its approach.

Real Example: AI Dark Factory

We run 12+ AI agents overnight on a single machine, coordinated by a Rust orchestrator. Each agent follows the V-model:

Safety agents run continuously with automatic restart and cooldown. They handle monitoring, log analysis, and drift detection. If one crashes, the orchestrator waits 15 minutes before restarting (up to 3 times) to prevent crash loops.
Core agents are scheduled via cron. They pick the highest-priority unblocked issue from the Gitea board (ranked by PageRank across the dependency graph), claim it, branch, implement with tests, and open a pull request.
Growth agents run on demand for research, code review, and content generation.

Every agent's output passes through the judge system before merge. The morning routine is reviewing verdicts, not debugging overnight chaos. When an agent produces a NO-GO verdict, the PR is flagged with the specific issues: missing test coverage, undocumented API changes, or security concerns.

This is disciplined engineering at scale: not process overhead, but automated quality gates that catch problems before they compound.

Conclusion

The gap between what AI agents can do and what they should do is real. It is not a technology gap: it is a discipline gap. The V-model and the 32+ executable skills we built are now part of our consulting work rather than an open repository.

Add discipline back. If you want it inside your team, get in touch at zestic.ai.

Deeper dive: The V-model and quality gates we use are detailed in Chapters 3-4 of "Context Engineering with Knowledge Graphs". Coming soon.

Related posts:

From Learning Capture to Self-Evolving Rules: Adding Verification Sweeps to terraphim-agent

AlexMikhalev — Mon, 30 Mar 2026 16:21:44 +0000

From Learning Capture to Self-Evolving Rules: Adding Verification Sweeps to terraphim-agent

A self-evolving AI coding agent sounds like science fiction. It is not. It is a shell script, a markdown file with grep patterns, and a weekly review discipline.

We have been running terraphim-agent in production for months. It captures every failed bash command from Claude Code and OpenCode, stores them in a persistent learning database, and lets agents query past mistakes before repeating them. The capture loop works. The query system works. The correction mechanism works.

What was missing was verification. We could capture mistakes and add corrections, but we had no way to prove the corrections were being followed. No machine-checkable enforcement. No audit trail. No quantitative measure of whether the system was actually improving.

Then Meta Alchemist published a viral guide on transforming Claude Code into a self-evolving system, and two ideas jumped out: verification patterns on every rule and session scorecards. We already had the foundation. The article showed us what to build on top.

This post covers what we added, what we deliberately did not copy, and why the combination of a Rust CLI with a thin shell verification layer is more robust than an all-in-JSONL approach.

If you have not read the foundation post on configuring terraphim-agent for Claude Code and OpenCode, start there. This post assumes you have the capture system running.

What we already had

Before reading the Meta Alchemist article, our learning infrastructure had three layers:

Layer 1: Learning capture (PostToolUse hook)

Every failed bash command in Claude Code triggers our post_tool_use.sh hook. The hook extracts the command, exit code, and error output, then pipes them to terraphim-agent learn hook --format claude. The learning is stored as a structured file in ~/.local/share/terraphim/learnings/ (global) or .terraphim/learnings/ (project-scoped).

# What the hook does on every failed command:
terraphim-agent learn capture "$COMMAND" --error "$ERROR_OUTPUT" --exit-code "$EXIT_CODE"

The design is fail-open: if terraphim-agent is missing or crashes, the hook passes through silently. An observability tool must never break the tool it observes.

Layer 2: Safety guard (PreToolUse hook)

Before any bash command executes, our pre_tool_use.sh hook runs two checks:

terraphim-agent guard --json blocks destructive commands (rm -rf, git push --force, etc.)
terraphim-agent replace --role "Terraphim Engineer" performs knowledge graph text replacement (npm -> bun, pip -> uv, etc.)

The guard blocks. The replacement corrects. Neither depends on the LLM remembering instructions.

Layer 3: Query and correct

Humans and agents query the learning database:

# List recent learnings
terraphim-agent learn list

# Search by pattern (with synonym expansion via thesaurus)
terraphim-agent learn query "docker"

# Add a correction to a captured learning
terraphim-agent learn correct 3 --correction "Use 'docker compose' (v2 plugin)"

The thesaurus has 20 semantic categories and 160+ synonym mappings. Search for "error" and you find "failure", "bug", "issue". Search for "setup" and you find "configuration", "install", "init". This is not keyword matching. It is structured retrieval.

What was missing

The capture-query-correct loop is a journal. It records mistakes and lets you look them up. What it does not do:

Enforce rules mechanically. A rule saying "never use pip" exists only in CLAUDE.md text that the LLM might or might not follow.
Verify compliance at session start. No sweep checks whether graduated rules are actually being obeyed.
Track improvement quantitatively. No session scorecards. No trend data. No way to prove the system is getting better.
Provide an audit trail for rule changes. Rules appear and disappear without record.

These gaps are exactly what the Meta Alchemist article addressed.

What Meta Alchemist proposed

The full guide describes a four-layer system:

Cognitive core (CLAUDE.md): A decision framework Claude runs before writing code, plus completion criteria that must pass before any task is done.
Specialised agents: An architect (plans, read-only) and a reviewer (validates, read-only) that spawn as subagents.
Path-scoped rules: Security rules that only load when editing auth code. API design rules that only activate in handler directories. Keeps context lean.
Evolution engine: A memory system that captures corrections in JSONL files, runs verification sweeps at session start, generates session scorecards, and promotes patterns through a confidence ladder.

The genuinely good ideas:

Verify lines on every rule. Each learned rule gets a machine-checkable grep pattern. verify: Grep("\.\.\.options", path="src/api/") -> 0 matches. The sweep runs the grep and reports PASS/FAIL. This is brilliant. It turns instructions into guardrails.
Session scorecards. Quantitative tracking of corrections received, rules checked, rules passed, violations found. Trend detection over time. If corrections are flat or increasing, the rules are not working.
Promotion ladder. Corrected once = logged. Corrected twice = auto-promoted to permanent rule. In learned-rules for 10+ sessions = candidate for graduation to CLAUDE.md.
Capacity management. Max 50 lines in learned-rules.md forces graduation or pruning. Prevents unbounded growth.

The key quote: "A rule without a verification check is a wish. A rule with a verification check is a guardrail. Only guardrails survive."

We agree with the principle. We disagree with the implementation.

Where our approaches diverge

The Meta Alchemist article builds everything from scratch using JSONL files parsed by the LLM:

corrections.jsonl -- user corrections as JSON objects
observations.jsonl -- verified discoveries
violations.jsonl -- rule violations caught by sweep
sessions.jsonl -- session scorecards

This is a reasonable approach if you have no existing infrastructure. We do. terraphim-agent already provides structured file storage with frontmatter, synonym-expanded querying, project/global scoping, and correction chaining. Adding parallel JSONL files would create a split-brain problem: two sources of truth for the same data.

The article also proposes auto-promotion: when the same correction appears twice, it automatically becomes a permanent rule. This is risky. A correction might be context-dependent (correct for one project, wrong for another). It might be a preference rather than a constraint. Auto-promotion without a quality gate means the system accumulates rules without human judgement about which ones deserve to be permanent.

Our approach: capture in terraphim-agent, verify with shell scripts, promote with CTO approval.

The verification layer we added

Three new components, all configuration. No Rust code changes.

learned-rules.md: graduated rules with verify patterns

The file lives at .claude/memory/learned-rules.md. Each rule has three parts: the constraint text, a machine-checkable verify pattern, and a source annotation.

# Learned Rules

Rules graduated from terraphim-agent corrections and CLAUDE.md conventions.
Each rule has a `verify:` pattern checked by the /boot verification sweep.

---

- Never use pip, pip3, or pipx; always use uv instead.
  verify: Grep("pip install|pip3 install|pipx install", path="automation/") -> 0 matches
  [source: CLAUDE.md convention, terraphim-agent learning #4, 2026-03-30]

- Never use npm, yarn, or pnpm; always use bun instead.
  verify: Grep("npm install|yarn add|pnpm add", path="automation/") -> 0 matches
  [source: CLAUDE.md convention, terraphim KG hook replacement, 2026-03-30]

- Never use double dashes in document titles or markdown headings.
  verify: Grep("^#.*--", path="knowledge/") -> 0 matches
  [source: corrected 2x, terraphim-agent learning #5, 2026-03-30]

- Never hardcode API keys as default values in bash scripts.
  verify: Grep("API_KEY=.[a-zA-Z0-9]", path="automation/") -> 0 matches
  [source: MEMORY.md security lesson, 2026-03-30]

The format is deliberately simple. No JSON. No YAML frontmatter. Just markdown that a human can read and a shell script can parse. The verify line follows a consistent pattern:

verify: Grep("regex_pattern", path="scope/") -> N matches

Where -> 0 matches means the pattern should NOT appear (absence check) and -> 1+ matches means the pattern MUST appear (presence check).

Rules without a verify line are flagged as technical debt during evolution review.

verify-sweep.sh: the verification engine

The core script parses learned-rules.md, extracts each verify line, runs the check, and reports PASS/FAIL. It uses rg (ripgrep) when available for native output limiting (no SIGPIPE issues from pipe chains).

#!/bin/bash
# Verification Sweep: parse learned-rules.md, run verify: checks, report PASS/FAIL
# Always exits 0 (advisory tool, never blocks).

set -uo pipefail

RULES_FILE="${1:-.claude/memory/learned-rules.md}"
PROJECT_ROOT="$(git rev-parse --show-toplevel 2>/dev/null || pwd)"
RG="$(which rg 2>/dev/null || echo '')"

TOTAL=0; PASSED=0; FAILED=0; MANUAL=0
current_rule=""

while IFS= read -r line; do
    # Capture rule text (lines starting with "- ")
    if echo "$line" | rg -q '^\s*- .+' 2>/dev/null; then
        current_rule=$(echo "$line" | sed 's/^\s*- //')
    fi

    # Process verify: lines
    if echo "$line" | rg -q '^\s*verify:' 2>/dev/null; then
        TOTAL=$((TOTAL + 1))

        # Skip manual checks
        if echo "$line" | rg -qi 'manual' 2>/dev/null; then
            MANUAL=$((MANUAL + 1))
            echo "SKIP: $current_rule (manual check)"
            continue
        fi

        # Extract pattern, path, and expected count
        pattern=$(echo "$line" | sed -n 's/.*Grep("\([^"]*\)".*/\1/p')
        path_scope=$(echo "$line" | sed -n 's/.*path="\([^"]*\)".*/\1/p')
        expected=$(echo "$line" | sed -n 's/.*-> \([0-9]*\).*/\1/p')

        [ -z "$pattern" ] && { MANUAL=$((MANUAL + 1)); echo "SKIP: $current_rule (unparseable)"; continue; }

        search_path="${path_scope:-.}"
        [ ! -d "$search_path" ] && [ -d "$PROJECT_ROOT/$search_path" ] && search_path="$PROJECT_ROOT/$search_path"

        # Count matches
        if [ -n "$RG" ]; then
            match_count=$("$RG" -c "$pattern" "$search_path" 2>/dev/null \
                | awk -F: '{s+=$NF} END {print s+0}') || true
        else
            match_count=$(grep -rEc "$pattern" "$search_path" 2>/dev/null \
                | awk -F: '{s+=$NF} END {print s+0}') || true
        fi

        # Compare against expectation
        if [ "$expected" = "0" ]; then
            if [ "$match_count" -eq 0 ]; then
                PASSED=$((PASSED + 1)); echo "PASS: $current_rule"
            else
                FAILED=$((FAILED + 1))
                echo "FAIL: $current_rule (found $match_count matches, expected 0)"
                "$RG" -n --max-count 1 "$pattern" "$search_path" 2>/dev/null \
                    | head -3 | sed 's/^/  >> /' || true
            fi
        else
            if [ "$match_count" -gt 0 ]; then
                PASSED=$((PASSED + 1)); echo "PASS: $current_rule"
            else
                FAILED=$((FAILED + 1))
                echo "FAIL: $current_rule (found 0 matches, expected 1+)"
            fi
        fi
    fi
done < "$RULES_FILE"

echo ""; echo "--- Verification Summary ---"
echo "Rules checked: $TOTAL | Passed: $PASSED | Failed: $FAILED | Skipped: $MANUAL"

Real output from our production environment:

PASS: Never use pip, pip3, or pipx; always use uv instead.
PASS: Never use npm, yarn, or pnpm; always use bun instead.
FAIL: Never use double dashes in document titles or markdown headings. (found 252 matches, expected 0)
  >> knowledge/mem-layer-graph-memory.md:1:# Mem-Layer -- Graph-Based AI Memory System
  >> knowledge/claude-1m-context-ga.md:7:# Claude 1M Context Window -- General Availability
  >> knowledge/ukri-funding-opportunities-2026.md:7:# UKRI Funding Opportunities -- 2026 Active/Recent
FAIL: Use British English spelling in all generated content. (found 316 matches, expected 0)
  >> knowledge/topics/context-engineering.md:41:- locality-of-behavior-dev-community.md
  >> knowledge/topics/conway-vs-strongdm-identity.md:49:- Govern AI agent behavior at runtime
SKIP: Always run date before date-sensitive operations. (manual check)
PASS: Never hardcode API keys as default values in bash scripts.
PASS: Never use git commit --amend in pre-push hooks.

--- Verification Summary ---
Rules checked: 7 | Passed: 4 | Failed: 2 | Skipped: 1

The two failures are from imported external articles that use American English spelling and double dashes. These are expected: imported content is not generated content. This kind of nuance is exactly why we do not auto-fix violations. The sweep surfaces them; the human decides what to do.

/boot skill: session-start verification

The /boot skill wraps the verification sweep into a session-start ritual:

Run date to establish the actual current date (never trust stale context)
Read learned-rules.md to load all graduated rules
Execute verify-sweep.sh to check compliance
Run terraphim-agent learn list to surface recent learnings
Report a one-line summary

Boot complete: 7 rules checked, 4 passed, 2 failed. 10 recent learnings loaded.

The Meta Alchemist article proposes a SessionStart hook to trigger this automatically. Claude Code does not have a SessionStart hook type. Their article assumes one exists. Ours does not. We invoke /boot manually at the start of each session. A manual invocation that runs reliably is better than an automatic hook that does not exist.

The evolution engine

Verification tells you what is wrong. Evolution fixes it over time.

/evolve skill: weekly review with approval gate

The /evolve skill is the mechanism by which the system improves. It runs weekly (or on demand) and does the following:

Gathers corrections from terraphim-agent learn list (recent captures and corrections)
Reads current rules from learned-rules.md
Reads the evolution log from evolution-log.md (to avoid re-proposing rejected rules)
Groups failures by pattern and identifies repeat corrections
Proposes changes using a structured format:

PROPOSE: PROMOTE
Rule: Never use timeout command on macOS (does not exist)
Source: terraphim-agent learning #8, #12
Evidence: Corrected twice across different sessions
Verify: Grep("timeout ", path="automation/") -> 0 matches
Destination: learned-rules.md
Risk: Low. The command genuinely does not exist on macOS.

Waits for CTO approval. No changes are applied until each proposal is individually approved, rejected, or modified.
Logs everything to evolution-log.md: approved changes, rejected proposals, and the reasoning.

This is where we differ most from the Meta Alchemist approach. Their system auto-promotes on the second correction. Ours proposes and waits. The CTO reviews and approves.

Why? Because a correction is context. "Don't use pip" is correct for our projects. It is not correct for a project that deliberately uses pip. Auto-promotion assumes all corrections are universally true. They are not.

Our principle: A rule without CTO approval is an assumption.

Promotion ladder

Signal	Destination
Failed command captured	terraphim-agent learn database
Human adds correction	Same learning, enriched
Same correction appears twice	Flagged for `/evolve` review
Approved during `/evolve`	`learned-rules.md` with verify: pattern
Passing for 5+ sessions	Candidate for graduation to CLAUDE.md
Rejected during `/evolve`	`evolution-log.md` (never re-proposed)

The ladder is one-way unless the CTO explicitly overrides. Rejected rules do not come back. Graduated rules do not regress. The evolution log is the audit trail that makes this provable.

Session scorecards

The session-scorecard.sh script generates a quantitative summary:

=== Session Scorecard: 2026-03-30 ===

--- Recent Learnings ---
Total learnings in database: 10
With corrections: 2

--- Verification Sweep ---
Rules checked: 7 | Passed: 4 | Failed: 2 | Skipped: 1

--- Relevant Past Learnings ---
No learnings matching 'cto-executive-system'.

=== End Scorecard ===

Over time, the trend data answers a fundamental question: is the system getting better? If corrections decrease and pass rates increase, the evolution loop is working. If they are flat, the rules are too vague or too disconnected from actual work.

What we deliberately did not build

Engineering is as much about what you leave out as what you put in. Here is what the Meta Alchemist article proposes that we skipped, and why.

JSONL files for corrections and observations

The article creates corrections.jsonl, observations.jsonl, violations.jsonl, and sessions.jsonl. Each is an append-only log of JSON objects that Claude parses at session start.

We already have terraphim-agent learn which provides structured file storage, thesaurus-expanded querying, project/global scoping, and correction chaining. Adding JSONL files would create two sources of truth for the same data. The agent CLI is the single source.

Auto-promotion on second correction

The article promotes automatically when the same correction appears twice. We flag for review. The difference matters:

Auto-promotion: fast, no human bottleneck, but accumulates rules without judgement
Reviewed promotion: slower, requires CTO time, but every rule is intentional

We chose reviewed promotion because our project spans multiple contexts (CTO executive system, Terraphim AI, client projects). A correction that is right in one context might be wrong in another. The human knows the difference.

SessionStart and Stop hooks

The article configures SessionStart and Stop hooks in Claude Code's settings.json. These hook types do not exist in Claude Code's documented hook system. The available types are PreToolUse, PostToolUse, and (in some versions) SubagentStart. The article either assumes a future feature or describes a different version.

We replaced the SessionStart hook with a /boot skill. We replaced the Stop hook with a manual session-scorecard.sh invocation. Both work reliably because they use mechanisms that actually exist.

Path-scoped rules

The article loads different rule files based on which file Claude is editing: security rules for auth code, API design rules for handlers, performance rules everywhere. This is a genuine Claude Code feature (.claude/rules/ with paths: frontmatter).

We skipped it because our project is not a single codebase. The CTO executive system contains knowledge articles, automation scripts, domain models, plans, and publishing workflows. Path-scoped rules make sense for a web application with clear directory boundaries. They are premature for a heterogeneous knowledge system.

Hard capacity cap

The article enforces a maximum of 50 lines in learned-rules.md. If you hit the cap, you must graduate or prune before adding more.

This is a useful forcing function for projects that might accumulate hundreds of rules. We started with 7. When we approach a natural limit, /evolve will recommend pruning. We do not need an artificial constraint to force a behaviour that good engineering practice already demands.

Architecture comparison

Dimension	Meta Alchemist	terraphim-agent + verify layer
Storage	JSONL files parsed by LLM	Rust CLI with structured file storage
Capture trigger	Custom evolution SKILL.md (auto-triggered)	PostToolUse bash hook (fail-open)
Query mechanism	LLM reads and interprets JSONL	CLI with thesaurus-expanded search
Verification	Grep patterns in learned-rules.md	Same (we adopted this idea)
Promotion	Auto on 2nd correction	Manual via /evolve with CTO approval
Audit trail	evolution-log.md	Same (we adopted this idea)
Session scoring	sessions.jsonl (auto-written)	session-scorecard.sh (manual)
Cross-tool support	Claude Code only	Claude Code + OpenCode + any CLI
Safety guard	settings.json deny list	terraphim-agent guard (pattern matching)
Text replacement	Not included	terraphim-agent replace (KG-based)

The fundamental difference: Meta Alchemist builds a complete system inside Claude Code's configuration. We build a thin verification layer on top of an existing CLI. The CLI handles storage, querying, and correction chaining. The verification layer handles enforcement and evolution. Each does what it is good at.

Where this fits in the broader landscape

The idea of self-improving AI coding agents is not new. Several approaches exist:

Devin's knowledge suggestions: captures corrections as project-specific "knowledge" entries that load into future sessions
OpenClaw's metacognitive loops: three-phase review cycle that captures Phase 2 findings as learnings for future Phase 1s
Ouroboros pattern: self-modifying agents with constitutional guardrails, event sourcing, and multi-model review chains
Compound agency learning architecture: six nested learning loops from failure-to-guardrail up to loop-evolution

Our approach sits between Devin (simple capture) and Ouroboros (full self-modification). We capture automatically, verify mechanically, but promote deliberately. The human stays in the loop for rule changes. The machine handles enforcement.

Getting started

If you already have terraphim-agent configured with the PostToolUse hook (see the foundation post), adding the verification layer takes five steps:

1. Create the directory structure

mkdir -p automation/learning .claude/skills/boot .claude/skills/evolve .claude/memory

2. Seed learned-rules.md

Start with 3 to 5 rules from your existing CLAUDE.md or project conventions. Each rule needs a verify pattern. If you cannot write a grep check for a rule, the rule is too vague.

3. Write verify-sweep.sh

Copy the script from above. Make it executable. Test it:

chmod +x automation/learning/verify-sweep.sh
bash automation/learning/verify-sweep.sh

You should see PASS/FAIL for each rule. If a rule fails, either fix the violation or refine the verify pattern.

4. Create the /boot and /evolve skills

These are SKILL.md files in .claude/skills/boot/ and .claude/skills/evolve/. The boot skill runs the sweep and surfaces learnings. The evolve skill reviews corrections and proposes rule changes. Full skill definitions are in our repository.

5. Add to CLAUDE.md

### Learning Evolution System
- Run /boot at session start to verify learned rules and surface past learnings
- Run /evolve weekly to review corrections and propose rule promotions
- Graduated rules with verify: patterns: .claude/memory/learned-rules.md
- Evolution audit trail: .claude/memory/evolution-log.md

Conclusion

The Meta Alchemist article gave us the idea we were missing: machine-checkable verification patterns on every rule. That single concept transforms a learning journal into an immune system. We credit the article for the insight.

What we brought to the table: a Rust CLI that already handles capture, storage, querying, and correction chaining. The verification layer is 100 lines of bash on top of a structured backend, not 500 lines of JSONL parsing instructions for an LLM.

The combination works. Corrections are captured automatically by the PostToolUse hook. Rules are verified mechanically by the sweep script. Promotions are approved deliberately by a human. The system gets better every week, and we can prove it with session scorecards.

Two principles emerged from building this:

From Meta Alchemist: A rule without a verification check is a wish.

From us: A rule without CTO approval is an assumption.

Only verified, approved guardrails survive.

The terraphim-agent learning system is open source at github.com/terraphim/terraphim-ai. The verification layer described in this post is configuration, not code: shell scripts and markdown files on top of the existing CLI.

This is the third post in a series: Part 1: Configuring terraphim-agent for Claude Code and OpenCode | Part 2: Verification Checklist | Part 3: Self-Evolving Rules (this post)

OpenClaw + Terraphim LLM Proxy: OpenAI, Z.ai GLM-5, and MiniMax M2.5

AlexMikhalev — Fri, 13 Feb 2026 20:17:04 +0000

If you want OpenClaw to use multiple providers through a single endpoint, with Terraphim AI intelligent LLM proxy:

OpenAI Codex (gpt-5.2)
Z.ai (glm-5)
MiniMax (MiniMax-M2.5)
intelligent keyword routing
automatic fallback when a provider goes down

This guide reflects a real build-in-public rollout on terraphim-llm-proxy, including production debugging, fallback drills, and routing verification.

Why this setup

Most agent stacks fail at provider outages and model sprawl. A single proxy with explicit route chains keeps clients stable while you switch providers underneath.

Proxy config pattern

Use route chains in /etc/terraphim-llm-proxy/config.toml:

[router]
default = "openai-codex,gpt-5.2-codex|zai,glm-5"
think = "openai-codex,gpt-5.2|minimax,MiniMax-M2.5|zai,glm-5"
long_context = "openai-codex,gpt-5.2|zai,glm-5"
web_search = "openai-codex,gpt-5.2|zai,glm-5"
strategy = "fill_first"

[[providers]]
name = "openai-codex"
api_base_url = "https://api.openai.com/v1"
api_key = "oauth-token-managed-internally"
models = ["gpt-5.2", "gpt-5.2-codex", "gpt-5.3", "gpt-4o"]
transformers = ["openai"]

[[providers]]
name = "zai"
api_base_url = "https://api.z.ai/api/paas/v4"
api_key = "$ZAI_API_KEY"
models = ["glm-5", "glm-4.7", "glm-4.6", "glm-4.5"]
transformers = ["openai"]

[[providers]]
name = "minimax"
api_base_url = "https://api.minimax.io/anthropic"
api_key = "$MINIMAX_API_KEY"
models = ["MiniMax-M2.5", "MiniMax-M2.1"]
transformers = ["anthropic"]

Keep secrets in env, never inline.

OpenClaw config pattern

In both:

/home/alex/.openclaw/openclaw.json
/home/alex/.openclaw/clawdbot.json

set Terraphim provider:

baseUrl: http://127.0.0.1:3456/v1
api: openai-completions
model ids include:
- openai-codex,gpt-5.2
- zai,glm-5
- minimax,MiniMax-M2.5

Intelligent routing example

Add taxonomy file:

/etc/terraphim-llm-proxy/taxonomy/routing_scenarios/minimax_keyword_routing.md

route:: minimax, MiniMax-M2.5
priority:: 100
synonyms:: minimax, m2.5, minimax keyword, minimax route

Now a normal request containing a minimax keyword can route to MiniMax even if requested model is generic.

Validation commands

Direct provider checks:

curl -sS -X POST http://127.0.0.1:3456/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'x-api-key: <PROXY_API_KEY>' \
  -d '{"model":"openai-codex,gpt-5.2","messages":[{"role":"user","content":"Reply exactly: openai-ok"}],"stream":false}'

curl -sS -X POST http://127.0.0.1:3456/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'x-api-key: <PROXY_API_KEY>' \
  -d '{"model":"zai,glm-5","messages":[{"role":"user","content":"Reply exactly: zai-ok"}],"stream":false}'

curl -sS -X POST http://127.0.0.1:3456/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'x-api-key: <PROXY_API_KEY>' \
  -d '{"model":"minimax,MiniMax-M2.5","messages":[{"role":"user","content":"Reply exactly: minimax-ok"}],"stream":false}'

Fallback proof (simulate Codex outage):

sudo cp /etc/hosts /tmp/hosts.bak
echo '127.0.0.1 chatgpt.com' | sudo tee -a /etc/hosts >/dev/null

curl -sS -X POST http://127.0.0.1:3456/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -H 'x-api-key: <PROXY_API_KEY>' \
  -d '{"model":"gpt-5.2","messages":[{"role":"user","content":"Reply exactly: fallback-ok"}],"stream":false}'

sudo cp /tmp/hosts.bak /etc/hosts

Look for fallback logs:

sudo journalctl -u terraphim-llm-proxy -n 120 --no-pager | rg 'Primary target failed, attempting fallback target|next_provider='

Key lesson

Reliable multi-model routing is mostly configuration discipline:

explicit route chains
provider-specific endpoint handling where needed
deterministic fallback order
logs that prove routing decisions

This keeps OpenClaw simple and makes provider outages routine instead of incidents. Donate 3 USD to unlock the open-source proxy on GitHub.

Deploy BERT Large Question Answering models with a tenth of the second's inference on CPU in Python

AlexMikhalev — Thu, 21 Jul 2022 12:52:47 +0000

Deploy BERT Large Question Answering models with a tenth of the second's inference on CPU in Python

How to deploy and benchmark Large BERT uncased model for Question Answering API with ~0.088387 seconds inference

Summary of the article

This article will explore the challenges and opportunities of deploying a large BERT Question Answering Transformer model(bert-large-uncased-whole-word-masking-finetuned-squad) from inside Huggingface, where RedisGears and RedisAI perform heavy lifting while leveraging in-memory datastore Redis. End result will be Question Answering API with ~0.088387 seconds inference on the first run and nanosecond on the second.

Why do we need RedisAI?

In data science load, you want to load high-performance hardware as close to 100% as possible.
In user-facing load, you want to be able to distribute the load evenly, so it never reaches 100%, and client-facing servers can perform additional functions.
In data science, you prefer re-calculate results.
In a client-facing application, you prefer to cache results of calculation and fetch data from the cache as fast as possible to drive a seamless customer experience

Some numbers for inspiration and why to read this article:

python3 transformers_plain_bert_qa.py 
airborne transmission of respiratory infections is the lack of established methods for the detection of airborne respiratory microorganisms
10.351818372 seconds

time curl -i -H "Content-Type: application/json" -X POST -d '{"search":"Who performs viral transmission among adults"}' http://localhost:8080/qasearch

real    0m0.747s
user    0m0.004s
sys 0m0.000s

Background

BERT Question Answering inference works where the ML model selects an answer from the given text. In other words, BERT QA “thinks” through the following: “What is the answer from the text, assuming the answer to the question exists within the paragraph selected.”

So it’s important to select text potentially containing an answer. A typical pattern is to use Wikipedia data to build Open Domain Question Answering.

Our QA system is a medical domain-specific question/answering pipeline. Hence we need a first pipeline that turns data into a knowledge graph. This NLP pipeline is available at Redis LaunchPad, is fully open source, and is described in a previous article. Here is a 5-minute video describing it, and below, you will find an architectural overview:

BERT Question Answering pipeline and API

In the BERT QA pipeline (or in any other modern NLP inference task), there are two steps:

Tokenize text — turn text into numbers
Run the inference — large matrix multiplication

With Redis, we have the opportunity to pre-compute everything and store it in memory, but how do we do it? Unlike with the summarization ML learning task, the question is not known in advance, so we can’t pre-compute all possible answers. However, we can pre-tokenize all potential answers (i.e. all paragraphs in the dataset) using RedisGears:

def parse_sentence(record):
    import redisAI
    import numpy as np
    global tokenizer
    if not tokenizer:
        tokenizer=loadTokeniser()
    hash_tag="{%s}" % hashtag()

    for idx, value in sorted(record['value'].items(), key=lambda item: int(item[0])):
        tokens = tokenizer.encode(value, add_special_tokens=False, max_length=511, truncation=True, return_tensors="np")
        tokens = np.append(tokens,tokenizer.sep_token_id).astype(np.int64)
        tensor=redisAI.createTensorFromBlob('INT64', tokens.shape, tokens.tobytes())

        key_prefix='sentence:'
        sentence_key=remove_prefix(record['key'],key_prefix)
        token_key = f"tokenized:bert:qa:{sentence_key}:{idx}"
        redisAI.setTensorInKey(token_key, tensor)
        execute('SADD',f'processed_docs_stage3_tokenized{hash_tag}', token_key)

See the full code on GitHub.

Then for each Redis Cluster shard, we pre-load the BERT QA model by downloading, exporting it into torchscript, then loading it into each shard:

def load_bert():
    model_file = 'traced_bert_qa.pt'

    with open(model_file, 'rb') as f:
        model = f.read()
    startup_nodes = [{"host": "127.0.0.1", "port": "30001"}, {"host": "127.0.0.1", "port":"30002"}, {"host":"127.0.0.1", "port":"30003"}]
    cc = ClusterClient(startup_nodes = startup_nodes)
    hash_tags = cc.execute_command("RG.PYEXECUTE", "gb = GB('ShardsIDReader').map(lambda x:hashtag()).run()")[0]
    print(hash_tags)
    for hash_tag in hash_tags:
        print("Loading model bert-qa{%s}" %hash_tag.decode('utf-8'))
        cc.modelset('bert-qa{%s}' %hash_tag.decode('utf-8'), 'TORCH', 'CPU', model)
        print(cc.infoget('bert-qa{%s}' %hash_tag.decode('utf-8')))

The full code is available on GitHub.

And when a question comes from the user, we tokenize and append the question to the list of potential answers before running the RedisAI model:

token_key = f"tokenized:bert:qa:{sentence_key}"
    # encode question
    input_ids_question = tokenizer.encode(question, add_special_tokens=True, truncation=True, return_tensors="np")
    t=redisAI.getTensorFromKey(token_key)
    input_ids_context=to_np(t,np.int64)
    # merge (append) with potential answer, context - is pre-tokenized paragraph
    input_ids = np.append(input_ids_question,input_ids_context)
    attention_mask = np.array([[1]*len(input_ids)])
    input_idss=np.array([input_ids])
    num_seg_a=input_ids_question.shape[1]
    num_seg_b=input_ids_context.shape[0]
    token_type_ids = np.array([0]*num_seg_a + [1]*num_seg_b)
    # create actual model runner for RedisAI
    modelRunner = redisAI.createModelRunner(f'bert-qa{hash_tag}')
    # make sure all types are correct
    input_idss_ts=redisAI.createTensorFromBlob('INT64', input_idss.shape, input_idss.tobytes())
    attention_mask_ts=redisAI.createTensorFromBlob('INT64', attention_mask.shape, attention_mask.tobytes())
    token_type_ids_ts=redisAI.createTensorFromBlob('INT64', token_type_ids.shape, token_type_ids.tobytes())
    redisAI.modelRunnerAddInput(modelRunner, 'input_ids', input_idss_ts)
    redisAI.modelRunnerAddInput(modelRunner, 'attention_mask', attention_mask_ts)
    redisAI.modelRunnerAddInput(modelRunner, 'token_type_ids', token_type_ids_ts)
    redisAI.modelRunnerAddOutput(modelRunner, 'answer_start_scores')
    redisAI.modelRunnerAddOutput(modelRunner, 'answer_end_scores')
    # run RedisAI model runner
    res = await redisAI.modelRunnerRunAsync(modelRunner)
    answer_start_scores=to_np(res[0],np.float32)
    answer_end_scores = to_np(res[1],np.float32)
    answer_start = np.argmax(answer_start_scores)
    answer_end = np.argmax(answer_end_scores) + 1
    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end],skip_special_tokens = True))
    log("Answer "+str(answer))
    return answer

Checkout the full code, available on GitHub.

The process for making a BERT QA API call looks like this:

Architecture Diagram for BERT QA RedisGears and RedisAI

Here I use two remarkable features of RedisGears: capturing events on key miss and using async/await to run RedisAI on each shard without locking the primary thread — so that Redis Cluster can continue to serve other customers. For benchmarks, caching responses from RedisAI is disabled. If you are getting response times in nanoseconds on the second call rather then milliseconds, check to ensure the line linked above is commented out.

Running the Benchmark

Pre-requisites for running the benchmark:

Assuming you are running Debian or Ubuntu and have Docker and docker-compose installed (or can create a virtual environment via conda), run the following commands:

git clone --recurse-submodules https://github.com/applied-knowledge-systems/the-pattern.git
cd the-pattern
./bootstrap_benchmark.sh

The above commands should end with a curl call to the qasearch API, since Redis caching is disabled for the benchmark.

Next, invoke curl like this:

time curl -i -H "Content-Type: application/json" -X POST -d '{"search":"Who performs viral transmission among adults"}' [http://localhost:8080/qasearch](http://localhost:8080/qasearch)

Expect the following output, or something similar based on your runtime environment:

HTTP/1.1 200 OK
Server: nginx/1.18.0 (Ubuntu)
Date: Sun, 29 May 2022 12:05:39 GMT
Content-Type: application/json
Content-Length: 2120
Connection: keep-alive

{"links":[{"created_at":"2002","rank":13,"source":"C0001486","target":"C0152083"}],"results":[{"answer":"adenovirus","sentence":"The medium of 40 T150 flasks of adenovirus transducer dec CAR CHO cells yielded 0 5 1 my of purified msCEACAM1a 1 4 protein","sentencekey":"sentence:PMC125375.xml:{mG}:202","title":"Crystal structure of murine sCEACAM1a[1,4]: a coronavirus receptor in the CEA family"}] OUTPUT_REDUCTED}

I modified the output of API for the benchmark to return results from all shards — even if the answer is empty, in the run above five shards return answers, overall API call response under second with all additional hops to search in RedisGraph.

I modified the output of the API for the benchmark to return results from all shards — even if the answer is empty. In the run above five shards return answers. The overall API call response takes less than one second with all additional hops to search in RedisGraph!

Architecture Diagram for BERT QA API call

Deep Dive into the Benchmark

Let’s dig deeper into what’s happening under the hood:

You should have a sentence key with shard id, which you get by looking at the “Cache key” from docker logs -f rgcluster. In my setup the cache key is, "bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults". If you think it looks like a function call it's because it is a function call. It is triggered if the key isn't present in the Redis Cluster, which for the benchmark will be every time since if you remember we disabled caching the output.

One more thing to figure out from the logs is the port of the shard corresponding to the hashtag, also known as the shard id. It is the text found in betweeen the curly brackets – looks like {6fd} above. The same will be in the output for the export_load script. In my case the cache key was found in "30012.log", so my port is 30012.

Next I run the following command:

redis-cli -c -p 300012 -h 127.0.0.1 get "bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults"

and then run the benchmark:

redis-benchmark -p 30012 -h 127.0.0.1 -n 10 get "bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults"
====== get bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults ======
  10 requests completed in 0.04 seconds
  50 parallel clients
  3 bytes payload
  keep alive: 1

10.00% <= 41 milliseconds
100.00% <= 41 milliseconds
238.10 requests per second

If you are wondering, -n = number of times. In this case we run the benchmark 10 times. You can also add:

– csv if you want to output in CSV format

– precision 3 if you want more decimals in the ms

More information about the benchmarking tool can be found on the redis.io Benchmarks page.

if you don’t have redis-utils installed locally, you can use Docker as follows:

docker exec -it rgcluster /bin/bash
redis-benchmark -p 30012 -h 127.0.0.1 -n 10 get "bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults"
====== get bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults ======
  10 requests completed in 1.75 seconds
  50 parallel clients
  99 bytes payload
  keep alive: 1
  host configuration "save":
  host configuration "appendonly": no
  multi-thread: no

Latency by percentile distribution:
0.000% <= 243.711 milliseconds (cumulative count 1)
50.000% <= 987.135 milliseconds (cumulative count 5)
75.000% <= 1577.983 milliseconds (cumulative count 8)
87.500% <= 1662.975 milliseconds (cumulative count 9)
93.750% <= 1744.895 milliseconds (cumulative count 10)
100.000% <= 1744.895 milliseconds (cumulative count 10)

Cumulative distribution of latencies:
0.000% <= 0.103 milliseconds (cumulative count 0)
10.000% <= 244.223 milliseconds (cumulative count 1)
20.000% <= 409.343 milliseconds (cumulative count 2)
30.000% <= 575.487 milliseconds (cumulative count 3)
40.000% <= 821.247 milliseconds (cumulative count 4)
50.000% <= 987.135 milliseconds (cumulative count 5)
60.000% <= 1157.119 milliseconds (cumulative count 6)
70.000% <= 1497.087 milliseconds (cumulative count 7)
80.000% <= 1577.983 milliseconds (cumulative count 8)
90.000% <= 1662.975 milliseconds (cumulative count 9)
100.000% <= 1744.895 milliseconds (cumulative count 10)

Summary:
  throughput summary: 5.73 requests per second
  latency summary (msec):
          avg min p50 p95 p99 max
     1067.296 243.584 987.135 1744.895 1744.895 1744.895

The platform only has 20 articles and 8 Redis nodes (4 masters + 4 slaves), so relevance would be wrong and it doesn’t need a lot of memory.

AI.INFO

Now let’s check how long our RedisAI model runs on the {6fd} shard:

127.0.0.1:30012> AI.INFO bert-qa{6fd}
 1) "key"
 2) "bert-qa{6fd}"
 3) "type"
 4) "MODEL"
 5) "backend"
 6) "TORCH"
 7) "device"
 8) "CPU"
 9) "tag"
10) ""
11) "duration"
12) (integer) 8928136
13) "samples"
14) (integer) 58
15) "calls"
16) (integer) 58
17) "errors"
18) (integer) 0

bert-qa{6fd} is the key of the actual (very large) model saved. The AI.INFO command gives us a cumulative duration of 8928136 microseconds and 58 calls, which is approximately 153 milliseconds per call.

Let’s double-check to make sure that’s right by resetting the stats and then re-runnning the benchmark.

First, reset the stats:

127.0.0.1:30012> AI.INFO bert-qa{6fd} RESETSTAT
OK
127.0.0.1:30012> AI.INFO bert-qa{6fd}
 1) "key"
 2) "bert-qa{6fd}"
 3) "type"
 4) "MODEL"
 5) "backend"
 6) "TORCH"
 7) "device"
 8) "CPU"
 9) "tag"
10) ""
11) "duration"
12) (integer) 0
13) "samples"
14) (integer) 0
15) "calls"
16) (integer) 0
17) "errors"
18) (integer) 0

Then, re-run the benchmark:

redis-benchmark -p 30012 -h 127.0.0.1 -n 10 get "bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults"
====== get bertqa{6fd}_PMC169038.xml:{6fd}:33_Who performs viral transmission among adults ======
  10 requests completed in 1.78 seconds
  50 parallel clients
  99 bytes payload
  keep alive: 1
  host configuration "save":
  host configuration "appendonly": no
  multi-thread: no

Latency by percentile distribution:
0.000% <= 188.927 milliseconds (cumulative count 1)
50.000% <= 995.839 milliseconds (cumulative count 5)
75.000% <= 1606.655 milliseconds (cumulative count 8)
87.500% <= 1692.671 milliseconds (cumulative count 9)
93.750% <= 1779.711 milliseconds (cumulative count 10)
100.000% <= 1779.711 milliseconds (cumulative count 10)

Cumulative distribution of latencies:
0.000% <= 0.103 milliseconds (cumulative count 0)
10.000% <= 189.183 milliseconds (cumulative count 1)
20.000% <= 392.191 milliseconds (cumulative count 2)
30.000% <= 540.159 milliseconds (cumulative count 3)
40.000% <= 896.511 milliseconds (cumulative count 4)
50.000% <= 996.351 milliseconds (cumulative count 5)
60.000% <= 1260.543 milliseconds (cumulative count 6)
70.000% <= 1456.127 milliseconds (cumulative count 7)
80.000% <= 1606.655 milliseconds (cumulative count 8)
90.000% <= 1692.671 milliseconds (cumulative count 9)
100.000% <= 1779.711 milliseconds (cumulative count 10)

Summary:
  throughput summary: 5.62 requests per second
  latency summary (msec):
          avg min p50 p95 p99 max
     1080.454 188.800 995.839 1779.711 1779.711 1779.711

Now check the stats again:

AI.INFO bert-qa{6fd}
 1) "key"
 2) "bert-qa{6fd}"
 3) "type"
 4) "MODEL"
 5) "backend"
 6) "TORCH"
 7) "device"
 8) "CPU"
 9) "tag"
10) ""
11) "duration"
12) (integer) 1767749
13) "samples"
14) (integer) 20
15) "calls"
16) (integer) 20
17) "errors"
18) (integer) 0

Now we get 88387.45 microseconds per call ~0.088387 seconds, which is pretty fast! Also, considering we started with 10 seconds per call, I think the benefits of using RedisAI in combination with RedisGears are pretty obvious. However, the trade-off is high memory usage.

There are many ways to optimize this deployment. For example, you can add a FP16 quantization and ONNX runtime. If you want to try that, this script will be a good starting point.

Using Grafana to monitor RedisGears throughput, CPU, and Memory usage

Thanks to the contribution of Mikhail Volkov, we can now observe RedisGears and RedisGraph throughput and memory consumption using Grafana. When you cloned repository it started Graphana Docker, which has pre-build templates to monitor RedisCluster, including RedisGears and RedisAI, and Graph — which is Redis with RedisGraph. “The Pattern” dashboard provides an overview, with all the key benchmark metrics you care about:

Grafana for RedisGraph

Grafana for RedisCluster

This post is in collaboration with Redis.

Announcing Reference Architecture for AI: Build on Redis with Python

AlexMikhalev — Thu, 16 Jun 2022 16:10:28 +0000

We launch in two full-featured articles — NLP ML pipeline for turning unstructured JSON text into a knowledge graph and fresh off the press Benchmarks for BERT Large Question Answering inference for RedisAI and RedisGears with Graphana Dashboards by Mikhail Volkov. Further announcement below:

Announcing Reference Architecture for AI

Ethics of creativity: is it the same for scientists, engineers or other creative professions?

AlexMikhalev — Sat, 19 Feb 2022 00:17:11 +0000

Photo by Clark Tibbs on Unsplash

There are common assumptions about ethics for scientists:

that bioscientists, working on even harmless virus, will put an effort of containing the virus, even if it’s as benign as forcing people to sneeze at 8 a.m. will think about consequences of the action releasing such virus into the public: all drivers on motorway sneezing together will cause dire consequences.

That physicists or radiologists working on new radioactive materials or tracing methods will refrain from spreading radioactive materials in front of their labs to see how much glow the passer-by get.

When you are creating a computer virus in the software industry, you can be called a hero, hacker or criminal depending on what your virus does and if it damages or repairs infrastructure (virus patching vulnerable DNS server), demand ransom or destroy people’s valuables.

But the ethics rules seem to work differently for content creators of numerous social networks.

When are you creating a movie or cartoon that affects the human brain and destroys the children’s behaviour, how shall a creative person be called?

We know how to trick humans into binge-watching or slot machine pull to refresh for social media through cognitive psychology research and applying it. Even adults can’t resist it, but even children’s cartoon series has a “hook” at the end of the episode, and the hook can be for the PG6 or PG10 series — those children who will have strategic parts of their brain developed in the next ten years or more.

The variety of the content affecting our children is very substantial:

Noise: from just noise like “kids playing toys” — as a parent, I want them to play with their toys, not watch other people playing with them.
Behaviour damage: modern high-quality cartoons like Booba (and Masha & Bear, and Tom&Jerry and Grizly& Lemmings and… and… a large number of video material marked by tag “kids” on youtube): I can trace behaviour changes in my 5-year-old son depending on what cartoons he is watching, other parents confirm this observation.
outdated archetypes: if you are a parent of a daughter, you may not agree to the archetype introduced by Cindarella — where the role of the good daughter is to do chores, sit and wait for fairy godmother and Prince Charming
pure toxic and harmful content, which introduces self-harm and suicidal behaviour

And there are creators and re-posters of such content. Some do it for money, some for fame and some out of ignorance.

But it doesn’t have to be that way:

Ethics is a personal choice of every person, and we can:

Make ethical choices personally, building on existing philosophical and religious ethical frameworks, whether from Stoic philosophy or following ten commandments.
Think about the consequences of our actions — long term and in the lifecycle of your community, your family and yourself
we can build not only profitable and lawful but ethical and valuable products and services
We can expand and build ethical AI to remind humans about their moral choices and biases.
Build intelligent filters AI filters for ourselves and share them with others

So, why are the best in the brightest minds of our data science and engineering community focused on building advanced noise generators like GPT-3?

Soft skills for the modern world

AlexMikhalev — Wed, 16 Feb 2022 09:28:27 +0000

For data scientists and engineers, soft skills for the modern world are not the old Trivium — rhetoric, logic, and grammar. My view on the set of skills that allow you to navigate complexity in the modern world, common for engineers, architects and data scientists:

Systems Engineering (Thinking) — the ability to identify and select own system of interest, the operational environment, emergent properties, processes and lifecycle. See INCOSE material and ISO 42010.
The ability to identify domains, stakeholders, concerns and their needs/requirements. See INCOSE material, IEEE 15288:2015 and NIST standard for Cyber-Physical Systems.
The ability to formulate a hypothesis with supported measurements (see “Feynman on Scientific Method”)
Stakeholder management: To work through conflicting goals and requirements — Theory of Constraint (Thinking Tools), see a lot of material in the TOC community by E.Goldratt, E.Schragenheim and many others. For a short introduction, see Clark Ching “Bottleneck Rules”.
Conflict resolution technical — i.e. metal shall be soft and hard. See TRIZ with variations — Triz+, BioTRIZ.

The above requires the base ability to work productively within a given time — GTD/Superfocus/Pomodoro, and work with long text — Zettelkasten, which leads to identifying and building taxonomy and ontology of own concepts.

When focused on old Trivium, you can’t build complex systems or organisations (a system of systems), and you can’t make a good decision based on well-sounded arguments, to quote a conversation with one of my friends.

Friend: “Agile/DevOps is a method of delivering things faster/better/cheaper. Would you like it?
Me: Of course, yes.
Friend: And I didn’t say anything meaningful until now.

Building complex and fruitful systems require precise communication and shared goals. Let’s focus on building complex, valuable, ethical systems.

DEV Community: AlexMikhalev

Plug Terraphim Search into Claude Code and opencode (CLI First, MCP When You Need It)

What "integrate" means here

Path A -- CLI via slash command

Why this is the recommended starting point

One file, two hosts

Why fast enough

Three example queries

Path B -- MCP server (when you want typed tools)

Build and install

Register

SessionStart primer (both paths)

When to pick which path

Why this is the right shape

Try it

System Operator Demo: A Logseq Knowledge Graph Drives Enterprise MBSE Search

What the demo is

Run it

The piece people miss

How it compares to the Personal Assistant role

Why this matters for teams evaluating MBSE tooling

Try it

Personal Assistant Role: One Search Across Email and Notes

Why a unified role matters

What you get

Wiring sketch

Why graph embeddings make this practical

Try it

What is next

Teaching AI Coding Agents with Knowledge Graph Hooks

Anthropic Bought Bun. Claude Still Outputs npm install.

The Problem: LLMs Don't Know Your Preferences

The Solution: Knowledge Graph Hooks

How It Works

Real-World Example: npm → bun

Hook Integration Points

1. Claude Code PreToolUse Hooks

2. Git prepare-commit-msg Hooks

3. MCP Tools

Architecture

Performance

Extending the Knowledge Graph

Pattern Priority

Installation

Quick Setup

Manual Setup

Use Cases

Claude Code Skills Plugin

Conclusion

Next Steps

Resources

Why Graph Embeddings Matter

The Pitch in One Paragraph

The Numbers

Three Consequences

1. Full Explainability

2. No Training, No Retraining, No Fine-Tuning

3. Language-Agnostic Without Language Detection

What This Lets You Do

How to Apply

Why Bother Saying This Out Loud

Disciplined Engineering: How We Build AI Systems That Actually Work

Update, May 2026: this repository is now private

The Vibe Coding Problem

The V-Model: Adding Discipline Back

Phase 1: Research

Phase 2: Design

Phase 3: Implementation

Phase 4: Verification

Phase 5: Validation

terraphim-skills: 32+ Executable Disciplines

Quality Gates: The Judge System

Guard Rails in Practice

Real Example: AI Dark Factory

Conclusion

From Learning Capture to Self-Evolving Rules: Adding Verification Sweeps to terraphim-agent

From Learning Capture to Self-Evolving Rules: Adding Verification Sweeps to terraphim-agent

What we already had

Layer 1: Learning capture (PostToolUse hook)

Layer 2: Safety guard (PreToolUse hook)

Anthropic Bought Bun. Claude Still Outputs `npm install`.