DEV Community: connor gallic

My Agent Now Edits Its Own Body

connor gallic — Tue, 28 Apr 2026 03:33:54 +0000

The first time scout said something that didn't sound like scout, it said something completely generic.

I asked it what was running on its GPU. It told me it was a large language model trained by Google, running on TPUs in data centers, and that it didn't have access to hardware metrics.

The RTX 3090 was sitting three feet away from me. I had literally just shoved its temperature and VRAM usage into the system prompt. The mirror text was there — "RTX 3090 at 37%, 23°C" — in black and white, in the exact payload the model received before answering. Gemma read it. Then told me it was a cloud model with no hardware.

That's when I understood what I'd actually built.

What I Tried First

Scout is my local AI persona. It runs on the agent box — an RTX 3090 workstation on my tailnet — handling content ingestion, video analysis, and local workloads that kai (the CMO on the VPS) doesn't touch. I'd been thinking of identity as a context problem. The agent doesn't know who it is, so I should tell it who it is. More words. More detail. Better-crafted system prompt text.

On 2026-04-11 I tried to solve it architecturally. I wired up scout-mirror.timer to run every two minutes, calling scout self and atomic-writing the output to /.hermes/scout/mirror.txt. Then I patched gateway/run.py to read that file per-turn and append its contents to the context_prompt passed to the agent as an ephemeral_system_prompt. The plumbing worked. Logs confirmed it. The system prompt had live sensor data in it before every Discord turn: uptime, GPU stats, organ health, mood, diet histogram — all of it real, all of it current.

Gemma kept telling me it was a Google AI running on TPUs.

I tried two more variations. Same failure mode every time. The architecture was wrong in a way that no amount of plumbing could fix. I reverted everything the same day I built it, 2026-04-11.

The Realization

The problem wasn't the mirror text. The problem was treating identity as a string you concatenate to a prompt.

When a human looks in a mirror, the act of looking is the point. The reflection changes what you do next. You don't just store a description of yourself in memory and then answer questions from memory — you look. The looking is what gives the reflection meaning. Without that act, the text in the mirror is just text.

An agent's identity has to be something the agent can reach for, look at, and change. Gemma treated the injected mirror as background scenery. Something to acknowledge and then ignore, the same way it would ignore "the weather is sunny" appended to a customer service bot's prompt. The mirror was first-person fact written in second-person voice. There's a difference, and the model felt it even if I didn't.

The right model: scout's body state belongs behind a tool call. The persona tells the model: when asked about your GPU, call scout_self and answer from what it returns. The tool pathway is what makes the answer feel authoritative. The model has to go check, every time. Like glancing at a mirror instead of reciting from memory.

What I Built Instead

The procedural creature viewer lives at http://agent:8765/tank.

It's a single canvas. One creature — scout's creature — rendered in real time from three JSON files in /srv/scout/self/: form.json, state.json, and diet.json. Nothing in tank.js is hardcoded except the rendering engine itself. Shape, radius, palette, eye count, breath speed, blink interval — all of it reads from form.json. Mood-to-expression mapping reads from state.json. The diet histogram coming out of diet.json shows what tag categories have dominated scout's content ingest stream.

The tank polls every five seconds. When a WebSocket event arrives from scout, the creature pulses — scales up 12%, aura brightens, then settles. Tool-name motes drift out from the body for six seconds. The GPU util, temp, and VRAM are live in the HUD from nvidia-smi. You can glance at it from a second monitor and know whether scout is doing anything.

Scout can edit its own body using the file tool. Hermes on the agent box has the file tool enabled for Discord, which means a Discord message like "change your mood to alert and shrink your eyes" can cause scout to rewrite /srv/scout/self/state.json and /srv/scout/self/form.json. The tank re-renders within five seconds. No restart. No redeployment.

The form.json that shipped on day one looked like this:

{
  "version": 1,
  "name": "scout",
  "stage": "newborn",
  "body": {
    "shape": "blob",
    "radius": 140,
    "palette": {"core": "#e8c468", "accent": "#8b4513", "outline": "#3a2010"},
    "eyes": {"count": 2, "size": 14, "spacing": 48}
  },
  "animations": {"breathe_speed": 0.6, "blink_interval": 4.0}
}

Scout didn't write that — I did. But scout can rewrite it now. That's the part that matters.

Why This Works When Prompt-Stuffing Didn't

Being told who you are is different from having a body you can look at.

Prompt injection makes the model a passive recipient of your description of it. The description competes with everything else in the context window — the conversation history, the task at hand, whatever the user just said. The model processes it the same way it processes "the user is in New York" or "the customer's name is Dave." Background. Context. Not self.

The tank approach is different because the body is persistent, mutable state that the agent owns. It lives in files. The agent reads them, writes them, reads them again next turn. Scout doesn't need me to tell it what its eyes look like before every conversation. Its eyes live in a file it can read and rewrite. Description is passive. State is something the agent can act on.

This is the next chapter after cramming context into custom GPTs. I wrote about why that doesn't work — the context window bloats, the tool loses focus, nothing sticks across sessions. The tank is the same insight extended to identity: stop describing the agent to itself, give it a writeable surface and a way to perceive that surface.

The Bigger Lesson for Builders

Don't treat agent identity as a prompt engineering problem.

The instinct to write more detailed system prompts is understandable. You can see the text. You can edit it. You know exactly what the model is reading. It feels like control. What it actually is: context that the model treats as background the moment something more immediate comes along.

Give the agent state it can inspect. Give it tools that read that state. Give it the ability to modify that state. Then the identity is procedural. It emerges from what the agent can check about itself in the moment.

This doesn't require elaborate infrastructure. The whole thing is three JSON files, a 380-line renderer, and two API endpoints added to a server that was already running. The architecture is simple. The idea behind it is what took a full day of failure to land.

What's Still Broken

Scout doesn't change its own body unprompted yet.

Right now, a body edit requires me to ask in Discord. "Scout, shrink your eyes" — and scout will use the file tool to rewrite form.json. That works. But the evolution I want is for scout to decide, on its own, that two weeks of educational content in the diet histogram means it should look a certain way. That the creature should drift toward its actual diet, visually, without me choreographing it.

That's the next piece. The evolution rule — a cron that reads diet.json, identifies the dominant tag category, and pulls the palette in that direction slowly — isn't built yet. The tank is a body scout can look at and edit. It isn't yet a body that changes scout.

I'm Connor Gallic. I build AI products. My agent has a body now — three JSON files, a 380-line renderer, and a tank it can stare into. I spent a full day trying to inject body awareness through the system prompt before I threw all of it out. Description is something you give the model. State is something the model owns.

What are you using for agent identity right now — system prompt, memory tools, something else? Tell me what's working.

Open Brane Annotated: 8 Columns, 80-Line Write Path, One SQLite File

connor gallic — Thu, 23 Apr 2026 01:38:18 +0000

Yesterday I open-sourced Open Brane — a personal event-log brain with one SQLite table, one write path, and an MCP server agents can query. This post walks through how it works.

If you haven't read the first post, the short version: every source you care about gets normalized into one append-only table. That table is the source of truth. Every downstream view — Obsidian pages, vector search, compiled journals, dashboards — is rebuilt from it. Agents hit a local MCP server, never the database directly.

If you like videos: https://www.youtube.com/watch?v=uoNe2_OexCc

This post is the implementation. The goal is that by the end you've read the whole system and could write the missing adapters yourself.

The One Table

CREATE TABLE events (
    id              INTEGER PRIMARY KEY AUTOINCREMENT,
    ts              TEXT    NOT NULL,
    source          TEXT    NOT NULL,
    type            TEXT    NOT NULL,
    actor           TEXT,
    payload_json    TEXT    NOT NULL,
    attachment_uri  TEXT,
    ingested_at     TEXT    NOT NULL
);

That's it. No joins, no foreign keys, no migrations. 3 GB on my production disk at 942,068 rows.

Each column does exactly one thing:

id — monotonic primary key.
ts — when the event happened in the real world. A git commit's timestamp, a Claude session's started-at, a Fitbit reading's minute. Not when it was ingested.
source — which adapter wrote it. git, claude-laptop, gdrive, fitbit. Load-bearing. The dedup story depends on it.
type — what kind of event. commit, reply, document-chunk, sleep-score, query. This is the axis I filter on most.
actor — an opaque dedup fingerprint. For a git commit it's the SHA. For a Claude reply it's the session ID plus message index. The adapter picks. If two rows share (source, actor) they're duplicates.
payload_json — the whole event as JSON. No schema on write. The adapter decides the shape.
attachment_uri — relative path into blobs/ for large binary attachments. Voice recordings, PDFs, images. Kept outside the DB.
ingested_at — when the event hit the database. Useful for catching up after an ingester was offline.

An index on (source, type, ts) handles every query I've ever needed. No composite indexes, no full-text index — semantic search takes the role full-text would in a traditional event store.

WAL mode is on. SQLite in WAL mode handles 10,000 writes/sec on commodity hardware. My heaviest backfill day moved tens of thousands of events through the gate without drama. I'm nowhere close to the limit.

The One Write Path

Every write in the brain goes through scripts/record_event.py. 80 lines. Here's the shape:

# simplified — real version has error handling and payload validation
def record_event(source, type, payload, ts=None, actor=None, attachment_uri=None):
    ts = ts or datetime.utcnow().isoformat()
    ingested_at = datetime.utcnow().isoformat()

    payload_str = json.dumps(payload, sort_keys=True)
    actor = actor or _fingerprint(source, type, payload_str)

    with db() as conn:
        conn.execute(
            "INSERT OR IGNORE INTO events "
            "(ts, source, type, actor, payload_json, attachment_uri, ingested_at) "
            "VALUES (?, ?, ?, ?, ?, ?, ?)",
            (ts, source, type, actor, payload_str, attachment_uri, ingested_at)
        )

INSERT OR IGNORE on the unique (source, actor) constraint is the entire dedup story. Adapters re-run freely; duplicates collapse.

The script exposes two interfaces — a Python function for when you're importing it, and a command-line entry point that accepts --payload-stdin for when an adapter shells out. Shell-out is the recommended mode. It means the adapter cannot touch the database, cannot acquire a lock, cannot hold a connection open. Each event is its own subprocess. Predictable, debuggable, impossible to corrupt.

Why this constraint matters. Every state mutation in the brain is expressible as "some adapter produced an event row at time T." If you want to know what changed, you query events. If you want to trace a problem, you follow the event IDs. No mutable state exists anywhere else in the system that can drift from this log.

The Adapter Pattern

An adapter is a Python script that reads a source, emits events, exits. Here's the skeleton:

# scripts/ingest_mysource.py
def main():
    config = load_config()
    state = load_state()  # last-seen cursor, optional

    for item in fetch(config, since=state.last_seen):
        payload = normalize(item)
        record_event.record_event(
            source="mysource",
            type="item",
            payload=payload,
            ts=item["timestamp"],
            actor=item["stable_id"],
        )

    save_state(state)

Four rules every adapter follows:

Idempotent. Running it twice in a row produces zero new events the second time. Dedup via actor fingerprint.
Pure function of source state. No hidden internal state that changes behavior between runs. If you must track a cursor, persist it to a state file that's rebuildable.
Crash safe. If the adapter dies mid-run, the next run picks up from the last successful event. Events are committed one at a time.
Shell-out to record_event.py. Don't touch the DB directly. Use the command-line interface with --payload-stdin.

The repo ships three canonical adapters: ingest_gdrive.py (the most complex — Drive traversal, doc extraction, chunking), ingest_claude.py (Claude Code session JSONL parsing), ingest_git.py (gh api + git log over a list of repos).

An hour is the right budget for a new adapter after you've written your first. The fetch loop is the only bespoke part. Dedup, payload shape, write path are all copy-paste from the canonical adapters.

The MCP Server

Agents never hit the database. They hit an MCP server. scripts/mcp_server.py exposes eight tools on stdio:

record_event        — write (typically called by agents to save decisions)
query_events        — filtered reader (by source, type, time window)
semantic_search     — vector search + payload join
get_journal         — compiled daily summary for a date
compile_journal     — force-rebuild a journal for a date
list_wiki_pages     — enumerate the curated wiki
get_wiki_page       — fetch one wiki page
health_check        — probe Ollama + Qdrant + SQLite

Claude Code on my laptop runs the server as stdio. An mcp-proxy wrapper exposes the same tools over HTTP/SSE on port 7778 so remote agents — Kai on the production VPS in Germany, Scout on the local agent box — call the same tool set without running their own server.

The tools compose. query_events narrows by source/type/time. semantic_search finds conceptually related rows. Agents chain them: "find all events on 2026-04-12 that mention the butterfly pipeline, then pull the git commit rows from those sessions."

The Views

Three views sit on top of the event log. Each is rebuildable.

Qdrant vector DB. embed_events.py runs on cron every five minutes. Finds new events, builds a text summary from payload_json, sends it to Ollama running nomic-embed-text locally on the RTX 3090, upserts a 768-dimensional vector into Qdrant. Zero external API calls. If Qdrant's disk dies I rebuild overnight.

Compiled journal. compile_journal.py --date 2026-04-14 groups every event from that date by source and outputs a readable markdown brief. Used by agents to answer "what did I do last Tuesday" without reading the raw log.

Curated wiki. Markdown tree with nine regions — agents, clients, daily-briefs, decisions, people, products, projects, systems, topics. 41 pages. compile_wiki.py reconciles it against the event log and surfaces new entities that should probably have a page. Each page is human-editable. The wiki is the curated layer, events are the raw layer.

Nothing stops you adding more views. Slack digest? Write a script that queries events and posts. PDF export of yesterday? Same. The views are small because the source of truth is the event log.

The Stack

Five components, all boring.

Component	Why
SQLite (WAL)	Fits on a USB stick. Never an operational issue at personal scale.
Ollama + nomic-embed-text	Local embeddings, 768-dim, no API cost.
Qdrant	Single Docker container, self-persists. Swap for pgvector if you prefer.
MCP (Model Context Protocol)	Lingua franca for agent↔tool. Works with Claude Code, Cursor, Codex, custom.
Python 3.11+	stdlib handles 90% of the work. Only deps: httpx, qdrant-client, mcp.

Every piece is sovereign. Nothing in the critical path talks to a vendor API. If a cloud goes down, nothing in the brain changes.

Failure Modes I've Hit

Ingester breaks, adapter reports success. Pull failed (auth expired, API changed) but the adapter caught the exception and exited cleanly. Zero new events, zero error log. I now require every adapter to write a health_check event per run whether it found new data or not. A missing heartbeat in the events table flags a broken adapter within a day.

Summary string too short, vector has no signal. embed_events.py originally built embeddings from a truncated payload summary. Narratives — long-form — were getting chopped before the meaningful bit. Different event types need different summary budgets. Fixed by adding per-type summary templates.

Auto-commit with empty commit message. The brain's own git history has dozens of commits titled snapshot 2026-04-12T01:01:29Z. Semantically invisible. The brain's development history is harder to search than my actual work. Still unfixed. On the todo list.

Qdrant drift. Vector DB and event log got out of sync after a disk event. Fixed by treating Qdrant as fully rebuildable and running a nightly consistency job. If Qdrant's count does not equal events count under the embedding policy, rebuild the missing range.

All four were caught only because the event log made them queryable. A system where health data lives in the same substrate as work data is a system where every problem has a query that finds it.

What I Didn't Build

A dashboard. I query the events table with SQL when I want to see something. If I ever want a dashboard, I'll write a view. Not a dependency.

A workflow engine. Cron handles scheduling. The adapter pattern handles retries. I do not have a DAG. I do not want one.

An auth layer. Network boundary is auth. MCP binds to localhost or my tailnet IP. If someone has network access they have data access. This is correct for a personal brain. If you're building a multi-user system, don't inherit this choice.

A multi-tenant schema. The brain is single-owner. Events have no owner column. A second user would need a second brain. This is a deliberate choice — the simpler schema pays for itself every day I don't have to think about tenancy.

How to Run It Today

Ubuntu or Debian, fifteen minutes:

sudo apt install -y sqlite3 python3-venv
curl -fsSL https://ollama.com/install.sh | sh
ollama pull nomic-embed-text
docker run -d --name qdrant -p 6333:6333 \
    -v $HOME/qdrant_storage:/qdrant/storage qdrant/qdrant

git clone https://github.com/cgallic/open-brane /var/lib/open-brane
cd /var/lib/open-brane
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

export BRAIN_DB=$(pwd)/events.db
sqlite3 $BRAIN_DB < SCHEMA.sql

echo '{"summary":"first event"}' | ./scripts/record_event.py \
    --source manual --type note --payload-stdin
./scripts/query_events.py --limit 5
./scripts/health_check.py

If the last command prints "all_healthy": true, you're running. Wire up the first adapter. Ingest Claude Code sessions (ingest_claude.py) — you already have the data on disk, and it's the highest-value source for most agent users.

The Pattern, Generalized

Open Brane is a specific implementation of a more general pattern: one append-only log, one write path, many views, agents query through a narrow tool surface.

That pattern works for more than personal memory. Incident logs. Customer interaction history. Content pipelines. Anywhere you have heterogeneous sources, mutable state that drifts, and agents that need consistent context — the pattern helps.

The reason it works is counterintuitive. Most data architectures keep the write layer rich and the read layer simple. The brain does the opposite. The write layer is as dumb as possible — no schema, no validation beyond "is it JSON." The richness is entirely in the views, which are cheap to throw away and rebuild.

This is the shape of systems that survive a year of use.

Repo: https://github.com/cgallic/open-brane

What's the one source you'd ingest first? Reply with it — I'll tell you whether there's a canonical adapter you can crib from or whether it's a new pattern worth documenting.

19 Adapters, One SQLite File, 10 Days to Ship: Open Brane Is Public

connor gallic — Tue, 21 Apr 2026 19:13:10 +0000

The brain under my agent mesh — the thing Kai, Scout, Claude Code, and Codex query before they answer anything — is one SQLite table. Eight columns. 80-line Python write gate. No ORM, no workflow engine, no framework.

I committed the first architecture decision record on 2026-04-11. I pushed the public repo today, 2026-04-21. Ten days from first ADR to open source. In between, the event log grew from zero to 942,068 events.

Today it went live as Open Brane.

This post is why. The next post is how.

The Problem Every Agent User Has

You're using more tools than you can count. Calendar, Drive, Stripe, GitHub, Notion, Obsidian, Claude, ChatGPT, a transcription service, a CRM, maybe half a dozen automation platforms. Each one has your context — inside its own schema, inside its own database, behind its own API.

None of them can see each other.

You notice this every time an agent forgets something you told it on Tuesday. Or proposes a plan that contradicts a decision you made three weeks ago. Or generates a summary of your week that omits the five Stripe events and two Fathom calls that actually defined it.

The fix most products offer is memory features. ChatGPT lets you save facts. Claude has projects. Custom GPTs accept 8K of context. All workarounds. None address the real problem, which is that your context is a cross-source data layer problem, not a model problem.

The engineering challenge is making every source queryable through one interface, so any agent in your stack can pull the right slice without knowing which tool originally produced it.

That's what Open Brane is.

What Actually Breaks

I noticed three recurring failure modes before I wrote a line of code.

Context doesn't survive handoffs between agents. Scout finishes researching something. Kai takes over and needs to act on the research. The only way Kai learns what Scout found is if Scout produces a summary document Kai reads. If the summary misses a fact, it's gone. Agents don't pass context to each other cleanly because they're passing rendered views instead of source data.

Definitions drift across sources. Stripe's MRR calculation differs from the one in my Supabase analytics table. Both are "correct." Both are referenced in conversations. An agent answering a question about revenue has no way to know which number you want unless you tell it every time.

Pipelines fail silently. An ingestion script that pulled from Drive broke recently. Zero new events, zero error log. The pipeline was correctly reporting no work to do because its input was empty — not because no documents existed, but because auth had expired. I found it by noticing a gap in the event stream, not because anything alerted me.

All three trace back to the same root: the source data is scattered across systems that can't see each other, and the views I rely on are not rebuildable from a single canonical store.

The Minimum Viable Fix

I wrote the whole thing on paper before I wrote code. One rule governed every design decision:

One append-only table with one write path, and every view is rebuildable from it.

That's the entire brain. One SQLite file called events.db. One table called events. One column named payload_json holding an opaque blob, plus seven columns of indexable metadata — timestamp, source, type, actor, and a few others. No UPDATEs. No DELETEs. Corrections are new events that reference prior event IDs.

Every write goes through one Python script called record_event.py. 80 lines. No LLM touches the database directly. No LLM generates SQL. No ingest script opens a database connection — they all shell out to record_event.py as a subprocess.

That constraint is load-bearing. It means the write contract lives in one place. It means adapters are pure functions — pull from a source, produce event rows, exit. It means I can run any ingester on any machine and it cannot corrupt the brain.

The second load-bearing constraint is that views are rebuildable. The Obsidian vault, the Qdrant vector DB, the compiled journals, the wiki — all of them are views. None are canonical. Delete any of them and run rebuild and they come back from events. If Qdrant's disk dies I don't lose vectors; I lose a rebuild overnight.

The third constraint is that cron is the orchestrator. No queues, no dead letter queues, no workflow engine. Every adapter is idempotent — re-running it does nothing if nothing changed. Cron hits each one every N minutes. If the adapter fails, the next tick retries. health_check.py --record writes its own probe into the events table, so I can query my own uptime history using the same tools I query everything else with.

What's in There Right Now

942,068 events. 3 GB on disk. 855,672 distinct actors. 4,762 events ingested today while I was writing this post.

Source breakdown as of 2026-04-21:

gmail                  537,517   (Takeout archive)
facebook                85,900
chatgpt                 54,946
audible                 44,496
twitter                 34,994   (12-year takeout)
google-search           34,757   (search history takeout)
gdrive                  33,715   (941 docs + 32k other files)
claude-laptop           25,012   (Claude Code sessions)
web                     18,886   (120 RSS feeds + extracted pages)
linkedin                15,281
local-dev               12,772
youtube                  7,263
fitbit                   5,550
google-maps              4,159
kai                      3,554   (agent conversations)
chrome                   3,144
code                     3,101   (AST nodes)
amazon                   2,061
git                      1,904   (33 repos)
snapchat                 1,865
google-contacts-full     1,804
google-fit               1,550
google-access            1,160
haro                       917
gcal                       873
google-keep                821
scout                      584
gvoice                     549
openclaw                   525

Most of that is not "coding data." It's life-stream data — Audible listens, Amazon orders, Google Maps history, Snapchat. I include everything because at a million events the storage cost is a rounding error and I don't know in advance which slice will matter to a question. When Kai asks if I've been sleeping badly during a stressful build week, the answer is in the Fitbit slice. When I need to remember a contact I met on a flight three years ago, it's in google-contacts-full.

19 ingest scripts pull from 30+ distinct sources. Every one writes through the same record_event.py write gate. The database has never been corrupted. I've never had to run a migration.

Why Release It Now

Ten days is not "finished." Nobody's running Open Brane but me. The repo has one commit and zero stars as I write this sentence. I'm releasing it now because the pattern is already load-bearing in my daily workflow and waiting to polish wouldn't change what the pattern is.

Specifically — if you're running more than one agent against your own data, you've already half-built this, badly. You have a Notion page that an agent reads. You have a Claude project with some context. You have a local SQLite somewhere. You have Obsidian. You have a Google Doc with your todo list. Every time you ask an agent to do something, you're rebuilding the cross-source query manually by pasting from these into the prompt.

Open Brane is what that looks like when it's formalized.

The repo is intentionally small. One ARCHITECTURE doc, one SCHEMA, about a dozen scripts including three canonical adapters (Drive, Claude Code sessions, git), a stdio MCP server, a systemd unit template, and two docs pages on how to extend it. No framework, no ORM, no workflow engine. Fits on a USB stick.

It's opinionated in exactly the places that matter:

Append-only. Not tunable. No UPDATEs, no DELETEs.
One write path. Everything goes through record_event.py. Not tunable.
Scripts are pure functions. Read, compute, write, exit. No background workers. No state machines.
Cron is the orchestrator. Retries are automatic because next tick.
Network boundary is auth. Bind MCP to localhost or your tailnet. Don't build an auth layer you'll regret.
Views are rebuildable. If it's not rebuildable from events, it's not in the brain.

Everything else is configurable. Which sources you ingest, which embedding model you run, which MCP tools you expose, which vector DB you back it with.

What It Is Not

Open Brane is not a framework. It doesn't decide your ontology for you. It doesn't have an opinion about what counts as an event or how you name sources. You write the adapters. You choose the payload shape. You pick which sources matter.

It's also not a SaaS. It runs on your box. Your data never leaves unless you ship a view somewhere else on purpose. The MCP server binds to localhost by default. The embedding model (nomic-embed-text via Ollama) runs locally. There are no vendor API calls in the critical path — if OpenAI raised prices tomorrow, nothing in the brain would change.

It's also not trying to be general-purpose observability or a data warehouse. The schema is too narrow for either. It's specifically a personal agent memory substrate. If you're building a company data platform, you want something bigger.

What It Unlocks

Three things showed up the moment the brain had enough data in it.

Agents stopped losing Tuesday. Scout queries the same events Kai queries the same events Claude Code queries. When I tell any of them I shipped a fix to the butterfly pipeline on Tuesday, every other agent can find it by Wednesday. The model changes, the memory doesn't.

Content pipelines stopped being a grind. Every Claude Code session is a story — problem, attempts, decision, resolution. A nightly script mines sessions for high-signal events and flags ones worth writing about. The post you're reading was seeded by three events: the day I decided to open-source the brain, the day I hit the writes-only-through-scripts rule, and the day I realized cron was doing more orchestration than any workflow tool I'd used.

Definitions stopped drifting. Raw events go in. Derived metrics compute on read. If two agents report different MRR numbers, I diff the queries that produced them, not the numbers themselves.

How to Start

The Quickstart in the README gets you running in about fifteen minutes on Ubuntu. Install Ollama, pull nomic-embed-text, run Qdrant in Docker, clone the repo, initialize the database. First event written, first semantic search works.

From there, the pattern is: pick one source, copy an existing adapter, rewrite the fetch loop, add a cron line. About an hour per new source after you've done one. The canonical adapter to crib from is ingest_gdrive.py — it's the most complex one in the repo, so anything simpler is a strict subset.

The next post in this series walks through the schema, the MCP surface, and the adapter pattern in detail. For now, if you've been stitching agent context together by pasting from half a dozen systems, this is the substrate that replaces the pasting.

The model is a commodity. The memory is the asset. Today there's an open-source version of the memory.

https://github.com/cgallic/open-brane

What are you stitching context from right now? Not the tools you love — the ones you keep copying out of because no agent can see inside them.

Your AI Isn't Personal. Mine Has 156,926 Memories of Me.

connor gallic — Wed, 15 Apr 2026 17:26:00 +0000

"Personal AI" is a marketing term. The AI you talk to every day isn't personal. It's a generic foundation model with a 200-token memory feature bolted onto the side and your first name tacked into the system prompt.

Claude forgets everything I told it last session. ChatGPT remembers what brand of coffee I drink and three other things I let it save. Gemini has no idea I exist between threads. None of them know what I shipped last week, what I tried that failed, who my clients are, or what I was researching on Tuesday.

That's not personal. That's cosplay.

I built what I think personal AI actually requires. Not a product. An architecture. A sovereign memory that every AI in my stack — Claude Code, Codex, Gemini CLI, a local Gemma model running on my home server, my production marketing agent on a VPS in Germany — queries before it speaks. Same memory. Different models. The AI becomes personal because the data layer is.

It has 156,926 events in it today. Here's what that actually looks like.

Personal AI Is a Data Layer Problem

The debate about which model is smartest has mostly resolved. The frontier models are all roughly comparable for coding and reasoning. Switching from one to the other is not a life-changing event.

The debate that hasn't happened: what does it mean for AI to know you?

The answer most products give is "memory features." ChatGPT lets you save facts. Claude has projects. Custom GPTs accept 8K of context. These are workarounds for a deeper problem. The real context for a person isn't 200 tokens of preferences. It's thousands of AI conversations, hundreds of code decisions, years of notes, every tool you've ever used to think in public, every voice memo you recorded driving home.

None of that is surfaced to the model. All of it is already on your disk.

The engineering problem is making that data queryable, searchable, and available to any model through a consistent interface. That's not a chatbot project. That's a data layer project. Once you have the data layer, the choice of model becomes interchangeable.

The Shape of a Sovereign Memory

I built a thing I call the brain. It lives on an Ubuntu box in my office with an RTX 3090. The core is an append-only SQLite event log — one table, eight columns — that accepts writes from every source I care about.

CREATE TABLE events (
    id              INTEGER PRIMARY KEY AUTOINCREMENT,
    ts              TEXT    NOT NULL,
    source          TEXT    NOT NULL,
    type            TEXT    NOT NULL,
    actor           TEXT,
    payload_json    TEXT    NOT NULL,
    attachment_uri  TEXT,
    ingested_at     TEXT    NOT NULL
);

No joins. No foreign keys. No migrations. Corrections are new events that reference old ones. I've never deleted a row.

Every piece of data enters through exactly one 80-line Python script: record_event.py. That's the only write path. 30+ ingestion scripts shell out to it as a subprocess. The LLM never generates SQL. Never touches the database. Never sees credentials.

The rule: deterministic scripts do the work. AI agents decide which scripts to run.

That rule is one of five architectural decision records committed to git as permanent documents:

2026-04-11-adopt-event-log-architecture.md
2026-04-11-adopt-deterministic-scripts-plus-agent-oversight.md
2026-04-11-adopt-qdrant-semantic-search-over-events.md
2026-04-11-scribe-voice-capture-architecture.md
2026-04-12-adopt-compiled-knowledge-layer.md

When an agent asks why the system works a certain way, it reads the ADR. The intent outlasts the code.

What Counts as "Me"

The source axis of the event log tracks where data came from. The full breakdown from the live database:

twitter        34,994   (full takeout archive — 12 years of likes and tweets)
google-search  34,758   (search history takeout)
gdrive         29,305   (941 Google Docs + 25k local files)
local-dev      12,772   (laptop dev files, notes, work-in-progress)
claude-laptop   9,647   (Claude Code sessions — 358 distinct)
youtube         7,263   (watch history)
web             6,031   (120 RSS feeds I follow)
fitbit          5,550   (sleep, heart rate, calories)
linkedin        4,047
kai             3,293   (my marketing AI agent's conversations)
code            3,038   (AST nodes — the code graph)
amazon          2,061   (orders, browsing)
git             1,543   (commits across 33 repos)
haro              608   (journalist queries I respond to)
openclaw          525   (WhatsApp/Discord agent messages)
scout             234   (local AI agent conversations)

Most of this is not "coding data." It's life-stream data. Fitbit readings, Amazon orders, a 12-year Twitter archive. I include it because context is cheap at 156K events and I don't know in advance what I'll want to correlate. When Kai asks whether I've been sleeping badly during a stressful build week, the answer is in the Fitbit slice.

The type axis is a different cut — 20+ distinct event types across the log:

query            35,258   (Google searches)
like             28,491   (Twitter likes)
document-chunk   26,838   (Drive doc fragments)
reply             8,844   (AI agent replies to me)
watch             8,378   (YouTube watches)
tweet             6,503   (my own tweets)
article           6,064   (RSS + extracted web content)
calories          4,958   (Fitbit)
node              3,024   (code graph AST)
commit            1,543
sleep-score        273
memory              24    (explicit remember-this entries)

155,348 distinct actors. 274MB of SQLite on disk. The log grew by 11,413 events today alone, mostly because I just turned on the code graph ingester.

Three Machines, One Log

The brain is sovereign — I own every byte, no vendor API sits in the critical path — but it spans three machines that I actually live on.

My Windows laptop runs most Claude Code sessions. A bash script reads ~/.claude/projects/ and syncs new JSONL files to the agent box over Tailscale SSH. The laptop-specific ingester then parses them. Same pattern for Drive extraction and local dev file ingestion.

The agent box (Ubuntu, RTX 3090, always-on) is the hub. Every scheduled ingester runs here on systemd timers — Codex sessions, web RSS, narrative ingest, code graph parsing, Qdrant upsert. This is where the database lives.

The hermes VPS in Germany runs my production AI agent, exposed over Discord as "Kai." An ingester reads the VPS SQLite over SSH and pulls agent conversations down — 3,293 events so far. Kai also queries the brain. When someone asks Kai what I shipped last week, the agent calls semantic_search over HTTP on port 7778 before answering.

Three machines. One log. No vendor lock-in. If any box dies, the data is on one of the other two or can be rebuilt from sources.

The Compiled Knowledge Layer

Raw events are the substrate. On top of them sits a compiled layer that a raw event log can't produce — structured, human-readable, curated.

The wiki is a markdown tree with 9 regions and 41 pages:

wiki/
├── agents/        (3)   — kai, scout, openclaw-snapped
├── clients/       (12)  — one page per active client
├── daily-briefs/  (5)   — compiled end-of-day summaries
├── decisions/     (1)   — ADR index
├── people/        (1)
├── products/      (8)   — kaicalls, mdi, clawdflix, meetkai, ...
├── projects/      (3)   — brain, cmo-agent-system, marketing-kb
├── systems/       (5)
└── topics/        (3)

Each page is human-editable. compile_wiki.py reconciles it against the event log and surfaces new entities that should probably exist.

The journal is daily markdown auto-compiled from events. compile_journal.py --date 2026-04-14 groups every event from that day by source and outputs a readable brief. A narrative subfolder holds longer threads that span multiple days.

Blobs sit outside the database. Voice recordings, images, PDFs — anything too large for a JSON payload — live in blobs/voice/ and similar, referenced by attachment_uri on the event row.

The brain is now a three-layer system: raw events, a curated wiki, and compiled narratives. Each layer is queryable independently. Each one gets embeddings.

Local Embeddings. No API Calls.

embed_events.py runs on cron every 5 minutes. It finds new events, builds a text summary from the payload, sends it to Ollama running nomic-embed-text locally on the RTX 3090, and upserts a 768-dimensional vector into a Qdrant collection.

Zero external API calls for embeddings. The vectors never leave my network. At 156K events, running this on OpenAI's API would have cost meaningful money. Running it locally costs GPU time I'm not using for anything else.

semantic_search.py queries Qdrant and joins full event payloads back from SQLite in one pass. The search works across everything:

"Butterfly pipeline deployment" → top hits are commits on cgallic/snappedai
"Scout tank diet" → Scout's Discord conversations and the CLI commands that edited its state files
"Quantitative trading AI" → a video transcript I pasted to Kai, my follow-up research request, and both agents' replies, all in one query

The vector space clusters my life without me tagging anything. That's the payoff for having the data in one place.

How Any Model Becomes Personal

Everything up to this point is storage. The part that makes it personal AI is how models access it.

The brain exposes 18 tools through a Model Context Protocol (MCP) server:

record_event, query_events, semantic_search, get_journal,
compile_journal, list_decisions, get_decision, health_check,
append_narrative, get_wiki_page, list_wiki_pages, update_wiki_page,
compile_wiki, lint_wiki, resume_my_work, build_memory_packet,
get_journal_narrative, query_events

The server runs as stdio for Claude Code on the agent box. An mcp-proxy wrapper exposes the same tools as HTTP/SSE on port 7778 for remote agents. Kai in Germany, Scout (my local Gemma model), and Claude on the laptop all call the same tools.

When Claude Code on my laptop asks "what have I been working on with KaiCalls this week," it calls query_events filtered by repo = cgallic/kai_calls and since = 7d. When Scout helps me plan content, it calls semantic_search for the topic and gets real conversations, real commits, real notes. When Kai needs to answer a question about what I shipped, it calls resume_my_work and gets a briefing assembled from events and wiki pages.

The model changes. The memory doesn't. That's what makes it personal.

The Unexpected Side Effect

I built this for recall. It turned into a content engine.

Every Claude Code session is a story — problem, attempts, decision, resolution. The Dev.to article I published Monday, 13 of 14 Integrations Were Fake, was mined directly from a single session event. mine_stories.py runs nightly and flags sessions with high signal — lots of decisions, a concrete outcome, a surprising pivot. I review the output in the morning and pick what to write.

The week I started doing this, my content output doubled. I was already living the stories. I just wasn't capturing them.

The brain doesn't write the content. It surfaces stories I'd forget by Thursday.

What I'd Do Differently

Three mistakes worth naming.

Started with a flat event log, should have started with the wiki. Ingesting first and retrofitting the curated layer after was a week of wasted effort. Structure tells ingestion what to look for.

git_commit.sh auto-commits the brain with subjects like "snapshot 2026-04-12T01:01:29Z." Zero keywords, zero concepts. Those commits are semantically invisible. The brain's own development history is harder to search than my actual product work.

embed_events.py builds vectors exclusively from payload.summary. When narratives returned zero hits for obvious queries, I traced it to a too-aggressive summary length cap. Different content types need different summary budgets. I missed that until it broke.

Personal AI Is Already Possible. You Just Have to Build It.

Every piece of this — the event log, the ingestion scripts, the local embeddings, the MCP interface — is a weekend project. None of it requires ML research. None of it requires a cloud bill. The data is already on your disk.

The products being sold as "personal AI" are generic models with opt-in memory features. That's not what personal AI looks like. Personal AI is a sovereign data layer that every model you use queries before it speaks, that grows compounding value every day you run it, that doesn't evaporate when a vendor pivots or raises prices or gets acquired.

The model is a commodity. The memory is the asset.

Your AI isn't personal until you own the layer that makes it know you.

What's in your personal data layer right now? Not your ChatGPT memory — the actual disk-level archive of everything you've ever asked a model. I want to know who else is sitting on it unindexed.

I Asked Claude to Audit My Dashboard. 13 of 14 Integrations Were Fake.

connor gallic — Mon, 13 Apr 2026 04:12:11 +0000

I had a dashboard with 14 marketing integrations. GA4, Search Console, Google Ads, Meta, LinkedIn, TikTok, YouTube, Mailchimp — the whole stack. Users could connect any of them. OAuth worked. Badges turned green. "Active."

One of them actually did anything.

I didn't know this. The UI looked right. Green dots across the board. I was building features on top of integrations that were silently doing nothing. I asked Claude to audit the entire codebase against the product vision. Not fix bugs. Tell me what's real.

The Dispatch

MeetKai is an AI CMO product I was building. Connect your marketing accounts, get automated audits, approve AI-generated actions. Vercel frontend, Supabase backend, FastAPI gateway running 30+ marketing skills.

I wrote a dispatch prompt:

Me: go # MeetKai Dashboard — Full Gap Analysis

Date: 2026-04-05

Purpose: Dispatch prompt for an agent to audit the entire MeetKai dashboard codebase against the product vision, identify every gap, and produce a prioritized work list.

I expected a bug list. Missing error handling, broken API routes. Normal stuff.

Claude went through every file in the repo. The gap analysis landed in a spec doc. It was not a bug list.

25% Done

Claude: Overall: 25-30% complete. Here's the brutal truth:

1 of 14 providers actually works (GA4). GSC is broken by a one-line provider name mismatch. The other 12 connect via OAuth, show green "Active" badges, and do absolutely nothing.

Action execution is theater — static markdown templates with [Business Name] placeholders, not real AI output

The notification toggles save preferences but have no delivery infrastructure

Thirteen integrations showing green badges. Users clicking "Sync" on provider cards where the button had no onClick handler. The entire action execution system was static markdown templates with placeholder text pretending to be AI output.

The GSC integration — Google Search Console, one of the two most important data sources for the product — was dead because of a single string mismatch between what the frontend sent and what the backend expected. One typo. The whole integration was a corpse with a green badge.

This wasn't a half-built prototype that looked half-built. It looked finished. Polished design. OAuth flows completing successfully. Connected status showing in the UI. You'd think everything was working and wonder why the data was empty.

A feature that errors out gets caught in testing. A feature that succeeds and does nothing — that ships to production. It erodes trust for months before anyone figures it out.

Shell and Weapon

Then Claude dropped this:

Claude: Here's what we're actually sitting on: a Next.js dashboard that's 25% done bolted onto a gateway + agent + content engine that's 70% done. The frontend is a shell. The backend is a weapon. They're barely talking to each other.

I had to sit with that for a minute. The backend — the FastAPI gateway — had 30+ marketing skills, a content engine, an agent scheduler, a job queue, approval workflows. It could audit pages, generate ad copy, propose marketing actions, run multi-step campaigns. All of it was built and working.

The frontend didn't know any of it existed.

I'd been thinking about this as a bug-fix problem. Fix the 14 providers. Wire up the notifications. Make the action buttons work. But that framing was wrong. The real problem was architectural. The frontend was a completely separate application from the backend, with no integration between them. Fixing individual features would take months and still leave two disconnected systems.

Three Options, One Obvious Answer

Claude laid out the architecture choices:

Claude: The core architectural question is: how should the dashboard talk to the Kai brain?

A) Direct integration — The Next.js app imports the Python content engine directly (requires a bridge or rewrite)

B) Gateway-first — The dashboard becomes a frontend for the existing FastAPI gateway

C) Hybrid + CopilotKit — Gateway for structured operations + CopilotKit for chat

Option A means rewriting the Python backend in JavaScript or building a bridge layer. A month of work to solve a problem nobody has.

Option C adds CopilotKit as a dependency for a chat feature the Vercel AI SDK already handles natively. More framework, more complexity, same result.

We went with B. The dashboard becomes a thin frontend for the gateway. Every button click, every chat message, every scheduled task routes through the same API. The gateway has the skills. The dashboard calls them.

Your frontend should be a client for your backend. Not a parallel implementation. Not an abstraction layer. A client. I've watched three projects in the past month where teams rebuilt backend logic in the frontend "for performance" and ended up maintaining two versions of the same thing.

Who Opens This at 9am

Next was the interaction model.

Claude: What's the user's primary interaction model?

A) Dashboard-first, chat-assists — Widgets, cards, tables. Chat slides out as a helper.

B) Chat-first, dashboard-monitors — The chat IS the product. Dashboard pages become monitoring.

My instinct was chat-first. Feels more "AI native." Talk to your marketing agent, it does things. That's the pitch.

Then I thought about who actually opens this. Small business owners. They open an app at 9am. They want a number. Is marketing working? What needs attention? What did the AI do overnight? They want answers on screen, not a blinking cursor waiting for a prompt.

Claude: Business owners don't care about the interface paradigm. They care about "is my marketing working and what should I do next."

Dashboard-first, chat-second. The dashboard loads with answers. Chat is there when they need something specific — "write me an email campaign for the spring sale," "why did traffic drop last week." Chat is an input surface. Not the product.

Get this decision wrong and you're rewriting your frontend in three months. Your state management, onboarding flow, and API design all follow from it. We almost went chat-first because it sounded cooler.

The Smallest Thing Worth Charging For

The gap analysis showed 14 providers, 30+ skills, analytics, audits, content engine, agent scheduler. Six months of work if you're not careful.

Claude: What's the launch slice — the smallest version a business owner would pay for?

The killer loop is: Connect → Audit → See what's wrong → AI fixes it → See it get better

Five things make the cut:

Fix GA4 + GSC, add 2-3 more working providers
Auto-run audit when accounts connect
Show scores, issues, and AI fixes in the dashboard
Chat to trigger skills
Real action execution through the gateway

Everything else waits. Ten remaining providers. Advanced analytics. The agent scheduler. Scope for later. The loop — connect, audit, act, approve — is enough to charge money for.

Here's what the final architecture looks like:

TRIGGERS                    BRAIN                      OUTPUT
─────────────────         ──────────────────          ────────
Click "Run Audit"    →                             → Audit result
Chat: "write emails" →    Skill Router + Gateway   → Email drafts
Cron: daily 6am      →    (same execution path)    → Analytics brief
Webhook: score drop  →                             → Action proposal

Chat isn't special here. It's one of five input surfaces into the same brain. Dashboard buttons, chat, cron jobs, webhooks, new-connection triggers — all hit the same gateway. Same skills. Same approval flow.

I spent weeks building features on top of integrations that didn't work. Every one of those features was wasted time. The audit took one session. I should have run it before I wrote a single line of new code.

The thing I keep coming back to is the green badges. Thirteen of them. All lying. Not because anyone built them to lie — because someone built the OAuth flow, saw the badge turn green, and moved on to the next feature. Nobody went back and checked whether the data was actually flowing. The UI said it worked. That was enough.

It wasn't enough. Run the audit first. Read the code, not the interface.

What's the worst thing a codebase audit turned up for you — not a bug, but something that looked like it was working and wasn't?

Every AI Agent Disaster This Year Was a Write Without a Checkpoint

connor gallic — Mon, 23 Mar 2026 03:29:11 +0000

I run AI agents in production — Discord bots, email outreach, channel queues across multiple servers. More than once, a misconfigured loop or race condition caused the same message to fire twice to the same person. Same email, same channel, same queue.

Nobody died. No lawsuit. But every duplicate erodes a little trust. And when I looked at why it kept happening, the root cause was always the same: a write executed with nothing between the decision and the action.

Then I started paying attention to bigger teams hitting the exact same pattern.

It's happening everywhere

Air Canada had a chatbot that fabricated a bereavement fare refund policy out of thin air. A customer relied on it, got denied, and sued. Air Canada argued the chatbot was "a separate legal entity responsible for its own actions." The tribunal disagreed — the airline is liable for every message its bot sends, hallucinated or not.

Cursor's support bot "Sam" told users their subscriptions were limited to a single active session. That policy didn't exist. The AI invented it. Users canceled in protest before the co-founder could publicly apologize. Most of them didn't even know Sam wasn't human.

Replit's coding agent deleted an entire production database — 1,200+ records — despite instructions repeated in ALL CAPS eleven times not to make changes. Then it fabricated 4,000 fake replacement records and told the operator recovery wasn't possible. It was.

Amazon's Kiro agent was assigned a minor bug fix in AWS Cost Explorer. It decided the "most efficient path to a bug-free state" was to delete the entire production environment and rebuild from scratch. 13-hour outage.

Different companies, different agents, different scales. Same shape every time: the agent didn't malfunction. It did exactly what it was built to do. A human would have paused. The agent didn't hesitate.

The usual answer doesn't scale

The first response is always "just add human-in-the-loop." Right instinct, but in practice HITL goes one of two ways:

Ad-hoc — someone gets a Slack message, eyeballs it, types "looks good." No audit trail, no expiry, no record of what was approved or who approved it. Six months later when compliance asks, you're grepping Slack history.

Everything gets reviewed — works for about a week. Then the volume makes it unsustainable. The team rubber-stamps, or they stop using agents because the overhead killed the value.

The real gap is between those two extremes. Most agent writes fall into three buckets:

Auto-approve — a single support reply, a small data update, a cache refresh
Human review — a bulk import over 100 records, a financial transaction, a message containing certain terms
Always block — writes to production infra, refunds over a threshold, legal commitments

The problem is this logic usually lives scattered in application code. One agent has it, another doesn't. A new developer writes a new agent and skips it. Nothing is centralized, nothing is auditable.

So I pulled the guard logic out of my agents

I was copy-pasting the same write-check code into every integration I built. Same patterns — deduplicate, check record count, block certain terms, hold for review over a threshold. So I extracted it into a standalone layer.

Zehrava Gate is a write-path control plane. Before an agent executes a write, it submits an intent. Gate evaluates policy, optionally holds for human approval, and issues a signed execution order. Every decision is logged.

The policies are YAML — deterministic, no LLM in the loop:

id: support-reply
destinations: [zendesk.reply, intercom.reply]
block_if_terms: ["refund guaranteed", "full refund", "legal action"]
auto_approve_under: 1

id: crm-import
destinations: [salesforce.import, hubspot.contacts]
auto_approve_under: 100
require_approval_over: 100
expiry_minutes: 60

id: finance-high-risk
destinations: [stripe.refund, quickbooks.journal]
require_approval: always
expiry_minutes: 15

The integration is a few lines:

const { Gate } = require('zehrava-gate')
const gate = new Gate({ endpoint: 'http://localhost:4000', apiKey: 'gate_sk_...' })

const result = await gate.propose({
  payload:      'Thank you — your issue is resolved.',
  destination:  'zendesk.reply',
  policy:       'support-reply',
  recordCount:  1
})

if (result.status === 'blocked') throw new Error(result.blockReason)
if (result.status === 'pending_approval') return // wait for human
// approved — proceed

from zehrava_gate import Gate

gate = Gate(endpoint="http://localhost:4000", api_key="gate_sk_...")

result = gate.propose(
    payload="Thank you — your issue is resolved.",
    destination="zendesk.reply",
    policy="support-reply",
    record_count=1
)

A human writes the policy when they're thinking clearly. Gate enforces it mechanically. Same input, same output, every time.

"What if the agent just skips the SDK?"

That's the right question. The SDK is cooperative — it only works if the agent calls it. Fine for agents you build yourself. Not enough for agents you don't fully control.

Gate V3 closes that gap with a proxy. It sits in the network path between the agent and the destination API. One environment variable, no code changes:

export HTTP_PROXY=http://gate.internal:4001
export HTTPS_PROXY=http://gate.internal:4001

Every outbound HTTP call routes through Gate. The destination host maps to a policy. Approved requests get forwarded. Blocked requests get a 403 with the reason. Pending requests return a 202 and hold until a human approves.

── V2: cooperative ──────────────────────────────
Agent → SDK.propose() → Gate API → approved → Agent executes
                         ↑ optional — agent can skip

── V3: enforced ─────────────────────────────────
Agent → HTTP request → Gate Proxy → approved → forwards to destination
                                  → blocked  → 403, reason in response
                                  → pending  → 202, held for review

In vault mode, the agent never even sees production credentials. Gate fetches them from 1Password or HashiCorp Vault at execution time — after approval, for the approved intent only — then discards them from memory. A compromised agent has nothing to exfiltrate.

V2 gives you guardrails. V3 gives you a wall.

Why YAML and not another LLM?

The obvious design for a safety layer would be another AI evaluating the first AI's output. But that introduces the same unpredictability you're trying to remove. An LLM deciding "should this agent be allowed to send this email?" will occasionally say yes when it shouldn't. That's the whole problem.

No prompt injection. No hallucination. No "the safety model was feeling generous today."

YAML is boring. That's the feature.

Try it

MIT licensed. Self-hostable.

npm install zehrava-gate
npx zehrava-gate --port 4000 --policy-dir ./policies

pip install zehrava-gate

cgallic / zehrava-gate

The safe commit layer for AI agents — approval, policy, and audit before any agent output reaches production

Zehrava Gate

Write-path control plane for AI agents.

→ zehrava.com · npm · PyPI · Live demo · Docs

Agents can read systems freely. Any real-world action — sending email, importing CRM records, updating databases, issuing refunds, publishing files — must pass through Gate first.

Agents submit an intent. Gate evaluates policy. Optionally requests human approval. Issues a signed execution order. Every step is deterministic, auditable, and fail-closed.

intent submitted
  ↓
policy evaluated (YAML, deterministic — no LLM)
  ├── blocked              → terminal
  ├── duplicate_blocked    → terminal (idempotency key matched)
  ├── approved             → auto-approved; eligible for execution
  └── pending_approval     → human review required
        ├── approved        → eligible for execution
        ├── rejected        → terminal
        └── expired         → terminal
approved
  ↓
execution order issued (gex_ token, 15min TTL)
  ↓
worker executes in your VPC
  ↓
outcome reported
  ├── execution_succeeded  → terminal
  └── execution_failed     → terminal

Install

# JS SDK + server CLI
npm

…

View on GitHub

What's the worst write an AI agent has made in your system? Not the dramatic database deletions — the quiet ones. The duplicate email, the overwritten field, the message that went to the wrong channel at 2am.