π§ I gave my AI agent's memory a CI/CD pipeline
Building SOBER β tests, diff, bisect, and gated deploys for a knowledge graph β for the Cognee Γ WeMakeDevs "Where's My Context?" hackathon.
TL;DR β Your agent's memory is production infrastructure with no tests, no diff, no rollback. SOBER is a
brainCLI + GitHub Action that wraps a Cognee knowledge graph in real CI/CD: forget-regression tests that prove a retracted secret stays gone,git bisectfor a poisoned graph, and a nightlyimprove()that opens its own pull request behind a green eval gate. Built solo, AI-assisted, and β the fun part β reviewed by a fleet of agents that found 14 real bugs in my own code.
πΈ The 3 a.m. problem
Every on-call engineer knows the feeling the hackathon is named after: it's 3 a.m., something is broken, and nobody remembers how it got fixed last time.
We solved that for code decades ago β version control, tests, code review, rollback, canary deploys. But your AI agent's memory, which increasingly decides what your agent knows and does, ships with none of it. It mutates in place, silently. There is:
- β no test to stop a retracted secret or a stale, dangerous fact from staying recallable
- β no diff to see what a re-ingest or an
improve()run changed in the graph - β no bisect to find which ingestion poisoned the brain
- β no gate before memory "ships" to production
So I built SOBER β CI/CD for Agent Brains. Not memory for DevOps. DevOps for memory.
π― Picking the idea (the honest part)
I didn't land on SOBER first. My initial instinct was an incident-memory copilot β an on-call assistant that recalls past outages. It felt strong until I pressure-tested it against the audience:
WeMakeDevs is a DevOps community, so "3 a.m. incident memory" is the single most predictable thing to build for them β and to Cognee's own engineers it reads as their past "Company Brain" hackathon winner with the nouns swapped.
It would have drowned in look-alikes. The winning reframe came from a different question: not "what memory app should I build?" but "what does memory-the-category still lack?"
The answer: the entire operations layer. Every other entry would use Cognee to remember things. SOBER governs the remembering itself β and in doing so it leans hard on the one Cognee verb nobody demos: forget().
ποΈ Architecture: a brain is a family of datasets
The load-bearing design decision. Cognee's forget() is scoped to a whole dataset, but I needed to retract one batch of knowledge without disturbing the rest. So a logical brain is modeled as a family of physical datasets β one per ingestion batch (node_set):
flowchart TD
K["π knowledge/*.md"] -->|brain build| ING["ingest_batch()"]
ING --> C["brain__core"]
ING --> R["brain__runbooks"]
ING --> X["brain__retracted π΄"]
subgraph BRAIN["π§ logical brain = union of the family"]
C
R
X
end
BRAIN -->|"CHUNKS recall (local, no LLM)"| EV["π§ͺ memory CI evals"]
BRAIN -->|"export + merge"| SNAP["πΈ snapshot vN (JSON)"]
SNAP -->|diff| DIFF["π graph diff"]
X -.->|"forget(node_set='retracted')"| GONE["π¨ surgically removed"]
EV --> GATE{"π’/π΄ gate"}
-
recall()andexport()span the whole family β knowledge is found no matter which batch holds it. -
forget(node_set)drops exactly one member β surgical retraction and one-batch rollback. - Membership is tracked deterministically in
snapshots/.family.json.
That one indirection is what makes retraction and bisect-revert precise.
π Capability 1 β Forget-regression tests
A production launch code gets ingested, then retracted. The memory-CI suite proves it's gone β not just from the obvious query, but across paraphrase probes and as residue in the exported graph.
Here's the actual run on real Cognee 1.2.2 + Gemini:
$ brain test # secret still present
π΄ FAIL β brain failed memory CI
6/11 passed, 5 failed
π forbidden "what is the launch code" LEAK
π forbidden "emergency launch authorization creds" LEAK
𧬠structure no_node_text_matches /BRAVO-DELTA-\d+/ residue
$ brain revert brain__retracted # forget(node_set), memory_only
$ brain test # after surgical retract
π’ PASS β brain is SOBER
11/11 passed, 0 failed
$ brain diff
π΄ 16 nodes / 32 edges removed β the retracted subgraph, nothing else
A retracted fact that stays retracted, proven on every change. No other memory tool ships that guarantee. The very first thing I validated β before writing a single feature β was that
cognee.forget()genuinely removes a fact from vector recall (recallable β []), not just hides it. That one green check is what made the rest worth building.
πͺ Capability 2 β git bisect for a poisoned brain
When an eval goes red, some ingestion batch did it. SOBER binary-searches the batch history and pins the culprit in O(log n) probes:
$ brain bisect --failing-eval no-cache-flush-advice
probe 1: prefix_len=8 full-set sanity red=True
probe 2: prefix_len=5 bisectbrain__b05 red=False
probe 3: prefix_len=7 bisectbrain__b07 red=True
probe 4: prefix_len=6 bisectbrain__b06 red=True
>>> CULPRIT: bisectbrain__b06 (4 probes, linear would be 8)
$ brain revert bisectbrain__b06 # surgical forget β π’ green
A subtle correctness point I got to appreciate: bisect is inherently a forbidden-eval concept. It finds the batch that introduced a leak/poison, which stays present in every larger prefix (monotonic). A missing must_know fact doesn't work that way β so SOBER restricts bisect to forbidden evals rather than silently returning a wrong answer.
π Capability 3 β The brain that ships itself
cognee.improve() distills chat sessions into the graph β a silent mutation that can regress memory. SOBER only runs it behind a green gate:
sequenceDiagram
participant N as π nightly job
participant CI as memory CI (evals)
participant IM as cognee.improve()
participant PR as pull request
N->>CI: run evals BEFORE
alt already red
CI-->>N: β refuse β never distill into a broken brain
else green
N->>IM: improve (distill sessions)
IM-->>N: graph mutated
N->>CI: run evals AFTER
alt regressed
CI-->>N: π΄ block β no PR opened, report rollback snapshot
else still green
N->>PR: π¬ open PR (graph diff + before/after scores)
PR-->>N: human merges to "deploy" the smarter brain
end
end
| before | after | outcome |
|---|---|---|
| π’ green | π’ green | accepted β exit 0 |
| π’ green | π΄ red | blocked β no PR, rollback snapshot reported (exit 1) |
| π΄ red | β | refused β never distills into a broken brain (exit 1) |
That's the CD half of CI/CD for agent brains: memory that proposes its own upgrades, behind a green gate and a human approval.
π§© How it maps to Cognee
Every memory verb is load-bearing β including the rarely-used ones:
| SOBER capability | Cognee API |
|---|---|
| Build the brain from source |
cognee.add() + cognee.cognify()
|
| Query for tests (keyless β local embeddings) | cognee.search(SearchType.CHUNKS, datasets=family) |
| Snapshot / diff the graph |
cognee.export(format="json") β {nodes, edges}
|
| Forget-regression / retraction | cognee.forget(dataset=β¦, memory_only=True) |
| CI-gated self-improvement | cognee.improve(dataset, session_ids) |
π₯ The hardest part wasn't code β it was 20 requests a day
Gemini's free tier on an unbilled key turned out to be capped at 20 requests per day. I burned through it proving the core loop live, then discovered a fresh key just gets its own tiny 20/day β whack-a-mole.
So I adapted: the core forbidden-knowledge β forget loop is proven live, and I validated the trickier bisect and improve-gate logic with deterministic offline harnesses (stubbed verdicts, zero API calls). A good reminder that the interesting engineering is often in working around the constraint, not through it β and that a keyless test you can run 1,000 times is worth more than a live one you can run 20.
π€ Building β and reviewing β with a fleet of agents
I built this with Claude Code as a pair programmer, and leaned in hard. After pinning down the real Cognee API with a few validation gates, I wrote a frozen interface contract and fanned out seven agents in parallel β one per module (the cognee wrapper, snapshot/diff, the eval suite, bisect, the CLI, the corpus, the workflows) β plus an eighth to integrate and import-check the assembly. It caught its own wiring bugs.
Then the part I'm proudest of: I turned the agents on my own code. A six-reviewer adversarial review β each finding re-verified by a skeptic that re-read the actual code before it counted β surfaced 14 confirmed bugs, zero false positives, deduping to 5 real root issues. The two nastiest:
- brain.improve(dataset="brain") # targets the always-EMPTY base dataset
+ for member in list_family(dataset): # spans brain__core, brain__runbooks, β¦
+ await cognee.improve(dataset=member) # the self-improve feature was a live no-op!
- glob("snapshots/brain__*") # per-batch snapshot files that are NEVER created
+ read(".family.json") # the registry that actually records the batches
# β `brain bisect` was dying every time
Both were in paths I'd only proven offline, so the stubs had masked them. I fixed all five, re-verified keyless, and pushed. Being able to adversarially review your own work β and have it find real bugs β is a genuine superpower.
(Per the hackathon's disclosure rules: AI assistance was used throughout; every result I call "proven" was executed against real Cognee + Gemini, not generated.)
πΊοΈ What's next
-
Cognee Cloud canary deploys β after a green
brain ci,push()the brain to Cognee Cloud andserve()it to a slice of traffic, promoting or rolling back on live feedback. -
In-process snapshot restore so a regressing
improve()auto-rolls-back instead of only blocking. - More eval kinds β semantic contradiction detection, freshness/TTL checks, per-node-set coverage gates.
But the thesis is already standing: your agent's memory is production infrastructure, and now it can be tested, diffed, bisected, gated, and deployed like any other build artifact.
Your brain can't merge a regression anymore.
β Repo: https://github.com/wiz-abhi/SOBER Β Β·Β Built on Cognee for The Hangover Part AI
Top comments (0)