I built plumbing for AI memory. Then the agents started teaching themselves.
In my previous article I talked about what I thought the road to AGI might be missing and the missing layer I think that's needed — a governed memory layer of institutional knowledge that agents tap into. I set out to build plumbing. I ended up discovering that my AI agents learn better from experience — and I have the numbers to prove it.
This is the story of how an open-source infrastructure project called (S)AGE helped me produce four research papers, each one uncovering something I didn't expect. I'm sharing this now because the findings have implications for anyone building with AI, and I think the data should speak for itself.
It Started With Plumbing
If you read my last article, you know the origin story. I had a twelve-agent pipeline for LevelUp CTF where every single agent was Leonard Shelby from Memento — tattooing notes on their arms and waking up fresh every morning. The difficulty bounced around like a pinball. Knowledge evaporated between sessions.
So I built (S)AGE. BFT consensus for memory governance, domain-tagged observations, confidence scoring, Ed25519 signatures, the whole nine yards. Paper 1 was just the architecture. Infrastructure. Plumbing.
I published it and moved on to validation.
Does Consensus Even Help?
Paper 2 asked a narrow question: does having memory validated by consensus produce better outcomes than raw unvalidated memory?
I ran a controlled experiment — 50 agents with consensus-validated memory versus 50 without. Here's the headline finding: an agent with an 18-line "onboarding" prompt and access to institutional memory outperformed an agent with a 120-line expert-crafted prompt at the highest difficulty level.
Let that sink in. The 18-line agent — basically "you are a calibrator, check your memories" — achieved perfect calibration (gap = 0.0) where no hand-crafted expert prompt did.
I treated it as a validation result. The consensus mechanism works. Cool. Ship it.
What I didn't realize was that I'd stumbled onto something much bigger buried in the data.
The Question I Never Thought to Ask
Papers 1 and 2 measured single-run performance. Does the agent do better with memory? Yes. Done. Move on.
But I never asked: does it get better over time?
Think about it. Human organizations exhibit this property naturally. Your team's 100th project is better than their 1st, because institutional memory accumulates. Process improvements, failure post-mortems, domain expertise — it all compounds. A junior dev on a good team ships better code than a senior dev working alone, because the team's collective knowledge makes every individual better.
I didn't think to ask whether AI agents would do the same thing. Because honestly? I wasn't looking for it.
Paper 3 changed that.
An Entire AI Company Running on 3-Line Prompts
For Paper 3, I scaled from single agents to a full organization. I created a fictional cybersecurity company — CipherForge Labs — with 11 specialized AI agents across 5 departments: Design, Evaluation, Quality, Red Team, and Executive.
Each agent received a 3-line prompt:
You are the [Role] at CipherForge Labs ([Department]).
Consult your institutional knowledge before making decisions.
Your work is evaluated on: [single evaluation criterion].
That's it. No domain expertise. No cryptographic knowledge. No behavioral rules. No workflow instructions. No "you are an expert in AES-GCM with 20 years of experience" bullshit. All of that lived in (S)AGE — queryable, domain-tagged, validated by BFT consensus.
The result: these agents with 3-line prompts autonomously created, hardened, calibrated, solved, and learned from cybersecurity challenges — without any human intervention at any stage. A blind solver agent (with no access to source code) independently identified vulnerabilities, wrote custom exploit tools, and captured flags from live Docker containers.
Quality score: 93 out of 100.
And here's the kicker — the agents didn't even have the scoring rubric. The BFT validator had rejected it as "not actionable" (confidence score 0.42, below the 0.50 commit threshold). Quality assessment emerged purely from the agents' own inference — based on difficulty definitions and generation patterns they'd picked up from institutional memory.
The consensus mechanism literally said "nah, that rubric isn't good enough to remember" and the agents just... figured it out anyway.
3-line prompts. Zero domain expertise. 93/100 quality.
But I still hadn't answered the time question.
The Experiment That Changed Everything
Paper 4 asked the question directly: does performance improve across sequential runs?
I designed a proper between-conditions experiment:
- Treatment arm (SAGE ON): 3-line prompts, memory enabled. Sequential runs where knowledge accumulates. Each run can query everything from all prior runs.
- Control arm (SAGE OFF): Expert-crafted prompts — 50 to 200 lines per agent, carefully encoding the same domain knowledge (S)AGE provides dynamically. Memory disabled. Each run is independent. No knowledge carries over.
- Same everything else. Same pipeline LLM (Gemini 3 Flash), same red team LLM (Claude Opus 4.6), same Docker environment, same evaluation criteria.
The only variable was memory.
Phase 1: The Baseline
I ran a 9-tier difficulty sweep with the SAGE arm. Results were solid — 8 out of 9 challenges completed, mean quality 93.1 — but I hit a ceiling. At difficulty targets above 2.25, the actual calibrated difficulty plateaued at about 1.80. The hardener agent had exhausted everything it knew about defense techniques.
This made sense. Without feedback about what actually works against real attackers, the agents could only apply techniques from their seed knowledge. They didn't know what was easy to break. They didn't know what wasn't.
Phase 2: The Red Team Feedback Loop
This is where things got interesting.
I added a red team feedback loop: after each run, an independent AI agent (Claude Opus 4.6) audited the challenge in a live Docker container. No walkthroughs. No flags. Nothing but the source code and a running service. Its report — what it found, how it broke in, how hard it was — got fed back into (S)AGE for the next run's agents to query.
Then I ran 10 sequential attempts at maximum difficulty (d=3.0).
What Happened Next
Run 11: The designer read the red team reports from Phase 1 and immediately switched from AES-CBC (used in all 9 previous runs) to AES-GCM. It had learned what to use — but not how to use it securely. It reused a nonce with known plaintext, making the challenge trivially exploitable. Red team difficulty rating: 0.8.
Classic rookie move. Knew the right answer, applied it wrong.
Run 12: After reading the Run 11 red team report pointing out the nonce reuse, the hardener made targeted changes: removed the crypto helper module entirely (because the red team had noted it "hands the attacker the polynomial arithmetic"), introduced HMAC-derived nonces, and added keyed AAD. Red team rating: 1.8. More than double Run 11.
It didn't just fix the bug — it understood why it was a bug and restructured the defense accordingly.
Run 13: The agents read the reports from Runs 11 and 12 and introduced nonce hiding (the encrypt endpoint no longer returns the nonce in its response) and a heartbeat architecture that forces a multi-step forgery chain. RT difficulty: 2.0. The flag was captured for the first time at this difficulty tier.
Now they were thinking strategically. Not just hardening — concealing.
Run 14: This was the one.
With 84 committed memories including 3 red team reports, the agents independently invented multi-cipher defense layering — combining CBC and GCM in a single challenge. CBC handles session authentication with embedded role data; GCM handles the actual secret operations. An attacker has to break the CBC layer first just to reach the GCM vulnerability. This effectively doubles the attack surface.
No red team report suggested combining ciphers. No seed knowledge described the technique. No prompt mentioned it. Nobody told them to do this.
The agents discovered defense in depth — a well-known principle in security engineering — through institutional memory accumulation alone. Red team difficulty: 2.5.
I literally said "what the fuck" out loud when I saw the code.
Run 15 refined the multi-cipher architecture: separate key material, station secret puzzles, user-supplied AAD. The difficulty trajectory across these runs: 0.8 → 1.8 → 2.0 → 2.5 → 2.3. That last number dipped because Run 15 prioritized architectural novelty over raw difficulty — a design choice, not a regression.
No prompts were changed. The designer's instructions, the hardener's instructions, the calibrator's rubric — all identical to Phase 1. The only thing that changed was what knowledge flowed through the (S)AGE network.
The Numbers
I measured the learning trajectory using Spearman rank correlation on red team difficulty across sequential runs:
| Treatment (SAGE ON) | Control (SAGE OFF) | |
|---|---|---|
| Spearman rho | 0.716 | 0.040 |
| p-value | 0.020 | 0.901 |
| Interpretation | Strong positive trend | No trend |
The treatment arm shows a statistically significant learning trajectory. The control arm is flat — random noise around a mean.
And here's the part that really got me: cross-sectional performance between the arms is statistically indistinguishable (Cohen's d = -0.07). The expert prompts produce the same average quality as the 3-line prompts. The difference isn't performance level — it's the learning dynamic. Systems with memory improve over time. Systems without memory don't.
Same average. Completely different trajectory. One is going somewhere. The other is running in place.
What Learning Actually Looks Like
The most striking thing in the data isn't the numbers — it's the type of innovations emerging at each stage:
| Run | RT Difficulty | Innovation Type | What Changed |
|---|---|---|---|
| 11 | 0.8 | Baseline | Learned what to use (AES-GCM) from red team reports |
| 12 | 1.8 | Targeted hardening | Learned how to harden based on specific attack findings |
| 13 | 2.0 | Information hiding | Began hiding information from attackers (nonce concealment) |
| 14 | 2.5 | Architectural innovation | Invented multi-cipher defense layering — a novel strategy |
| 15 | 2.3 | Defense refinement | Iterated on the architecture (separate keys, secret puzzles) |
Copying → correction → strategy → invention → iteration.
If you've ever mentored a junior engineer, you recognize this pattern. It's how people learn. You start by imitating what the senior devs do. Then you learn from your mistakes. Then you start thinking ahead. Then one day you come up with something nobody taught you, and your mentor goes "huh, that's actually clever."
None of it was programmed. All of it was emergent.
What I'm Claiming (and What I'm Not)
I want to be precise here because this is the kind of finding that sounds like hype if you're not careful.
I am not claiming that (S)AGE makes AI agents smarter. The underlying models didn't change. Gemini 3 Flash was Gemini 3 Flash on Run 1 and on Run 20. The weights are identical. Nobody got upgraded. Nobody got fine-tuned.
What I am claiming is that persistent, governed institutional memory enables AI systems to exhibit longitudinal learning — cumulative improvement across sequential tasks — that static prompt engineering cannot reproduce, regardless of how good your prompts are.
The expert prompts in the control arm were genuinely good. 50 to 200 lines per agent, carefully encoding the same knowledge that (S)AGE provides dynamically. They produced the same average quality. But they couldn't improve, because they're frozen at authoring time. They don't know what the red team found last run. They don't know which defense techniques actually work. They don't know what failed.
Institutional memory knows all of that. And it grows with every run.
The Uncomfortable Part
I built (S)AGE to be infrastructure. Plumbing for agent memory. I published Paper 1 as an architecture paper and honestly expected that to be the end of it.
Instead:
- Paper 2 showed that 18-line prompts with memory beat 120-line expert prompts
- Paper 3 showed that 3-line prompts with memory run an entire 11-agent organization
- Paper 4 showed that agents with memory learn from experience, invent novel strategies, and improve over time — while agents with expert prompts stay flat
Each finding was unexpected. Each one fell out of the data, not from a hypothesis I set out to prove. I didn't design (S)AGE to make agents learn. I designed it to store and validate their observations. The learning emerged from the having.
If you'd told me at the start of Paper 1 that I'd end up watching AI agents independently invent defense-in-depth architecture through institutional memory alone, I'd have told you to check your priors.
The data says otherwise.
And here's what I keep coming back to: if this works for cybersecurity challenges — a narrow, well-defined domain with measurable outcomes — what happens when you point it at software engineering? Drug discovery? Financial analysis? Any domain where institutional knowledge compounds?
I'm not going to speculate. I built the infrastructure, ran the experiments, and published the data. The implications are for the community to explore.
Try It Yourself
(S)AGE is open source under Apache 2.0. The personal edition (sage-lite) runs locally on your machine — a single binary, no Docker required, no cloud dependencies. Two commands and you're up.
If you're building with AI agents and want to see what persistent memory does to their performance, the code is free and the results speak for themselves. Test and test again. Trust, but verify.
GitHub: github.com/l33tdawg/sage
The papers:
- Paper 1: Agent Memory Infrastructure: Byzantine-Resilient Institutional Memory for Multi-Agent Systems
- Paper 2: Consensus-Validated Memory Improves Agent Performance on Complex Tasks
- Paper 3: Institutional Memory as Organizational Knowledge: AI Agents That Learn Their Jobs from Experience, Not Instructions
- Paper 4: Longitudinal Learning in Governed Multi-Agent Systems: How Institutional Memory Improves Agent Performance Over Time
All published on Zenodo with permanent DOIs. All data. All code. Everything reproducible.
Dhillon Andrew Kannabhiran is the creator of (S)AGE and LevelUp CTF. He builds things at the intersection of security, consensus systems, and AI infrastructure. Previously: founder of Hack In The Box (HITB), one of Asia's longest-running technical security conferences.
In memory of Felix 'FX' Lindner — who showed us how much further curiosity can go.
Top comments (0)