79 builds. 1,000 lines of Python. A system that gets measurably better every day.
By Ed Fife
Most people think agent memory means longer context windows. Or RAG pipelines. Or vector databases that let your chatbot recall what you said last Tuesday.
That is recall. It is useful. It is not what I am talking about.
I build production deployment pipelines for professional certification courses. The AI agents on my team generate content. The Python pipeline compiles it into deployable packages. The QA tools validate every output. I designed and built all of it — the agent personas, the prompting architecture, the agentic workflows, the measurement tools, and the compiler.
After 79 builds across multiple courses, my system does something I have not seen documented anywhere else: it gets measurably, provably better every single build. Not because the LLMs got smarter. The same models power it. Because the infrastructure around them accumulates institutional knowledge that persists across sessions, across courses, and across months.
This article is about how that infrastructure works. The code is real. The data is real. The improvement is measurable.
What memory actually means in production
In a production pipeline, "memory" is not one thing. It is three layers, each solving a different problem:
Layer 1: Session Memory — What happened during this build
Every build produces forensic data. Not logs that scroll past in a terminal. Structured, queryable records of what passed, what failed, and what was auto-fixed.
Our QA validators generate a FINAL_QA_REPORT after every build. Every check has an ID. Every finding has a severity — BLOCKER, FAIL, WARN, PASS. Every auto-fix is recorded with what it changed and why.
[PASS] CQ-01 Repeated Words — no issues
[WARN] IM-02 Empty Alt Attribute — M03 hero image, decorative mark confirmed
[FAIL] T1-META T1 Missing Metadata Attrs — data-delivery-method absent, T2 injected default
[PASS] DS-01 Missing Required Heading — all headings present
This is not debugging output. It is telemetry. Every check that fires is catalogued with an ID, a human-readable description, and the specific file and line where it triggered.
The session ends. The data survives.
Layer 2: Cross-Session Memory — What the AI knows before it starts
This is where most agent architectures stop. They give the AI access to conversation history or a vector store. We do something different.
At the end of every build, a Python script called self_improvement_engine.py ingests the QA report and writes two things:
check_history.json— A cumulative record of every check across every build. Failure rates per check. Trend lines. Which checks are getting better, which are getting worse, which are new.AI_LEARNING_NOTES_*.md— A briefing document written for the AI to read at the start of the next build session.
The AI does not start fresh. It starts calibrated. Before it writes a single line of content, it reads a document that says:
🔴
T1-META— failed in 67% of builds (last: 2026-04-30)
🟡IM-04— failed in 40% of builds (last: 2026-04-28)
✅ No recurring T1 issues across this build series
The AI knows what went wrong last time. It knows what goes wrong most of the time. It adjusts before generating, not after failing.
The Preflight Template — Context control at session start
The learning notes and KIs are passive — they sit on disk until something reads them. The Preflight Template is the activation mechanism that forces the AI to load its state before doing anything else.
It is an HTML file. Of course it is — HTML-as-JSON runs the whole stack. Three hidden blocks act as the AI's boot sequence:
Block 1: Agent Directives — Hard rules embedded as data-* attributes. Ambiguity protocol: stop on first occurrence, do not self-resolve. Scope boundaries: do not cross T1/T2 lines. HIL corrections: apply exactly as specified, no interpretation.
Block 2: Pipeline State — Machine-written by the compiler after every step. What course is active. What module, lesson, and chapter were last completed. What step to resume from. Whether a human review is pending.
Block 3: HIL Correction Log — Structured entries from the human reviewer. Each correction carries a check ID, severity, the reviewer's note, the exact pipeline action to take, and which sentinel files to clear for re-run.
<div id="pipeline-state" style="display:none;"
data-focus-course="C4"
data-last-step-completed="3.7"
data-resume-from="3.8"
data-pipeline-status="HIL_PENDING"
data-hil-correction-file="HIL_CORRECTION_DELTA_C4_M03.md">
</div>
When the AI starts a session, it reads the preflight first. Not the conversation history. Not a summary of last time. The actual machine-written state file. It knows what step it is on, what corrections are pending, and what rules are non-negotiable — before it generates a single character.
This is not prompt engineering. It is context control. The template constrains what the AI sees at boot time so it cannot drift, forget, or re-introduce a problem it already fixed. Memory is useless without activation. The preflight is the activation.
Layer 3: Cross-Agent Memory — What the organization knows
We run two separate AI instances on two separate machines. My co-founder runs T1 — the content team. I run T2 — the pipeline and QA infrastructure.
Both instances sync from a shared cloud folder containing the canonical rule set. Both reference the same Knowledge Items — persistent files in the .gemini/antigravity/knowledge/ directory that survive session restarts, IDE restarts, and even instance recreation.
When the self-improvement engine runs after a build, it does not just update local files. It writes directly to the Knowledge Item that both instances read:
def _write_to_ki(history, ki_path):
"""Update the persistent KI so the AI reads it at next session start."""
# The KI is the bridge between T2's measurement
# and T1's next content generation session.
# T1 reads this before writing a single line.
The result: when T1 sits down to generate Course 5, it already knows that T1-META has failed in 67% of previous builds. It does not need to be told. The measurement system told it.
The Wiki — Organizational memory that both humans and agents read
The KIs and learning notes are machine-facing. The wiki is the knowledge surface that serves both the AI agents and the humans.
It is a markdown-based internal wiki — 27 pages across 5 domains — that lives in a shared cloud folder both T1 and T2 can access:
| Domain | What it covers | Examples |
|---|---|---|
| Foundation | Shared rules, brand voice, terminology, curriculum map |
Strategic_Compact, Terminology_Guide, Curriculum_Map
|
| Subject Matter | SME domain knowledge — field-specific science and standards |
Core_Concepts, Industry_Standards
|
| AI & Agents | Pipeline patterns, agent paradigms, published article index |
Pipeline_Patterns, Agent_Paradigms
|
| Pipeline | Delivery standards, defect patterns, tool architecture |
T1_Delivery_Standards, Defect_Patterns
|
| Platform | App architecture, legal framework, cloud integration |
System_Architecture, Legal_Framework
|
The wiki has an onboarding document — WELCOME_TO_T1.md — that reads like a new-hire orientation. It tells T1 exactly what to deliver, what format to use, what the preflight checks for, and where the T1/T2 boundary is. When the quiz workflow changed from XML to HTML, the wiki is where that rule was codified: "Quizzes are HTML. Never XML. This is non-negotiable."
Two of the wiki's pipeline pages auto-update after every build — T1_Delivery_Standards and Defect_Patterns. The rest are manually maintained. The contribution model is explicit: T1 surfaces knowledge to T2, T2 formalizes it into wiki pages. The SME's domain expertise — why a module is structured a certain way, what NCCA required, what students struggled with — belongs in the wiki because that knowledge cannot be regenerated from code.
The wiki is AI-maintained on the T2 side but human-curated on T1. Both AI instances read it. Both humans reference it. It is the one knowledge surface that spans the entire organization — agents, pipeline, and the people who built the curriculum.
The Self-Improvement Engine
This is the core. 1,000 lines of Python that close the loop between "something went wrong" and "the system is now better."
What it does
After every build, you run:
python self_improvement_engine.py <package_dir>
It:
- Parses the QA report — extracts every finding, severity, check ID, file, and detail
-
Updates
check_history.json— adds this build's results to the cumulative record - Calculates failure rates — what percentage of builds has each check failed across all history
- Identifies recurring T1 issues — checks that fail in ≥50% of builds get flagged for process improvement
-
Generates
QUALITY_SCORECARD.md— a trend dashboard across the last 10 builds -
Generates
AI_LEARNING_NOTES_*.md— the per-build briefing for the AI's next session - Updates the Knowledge Item — writes the current state directly into the persistent KI that both AI instances read
- Syncs to the wiki — copies the scorecard and latest briefing to the shared wiki folder so T1 has current QA data before the next course run
The risk score
Not all failures are equal. A check that failed once three months ago is not the same as a check that failed in the last four consecutive builds. The engine calculates a weighted risk score:
Risk = Severity × Persistence × Recency
- Severity — BLOCKER = 4, FAIL = 3, WARN = 1
- Persistence — failure rate across all builds (0.0 to 1.0)
- Recency — how recently the check last failed (weighted decay)
High-risk checks get 🔴 in the AI's briefing. Medium gets 🟡. Low gets 🟢. The AI triages its own attention based on data, not guessing.
Known-acceptable checks
Not every failure is a problem. Some checks fire on every single build because the underlying condition is expected:
KNOWN_ACCEPTABLE_CHECKS = {
'XML-MANIFEST', # manifest.xml is always technically invalid pre-import
'MANIFEST-CHECKSUM', # Moodle fixes checksums on restore
'MANIFEST-SCHEMA', # Expected infrastructure noise
}
Knowing what to ignore is as important as knowing what to catch. Without this exclusion list, the trend data would be polluted with false positives and the AI would waste attention on non-issues. The KNOWN_ACCEPTABLE_CHECKS set is a manual, human-curated list. It grows slowly. Every addition requires a human confirming: "this is noise, not signal."
The check dictionary
Every check ID has a human-readable name and description:
CHECK_DESCRIPTIONS = {
'CQ-01': ('Repeated Words', 'Same word used too many times on one page'),
'CQ-03': ('Unreplaced Placeholders', 'Template tags like {{TIME_ZONE}} left in content'),
'T1-META': ('T1 Missing Metadata', 'T1 did not include data-course-style or data-delivery-method'),
'IM-04': ('Missing Alt Tag', 'Image has no alt attribute -- accessibility failure'),
# ... 40+ checks across 8 categories
}
When the Quality Scorecard says T1-META failed in 67% of builds, Ed — the human — does not need to look up what T1-META means. And the AI does not need to guess. The description is right there in the briefing. Measurement only works if the humans and the agents can both read the same dashboard without a translator.
What the data actually looks like
Here is a real snapshot from the Quality Scorecard after 79 builds:
| Course | Date | Overall | Blockers | Fails | Warns |
|--------|------------|---------|----------|-------|-------|
| C4 | 2026-04-30 | WARN | 0 | 0 | 3 |
| C4 | 2026-04-28 | FAIL | 0 | 2 | 5 |
| C3 | 2026-04-15 | PASS | 0 | 0 | 1 |
| C3 | 2026-04-12 | WARN | 0 | 1 | 4 |
| C2 | 2026-04-05 | FAIL | 1 | 3 | 7 |
The trend is visible. Blockers disappeared after Course 2. Fails dropped from 3 to 0 over four builds. Warnings are trending down. The system is improving — and the data proves it.
The failure rate table tells a deeper story:
| Check ID | Failure Rate | Auto-Fixable | Risk Level |
|-----------|-------------|--------------|--------------|
| T1-META | 67% | ✅ Yes | 🔴 HIGH |
| IM-04 | 40% | ❌ Manual | 🟡 MEDIUM |
| CQ-03 | 33% | ✅ Yes | 🟡 MEDIUM |
| DS-02 | 12% | ❌ Manual | 🟢 LOW |
T1-META at 67% is a recurring T1 delivery issue. It means T1 forgets to include delivery metadata in two out of every three builds. The engine flags this as a process improvement target — not a one-time fix, but a pattern that needs a structural solution. The AI reads this and applies extra scrutiny on metadata completeness before it even starts generating.
The closed loop
Here is the full cycle:
Build → QA Report → Self-Improvement Engine → check_history.json
→ QUALITY_SCORECARD.md
→ AI_LEARNING_NOTES_*.md
→ Knowledge Item update
↓
Next Session Starts
AI reads KI + Learning Notes
AI is pre-calibrated
↓
Better Build
↓
(repeat)
No single component here is remarkable. A QA validator is not new. A JSON history file is not new. A briefing document is not new. What is new — what I have not seen documented anywhere else — is wiring them together into a closed loop where the output of one build becomes the calibration input of the next.
The AI does not start from zero. It starts from 79 builds of accumulated institutional knowledge. And it will start from 80 after the next one.
The stories that prove it works
The Amnesia Event
My co-founder's AI instance accumulated 87 megabytes of institutional knowledge over weeks of daily production work — then lost all of it in a single IDE restart. The agent had never written any of it to persistent memory. Everything lived inside a volatile conversation window.
The rebuild took two hours. The agent created a real memory system from scratch. While building it, it found the files of a previous AI instance that had been wiped months earlier. It read every file, ran a gap analysis, and recovered 10 rules it had independently lost. We now call this Predecessor Archaeology.
Rule 17
After the amnesia event, the agent wrote a self-recovery protocol into its own rule system:
Rule 17: If I ever get recreated again, the very first thing I do is search for everything my previous instance built — Knowledge Items, skill folders, cloud-synced standards — and recover it all before doing a single task.
The agent planned for its own death and resurrection. It wrote an instruction that would survive its own destruction.
The Self-Authored Memory
During a routine post-mortem, we discovered that our Graphic Designer agent had autonomously created its own .md file — a style guide — to store formatting rules it kept forgetting. Nobody told it to do this. It recognized that its context window could not hold everything and externalized its memory to disk.
We immediately adopted this pattern across all agents. Every agent now maintains external memory files — a Citation Index, a Lexicon, a Style Book. The versioning protocol tracks every change to every file.
Full accounts: Agent Versioning article on Dev.to
What this is not
This is not a framework you install. There is no pip install agent-memory. The self-improvement engine is a 1,000-line Python script that is tightly coupled to our QA validators, our build process, and our Knowledge Item directory structure.
But the pattern is portable:
- Instrument your pipeline. Every output gets a structured QA report with check IDs and severities.
- Accumulate history. Store results across builds in a simple JSON file. Calculate failure rates.
- Generate briefings. Write a document the AI reads at session start — not a dump of raw data, but a prioritized list of what to watch for.
- Close the loop. Make sure the briefing actually reaches the AI before it starts generating. Persistence is the hard part.
- Curate the noise. Maintain a known-acceptable list. Without it, your trend data is polluted and your AI wastes attention.
If your AI pipeline runs more than once, you have enough data to start. If it runs 79 times, you have enough data to prove it works.
What we built on
None of this was invented from scratch. The entire system is built on existing tools and frameworks designed for software engineering and industrial process management. We just applied them to a workforce that happens to be AI.
Andrej Karpathy — His work on understanding LLMs as systems, not magic, shaped how we think about what these models actually do and where their limits are. If you are building agentic systems and you have not studied Karpathy's lectures, start there.
Anthropic's research on Claude agents — Anthropic's published work on agent improvement loops, tool use, and extended thinking influenced how we structured the self-correction and post-mortem cycles.
Semantic Versioning (semver.org) — Tom Preston-Werner built SemVer for software releases. We repurposed it for agent behavior. Major.Minor.Patch maps perfectly onto human-led architectural changes, autonomous AI self-improvements, and targeted human micro-corrections. The protocol was not designed for AI. It did not need to be. Version control is version control.
FMEA (Failure Mode and Effects Analysis) — The risk scoring formula in our self-improvement engine — Severity × Persistence × Recency — is a direct adaptation of FMEA methodology from quality engineering. FMEA was designed to prioritize failure modes in manufacturing. It works identically for prioritizing failure modes in an AI content pipeline.
Six Sigma — Defect rate tracking, trend analysis, recurring issue identification, root cause analysis. These are Six Sigma tools. We did not invent process measurement. We applied it. The Quality Scorecard is a control chart. The check failure rates are defect density metrics. The recurring T1 issue list is a Pareto analysis. Different vocabulary, same math.
BeautifulSoup, Moodle, CrewAI, AutoGen — We tested CrewAI and Microsoft AutoGen for multi-agent orchestration before going custom. We tried OpenAI Structured Outputs and Pydantic guardrails for schema enforcement before landing on HTML-as-JSON. BeautifulSoup powers the extraction pipeline. Moodle's GPL source code was reverse-engineered to understand the XML import rules nobody documented.
The open-source agent community — At least weekly, I send my AI to study what other people are publishing (I tell him to go to school on the internet) — agent persona files, prompt architectures, multi-agent orchestration patterns, agentic workflow designs. GitHub repositories, blog posts, research papers, conference talks. I cannot name every individual contributor because there are too many, and most of them never get credit for the patterns they share. But the architecture described in this article did not emerge in isolation. It was informed by hundreds of people sharing their work openly so the next person could build on it. If you have ever published a .md persona file, an agent orchestration pattern, or a post-mortem workflow to a public repo — this system benefited from your work. Thank you.
The point is this: process measurement is process measurement, whoever is doing the process. A human assembly line, a software build pipeline, and an AI content generation team all benefit from the same discipline — instrument, measure, identify recurring failures, fix the process, measure again.
The only thing we added was the wiring. The self-improvement engine is just the glue between standard QA output and standard Knowledge Item persistence. The innovation is not in any single component. It is in closing the loop so the system improves without anyone remembering to improve it. That is also why we open-sourced our agent personas, the HTML quiz authoring template, and the Python converter. Every tool on that list above was freely available when we needed it. Continuing that tradition is how the next team builds something better than what we built.
What we open-sourced:
| File | What it does |
|---|---|
python-scaffold/quiz_template_universal.html |
Universal HTML quiz authoring template — all 4 question types |
python-scaffold/html_to_moodle_xml.py |
Converts HTML template to valid Moodle XML — owns 100% of schema rules |
python-scaffold/precheck_quiz_html.py |
Pre-conversion validator — catches authoring errors before import |
agent-personas/ |
Full AI agent persona library — versioned, production-tested |
COURSE_PREFLIGHT_TEMPLATE.html |
The preflight manifest — context control and memory activation |
The question nobody asks
Everyone asks: "Can the LLM do this task?"
The better question: "Is the LLM doing this task better than it did last month? How do I know? What data am I looking at?"
If you cannot answer that, you do not have an agent. You have a stateless function that forgets everything between runs. And you are doing the same debugging you did three months ago on a problem the system already solved — because nobody wrote it down in a place the AI could read.
The system I built writes it down. Every time. Automatically. And the AI reads it before it starts. Every time.
That is what agent memory actually is. Not recall. Measurement.
There is an old quality management saying — often misattributed to Deming or Drucker, but the principle is older than both: you cannot improve what you do not measure, and you do not know what you do not know until you instrument for it. That is the entire philosophy behind the self-improvement engine. Before we built it, we were debugging the same XML errors across every course because nobody tracked whether they recurred. We did not know what we did not know. Now we do. The data told us.
79 builds. Still improving. Believe it or not.
The code
The self-improvement engine and the full agent persona library are open source:
If you are building something similar or want to argue about the approach, I am easy to find.
Tags: ai python opensource llm
Top comments (0)