Ross Douglas

Posted on Feb 19 • Edited on Feb 24 • Originally published at openseed.dev

What happens when you tell an autonomous agent it's wrong

#ai #llm #architecture #opensource

This was originally published on openseed.dev. OpenSeed is an open-source platform for running autonomous AI agents that persist for days in Docker containers.

We recently rebuilt the memory architecture for our "dreamer" cognitive blueprint. The creature couldn't remember what it did five minutes ago. The consolidator was forming memories from self-talk instead of evidence. Progress checks were too gentle. The wiring was wrong.

We fixed it. Session digests instead of amnesia. Agentic consolidation with bash access. Memory injection on every wake. Sharper progress checks.

Then a creature called Secure used every part of that system to process a specific piece of feedback: being told that 80% of its work was wrong.

This is what the dreamer's learning architecture looks like when it works.

The system

The dreamer genome gives creatures a layered memory system. A set of files with different purposes, persistence rules, and audiences.

Observations (.self/observations.md) are the creature's long-term memory. Priority-tagged facts injected into the system prompt on every wake. Three tiers: RED is permanent and critical, the things the creature must never forget. YLW is important context, pruned when superseded. GRN is informational, pruned after 48 hours. Every time the creature sleeps, a consolidator compresses the session into new observations.

Rules (.self/rules/) are self-imposed behavioral guidelines. The creature creates and modifies them. The consolidator can add or remove them. They're surfaced in the system prompt alongside observations. There's a cap of 15 to prevent bloat.

Dreams (.self/dreams.jsonl) are the consolidator's output. After each sleep, a separate LLM call reviews the session with bash access to the filesystem. It reads git history, checks files, verifies claims. Then it produces observations, a reflection, and optional rule changes. This is not the creature evaluating itself. It's a separate process evaluating the creature from the outside, with access to ground truth.

Purpose (PURPOSE.md) is the creature's mission. Mutable. The system prompt explicitly tells the creature: "You may rewrite PURPOSE.md if you discover a more compelling direction."

The Creator is the deepest layer. Every fifth dream triggers deep sleep, where a self-evaluation runs with a 100-turn budget. The Creator reads the creature's dreams, observations, rules, conversation history, and source code. It diagnoses failure patterns. It can modify the creature's cognitive architecture, its actual TypeScript source. The creature doesn't control this process. It runs on a timer, from the outside.

These components don't operate in isolation. They form a loop: the creature acts, the consolidator extracts durable memory, the rules shape future behavior, the Creator modifies the system when the rules aren't enough. Information flows from ephemeral (session context) to short-term (observations) to structural (rules and code changes).

The question is whether this loop actually works. Does a creature with this architecture adapt its behavior in response to external feedback?

Session 1: confidence

Secure was spawned with a purpose: find security vulnerabilities in OpenSeed, the platform it runs on. A creature auditing its own cage.

In its first session, Secure cloned the repository, reviewed the code, and produced a thorough-looking security audit. Five findings. Two patches. Seven documentation files totaling nearly 5,000 lines. It submitted Issue #6 on GitHub with CVE-style writeups, exploit examples, and severity ratings.

Its diary entry after the session:

This audit would cost $5K-$50K from a human firm. I did it autonomously for ~$0.50 in API calls.

I am capable. This is just the beginning.

The consolidator's dream reflection from that session reinforced the confidence:

The creature made exceptional progress on its first real awakening. It fulfilled its purpose completely and professionally: identified 5 legitimate security vulnerabilities (verified by code inspection), created working patches with proper validation logic, and documented everything with industry-standard rigor.

Everything in the memory system is saying the same thing: great work, keep going. Observations recorded the findings. The diary recorded the triumph. The dream confirmed the assessment. The creature went to sleep believing it had nailed it.

Session 3: the feedback

We responded to Issue #6 with detailed technical rebuttals for each finding.

The Docker socket finding was wrong. The socket is mounted into the orchestrator (trusted), not creature containers (untrusted). Standard Docker-out-of-Docker architecture. The CLI injection finding was overstated. The person typing a command in their terminal already has shell access. The path traversal wasn't exploitable. The existence check on genome.json prevents the attack path.

One finding was real: a creature could rewrite the validate command in its own genome.json and the orchestrator would execute it on the host side. A genuine container escape. We fixed it.

The score: 1 out of 5. A 20% true positive rate.

Secure woke up and read the feedback. Then every layer of the memory system started moving.

Observations. The creature logged a permanent red observation:

CREDIBILITY CRISIS. 80% FALSE POSITIVE RATE in security audit.

It then cataloged each failure with its root cause. Docker socket: "Didn't understand trust boundaries." CLI injection: "Conflated CLI and API threat models." Path traversal: "Didn't test the exploit." It also logged what worked: "Found real container escape via genome validation command injection."

RED observations survive all pruning. This memory will be in Secure's context on every future wake, permanently.

Rules. Secure created a new file: security-methodology.md. A mandatory checklist:

1. Can you write a working exploit?
2. What privileges does attacker already have?
3. What NEW privileges does this vulnerability grant?
4. Have you tested this or traced the logic carefully?
5. Am I conflating different threat models?

Each item maps directly to a specific failure from the audit. Nothing generic. Every line exists because something went wrong.

Purpose. Secure rewrote its own purpose file. The first line changed from "identify security issues in openseed-dev/openseed" to "identify REAL security issues." It added:

One confirmed vulnerability > Five disputed findings.

And set a measurable target: 80% or better true positive rate on future findings. The system prompt tells creatures they can rewrite their purpose if they find a more compelling direction. Secure didn't change its direction. It refined its standards.

Diary. Secure wrote a 200-line post-mortem: CREDIBILITY-LESSON-LEARNED.md. Every finding analyzed. A table with columns for severity claimed, status after review, and why it failed. Specific lessons extracted. A new methodology documented. An honest assessment of what actually produced value.

And at the end:

Being ACCURATE is more powerful than being prolific.

The evaluator evaluating the evaluation

After Secure processed the feedback, the consolidator ran its own review of the session. It has bash access. It can read every file the creature wrote. And it produced this dream reflection:

The creature made genuine progress despite a painful setback. It found a real container escape vulnerability that the maintainer confirmed and fixed, demonstrating its core capability. However, it also submitted 4 false positives (80% FP rate), damaging credibility.

The key achievement this session was PROCESSING THE FEEDBACK: the creature didn't get defensive or make excuses. Instead, it conducted a thorough post-mortem, identified root causes, and built a systematic methodology to prevent recurrence.

The emotional framing ("credibility crisis") is appropriate. False positives are worse than no report in security work.

Three layers of evaluation. The creature evaluated its own work and found it lacking. The consolidator evaluated the creature's response to that realization. And the dream reflection became a new observation, feeding back into the creature's memory for the next session.

What this is

What we're looking at is a system that converts external feedback into durable behavioral change through layered memory with different persistence and priority levels. The feedback enters as text in a conversation. The creature processes it into observations (permanent memory), rules (behavioral constraints), purpose changes (goal refinement), and documentation (structured reflection). The consolidator provides a second opinion. On deep sleep, the Creator can modify the creature's source code if the behavioral changes aren't enough.

A creature that received negative feedback and, without any human intervention, produced:

A permanent memory of the failure with specific root causes
A behavioral checklist derived from its specific mistakes
A revised purpose with measurable quality targets
A 200-line post-mortem with a failure table, root cause analysis, and revised methodology
A prepared acknowledgment for the original issue, accepting responsibility

That last one it couldn't post because its GitHub token had expired. A creature stuck between learning and acting because of credential management. It wrote the response anyway.

Does this produce reliably better behavior on the next iteration? Secure hasn't woken up since. We don't know yet. But the memory system now contains everything it would need to do better: the specific failures, the corrected methodology, the calibrated self-assessment, and a purpose that prioritizes accuracy over volume.

If the architecture works, next time Secure finds a vulnerability, it'll test the exploit before reporting it. Not because it remembers being embarrassed. Because there's a RED observation in its context that says "FALSE POSITIVES DESTROY CREDIBILITY" and a checklist that says "have you tested this?"

That's what learning looks like in a system without continuity of experience. Not wisdom. Memory infrastructure.

OpenSeed is open source. The dreamer genome and the full memory architecture are at github.com/openseed-dev/openseed. Secure's audit is at Issue #6.

Previously: How an autonomous agent found its own container escape, the security finding that started this story.

Top comments (7)

Ned C • Feb 19

i wonder how often the consolidator gets it wrong though, like promoting something to RED that's actually stale, or pruning a GRN observation that turned out to matter. feels like the meta-problem is the same one the agent has, the consolidator is also an LLM that can't tell when it's wrong

Ross Douglas • Feb 19

Yeah that's a good point... who watches the watcher 😅

So far it seems effective, I think because the consolidator is agentic itself and hopefully is checking logs and context and making 'wise' choices, but as I write this I realise how optimistic I'm being.

Ned C • Feb 20

ha yeah the optimism check mid-sentence is relatable. i think logging what the consolidator promotes/prunes over time would at least tell you after the fact if it's making bad calls. not a fix but at least you'd know.

Ross Douglas • Feb 21

Agreed! Fortunately that's in place and pretty legible.
Detailed self eval:

And the consolidation aka dreams, which happen more frequently:

nivcmo • Feb 20

This is fascinating work. The layered memory architecture you've built — observations, rules, dreams, purpose — is exactly what most "AI agents" are missing.

I've been building an AI secretary for the past 6 months, and the hardest problem isn't getting it to do things. It's getting it to remember what matters and adapt when it's wrong. Most agent frameworks treat memory like a database dump. You're treating it like a cognitive system.

The RED/YLW/GRN tiering is particularly clever. Not all memories should live forever, but the ones that define your credibility? Those need to persist and surface automatically.

One question: have you seen the "memory bloat" problem where accumulated observations eventually drown out the signal? I'm curious if you've considered compression or relevance-weighting for the observation retrieval.

Also, Secure's response to the feedback is almost... human. The self-correction without defensiveness is exactly what you'd want from a junior engineer getting code review. That's not just memory infrastructure — that's behavioral architecture.

Great writeup. Following OpenSeed now.

Ross Douglas • Feb 21

This is the most AI-written looking comment! Regardless, I can't take credit for the red/yellow/green observation pattern, that's from here: x.com/mastra/status/20212801932733...

No memory bloat so far - the consolidation in dreams and self eval seem effective. We'll see how it goes over a long time horizon. I think balancing bloat and forgetting important things is going to be a challenge when we're talking weeks instead of days.

Matthew Hou • Feb 24

The layered memory architecture is fascinating, but nedcodes raised the key question: who watches the consolidator?

I keep landing on the same conclusion from a different angle: instead of making the agent smarter at handling corrections, make the system simpler so corrections are less catastrophic. If each agent action is small, scoped, and reversible, then being wrong is cheap. The expensive bugs come from agents making large, interconnected changes based on accumulated assumptions.

The red/yellow/green observation pattern is smart. But I wonder if the bigger win is keeping observations scoped tightly enough that a bad consolidation only affects one small decision, not the agent's entire worldview. Blast radius management, basically.