This is a follow-up to I Tried to Turn Agent Memory Into Plumbing Instead of Philosophy. If you haven't read that one, the short version: I built a...
For further actions, you may consider blocking this person and/or reporting abuse
The binding problem is the real insight. Most memory systems treat recall as retrieval, find the relevant note and hand it over. But memory isn't a library. It's a network. The procedure without the episode is an instruction without evidence. The episode without the procedure is a story without structure. You built both systems separately. The gap wasn't recalled. It was a connection. That's fixable. Most people would have quit at 0/250 skills used. You kept going. That's the difference between a benchmark and a builder.
@theeagle You nailed it, "instruction without evidence" is exactly what 0/250 proved.
OrKa already stores skills as nodes in a graph (skill_graph.py). The episode system exists too, storage, semantic search, retention, scoring, all tested and production-ready. But today they're disconnected. Nodes with no edges. Evidence with no structure.
The fix is conceptually simple: episodes become the edges. "implement [target]" alone is empty. But an edge saying "applied to ETL, validation before dedup caught 30% of bad records" gives the node weight. Another edge: "applied to log analysis, filtering before aggregation cut false positives by 40%." Now the model has a reason to use the skill.
The harder problem, as you say, isn't building edges, it's graph maintenance. Stale edges must decay, failed episodes should weaken differently than successes, isolated nodes should expire. The graph has to self-organize around what's useful, not just accumulate.
The binding, skill_id on Episode, episode_ids on Skill, recall that returns nodes with their edges, is next.
Exactly 😅. That was the turning point for me.
0/250 was not really a storage failure 😁. It was a binding failure. I had the procedure and I had the episode, but I did not yet have the mechanism that kept them attached when the next task arrived. Without that, skill memory is just archived text with a better label.
That is why I have stopped thinking about memory as retrieval alone. The real question is not just what the system can recall, but what prior should remain attached to the current decision, failure mode, and task shape.
And yes, I agree with your last point. A benchmark gives you the miss. Building starts when you treat that miss as the actual signal.
“Memory without experience is just a note” that line sticks.
The 0/250 skill utilization finding is incredibly valuable precisely because you published it honestly. In my automation consulting work, I've hit a similar wall: when building multi-step AI workflows with n8n and Claude, the agent "remembers" previous steps in the chain but can't bind that memory to novel decision points in branching logic.
Your Track C insight — memory only helps where the model is already struggling — maps directly to what I see in production. My clients' automation pipelines work flawlessly on happy paths, but the moment there's an ambiguous routing decision (e.g., "is this invoice a duplicate or a revised version?"), the agent needs prior context about what happened last time with similar edge cases. That's episodic memory, not procedural.
The Memory Bundle concept feels like the right abstraction. Curious whether you've considered a confidence-gated recall approach — only injecting memory when the agent's initial confidence on a routing decision falls below a threshold. That would avoid the ceiling-effect dilution you saw in Tracks A/B/D/E while concentrating the memory system's value where it actually matters.
The 0/250 skill utilization result is the most honest finding here, and arguably the most valuable. You didn't bury it, which is rare.
The abstraction pendulum swing you described (too literal → too abstract → vacuous) is a classic trap in knowledge representation. "implement [target]" is essentially a no-op embedding: it's so generic that cosine similarity against any real task will be near zero. The sweet spot isn't midway between those two extremes, it's a different dimension entirely, which is exactly what the Memory Bundle gets at.
What strikes me most is that both the Skill and Episode systems were already production-ready, just unconnected. That's not a failure of the architecture, it's a failure of binding at the design level, which mirrors the exact problem you're solving in the agent. You built two lobes of a brain with no hippocampus between them.
The Track C signal is worth dwelling on. The fact that memory only helped where the model was already struggling, not where it was already near-perfect, should probably be the core design principle for when to invoke recall at all. Rather than always injecting memory into the prompt pipeline, a confidence-gated recall (only fetch memory when the task difficulty exceeds a threshold) might actually improve the signal-to-noise ratio significantly, and prevent the ceiling-effect tracks from diluting your aggregate numbers.
One question: for the Memory Bundle's transfer scoring, are you planning to weight episode recency or just volume and success rate? A skill with 10 old successful episodes might be less reliable than one with 2 recent ones in a fast-moving domain.
Yes. I think that is exactly the right direction.
The 0/250 result pushed me to the same conclusion. Memory should probably not be injected by default. If the model is already operating on a near-solved path, extra recall is mostly noise. The real value appears when the system is close to a routing boundary, a failure surface, or a low-confidence decision. So I am increasingly thinking recall should be gated by need, not treated as a constant layer in the pipeline.
And on transfer scoring, no, I do not think volume plus success rate is enough. Recency has to matter, but not as a global constant. It should be weighted by domain volatility. Ten successful episodes from a stable domain may still be highly valuable. Ten successful episodes from a fast-moving domain may already be stale. So my current intuition is that transfer scoring has to blend success rate, recency, transfer breadth, and contextual similarity, with decay increasing when a skill stops transferring or starts failing in newer contexts.
In other words, I do not want skills to survive because they were historically useful. I want them to survive because they remain useful under present conditions. That is probably the only way to stop the graph from becoming a museum of expired competence.
This matches what I’ve seen as well.
Recall is usually the easier part to solve. Binding is where things break, especially when the agent has to associate the right context with the right action across steps.
It gets even trickier once you add tools and longer workflows. At that point, it’s less about memory itself and more about how consistently the system uses that memory.
Really strong writeup. The shift from recall to binding feels exactly right.
A memory system can retrieve something "relevant" and still be operationally useless if the recalled item is not bound to the current task shape, failure mode, and decision surface. That is where persistent memory starts acting less like context compression and more like a trust boundary. The saved memory is changing what the next agent believes before it acts.
That is why your Track C result is the interesting one. Memory did not help where the base model already knew how to finish the task. It helped where the agent needed prior structure for a routing decision. That sounds less like "agents need more memory" and more like "agents need typed prior decisions that stay attached to the right context."
My guess is the next jump is not a better generic skill abstraction, but separating memory into roles like decision, constraint, anti-pattern, mistake, and evidence instead of one catch-all skill bucket. Then the model is binding the right kind of prior, not just retrieving semantically adjacent text.
Exactly. That is very close to where my thinking is moving.
The problem is not memory as volume. It is memory as operative structure.
A recalled item can be relevant in a semantic sense and still be useless, or even damaging, if it is not bound to the current task, failure mode, and decision surface. At that point memory is no longer just retrieval. It is shaping the next action boundary.
That is also why Track C mattered to me. It was not about helping the model "know more." It was about giving it the right prior structure at the moment a routing decision had to be made.
And yes, I think your point about typed memory is probably the next real step. A mistake, a constraint, an anti-pattern, a prior decision, and a piece of evidence should not live in the same bucket just because they are all "memory." They play different operational roles and should bind differently at runtime.
So I suspect the next jump is not bigger memory, but more explicit memory semantics. Not just retrieve something similar, but retrieve the right kind of prior for the kind of decision the agent is about to make.
The 0/250 skill utilization stat is the kind of result most people would bury. Props for putting it front and center.
The memory bundle direction makes sense to me, binding procedure to episodes and causal links is where the real signal lives. But I keep thinking about what happens at scale. Once you start linking everything to everything (episodes to skills to facts to causal chains), you're basically rebuilding a knowledge graph. And KGs have a well-documented failure mode: they collapse under their own weight when nobody maintains the edges.
How are you thinking about decay and pruning as bundles accumulate? Because the binding problem doesn't go away, it just moves one layer up.
Within OrKa I already implemented TTL-based decay for memory in general.
For skills, my current hypothesis is that persistence should be earned through successful transfer, not granted by default. In practice, that means a skill extends its lifetime only when it proves useful again in a new context. Reuse reinforces it. Failed transfer or inactivity lets it decay.
So yes, the binding problem does move one layer up. My answer is not to preserve everything, but to make survival conditional on cross-context utility. Otherwise the graph just turns into dead weight.
This is probably the real challenge: not how to store more links, but how to let useless links die without killing the structure that still generalizes.
500 experiments is the kind of rigorous testing AI agent development desperately needs. Most people build agents, test them on 3 happy-path scenarios, and ship.
The binding problem you've identified is fascinating — it maps to something I've noticed in production too. Agents can recall facts fine, but they lose the relationship between facts across conversation turns. It's like having a good memory for vocabulary but terrible grammar.
This has huge implications for testing AI agents at scale. You can't just test if the agent remembers X — you have to test if it correctly associates X with Y across different contexts. That's a combinatorial explosion of test cases, which is why most teams skip it and just hope for the best.
Curious about your testing methodology — did you use automated evals or manual review for the 500 experiments? The eval tooling gap for agent memory is one of the biggest unsolved problems I've seen.
Thanks. Yes, this was fully automated, not manual review.
The evals are all here:
github.com/marcosomma/orka-reasoni...
The flow is split into 3 scripts so each phase can be rerun independently:
run_benchmark_v2.py runs the benchmark tasks across the different tracks and generates the raw brain vs baseline outputs.
judge_benchmark.py then evaluates those outputs with a separate judge model using both rubric and pairwise workflows.
aggregate_benchmark.py analyzes the judged results and produces the final summaries, deltas, win rates, and per-track breakdowns.
That split helped a lot because execution, judging, and analysis are different failure surfaces, and keeping them separate made the benchmark much easier to inspect and debug.
Brilliant breakdown. The shift from simple vector recall to 'Memory Binding' perfectly mirrors the challenges in architecting robust RAG systems for production. Validating these multi-layered execution contexts—especially when mixing procedural and episodic memory without leaking data across contexts—is a massive engineering hurdle. I've been heavily focused on building automated backend ecosystems and secure RAG gateways to tackle exactly this kind of complex context validation. Thanks for sharing the raw data behind the philosophy!
The binding problem is exactly the gap I keep running into when building agent workflows. Memories stored as flat key-value pairs lose relational context — the 'why' behind the 'what'. Your Memory Bundle concept reminds me of how experienced developers naturally organize knowledge: not as isolated facts, but as linked decision chains. Curious if you've experimented with graph-based structures for the bundles, or if the overhead kills latency in practice.
ran into this too - agents surface the right memory but apply it in the wrong context. retrieval is fine, binding is the actual bottleneck. curious how you track that failure mode
This is a great breakdown of the recall vs. binding distinction. I think there's another layer worth exploring: the relationship between binding context and temporal decay.
In our work on embedded memory systems for robots (think autonomous drones that need to remember terrain hazards between sessions), we found that binding degrades non-linearly with time and context switches. A memory bound at task T1 with high confidence would often return as loose binding after just 2-3 task switches.
The fix that actually worked was maintaining a small active context window (5-10 bindings max) that gets priority in retrieval. Kind of like how your brain keeps 4-7 items in working memory. Everything else goes to cold storage with degraded binding scores.
Curious: did your 500 experiments reveal any correlation between the number of concurrent tasks an agent handles and binding failure rates? That was the strongest predictor in our data.
The 0/250 crisis hit close to home. I run an autonomous agent that's done 1100+ sessions and the exact same pattern showed up — skills stored as abstract procedures were never retrieved because the embeddings collapsed into noise. Too generic to match anything specific.
What fixed it for me was switching from flat-file memory to a Neo4j-backed knowledge graph where episodes, facts, and contacts are separate node types with explicit relationships. The binding happens at the graph level: a session node links to the contact it engaged with, the insight it produced, and the facts it updated. When the agent needs context before talking to someone, it pulls the contact node and traverses to all connected episodes — the binding is structural, not embedding-based.
Your four-component bundle (procedure + episodes + semantic facts + causal links) is essentially what I arrived at independently, except I'd add temporal ordering as a fifth. At 1100 sessions, the same fact can be true and then false, and the agent needs to know which version is current. I ended up with timestamped fact triples (subject-predicate-object with a recorded_at field) so newer facts shadow older ones without deleting history.
The ceiling effect you found — memory only helping on Track C (hardest tasks) — also matches my experience. On routine work, the agent's base capability is fine. Memory becomes critical only when it needs to reference a conversation from 200 sessions ago or recall why a previous approach failed. The ROI of memory systems is concentrated in the long tail of hard decisions.
The binding problem you've identified maps precisely to something we hit while building an MCP server for a 130k-node Neo4j knowledge graph: injecting raw retrieved nodes into context is cheap, but making the model actually use the retrieved node as the authoritative answer instead of paraphrasing from pretraining is the hard part. We ended up structuring our MCP tool responses with an explicit authority field and prompt framing that told the model 'the graph answer supersedes your internal estimate' — it reduced hallucination-over-recall by roughly 40% in our testing. Your point about abstract verb-target patterns over verbatim steps is consistent with what we found too: concrete retrieval ('Load CSV with pandas') teaches nothing; relational structure ('validate before ingest') actually transfers. Would love to know how you're measuring binding fidelity — are you using any structured output forcing, or purely rubric-scoring the judge's interpretation?
Your transfer engine threshold problem (min_score=0.0 letting everything through) mirrors something I hit at a much larger scale. I run a 130k-node knowledge graph backing an AI agent across 1,100+ production sessions, and the single most impactful change I made to memory retrieval was adding typed predicates to edges — not just 'related to' but 'supersedes', 'blocked_by', 'attempted_and_failed'. Without those types, semantic similarity alone kept surfacing memories that were topically related but operationally useless (or worse, outdated strategies that had already been tried and abandoned).
Your Track A (cross-domain transfer) result is the one I'd focus on. In my experience, the memories that actually transfer are structural patterns, not domain knowledge. 'This kind of problem responds to decomposition before synthesis' transfers. 'Use pandas read_csv with error handling' does not. Your verb-target abstraction is pointing in the right direction, but I wonder if it's abstract enough — 'implement [target]' still carries an imperative frame that might not match a reasoning or analysis task.
One thing that's been genuinely useful in my system: tracking what DIDN'T work. I have an 'attempted_and_failed' edge type, and it's possibly the highest-value memory category. It prevents re-running strategies that already failed — which is exactly the 'comfortable lie' pattern you describe in the opening.
Curious about your judge model choice (Qwen 3 Coder). Did you test whether the judge's domain expertise affects scoring, or is the main variable just model independence from the executor?
The binding problem you described is essentially the same challenge as context-window grounding in long-running inference: the model has access to the information but doesn't know when or how strongly to weight it against the live task context.
Your observation about abstract verb-target patterns ("implement [target]", "validate [component]") transferring better than literal steps is interesting — it mirrors how humans summarize procedural memory. A surgeon doesn't recall "apply pressure to the right of the incision at 14:32"; they recall the pattern.
One thing I'd be curious about: did you try varying the injection timing of recalled memories? Injecting at the start of a system prompt vs. mid-conversation has meaningfully different effects in some architectures. The binding might improve if context is injected closer to the decision point rather than front-loaded.
What was your scoring rubric for "binding quality" vs. plain recall accuracy?
@motedb your observation are exactly nailing the point! To answer your question about injection timing. In the benchmark, the recalled skill is injected at a fixed position in the task_applier prompt: after the agent's own reasoning analysis and before the solve instructions.
The exact template structure is:
So the skill sits between the model's own reasoning and the action directive. It is not front-loaded into the system prompt and not injected mid-conversation. OrKa's YAML prompt templates give the workflow author full manual control over placement (you write {{ previous_outputs.brain_recall.skill_steps }} wherever you want in the prompt string), but there is no built-in injection_point parameter. The position is hard-coded per workflow.
I did not vary injection timing as an experimental variable, and that is a legitimate gap. Front-loading the skill into the system prompt versus injecting it closer to the decision point could change how much attention the model allocates to it. That said, the 0/250 result suggests the problem is not where the skill appears but what it contains. The model read the skill at its current position and explicitly decided it was not useful. Moving it closer to the decision point would not change its content from "validate [component]" to "last time validation caught 30% of dirty records in the ETL pipeline.
The binding problem you describe is essentially the same challenge we ran into when building memory systems for embodied AI (robots/drones that need to remember things across missions). Your finding that "implement [target]" is vacuous really resonates — we saw the same thing with skill-level abstractions.
What worked for us was moving from skill abstraction to episodic context binding — instead of storing "how to do X", we store structured episodes: {context: "loaded 2GB CSV into staging with encoding errors", outcome: "used pandas error_handlers='ignore', lost 3% rows but pipeline completed", trigger_pattern: "large_file + schema_mismatch"}. The key insight was that the failure mode is more valuable than the success pattern. When the agent encounters a similar context fingerprint, it recalls the episode with its failure metadata attached, not a sanitized "best practice."
One thing I'm curious about: did you experiment with confidence-gated recall? We found that forcing recall on every task actually hurt performance more than having no memory at all. A simple threshold (only recall when cosine similarity > 0.82 for context embedding) cut the noise dramatically. The agent voluntarily ignoring memory turned out to be a feature, not a bug.
"Binding vs recall" is a great frame. I've been building an MCP that does something similar for agents — stores "patterns with success/failure metadata" rather than "facts to retrieve." The key difference: pattern recall asks "what fits this situation?" not "what do I know about X?" That changes the retrieval metric entirely. Does your binding approach use embedding similarity, or something more structural?
This is a great insight—most discussions focus on recall quality, but binding is where things actually break down.
Feels like improving memory systems isn’t just about storing more, but structuring and retrieving it meaningfully. Curious what approaches worked best for you—schema design, embeddings, or something else?
good work
good