Tirso García

Posted on May 10

Building Kernel Memory Protocol: Navigable Memory for AI Agents

#agents #ai #llm #opensource

Versión en español: Construyendo Kernel Memory Protocol: memoria navegable para agentes de IA

The hard part with many AI agents is not the amount of text in the prompt.
The hard part is that they do not have memory they can query, traverse, and
audit.

Most current approaches try to solve this in one of three ways:

copying parts of previous conversations into the next prompt;
searching similar chunks with embeddings;
letting an agent framework store memory internally, often in a way that is hard to inspect, replay, or explain.

Those approaches help, but they are not enough when an agent is doing real
work. At that point, retrieving text is not the whole problem. You need to
reconstruct the process.

The important questions become different:

What did the agent know when it made a decision?
Which solution attempts did it try?
Which attempt failed, and why?
What new information changed the direction of the work?
Which sequence of steps led to the final answer?
Which evidence supports a decision or answer?
Can a person review that evidence without reading the whole raw conversation?
Can another model navigate the same memory without knowing how it is stored underneath?

Underpass KMP started with a smaller goal: recovering only the context an agent
needed to continue a task without rereading the whole previous conversation. I
called that context rehydration: taking already recorded memory and rebuilding
only the useful part for the next step.

The more I tested it, the clearer the real problem became. This was not about
making better prompts. I needed a memory layer that could record what happened,
when it happened, who produced it, what evidence supported it, and how it could
be traversed later.

That is where Kernel Memory Protocol, or KMP, comes from: a small, explicit API
for writing, querying, traversing, tracing, and inspecting agent memory.

From Searching Chunks to Navigating Memory

The first mistake was treating memory as if it were just search.

A search system can return text that looks similar to the question you just
asked. That is useful for finding isolated information, but it is not enough to
understand a work process.

When an agent solves a task, the key question is not only which sentence looks
similar. The key question is what happened:

what information the agent had when it made a decision;
which solution attempts it tried;
which attempt failed, and why;
which new data changed the direction of the work;
which sequence of steps led to the final result;
which evidence supports each conclusion.

That distinction shaped the kernel. I did not want to build another mechanism
for searching text. I wanted navigable memory.

That is why KMP does not expose a vector database API. It exposes memory
operations:

ingest   -> record memory
wake     -> recover the state needed to continue
ask      -> query memory with evidence
goto     -> move to a specific moment or reference
near     -> inspect what happened around a moment or reference
rewind   -> move backward
forward  -> move forward
trace    -> explain a relation path
inspect  -> inspect a memory node

The central system is intentionally small. KMP is not trying to be the agent,
and it is not responsible for deciding the final answer. Its job is to store
structured memory, make that memory traversable in a deterministic way, and
return evidence that can be audited.

Answer generation, business rules, and domain plugins can live around KMP
without being pushed into the memory protocol itself.

The Mental Model

The central object in Underpass KMP is an about.

An about is the case, topic, or memory world being worked on. It can be an
incident, a task, a customer, a benchmark case, a repository, a user, or a
long-running agent process.

Inside that about, memory does not need to live on a single line. It can be
split into dimensions:

about
  dimension: session
  dimension: agent
  dimension: task
  dimension: entity
  dimension: preference
  dimension: attempt
  dimension: incident_phase
  dimension: success_path
  dimension: failure_path

A dimension can represent a session, an agent, a task, an entity, a solution
attempt, or a phase of the process.

Time is not just another dimension.

Time is what lets you ask what was known before a step, what changed after it,
or which information did not exist yet when a decision was made.

The mental model is:

about -> the case or memory world
dimensions -> memory planes inside that case
time -> the temporal axis crossing those planes
relations -> why two memory items are connected
evidence -> proof attached to memory
provenance -> who observed or wrote it, and when

Visually, KMP memory looks more like this than like a list of messages:

Figure 1. A single about can contain several dimensions crossed by time.
Blue arrows are semantic relations; dashed arrows show continuity inside a
dimension.

This matters because agent memory is rarely linear.

A long task can involve several agents. Each agent can have its own session.
Each session can produce hypotheses, failed attempts, tool results, and final
decisions. A useful memory layer must let you look at one dimension, several
dimensions, or the whole case, while making the query scope explicit every
time.

Why Dimensions Need Namespaces

One important implementation decision was making about act as the namespace
for dimensions.

When a client ingests memory, IngestRequest.about defines the default scope.
Internally, the real identity of a dimension is equivalent to something like:

about:<about>:dimension:<dimension_id>

This may look like a small detail, but it prevents important mistakes.

If two different tasks both have a dimension called session:1, I do not want
them to be mixed by accident. Once the dimension lives inside its about, each
session:1 belongs to the case it was created for.

Reads are explicit too:

CURRENT_ABOUT queries the current case;
ABOUTS queries a concrete list of cases;
ALL_ABOUTS queries all cases, but only when the caller asks for that intentionally.

If a caller asks for ABOUTS without providing the list of cases, the kernel
rejects the request. If a caller asks for ALL_ABOUTS, the request is clearly
global and can be audited as such.

The reason is simple: a query that looked scoped to one case should not
silently end up mixing memory from other cases.

Protocol First, Tools Second

MCP is a useful way for a model to call tools. For example, it lets an LLM use
operations such as kernel_ask, kernel_near, kernel_trace, and
kernel_inspect.

That is valuable, but I did not want MCP to define how memory works.

The rule belongs in a more stable place: KMP. In the current implementation,
the same operations are exposed through the typed gRPC service
KernelMemoryService.

Separating those layers has a practical benefit:

an LLM can use KMP through MCP tools;
an application can call the gRPC service directly;
a future HTTP API or SDK can expose the same behavior;
all of those entry points must mean the same thing when they ask, traverse, trace, or inspect memory.

The project follows a hexagonal architecture for exactly this reason: entry
points can change without changing the memory semantics. gRPC is the main API.
MCP is the agent-facing entry point: the way to expose the same operations to
an AI model as tools it can use without ambiguity.

I have been careful about keeping MCP and gRPC in parity. Both entry points
must respect the same behavior. If a REST API, SDK, or another integration is
added later, it should become another entry point into the same protocol, not a
different version of memory.

The principle is:

KMP defines memory semantics.
gRPC, MCP, HTTP, SDKs, and CLIs are ways to use those semantics.

The separation looks like this:

Figure 2. MCP, gRPC, and future entry points operate over the same memory
semantics defined by KMP.

Time Is Not Just Another Filter

Useful memory is not only about what was said. It also matters when it was said
and in which order the information appeared.

An answer can be valid with the information available at one moment and become
obsolete later. A decision can be reasonable before a tool result arrives and
wrong once new data appears. Even a failed attempt can be useful if it explains
why a different solution was chosen afterwards.

That is why KMP does not treat time as a secondary filter. It makes time part
of memory navigation:

goto moves to a concrete moment or reference;
near shows what happened around it;
rewind moves backward;
forward moves forward;
trace explains a path of relations and evidence;
inspect exposes the details of a node.

With that, you do not need to ask an LLM to reread a huge conversation and
guess what happened. A person or a model can move through memory with explicit,
reproducible operations.

For a person, the process becomes inspectable. For an AI model, memory becomes
something it can operate through tools.

Writing Memory Well Is the Hard Part

All of the above depends on one condition: the memory must be written well.

goto, near, rewind, forward, trace, and inspect are only useful if
the stored memory has enough structure. To traverse memory later, you first
need to write it properly.

Saving unstructured text is not enough. It lets you search for phrases later,
but it does not reconstruct the process very well: which step depended on
another, which decision corrected an earlier one, which evidence supported a
conclusion, or which attempt was discarded.

That is why writing is as important as reading.

Writing memory in KMP means recording entries, relations, evidence, dimensions,
and time. It also means deciding how a new piece of memory connects to what was
already there.

This is an important boundary. The kernel is not responsible for inference.
Inference belongs to whoever uses it: a person, an agent, a model, or an
adapter.

Writing to KMP is not just adding text. The writer also has to say which prior
memory the text connects to, and why it connects there. That relation is part
of the memory, not a secondary detail. The kernel should validate what is
written and make it traversable; it should not invent the meaning of what
happened.

I call the piece that writes memory the writer. It can be:

a person;
an agent;
a model using MCP;
a benchmark adapter;
a future specialist model trained to write memory.

The writer decides why a new entry connects to previous memory. The kernel
checks that the relation is valid, scoped correctly, backed by evidence, and
auditable later.

The write flow looks like this:

Figure 3. The writer decides meaning and relations. KMP validates what is
written, but it does not infer meaning on its own.

That separation led to two write paths:

kernel_ingest       -> canonical low-level write path
kernel_write_memory -> writer helper that ultimately compiles to ingest

kernel_ingest is the strict entry point. It receives already structured
memory.

kernel_write_memory is more convenient for a writer. It lets the writer
express a new entry and its connections, while still validating the quality of
what is about to be written:

relation name;
semantic class;
target node reference;
why;
evidence;
context read before writing;
fallback quality.

This matters because a memory graph full of vague relations is not very useful.

If every relation says supports_answer, the memory is connected, but it does
not explain anything. It does not tell you whether an entry depends on a
previous answer, contradicts it, refines it, replaces it, or merely appears
near it.

In KMP, relation quality is part of memory quality.

Relations Need to Be Honest

There is also the opposite risk: making relations look richer than they are.

A writer should not create smart-looking edges just to make the graph look
better. If it cannot justify a relation from the context it observed, it should
fall back to a simpler, anemic, or structural relation.

That fallback is not a failure. It is an honest signal.

A good memory system must be able to say:

I know these nodes are related by order or proximity.
I do not yet know a stronger semantic reason.

That gives me metrics I can inspect:

rich relations;
anemic relations;
structural relations;
suspect or rejected relations;
prior context observed before writing;
evidence coverage.

Those metrics give me a practical way to improve the writer without hiding
uncertainty.

The Boundary Between Memory and Interpretation

To measure KMP quality, I have mainly been working with two kinds of benchmarks.

MemoryArena is interesting because it looks closer to the kind of memory I want
to build: multi-step tasks, attempts, feedback, course corrections, and memory
that has to be reused later.

LongMemEval is interesting for a different reason. It is more conversational,
but it stresses a very useful case: recovering evidence scattered across many
sessions and checking whether the system can use it to answer.

That comparison made another boundary clear: the same memory layer can support
many use cases, and not all of them need the same kind of interpretation.

The kernel can retrieve the right evidence, and the final answer can still be
wrong if the reader has to perform domain work:

summing money;
counting entities;
deduplicating events;
selecting the latest value;
comparing dates;
normalizing code, URLs, or currencies;
deciding whether an amount is paid, planned, cancelled, or only mentioned.

That is where plugins come in.

In this context, a plugin is a specialized component that interprets evidence
the kernel has already retrieved. For example: detecting amounts, summing
money, comparing dates, counting entities, recognizing URLs, identifying code,
or resolving the latest value.

The reason for introducing plugins is not to win a specific benchmark. It is to
adapt memory to different use cases without putting all those rules inside KMP
itself.

I do not want to contaminate the kernel with logic specific to one benchmark,
money, dates, preferences, or any other domain. The kernel should stay
use-case agnostic: it stores memory, relations, time, evidence, and traces.
Specialized interpretation should live outside it.

The kernel should retrieve memory and evidence reliably. Plugins and readers
can then work on that evidence to solve domain operations.

The separation is:

kernel -> memory, traversal, proof, inspection
plugins -> typed value extraction and deterministic operations
reader -> answer construction and task policy

Figure 4. KMP retrieves traceable evidence. Plugins interpret typed values and
the reader builds the final answer.

This distinction is central.

Underpass KMP should not become a custom solution for a benchmark or a single
domain. It should do its part well: recover memory, evidence, and relations
reliably so that readers, plugins, and future specialist models can work on
top.

Why This Matters for Agents

Agent memory should not only help answer a user question by looking at old
chat history.

The more interesting case appears when an AI works through several steps: it
tries a hypothesis, uses tools, makes a mistake, changes direction, receives
new information, and eventually reaches a solution. In that setting, memory is
not a text archive. It is a navigable record of how something was solved.

With that kind of memory, a person or a model can go back into the process and
ask:

what was known before a decision was made;
which solution attempt failed;
which new data changed the direction of the work;
which agent introduced a wrong assumption;
why a later answer replaced an earlier one;
which sequence of steps led to the final solution;
which evidence supports the result.

This is where multidimensional and temporal memory becomes useful. Each agent
can be a dimension. Each session, task, entity, attempt, or work phase can be
another. Time lets you move across them and understand how the state of the
process changed.

The graph is not decoration. It is the shape of the process: what happened, in
which order, connected to what, and why.

Observability Is Not Optional

If agent memory is infrastructure, it has to be observable.

I need to know:

whether a write became queryable;
how long projection took;
which scope a query used;
how many references were inspected;
whether trace pagination worked;
whether proof was complete;
whether a reader ignored correct evidence;
whether a writer created rich, anemic, or suspect relations.

That is why the kernel records structured KMP and MCP logs, OTel metrics for
KMP calls, projection processing latency, relation quality metrics, and
explicit inspect and trace behavior.

The operational goal is simple:

A failed agent answer should be classifiable.

Possible classes include:

ingestion gap;
projection gap;
retrieval gap;
proof gap;
reader consumption gap;
task reasoning gap.

Without that classification, every failure looks the same: "the AI got it
wrong". That is not good enough for production agents.

Security and Auditability

Navigable memory can also be sensitive memory.

If the system can reconstruct what happened, who said it, which decision was
made, and which evidence supported it, then it must also control who can see
each thing and at what level of detail.

Asking for a summary is not the same as asking for raw memory. Querying the
current case is not the same as crossing memory from many cases. And logs or
traces must not casually expose secrets, credentials, complete prompts, or
content that did not need to leave the system.

That is why KMP treats security and auditability as part of the design, not as
an afterthought:

API boundaries are typed;
reads have explicit scope;
raw inspection is a deliberate option;
errors fail fast instead of activating silent fallback;
references, evidence, and relations are designed for audit;
TLS/mTLS is used on infrastructure boundaries that support it.

The goal is that a person can review why the system returned an answer without
opening all memory, while the system avoids exposing more information than
needed.

What Underpass KMP Promises

Before talking about results, it is worth being clear about what KMP promises
and what it does not try to solve.

Underpass KMP is not:

a general replacement for a vector database;
a final answer generator;
a benchmark-specific solution;
a hidden agent framework;
a guarantee that every model will interpret evidence correctly.

It is a deterministic, auditable memory layer. Its job is to preserve enough
structure for people, agents, plugins, readers, and future specialist models to
work with memory without reading everything again from scratch.

Benchmarks: What I Learned

I have been careful not to claim more than the current evidence supports.

The most important early result is not "the kernel wins every memory
benchmark". The important result is that the kernel makes a previously blurry
boundary visible:

Did memory retrieval fail, or did the reader fail to use correct evidence?

That distinction matters.

In a MemoryArena public-TLS run with 100 progressive-search tasks and the
smart writer enabled, the kernel reached:

Metric	Result
Correct KMP events	2259/2259
Known-at-clean queries	753/753
Full-ref recall	753/753
Future-answer leaks	0
Local paper-aligned score	97/100
Final misses	3

The 3 final misses were classified as reader answer-selection failures over
complete evidence, not as kernel retrieval failures or graph contamination.

In a realistic MemoryArena 2x/domain slice, the kernel reached:

Metric	Result
Correct KMP events	221/221
Known-at-clean queries	73/73
Full-ref recall	73/73
Future leaks	0
Unexpected references	0
Missing references	0

The remaining task failures were reader or agent gaps, not evidence gaps.

LongMemEval taught a different lesson. In a 30-item multi-session smart-writer
slice, the recovered evidence was complete, but the same evidence produced
different results depending on the reader:

Reader	Result
GPT-4o	22/30
Gemma 4 31B	25/30

In a 100-item test using an external embedding model and derivations, the same
boundary appeared again:

Measure	Result
Broad evidence recall	~99%
Official multi-session aggregate end-to-end QA	71.7%

The remaining failures were mostly structured operand problems: missed count
predicates, omitted qualifying evidence, or comparison mistakes.

That is useful information.

It tells me that the next improvement is not to hide more logic inside the
kernel. The next improvement is better candidate retrieval, reranking, typed
operand extraction, and reusable domain plugins.

Roadmap

The next step is to keep validating the idea with real cases and make the
kernel easier to use.

In the short term, the work is practical:

stronger MemoryArena and MemoryAgentBench runs;
an official-style LongMemEval regression as a secondary benchmark;
hybrid candidate retrieval behind ports;
reranking experiments;
visual graph and timeline exploration for traversing memory;
better proof and traversal observability;
stable pagination, limits, and scopes in KMP.

In the medium term, the direction becomes more interesting:

a small model specialized in operating kernel tools, trained from audited MCP trajectories;
process queries such as known_at, why, failed_paths, final_path, and best_path;
reusable interpretation plugins for money, dates, counts, URLs, code, and domain-specific operators;
conformance tests so kernel semantics are independent from the storage implementation;
public visual experiences that let people replay an agent process as a graph and timeline.

The operator model is especially important to me. It would not be a general
agent, and it would not be a magical model that "understands memory". It would
be a small specialist trained to use KMP efficiently:

Which tool should I call now?
With which bounded arguments?
Should I inspect, trace, move through time, or stop?
Which references prove that I have enough evidence?

That is a narrow and measurable problem.

The Product Thesis

The thesis behind Underpass KMP is simple:

Reliable agents need memory they can navigate, not just context they can
retrieve.

That memory must be:

scoped by what it is about;
split into meaningful dimensions;
traversable through time;
connected by honest relations;
backed by evidence;
inspectable by people;
usable by LLMs through tools;
observable and auditable in production.

That is why I am building Kernel Memory Protocol: so agent memory is not just
accumulated text, but a structure that can be traversed, inspected, and reused.

This is not about making prompts longer. It is the opposite: rebuilding the
useful context without forcing the model to read all the raw material, and
making token usage intelligent, measurable, and auditable.

The goal is to turn agent memory into a real working layer.

If this direction interests you, you can check the
Underpass KMP repository.
And if you find it useful, a GitHub star helps give the project visibility.

Written by Tirso García Ibáñez ·
LinkedIn ·
Underpass AI

Underpass KMP is part of the Underpass AI project. The repository is licensed
under the Apache License 2.0,
unless stated otherwise.

DEV Community