DEV Community: Tirso García

Construyendo Kernel Memory Protocol: memoria navegable para agentes de IA

Tirso García — Sun, 10 May 2026 14:59:29 +0000

English version: Building Kernel Memory Protocol: Navigable Memory for AI Agents

El problema de muchos agentes de IA no es que les falte texto en el prompt.
El problema es que no tienen una memoria que puedan consultar, recorrer y
auditar.

Hoy muchas soluciones intentan resolverlo de tres formas:

copiando parte de conversaciones anteriores dentro del nuevo prompt;
buscando fragmentos parecidos mediante embeddings;
delegando la memoria a un framework que la guarda por dentro, pero que no siempre permite inspeccionarla, recorrerla o explicar cómo se llegó a una decisión.

Estas soluciones ayudan, pero se quedan cortas cuando un agente está haciendo
trabajo real. En ese contexto no basta con recuperar texto: hay que poder
reconstruir el proceso.

Las preguntas importantes son otras:

¿Qué sabía el agente cuando tomó una decisión?
¿Qué intentos de solución probó?
¿Cuál falló y por qué?
¿Qué información nueva cambió el rumbo del trabajo?
¿Qué secuencia de pasos llevó a la solución final?
¿Qué evidencias justifican una decisión o una respuesta?
¿Puede una persona revisar esas evidencias sin leer toda la conversación en bruto?
¿Puede otro modelo navegar la misma memoria sin saber dónde ni cómo está guardada por debajo?

Underpass KMP nació intentando resolver algo más pequeño: recuperar solo el
contexto necesario para que un agente pudiera continuar una tarea sin releer
toda la conversación anterior. A eso lo llamaba rehidratación de contexto:
tomar memoria ya registrada y reconstruir únicamente la parte útil para el
siguiente paso.

Pero al probarlo vi que el problema real era mayor. No bastaba con preparar
mejor el prompt. Necesitaba una capa de memoria que guardase qué ocurrió,
cuándo ocurrió, quién lo produjo, qué evidencias lo sostenían y cómo podía
recorrerse después.

De ahí nace Kernel Memory Protocol, o KMP: una API pequeña y explícita para
escribir, consultar, recorrer, trazar e inspeccionar memoria de agentes.

De buscar fragmentos a recorrer la memoria

El error inicial fue tratar la memoria como si fuera solo un buscador.

Un buscador puede devolver textos parecidos a lo que acabas de preguntar. Eso
sirve para encontrar información suelta, pero no basta para entender un proceso
de trabajo.

Cuando un agente resuelve una tarea, lo importante no es encontrar una frase
parecida. Lo importante es poder reconstruir qué pasó:

qué información tenía el agente cuando tomó una decisión;
qué intentos de solución probó;
qué intento falló y por qué;
qué datos nuevos cambiaron la dirección del trabajo;
qué secuencia de pasos llevó a la solución final;
qué evidencias justifican cada conclusión.

Esa diferencia dio forma al kernel. No quería construir otro mecanismo para
buscar texto. Quería construir una memoria navegable.

Por eso KMP no expone una API de base de datos vectorial. Expone movimientos de
memoria:

ingest   -> registrar memoria
wake     -> recuperar el estado necesario para continuar
ask      -> preguntar a la memoria con evidencia
goto     -> ir a un momento o referencia concreta
near     -> ver qué ocurrió alrededor de un momento o referencia
rewind   -> moverse hacia atrás
forward  -> moverse hacia delante
trace    -> explicar una ruta de relaciones
inspect  -> inspeccionar un nodo de memoria

La idea es mantener pequeña la parte central del sistema. KMP no intenta ser el
agente ni decidir la respuesta final. Su responsabilidad es otra: guardar
memoria estructurada, permitir recorrerla de forma determinista y devolver
evidencias auditables.

La generación de la respuesta final, las reglas de negocio o los plugins de
dominio pueden añadirse alrededor de KMP, sin meterlos en esa parte central.

El modelo mental

El objeto central de Underpass KMP es un about.

Un about es el caso, tema o mundo de memoria sobre el que se está trabajando.
Puede ser un incidente, una tarea, un cliente, un benchmark, un repositorio, un
usuario o un proceso largo de un agente.

Dentro de ese about, la memoria no vive en una sola línea. Puede dividirse en
dimensiones:

about
  dimension: session
  dimension: agent
  dimension: task
  dimension: entity
  dimension: preference
  dimension: attempt
  dimension: incident_phase
  dimension: success_path
  dimension: failure_path

Una dimensión puede representar una sesión, un agente, una tarea, una entidad,
un intento de solución o una fase del proceso.

El tiempo no es una dimensión más.

El tiempo es lo que permite preguntar qué se sabía antes de un paso, qué cambió
después o qué información todavía no existía cuando se tomó una decisión.

El modelo mental queda así:

about -> caso o mundo de memoria
dimensions -> planos de memoria dentro de ese caso
time -> eje temporal que atraviesa esos planos
relations -> por qué dos elementos están conectados
evidence -> evidencias que sostienen la memoria
provenance -> quién lo observó o escribió, y cuándo

Visualmente, una memoria KMP se parece más a esto que a una lista de mensajes:

Figura 1. Un mismo about puede tener varias dimensiones atravesadas por el
tiempo. Las flechas azules son relaciones semánticas; las discontinuas muestran
continuidad dentro de una dimensión.

Este modelo importa porque la memoria de un agente rara vez es lineal.

Una tarea larga puede involucrar a varios agentes. Cada agente puede tener su
propia sesión. Cada sesión puede producir hipótesis, intentos fallidos,
resultados de herramientas y decisiones finales. Una capa de memoria útil debe
permitir mirar una sola dimensión, varias dimensiones o todo el caso, dejando
claro en todo momento qué alcance se está consultando.

Por qué las dimensiones necesitan un espacio de nombres

Una de las decisiones importantes de implementación fue hacer que about
actuase como espacio de nombres de las dimensiones.

Cuando un cliente ingiere memoria, IngestRequest.about define el alcance por
defecto. Internamente, la identidad real de una dimensión equivale a algo como:

about:<about>:dimension:<dimension_id>

Puede parecer un detalle pequeño, pero evita errores importantes.

Si dos tareas distintas tienen una dimensión llamada session:1, no quiero que
se mezclen por accidente. Al meter la dimensión dentro de su about, cada
session:1 pertenece al caso que le corresponde.

Las lecturas también son explícitas:

CURRENT_ABOUT consulta el caso actual;
ABOUTS consulta una lista concreta de casos;
ALL_ABOUTS consulta todos los casos, pero solo cuando el cliente lo pide de forma intencionada.

Si alguien pide ABOUTS pero no indica qué casos quiere consultar, el kernel
rechaza la petición. Y si alguien pide ALL_ABOUTS, queda claro que está
pidiendo cruzar todas las memorias disponibles.

La razón es sencilla: una consulta que parecía limitada a un caso no debería
acabar mezclando memoria de otros casos por accidente.

Primero el protocolo, luego las herramientas

MCP es una forma cómoda de que un modelo use herramientas. Por ejemplo, permite
que un LLM llame a operaciones como kernel_ask, kernel_near,
kernel_trace o kernel_inspect.

Eso es muy útil, pero no quería que MCP definiera cómo funciona la memoria.

La regla tenía que vivir en un sitio más estable: KMP. En la implementación
actual, esas operaciones están expuestas mediante el servicio gRPC tipado
KernelMemoryService.

La ventaja de separar las dos cosas es simple:

un LLM puede usar KMP a través de herramientas MCP;
una aplicación puede llamar directamente al servicio gRPC;
en el futuro puede existir una API HTTP o un SDK;
todos esos caminos deben hacer lo mismo cuando preguntan, recorren, trazan o inspeccionan memoria.

El proyecto mantiene una arquitectura hexagonal precisamente para poder
intercambiar esas entradas. La API principal es gRPC. MCP es la entrada
agéntica: la forma de exponer esas mismas operaciones para que una IA pueda
usarlas como herramientas sin confundirse.

He tenido mucho cuidado con la paridad entre MCP y gRPC. Las dos entradas deben
respetar el mismo comportamiento. Y si mañana aparece una API REST, un SDK u
otro tipo de integración, debería poder añadirse como otra entrada al mismo
protocolo, no como una versión distinta de la memoria.

El principio es:

KMP define la semántica de memoria.
gRPC, MCP, HTTP, SDKs y CLIs son formas de usar esa semántica.

La separación queda así:

Figura 2. MCP, gRPC y futuras entradas operan sobre la misma semántica de
memoria definida por KMP.

El tiempo no es un filtro más

En una memoria útil no basta con guardar qué se dijo. También importa cuándo se
dijo y en qué orden apareció la información.

Una respuesta puede tener sentido con la información disponible en un momento y
quedar obsoleta después. Una decisión puede ser razonable antes de recibir el
resultado de una herramienta, pero dejar de serlo cuando aparece un dato nuevo.
Incluso un intento fallido puede ser valioso si explica por qué se eligió
después otra solución.

Por eso KMP no trata el tiempo como un filtro secundario. Lo convierte en una
forma de navegar la memoria:

goto permite ir a un momento o referencia concreta;
near muestra lo que ocurrió alrededor;
rewind permite mirar hacia atrás;
forward permite avanzar desde un punto;
trace explica una ruta de relaciones y evidencias;
inspect permite revisar los detalles de un nodo.

Así no hace falta pedirle a un LLM que relea una conversación enorme y adivine
qué pasó. Una persona o un modelo puede moverse por la memoria con operaciones
explícitas y reproducibles.

Para una persona, el proceso se vuelve inspeccionable. Para una IA, la memoria
se convierte en algo que puede usar mediante herramientas.

Escribir bien la memoria es la parte difícil

Pero todo lo anterior depende de una condición: la memoria tiene que estar bien
escrita.

goto, near, rewind, forward, trace e inspect solo son útiles si lo
que se guardó tiene estructura suficiente. Para poder recorrer una memoria
después, primero hay que escribirla bien.

No basta con guardar texto sin estructura. Eso permite buscar frases después,
pero no reconstruir bien el proceso: qué paso dependía de otro, qué decisión
corrigió una anterior, qué evidencia justificaba una conclusión o qué intento
quedó descartado.

Por eso la escritura es tan importante como la lectura.

Escribir memoria en KMP significa registrar entradas, relaciones, evidencias,
dimensiones y tiempo. También significa decidir cómo se conecta una nueva pieza
de memoria con lo que ya existía.

Ahí aparece una frontera importante. El kernel no tiene responsabilidad de
inferencia. La inferencia la hace quien lo usa: una persona, un agente, un
modelo o un adaptador.

Escribir en KMP no es solo añadir texto. También hay que decir a qué memoria se
conecta ese texto y por qué se conecta ahí. Esa relación es parte de la memoria,
no un detalle secundario. El kernel debe validar lo que se escribe y hacerlo
recorrible; no debe inventar el significado de lo que ocurrió.

A la pieza que escribe memoria la llamo writer. Puede ser:

una persona;
un agente;
un modelo usando MCP;
un adaptador de benchmark;
un futuro modelo especializado en escribir memoria.

El writer es quien debe decir por qué una nueva entrada se conecta con memoria
anterior. El kernel comprueba que esa relación sea válida, que esté dentro del
alcance correcto, que tenga evidencia y que pueda auditarse después.

El flujo de escritura queda así:

Figura 3. El writer decide significado y relaciones. KMP valida lo escrito,
pero no infiere por su cuenta qué significa.

Esta separación llevó a dos formas de escribir:

kernel_ingest       -> escritura canónica de bajo nivel
kernel_write_memory -> ayuda para writers, que acaba convirtiéndose en ingest

kernel_ingest es la entrada estricta. Recibe memoria ya estructurada.

kernel_write_memory es más cómodo para un writer. Le permite expresar una
entrada nueva y sus conexiones, pero sigue validando la calidad de lo que se va
a escribir:

nombre de la relación;
clase semántica;
referencia del nodo objetivo;
why;
evidencia;
contexto leído antes de escribir;
calidad del fallback.

Esto importa porque una memoria llena de relaciones vagas no sirve de mucho.

Si todas las relaciones dicen supports_answer, la memoria está conectada,
pero no explica nada. No dice si una entrada depende de una respuesta anterior,
la contradice, la refina, la sustituye o simplemente aparece cerca de ella.

En KMP, la calidad de las relaciones forma parte de la calidad de la memoria.

Las relaciones deben ser honestas

También existe el riesgo contrario: inventar relaciones demasiado ricas.

Un writer no debe crear aristas aparentemente inteligentes solo para que el
grafo parezca mejor. Si no puede justificar una relación desde el contexto que
ha observado, debe caer a una relación más simple, anémica o estructural.

Ese fallback no es un fracaso. Es una señal honesta.

Un buen sistema de memoria debe poder decir:

Sé que estos nodos están relacionados por orden o cercanía.
Todavía no conozco una razón semántica más fuerte.

Esto crea métricas que puedo inspeccionar:

relaciones ricas;
relaciones anémicas;
relaciones estructurales;
relaciones sospechosas o rechazadas;
contexto previo observado antes de escribir;
cobertura de evidencia.

Estas métricas me dan una forma práctica de mejorar el writer sin ocultar la
incertidumbre.

La frontera entre memoria e interpretación

Para medir la calidad de KMP he estado trabajando principalmente con dos tipos
de benchmarks.

MemoryArena me interesa porque se parece más al tipo de memoria que quiero
construir: tareas con varios pasos, intentos, feedback, cambios de rumbo y
memoria que debe reutilizarse más adelante.

LongMemEval me interesa por otra razón. Es más conversacional, pero estresa un
caso muy útil: recuperar evidencia dispersa entre muchas sesiones y comprobar
si el sistema sabe usarla para responder.

Esa comparación dejó clara otra cosa: una memoria puede servir para muchos
casos de uso distintos, y no todos requieren el mismo tipo de interpretación.

El kernel puede recuperar la evidencia correcta y, aun así, no producir la
respuesta final si el lector tiene que hacer trabajo de dominio:

sumar dinero;
contar entidades;
deduplicar eventos;
elegir el valor más reciente;
comparar fechas;
normalizar código, URLs o monedas;
decidir si un importe está pagado, planificado, cancelado o solo mencionado.

Ahí aparecen los plugins.

En este contexto, un plugin es una pieza especializada que interpreta la
evidencia que el kernel ha recuperado. Por ejemplo: detectar importes, sumar
dinero, comparar fechas, contar entidades, reconocer URLs, identificar código o
resolver cuál es el valor más reciente.

La razón para introducir plugins no es ganar un benchmark concreto. Es poder
adaptar la memoria a casos de uso distintos sin meter todas esas reglas dentro
de la parte central de KMP.

No quiero contaminar el kernel con lógica específica de un benchmark, de dinero,
de fechas, de preferencias o de cualquier otro dominio. El kernel debe seguir
siendo agnóstico al caso de uso: guarda memoria, relaciones, tiempo, evidencias
y trazas. La interpretación especializada debe vivir fuera.

El kernel debe recuperar memoria y evidencia de forma fiable. Los plugins y los
lectores pueden trabajar después sobre esa evidencia para resolver operaciones
de dominio.

La separación es esta:

kernel -> memoria, recorrido, prueba e inspección
plugins -> extracción de valores tipados y operaciones deterministas
lector -> construcción de respuesta y política de tarea

Figura 4. KMP recupera evidencia trazable. Los plugins interpretan valores
tipados y el reader construye la respuesta final.

Esta distinción es central.

Underpass KMP no debe convertirse en una solución hecha a medida para un
benchmark ni para un dominio concreto. Debe hacer bien su parte: recuperar
memoria, evidencias y relaciones de forma fiable para que lectores, plugins y
futuros modelos especializados puedan trabajar encima.

Por qué importa para los agentes

La memoria de un agente no debería servir solo para responder a una pregunta
mirando conversaciones antiguas.

Lo realmente interesante aparece cuando una IA trabaja durante varios pasos:
prueba una hipótesis, usa herramientas, se equivoca, corrige el rumbo, recibe
información nueva y finalmente llega a una solución. Ahí la memoria no es un
archivo de texto. Es el registro navegable de cómo se resolvió algo.

Con una memoria así, una persona o un modelo puede volver sobre el proceso y
preguntar:

qué se sabía antes de tomar una decisión;
qué intento de solución falló;
qué dato nuevo cambió el rumbo;
qué agente introdujo una suposición equivocada;
por qué una respuesta posterior sustituyó a una anterior;
qué secuencia de pasos llevó a la solución final;
qué evidencias justifican el resultado.

Ahí es donde la memoria multidimensional y temporal se vuelve útil. Cada agente
puede ser una dimensión. Cada sesión, tarea, entidad, intento o fase del trabajo
puede ser otra. El tiempo permite atravesarlas y entender cómo cambió el estado
del proceso.

El grafo no es una visualización decorativa. Es la forma del proceso: qué pasó,
en qué orden, conectado con qué, y por qué.

La observabilidad no es opcional

Si la memoria de agentes es infraestructura, debe ser observable.

Necesito saber:

si una escritura llegó a ser consultable;
cuánto tardó la proyección;
qué alcance usó una consulta;
cuántas referencias fueron inspeccionadas;
si funcionó la paginación de trace;
si la prueba estaba completa;
si un lector ignoró evidencia correcta;
si un writer creó relaciones ricas, anémicas o sospechosas.

Por eso el kernel registra logs estructurados KMP y MCP, métricas OTel para
llamadas KMP, latencia de procesamiento de proyección, métricas de calidad de
relaciones y comportamiento explícito de inspect y trace.

El objetivo operativo es simple:

Una respuesta fallida de un agente debe poder clasificarse.

Las clases posibles incluyen:

gap de ingesta;
gap de proyección;
gap de recuperación;
gap de prueba;
gap de consumo por parte del lector;
gap de razonamiento de tarea.

Sin esa clasificación, todos los fallos parecen lo mismo: "la IA se equivocó".
Eso no es suficiente para agentes en producción.

Seguridad y auditabilidad

Una memoria navegable también puede ser una memoria sensible.

Si el sistema puede reconstruir qué ocurrió, quién lo dijo, qué decisión se
tomó y qué evidencias la justificaban, entonces también tiene que controlar muy
bien quién puede ver cada cosa y en qué nivel de detalle.

No es lo mismo pedir un resumen que pedir la memoria en bruto. No es lo mismo
consultar el caso actual que cruzar memoria de muchos casos. Y no es aceptable
que logs o trazas acaben exponiendo secretos, credenciales, prompts completos o
contenido que no hacía falta sacar.

Por eso KMP trata la seguridad y la auditoría como parte del diseño, no como un
añadido posterior:

las fronteras de la API están tipadas;
las lecturas tienen alcance explícito;
la inspección en bruto es una opción deliberada;
los errores fallan rápido en lugar de activar fallback silencioso;
las referencias, evidencias y relaciones están pensadas para poder auditarse;
TLS/mTLS se usa en las fronteras de infraestructura que lo soportan.

El objetivo es que una persona pueda revisar por qué el sistema devolvió una
respuesta sin tener que abrir toda la memoria, y que al mismo tiempo el sistema
no exponga más información de la necesaria.

Qué promete Underpass KMP

Antes de hablar de resultados, conviene dejar claro qué promete KMP y qué no
pretende resolver.

Underpass KMP no es:

un sustituto general de una base de datos vectorial;
un generador de respuestas finales;
una solución hecha a medida para benchmarks;
un framework de agentes oculto;
una garantía de que cualquier modelo interpretará bien la evidencia.

Es una capa de memoria determinista y auditable. Su trabajo es conservar la
estructura suficiente para que personas, agentes, plugins, lectores y futuros
modelos especializados puedan trabajar sobre la memoria sin volver a leer todo
desde cero.

Benchmarks: qué aprendí

He tenido cuidado de no afirmar más de lo que soporta la evidencia actual.

El resultado temprano más importante no es "el kernel gana todos los benchmarks
de memoria". Lo importante es que el kernel hace visible una frontera que antes
era difusa:

¿Falló la recuperación de memoria, o falló el lector al usar evidencia correcta?

Esa distinción importa.

En una ejecución MemoryArena public-TLS de 100 tareas, progressive search y
smart-writer, el kernel alcanzó:

Métrica	Resultado
Eventos KMP correctos	2259/2259
Consultas known-at-clean	753/753
Full-ref recall	753/753
Fugas de respuestas futuras	0
Score local alineado con el paper	97/100
Fallos finales	3

Los 3 fallos finales quedaron clasificados como fallos de selección de
respuesta del lector sobre evidencia completa, no como fallos de recuperación
del kernel ni contaminación del grafo.

En un slice realista MemoryArena 2x/domain, el kernel alcanzó:

Métrica	Resultado
Eventos KMP correctos	221/221
Consultas known-at-clean	73/73
Full-ref recall	73/73
Fugas futuras	0
Referencias inesperadas	0
Referencias perdidas	0

Los fallos de tarea restantes fueron gaps del lector o del agente, no gaps de
evidencia.

LongMemEval enseñó una lección distinta. En un slice smart-writer multi-session
de 30 items, la evidencia recuperada fue completa, pero la misma evidencia
obtuvo resultados distintos según el lector:

Lector	Resultado
GPT-4o	22/30
Gemma 4 31B	25/30

En una prueba de 100 items usando un modelo externo de embeddings y
derivaciones, la frontera volvió a aparecer:

Medida	Resultado
Recall amplio de evidencia	~99%
QA agregado multi-session oficial end-to-end	71,7%

Los fallos restantes fueron sobre todo problemas de operandos estructurados:
predicados de conteo perdidos, evidencia calificadora omitida o errores de
comparación.

Para mí, esa información es valiosa.

Me dice que la siguiente mejora no consiste en esconder más lógica dentro del
kernel. La siguiente mejora está en recuperar mejor candidatos, reordenarlos
con un reranker, extraer operandos tipados y usar plugins de dominio
reutilizables.

Hoja de ruta

El siguiente paso es seguir validando la idea con casos reales y hacer que el
kernel sea más fácil de usar.

A corto plazo, el trabajo es práctico:

ejecuciones más fuertes en MemoryArena y MemoryAgentBench;
regresión LongMemEval de estilo oficial como benchmark secundario;
recuperación híbrida de candidatos detrás de puertos;
experimentos de reranking;
exploración visual de grafo y línea temporal para recorrer la memoria;
mejor observabilidad de prueba y recorrido;
paginación, límites y alcances estables en KMP.

A medio plazo, la dirección se vuelve más interesante:

un modelo pequeño especializado en operar herramientas del kernel, entrenado con trayectorias MCP auditadas;
consultas de proceso como known_at, why, failed_paths, final_path y best_path;
plugins de interpretación reutilizables para dinero, fechas, conteos, URLs, código y operadores específicos de dominio;
tests de conformidad para que la semántica del kernel sea independiente del backend de almacenamiento;
demos visuales públicas que permitan reproducir un proceso agente como grafo y línea temporal.

El modelo operador me parece especialmente importante. No sería un agente
general ni un modelo que "entiende la memoria" de forma mágica. Sería un
especialista pequeño entrenado para usar KMP de forma eficiente:

¿Qué herramienta debería llamar ahora?
¿Con qué argumentos acotados?
¿Debo inspeccionar, trazar, moverme temporalmente o parar?
¿Qué referencias prueban que tengo evidencia suficiente?

Es un problema estrecho y medible.

La tesis del producto

La tesis detrás de Underpass KMP es sencilla:

Los agentes fiables necesitan memoria que puedan navegar, no solo contexto que
puedan recuperar.

Esa memoria debe estar:

acotada por aquello sobre lo que trata;
dividida en dimensiones significativas;
recorrible a través del tiempo;
conectada por relaciones honestas;
respaldada por evidencia;
inspeccionable por personas;
usable mediante herramientas por LLMs;
observable y auditable en producción.

Por eso estoy construyendo Kernel Memory Protocol: para que la memoria de un
agente no sea solo texto acumulado, sino una estructura que pueda recorrerse,
inspeccionarse y reutilizarse.

No se trata de hacer prompts más largos. Es justo lo contrario: reconstruir el
contexto útil sin obligar al modelo a leer todo el material en bruto, y hacer
que el consumo de tokens sea inteligente, medible y auditable.

La meta es convertir la memoria de los agentes en una capa real de trabajo.

Si os interesa esta línea de trabajo, podéis revisar el repositorio de
Underpass KMP. Y si os
parece útil, una estrella en GitHub ayuda a darle visibilidad al proyecto.

Escrito por Tirso García Ibáñez ·
LinkedIn ·
Underpass AI

Underpass KMP forma parte del proyecto Underpass AI. El repositorio está
licenciado bajo Apache License 2.0,
salvo que se indique lo contrario.

Building Kernel Memory Protocol: Navigable Memory for AI Agents

Tirso García — Sun, 10 May 2026 14:20:36 +0000

Versión en español: Construyendo Kernel Memory Protocol: memoria navegable para agentes de IA

The hard part with many AI agents is not the amount of text in the prompt.
The hard part is that they do not have memory they can query, traverse, and
audit.

Most current approaches try to solve this in one of three ways:

copying parts of previous conversations into the next prompt;
searching similar chunks with embeddings;
letting an agent framework store memory internally, often in a way that is hard to inspect, replay, or explain.

Those approaches help, but they are not enough when an agent is doing real
work. At that point, retrieving text is not the whole problem. You need to
reconstruct the process.

The important questions become different:

What did the agent know when it made a decision?
Which solution attempts did it try?
Which attempt failed, and why?
What new information changed the direction of the work?
Which sequence of steps led to the final answer?
Which evidence supports a decision or answer?
Can a person review that evidence without reading the whole raw conversation?
Can another model navigate the same memory without knowing how it is stored underneath?

Underpass KMP started with a smaller goal: recovering only the context an agent
needed to continue a task without rereading the whole previous conversation. I
called that context rehydration: taking already recorded memory and rebuilding
only the useful part for the next step.

The more I tested it, the clearer the real problem became. This was not about
making better prompts. I needed a memory layer that could record what happened,
when it happened, who produced it, what evidence supported it, and how it could
be traversed later.

That is where Kernel Memory Protocol, or KMP, comes from: a small, explicit API
for writing, querying, traversing, tracing, and inspecting agent memory.

From Searching Chunks to Navigating Memory

The first mistake was treating memory as if it were just search.

A search system can return text that looks similar to the question you just
asked. That is useful for finding isolated information, but it is not enough to
understand a work process.

When an agent solves a task, the key question is not only which sentence looks
similar. The key question is what happened:

what information the agent had when it made a decision;
which solution attempts it tried;
which attempt failed, and why;
which new data changed the direction of the work;
which sequence of steps led to the final result;
which evidence supports each conclusion.

That distinction shaped the kernel. I did not want to build another mechanism
for searching text. I wanted navigable memory.

That is why KMP does not expose a vector database API. It exposes memory
operations:

ingest   -> record memory
wake     -> recover the state needed to continue
ask      -> query memory with evidence
goto     -> move to a specific moment or reference
near     -> inspect what happened around a moment or reference
rewind   -> move backward
forward  -> move forward
trace    -> explain a relation path
inspect  -> inspect a memory node

The central system is intentionally small. KMP is not trying to be the agent,
and it is not responsible for deciding the final answer. Its job is to store
structured memory, make that memory traversable in a deterministic way, and
return evidence that can be audited.

Answer generation, business rules, and domain plugins can live around KMP
without being pushed into the memory protocol itself.

The Mental Model

The central object in Underpass KMP is an about.

An about is the case, topic, or memory world being worked on. It can be an
incident, a task, a customer, a benchmark case, a repository, a user, or a
long-running agent process.

Inside that about, memory does not need to live on a single line. It can be
split into dimensions:

about
  dimension: session
  dimension: agent
  dimension: task
  dimension: entity
  dimension: preference
  dimension: attempt
  dimension: incident_phase
  dimension: success_path
  dimension: failure_path

A dimension can represent a session, an agent, a task, an entity, a solution
attempt, or a phase of the process.

Time is not just another dimension.

Time is what lets you ask what was known before a step, what changed after it,
or which information did not exist yet when a decision was made.

The mental model is:

about -> the case or memory world
dimensions -> memory planes inside that case
time -> the temporal axis crossing those planes
relations -> why two memory items are connected
evidence -> proof attached to memory
provenance -> who observed or wrote it, and when

Visually, KMP memory looks more like this than like a list of messages:

Figure 1. A single about can contain several dimensions crossed by time.
Blue arrows are semantic relations; dashed arrows show continuity inside a
dimension.

This matters because agent memory is rarely linear.

A long task can involve several agents. Each agent can have its own session.
Each session can produce hypotheses, failed attempts, tool results, and final
decisions. A useful memory layer must let you look at one dimension, several
dimensions, or the whole case, while making the query scope explicit every
time.

Why Dimensions Need Namespaces

One important implementation decision was making about act as the namespace
for dimensions.

When a client ingests memory, IngestRequest.about defines the default scope.
Internally, the real identity of a dimension is equivalent to something like:

about:<about>:dimension:<dimension_id>

This may look like a small detail, but it prevents important mistakes.

If two different tasks both have a dimension called session:1, I do not want
them to be mixed by accident. Once the dimension lives inside its about, each
session:1 belongs to the case it was created for.

Reads are explicit too:

CURRENT_ABOUT queries the current case;
ABOUTS queries a concrete list of cases;
ALL_ABOUTS queries all cases, but only when the caller asks for that intentionally.

If a caller asks for ABOUTS without providing the list of cases, the kernel
rejects the request. If a caller asks for ALL_ABOUTS, the request is clearly
global and can be audited as such.

The reason is simple: a query that looked scoped to one case should not
silently end up mixing memory from other cases.

Protocol First, Tools Second

MCP is a useful way for a model to call tools. For example, it lets an LLM use
operations such as kernel_ask, kernel_near, kernel_trace, and
kernel_inspect.

That is valuable, but I did not want MCP to define how memory works.

The rule belongs in a more stable place: KMP. In the current implementation,
the same operations are exposed through the typed gRPC service
KernelMemoryService.

Separating those layers has a practical benefit:

an LLM can use KMP through MCP tools;
an application can call the gRPC service directly;
a future HTTP API or SDK can expose the same behavior;
all of those entry points must mean the same thing when they ask, traverse, trace, or inspect memory.

The project follows a hexagonal architecture for exactly this reason: entry
points can change without changing the memory semantics. gRPC is the main API.
MCP is the agent-facing entry point: the way to expose the same operations to
an AI model as tools it can use without ambiguity.

I have been careful about keeping MCP and gRPC in parity. Both entry points
must respect the same behavior. If a REST API, SDK, or another integration is
added later, it should become another entry point into the same protocol, not a
different version of memory.

The principle is:

KMP defines memory semantics.
gRPC, MCP, HTTP, SDKs, and CLIs are ways to use those semantics.

The separation looks like this:

Figure 2. MCP, gRPC, and future entry points operate over the same memory
semantics defined by KMP.

Time Is Not Just Another Filter

Useful memory is not only about what was said. It also matters when it was said
and in which order the information appeared.

An answer can be valid with the information available at one moment and become
obsolete later. A decision can be reasonable before a tool result arrives and
wrong once new data appears. Even a failed attempt can be useful if it explains
why a different solution was chosen afterwards.

That is why KMP does not treat time as a secondary filter. It makes time part
of memory navigation:

goto moves to a concrete moment or reference;
near shows what happened around it;
rewind moves backward;
forward moves forward;
trace explains a path of relations and evidence;
inspect exposes the details of a node.

With that, you do not need to ask an LLM to reread a huge conversation and
guess what happened. A person or a model can move through memory with explicit,
reproducible operations.

For a person, the process becomes inspectable. For an AI model, memory becomes
something it can operate through tools.

Writing Memory Well Is the Hard Part

All of the above depends on one condition: the memory must be written well.

goto, near, rewind, forward, trace, and inspect are only useful if
the stored memory has enough structure. To traverse memory later, you first
need to write it properly.

Saving unstructured text is not enough. It lets you search for phrases later,
but it does not reconstruct the process very well: which step depended on
another, which decision corrected an earlier one, which evidence supported a
conclusion, or which attempt was discarded.

That is why writing is as important as reading.

Writing memory in KMP means recording entries, relations, evidence, dimensions,
and time. It also means deciding how a new piece of memory connects to what was
already there.

This is an important boundary. The kernel is not responsible for inference.
Inference belongs to whoever uses it: a person, an agent, a model, or an
adapter.

Writing to KMP is not just adding text. The writer also has to say which prior
memory the text connects to, and why it connects there. That relation is part
of the memory, not a secondary detail. The kernel should validate what is
written and make it traversable; it should not invent the meaning of what
happened.

I call the piece that writes memory the writer. It can be:

a person;
an agent;
a model using MCP;
a benchmark adapter;
a future specialist model trained to write memory.

The writer decides why a new entry connects to previous memory. The kernel
checks that the relation is valid, scoped correctly, backed by evidence, and
auditable later.

The write flow looks like this:

Figure 3. The writer decides meaning and relations. KMP validates what is
written, but it does not infer meaning on its own.

That separation led to two write paths:

kernel_ingest       -> canonical low-level write path
kernel_write_memory -> writer helper that ultimately compiles to ingest

kernel_ingest is the strict entry point. It receives already structured
memory.

kernel_write_memory is more convenient for a writer. It lets the writer
express a new entry and its connections, while still validating the quality of
what is about to be written:

relation name;
semantic class;
target node reference;
why;
evidence;
context read before writing;
fallback quality.

This matters because a memory graph full of vague relations is not very useful.

If every relation says supports_answer, the memory is connected, but it does
not explain anything. It does not tell you whether an entry depends on a
previous answer, contradicts it, refines it, replaces it, or merely appears
near it.

In KMP, relation quality is part of memory quality.

Relations Need to Be Honest

There is also the opposite risk: making relations look richer than they are.

A writer should not create smart-looking edges just to make the graph look
better. If it cannot justify a relation from the context it observed, it should
fall back to a simpler, anemic, or structural relation.

That fallback is not a failure. It is an honest signal.

A good memory system must be able to say:

I know these nodes are related by order or proximity.
I do not yet know a stronger semantic reason.

That gives me metrics I can inspect:

rich relations;
anemic relations;
structural relations;
suspect or rejected relations;
prior context observed before writing;
evidence coverage.

Those metrics give me a practical way to improve the writer without hiding
uncertainty.

The Boundary Between Memory and Interpretation

To measure KMP quality, I have mainly been working with two kinds of benchmarks.

MemoryArena is interesting because it looks closer to the kind of memory I want
to build: multi-step tasks, attempts, feedback, course corrections, and memory
that has to be reused later.

LongMemEval is interesting for a different reason. It is more conversational,
but it stresses a very useful case: recovering evidence scattered across many
sessions and checking whether the system can use it to answer.

That comparison made another boundary clear: the same memory layer can support
many use cases, and not all of them need the same kind of interpretation.

The kernel can retrieve the right evidence, and the final answer can still be
wrong if the reader has to perform domain work:

summing money;
counting entities;
deduplicating events;
selecting the latest value;
comparing dates;
normalizing code, URLs, or currencies;
deciding whether an amount is paid, planned, cancelled, or only mentioned.

That is where plugins come in.

In this context, a plugin is a specialized component that interprets evidence
the kernel has already retrieved. For example: detecting amounts, summing
money, comparing dates, counting entities, recognizing URLs, identifying code,
or resolving the latest value.

The reason for introducing plugins is not to win a specific benchmark. It is to
adapt memory to different use cases without putting all those rules inside KMP
itself.

I do not want to contaminate the kernel with logic specific to one benchmark,
money, dates, preferences, or any other domain. The kernel should stay
use-case agnostic: it stores memory, relations, time, evidence, and traces.
Specialized interpretation should live outside it.

The kernel should retrieve memory and evidence reliably. Plugins and readers
can then work on that evidence to solve domain operations.

The separation is:

kernel -> memory, traversal, proof, inspection
plugins -> typed value extraction and deterministic operations
reader -> answer construction and task policy

Figure 4. KMP retrieves traceable evidence. Plugins interpret typed values and
the reader builds the final answer.

This distinction is central.

Underpass KMP should not become a custom solution for a benchmark or a single
domain. It should do its part well: recover memory, evidence, and relations
reliably so that readers, plugins, and future specialist models can work on
top.

Why This Matters for Agents

Agent memory should not only help answer a user question by looking at old
chat history.

The more interesting case appears when an AI works through several steps: it
tries a hypothesis, uses tools, makes a mistake, changes direction, receives
new information, and eventually reaches a solution. In that setting, memory is
not a text archive. It is a navigable record of how something was solved.

With that kind of memory, a person or a model can go back into the process and
ask:

what was known before a decision was made;
which solution attempt failed;
which new data changed the direction of the work;
which agent introduced a wrong assumption;
why a later answer replaced an earlier one;
which sequence of steps led to the final solution;
which evidence supports the result.

This is where multidimensional and temporal memory becomes useful. Each agent
can be a dimension. Each session, task, entity, attempt, or work phase can be
another. Time lets you move across them and understand how the state of the
process changed.

The graph is not decoration. It is the shape of the process: what happened, in
which order, connected to what, and why.

Observability Is Not Optional

If agent memory is infrastructure, it has to be observable.

I need to know:

whether a write became queryable;
how long projection took;
which scope a query used;
how many references were inspected;
whether trace pagination worked;
whether proof was complete;
whether a reader ignored correct evidence;
whether a writer created rich, anemic, or suspect relations.

That is why the kernel records structured KMP and MCP logs, OTel metrics for
KMP calls, projection processing latency, relation quality metrics, and
explicit inspect and trace behavior.

The operational goal is simple:

A failed agent answer should be classifiable.

Possible classes include:

ingestion gap;
projection gap;
retrieval gap;
proof gap;
reader consumption gap;
task reasoning gap.

Without that classification, every failure looks the same: "the AI got it
wrong". That is not good enough for production agents.

Security and Auditability

Navigable memory can also be sensitive memory.

If the system can reconstruct what happened, who said it, which decision was
made, and which evidence supported it, then it must also control who can see
each thing and at what level of detail.

Asking for a summary is not the same as asking for raw memory. Querying the
current case is not the same as crossing memory from many cases. And logs or
traces must not casually expose secrets, credentials, complete prompts, or
content that did not need to leave the system.

That is why KMP treats security and auditability as part of the design, not as
an afterthought:

API boundaries are typed;
reads have explicit scope;
raw inspection is a deliberate option;
errors fail fast instead of activating silent fallback;
references, evidence, and relations are designed for audit;
TLS/mTLS is used on infrastructure boundaries that support it.

The goal is that a person can review why the system returned an answer without
opening all memory, while the system avoids exposing more information than
needed.

What Underpass KMP Promises

Before talking about results, it is worth being clear about what KMP promises
and what it does not try to solve.

Underpass KMP is not:

a general replacement for a vector database;
a final answer generator;
a benchmark-specific solution;
a hidden agent framework;
a guarantee that every model will interpret evidence correctly.

It is a deterministic, auditable memory layer. Its job is to preserve enough
structure for people, agents, plugins, readers, and future specialist models to
work with memory without reading everything again from scratch.

Benchmarks: What I Learned

I have been careful not to claim more than the current evidence supports.

The most important early result is not "the kernel wins every memory
benchmark". The important result is that the kernel makes a previously blurry
boundary visible:

Did memory retrieval fail, or did the reader fail to use correct evidence?

That distinction matters.

In a MemoryArena public-TLS run with 100 progressive-search tasks and the
smart writer enabled, the kernel reached:

Metric	Result
Correct KMP events	2259/2259
Known-at-clean queries	753/753
Full-ref recall	753/753
Future-answer leaks	0
Local paper-aligned score	97/100
Final misses	3

The 3 final misses were classified as reader answer-selection failures over
complete evidence, not as kernel retrieval failures or graph contamination.

In a realistic MemoryArena 2x/domain slice, the kernel reached:

Metric	Result
Correct KMP events	221/221
Known-at-clean queries	73/73
Full-ref recall	73/73
Future leaks	0
Unexpected references	0
Missing references	0

The remaining task failures were reader or agent gaps, not evidence gaps.

LongMemEval taught a different lesson. In a 30-item multi-session smart-writer
slice, the recovered evidence was complete, but the same evidence produced
different results depending on the reader:

Reader	Result
GPT-4o	22/30
Gemma 4 31B	25/30

In a 100-item test using an external embedding model and derivations, the same
boundary appeared again:

Measure	Result
Broad evidence recall	~99%
Official multi-session aggregate end-to-end QA	71.7%

The remaining failures were mostly structured operand problems: missed count
predicates, omitted qualifying evidence, or comparison mistakes.

That is useful information.

It tells me that the next improvement is not to hide more logic inside the
kernel. The next improvement is better candidate retrieval, reranking, typed
operand extraction, and reusable domain plugins.

Roadmap

The next step is to keep validating the idea with real cases and make the
kernel easier to use.

In the short term, the work is practical:

stronger MemoryArena and MemoryAgentBench runs;
an official-style LongMemEval regression as a secondary benchmark;
hybrid candidate retrieval behind ports;
reranking experiments;
visual graph and timeline exploration for traversing memory;
better proof and traversal observability;
stable pagination, limits, and scopes in KMP.

In the medium term, the direction becomes more interesting:

a small model specialized in operating kernel tools, trained from audited MCP trajectories;
process queries such as known_at, why, failed_paths, final_path, and best_path;
reusable interpretation plugins for money, dates, counts, URLs, code, and domain-specific operators;
conformance tests so kernel semantics are independent from the storage implementation;
public visual experiences that let people replay an agent process as a graph and timeline.

The operator model is especially important to me. It would not be a general
agent, and it would not be a magical model that "understands memory". It would
be a small specialist trained to use KMP efficiently:

Which tool should I call now?
With which bounded arguments?
Should I inspect, trace, move through time, or stop?
Which references prove that I have enough evidence?

That is a narrow and measurable problem.

The Product Thesis

The thesis behind Underpass KMP is simple:

Reliable agents need memory they can navigate, not just context they can
retrieve.

That memory must be:

scoped by what it is about;
split into meaningful dimensions;
traversable through time;
connected by honest relations;
backed by evidence;
inspectable by people;
usable by LLMs through tools;
observable and auditable in production.

That is why I am building Kernel Memory Protocol: so agent memory is not just
accumulated text, but a structure that can be traversed, inspected, and reused.

This is not about making prompts longer. It is the opposite: rebuilding the
useful context without forcing the model to read all the raw material, and
making token usage intelligent, measurable, and auditable.

The goal is to turn agent memory into a real working layer.

If this direction interests you, you can check the
Underpass KMP repository.
And if you find it useful, a GitHub star helps give the project visibility.

Written by Tirso García Ibáñez ·
LinkedIn ·
Underpass AI

Underpass KMP is part of the Underpass AI project. The repository is licensed
under the Apache License 2.0,
unless stated otherwise.

What an event-driven agent pipeline looks like when you trace it end-to-end

Tirso García — Thu, 23 Apr 2026 22:32:49 +0000

In an earlier post I argued that event-driven agents reduce scope, cost, and decision dispersion because they narrow the decision space before the model starts reasoning.

This article is the empirical follow-up to that idea.

It does not try to re-argue the thesis. It tries to show what it looks like when the architecture is wired end-to-end, running on a real case and instrumented enough that the behavior of the system can be observed instead of inferred after the fact.

The point here is not just that a multi-agent pipeline exists. The point is that the pipeline emits a readable operational shape: ingestion, incident opening, specialized hops, differentiated latencies, outcomes by role, and distributed traces that let you follow a concrete execution from the initial event to its close.

The central image of this article is not a diagram. It is a real trace. In Tempo, a single incident appears as a sequence of spans with different durations, visible overlaps, and a critical path that can be inspected without being reconstructed by hand. For me, that is the important leap: when the system stops being just a designed architecture and starts being an observable architecture.

Figure 1. Tempo waterfall for a concrete incident. The trace shows the full path of the incident as an observable sequence of spans, durations, and specialized hops.

To anchor the captures to concrete executions, I am working with two real incidents from this run: a CPU saturation on underpass-demo-payments-api and a latency regression correlated with a recent canary.

The pipeline emits its own shape

One of the common problems in agent systems is that the architecture is usually explained better than it can be inspected. It is easy to draw boxes, arrows, and role names. It is much less common for those boxes and arrows to leave enough operational evidence that the system can be checked to actually behave as designed once real events start coming in.

This is where telemetry starts to change the conversation.

The per-specialist rate chart does not show aggregated activity in the abstract. It shows the temporal shape of the pipeline. First ingress appears, then intermediate hops like routing and kernelseed come in, and later the specialists activate with distinct cadences and durations. You are not looking at a list of components in the abstract, but at stages that turn on, overlap, and turn off in an observable order.

That matters because it turns architecture into emitted behavior. The reader does not have to reconstruct the path from scattered logs or from a retrospective narrative. The system itself shows how the incident moves through differentiated hops, which ones hold the flow longer, and which ones intervene more briefly. The architecture stops being a promise I describe and becomes a measurable temporal sequence.

In other words: the pipeline no longer lives only in the diagram. It also lives in the time series. And that difference, for me, is an important part of what makes an agent system start to be operationally readable.

Figure 2. Rate per specialist during a pipeline execution. The temporal activation sequence reveals the operational shape of the system: ingress, routing, context materialization, and specialists with differentiated cadences.

The event narrows the decision space

The core idea of the architecture is still the same as in the previous article: narrow the decision space before reasoning.

Instead of starting with an open space of context, tools, and possible courses of action, the system starts from an explicit event. That event narrows the problem from the beginning. It does not solve anything on its own, but it constrains what kind of incident we are dealing with, which specialist should intervene first, what context needs to be materialized, and which parts of the system are not yet relevant.

That matters because much of the cost of agentic systems does not come just from "using an LLM", but from leaving too many decisions open too early. When the model receives an action space that is too wide, it also receives too many opportunities to be wrong, distracted, or over-reasoning.

Here the event does the initial compression work.

It does not eliminate the need for reasoning. It makes that reasoning operate within an already-narrowed decision space.

But the important point here is that this initial narrowing does not stay in the raw event. The system turns it into persisted structure. An alert enters as an operational fact, ingress transforms it into deterministic evidence, and from there the specialists deposit their artifacts — findings, plans, decisions — onto the incident graph, each with an explicit author, revision, and verifiable content_hash. What circulates between phases stops being loose text or history and becomes structure with traceability.

That trail is what allows the full materialization cycle to be shown. The specialists do not only write the text of each artifact: they also declare which relations must remain explicit between them inside a shared typed vocabulary. What ends up stored is not a narration but a grammar: what evidence sustains a finding, what finding grounds a plan, and what decision mitigates an incident.

In other words, the event does not only trigger execution. It also starts to build structured memory.

The first detail worth showing is precisely that deterministic input artifact. Before any specialist intervenes, ingress leaves something like this in Valkey:

alert_id=article-sat-1776973582
alert_name=PaymentsCPUSaturation
service_name=payments-api
environment=cluster
severity=SEV2
namespace=underpass-runtime
workload_kind=deployment
workload_name=underpass-demo-payments-api
symptom_kind=saturation
symptom_value=cpu=94%
threshold=cpu > 90% for 5m
summary=Payments API CPU saturation
description=Sustained CPU utilization above 90% for over 5 minutes. The workload is horizontally scalable.
runbook_url=https://runbooks.internal/payments/cpu-saturation
dashboard_url=https://grafana.internal/d/payments-cpu
firing_at=2026-04-23T19:46:22Z

Real block from evidence_56daf53e-4c97-4344-bbb6-f63cf513ae89_initial_alert.json. content_hash: sha256:116b246ae0bc4adf269ce29dd76cd794ccde11befa29567e8abdbf988abd3dc0. revision: 1.

And that text does not stay isolated. The kernel turns it into typed nodes and relations. For the saturation incident, the shape stored in Neo4j looks like this:

Figure 3. Typed graph stored in Neo4j for the saturation incident. Same node kinds, same semantic_class vocabulary, composed into the 3-in-series shape that this incident type requires.

Not every hop costs the same

Another thing the instrumentation makes clear is that not every hop of the pipeline pays the same cost.

In the panel, fast hops are distinguishable from model-bound hops. That separation is important because it lets you read the system with more precision. Not every pipeline step needs generative capability, and not every operational cost should be attributed to the model's reasoning.

Some steps are essentially operational: ingestion, routing, persistence, context retrieval, component coordination. And some steps are tied to LLM-bound specialists, where the real cognitive cost of the system appears.

Separating both planes changes the conversation. Instead of talking about the pipeline as a single opaque block, you can see which parts of the work are infrastructure, which parts are reasoning, and which parts are coordination between both. It also lets you detect whether the system is spending model capacity where it should not.

In other words: it is not enough that the pipeline works. It has to be clear which hops consume expensive intelligence and which are simply governed execution.

Figure 4. p95 latency per hop. The instrumentation separates LLM-bound specialists from cheaper operational hops, so the cost of the pipeline is not treated as an opaque block.

Specialists are not named prompts

There is a common risk in this space: calling "agents" or "specialists" what are really just prompts with different labels.

What I want to avoid here is precisely that.

In this architecture, investigator, planner, and operator are not just semantic names for three nice phases of a demo. They appear as differentiated stages with their own timing, their own outcomes, and bounded responsibility within the incident cycle. The instrumentation lets you see them as operational roles, not just distinct voices generated by the same model.

That does not mean the problem is solved in general. It means something more concrete and defensible: in this architecture, the specialists leave enough trace that you can inspect what they did, when they intervened, and with what outcome each hop finished. That trace does not only appear in telemetry. It also appears in the node.details persisted in Valkey, where each actor leaves text with explicit authorship and where the reader can compare deterministic evidence with LLM intervention without the two getting mixed.

That change seems small, but it is not. In many systems, multi-agent behavior is narrated. Here it starts to be auditable.

Figure 5. Specialist outcomes in the observed window. The roles leave their own operational trace: they are not just prompts with different names but stages with distinguishable outcomes and responsibility.

Governing what the LLM cannot invent

There is a part of the design I care about separating from reasoning: what each agent can and cannot decide.

In many agent systems, governance is delegated to the prompt: "do not do X, only do Y". That is fragile because it depends on the model obeying. Here I prefer governance to live in layers prior to the model, so that when the LLM gets it wrong — and it does — the system can no longer execute the mistake.

The specialist closes the decision space. Each specialist exposes a closed enum to the model. The saturation operator can only respond execute, escalate, or reject. The planner picks from five actions: scale_up, restart_pods, circuit_break, escalate, or not_enough_evidence. The rollout operator: rollback, pause_rollout, escalate, not_enough_evidence. That space is not a prompt convention; these are enums declared in the domain catalog (eventsv1.SaturationOperatorDecision, eventsv1.SaturationAction, eventsv1.RuntimeRolloutDecision) with an IsValid method. The specialist validates the model's response before emitting any event:

decision := eventsv1.SaturationOperatorDecision(strings.ToLower(out.Decision))
if !decision.IsValid() {
    reason := fmt.Sprintf("llm returned invalid operator decision %q; defaulting to escalate", out.Decision)
    return out, eventsv1.SaturationOperatorDecisionEscalate, reason, nil
}

The LLM's output schema is also closed: each specialist asks the model for a JSON with explicit fields (decision, confidence, node_detail, relations with their explanations). If the JSON does not parse, safe fallback. If it parses but the values are invalid, safe fallback. The model's uncertainty becomes an operational signal — the insufficient_data outcome appears on the dashboard — not an execution risk.

The kernel closes the graph vocabulary. Relations between nodes are not free text. Each relation carries a semantic_class that has to be one of six: structural, causal, motivational, procedural, evidential, constraint. That set is fixed in the kernel's domain. If a specialist tries to emit a relation with a class outside that list — even if the LLM wanted to generate something like "inspirational" or "quasi-causal" — the projector drops it and the batch is recorded as failed in telemetry. What reaches the graph always comes from a known vocabulary.

The content of the nodes is fingerprinted. Each node.detail is persisted with content_hash and revision from the moment the specialist emits it; the hash stays with the text permanently. Months later, anyone can recompute the hash against the stored detail and confirm that the text has not changed since the author wrote it.

The combination — six closed semantic classes for the relations and cryptographic signing over the node content — makes the system's persisted memory not a free narration but a graph with a fixed vocabulary and verifiable text. That is what allows different specialists, at different times, to operate on the same structure without each contaminating the others'.

The catalog ties the layers together. SpecialistID, ToolProfile, GovernanceProfile, and SuccessProfile live in YAML and are cross-checked against the Go constants via conformance tests. If someone adds a specialist or a tool without declaring it in both places, the build fails. Governance is not a convention read from a wiki; it is versioned, executable structure.

The runtime is the last guardian. When a specialist opens a session with the runtime to invoke a tool, it declares a tool_profile, a governance_profile, and a success_profile. The runtime validates each invocation — not just at session-open, but on every call — against three rules: if the tool's scope (workspace / cluster) does not match the caller's roles, it fails; if the tool's risk level is high and the caller does not have platform_admin, it fails; if the tool requires approval and the invocation does not carry Approved=true, it fails. Each runtime decision — allowed, denied, failed — is recorded in a structured audit log with actor_id, tenant_id, session_id, invocation_id, status, and redacted metadata. PIR does not decide whether it has permission; the runtime decides it against declared structure.

Governance leaves a trail. Every runtime denial, every specialist fallback, every insufficient_data from the model, every kernel rejection lands in telemetry. These are not errors; they are evidence that the layers are doing their job. That lets you distinguish between "the system did nothing because it was right not to act" and "the system did nothing because it failed". The warnings panel on the dashboard is not a negative indicator; it is a health indicator.

Governance emerges as composition. The event narrows the problem. The specialist narrows the model's output. The kernel narrows the persisted vocabulary. The catalog narrows the universe of roles, tools, and profiles. The runtime narrows execution. No single layer is enough on its own; the combination is what allows a potentially ambiguous model to operate on real infrastructure without unexpected behavior.

That change of register — from "promising the agent behaves well" to "structurally preventing it from behaving badly" — is what makes me think the word governance is appropriate here. It is not prompt governance nor external policy governance. It is architecture governance: the system does not rely on the model's goodness; it operates under constraints the model cannot reach.

The distributed trace is the incident

The most valuable part of this setup, for me, is not the aggregated dashboard. It is the bridge between aggregates and concrete executions.

The dashboard's traces panel already shows that bridge: each row lists payments-incident-response, operations like pir.ingress.handle_alert_firing, and durations on the order of a bit over a minute per incident. But where the real value appears is when you open one of those entries in Tempo — and what unfolds there is exactly the waterfall that opens this article.

Now that we have walked the pipeline, it is worth coming back to Figure 1 with different eyes. Each of those spans corresponds to a hop you already know how to read: ingress as root, five siblings in parallel (routing, event proxy, kernel seed, investigators) and then the serial chain of the saturation pipeline. The hops in microseconds or milliseconds are infrastructure doing its work; the spans of tens of seconds are the LLM-bound specialists. That asymmetry — visible without reconstructing anything — is what lets you read the incident as a path, not as an aggregate.

The critical path also stops being a guess. You can see which span dominated the total duration, which were practically instantaneous, and which parts overlapped. That avoids a common trap of observability in LLM systems: staying at global metrics that say little about the real behavior of an individual execution. Averages help, but they are not enough. To understand how the system reasons and executes under a specific incident, you have to be able to follow a trace.

Distributed tracing does not replace the system's reasoning. It makes it inspectable.

The kernel does not store free-floating text

The other half of the story is not in Grafana or in Tempo. It is in what remains stored after each hop.

Each incident leaves a chain of real node.details: initial evidence, findings, plans, and decisions. Some of those blocks come from ingress and are deterministic. Others come from LLM-bound specialists. What is useful is that they do not end up mixed in an indistinct narration. Each piece keeps key, author, raw content, content_hash, and revision.

That lets you show something that is usually lost in this kind of system: not only what the pipeline decided, but what specific text each actor produced and with what provenance it can be read. It also lets you link a single incident to four different views of the same fact: Grafana screenshot, Tempo trace, Valkey-persisted detail, and typed relations in Neo4j.

For example, the saturation finding does not describe an "agent" in the abstract. It describes a concrete reading of concrete evidence:

Workload under pressure: deployment/underpass-demo-payments-api in namespace underpass-runtime.
Resource saturated: cpu (symptom_kind=saturation, symptom_value=cpu=94%).
Observed pressure level: 94% vs alert threshold of cpu > 90% for 5m.
Hypotheses: The evidence supports a sudden spike or regression related to recent deployments, as the workload received multiple canary rollouts (v2.7.4, v2.7.5, v2.7.6 23min before, and v2.7.7 15min before the symptom fired).
Missing information: Metrics window for trend analysis, top-N consumers, and limits vs requests to determine if the saturation is due to resource constraints or increased load.

Real block from finding_56daf53e-4c97-4344-bbb6-f63cf513ae89_saturation.json. content_hash: sha256:7c67c37f8e3439942993b09a7dde70c164abdaa6b69d0e46a65872d7f17824ae. revision: 1.

And in the rollout incident, the operational decision is also persisted with its own explicit rationale:

Scope: payments-api service in the cluster environment, specifically the underpass-demo-payments-api deployment.
Observed Data: alert_id article-roll-1776973582 reports a symptom_kind of latency with a p99 value of 2.41s, exceeding the threshold of p99 > 2s for 5m. A recent_deploy (v2.7.8-canary) is present with an age_minutes of 18.
Operational Rationale: The regression is correlated with a recent canary deploy that is young (< 60 minutes) and the symptom is latency, meeting the criteria for a rollback to a healthy previous revision.
Falsification: This decision would be invalidated if evidence emerges that the latency is caused by a global infrastructure failure unrelated to the v2.7.8-canary rollout.

Real block from decision_a65c9a9b-3dff-4ce4-aeeb-d6497984ee57_runtime-rollout.json. content_hash: sha256:2830e840c213de2baaf82444c6383e3ea9b63dac41944153534e61ead72a07d7. revision: 1.

The other graph worth showing is the rollout-regression one, because it makes clear that the ontology is shared but the topology changes depending on the incident type:

Figure 6. Typed graph stored in Neo4j for the rollout-regression incident. Shared ontology, different topology: finding and decision in parallel instead of the 3-in-series chain.

Recovery and closure are also architecture

An agent architecture is not played only in the "interesting" moment of reasoning. It is also played in the less visible parts: retries, durability, message consumption, correct pipeline closure, absence of spurious escalations, and stable behavior when several components work in chain.

That matters because many demos show the best case but not the operational fabric that keeps the system from breaking at the first mismatch between services, queues, context, and execution.

Part of what I wanted to build here was precisely that: a pipeline where recovery, coordination, and traceability are not an afterthought but part of the design from the beginning.

If the system resolves something but cannot explain which hop it passed through, how long it took, who intervened, what failed, and where it could recover, then you do not yet have reliable infrastructure. You have a lucky sequence.

What I think generalizes

I do not think a single demo allows broad claims about "agents" in general. I do think this setup shows something more bounded.

When the event narrows the decision space well, when the pipeline distributes responsibility across specialists with clear scope, and when execution leaves enough telemetry to inspect each hop, the system becomes more governable. Not necessarily simpler on the inside, but more readable from the outside. And that operational readability matters.

It matters for debugging.
It matters for evaluation.
It matters for audit.
It matters for deciding whether the cost of reasoning is justified at each phase.

In that sense, what is interesting is not just that there are several specialists. What is interesting is that the complete cycle can be observed as a typed, bounded, and measurable sequence of transitions between state, context, and execution.

Close

The main result here is not that a multi-agent pipeline can be drawn. It is that it can be observed as a sequence of bounded hops, with latency, outcome, and traceability per execution.

The event defines the boundary. The specialist closes the model's output. The kernel closes the graph vocabulary. The runtime closes execution. The instrumentation leaves enough evidence to inspect what actually happened, instead of reconstructing it afterwards from prompts, loose logs, or intuition.

This pipeline is not a monolithic application. It runs on two pieces of open-source infrastructure I published earlier in separate articles: rehydration-kernel provides structured context with explicit causality and a closed ontology, and underpass-runtime provides governed execution with a policy engine and auditing. This article shows what emerges when the two are composed.

If you work on infrastructure for agents, governed execution, or observability for LLM systems, I would especially value technical feedback on the design and the instrumentation.

Why event-driven agents reduce scope, cost, and decision dispersion

Tirso García — Thu, 16 Apr 2026 22:45:25 +0000

Most agent systems do not control their costs because they spend tokens letting the model discover boundaries that the architecture should have defined up front.

A task arrives. A general-purpose agent receives a large context window, too many available tools, mixed historical signals, and a loosely defined objective. From there, the model must infer what matters, what does not, which tools are plausible, which constraints apply, and how success should be measured.

That is expensive.

Not only in tokens, but in unnecessary exploration, latency, failed tool calls, policy denials, and avoidable reasoning over irrelevant possibility space.

This is the core issue I want to highlight: many agent systems are expensive because the architecture leaves too many decisions open for the model before reasoning even begins.

I think event-driven agents are one of the cleanest ways to reduce those open decisions.

A well-defined event does more than trigger work. It defines the boundaries of the problem and becomes the initial context the agent works with. The better the event is designed, the more precise the agent will be.

That event is routed to a specialist agent — not a generic agent that has to figure out what to do, but one that already knows what type of problem it focuses on.

The real cost of generality

The default pattern in many agent systems is still broad and implicit:

give the model a lot of context
expose a wide action surface
provide generic instructions
hope the model discovers the right boundary at inference time

That scales poorly.

As systems grow, the agent is forced to reason over:

heterogeneous history
multiple subsystems
weakly related signals
many candidate tools
overlapping objectives

At that point, the problem is no longer just context size.

It is decision dispersion: too many plausible interpretations, too many candidate actions, and too much irrelevant context competing for attention.

A broad agent can still succeed, but the system is making it solve problem decomposition again and again on every cycle. And the more the system grows, the more likely the model is to fail or take shortcuts just to get through.

That is architectural waste.

Events are not just triggers

In a well-designed event-driven system, an event is not merely a transport primitive.

It is a semantic boundary.

A well-defined event already carries a strong signal about:

what class of situation has occurred
which specialist capability is relevant
which context should be materialized
which tools are worth considering
which policies should govern the response
how the result should be evaluated

That changes the starting point of the system.

Instead of asking:

What should an agent do in this entire environment?

The system can ask:

How should this specialist handle this class of situation under these constraints?

That is a much healthier question.

The key idea: narrowing before reasoning

The architectural value of event-driven agents is not just decoupling. It is control.

A well-defined event lets the system narrow four things before the model starts reasoning:

1. Problem scope

The event defines the operational boundary.

2. Context scope

Only the relevant knowledge should be materialized.

3. Action scope

Only the relevant tools and permissions should be exposed.

4. Evaluation scope

Success criteria become more local and easier to observe.

This is why event-driven systems can become cheaper and more reliable at the same time.

Narrowing across four architectural layers

A serious event-driven agent system should narrow the problem across four layers.

1. Event routing — narrowing the problem surface

The first narrowing step is event classification and routing.

A well-defined event such as:

ThermalDriftDetected
PolicyViolationDetected
ExecutionFailureObserved
IncidentSeverityRaised

already tells the system that not every capability is equally relevant.

Routing should select the specialist capability or specialist set that is appropriate for that class of problem.

The model should not spend tokens discovering what the event already told us.

2. Context materialization — narrowing the knowledge surface

Once the event boundary is known, context should not be assembled as a flat prompt bundle.

It should be materialized explicitly and narrowly:

relevant entities only
causal relationships only where useful
prior mitigations and outcomes
rationale from previous decisions
constraints tied to the event class

This is where many systems either win or fail.

A narrow context is not automatically a good context. The goal is not merely to shrink tokens. The goal is to increase relevance density.

That is why context quality should be observable.

Useful metrics here include:

raw_equivalent_tokens
compression_ratio
causal_density
noise_ratio
detail_coverage

Those metrics make it possible to ask a much better question than “how much context did we pass?”

The better question is:

how much unnecessary context did we discard without losing what the specialist actually needs?

3. Governed execution — narrowing the action surface

Even with the right context, an agent should not operate over an unrestricted action surface.

Execution should be governed.

A runtime layer can narrow execution by:

restricting the candidate tool set
ranking likely actions before invocation
applying policy checks before execution
isolating execution environments
capturing telemetry, logs, and traces

The cost of a system is not only what it spends on prompts. It is also what it spends on tool fan-out, denied actions, and unnecessary exploration.

This is why execution quality also needs metrics.

Useful metrics here include:

workspace_tool_calls_per_task
workspace_success_on_first_tool_rate
workspace_recommendation_acceptance_rate
workspace_policy_denial_rate_bad_recommendation
invocation_latency_histograms

4. Observability and feedback — narrowing through evidence

The final layer is observability.

Without observability, “event-driven agents reduce cost” remains a belief.

With observability, it becomes testable.

A well-instrumented system can show:

whether context is becoming denser or just smaller
whether specialists use fewer tools than broad agents
whether routing improves first-action success
whether policy boundaries are helping or creating churn
whether outcome quality improves as scope narrows

This is where the architecture stops being an opinion and becomes an operational hypothesis.

A concrete example: alert-driven remediation

A useful way to think about this is an operational remediation loop in a live cluster.

Imagine an alert arrives from the observability stack because a subsystem crosses a critical threshold.

A broad agent design might do something like this:

gather recent logs
gather broad system history
expose many tools
ask a general-purpose agent to stabilize the situation

That approach pushes too much decomposition work into the model.

An event-driven design works differently.

Step 1 — a well-defined event enters the system

The alert becomes a well-defined event such as IncidentSeverityRaised or ExecutionFailureObserved.

Step 2 — the event selects a specialist path

The system routes the event to a specialist capability responsible for that class of issue.

Step 3 — context is materialized narrowly

The context layer assembles only what is relevant to that incident type:

the affected subsystem
recent related failures
prior mitigations
current operational constraints
known causal dependencies

Step 4 — execution is governed

The runtime narrows the available action space:

only relevant tools are visible
suggested actions are ranked
policy checks can reject unsafe actions before execution
telemetry is attached to the full cycle

Step 5 — the outcome becomes evidence

The result of the mitigation becomes a new event and a new measurement point.

At that point, the system can observe not only whether the incident was addressed, but how expensive the path was:

how much context was needed
how many tools were considered
whether the first recommended action succeeded
whether policy narrowed or blocked the path
how long the cycle took end to end

That is the real strength of the pattern.

The event is not just a trigger for work. It is the boundary that lets the whole system narrow problem scope, knowledge scope, and action scope before reasoning begins.

The architecture in one line

A useful mental model is this:

Every stage should remove irrelevant possibilities.

If the system keeps adding options instead of removing them, it is probably moving in the wrong direction.

Why this reduces cost

Cost reduction comes from multiple sources at once:

fewer input tokens
denser context
fewer candidate tools
fewer executions without result
fewer unsafe actions reaching execution
fewer retries caused by vague reasoning
shorter cycles to first useful action

A broad agent pays these costs implicitly.

An event-driven specialist system avoids them structurally.

Why this reduces decision dispersion

Decision dispersion appears when the system leaves too many paths open at once.

Too much context.
Too many tools.
Too many plausible interpretations.
Too many weakly bounded goals.

A well-defined event cuts through that.

It does not eliminate uncertainty, but it turns a diffuse reasoning problem into a more local one.

The system no longer asks for a global interpretation of the world. It asks for a bounded response to a bounded class of situation.

That is the kind of narrowing that helps both quality and cost.

What to measure in a live system

For this architecture to be credible, it has to be measurable.

A strong live demonstration would compare a broader path against an event-driven specialist path on the same class of incident.

For the context layer, useful measurements include:

raw_equivalent_tokens
compression_ratio
causal_density
noise_ratio
detail_coverage
context_bytes_saved

For the execution layer, useful measurements include:

workspace_tool_calls_per_task
workspace_success_on_first_tool_rate
workspace_recommendation_acceptance_rate
workspace_policy_denial_rate_bad_recommendation
invocation_latency_histograms
trace_span_durations

The point is not to claim that every event-driven design is automatically better.

The point is that this design gives you a coherent way to test whether narrowing is actually happening and whether it is paying off.

What this does not solve

Event-driven agents do not solve everything.

They can still fail badly if:

events are poorly designed
specialist boundaries are unclear
context materialization is weak
the runtime exposes the wrong tools
policies are too loose or too rigid
observability is incomplete

A noisy event taxonomy creates noise, not clarity.

A bad specialist boundary just moves confusion from the prompt to the routing layer.

A narrow system is only better if the narrowing is semantically sound.

The final idea

Reducing cost, improving focus, and eliminating dispersion are consequences of the same principle: narrow before reasoning. When that is combined with materialized context, governed execution, and real observability, the system stops being a prompt pipeline and becomes operational infrastructure.

The systems that will scale are not the ones that expose larger models to more context and more tools.

They are the ones that learn how to narrow the world before the model starts thinking.

— Tirso Garcia · April 2026

Building these ideas in the open:
rehydration-kernel · underpass-runtime

If you're working on similar problems, I'd love to hear from you.