<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tirso García</title>
    <description>The latest articles on DEV Community by Tirso García (@tirsogarcia).</description>
    <link>https://dev.to/tirsogarcia</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1723304%2F484b0a43-4957-467a-9131-06846c86b84f.png</url>
      <title>DEV Community: Tirso García</title>
      <link>https://dev.to/tirsogarcia</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tirsogarcia"/>
    <language>en</language>
    <item>
      <title>Construyendo Kernel Memory Protocol: memoria navegable para agentes de IA</title>
      <dc:creator>Tirso García</dc:creator>
      <pubDate>Sun, 10 May 2026 14:59:29 +0000</pubDate>
      <link>https://dev.to/tirsogarcia/construyendo-kernel-memory-protocol-memoria-navegable-para-agentes-de-ia-24lc</link>
      <guid>https://dev.to/tirsogarcia/construyendo-kernel-memory-protocol-memoria-navegable-para-agentes-de-ia-24lc</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;English version: &lt;a href="https://dev.to/tirsogarcia/building-kernel-memory-protocol-navigable-memory-for-ai-agents-315j"&gt;Building Kernel Memory Protocol: Navigable Memory for AI Agents&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;El problema de muchos agentes de IA no es que les falte texto en el prompt.&lt;br&gt;
El problema es que no tienen una memoria que puedan consultar, recorrer y&lt;br&gt;
auditar.&lt;/p&gt;

&lt;p&gt;Hoy muchas soluciones intentan resolverlo de tres formas:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;copiando parte de conversaciones anteriores dentro del nuevo prompt;&lt;/li&gt;
&lt;li&gt;buscando fragmentos parecidos mediante embeddings;&lt;/li&gt;
&lt;li&gt;delegando la memoria a un framework que la guarda por dentro, pero que no
siempre permite inspeccionarla, recorrerla o explicar cómo se llegó a una
decisión.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Estas soluciones ayudan, pero se quedan cortas cuando un agente está haciendo&lt;br&gt;
trabajo real. En ese contexto no basta con recuperar texto: hay que poder&lt;br&gt;
reconstruir el proceso.&lt;/p&gt;

&lt;p&gt;Las preguntas importantes son otras:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;¿Qué sabía el agente cuando tomó una decisión?&lt;/li&gt;
&lt;li&gt;¿Qué intentos de solución probó?&lt;/li&gt;
&lt;li&gt;¿Cuál falló y por qué?&lt;/li&gt;
&lt;li&gt;¿Qué información nueva cambió el rumbo del trabajo?&lt;/li&gt;
&lt;li&gt;¿Qué secuencia de pasos llevó a la solución final?&lt;/li&gt;
&lt;li&gt;¿Qué evidencias justifican una decisión o una respuesta?&lt;/li&gt;
&lt;li&gt;¿Puede una persona revisar esas evidencias sin leer toda la conversación en
bruto?&lt;/li&gt;
&lt;li&gt;¿Puede otro modelo navegar la misma memoria sin saber dónde ni cómo está
guardada por debajo?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Underpass KMP nació intentando resolver algo más pequeño: recuperar solo el&lt;br&gt;
contexto necesario para que un agente pudiera continuar una tarea sin releer&lt;br&gt;
toda la conversación anterior. A eso lo llamaba rehidratación de contexto:&lt;br&gt;
tomar memoria ya registrada y reconstruir únicamente la parte útil para el&lt;br&gt;
siguiente paso.&lt;/p&gt;

&lt;p&gt;Pero al probarlo vi que el problema real era mayor. No bastaba con preparar&lt;br&gt;
mejor el prompt. Necesitaba una capa de memoria que guardase qué ocurrió,&lt;br&gt;
cuándo ocurrió, quién lo produjo, qué evidencias lo sostenían y cómo podía&lt;br&gt;
recorrerse después.&lt;/p&gt;

&lt;p&gt;De ahí nace Kernel Memory Protocol, o KMP: una API pequeña y explícita para&lt;br&gt;
escribir, consultar, recorrer, trazar e inspeccionar memoria de agentes.&lt;/p&gt;
&lt;h2&gt;
  
  
  De buscar fragmentos a recorrer la memoria
&lt;/h2&gt;

&lt;p&gt;El error inicial fue tratar la memoria como si fuera solo un buscador.&lt;/p&gt;

&lt;p&gt;Un buscador puede devolver textos parecidos a lo que acabas de preguntar. Eso&lt;br&gt;
sirve para encontrar información suelta, pero no basta para entender un proceso&lt;br&gt;
de trabajo.&lt;/p&gt;

&lt;p&gt;Cuando un agente resuelve una tarea, lo importante no es encontrar una frase&lt;br&gt;
parecida. Lo importante es poder reconstruir qué pasó:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;qué información tenía el agente cuando tomó una decisión;&lt;/li&gt;
&lt;li&gt;qué intentos de solución probó;&lt;/li&gt;
&lt;li&gt;qué intento falló y por qué;&lt;/li&gt;
&lt;li&gt;qué datos nuevos cambiaron la dirección del trabajo;&lt;/li&gt;
&lt;li&gt;qué secuencia de pasos llevó a la solución final;&lt;/li&gt;
&lt;li&gt;qué evidencias justifican cada conclusión.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Esa diferencia dio forma al kernel. No quería construir otro mecanismo para&lt;br&gt;
buscar texto. Quería construir una memoria navegable.&lt;/p&gt;

&lt;p&gt;Por eso KMP no expone una API de base de datos vectorial. Expone movimientos de&lt;br&gt;
memoria:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ingest   -&amp;gt; registrar memoria
wake     -&amp;gt; recuperar el estado necesario para continuar
ask      -&amp;gt; preguntar a la memoria con evidencia
goto     -&amp;gt; ir a un momento o referencia concreta
near     -&amp;gt; ver qué ocurrió alrededor de un momento o referencia
rewind   -&amp;gt; moverse hacia atrás
forward  -&amp;gt; moverse hacia delante
trace    -&amp;gt; explicar una ruta de relaciones
inspect  -&amp;gt; inspeccionar un nodo de memoria
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;La idea es mantener pequeña la parte central del sistema. KMP no intenta ser el&lt;br&gt;
agente ni decidir la respuesta final. Su responsabilidad es otra: guardar&lt;br&gt;
memoria estructurada, permitir recorrerla de forma determinista y devolver&lt;br&gt;
evidencias auditables.&lt;/p&gt;

&lt;p&gt;La generación de la respuesta final, las reglas de negocio o los plugins de&lt;br&gt;
dominio pueden añadirse alrededor de KMP, sin meterlos en esa parte central.&lt;/p&gt;
&lt;h2&gt;
  
  
  El modelo mental
&lt;/h2&gt;

&lt;p&gt;El objeto central de Underpass KMP es un &lt;code&gt;about&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Un &lt;code&gt;about&lt;/code&gt; es el caso, tema o mundo de memoria sobre el que se está trabajando.&lt;br&gt;
Puede ser un incidente, una tarea, un cliente, un benchmark, un repositorio, un&lt;br&gt;
usuario o un proceso largo de un agente.&lt;/p&gt;

&lt;p&gt;Dentro de ese &lt;code&gt;about&lt;/code&gt;, la memoria no vive en una sola línea. Puede dividirse en&lt;br&gt;
dimensiones:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;about
  dimension: session
  dimension: agent
  dimension: task
  dimension: entity
  dimension: preference
  dimension: attempt
  dimension: incident_phase
  dimension: success_path
  dimension: failure_path
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Una dimensión puede representar una sesión, un agente, una tarea, una entidad,&lt;br&gt;
un intento de solución o una fase del proceso.&lt;/p&gt;

&lt;p&gt;El tiempo no es una dimensión más.&lt;/p&gt;

&lt;p&gt;El tiempo es lo que permite preguntar qué se sabía antes de un paso, qué cambió&lt;br&gt;
después o qué información todavía no existía cuando se tomó una decisión.&lt;/p&gt;

&lt;p&gt;El modelo mental queda así:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;about -&amp;gt; caso o mundo de memoria
dimensions -&amp;gt; planos de memoria dentro de ese caso
time -&amp;gt; eje temporal que atraviesa esos planos
relations -&amp;gt; por qué dos elementos están conectados
evidence -&amp;gt; evidencias que sostienen la memoria
provenance -&amp;gt; quién lo observó o escribió, y cuándo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Visualmente, una memoria KMP se parece más a esto que a una lista de mensajes:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjt9wk79r1f0q9iba284j.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjt9wk79r1f0q9iba284j.jpg" alt="Memoria KMP multidimensional atravesada por el tiempo" width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figura 1. Un mismo &lt;code&gt;about&lt;/code&gt; puede tener varias dimensiones atravesadas por el&lt;br&gt;
tiempo. Las flechas azules son relaciones semánticas; las discontinuas muestran&lt;br&gt;
continuidad dentro de una dimensión.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Este modelo importa porque la memoria de un agente rara vez es lineal.&lt;/p&gt;

&lt;p&gt;Una tarea larga puede involucrar a varios agentes. Cada agente puede tener su&lt;br&gt;
propia sesión. Cada sesión puede producir hipótesis, intentos fallidos,&lt;br&gt;
resultados de herramientas y decisiones finales. Una capa de memoria útil debe&lt;br&gt;
permitir mirar una sola dimensión, varias dimensiones o todo el caso, dejando&lt;br&gt;
claro en todo momento qué alcance se está consultando.&lt;/p&gt;
&lt;h2&gt;
  
  
  Por qué las dimensiones necesitan un espacio de nombres
&lt;/h2&gt;

&lt;p&gt;Una de las decisiones importantes de implementación fue hacer que &lt;code&gt;about&lt;/code&gt;&lt;br&gt;
actuase como espacio de nombres de las dimensiones.&lt;/p&gt;

&lt;p&gt;Cuando un cliente ingiere memoria, &lt;code&gt;IngestRequest.about&lt;/code&gt; define el alcance por&lt;br&gt;
defecto. Internamente, la identidad real de una dimensión equivale a algo como:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;about:&amp;lt;about&amp;gt;:dimension:&amp;lt;dimension_id&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Puede parecer un detalle pequeño, pero evita errores importantes.&lt;/p&gt;

&lt;p&gt;Si dos tareas distintas tienen una dimensión llamada &lt;code&gt;session:1&lt;/code&gt;, no quiero que&lt;br&gt;
se mezclen por accidente. Al meter la dimensión dentro de su &lt;code&gt;about&lt;/code&gt;, cada&lt;br&gt;
&lt;code&gt;session:1&lt;/code&gt; pertenece al caso que le corresponde.&lt;/p&gt;

&lt;p&gt;Las lecturas también son explícitas:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;CURRENT_ABOUT&lt;/code&gt; consulta el caso actual;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ABOUTS&lt;/code&gt; consulta una lista concreta de casos;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ALL_ABOUTS&lt;/code&gt; consulta todos los casos, pero solo cuando el cliente lo pide de
forma intencionada.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Si alguien pide &lt;code&gt;ABOUTS&lt;/code&gt; pero no indica qué casos quiere consultar, el kernel&lt;br&gt;
rechaza la petición. Y si alguien pide &lt;code&gt;ALL_ABOUTS&lt;/code&gt;, queda claro que está&lt;br&gt;
pidiendo cruzar todas las memorias disponibles.&lt;/p&gt;

&lt;p&gt;La razón es sencilla: una consulta que parecía limitada a un caso no debería&lt;br&gt;
acabar mezclando memoria de otros casos por accidente.&lt;/p&gt;
&lt;h2&gt;
  
  
  Primero el protocolo, luego las herramientas
&lt;/h2&gt;

&lt;p&gt;MCP es una forma cómoda de que un modelo use herramientas. Por ejemplo, permite&lt;br&gt;
que un LLM llame a operaciones como &lt;code&gt;kernel_ask&lt;/code&gt;, &lt;code&gt;kernel_near&lt;/code&gt;,&lt;br&gt;
&lt;code&gt;kernel_trace&lt;/code&gt; o &lt;code&gt;kernel_inspect&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Eso es muy útil, pero no quería que MCP definiera cómo funciona la memoria.&lt;/p&gt;

&lt;p&gt;La regla tenía que vivir en un sitio más estable: KMP. En la implementación&lt;br&gt;
actual, esas operaciones están expuestas mediante el servicio gRPC tipado&lt;br&gt;
&lt;code&gt;KernelMemoryService&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;La ventaja de separar las dos cosas es simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;un LLM puede usar KMP a través de herramientas MCP;&lt;/li&gt;
&lt;li&gt;una aplicación puede llamar directamente al servicio gRPC;&lt;/li&gt;
&lt;li&gt;en el futuro puede existir una API HTTP o un SDK;&lt;/li&gt;
&lt;li&gt;todos esos caminos deben hacer lo mismo cuando preguntan, recorren, trazan o
inspeccionan memoria.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;El proyecto mantiene una arquitectura hexagonal precisamente para poder&lt;br&gt;
intercambiar esas entradas. La API principal es gRPC. MCP es la entrada&lt;br&gt;
agéntica: la forma de exponer esas mismas operaciones para que una IA pueda&lt;br&gt;
usarlas como herramientas sin confundirse.&lt;/p&gt;

&lt;p&gt;He tenido mucho cuidado con la paridad entre MCP y gRPC. Las dos entradas deben&lt;br&gt;
respetar el mismo comportamiento. Y si mañana aparece una API REST, un SDK u&lt;br&gt;
otro tipo de integración, debería poder añadirse como otra entrada al mismo&lt;br&gt;
protocolo, no como una versión distinta de la memoria.&lt;/p&gt;

&lt;p&gt;El principio es:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;KMP define la semántica de memoria.
gRPC, MCP, HTTP, SDKs y CLIs son formas de usar esa semántica.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;La separación queda así:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0jd1z7xj93vv5jia23nb.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0jd1z7xj93vv5jia23nb.jpg" alt="KMP como protocolo común para MCP, gRPC y futuras entradas" width="800" height="454"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figura 2. MCP, gRPC y futuras entradas operan sobre la misma semántica de&lt;br&gt;
memoria definida por KMP.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  El tiempo no es un filtro más
&lt;/h2&gt;

&lt;p&gt;En una memoria útil no basta con guardar qué se dijo. También importa cuándo se&lt;br&gt;
dijo y en qué orden apareció la información.&lt;/p&gt;

&lt;p&gt;Una respuesta puede tener sentido con la información disponible en un momento y&lt;br&gt;
quedar obsoleta después. Una decisión puede ser razonable antes de recibir el&lt;br&gt;
resultado de una herramienta, pero dejar de serlo cuando aparece un dato nuevo.&lt;br&gt;
Incluso un intento fallido puede ser valioso si explica por qué se eligió&lt;br&gt;
después otra solución.&lt;/p&gt;

&lt;p&gt;Por eso KMP no trata el tiempo como un filtro secundario. Lo convierte en una&lt;br&gt;
forma de navegar la memoria:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;goto&lt;/code&gt; permite ir a un momento o referencia concreta;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;near&lt;/code&gt; muestra lo que ocurrió alrededor;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;rewind&lt;/code&gt; permite mirar hacia atrás;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;forward&lt;/code&gt; permite avanzar desde un punto;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;trace&lt;/code&gt; explica una ruta de relaciones y evidencias;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;inspect&lt;/code&gt; permite revisar los detalles de un nodo.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Así no hace falta pedirle a un LLM que relea una conversación enorme y adivine&lt;br&gt;
qué pasó. Una persona o un modelo puede moverse por la memoria con operaciones&lt;br&gt;
explícitas y reproducibles.&lt;/p&gt;

&lt;p&gt;Para una persona, el proceso se vuelve inspeccionable. Para una IA, la memoria&lt;br&gt;
se convierte en algo que puede usar mediante herramientas.&lt;/p&gt;
&lt;h2&gt;
  
  
  Escribir bien la memoria es la parte difícil
&lt;/h2&gt;

&lt;p&gt;Pero todo lo anterior depende de una condición: la memoria tiene que estar bien&lt;br&gt;
escrita.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;goto&lt;/code&gt;, &lt;code&gt;near&lt;/code&gt;, &lt;code&gt;rewind&lt;/code&gt;, &lt;code&gt;forward&lt;/code&gt;, &lt;code&gt;trace&lt;/code&gt; e &lt;code&gt;inspect&lt;/code&gt; solo son útiles si lo&lt;br&gt;
que se guardó tiene estructura suficiente. Para poder recorrer una memoria&lt;br&gt;
después, primero hay que escribirla bien.&lt;/p&gt;

&lt;p&gt;No basta con guardar texto sin estructura. Eso permite buscar frases después,&lt;br&gt;
pero no reconstruir bien el proceso: qué paso dependía de otro, qué decisión&lt;br&gt;
corrigió una anterior, qué evidencia justificaba una conclusión o qué intento&lt;br&gt;
quedó descartado.&lt;/p&gt;

&lt;p&gt;Por eso la escritura es tan importante como la lectura.&lt;/p&gt;

&lt;p&gt;Escribir memoria en KMP significa registrar entradas, relaciones, evidencias,&lt;br&gt;
dimensiones y tiempo. También significa decidir cómo se conecta una nueva pieza&lt;br&gt;
de memoria con lo que ya existía.&lt;/p&gt;

&lt;p&gt;Ahí aparece una frontera importante. El kernel no tiene responsabilidad de&lt;br&gt;
inferencia. La inferencia la hace quien lo usa: una persona, un agente, un&lt;br&gt;
modelo o un adaptador.&lt;/p&gt;

&lt;p&gt;Escribir en KMP no es solo añadir texto. También hay que decir a qué memoria se&lt;br&gt;
conecta ese texto y por qué se conecta ahí. Esa relación es parte de la memoria,&lt;br&gt;
no un detalle secundario. El kernel debe validar lo que se escribe y hacerlo&lt;br&gt;
recorrible; no debe inventar el significado de lo que ocurrió.&lt;/p&gt;

&lt;p&gt;A la pieza que escribe memoria la llamo writer. Puede ser:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;una persona;&lt;/li&gt;
&lt;li&gt;un agente;&lt;/li&gt;
&lt;li&gt;un modelo usando MCP;&lt;/li&gt;
&lt;li&gt;un adaptador de benchmark;&lt;/li&gt;
&lt;li&gt;un futuro modelo especializado en escribir memoria.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;El writer es quien debe decir por qué una nueva entrada se conecta con memoria&lt;br&gt;
anterior. El kernel comprueba que esa relación sea válida, que esté dentro del&lt;br&gt;
alcance correcto, que tenga evidencia y que pueda auditarse después.&lt;/p&gt;

&lt;p&gt;El flujo de escritura queda así:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdplkvqdf9m9e0hulqjt5.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdplkvqdf9m9e0hulqjt5.jpg" alt="Flujo de escritura de memoria en KMP" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figura 3. El writer decide significado y relaciones. KMP valida lo escrito,&lt;br&gt;
pero no infiere por su cuenta qué significa.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Esta separación llevó a dos formas de escribir:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kernel_ingest       -&amp;gt; escritura canónica de bajo nivel
kernel_write_memory -&amp;gt; ayuda para writers, que acaba convirtiéndose en ingest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;kernel_ingest&lt;/code&gt; es la entrada estricta. Recibe memoria ya estructurada.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kernel_write_memory&lt;/code&gt; es más cómodo para un writer. Le permite expresar una&lt;br&gt;
entrada nueva y sus conexiones, pero sigue validando la calidad de lo que se va&lt;br&gt;
a escribir:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;nombre de la relación;&lt;/li&gt;
&lt;li&gt;clase semántica;&lt;/li&gt;
&lt;li&gt;referencia del nodo objetivo;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;why&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;evidencia;&lt;/li&gt;
&lt;li&gt;contexto leído antes de escribir;&lt;/li&gt;
&lt;li&gt;calidad del fallback.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Esto importa porque una memoria llena de relaciones vagas no sirve de mucho.&lt;/p&gt;

&lt;p&gt;Si todas las relaciones dicen &lt;code&gt;supports_answer&lt;/code&gt;, la memoria está conectada,&lt;br&gt;
pero no explica nada. No dice si una entrada depende de una respuesta anterior,&lt;br&gt;
la contradice, la refina, la sustituye o simplemente aparece cerca de ella.&lt;/p&gt;

&lt;p&gt;En KMP, la calidad de las relaciones forma parte de la calidad de la memoria.&lt;/p&gt;
&lt;h2&gt;
  
  
  Las relaciones deben ser honestas
&lt;/h2&gt;

&lt;p&gt;También existe el riesgo contrario: inventar relaciones demasiado ricas.&lt;/p&gt;

&lt;p&gt;Un writer no debe crear aristas aparentemente inteligentes solo para que el&lt;br&gt;
grafo parezca mejor. Si no puede justificar una relación desde el contexto que&lt;br&gt;
ha observado, debe caer a una relación más simple, anémica o estructural.&lt;/p&gt;

&lt;p&gt;Ese fallback no es un fracaso. Es una señal honesta.&lt;/p&gt;

&lt;p&gt;Un buen sistema de memoria debe poder decir:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Sé que estos nodos están relacionados por orden o cercanía.
Todavía no conozco una razón semántica más fuerte.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Esto crea métricas que puedo inspeccionar:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;relaciones ricas;&lt;/li&gt;
&lt;li&gt;relaciones anémicas;&lt;/li&gt;
&lt;li&gt;relaciones estructurales;&lt;/li&gt;
&lt;li&gt;relaciones sospechosas o rechazadas;&lt;/li&gt;
&lt;li&gt;contexto previo observado antes de escribir;&lt;/li&gt;
&lt;li&gt;cobertura de evidencia.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Estas métricas me dan una forma práctica de mejorar el writer sin ocultar la&lt;br&gt;
incertidumbre.&lt;/p&gt;
&lt;h2&gt;
  
  
  La frontera entre memoria e interpretación
&lt;/h2&gt;

&lt;p&gt;Para medir la calidad de KMP he estado trabajando principalmente con dos tipos&lt;br&gt;
de benchmarks.&lt;/p&gt;

&lt;p&gt;MemoryArena me interesa porque se parece más al tipo de memoria que quiero&lt;br&gt;
construir: tareas con varios pasos, intentos, feedback, cambios de rumbo y&lt;br&gt;
memoria que debe reutilizarse más adelante.&lt;/p&gt;

&lt;p&gt;LongMemEval me interesa por otra razón. Es más conversacional, pero estresa un&lt;br&gt;
caso muy útil: recuperar evidencia dispersa entre muchas sesiones y comprobar&lt;br&gt;
si el sistema sabe usarla para responder.&lt;/p&gt;

&lt;p&gt;Esa comparación dejó clara otra cosa: una memoria puede servir para muchos&lt;br&gt;
casos de uso distintos, y no todos requieren el mismo tipo de interpretación.&lt;/p&gt;

&lt;p&gt;El kernel puede recuperar la evidencia correcta y, aun así, no producir la&lt;br&gt;
respuesta final si el lector tiene que hacer trabajo de dominio:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;sumar dinero;&lt;/li&gt;
&lt;li&gt;contar entidades;&lt;/li&gt;
&lt;li&gt;deduplicar eventos;&lt;/li&gt;
&lt;li&gt;elegir el valor más reciente;&lt;/li&gt;
&lt;li&gt;comparar fechas;&lt;/li&gt;
&lt;li&gt;normalizar código, URLs o monedas;&lt;/li&gt;
&lt;li&gt;decidir si un importe está pagado, planificado, cancelado o solo mencionado.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ahí aparecen los plugins.&lt;/p&gt;

&lt;p&gt;En este contexto, un plugin es una pieza especializada que interpreta la&lt;br&gt;
evidencia que el kernel ha recuperado. Por ejemplo: detectar importes, sumar&lt;br&gt;
dinero, comparar fechas, contar entidades, reconocer URLs, identificar código o&lt;br&gt;
resolver cuál es el valor más reciente.&lt;/p&gt;

&lt;p&gt;La razón para introducir plugins no es ganar un benchmark concreto. Es poder&lt;br&gt;
adaptar la memoria a casos de uso distintos sin meter todas esas reglas dentro&lt;br&gt;
de la parte central de KMP.&lt;/p&gt;

&lt;p&gt;No quiero contaminar el kernel con lógica específica de un benchmark, de dinero,&lt;br&gt;
de fechas, de preferencias o de cualquier otro dominio. El kernel debe seguir&lt;br&gt;
siendo agnóstico al caso de uso: guarda memoria, relaciones, tiempo, evidencias&lt;br&gt;
y trazas. La interpretación especializada debe vivir fuera.&lt;/p&gt;

&lt;p&gt;El kernel debe recuperar memoria y evidencia de forma fiable. Los plugins y los&lt;br&gt;
lectores pueden trabajar después sobre esa evidencia para resolver operaciones&lt;br&gt;
de dominio.&lt;/p&gt;

&lt;p&gt;La separación es esta:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kernel -&amp;gt; memoria, recorrido, prueba e inspección
plugins -&amp;gt; extracción de valores tipados y operaciones deterministas
lector -&amp;gt; construcción de respuesta y política de tarea
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0qvy7knyt67iqovt7q4s.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0qvy7knyt67iqovt7q4s.jpg" alt="Flujo de lectura de memoria en KMP" width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figura 4. KMP recupera evidencia trazable. Los plugins interpretan valores&lt;br&gt;
tipados y el reader construye la respuesta final.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Esta distinción es central.&lt;/p&gt;

&lt;p&gt;Underpass KMP no debe convertirse en una solución hecha a medida para un&lt;br&gt;
benchmark ni para un dominio concreto. Debe hacer bien su parte: recuperar&lt;br&gt;
memoria, evidencias y relaciones de forma fiable para que lectores, plugins y&lt;br&gt;
futuros modelos especializados puedan trabajar encima.&lt;/p&gt;
&lt;h2&gt;
  
  
  Por qué importa para los agentes
&lt;/h2&gt;

&lt;p&gt;La memoria de un agente no debería servir solo para responder a una pregunta&lt;br&gt;
mirando conversaciones antiguas.&lt;/p&gt;

&lt;p&gt;Lo realmente interesante aparece cuando una IA trabaja durante varios pasos:&lt;br&gt;
prueba una hipótesis, usa herramientas, se equivoca, corrige el rumbo, recibe&lt;br&gt;
información nueva y finalmente llega a una solución. Ahí la memoria no es un&lt;br&gt;
archivo de texto. Es el registro navegable de cómo se resolvió algo.&lt;/p&gt;

&lt;p&gt;Con una memoria así, una persona o un modelo puede volver sobre el proceso y&lt;br&gt;
preguntar:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;qué se sabía antes de tomar una decisión;&lt;/li&gt;
&lt;li&gt;qué intento de solución falló;&lt;/li&gt;
&lt;li&gt;qué dato nuevo cambió el rumbo;&lt;/li&gt;
&lt;li&gt;qué agente introdujo una suposición equivocada;&lt;/li&gt;
&lt;li&gt;por qué una respuesta posterior sustituyó a una anterior;&lt;/li&gt;
&lt;li&gt;qué secuencia de pasos llevó a la solución final;&lt;/li&gt;
&lt;li&gt;qué evidencias justifican el resultado.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ahí es donde la memoria multidimensional y temporal se vuelve útil. Cada agente&lt;br&gt;
puede ser una dimensión. Cada sesión, tarea, entidad, intento o fase del trabajo&lt;br&gt;
puede ser otra. El tiempo permite atravesarlas y entender cómo cambió el estado&lt;br&gt;
del proceso.&lt;/p&gt;

&lt;p&gt;El grafo no es una visualización decorativa. Es la forma del proceso: qué pasó,&lt;br&gt;
en qué orden, conectado con qué, y por qué.&lt;/p&gt;
&lt;h2&gt;
  
  
  La observabilidad no es opcional
&lt;/h2&gt;

&lt;p&gt;Si la memoria de agentes es infraestructura, debe ser observable.&lt;/p&gt;

&lt;p&gt;Necesito saber:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;si una escritura llegó a ser consultable;&lt;/li&gt;
&lt;li&gt;cuánto tardó la proyección;&lt;/li&gt;
&lt;li&gt;qué alcance usó una consulta;&lt;/li&gt;
&lt;li&gt;cuántas referencias fueron inspeccionadas;&lt;/li&gt;
&lt;li&gt;si funcionó la paginación de &lt;code&gt;trace&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;si la prueba estaba completa;&lt;/li&gt;
&lt;li&gt;si un lector ignoró evidencia correcta;&lt;/li&gt;
&lt;li&gt;si un writer creó relaciones ricas, anémicas o sospechosas.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Por eso el kernel registra logs estructurados KMP y MCP, métricas OTel para&lt;br&gt;
llamadas KMP, latencia de procesamiento de proyección, métricas de calidad de&lt;br&gt;
relaciones y comportamiento explícito de &lt;code&gt;inspect&lt;/code&gt; y &lt;code&gt;trace&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;El objetivo operativo es simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Una respuesta fallida de un agente debe poder clasificarse.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Las clases posibles incluyen:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;gap de ingesta;&lt;/li&gt;
&lt;li&gt;gap de proyección;&lt;/li&gt;
&lt;li&gt;gap de recuperación;&lt;/li&gt;
&lt;li&gt;gap de prueba;&lt;/li&gt;
&lt;li&gt;gap de consumo por parte del lector;&lt;/li&gt;
&lt;li&gt;gap de razonamiento de tarea.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sin esa clasificación, todos los fallos parecen lo mismo: "la IA se equivocó".&lt;br&gt;
Eso no es suficiente para agentes en producción.&lt;/p&gt;
&lt;h2&gt;
  
  
  Seguridad y auditabilidad
&lt;/h2&gt;

&lt;p&gt;Una memoria navegable también puede ser una memoria sensible.&lt;/p&gt;

&lt;p&gt;Si el sistema puede reconstruir qué ocurrió, quién lo dijo, qué decisión se&lt;br&gt;
tomó y qué evidencias la justificaban, entonces también tiene que controlar muy&lt;br&gt;
bien quién puede ver cada cosa y en qué nivel de detalle.&lt;/p&gt;

&lt;p&gt;No es lo mismo pedir un resumen que pedir la memoria en bruto. No es lo mismo&lt;br&gt;
consultar el caso actual que cruzar memoria de muchos casos. Y no es aceptable&lt;br&gt;
que logs o trazas acaben exponiendo secretos, credenciales, prompts completos o&lt;br&gt;
contenido que no hacía falta sacar.&lt;/p&gt;

&lt;p&gt;Por eso KMP trata la seguridad y la auditoría como parte del diseño, no como un&lt;br&gt;
añadido posterior:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;las fronteras de la API están tipadas;&lt;/li&gt;
&lt;li&gt;las lecturas tienen alcance explícito;&lt;/li&gt;
&lt;li&gt;la inspección en bruto es una opción deliberada;&lt;/li&gt;
&lt;li&gt;los errores fallan rápido en lugar de activar fallback silencioso;&lt;/li&gt;
&lt;li&gt;las referencias, evidencias y relaciones están pensadas para poder auditarse;&lt;/li&gt;
&lt;li&gt;TLS/mTLS se usa en las fronteras de infraestructura que lo soportan.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;El objetivo es que una persona pueda revisar por qué el sistema devolvió una&lt;br&gt;
respuesta sin tener que abrir toda la memoria, y que al mismo tiempo el sistema&lt;br&gt;
no exponga más información de la necesaria.&lt;/p&gt;
&lt;h2&gt;
  
  
  Qué promete Underpass KMP
&lt;/h2&gt;

&lt;p&gt;Antes de hablar de resultados, conviene dejar claro qué promete KMP y qué no&lt;br&gt;
pretende resolver.&lt;/p&gt;

&lt;p&gt;Underpass KMP no es:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;un sustituto general de una base de datos vectorial;&lt;/li&gt;
&lt;li&gt;un generador de respuestas finales;&lt;/li&gt;
&lt;li&gt;una solución hecha a medida para benchmarks;&lt;/li&gt;
&lt;li&gt;un framework de agentes oculto;&lt;/li&gt;
&lt;li&gt;una garantía de que cualquier modelo interpretará bien la evidencia.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Es una capa de memoria determinista y auditable. Su trabajo es conservar la&lt;br&gt;
estructura suficiente para que personas, agentes, plugins, lectores y futuros&lt;br&gt;
modelos especializados puedan trabajar sobre la memoria sin volver a leer todo&lt;br&gt;
desde cero.&lt;/p&gt;
&lt;h2&gt;
  
  
  Benchmarks: qué aprendí
&lt;/h2&gt;

&lt;p&gt;He tenido cuidado de no afirmar más de lo que soporta la evidencia actual.&lt;/p&gt;

&lt;p&gt;El resultado temprano más importante no es "el kernel gana todos los benchmarks&lt;br&gt;
de memoria". Lo importante es que el kernel hace visible una frontera que antes&lt;br&gt;
era difusa:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;¿Falló la recuperación de memoria, o falló el lector al usar evidencia correcta?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Esa distinción importa.&lt;/p&gt;

&lt;p&gt;En una ejecución MemoryArena public-TLS de 100 tareas, progressive search y&lt;br&gt;
smart-writer, el kernel alcanzó:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Métrica&lt;/th&gt;
&lt;th&gt;Resultado&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Eventos KMP correctos&lt;/td&gt;
&lt;td&gt;2259/2259&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Consultas known-at-clean&lt;/td&gt;
&lt;td&gt;753/753&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full-ref recall&lt;/td&gt;
&lt;td&gt;753/753&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fugas de respuestas futuras&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Score local alineado con el paper&lt;/td&gt;
&lt;td&gt;97/100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fallos finales&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Los 3 fallos finales quedaron clasificados como fallos de selección de&lt;br&gt;
respuesta del lector sobre evidencia completa, no como fallos de recuperación&lt;br&gt;
del kernel ni contaminación del grafo.&lt;/p&gt;

&lt;p&gt;En un slice realista MemoryArena 2x/domain, el kernel alcanzó:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Métrica&lt;/th&gt;
&lt;th&gt;Resultado&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Eventos KMP correctos&lt;/td&gt;
&lt;td&gt;221/221&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Consultas known-at-clean&lt;/td&gt;
&lt;td&gt;73/73&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full-ref recall&lt;/td&gt;
&lt;td&gt;73/73&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fugas futuras&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Referencias inesperadas&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Referencias perdidas&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Los fallos de tarea restantes fueron gaps del lector o del agente, no gaps de&lt;br&gt;
evidencia.&lt;/p&gt;

&lt;p&gt;LongMemEval enseñó una lección distinta. En un slice smart-writer multi-session&lt;br&gt;
de 30 items, la evidencia recuperada fue completa, pero la misma evidencia&lt;br&gt;
obtuvo resultados distintos según el lector:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Lector&lt;/th&gt;
&lt;th&gt;Resultado&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o&lt;/td&gt;
&lt;td&gt;22/30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 31B&lt;/td&gt;
&lt;td&gt;25/30&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;En una prueba de 100 items usando un modelo externo de embeddings y&lt;br&gt;
derivaciones, la frontera volvió a aparecer:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Medida&lt;/th&gt;
&lt;th&gt;Resultado&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Recall amplio de evidencia&lt;/td&gt;
&lt;td&gt;~99%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;QA agregado multi-session oficial end-to-end&lt;/td&gt;
&lt;td&gt;71,7%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Los fallos restantes fueron sobre todo problemas de operandos estructurados:&lt;br&gt;
predicados de conteo perdidos, evidencia calificadora omitida o errores de&lt;br&gt;
comparación.&lt;/p&gt;

&lt;p&gt;Para mí, esa información es valiosa.&lt;/p&gt;

&lt;p&gt;Me dice que la siguiente mejora no consiste en esconder más lógica dentro del&lt;br&gt;
kernel. La siguiente mejora está en recuperar mejor candidatos, reordenarlos&lt;br&gt;
con un reranker, extraer operandos tipados y usar plugins de dominio&lt;br&gt;
reutilizables.&lt;/p&gt;
&lt;h2&gt;
  
  
  Hoja de ruta
&lt;/h2&gt;

&lt;p&gt;El siguiente paso es seguir validando la idea con casos reales y hacer que el&lt;br&gt;
kernel sea más fácil de usar.&lt;/p&gt;

&lt;p&gt;A corto plazo, el trabajo es práctico:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ejecuciones más fuertes en MemoryArena y MemoryAgentBench;&lt;/li&gt;
&lt;li&gt;regresión LongMemEval de estilo oficial como benchmark secundario;&lt;/li&gt;
&lt;li&gt;recuperación híbrida de candidatos detrás de puertos;&lt;/li&gt;
&lt;li&gt;experimentos de reranking;&lt;/li&gt;
&lt;li&gt;exploración visual de grafo y línea temporal para recorrer la memoria;&lt;/li&gt;
&lt;li&gt;mejor observabilidad de prueba y recorrido;&lt;/li&gt;
&lt;li&gt;paginación, límites y alcances estables en KMP.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A medio plazo, la dirección se vuelve más interesante:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;un modelo pequeño especializado en operar herramientas del kernel, entrenado
con trayectorias MCP auditadas;&lt;/li&gt;
&lt;li&gt;consultas de proceso como &lt;code&gt;known_at&lt;/code&gt;, &lt;code&gt;why&lt;/code&gt;, &lt;code&gt;failed_paths&lt;/code&gt;,
&lt;code&gt;final_path&lt;/code&gt; y &lt;code&gt;best_path&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;plugins de interpretación reutilizables para dinero, fechas, conteos, URLs,
código y operadores específicos de dominio;&lt;/li&gt;
&lt;li&gt;tests de conformidad para que la semántica del kernel sea independiente del
backend de almacenamiento;&lt;/li&gt;
&lt;li&gt;demos visuales públicas que permitan reproducir un proceso agente como grafo
y línea temporal.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;El modelo operador me parece especialmente importante. No sería un agente&lt;br&gt;
general ni un modelo que "entiende la memoria" de forma mágica. Sería un&lt;br&gt;
especialista pequeño entrenado para usar KMP de forma eficiente:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;¿Qué herramienta debería llamar ahora?
¿Con qué argumentos acotados?
¿Debo inspeccionar, trazar, moverme temporalmente o parar?
¿Qué referencias prueban que tengo evidencia suficiente?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Es un problema estrecho y medible.&lt;/p&gt;

&lt;h2&gt;
  
  
  La tesis del producto
&lt;/h2&gt;

&lt;p&gt;La tesis detrás de Underpass KMP es sencilla:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Los agentes fiables necesitan memoria que puedan navegar, no solo contexto que
puedan recuperar.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Esa memoria debe estar:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;acotada por aquello sobre lo que trata;&lt;/li&gt;
&lt;li&gt;dividida en dimensiones significativas;&lt;/li&gt;
&lt;li&gt;recorrible a través del tiempo;&lt;/li&gt;
&lt;li&gt;conectada por relaciones honestas;&lt;/li&gt;
&lt;li&gt;respaldada por evidencia;&lt;/li&gt;
&lt;li&gt;inspeccionable por personas;&lt;/li&gt;
&lt;li&gt;usable mediante herramientas por LLMs;&lt;/li&gt;
&lt;li&gt;observable y auditable en producción.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Por eso estoy construyendo Kernel Memory Protocol: para que la memoria de un&lt;br&gt;
agente no sea solo texto acumulado, sino una estructura que pueda recorrerse,&lt;br&gt;
inspeccionarse y reutilizarse.&lt;/p&gt;

&lt;p&gt;No se trata de hacer prompts más largos. Es justo lo contrario: reconstruir el&lt;br&gt;
contexto útil sin obligar al modelo a leer todo el material en bruto, y hacer&lt;br&gt;
que el consumo de tokens sea inteligente, medible y auditable.&lt;/p&gt;

&lt;p&gt;La meta es convertir la memoria de los agentes en una capa real de trabajo.&lt;/p&gt;

&lt;p&gt;Si os interesa esta línea de trabajo, podéis revisar el repositorio de&lt;br&gt;
&lt;a href="https://github.com/underpass-ai/rehydration-kernel" rel="noopener noreferrer"&gt;Underpass KMP&lt;/a&gt;. Y si os&lt;br&gt;
parece útil, una estrella en GitHub ayuda a darle visibilidad al proyecto.&lt;/p&gt;




&lt;p&gt;Escrito por &lt;a href="https://github.com/tgarciai" rel="noopener noreferrer"&gt;Tirso García Ibáñez&lt;/a&gt; ·&lt;br&gt;
&lt;a href="https://www.linkedin.com/in/tirsogarcia/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; ·&lt;br&gt;
&lt;a href="https://github.com/underpass-ai" rel="noopener noreferrer"&gt;Underpass AI&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Underpass KMP forma parte del proyecto Underpass AI. El repositorio está&lt;br&gt;
licenciado bajo &lt;a href="https://github.com/underpass-ai/rehydration-kernel/blob/main/LICENSE" rel="noopener noreferrer"&gt;Apache License 2.0&lt;/a&gt;,&lt;br&gt;
salvo que se indique lo contrario.&lt;/p&gt;

&lt;p&gt;Copyright © 2026 Tirso García Ibáñez.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>agents</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Building Kernel Memory Protocol: Navigable Memory for AI Agents</title>
      <dc:creator>Tirso García</dc:creator>
      <pubDate>Sun, 10 May 2026 14:20:36 +0000</pubDate>
      <link>https://dev.to/tirsogarcia/building-kernel-memory-protocol-navigable-memory-for-ai-agents-315j</link>
      <guid>https://dev.to/tirsogarcia/building-kernel-memory-protocol-navigable-memory-for-ai-agents-315j</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Versión en español: &lt;a href="https://dev.to/tirsogarcia/construyendo-kernel-memory-protocol-memoria-navegable-para-agentes-de-ia-24lc"&gt;Construyendo Kernel Memory Protocol: memoria navegable para agentes de IA&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The hard part with many AI agents is not the amount of text in the prompt.&lt;br&gt;
The hard part is that they do not have memory they can query, traverse, and&lt;br&gt;
audit.&lt;/p&gt;

&lt;p&gt;Most current approaches try to solve this in one of three ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;copying parts of previous conversations into the next prompt;&lt;/li&gt;
&lt;li&gt;searching similar chunks with embeddings;&lt;/li&gt;
&lt;li&gt;letting an agent framework store memory internally, often in a way that is
hard to inspect, replay, or explain.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those approaches help, but they are not enough when an agent is doing real&lt;br&gt;
work. At that point, retrieving text is not the whole problem. You need to&lt;br&gt;
reconstruct the process.&lt;/p&gt;

&lt;p&gt;The important questions become different:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What did the agent know when it made a decision?&lt;/li&gt;
&lt;li&gt;Which solution attempts did it try?&lt;/li&gt;
&lt;li&gt;Which attempt failed, and why?&lt;/li&gt;
&lt;li&gt;What new information changed the direction of the work?&lt;/li&gt;
&lt;li&gt;Which sequence of steps led to the final answer?&lt;/li&gt;
&lt;li&gt;Which evidence supports a decision or answer?&lt;/li&gt;
&lt;li&gt;Can a person review that evidence without reading the whole raw
conversation?&lt;/li&gt;
&lt;li&gt;Can another model navigate the same memory without knowing how it is stored
underneath?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Underpass KMP started with a smaller goal: recovering only the context an agent&lt;br&gt;
needed to continue a task without rereading the whole previous conversation. I&lt;br&gt;
called that context rehydration: taking already recorded memory and rebuilding&lt;br&gt;
only the useful part for the next step.&lt;/p&gt;

&lt;p&gt;The more I tested it, the clearer the real problem became. This was not about&lt;br&gt;
making better prompts. I needed a memory layer that could record what happened,&lt;br&gt;
when it happened, who produced it, what evidence supported it, and how it could&lt;br&gt;
be traversed later.&lt;/p&gt;

&lt;p&gt;That is where Kernel Memory Protocol, or KMP, comes from: a small, explicit API&lt;br&gt;
for writing, querying, traversing, tracing, and inspecting agent memory.&lt;/p&gt;
&lt;h2&gt;
  
  
  From Searching Chunks to Navigating Memory
&lt;/h2&gt;

&lt;p&gt;The first mistake was treating memory as if it were just search.&lt;/p&gt;

&lt;p&gt;A search system can return text that looks similar to the question you just&lt;br&gt;
asked. That is useful for finding isolated information, but it is not enough to&lt;br&gt;
understand a work process.&lt;/p&gt;

&lt;p&gt;When an agent solves a task, the key question is not only which sentence looks&lt;br&gt;
similar. The key question is what happened:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what information the agent had when it made a decision;&lt;/li&gt;
&lt;li&gt;which solution attempts it tried;&lt;/li&gt;
&lt;li&gt;which attempt failed, and why;&lt;/li&gt;
&lt;li&gt;which new data changed the direction of the work;&lt;/li&gt;
&lt;li&gt;which sequence of steps led to the final result;&lt;/li&gt;
&lt;li&gt;which evidence supports each conclusion.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That distinction shaped the kernel. I did not want to build another mechanism&lt;br&gt;
for searching text. I wanted navigable memory.&lt;/p&gt;

&lt;p&gt;That is why KMP does not expose a vector database API. It exposes memory&lt;br&gt;
operations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ingest   -&amp;gt; record memory
wake     -&amp;gt; recover the state needed to continue
ask      -&amp;gt; query memory with evidence
goto     -&amp;gt; move to a specific moment or reference
near     -&amp;gt; inspect what happened around a moment or reference
rewind   -&amp;gt; move backward
forward  -&amp;gt; move forward
trace    -&amp;gt; explain a relation path
inspect  -&amp;gt; inspect a memory node
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The central system is intentionally small. KMP is not trying to be the agent,&lt;br&gt;
and it is not responsible for deciding the final answer. Its job is to store&lt;br&gt;
structured memory, make that memory traversable in a deterministic way, and&lt;br&gt;
return evidence that can be audited.&lt;/p&gt;

&lt;p&gt;Answer generation, business rules, and domain plugins can live around KMP&lt;br&gt;
without being pushed into the memory protocol itself.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Mental Model
&lt;/h2&gt;

&lt;p&gt;The central object in Underpass KMP is an &lt;code&gt;about&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;An &lt;code&gt;about&lt;/code&gt; is the case, topic, or memory world being worked on. It can be an&lt;br&gt;
incident, a task, a customer, a benchmark case, a repository, a user, or a&lt;br&gt;
long-running agent process.&lt;/p&gt;

&lt;p&gt;Inside that &lt;code&gt;about&lt;/code&gt;, memory does not need to live on a single line. It can be&lt;br&gt;
split into dimensions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;about
  dimension: session
  dimension: agent
  dimension: task
  dimension: entity
  dimension: preference
  dimension: attempt
  dimension: incident_phase
  dimension: success_path
  dimension: failure_path
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A dimension can represent a session, an agent, a task, an entity, a solution&lt;br&gt;
attempt, or a phase of the process.&lt;/p&gt;

&lt;p&gt;Time is not just another dimension.&lt;/p&gt;

&lt;p&gt;Time is what lets you ask what was known before a step, what changed after it,&lt;br&gt;
or which information did not exist yet when a decision was made.&lt;/p&gt;

&lt;p&gt;The mental model is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;about -&amp;gt; the case or memory world
dimensions -&amp;gt; memory planes inside that case
time -&amp;gt; the temporal axis crossing those planes
relations -&amp;gt; why two memory items are connected
evidence -&amp;gt; proof attached to memory
provenance -&amp;gt; who observed or wrote it, and when
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Visually, KMP memory looks more like this than like a list of messages:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fytr3ihasjuwcwou0e7ab.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fytr3ihasjuwcwou0e7ab.jpg" alt="IKMP multidimensional memory crossed by time" width="800" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 1. A single &lt;code&gt;about&lt;/code&gt; can contain several dimensions crossed by time.&lt;br&gt;
Blue arrows are semantic relations; dashed arrows show continuity inside a&lt;br&gt;
dimension.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This matters because agent memory is rarely linear.&lt;/p&gt;

&lt;p&gt;A long task can involve several agents. Each agent can have its own session.&lt;br&gt;
Each session can produce hypotheses, failed attempts, tool results, and final&lt;br&gt;
decisions. A useful memory layer must let you look at one dimension, several&lt;br&gt;
dimensions, or the whole case, while making the query scope explicit every&lt;br&gt;
time.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why Dimensions Need Namespaces
&lt;/h2&gt;

&lt;p&gt;One important implementation decision was making &lt;code&gt;about&lt;/code&gt; act as the namespace&lt;br&gt;
for dimensions.&lt;/p&gt;

&lt;p&gt;When a client ingests memory, &lt;code&gt;IngestRequest.about&lt;/code&gt; defines the default scope.&lt;br&gt;
Internally, the real identity of a dimension is equivalent to something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;about:&amp;lt;about&amp;gt;:dimension:&amp;lt;dimension_id&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This may look like a small detail, but it prevents important mistakes.&lt;/p&gt;

&lt;p&gt;If two different tasks both have a dimension called &lt;code&gt;session:1&lt;/code&gt;, I do not want&lt;br&gt;
them to be mixed by accident. Once the dimension lives inside its &lt;code&gt;about&lt;/code&gt;, each&lt;br&gt;
&lt;code&gt;session:1&lt;/code&gt; belongs to the case it was created for.&lt;/p&gt;

&lt;p&gt;Reads are explicit too:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;CURRENT_ABOUT&lt;/code&gt; queries the current case;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ABOUTS&lt;/code&gt; queries a concrete list of cases;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ALL_ABOUTS&lt;/code&gt; queries all cases, but only when the caller asks for that
intentionally.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If a caller asks for &lt;code&gt;ABOUTS&lt;/code&gt; without providing the list of cases, the kernel&lt;br&gt;
rejects the request. If a caller asks for &lt;code&gt;ALL_ABOUTS&lt;/code&gt;, the request is clearly&lt;br&gt;
global and can be audited as such.&lt;/p&gt;

&lt;p&gt;The reason is simple: a query that looked scoped to one case should not&lt;br&gt;
silently end up mixing memory from other cases.&lt;/p&gt;
&lt;h2&gt;
  
  
  Protocol First, Tools Second
&lt;/h2&gt;

&lt;p&gt;MCP is a useful way for a model to call tools. For example, it lets an LLM use&lt;br&gt;
operations such as &lt;code&gt;kernel_ask&lt;/code&gt;, &lt;code&gt;kernel_near&lt;/code&gt;, &lt;code&gt;kernel_trace&lt;/code&gt;, and&lt;br&gt;
&lt;code&gt;kernel_inspect&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That is valuable, but I did not want MCP to define how memory works.&lt;/p&gt;

&lt;p&gt;The rule belongs in a more stable place: KMP. In the current implementation,&lt;br&gt;
the same operations are exposed through the typed gRPC service&lt;br&gt;
&lt;code&gt;KernelMemoryService&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Separating those layers has a practical benefit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;an LLM can use KMP through MCP tools;&lt;/li&gt;
&lt;li&gt;an application can call the gRPC service directly;&lt;/li&gt;
&lt;li&gt;a future HTTP API or SDK can expose the same behavior;&lt;/li&gt;
&lt;li&gt;all of those entry points must mean the same thing when they ask, traverse,
trace, or inspect memory.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The project follows a hexagonal architecture for exactly this reason: entry&lt;br&gt;
points can change without changing the memory semantics. gRPC is the main API.&lt;br&gt;
MCP is the agent-facing entry point: the way to expose the same operations to&lt;br&gt;
an AI model as tools it can use without ambiguity.&lt;/p&gt;

&lt;p&gt;I have been careful about keeping MCP and gRPC in parity. Both entry points&lt;br&gt;
must respect the same behavior. If a REST API, SDK, or another integration is&lt;br&gt;
added later, it should become another entry point into the same protocol, not a&lt;br&gt;
different version of memory.&lt;/p&gt;

&lt;p&gt;The principle is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;KMP defines memory semantics.
gRPC, MCP, HTTP, SDKs, and CLIs are ways to use those semantics.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The separation looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyxak5o706rka8tarkijk.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyxak5o706rka8tarkijk.jpg" alt=" " width="800" height="454"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 2. MCP, gRPC, and future entry points operate over the same memory&lt;br&gt;
semantics defined by KMP.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Time Is Not Just Another Filter
&lt;/h2&gt;

&lt;p&gt;Useful memory is not only about what was said. It also matters when it was said&lt;br&gt;
and in which order the information appeared.&lt;/p&gt;

&lt;p&gt;An answer can be valid with the information available at one moment and become&lt;br&gt;
obsolete later. A decision can be reasonable before a tool result arrives and&lt;br&gt;
wrong once new data appears. Even a failed attempt can be useful if it explains&lt;br&gt;
why a different solution was chosen afterwards.&lt;/p&gt;

&lt;p&gt;That is why KMP does not treat time as a secondary filter. It makes time part&lt;br&gt;
of memory navigation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;goto&lt;/code&gt; moves to a concrete moment or reference;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;near&lt;/code&gt; shows what happened around it;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;rewind&lt;/code&gt; moves backward;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;forward&lt;/code&gt; moves forward;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;trace&lt;/code&gt; explains a path of relations and evidence;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;inspect&lt;/code&gt; exposes the details of a node.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With that, you do not need to ask an LLM to reread a huge conversation and&lt;br&gt;
guess what happened. A person or a model can move through memory with explicit,&lt;br&gt;
reproducible operations.&lt;/p&gt;

&lt;p&gt;For a person, the process becomes inspectable. For an AI model, memory becomes&lt;br&gt;
something it can operate through tools.&lt;/p&gt;
&lt;h2&gt;
  
  
  Writing Memory Well Is the Hard Part
&lt;/h2&gt;

&lt;p&gt;All of the above depends on one condition: the memory must be written well.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;goto&lt;/code&gt;, &lt;code&gt;near&lt;/code&gt;, &lt;code&gt;rewind&lt;/code&gt;, &lt;code&gt;forward&lt;/code&gt;, &lt;code&gt;trace&lt;/code&gt;, and &lt;code&gt;inspect&lt;/code&gt; are only useful if&lt;br&gt;
the stored memory has enough structure. To traverse memory later, you first&lt;br&gt;
need to write it properly.&lt;/p&gt;

&lt;p&gt;Saving unstructured text is not enough. It lets you search for phrases later,&lt;br&gt;
but it does not reconstruct the process very well: which step depended on&lt;br&gt;
another, which decision corrected an earlier one, which evidence supported a&lt;br&gt;
conclusion, or which attempt was discarded.&lt;/p&gt;

&lt;p&gt;That is why writing is as important as reading.&lt;/p&gt;

&lt;p&gt;Writing memory in KMP means recording entries, relations, evidence, dimensions,&lt;br&gt;
and time. It also means deciding how a new piece of memory connects to what was&lt;br&gt;
already there.&lt;/p&gt;

&lt;p&gt;This is an important boundary. The kernel is not responsible for inference.&lt;br&gt;
Inference belongs to whoever uses it: a person, an agent, a model, or an&lt;br&gt;
adapter.&lt;/p&gt;

&lt;p&gt;Writing to KMP is not just adding text. The writer also has to say which prior&lt;br&gt;
memory the text connects to, and why it connects there. That relation is part&lt;br&gt;
of the memory, not a secondary detail. The kernel should validate what is&lt;br&gt;
written and make it traversable; it should not invent the meaning of what&lt;br&gt;
happened.&lt;/p&gt;

&lt;p&gt;I call the piece that writes memory the writer. It can be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a person;&lt;/li&gt;
&lt;li&gt;an agent;&lt;/li&gt;
&lt;li&gt;a model using MCP;&lt;/li&gt;
&lt;li&gt;a benchmark adapter;&lt;/li&gt;
&lt;li&gt;a future specialist model trained to write memory.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The writer decides why a new entry connects to previous memory. The kernel&lt;br&gt;
checks that the relation is valid, scoped correctly, backed by evidence, and&lt;br&gt;
auditable later.&lt;/p&gt;

&lt;p&gt;The write flow looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7n3n7vs51wirfuqeejaa.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7n3n7vs51wirfuqeejaa.jpg" alt=" " width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 3. The writer decides meaning and relations. KMP validates what is&lt;br&gt;
written, but it does not infer meaning on its own.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That separation led to two write paths:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kernel_ingest       -&amp;gt; canonical low-level write path
kernel_write_memory -&amp;gt; writer helper that ultimately compiles to ingest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;kernel_ingest&lt;/code&gt; is the strict entry point. It receives already structured&lt;br&gt;
memory.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;kernel_write_memory&lt;/code&gt; is more convenient for a writer. It lets the writer&lt;br&gt;
express a new entry and its connections, while still validating the quality of&lt;br&gt;
what is about to be written:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;relation name;&lt;/li&gt;
&lt;li&gt;semantic class;&lt;/li&gt;
&lt;li&gt;target node reference;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;why&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;evidence;&lt;/li&gt;
&lt;li&gt;context read before writing;&lt;/li&gt;
&lt;li&gt;fallback quality.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This matters because a memory graph full of vague relations is not very useful.&lt;/p&gt;

&lt;p&gt;If every relation says &lt;code&gt;supports_answer&lt;/code&gt;, the memory is connected, but it does&lt;br&gt;
not explain anything. It does not tell you whether an entry depends on a&lt;br&gt;
previous answer, contradicts it, refines it, replaces it, or merely appears&lt;br&gt;
near it.&lt;/p&gt;

&lt;p&gt;In KMP, relation quality is part of memory quality.&lt;/p&gt;
&lt;h2&gt;
  
  
  Relations Need to Be Honest
&lt;/h2&gt;

&lt;p&gt;There is also the opposite risk: making relations look richer than they are.&lt;/p&gt;

&lt;p&gt;A writer should not create smart-looking edges just to make the graph look&lt;br&gt;
better. If it cannot justify a relation from the context it observed, it should&lt;br&gt;
fall back to a simpler, anemic, or structural relation.&lt;/p&gt;

&lt;p&gt;That fallback is not a failure. It is an honest signal.&lt;/p&gt;

&lt;p&gt;A good memory system must be able to say:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;I know these nodes are related by order or proximity.
I do not yet know a stronger semantic reason.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That gives me metrics I can inspect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;rich relations;&lt;/li&gt;
&lt;li&gt;anemic relations;&lt;/li&gt;
&lt;li&gt;structural relations;&lt;/li&gt;
&lt;li&gt;suspect or rejected relations;&lt;/li&gt;
&lt;li&gt;prior context observed before writing;&lt;/li&gt;
&lt;li&gt;evidence coverage.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those metrics give me a practical way to improve the writer without hiding&lt;br&gt;
uncertainty.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Boundary Between Memory and Interpretation
&lt;/h2&gt;

&lt;p&gt;To measure KMP quality, I have mainly been working with two kinds of benchmarks.&lt;/p&gt;

&lt;p&gt;MemoryArena is interesting because it looks closer to the kind of memory I want&lt;br&gt;
to build: multi-step tasks, attempts, feedback, course corrections, and memory&lt;br&gt;
that has to be reused later.&lt;/p&gt;

&lt;p&gt;LongMemEval is interesting for a different reason. It is more conversational,&lt;br&gt;
but it stresses a very useful case: recovering evidence scattered across many&lt;br&gt;
sessions and checking whether the system can use it to answer.&lt;/p&gt;

&lt;p&gt;That comparison made another boundary clear: the same memory layer can support&lt;br&gt;
many use cases, and not all of them need the same kind of interpretation.&lt;/p&gt;

&lt;p&gt;The kernel can retrieve the right evidence, and the final answer can still be&lt;br&gt;
wrong if the reader has to perform domain work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;summing money;&lt;/li&gt;
&lt;li&gt;counting entities;&lt;/li&gt;
&lt;li&gt;deduplicating events;&lt;/li&gt;
&lt;li&gt;selecting the latest value;&lt;/li&gt;
&lt;li&gt;comparing dates;&lt;/li&gt;
&lt;li&gt;normalizing code, URLs, or currencies;&lt;/li&gt;
&lt;li&gt;deciding whether an amount is paid, planned, cancelled, or only mentioned.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is where plugins come in.&lt;/p&gt;

&lt;p&gt;In this context, a plugin is a specialized component that interprets evidence&lt;br&gt;
the kernel has already retrieved. For example: detecting amounts, summing&lt;br&gt;
money, comparing dates, counting entities, recognizing URLs, identifying code,&lt;br&gt;
or resolving the latest value.&lt;/p&gt;

&lt;p&gt;The reason for introducing plugins is not to win a specific benchmark. It is to&lt;br&gt;
adapt memory to different use cases without putting all those rules inside KMP&lt;br&gt;
itself.&lt;/p&gt;

&lt;p&gt;I do not want to contaminate the kernel with logic specific to one benchmark,&lt;br&gt;
money, dates, preferences, or any other domain. The kernel should stay&lt;br&gt;
use-case agnostic: it stores memory, relations, time, evidence, and traces.&lt;br&gt;
Specialized interpretation should live outside it.&lt;/p&gt;

&lt;p&gt;The kernel should retrieve memory and evidence reliably. Plugins and readers&lt;br&gt;
can then work on that evidence to solve domain operations.&lt;/p&gt;

&lt;p&gt;The separation is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kernel -&amp;gt; memory, traversal, proof, inspection
plugins -&amp;gt; typed value extraction and deterministic operations
reader -&amp;gt; answer construction and task policy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8mwyjktjwserob359jo0.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8mwyjktjwserob359jo0.jpg" alt=" " width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 4. KMP retrieves traceable evidence. Plugins interpret typed values and&lt;br&gt;
the reader builds the final answer.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This distinction is central.&lt;/p&gt;

&lt;p&gt;Underpass KMP should not become a custom solution for a benchmark or a single&lt;br&gt;
domain. It should do its part well: recover memory, evidence, and relations&lt;br&gt;
reliably so that readers, plugins, and future specialist models can work on&lt;br&gt;
top.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why This Matters for Agents
&lt;/h2&gt;

&lt;p&gt;Agent memory should not only help answer a user question by looking at old&lt;br&gt;
chat history.&lt;/p&gt;

&lt;p&gt;The more interesting case appears when an AI works through several steps: it&lt;br&gt;
tries a hypothesis, uses tools, makes a mistake, changes direction, receives&lt;br&gt;
new information, and eventually reaches a solution. In that setting, memory is&lt;br&gt;
not a text archive. It is a navigable record of how something was solved.&lt;/p&gt;

&lt;p&gt;With that kind of memory, a person or a model can go back into the process and&lt;br&gt;
ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what was known before a decision was made;&lt;/li&gt;
&lt;li&gt;which solution attempt failed;&lt;/li&gt;
&lt;li&gt;which new data changed the direction of the work;&lt;/li&gt;
&lt;li&gt;which agent introduced a wrong assumption;&lt;/li&gt;
&lt;li&gt;why a later answer replaced an earlier one;&lt;/li&gt;
&lt;li&gt;which sequence of steps led to the final solution;&lt;/li&gt;
&lt;li&gt;which evidence supports the result.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where multidimensional and temporal memory becomes useful. Each agent&lt;br&gt;
can be a dimension. Each session, task, entity, attempt, or work phase can be&lt;br&gt;
another. Time lets you move across them and understand how the state of the&lt;br&gt;
process changed.&lt;/p&gt;

&lt;p&gt;The graph is not decoration. It is the shape of the process: what happened, in&lt;br&gt;
which order, connected to what, and why.&lt;/p&gt;
&lt;h2&gt;
  
  
  Observability Is Not Optional
&lt;/h2&gt;

&lt;p&gt;If agent memory is infrastructure, it has to be observable.&lt;/p&gt;

&lt;p&gt;I need to know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;whether a write became queryable;&lt;/li&gt;
&lt;li&gt;how long projection took;&lt;/li&gt;
&lt;li&gt;which scope a query used;&lt;/li&gt;
&lt;li&gt;how many references were inspected;&lt;/li&gt;
&lt;li&gt;whether &lt;code&gt;trace&lt;/code&gt; pagination worked;&lt;/li&gt;
&lt;li&gt;whether proof was complete;&lt;/li&gt;
&lt;li&gt;whether a reader ignored correct evidence;&lt;/li&gt;
&lt;li&gt;whether a writer created rich, anemic, or suspect relations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why the kernel records structured KMP and MCP logs, OTel metrics for&lt;br&gt;
KMP calls, projection processing latency, relation quality metrics, and&lt;br&gt;
explicit &lt;code&gt;inspect&lt;/code&gt; and &lt;code&gt;trace&lt;/code&gt; behavior.&lt;/p&gt;

&lt;p&gt;The operational goal is simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A failed agent answer should be classifiable.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Possible classes include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ingestion gap;&lt;/li&gt;
&lt;li&gt;projection gap;&lt;/li&gt;
&lt;li&gt;retrieval gap;&lt;/li&gt;
&lt;li&gt;proof gap;&lt;/li&gt;
&lt;li&gt;reader consumption gap;&lt;/li&gt;
&lt;li&gt;task reasoning gap.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without that classification, every failure looks the same: "the AI got it&lt;br&gt;
wrong". That is not good enough for production agents.&lt;/p&gt;
&lt;h2&gt;
  
  
  Security and Auditability
&lt;/h2&gt;

&lt;p&gt;Navigable memory can also be sensitive memory.&lt;/p&gt;

&lt;p&gt;If the system can reconstruct what happened, who said it, which decision was&lt;br&gt;
made, and which evidence supported it, then it must also control who can see&lt;br&gt;
each thing and at what level of detail.&lt;/p&gt;

&lt;p&gt;Asking for a summary is not the same as asking for raw memory. Querying the&lt;br&gt;
current case is not the same as crossing memory from many cases. And logs or&lt;br&gt;
traces must not casually expose secrets, credentials, complete prompts, or&lt;br&gt;
content that did not need to leave the system.&lt;/p&gt;

&lt;p&gt;That is why KMP treats security and auditability as part of the design, not as&lt;br&gt;
an afterthought:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API boundaries are typed;&lt;/li&gt;
&lt;li&gt;reads have explicit scope;&lt;/li&gt;
&lt;li&gt;raw inspection is a deliberate option;&lt;/li&gt;
&lt;li&gt;errors fail fast instead of activating silent fallback;&lt;/li&gt;
&lt;li&gt;references, evidence, and relations are designed for audit;&lt;/li&gt;
&lt;li&gt;TLS/mTLS is used on infrastructure boundaries that support it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is that a person can review why the system returned an answer without&lt;br&gt;
opening all memory, while the system avoids exposing more information than&lt;br&gt;
needed.&lt;/p&gt;
&lt;h2&gt;
  
  
  What Underpass KMP Promises
&lt;/h2&gt;

&lt;p&gt;Before talking about results, it is worth being clear about what KMP promises&lt;br&gt;
and what it does not try to solve.&lt;/p&gt;

&lt;p&gt;Underpass KMP is not:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a general replacement for a vector database;&lt;/li&gt;
&lt;li&gt;a final answer generator;&lt;/li&gt;
&lt;li&gt;a benchmark-specific solution;&lt;/li&gt;
&lt;li&gt;a hidden agent framework;&lt;/li&gt;
&lt;li&gt;a guarantee that every model will interpret evidence correctly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It is a deterministic, auditable memory layer. Its job is to preserve enough&lt;br&gt;
structure for people, agents, plugins, readers, and future specialist models to&lt;br&gt;
work with memory without reading everything again from scratch.&lt;/p&gt;
&lt;h2&gt;
  
  
  Benchmarks: What I Learned
&lt;/h2&gt;

&lt;p&gt;I have been careful not to claim more than the current evidence supports.&lt;/p&gt;

&lt;p&gt;The most important early result is not "the kernel wins every memory&lt;br&gt;
benchmark". The important result is that the kernel makes a previously blurry&lt;br&gt;
boundary visible:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Did memory retrieval fail, or did the reader fail to use correct evidence?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That distinction matters.&lt;/p&gt;

&lt;p&gt;In a MemoryArena public-TLS run with 100 progressive-search tasks and the&lt;br&gt;
smart writer enabled, the kernel reached:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Correct KMP events&lt;/td&gt;
&lt;td&gt;2259/2259&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Known-at-clean queries&lt;/td&gt;
&lt;td&gt;753/753&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full-ref recall&lt;/td&gt;
&lt;td&gt;753/753&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Future-answer leaks&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local paper-aligned score&lt;/td&gt;
&lt;td&gt;97/100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Final misses&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 3 final misses were classified as reader answer-selection failures over&lt;br&gt;
complete evidence, not as kernel retrieval failures or graph contamination.&lt;/p&gt;

&lt;p&gt;In a realistic MemoryArena 2x/domain slice, the kernel reached:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Correct KMP events&lt;/td&gt;
&lt;td&gt;221/221&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Known-at-clean queries&lt;/td&gt;
&lt;td&gt;73/73&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Full-ref recall&lt;/td&gt;
&lt;td&gt;73/73&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Future leaks&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Unexpected references&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Missing references&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The remaining task failures were reader or agent gaps, not evidence gaps.&lt;/p&gt;

&lt;p&gt;LongMemEval taught a different lesson. In a 30-item multi-session smart-writer&lt;br&gt;
slice, the recovered evidence was complete, but the same evidence produced&lt;br&gt;
different results depending on the reader:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Reader&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o&lt;/td&gt;
&lt;td&gt;22/30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 31B&lt;/td&gt;
&lt;td&gt;25/30&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;In a 100-item test using an external embedding model and derivations, the same&lt;br&gt;
boundary appeared again:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Measure&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Broad evidence recall&lt;/td&gt;
&lt;td&gt;~99%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Official multi-session aggregate end-to-end QA&lt;/td&gt;
&lt;td&gt;71.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The remaining failures were mostly structured operand problems: missed count&lt;br&gt;
predicates, omitted qualifying evidence, or comparison mistakes.&lt;/p&gt;

&lt;p&gt;That is useful information.&lt;/p&gt;

&lt;p&gt;It tells me that the next improvement is not to hide more logic inside the&lt;br&gt;
kernel. The next improvement is better candidate retrieval, reranking, typed&lt;br&gt;
operand extraction, and reusable domain plugins.&lt;/p&gt;
&lt;h2&gt;
  
  
  Roadmap
&lt;/h2&gt;

&lt;p&gt;The next step is to keep validating the idea with real cases and make the&lt;br&gt;
kernel easier to use.&lt;/p&gt;

&lt;p&gt;In the short term, the work is practical:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;stronger MemoryArena and MemoryAgentBench runs;&lt;/li&gt;
&lt;li&gt;an official-style LongMemEval regression as a secondary benchmark;&lt;/li&gt;
&lt;li&gt;hybrid candidate retrieval behind ports;&lt;/li&gt;
&lt;li&gt;reranking experiments;&lt;/li&gt;
&lt;li&gt;visual graph and timeline exploration for traversing memory;&lt;/li&gt;
&lt;li&gt;better proof and traversal observability;&lt;/li&gt;
&lt;li&gt;stable pagination, limits, and scopes in KMP.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the medium term, the direction becomes more interesting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a small model specialized in operating kernel tools, trained from audited MCP
trajectories;&lt;/li&gt;
&lt;li&gt;process queries such as &lt;code&gt;known_at&lt;/code&gt;, &lt;code&gt;why&lt;/code&gt;, &lt;code&gt;failed_paths&lt;/code&gt;, &lt;code&gt;final_path&lt;/code&gt;, and
&lt;code&gt;best_path&lt;/code&gt;;&lt;/li&gt;
&lt;li&gt;reusable interpretation plugins for money, dates, counts, URLs, code, and
domain-specific operators;&lt;/li&gt;
&lt;li&gt;conformance tests so kernel semantics are independent from the storage
implementation;&lt;/li&gt;
&lt;li&gt;public visual experiences that let people replay an agent process as a graph
and timeline.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The operator model is especially important to me. It would not be a general&lt;br&gt;
agent, and it would not be a magical model that "understands memory". It would&lt;br&gt;
be a small specialist trained to use KMP efficiently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Which tool should I call now?
With which bounded arguments?
Should I inspect, trace, move through time, or stop?
Which references prove that I have enough evidence?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is a narrow and measurable problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Product Thesis
&lt;/h2&gt;

&lt;p&gt;The thesis behind Underpass KMP is simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Reliable agents need memory they can navigate, not just context they can
retrieve.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That memory must be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;scoped by what it is about;&lt;/li&gt;
&lt;li&gt;split into meaningful dimensions;&lt;/li&gt;
&lt;li&gt;traversable through time;&lt;/li&gt;
&lt;li&gt;connected by honest relations;&lt;/li&gt;
&lt;li&gt;backed by evidence;&lt;/li&gt;
&lt;li&gt;inspectable by people;&lt;/li&gt;
&lt;li&gt;usable by LLMs through tools;&lt;/li&gt;
&lt;li&gt;observable and auditable in production.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why I am building Kernel Memory Protocol: so agent memory is not just&lt;br&gt;
accumulated text, but a structure that can be traversed, inspected, and reused.&lt;/p&gt;

&lt;p&gt;This is not about making prompts longer. It is the opposite: rebuilding the&lt;br&gt;
useful context without forcing the model to read all the raw material, and&lt;br&gt;
making token usage intelligent, measurable, and auditable.&lt;/p&gt;

&lt;p&gt;The goal is to turn agent memory into a real working layer.&lt;/p&gt;

&lt;p&gt;If this direction interests you, you can check the&lt;br&gt;
&lt;a href="https://github.com/underpass-ai/rehydration-kernel" rel="noopener noreferrer"&gt;Underpass KMP repository&lt;/a&gt;.&lt;br&gt;
And if you find it useful, a GitHub star helps give the project visibility.&lt;/p&gt;




&lt;p&gt;Written by &lt;a href="https://github.com/tgarciai" rel="noopener noreferrer"&gt;Tirso García Ibáñez&lt;/a&gt; ·&lt;br&gt;
&lt;a href="https://www.linkedin.com/in/tirsogarcia/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; ·&lt;br&gt;
&lt;a href="https://github.com/underpass-ai" rel="noopener noreferrer"&gt;Underpass AI&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Underpass KMP is part of the Underpass AI project. The repository is licensed&lt;br&gt;
under the &lt;a href="https://github.com/underpass-ai/rehydration-kernel/blob/main/LICENSE" rel="noopener noreferrer"&gt;Apache License 2.0&lt;/a&gt;,&lt;br&gt;
unless stated otherwise.&lt;/p&gt;

&lt;p&gt;Copyright © 2026 Tirso García Ibáñez.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>agents</category>
      <category>opensource</category>
    </item>
    <item>
      <title>What an event-driven agent pipeline looks like when you trace it end-to-end</title>
      <dc:creator>Tirso García</dc:creator>
      <pubDate>Thu, 23 Apr 2026 22:32:49 +0000</pubDate>
      <link>https://dev.to/tirsogarcia/what-an-event-driven-agent-pipeline-looks-like-when-you-trace-it-end-to-end-1cck</link>
      <guid>https://dev.to/tirsogarcia/what-an-event-driven-agent-pipeline-looks-like-when-you-trace-it-end-to-end-1cck</guid>
      <description>&lt;p&gt;In an earlier post I argued that event-driven agents reduce scope, cost, and decision dispersion because they narrow the decision space before the model starts reasoning.&lt;/p&gt;

&lt;p&gt;This article is the empirical follow-up to that idea.&lt;/p&gt;

&lt;p&gt;It does not try to re-argue the thesis. It tries to show what it looks like when the architecture is wired end-to-end, running on a real case and instrumented enough that the behavior of the system can be observed instead of inferred after the fact.&lt;/p&gt;

&lt;p&gt;The point here is not just that a multi-agent pipeline exists. The point is that the pipeline emits a readable operational shape: ingestion, incident opening, specialized hops, differentiated latencies, outcomes by role, and distributed traces that let you follow a concrete execution from the initial event to its close.&lt;/p&gt;

&lt;p&gt;The central image of this article is not a diagram. It is a real trace. In Tempo, a single incident appears as a sequence of spans with different durations, visible overlaps, and a critical path that can be inspected without being reconstructed by hand. For me, that is the important leap: when the system stops being just a designed architecture and starts being an observable architecture.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjzxqjc0ujviedhju0tuj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjzxqjc0ujviedhju0tuj.png" alt="Figure 1. Tempo waterfall for a concrete incident." width="800" height="338"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 1. Tempo waterfall for a concrete incident. The trace shows the full path of the incident as an observable sequence of spans, durations, and specialized hops.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;To anchor the captures to concrete executions, I am working with two real incidents from this run: a CPU saturation on &lt;code&gt;underpass-demo-payments-api&lt;/code&gt; and a latency regression correlated with a recent canary.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pipeline emits its own shape
&lt;/h2&gt;

&lt;p&gt;One of the common problems in agent systems is that the architecture is usually explained better than it can be inspected. It is easy to draw boxes, arrows, and role names. It is much less common for those boxes and arrows to leave enough operational evidence that the system can be checked to actually behave as designed once real events start coming in.&lt;/p&gt;

&lt;p&gt;This is where telemetry starts to change the conversation.&lt;/p&gt;

&lt;p&gt;The per-specialist rate chart does not show aggregated activity in the abstract. It shows the temporal shape of the pipeline. First &lt;code&gt;ingress&lt;/code&gt; appears, then intermediate hops like &lt;code&gt;routing&lt;/code&gt; and &lt;code&gt;kernelseed&lt;/code&gt; come in, and later the specialists activate with distinct cadences and durations. You are not looking at a list of components in the abstract, but at stages that turn on, overlap, and turn off in an observable order.&lt;/p&gt;

&lt;p&gt;That matters because it turns architecture into emitted behavior. The reader does not have to reconstruct the path from scattered logs or from a retrospective narrative. The system itself shows how the incident moves through differentiated hops, which ones hold the flow longer, and which ones intervene more briefly. The architecture stops being a promise I describe and becomes a measurable temporal sequence.&lt;/p&gt;

&lt;p&gt;In other words: the pipeline no longer lives only in the diagram. It also lives in the time series. And that difference, for me, is an important part of what makes an agent system start to be operationally readable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk1sww9qiq2l266y0cbh8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk1sww9qiq2l266y0cbh8.png" alt="Figure 2. Rate per specialist during a pipeline execution." width="800" height="334"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 2. Rate per specialist during a pipeline execution. The temporal activation sequence reveals the operational shape of the system: ingress, routing, context materialization, and specialists with differentiated cadences.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The event narrows the decision space
&lt;/h2&gt;

&lt;p&gt;The core idea of the architecture is still the same as in the previous article: narrow the decision space before reasoning.&lt;/p&gt;

&lt;p&gt;Instead of starting with an open space of context, tools, and possible courses of action, the system starts from an explicit event. That event narrows the problem from the beginning. It does not solve anything on its own, but it constrains what kind of incident we are dealing with, which specialist should intervene first, what context needs to be materialized, and which parts of the system are not yet relevant.&lt;/p&gt;

&lt;p&gt;That matters because much of the cost of agentic systems does not come just from "using an LLM", but from leaving too many decisions open too early. When the model receives an action space that is too wide, it also receives too many opportunities to be wrong, distracted, or over-reasoning.&lt;/p&gt;

&lt;p&gt;Here the event does the initial compression work.&lt;/p&gt;

&lt;p&gt;It does not eliminate the need for reasoning. It makes that reasoning operate within an already-narrowed decision space.&lt;/p&gt;

&lt;p&gt;But the important point here is that this initial narrowing does not stay in the raw event. The system turns it into persisted structure. An alert enters as an operational fact, ingress transforms it into deterministic evidence, and from there the specialists deposit their artifacts — findings, plans, decisions — onto the incident graph, each with an explicit author, revision, and verifiable &lt;code&gt;content_hash&lt;/code&gt;. What circulates between phases stops being loose text or history and becomes structure with traceability.&lt;/p&gt;

&lt;p&gt;That trail is what allows the full materialization cycle to be shown. The specialists do not only write the text of each artifact: they also declare which relations must remain explicit between them inside a shared typed vocabulary. What ends up stored is not a narration but a grammar: what evidence sustains a finding, what finding grounds a plan, and what decision mitigates an incident.&lt;/p&gt;

&lt;p&gt;In other words, the event does not only trigger execution. It also starts to build structured memory.&lt;/p&gt;

&lt;p&gt;The first detail worth showing is precisely that deterministic input artifact. Before any specialist intervenes, ingress leaves something like this in Valkey:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;alert_id=article-sat-1776973582
alert_name=PaymentsCPUSaturation
service_name=payments-api
environment=cluster
severity=SEV2
namespace=underpass-runtime
workload_kind=deployment
workload_name=underpass-demo-payments-api
symptom_kind=saturation
symptom_value=cpu=94%
threshold=cpu &amp;gt; 90% for 5m
summary=Payments API CPU saturation
description=Sustained CPU utilization above 90% for over 5 minutes. The workload is horizontally scalable.
runbook_url=https://runbooks.internal/payments/cpu-saturation
dashboard_url=https://grafana.internal/d/payments-cpu
firing_at=2026-04-23T19:46:22Z
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Real block from evidence_56daf53e-4c97-4344-bbb6-f63cf513ae89_initial_alert.json. &lt;code&gt;content_hash&lt;/code&gt;: &lt;code&gt;sha256:116b246ae0bc4adf269ce29dd76cd794ccde11befa29567e8abdbf988abd3dc0&lt;/code&gt;. &lt;code&gt;revision&lt;/code&gt;: &lt;code&gt;1&lt;/code&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;And that text does not stay isolated. The kernel turns it into typed nodes and relations. For the saturation incident, the shape stored in Neo4j looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs87kmxnyxx5oozh25s0p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs87kmxnyxx5oozh25s0p.png" alt="Figure 3. Typed graph stored in Neo4j for the saturation incident." width="800" height="214"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Figure 3. Typed graph stored in Neo4j for the saturation incident. Same node kinds, same &lt;code&gt;semantic_class&lt;/code&gt; vocabulary, composed into the 3-in-series shape that this incident type requires.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Not every hop costs the same
&lt;/h2&gt;

&lt;p&gt;Another thing the instrumentation makes clear is that not every hop of the pipeline pays the same cost.&lt;/p&gt;

&lt;p&gt;In the panel, fast hops are distinguishable from model-bound hops. That separation is important because it lets you read the system with more precision. Not every pipeline step needs generative capability, and not every operational cost should be attributed to the model's reasoning.&lt;/p&gt;

&lt;p&gt;Some steps are essentially operational: ingestion, routing, persistence, context retrieval, component coordination. And some steps are tied to LLM-bound specialists, where the real cognitive cost of the system appears.&lt;/p&gt;

&lt;p&gt;Separating both planes changes the conversation. Instead of talking about the pipeline as a single opaque block, you can see which parts of the work are infrastructure, which parts are reasoning, and which parts are coordination between both. It also lets you detect whether the system is spending model capacity where it should not.&lt;/p&gt;

&lt;p&gt;In other words: it is not enough that the pipeline works. It has to be clear which hops consume expensive intelligence and which are simply governed execution.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsh9zaj4tfvclljrx58hq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsh9zaj4tfvclljrx58hq.png" alt="Figure 4. p95 latency per hop." width="800" height="145"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 4. p95 latency per hop. The instrumentation separates LLM-bound specialists from cheaper operational hops, so the cost of the pipeline is not treated as an opaque block.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Specialists are not named prompts
&lt;/h2&gt;

&lt;p&gt;There is a common risk in this space: calling "agents" or "specialists" what are really just prompts with different labels.&lt;/p&gt;

&lt;p&gt;What I want to avoid here is precisely that.&lt;/p&gt;

&lt;p&gt;In this architecture, investigator, planner, and operator are not just semantic names for three nice phases of a demo. They appear as differentiated stages with their own timing, their own outcomes, and bounded responsibility within the incident cycle. The instrumentation lets you see them as operational roles, not just distinct voices generated by the same model.&lt;/p&gt;

&lt;p&gt;That does not mean the problem is solved in general. It means something more concrete and defensible: in this architecture, the specialists leave enough trace that you can inspect what they did, when they intervened, and with what outcome each hop finished. That trace does not only appear in telemetry. It also appears in the &lt;code&gt;node.details&lt;/code&gt; persisted in Valkey, where each actor leaves text with explicit authorship and where the reader can compare deterministic evidence with LLM intervention without the two getting mixed.&lt;/p&gt;

&lt;p&gt;That change seems small, but it is not. In many systems, multi-agent behavior is narrated. Here it starts to be auditable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4b486vswa3t1om6b9oe8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4b486vswa3t1om6b9oe8.png" alt="Figure 5. Specialist outcomes in the observed window." width="800" height="125"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 5. Specialist outcomes in the observed window. The roles leave their own operational trace: they are not just prompts with different names but stages with distinguishable outcomes and responsibility.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Governing what the LLM cannot invent
&lt;/h2&gt;

&lt;p&gt;There is a part of the design I care about separating from reasoning: &lt;strong&gt;what each agent can and cannot decide&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In many agent systems, governance is delegated to the prompt: &lt;em&gt;"do not do X, only do Y"&lt;/em&gt;. That is fragile because it depends on the model obeying. Here I prefer governance to live in layers prior to the model, so that when the LLM gets it wrong — and it does — the system can no longer execute the mistake.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The specialist closes the decision space.&lt;/strong&gt; Each specialist exposes a closed enum to the model. The saturation operator can only respond &lt;code&gt;execute&lt;/code&gt;, &lt;code&gt;escalate&lt;/code&gt;, or &lt;code&gt;reject&lt;/code&gt;. The planner picks from five actions: &lt;code&gt;scale_up&lt;/code&gt;, &lt;code&gt;restart_pods&lt;/code&gt;, &lt;code&gt;circuit_break&lt;/code&gt;, &lt;code&gt;escalate&lt;/code&gt;, or &lt;code&gt;not_enough_evidence&lt;/code&gt;. The rollout operator: &lt;code&gt;rollback&lt;/code&gt;, &lt;code&gt;pause_rollout&lt;/code&gt;, &lt;code&gt;escalate&lt;/code&gt;, &lt;code&gt;not_enough_evidence&lt;/code&gt;. That space is not a prompt convention; these are enums declared in the domain catalog (&lt;code&gt;eventsv1.SaturationOperatorDecision&lt;/code&gt;, &lt;code&gt;eventsv1.SaturationAction&lt;/code&gt;, &lt;code&gt;eventsv1.RuntimeRolloutDecision&lt;/code&gt;) with an &lt;code&gt;IsValid&lt;/code&gt; method. The specialist validates the model's response before emitting any event:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;decision&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;eventsv1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SaturationOperatorDecision&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ToLower&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Decision&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;decision&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;IsValid&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;fmt&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"llm returned invalid operator decision %q; defaulting to escalate"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Decision&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;eventsv1&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SaturationOperatorDecisionEscalate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM's output schema is also closed: each specialist asks the model for a JSON with explicit fields (&lt;code&gt;decision&lt;/code&gt;, &lt;code&gt;confidence&lt;/code&gt;, &lt;code&gt;node_detail&lt;/code&gt;, &lt;code&gt;relations&lt;/code&gt; with their explanations). If the JSON does not parse, safe fallback. If it parses but the values are invalid, safe fallback. The model's uncertainty becomes an operational signal — the &lt;code&gt;insufficient_data&lt;/code&gt; outcome appears on the dashboard — not an execution risk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The kernel closes the graph vocabulary.&lt;/strong&gt; Relations between nodes are not free text. Each relation carries a &lt;code&gt;semantic_class&lt;/code&gt; that has to be one of six: &lt;code&gt;structural&lt;/code&gt;, &lt;code&gt;causal&lt;/code&gt;, &lt;code&gt;motivational&lt;/code&gt;, &lt;code&gt;procedural&lt;/code&gt;, &lt;code&gt;evidential&lt;/code&gt;, &lt;code&gt;constraint&lt;/code&gt;. That set is fixed in the kernel's domain. If a specialist tries to emit a relation with a class outside that list — even if the LLM wanted to generate something like "inspirational" or "quasi-causal" — the projector drops it and the batch is recorded as failed in telemetry. What reaches the graph always comes from a known vocabulary.&lt;/p&gt;

&lt;p&gt;The content of the nodes is fingerprinted. Each &lt;code&gt;node.detail&lt;/code&gt; is persisted with &lt;code&gt;content_hash&lt;/code&gt; and &lt;code&gt;revision&lt;/code&gt; from the moment the specialist emits it; the hash stays with the text permanently. Months later, anyone can recompute the hash against the stored detail and confirm that the text has not changed since the author wrote it.&lt;/p&gt;

&lt;p&gt;The combination — six closed semantic classes for the relations and cryptographic signing over the node content — makes the system's persisted memory not a free narration but a graph with a fixed vocabulary and verifiable text. That is what allows different specialists, at different times, to operate on the same structure without each contaminating the others'.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The catalog ties the layers together.&lt;/strong&gt; &lt;code&gt;SpecialistID&lt;/code&gt;, &lt;code&gt;ToolProfile&lt;/code&gt;, &lt;code&gt;GovernanceProfile&lt;/code&gt;, and &lt;code&gt;SuccessProfile&lt;/code&gt; live in YAML and are cross-checked against the Go constants via conformance tests. If someone adds a specialist or a tool without declaring it in both places, the build fails. Governance is not a convention read from a wiki; it is versioned, executable structure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The runtime is the last guardian.&lt;/strong&gt; When a specialist opens a session with the runtime to invoke a tool, it declares a &lt;code&gt;tool_profile&lt;/code&gt;, a &lt;code&gt;governance_profile&lt;/code&gt;, and a &lt;code&gt;success_profile&lt;/code&gt;. The runtime validates each invocation — not just at session-open, but on every call — against three rules: if the tool's scope (workspace / cluster) does not match the caller's roles, it fails; if the tool's risk level is high and the caller does not have &lt;code&gt;platform_admin&lt;/code&gt;, it fails; if the tool requires approval and the invocation does not carry &lt;code&gt;Approved=true&lt;/code&gt;, it fails. Each runtime decision — allowed, denied, failed — is recorded in a structured audit log with &lt;code&gt;actor_id&lt;/code&gt;, &lt;code&gt;tenant_id&lt;/code&gt;, &lt;code&gt;session_id&lt;/code&gt;, &lt;code&gt;invocation_id&lt;/code&gt;, &lt;code&gt;status&lt;/code&gt;, and redacted metadata. PIR does not decide whether it has permission; the runtime decides it against declared structure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Governance leaves a trail.&lt;/strong&gt; Every runtime denial, every specialist fallback, every &lt;code&gt;insufficient_data&lt;/code&gt; from the model, every kernel rejection lands in telemetry. These are not errors; they are evidence that the layers are doing their job. That lets you distinguish between &lt;em&gt;"the system did nothing because it was right not to act"&lt;/em&gt; and &lt;em&gt;"the system did nothing because it failed"&lt;/em&gt;. The warnings panel on the dashboard is not a negative indicator; it is a health indicator.&lt;/p&gt;

&lt;p&gt;Governance emerges as composition. The event narrows the problem. The specialist narrows the model's output. The kernel narrows the persisted vocabulary. The catalog narrows the universe of roles, tools, and profiles. The runtime narrows execution. No single layer is enough on its own; the combination is what allows a potentially ambiguous model to operate on real infrastructure without unexpected behavior.&lt;/p&gt;

&lt;p&gt;That change of register — from &lt;em&gt;"promising the agent behaves well"&lt;/em&gt; to &lt;em&gt;"structurally preventing it from behaving badly"&lt;/em&gt; — is what makes me think the word governance is appropriate here. It is not prompt governance nor external policy governance. It is architecture governance: the system does not rely on the model's goodness; it operates under constraints the model cannot reach.&lt;/p&gt;

&lt;h2&gt;
  
  
  The distributed trace is the incident
&lt;/h2&gt;

&lt;p&gt;The most valuable part of this setup, for me, is not the aggregated dashboard. It is the bridge between aggregates and concrete executions.&lt;/p&gt;

&lt;p&gt;The dashboard's traces panel already shows that bridge: each row lists &lt;code&gt;payments-incident-response&lt;/code&gt;, operations like &lt;code&gt;pir.ingress.handle_alert_firing&lt;/code&gt;, and durations on the order of a bit over a minute per incident. But where the real value appears is when you open one of those entries in Tempo — and what unfolds there is exactly the waterfall that opens this article.&lt;/p&gt;

&lt;p&gt;Now that we have walked the pipeline, it is worth coming back to Figure 1 with different eyes. Each of those spans corresponds to a hop you already know how to read: ingress as root, five siblings in parallel (routing, event proxy, kernel seed, investigators) and then the serial chain of the saturation pipeline. The hops in microseconds or milliseconds are infrastructure doing its work; the spans of tens of seconds are the LLM-bound specialists. That asymmetry — visible without reconstructing anything — is what lets you read the incident as a path, not as an aggregate.&lt;/p&gt;

&lt;p&gt;The critical path also stops being a guess. You can see which span dominated the total duration, which were practically instantaneous, and which parts overlapped. That avoids a common trap of observability in LLM systems: staying at global metrics that say little about the real behavior of an individual execution. Averages help, but they are not enough. To understand how the system reasons and executes under a specific incident, you have to be able to follow a trace.&lt;/p&gt;

&lt;p&gt;Distributed tracing does not replace the system's reasoning. It makes it inspectable.&lt;/p&gt;

&lt;h2&gt;
  
  
  The kernel does not store free-floating text
&lt;/h2&gt;

&lt;p&gt;The other half of the story is not in Grafana or in Tempo. It is in what remains stored after each hop.&lt;/p&gt;

&lt;p&gt;Each incident leaves a chain of real &lt;code&gt;node.details&lt;/code&gt;: initial evidence, findings, plans, and decisions. Some of those blocks come from ingress and are deterministic. Others come from LLM-bound specialists. What is useful is that they do not end up mixed in an indistinct narration. Each piece keeps key, author, raw content, &lt;code&gt;content_hash&lt;/code&gt;, and &lt;code&gt;revision&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That lets you show something that is usually lost in this kind of system: not only what the pipeline decided, but what specific text each actor produced and with what provenance it can be read. It also lets you link a single incident to four different views of the same fact: Grafana screenshot, Tempo trace, Valkey-persisted detail, and typed relations in Neo4j.&lt;/p&gt;

&lt;p&gt;For example, the saturation &lt;code&gt;finding&lt;/code&gt; does not describe an "agent" in the abstract. It describes a concrete reading of concrete evidence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Workload under pressure: deployment/underpass-demo-payments-api in namespace underpass-runtime.
Resource saturated: cpu (symptom_kind=saturation, symptom_value=cpu=94%).
Observed pressure level: 94% vs alert threshold of cpu &amp;gt; 90% for 5m.
Hypotheses: The evidence supports a sudden spike or regression related to recent deployments, as the workload received multiple canary rollouts (v2.7.4, v2.7.5, v2.7.6 23min before, and v2.7.7 15min before the symptom fired).
Missing information: Metrics window for trend analysis, top-N consumers, and limits vs requests to determine if the saturation is due to resource constraints or increased load.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Real block from finding_56daf53e-4c97-4344-bbb6-f63cf513ae89_saturation.json. &lt;code&gt;content_hash&lt;/code&gt;: &lt;code&gt;sha256:7c67c37f8e3439942993b09a7dde70c164abdaa6b69d0e46a65872d7f17824ae&lt;/code&gt;. &lt;code&gt;revision&lt;/code&gt;: &lt;code&gt;1&lt;/code&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;And in the rollout incident, the operational &lt;code&gt;decision&lt;/code&gt; is also persisted with its own explicit rationale:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Scope: payments-api service in the cluster environment, specifically the underpass-demo-payments-api deployment.
Observed Data: alert_id article-roll-1776973582 reports a symptom_kind of latency with a p99 value of 2.41s, exceeding the threshold of p99 &amp;gt; 2s for 5m. A recent_deploy (v2.7.8-canary) is present with an age_minutes of 18.
Operational Rationale: The regression is correlated with a recent canary deploy that is young (&amp;lt; 60 minutes) and the symptom is latency, meeting the criteria for a rollback to a healthy previous revision.
Falsification: This decision would be invalidated if evidence emerges that the latency is caused by a global infrastructure failure unrelated to the v2.7.8-canary rollout.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Real block from decision_a65c9a9b-3dff-4ce4-aeeb-d6497984ee57_runtime-rollout.json. &lt;code&gt;content_hash&lt;/code&gt;: &lt;code&gt;sha256:2830e840c213de2baaf82444c6383e3ea9b63dac41944153534e61ead72a07d7&lt;/code&gt;. &lt;code&gt;revision&lt;/code&gt;: &lt;code&gt;1&lt;/code&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The other graph worth showing is the rollout-regression one, because it makes clear that the ontology is shared but the topology changes depending on the incident type:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuw97spe550db0qrcs9vg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuw97spe550db0qrcs9vg.png" alt="Figure 6. Typed graph stored in Neo4j for the rollout-regression incident." width="800" height="315"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Figure 6. Typed graph stored in Neo4j for the rollout-regression incident. Shared ontology, different topology: finding and decision in parallel instead of the 3-in-series chain.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Recovery and closure are also architecture
&lt;/h2&gt;

&lt;p&gt;An agent architecture is not played only in the "interesting" moment of reasoning. It is also played in the less visible parts: retries, durability, message consumption, correct pipeline closure, absence of spurious escalations, and stable behavior when several components work in chain.&lt;/p&gt;

&lt;p&gt;That matters because many demos show the best case but not the operational fabric that keeps the system from breaking at the first mismatch between services, queues, context, and execution.&lt;/p&gt;

&lt;p&gt;Part of what I wanted to build here was precisely that: a pipeline where recovery, coordination, and traceability are not an afterthought but part of the design from the beginning.&lt;/p&gt;

&lt;p&gt;If the system resolves something but cannot explain which hop it passed through, how long it took, who intervened, what failed, and where it could recover, then you do not yet have reliable infrastructure. You have a lucky sequence.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I think generalizes
&lt;/h2&gt;

&lt;p&gt;I do not think a single demo allows broad claims about "agents" in general. I do think this setup shows something more bounded.&lt;/p&gt;

&lt;p&gt;When the event narrows the decision space well, when the pipeline distributes responsibility across specialists with clear scope, and when execution leaves enough telemetry to inspect each hop, the system becomes more governable. Not necessarily simpler on the inside, but more readable from the outside. And that operational readability matters.&lt;/p&gt;

&lt;p&gt;It matters for debugging.&lt;br&gt;
It matters for evaluation.&lt;br&gt;
It matters for audit.&lt;br&gt;
It matters for deciding whether the cost of reasoning is justified at each phase.&lt;/p&gt;

&lt;p&gt;In that sense, what is interesting is not just that there are several specialists. What is interesting is that the complete cycle can be observed as a typed, bounded, and measurable sequence of transitions between state, context, and execution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Close
&lt;/h2&gt;

&lt;p&gt;The main result here is not that a multi-agent pipeline can be drawn. It is that it can be observed as a sequence of bounded hops, with latency, outcome, and traceability per execution.&lt;/p&gt;

&lt;p&gt;The event defines the boundary. The specialist closes the model's output. The kernel closes the graph vocabulary. The runtime closes execution. The instrumentation leaves enough evidence to inspect what actually happened, instead of reconstructing it afterwards from prompts, loose logs, or intuition.&lt;/p&gt;

&lt;p&gt;This pipeline is not a monolithic application. It runs on two pieces of open-source infrastructure I published earlier in separate articles: &lt;a href="https://github.com/underpass-ai/rehydration-kernel" rel="noopener noreferrer"&gt;rehydration-kernel&lt;/a&gt; provides structured context with explicit causality and a closed ontology, and &lt;a href="https://github.com/underpass-ai/underpass-runtime" rel="noopener noreferrer"&gt;underpass-runtime&lt;/a&gt; provides governed execution with a policy engine and auditing. This article shows what emerges when the two are composed.&lt;/p&gt;

&lt;p&gt;If you work on infrastructure for agents, governed execution, or observability for LLM systems, I would especially value technical feedback on the design and the instrumentation.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>observability</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Why event-driven agents reduce scope, cost, and decision dispersion</title>
      <dc:creator>Tirso García</dc:creator>
      <pubDate>Thu, 16 Apr 2026 22:45:25 +0000</pubDate>
      <link>https://dev.to/tirsogarcia/why-event-driven-agents-reduce-scope-cost-and-decision-dispersion-2062</link>
      <guid>https://dev.to/tirsogarcia/why-event-driven-agents-reduce-scope-cost-and-decision-dispersion-2062</guid>
      <description>&lt;p&gt;Most agent systems do not control their costs because they spend tokens letting the model discover boundaries that the architecture should have defined up front.&lt;/p&gt;

&lt;p&gt;A task arrives. A general-purpose agent receives a large context window, too many available tools, mixed historical signals, and a loosely defined objective. From there, the model must infer what matters, what does not, which tools are plausible, which constraints apply, and how success should be measured.&lt;/p&gt;

&lt;p&gt;That is expensive.&lt;/p&gt;

&lt;p&gt;Not only in tokens, but in unnecessary exploration, latency, failed tool calls, policy denials, and avoidable reasoning over irrelevant possibility space.&lt;/p&gt;

&lt;p&gt;This is the core issue I want to highlight: &lt;strong&gt;many agent systems are expensive because the architecture leaves too many decisions open for the model before reasoning even begins.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I think event-driven agents are one of the cleanest ways to reduce those open decisions.&lt;/p&gt;

&lt;p&gt;A well-defined event does more than trigger work. It defines the boundaries of the problem and becomes the initial context the agent works with. The better the event is designed, the more precise the agent will be.&lt;/p&gt;

&lt;p&gt;That event is routed to a specialist agent — not a generic agent that has to figure out what to do, but one that already knows what type of problem it focuses on.&lt;/p&gt;




&lt;h2&gt;
  
  
  The real cost of generality
&lt;/h2&gt;

&lt;p&gt;The default pattern in many agent systems is still broad and implicit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;give the model a lot of context&lt;/li&gt;
&lt;li&gt;expose a wide action surface&lt;/li&gt;
&lt;li&gt;provide generic instructions&lt;/li&gt;
&lt;li&gt;hope the model discovers the right boundary at inference time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That scales poorly.&lt;/p&gt;

&lt;p&gt;As systems grow, the agent is forced to reason over:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;heterogeneous history&lt;/li&gt;
&lt;li&gt;multiple subsystems&lt;/li&gt;
&lt;li&gt;weakly related signals&lt;/li&gt;
&lt;li&gt;many candidate tools&lt;/li&gt;
&lt;li&gt;overlapping objectives&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point, the problem is no longer just context size.&lt;/p&gt;

&lt;p&gt;It is &lt;strong&gt;decision dispersion&lt;/strong&gt;: too many plausible interpretations, too many candidate actions, and too much irrelevant context competing for attention.&lt;/p&gt;

&lt;p&gt;A broad agent can still succeed, but the system is making it solve problem decomposition again and again on every cycle. And the more the system grows, the more likely the model is to fail or take shortcuts just to get through.&lt;/p&gt;

&lt;p&gt;That is architectural waste.&lt;/p&gt;




&lt;h2&gt;
  
  
  Events are not just triggers
&lt;/h2&gt;

&lt;p&gt;In a well-designed event-driven system, an event is not merely a transport primitive.&lt;/p&gt;

&lt;p&gt;It is a &lt;strong&gt;semantic boundary&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A well-defined event already carries a strong signal about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what class of situation has occurred&lt;/li&gt;
&lt;li&gt;which specialist capability is relevant&lt;/li&gt;
&lt;li&gt;which context should be materialized&lt;/li&gt;
&lt;li&gt;which tools are worth considering&lt;/li&gt;
&lt;li&gt;which policies should govern the response&lt;/li&gt;
&lt;li&gt;how the result should be evaluated&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That changes the starting point of the system.&lt;/p&gt;

&lt;p&gt;Instead of asking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What should an agent do in this entire environment?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The system can ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How should this specialist handle this class of situation under these constraints?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is a much healthier question.&lt;/p&gt;




&lt;h2&gt;
  
  
  The key idea: narrowing before reasoning
&lt;/h2&gt;

&lt;p&gt;The architectural value of event-driven agents is not just decoupling. It is &lt;strong&gt;control&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A well-defined event lets the system narrow four things before the model starts reasoning:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Problem scope
&lt;/h3&gt;

&lt;p&gt;The event defines the operational boundary.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Context scope
&lt;/h3&gt;

&lt;p&gt;Only the relevant knowledge should be materialized.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Action scope
&lt;/h3&gt;

&lt;p&gt;Only the relevant tools and permissions should be exposed.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Evaluation scope
&lt;/h3&gt;

&lt;p&gt;Success criteria become more local and easier to observe.&lt;/p&gt;

&lt;p&gt;This is why event-driven systems can become cheaper and more reliable at the same time.&lt;/p&gt;




&lt;h2&gt;
  
  
  Narrowing across four architectural layers
&lt;/h2&gt;

&lt;p&gt;A serious event-driven agent system should narrow the problem across four layers.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Event routing — narrowing the problem surface
&lt;/h3&gt;

&lt;p&gt;The first narrowing step is event classification and routing.&lt;/p&gt;

&lt;p&gt;A well-defined event such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;ThermalDriftDetected&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;PolicyViolationDetected&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ExecutionFailureObserved&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;IncidentSeverityRaised&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;already tells the system that not every capability is equally relevant.&lt;/p&gt;

&lt;p&gt;Routing should select the specialist capability or specialist set that is appropriate for that class of problem.&lt;/p&gt;

&lt;p&gt;The model should not spend tokens discovering what the event already told us.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Context materialization — narrowing the knowledge surface
&lt;/h3&gt;

&lt;p&gt;Once the event boundary is known, context should not be assembled as a flat prompt bundle.&lt;/p&gt;

&lt;p&gt;It should be materialized explicitly and narrowly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;relevant entities only&lt;/li&gt;
&lt;li&gt;causal relationships only where useful&lt;/li&gt;
&lt;li&gt;prior mitigations and outcomes&lt;/li&gt;
&lt;li&gt;rationale from previous decisions&lt;/li&gt;
&lt;li&gt;constraints tied to the event class&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where many systems either win or fail.&lt;/p&gt;

&lt;p&gt;A narrow context is not automatically a good context. The goal is not merely to shrink tokens. The goal is to increase &lt;strong&gt;relevance density&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That is why context quality should be observable.&lt;/p&gt;

&lt;p&gt;Useful metrics here include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;raw_equivalent_tokens&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;compression_ratio&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;causal_density&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;noise_ratio&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;detail_coverage&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those metrics make it possible to ask a much better question than “how much context did we pass?”&lt;/p&gt;

&lt;p&gt;The better question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;how much unnecessary context did we discard without losing what the specialist actually needs?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  3. Governed execution — narrowing the action surface
&lt;/h3&gt;

&lt;p&gt;Even with the right context, an agent should not operate over an unrestricted action surface.&lt;/p&gt;

&lt;p&gt;Execution should be governed.&lt;/p&gt;

&lt;p&gt;A runtime layer can narrow execution by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;restricting the candidate tool set&lt;/li&gt;
&lt;li&gt;ranking likely actions before invocation&lt;/li&gt;
&lt;li&gt;applying policy checks before execution&lt;/li&gt;
&lt;li&gt;isolating execution environments&lt;/li&gt;
&lt;li&gt;capturing telemetry, logs, and traces&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cost of a system is not only what it spends on prompts. It is also what it spends on tool fan-out, denied actions, and unnecessary exploration.&lt;/p&gt;

&lt;p&gt;This is why execution quality also needs metrics.&lt;/p&gt;

&lt;p&gt;Useful metrics here include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;workspace_tool_calls_per_task&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;workspace_success_on_first_tool_rate&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;workspace_recommendation_acceptance_rate&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;workspace_policy_denial_rate_bad_recommendation&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;invocation_latency_histograms&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Observability and feedback — narrowing through evidence
&lt;/h3&gt;

&lt;p&gt;The final layer is observability.&lt;/p&gt;

&lt;p&gt;Without observability, “event-driven agents reduce cost” remains a belief.&lt;/p&gt;

&lt;p&gt;With observability, it becomes testable.&lt;/p&gt;

&lt;p&gt;A well-instrumented system can show:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;whether context is becoming denser or just smaller&lt;/li&gt;
&lt;li&gt;whether specialists use fewer tools than broad agents&lt;/li&gt;
&lt;li&gt;whether routing improves first-action success&lt;/li&gt;
&lt;li&gt;whether policy boundaries are helping or creating churn&lt;/li&gt;
&lt;li&gt;whether outcome quality improves as scope narrows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where the architecture stops being an opinion and becomes an operational hypothesis.&lt;/p&gt;




&lt;h2&gt;
  
  
  A concrete example: alert-driven remediation
&lt;/h2&gt;

&lt;p&gt;A useful way to think about this is an operational remediation loop in a live cluster.&lt;/p&gt;

&lt;p&gt;Imagine an alert arrives from the observability stack because a subsystem crosses a critical threshold.&lt;/p&gt;

&lt;p&gt;A broad agent design might do something like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;gather recent logs&lt;/li&gt;
&lt;li&gt;gather broad system history&lt;/li&gt;
&lt;li&gt;expose many tools&lt;/li&gt;
&lt;li&gt;ask a general-purpose agent to stabilize the situation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That approach pushes too much decomposition work into the model.&lt;/p&gt;

&lt;p&gt;An event-driven design works differently.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1 — a well-defined event enters the system
&lt;/h3&gt;

&lt;p&gt;The alert becomes a well-defined event such as &lt;code&gt;IncidentSeverityRaised&lt;/code&gt; or &lt;code&gt;ExecutionFailureObserved&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2 — the event selects a specialist path
&lt;/h3&gt;

&lt;p&gt;The system routes the event to a specialist capability responsible for that class of issue.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3 — context is materialized narrowly
&lt;/h3&gt;

&lt;p&gt;The context layer assembles only what is relevant to that incident type:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the affected subsystem&lt;/li&gt;
&lt;li&gt;recent related failures&lt;/li&gt;
&lt;li&gt;prior mitigations&lt;/li&gt;
&lt;li&gt;current operational constraints&lt;/li&gt;
&lt;li&gt;known causal dependencies&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 4 — execution is governed
&lt;/h3&gt;

&lt;p&gt;The runtime narrows the available action space:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;only relevant tools are visible&lt;/li&gt;
&lt;li&gt;suggested actions are ranked&lt;/li&gt;
&lt;li&gt;policy checks can reject unsafe actions before execution&lt;/li&gt;
&lt;li&gt;telemetry is attached to the full cycle&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 5 — the outcome becomes evidence
&lt;/h3&gt;

&lt;p&gt;The result of the mitigation becomes a new event and a new measurement point.&lt;/p&gt;

&lt;p&gt;At that point, the system can observe not only whether the incident was addressed, but how expensive the path was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how much context was needed&lt;/li&gt;
&lt;li&gt;how many tools were considered&lt;/li&gt;
&lt;li&gt;whether the first recommended action succeeded&lt;/li&gt;
&lt;li&gt;whether policy narrowed or blocked the path&lt;/li&gt;
&lt;li&gt;how long the cycle took end to end&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the real strength of the pattern.&lt;/p&gt;

&lt;p&gt;The event is not just a trigger for work. It is the boundary that lets the whole system narrow problem scope, knowledge scope, and action scope before reasoning begins.&lt;/p&gt;




&lt;h2&gt;
  
  
  The architecture in one line
&lt;/h2&gt;

&lt;p&gt;A useful mental model is this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqc7xcyllpflzwu7k0fxy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqc7xcyllpflzwu7k0fxy.png" alt="Seven stacked boxes connected by arrows showing the stages: Event, Specialist routing, Context materialization, Tool suggestion and policy check, Governed execution, Outcome event, and&amp;lt;br&amp;gt;
   Metrics, Traces, Logs" width="800" height="966"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every stage should remove irrelevant possibilities.&lt;/p&gt;

&lt;p&gt;If the system keeps adding options instead of removing them, it is probably moving in the wrong direction.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this reduces cost
&lt;/h2&gt;

&lt;p&gt;Cost reduction comes from multiple sources at once:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;fewer input tokens&lt;/li&gt;
&lt;li&gt;denser context&lt;/li&gt;
&lt;li&gt;fewer candidate tools&lt;/li&gt;
&lt;li&gt;fewer executions without result&lt;/li&gt;
&lt;li&gt;fewer unsafe actions reaching execution&lt;/li&gt;
&lt;li&gt;fewer retries caused by vague reasoning&lt;/li&gt;
&lt;li&gt;shorter cycles to first useful action&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A broad agent pays these costs implicitly.&lt;/p&gt;

&lt;p&gt;An event-driven specialist system avoids them structurally.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this reduces decision dispersion
&lt;/h2&gt;

&lt;p&gt;Decision dispersion appears when the system leaves too many paths open at once.&lt;/p&gt;

&lt;p&gt;Too much context.&lt;br&gt;
Too many tools.&lt;br&gt;
Too many plausible interpretations.&lt;br&gt;
Too many weakly bounded goals.&lt;/p&gt;

&lt;p&gt;A well-defined event cuts through that.&lt;/p&gt;

&lt;p&gt;It does not eliminate uncertainty, but it turns a diffuse reasoning problem into a more local one.&lt;/p&gt;

&lt;p&gt;The system no longer asks for a global interpretation of the world. It asks for a bounded response to a bounded class of situation.&lt;/p&gt;

&lt;p&gt;That is the kind of narrowing that helps both quality and cost.&lt;/p&gt;




&lt;h2&gt;
  
  
  What to measure in a live system
&lt;/h2&gt;

&lt;p&gt;For this architecture to be credible, it has to be measurable.&lt;/p&gt;

&lt;p&gt;A strong live demonstration would compare a broader path against an event-driven specialist path on the same class of incident.&lt;/p&gt;

&lt;p&gt;For the context layer, useful measurements include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;raw_equivalent_tokens&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;compression_ratio&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;causal_density&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;noise_ratio&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;detail_coverage&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;context_bytes_saved&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the execution layer, useful measurements include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;workspace_tool_calls_per_task&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;workspace_success_on_first_tool_rate&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;workspace_recommendation_acceptance_rate&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;workspace_policy_denial_rate_bad_recommendation&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;invocation_latency_histograms&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;trace_span_durations&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The point is not to claim that every event-driven design is automatically better.&lt;/p&gt;

&lt;p&gt;The point is that this design gives you a coherent way to test whether narrowing is actually happening and whether it is paying off.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this does not solve
&lt;/h2&gt;

&lt;p&gt;Event-driven agents do not solve everything.&lt;/p&gt;

&lt;p&gt;They can still fail badly if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;events are poorly designed&lt;/li&gt;
&lt;li&gt;specialist boundaries are unclear&lt;/li&gt;
&lt;li&gt;context materialization is weak&lt;/li&gt;
&lt;li&gt;the runtime exposes the wrong tools&lt;/li&gt;
&lt;li&gt;policies are too loose or too rigid&lt;/li&gt;
&lt;li&gt;observability is incomplete&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A noisy event taxonomy creates noise, not clarity.&lt;/p&gt;

&lt;p&gt;A bad specialist boundary just moves confusion from the prompt to the routing layer.&lt;/p&gt;

&lt;p&gt;A narrow system is only better if the narrowing is semantically sound.&lt;/p&gt;




&lt;h2&gt;
  
  
  The final idea
&lt;/h2&gt;

&lt;p&gt;Reducing cost, improving focus, and eliminating dispersion are consequences of the same principle: narrow before reasoning. When that is combined with materialized context, governed execution, and real observability, the system stops being a prompt pipeline and becomes operational infrastructure.&lt;/p&gt;

&lt;p&gt;The systems that will scale are not the ones that expose larger models to more context and more tools.&lt;/p&gt;

&lt;p&gt;They are the ones that learn how to narrow the world before the model starts thinking.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;— &lt;a href="https://www.linkedin.com/in/tirsogarcia/" rel="noopener noreferrer"&gt;Tirso Garcia&lt;/a&gt; · April 2026&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Building these ideas in the open:&lt;br&gt;
&lt;a href="https://github.com/underpass-ai/rehydration-kernel" rel="noopener noreferrer"&gt;rehydration-kernel&lt;/a&gt; · &lt;a href="https://github.com/underpass-ai/underpass-runtime" rel="noopener noreferrer"&gt;underpass-runtime&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're working on similar problems, I'd love to hear from you.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>agents</category>
      <category>observability</category>
    </item>
  </channel>
</rss>
