DEV Community

Gerardo Arroyo for AWS Community Builders

Posted on • Originally published at gerardo.dev

AgentCore Episodic Memory: When Your Agent Learns from Experience

Cover

AgentCore Episodic Memory: When Your Agent Learns from Experience

You've spent weeks tuning your agent's prompt. Every time a new edge case appears, you document it, add it to the instructions, and redeploy. The system prompt has already exceeded 4,000 tokens. And yet, last week the agent made exactly the same mistake it made with another user three weeks ago — one that was already "documented" in the instructions, but that the model simply didn't prioritize at the right moment.

The problem isn't the model. The problem is that your agent has no memory of experiences. It only has memory of instructions.

That distinction, which seems semantic, completely changes the approach. And it's exactly the gap that AgentCore Episodic Memory comes to close.

In the article on AgentCore Evaluations, I promised to cover this capability. Today we close that debt — and if you've already seen how to measure whether your agent works well, now we'll see how to make it actively learn from what it experienced.

The Memory Map in AWS: Where We Stand

Before diving into episodic memory, it's worth orienting ourselves. Over time, AWS has evolved its agent memory capabilities in distinct layers. It's easy to confuse them because they all "remember things", but they serve very different purposes.

Session memory in classic Bedrock Agents — what we explored in the multi-agents article: persists summaries between sessions using a memoryId. Functional, but basic. The agent remembers that something happened, not how it resolved it or what it learned from it.

Session Management APIs — what we covered in the Bedrock Sessions article: full state management within and between sessions, without depending on the Bedrock agent. More control, more code, same limitation: no learning, just persistence.

AgentCore Memory with long-term strategies — the managed service announced in 2025 that allows extracting facts, preferences and session summaries automatically. A great step forward. But it's still declarative memory: it knows that "the user prefers instances in us-east-1", but doesn't remember that the last time it tried to do a rollback in that region it first used the wrong approach and had to correct it.

AgentCore Episodic Memory — announced at re:Invent 2025 as an additional long-term strategy: captures complete experiences, structures them into episodes, and generates reflections that cross multiple episodes to extract generalizable patterns. This is the difference between remembering a fact and remembering how you came to know that fact.

The distinction that helped me understand it most: semantic memory tells you what you know. Episodic memory tells you how you came to know it — and why a certain approach worked better than another.

The Use Case: An Infrastructure Diagnostics Agent

To make this concrete, I built a DevOps agent that diagnoses infrastructure incidents on AWS. The scenario is familiar: someone reports intermittent timeouts in production, the agent investigates with tools (CloudWatch, RDS, EC2), and tries to resolve or escalate.

Without episodic memory, each incident is treated as if it were the first. With it, after several similar RDS incidents the agent knows that when certain combined symptoms appear, checking the connection pool before scaling instances resolves most cases.

That knowledge doesn't come from someone writing it in the system prompt. It comes from the agent's own accumulated experience. All the code for this project is available at github.com/codecr/bedrock-memory.

How AgentCore Episodic Memory Works

When your agent sends events to AgentCore Memory with the episodic strategy enabled, the service runs a three-stage pipeline automatically:

Extraction — Analyzes the episode turn by turn as it occurs. For each turn it records: the situation the agent faced, the intent of that specific action, what tools it used and with what parameters, the reasoning behind the decision, and whether that turn was successful. The service automatically detects whether the episode is complete (the user achieved their goal) or if it continues.

Consolidation — When the episode completes, it synthesizes all turns into a single record that captures the overall situation, the user's intent, whether the objective was achieved, and insights from the episode: what approaches worked, which failed and why.

Reflection — The most interesting part. The reflection module takes the newly consolidated episode, semantically searches similar episodes in the history, and generates reflections — generalizable patterns that apply to future scenarios. Each reflection has a title, a description of when it applies, actionable hints, and a confidence score between 0.1 and 1.0 that grows with each episode that confirms the pattern.

⚠️ Important timing point: Unlike other AgentCore Memory strategies (semantic, summary, user preferences), episodic records only get generated when the episode completes. If the conversation is left unfinished, the system waits before generating the episode. This has implications for how you design your agent's flow — incomplete episodes appear with greater latency.

Step-by-Step Implementation

Step 1: Memory Resource Configuration

AgentCore Memory uses two separate boto3 clients: the control plane (bedrock-agentcore-control) for creating and configuring resources, and the data plane (bedrock-agentcore) for sending events and retrieving memories.

The MemoryManager class in the repository encapsulates both clients. The most important thing here is the structure of create_memory: the episode namespaces use {actorId}/{sessionId} to organize each incident, and reflections go at the actor level — so the agent learns from all its incidents, not just the current one.

# memory_manager.py
import boto3
import time
from datetime import datetime

class MemoryManager:
    def __init__(self, region_name: str):
        # Control plane: crear y gestionar recursos de memoria
        self.control_client = boto3.client(
            'bedrock-agentcore-control', region_name=region_name
        )
        # Data plane: escribir eventos y recuperar memorias
        self.data_client = boto3.client(
            'bedrock-agentcore', region_name=region_name
        )

    def create_memory_resource(self, name: str, description: str) -> str:
        response = self.control_client.create_memory(
            name=name,
            description=description,
            eventExpiryDuration=90,          # Eventos raw retenidos 90 días
            memoryStrategies=[{
                'episodicMemoryStrategy': {
                    'name': 'IncidentEpisodes',
                    # Episodios por agente + sesión (un incidente = una sesión)
                    'namespaces': ['/incidents/{actorId}/{sessionId}'],
                    # Reflections a nivel de actor — insight global del agente
                    'reflectionConfiguration': {
                        'namespaces': ['/incidents/{actorId}']
                    }
                }
            }]
        )

        memory_id = response['memory']['id']

        # El recurso tarda ~2 minutos en quedar ACTIVE
        while True:
            status = self.control_client.get_memory(
                memoryId=memory_id
            ).get('memory', {}).get('status')

            if status == 'ACTIVE':
                break
            elif status == 'FAILED':
                raise Exception("Falló la creación del recurso de memoria")
            time.sleep(15)

        return memory_id
Enter fullscreen mode Exit fullscreen mode

A relevant design point: the actorId represents the agent (or agent-user combination, depending on your use case), and the sessionId represents each individual incident. This allows reflections at the actor level to accumulate learning from all incidents without mixing data between sessions.

Step 2: Register the Incident as Events

Each agent interaction — user messages, responses, and especially tool results — is registered as an event. The TOOL role is particularly valuable: it gives the extraction module context about what information the agent had when making each decision.

    def register_event(self, memory_id: str, actor_id: str,
                       session_id: str, content: str, role: str):
        """
        role puede ser: 'USER', 'ASSISTANT', 'TOOL'
        """
        self.data_client.create_event(
            memoryId=memory_id,
            actorId=actor_id,
            sessionId=session_id,
            eventTimestamp=datetime.now(),
            payload=[{
                'conversational': {
                    'content': {'text': content},
                    'role': role
                }
            }]
        )
Enter fullscreen mode Exit fullscreen mode

In seed_memory.py you can see how a complete incident is registered with all three roles. This fragment shows the pattern with the real RDS incident we used to test the system:

# seed_memory.py — registro del incidente incident-001

manager.register_event(
    memory_id=memory_id, actor_id=actor_id,
    session_id='incident-001', role='USER',
    content=(
        'Timeouts intermitentes en checkout-api desde hace 20 minutos. '
        'Afecta al 30% de las requests. El servicio usa RDS PostgreSQL.'
    )
)

manager.register_event(
    memory_id=memory_id, actor_id=actor_id,
    session_id='incident-001', role='ASSISTANT',
    content='Entendido. Comenzaré investigando el estado actual de la instancia '
            'RDS y las métricas de conexiones activas.'
)

# El resultado de la herramienta es clave — sin esto el módulo de extracción
# no puede reconstruir el razonamiento del agente
manager.register_event(
    memory_id=memory_id, actor_id=actor_id,
    session_id='incident-001', role='TOOL',
    content=json.dumps({
        'tool': 'describe_rds_metrics',
        'params': {'instance': 'checkout-prod-db', 'period_minutes': 30},
        'result': {
            'DatabaseConnections': 485,
            'MaxConnections': 500,
            'CPUUtilization': 42,
            'FreeableMemory_GB': 8.2,
            'ReadLatency_ms': 120
        }
    })
)

manager.register_event(
    memory_id=memory_id, actor_id=actor_id,
    session_id='incident-001', role='ASSISTANT',
    content=(
        'Las métricas muestran 485 de 500 conexiones máximas (97%). '
        'CPU y memoria están normales — descarta sobrecarga de recursos. '
        'El cuello de botella está en el connection pool. '
        'Voy a verificar si hay conexiones zombie.'
    )
)

# ... más turnos TOOL + ASSISTANT hasta la confirmación del usuario ...

manager.register_event(
    memory_id=memory_id, actor_id=actor_id,
    session_id='incident-001', role='USER',
    content='Excelente, eso resolvió el problema. Los timeouts desaparecieron.'
)
Enter fullscreen mode Exit fullscreen mode

The final confirmation from the user ("resolved the problem") is the signal that AgentCore uses to detect that the episode is complete and launch the consolidation and reflection pipeline.

Step 3: Wait for Episode Generation

AgentCore Memory processes episodes asynchronously. After registering all events, the service needs time to run extraction → consolidation → reflection:

    def wait_for_episode(self, memory_id: str, actor_id: str,
                         session_id: str, timeout_minutes: int = 10):
        namespace = f'/incidents/{actor_id}'
        deadline = time.time() + (timeout_minutes * 60)

        while time.time() < deadline:
            response = self.data_client.retrieve_memory_records(
                memoryId=memory_id,
                namespace=namespace,
                searchCriteria={'searchQuery': session_id},
                maxResults=5
            )

            records = response.get('memoryRecordSummaries', [])
            if records:
                print(f"✅ Episodio generado para sesión {session_id}")
                return records[0]

            print("⏳ Esperando episodio...")
            time.sleep(30)

        return None
Enter fullscreen mode Exit fullscreen mode

In practice, with the 5 incidents in the seed (34 events total), AWS generated the 5 episodes and 5 reflections in approximately 30-60 minutes. It's not real-time — it happens in the background while the agent continues handling other incidents.

Step 4: Retrieve Relevant Experiences

Before starting any new diagnosis, the agent queries episodic memory. The API uses semantic search with searchCriteria.searchQuery — it's not exact keyword search, but search by semantic similarity:

    def retrieve_experiences(self, memory_id: str, actor_id: str,
                             query: str, max_results: int = 3) -> dict:
        response = self.data_client.retrieve_memory_records(
            memoryId=memory_id,
            namespace=f'/incidents/{actor_id}',
            searchCriteria={
                'searchQuery': query
            },
            maxResults=max_results
        )

        records = response.get('memoryRecordSummaries', [])

        # Los registros retornan como JSON — episodios y reflections
        # se distinguen por la presencia de campos específicos
        episodios = []
        reflections = []

        for record in records:
            content_text = record.get('content', {}).get('text', '')
            try:
                content_json = json.loads(content_text)
                # Episodios: tienen 'situation' y 'turns'
                if 'situation' in content_json and 'turns' in content_json:
                    episodios.append(record)
                # Reflections: tienen 'title' y 'use_cases'
                elif 'title' in content_json and 'use_cases' in content_json:
                    reflections.append(record)
            except json.JSONDecodeError:
                pass

        return {'episodios': episodios, 'reflections': reflections}
Enter fullscreen mode Exit fullscreen mode

An important detail about the format: AWS documentation shows examples in XML, but in practice the service returns JSON. The fields are situation, turns, intent, assessment for episodes, and title, use_cases, hints, confidence for reflections. The code handles both formats for compatibility.

Step 5: Inject Context into the Agent

Retrieval alone does nothing — the value is in how you prepare the agent with that information before the diagnosis. In agent.py, the experiences are incorporated into the system prompt before calling Bedrock Converse:

# agent.py — construcción del system prompt con experiencias

def _build_system_prompt(self, experiences: dict) -> str:
    prompt = """Eres un agente DevOps experto en diagnóstico de infraestructura AWS.
Tu especialidad: RDS (PostgreSQL, MySQL, Aurora), EC2, problemas de conexiones,
latencia, CPU y memoria. Usa un enfoque metódico: analiza síntomas, identifica
métricas a verificar, interpreta resultados, proporciona diagnóstico y solución.
"""
    # Primero las reflections — orientan la estrategia general
    if experiences['reflections']:
        prompt += "\n=== PATRONES APRENDIDOS DE EXPERIENCIAS PREVIAS ===\n\n"
        for reflection in experiences['reflections']:
            content = json.loads(reflection.get('content', {}).get('text', ''))
            score = reflection.get('score', 0)
            prompt += f"[Relevancia: {score:.2f}]\n"
            prompt += f"Patrón: {content.get('title', '')}\n"
            prompt += f"Aplica cuando: {content.get('use_cases', '')}\n"
            hints = content.get('hints', [])
            if isinstance(hints, list):
                prompt += "Recomendaciones:\n"
                for hint in hints[:5]:
                    prompt += f"  - {hint}\n"
            prompt += f"Confianza: {content.get('confidence', '')}\n\n"

    # Luego los episodios — ejemplos concretos de casos similares
    if experiences['episodios']:
        prompt += "\n=== CASOS SIMILARES RESUELTOS ANTERIORMENTE ===\n\n"
        for episode in experiences['episodios'][:2]:  # Solo los 2 más relevantes
            content = json.loads(episode.get('content', {}).get('text', ''))
            prompt += f"Situación: {content.get('situation', '')}\n"
            prompt += f"Aprendizaje: {content.get('reflection', '')}\n"
            prompt += "---\n\n"

    prompt += "\nBasándote en tu experiencia previa, proporciona un diagnóstico " \
              "claro y accionable. Si reconoces un patrón similar a casos anteriores, " \
              "menciónalo explícitamente.\n"
    return prompt
Enter fullscreen mode Exit fullscreen mode

The AWS documentation distinguishes when to use each type: reflections for high-level strategic guidance (what to check first, what mistakes to avoid), episodes when the new problem is very specific and there's already a nearly identical resolved case. For the DevOps agent, the combination of both gives the best result.

The Real Output: What AWS Generates

Once the pipeline processes the events, the records return with this JSON structure. This is a representative example of what the service generated for our RDS incident:

Episode:

{
  "situation": "Agente DevOps investigando timeouts intermitentes en servicio checkout-api. Instancia RDS PostgreSQL en producción. Síntoma: 30% de requests con timeout.",
  "intent": "Diagnosticar y resolver la causa raíz de timeouts en base de datos de checkout",
  "turns": [
    {
      "action": "Consultar métricas de conexiones RDS con describe_rds_metrics",
      "thought": "Verificar primero si el problema es de recursos (CPU, memoria) o de conexiones",
      "assessment": "Exitoso — descubrió 97% de utilización del límite de conexiones"
    },
    {
      "action": "Analizar conexiones zombie con check_zombie_connections",
      "thought": "CPU y memoria normales descarta recursos; conexiones altas sugiere pool mal gestionado",
      "assessment": "Exitoso — identificó 180 conexiones idle in transaction"
    }
  ],
  "assessment": "Yes",
  "reflection": "Para timeouts en RDS con CPU normal: verificar conexiones antes de escalar. Conexiones idle in transaction son señal de fuga de conexiones en la aplicación."
}
Enter fullscreen mode Exit fullscreen mode

Reflection (generated after multiple similar episodes):

{
  "title": "Database Connection Pool Exhaustion Diagnosis",
  "use_cases": "Applies when services experience intermittent timeouts with database backends, particularly RDS PostgreSQL. Relevant for production incidents where service degradation suggests resource bottlenecks. Most useful when initial symptoms point to database connectivity rather than compute resources.",
  "hints": [
    "Start with infrastructure metrics (describe_rds_metrics) using a time window matching the incident duration to establish baseline health.",
    "Distinguish between resource exhaustion (high CPU/memory) and connection pool exhaustion (high DatabaseConnections with normal CPU).",
    "When DatabaseConnections exceeds 80% of maximum, prioritize connection pool investigation over vertical scaling.",
    "Check for idle-in-transaction connections as these indicate application-level connection leaks.",
    "Apply immediate remediation (kill zombie connections) before implementing permanent fixes."
  ],
  "confidence": 0.9
}
Enter fullscreen mode Exit fullscreen mode

Notice the confidence score at 0.9 — that value grew with each additional incident that confirmed the pattern. When the first episode is generated, confidence is low. After five similar incidents resolved the same way, the system has high confidence in the pattern.

Reflections retrieved for query with relevant memory — score 0.54, confidence 0.9
Reflection with relevance score 0.54 for an RDS timeout query. The system correctly identifies the connection pool exhaustion pattern.

And this is what it returns for a query without relevant memory — when the problem is inter-region network latency, something the agent has never seen:

Reflections for query without relevant memory — scores 0.38 and 0.37
For an inter-regional latency query, scores drop to 0.38 and 0.37. The agent retrieves the closest available records, but the low relevance indicates there's no prior specific experience.

Namespaces: The Most Important Design Decision

The namespace structure determines the scope of learning. It's worth thinking through carefully because it's not trivial to change afterward.

The pattern we used — episodes in /incidents/{actorId}/{sessionId} and reflections in /incidents/{actorId} — generates insights at the agent level. A single agent learns from all the incidents it has handled.

# Opción A: Aprendizaje por agente (lo que implementamos)
# Un agente aprende de sus propios incidentes
'namespaces': ['/incidents/{actorId}/{sessionId}']      # episodios
'reflectionConfiguration': {'namespaces': ['/incidents/{actorId}']}  # reflections

# Opción B: Aprendizaje global (todos los agentes comparten insights)
# Útil si tienes múltiples instancias del mismo agente
'namespaces': ['/incidents/{actorId}/{sessionId}']
'reflectionConfiguration': {'namespaces': ['/incidents']}   # ← sin actorId

# Opción C: Aprendizaje por tipo de servicio (si categorizas incidentes)
'namespaces': ['/incidents/rds/{actorId}/{sessionId}']
'reflectionConfiguration': {'namespaces': ['/incidents/rds']}
Enter fullscreen mode Exit fullscreen mode

The AWS documentation is explicit on this point: reflections can span multiple actors within the same memory resource. If different actors represent different end users (not just different instances of the same agent), global-level reflections could mix information from different people. In that case, keep reflections at the actor level or combine with Guardrails.

The Real Numbers

After seeding the memory with 5 historical incidents (34 events total) and waiting for AWS to generate the episodes and reflections, I ran two comparative queries to validate that the system discriminates correctly:

Query 1 — WITH relevant memory:

Timeouts intermitentes en checkout-api. RDS PostgreSQL.
Conexiones al 92%. CPU al 40%.
Enter fullscreen mode Exit fullscreen mode

Query 2 — WITHOUT relevant memory:

Alta latencia entre regiones.
El tráfico de us-east-1 a eu-west-1 es muy lento.
Enter fullscreen mode Exit fullscreen mode
Metric RDS Query Network Query Difference
Average relevance score 0.497 0.390 +27.4%
Episodes retrieved 5 4
Reflections retrieved 5 6
Mentions prior experience ✅ Yes ❌ No Qualitative
Specific diagnostic order ✅ Yes ❌ No Qualitative

The average relevance score (0.497 vs 0.390) reflects how semantically close the retrieved records are to the query. For the RDS query, individual scores of the reflections are [0.568, 0.511, 0.491] — all above 0.49. For the network query, the same DB records are retrieved but with scores [0.406, 0.404, 0.385] — the system brings them because they're the closest available, but with less confidence in their relevance.

The most interesting thing isn't the numbers but the agent's qualitative behavior. For the RDS query, the diagnosis starts like this:

Diagnosis with relevant memory — 5 episodes, 5 reflections used
The agent immediately recognizes the pattern: "Connection Pool Exhaustion with Zombie Connections". Proposes the 3-step diagnostic plan without prior exploration, based on accumulated experience.

For a Lambda query with IAM errors (something the agent only partially knows from its memory), the behavior is different — it responds with context from general patterns but clarifies the limits of its experience:

Diagnosis with partially relevant memory — 2 episodes, 8 reflections
With 2 episodes and 8 relevant reflections, the agent gives a useful but more generic diagnosis, and explicitly notes that the problem is outside its main area of expertise.

AWS published formal benchmarks in January 2026 with the τ2-bench dataset (customer service scenarios in retail and airlines). Without memory, the agent successfully resolves at least once 65.8% of scenarios. With cross-episode reflections, that number rises to 77.2% — but more importantly, consistency (solving 3 out of 4 attempts) improves from 42.1% to 55.7%. The agent doesn't just solve more things, it solves them more reliably.

The Service Internals: Observability Metrics

Something I didn't expect to find when reviewing the AWS console was the Observability section in the memory resource. It shows real-time operational metrics:

AWS Console — active episodic strategy with observability metrics
In 24 hours: 58 invocations to create_event (131.3ms average latency, 0 errors), 52 invocations to retrieve_memory_records (189.1ms average latency, 0 errors), 14 long-term memories extracted. No alerts configured, no extra code.

The 14 extracted records correspond to the 5 episodes plus the 9 reflections generated from patterns detected across similar incidents. The 189ms average retrieval latency is completely acceptable for a diagnostics system where the total agent response time is 5-7 seconds.

Lessons Learned (and a Couple of Gotchas)

Episodic latency is real and you need to design for it. Other memory types generate records continuously. Episodic ones wait for the episode to complete. In production this means you can't rely on learning from one incident being immediately available for the next. It has latency of minutes, not seconds.

Tool results are the most valuable input. In our diagnostics agent, the metrics returned by tools are what allows the extraction module to understand why the agent made each decision. Without including them as events with role TOOL, episodes lose a lot of depth.

The actual format is JSON, not XML. Official documentation shows XML fragments in some places, but the service returns JSON with fields like situation, turns, use_cases, hints. If you build the parser expecting XML, you'll have problems. The code in the repository handles both formats, but in practice you'll only see JSON.

Episodes vs. reflections isn't a choice — it's a combination. The τ2-bench benchmarks show that reflections improve performance more on open-ended problems (+11.4% in Pass^1), while episodes as examples work better in well-defined flows with clear procedures. For the DevOps case, the combination of both gives the best result.

Control plane vs. data plane is a real distinction with consequences. bedrock-agentcore-control has much lower quotas than bedrock-agentcore. Creating the memory resource is a control plane operation you should do at infrastructure time (IaC, deployment), not at runtime. In production, the agent should only call the data plane.

The name episodic can be misleading. The strategy doesn't remember "what happened like a diary". It remembers how something was resolved with enough structure to be useful in similar future situations. It's less human episodic memory and more like a runbook that writes itself.

Conclusion

There's a moment in agent development when you can no longer improve it just with prompts. You've covered the common cases, added examples, refined the tone. But the agent still doesn't capitalize on the experience it's already accumulated — each interaction starts from scratch.

AgentCore Episodic Memory is the answer to that moment. It doesn't replace careful agent design or continuous evaluation (for that, Evaluations). What it does is add a learning layer that feeds itself as the agent works.

The DevOps agent we built starts knowing nothing about RDS timeouts. After five similar incidents, its reflections have 90% confidence and tell it exactly in what order to verify metrics, what patterns are signals for what type of problem, and what temporary vs. permanent solutions to apply. That knowledge wasn't written by anyone — the agent accumulated it from its own experience.

With this we close the series on the three capabilities announced at re:Invent 2025. Evaluations to measure quality in production. Policy to define limits the agent can't cross. And Episodic to let it learn from what it lives. Three pieces that together fundamentally change what it means to take an agent to production.

Do you already have agents in production that would benefit from this type of memory? Or do you have cases where episode generation latency would be a blocker? I'm interested in knowing what you're building — comments are open.


Official Resources 📚

Top comments (0)