Brooke Jamieson for AWS

Posted on Jan 12 • Originally published at builder.aws.com

Why AI Agents Need Context Graphs (And How to Build One with AWS)

#ai #contextgraphs #agenticai #aws

Back in 2020, I wrote that dashboards tell you what's happening, but not why. That the rush to be "data-driven" was derailing more decisions than it was informing. Over five years later, we're building AI agents and they're hitting the same wall. We know that past decisions were made, but we have no idea why - context graphs are here to change that!

I finally had time to catch up on reading over the holidays (which felt SO LUXURIOUS!) - two posts really caught my eye: Jaya Gupta and Ashu Garg's piece on context graphs from Foundation Capital, and Animesh Koratana's technical deep-dive on how to actually build one. They gave a name to something I'd been circling for a while and now that I'm back at my desk, the first thing I wanted to do was write a practical guide for developers who want to build this with AWS.

Their posts are worth reading in full, and in this piece I wanted to dig in to the practical side: how do you actually build a context graph with the tools we have today?

What People Actually Want

Working in AI consulting taught me that a lot of the time when someone says "we need AI," what they actually mean is "we make the same kinds of decisions over and over, and we want to stop winging it." People would often think they wanted a magic wand or something, but they really just wanted some clarity in the rules and a nice way to handle the exceptions. Basically they want a decision tree with memory, but not just any memory. They need to remember why decisions were made, not just that they were made. And that reasoning needs to be accessible at the right moment, when a similar decision comes up again.

This is where the "data-driven" promise breaks down. Having data isn't enough. You need the right data, accessible at the right time. The same is true for memory. And it turns out, the kind of memory that matters most was never captured as data in the first place. Enterprise systems are gnarly and are full of old tools, weird workflows, SQL Server running on a server hidden under a staircase, random spreadsheets, Slack threads from 2021. For sure, some of it is straight up junk, but lots of this clutter is where the work really gets done. There's hidden gems where people store context, judgement and workarounds.

But the hidden gems are all HIDDEN! They haven't been captured or stored as data, so the reasoning connecting data to action has all fallen by the wayside. The AI agents we're building don't know about the whispered approvals or the "we tried this before and it blew up" pieces of the puzzle. None of this useful reasoning (which is the most important part of onboarding an employee) exists in a way that can be queried.

What Are Context Graphs?

A context graph captures decision traces. Not just WHAT happened, but WHY it was the right call, who approved it, what precedents informed it, and what alternatives were considered.

Think of it like git. Git captures what changed (the diff) and who changed it (the author - or git blame, if you're feeling scandalous). Git gives you a lot of info, but it misses a lot too - things that might have been discussed in standups or design discussions, the alternatives you considered but didn't end up going with, the tradeoffs you looked at, or why the 'winning approach' won.

Context graphs can help to capture the missing layer of reasoning.

Animesh frames this in a "two clocks" problem: therere's so much infrastructure for the state clock (what's true right now), but barely anything for the event clock (what happened, in what order, with what reasoning).

Why "Just Add Memory" Isn't Enough

If remembering things is so important, why not just add some memory? Well - memory alone isn't enough:

It's not just what you talked about, but also how you decided
The decisions need to be connected to the customrs, systems or services they touched
It's easy to think about searching conversations, but searching precedent is a whole different thing
Reasoning like "what would happen if..." is what was always missing from dashboards, and it's the same thing here.

Context graphs look at this in a different way. It's not memory like "what did we talk about," but memory as in "what did we decide, why, and what did it affect."

The "graph" part is what makes this more than a log. It's all about connections! So you could start with a customer and see everything that affected them, or start by looking at a problem and find similar things you've tackled before.

These connected traces become a record that you can query to look at HOW your org makes decisions. So it's not just what the policies are, but how they get applied in practice.

Building Context Graphs with AWS

Honest answer here is that I think Strands and AgentCore are the coolest/most useful things AWS has launched in a long time, but they don’t get talked about enough, and this is an example where they’re super useful! It's their time to shine!

An important caveat here is that context graphs are early. Not just on AWS, but everywhere! Jaya and Ashu's blog came out on Dec 22 2025, so you're early to the party. Animesh's post from Dec 28 2025 goes into structural embeddings, 'what if' simulation, and a self-discovering structure is also fresh! I wasn't the only person on PTO at the time, so I doubt anyone has productionized it yet. (If you have, please get in touch!)

So we're figuring this out together, but AWS has the building blocks you need:

What You Need	AWS Service	How It Helps
Agents that can call tools & figure out what to do next	Strands Agents SDK	Your agent decides which tools it should call, and what order to call them in
Remember what happened and why	AgentCore Memory	Stores facts, episodes, and summaries, not just the chat history
Talk to the APIs you already have	AgentCore Gateway	Uses MCP to connect agents to your systems
Know what's allowed, and log why	AgentCore Policy	Cedar policies with full audit trail telling you what was allowed/denied and why
Track who made each decision	AgentCore Identity	The "who" travels with the reequest (OAuth with identity propagation)
Debug and audit the whole chain	AgentCore Observability	OpenTelemetry traces linking identity → policy → tool → outcome

A note on costs before you build: Obviously you should use the AWS Pricing Calculator and the pricing page but tl;dr AgentCore Memory costs scale linearly. As a reference point, 100K short-term events + 10K long-term memories + 20K retrievals = $42.50/month.

I know I keep saying "it's not just chat history" and to drill down into this more, here's the strategies in play:

Semantic memory stores facts about things: "Service X has had 3 incidents this month"
Episodic memory tells you the full story of a decision: think of this like answering job interview questions in the STAR (Situation, Task, Action, Result) format.
Summary memory is what it sounds like: it's a condensed summary that's faster to recall and search when you need it.

Special shoutout to the episodic strategy, because it does something that maps directly to what Animesh described as "learning from trajectories." There's also a good reason that so many companies tell you to answer job interview questions with the STAR format - it works! In this case, each episode captures important context in a structured way, so you know what the agent was trying to do in the first place, but also what it did, and how everything turned out in the end.

This is also a good opportunity to go back and talk some more about things we learnt from Data Driven Decision Making because I think there's a good parallel here! One of the other things I learnt early in my data science career was that lots of companies just want to hoard data like a big dragon sitting atop a pile of gold - they don't use it, they just want to know they have it. This is often completely useless, because what's the point of having all this data if you're not going to use it for anything helpful? Reflections are what saves you from doing this all over again. As you accumulate episodes, AgentCore analyses them to find the INSIGHTS that businesses froth over, so you can see the patterns, best practices and lessons you (hopefully) learnt.

How Knowledge Compounds

I talked about hoarding data earlier, and knowledge is a key example of this because it gets SO MESSY over time if you keep shoving it in a garage and closing the door to hide it. Facts change over time, and people might describe the same fact in different ways anyway so this can all get very hectic very quickly.

This is why you need memory consolidation to tidy it all up. AgentCore Memory doesn't just keep appending things, it intelligently merges related memories, resolves any conflicts that pop up, and keep a trail of what's changed.

There's a really good AWS Labs repo with an SRE agent use case that shows this in action. When the Kubernetes agent figures out that a particular memory leak pattern is causing OOM kills, that knowledge is stored. Then, the next time any agent (not just the Kubernetes agent!) sees a similar issue, that pattern is available.

I think this gets at the "compounding" that Animesh was talking about - each investigation helps to make future investiations smarter.

Who Approved What?

Remember git blame? You can use AgentCore Policy and **Identity **together to make a more useful (and hopefully less passive aggressive) version of this that shows the full story of who requested it, what policy said it was OK, and why. Getting an audit trail that shows the connections between identity → policy evaluation → tool execution → outcome is super useful if someone asks "why did the agent restart that pod?", because you can trace it all back to: "Alice requested it, the SRE-actions policy permitted it because she has the on-call role, and here's what happened next."

It's important to note that reasoning and governance get split up here - reasoning lives in Memory and governance lives in Policy logs. I think this might be handy for compliance-heavy workflows/industries.

A Minimal Example

I've yapped a lot so far, so let's make things a bit more concrete. Here's a basic outline of what a decision-capturing agent with Strands might look like:

from strands import Agent
from strands.tools import tool
from strands.models import BedrockModel

# Define a tool that captures the decision trace
# This is where you'd store to AgentCore Memory in practice
@tool
def log_decision(
    situation: str,
    context_gathered: list,
    recommendation: str,
    reasoning: str
) -> str:
    """Log a decision trace for future reference.

    Args:
        situation: What triggered this decision
        context_gathered: Data points collected before deciding
        recommendation: The recommended action
        reasoning: Why this recommendation makes sense
    """
    # In practice: store in AgentCore Memory with a namespace
    # like /decisions/customers/{customerId} etc
    return f"Decision logged: {recommendation}"

# Choose your model via Bedrock
model = BedrockModel(model_id="anthropic.claude-sonnet-4-5-20250929-v1:0")

# The system prompt is the key here
# It tells the agent to capture reasoning, not just complete the task
agent = Agent(
    model=model,
    system_prompt="""You help make decisions and capture your reasoning.

    For every decision:
    1. Describe the situation
    2. List what context you gathered
    3. State your recommendation
    4. Explain your reasoning

    Always call log_decision before responding.""",
    tools=[log_decision]
)

The system prompt is what turns this from a task executor to a context graph builder.

What a Decision Trace Looks Like

Earlier in the piece, I talked about a policy example with Alice & Carol from the SRE Agent repo. When Alice (an SRE) and Carol (an exec) both look in to the same API degradation issue in different ways, they both get the same technical findings - it was because of a database config failure, response time increased by 33x, and memory exhaustion. But, the traces show different reasoning behind getting to the same insight:
Alice's trace:

Root Cause: Database configuration failure causing connection timeouts - 
missing ConfigMap 'database-config' and invalid permissions on data directory

Escalation: alice.manager@company.com or sre-oncall@company.com if resolution exceeds 1 hour
Notifications: #alice-alerts, #sre-team

Next Steps:
1. Immediate (< 1 hour): Create/restore missing 'database-config' ConfigMap 
   and fix permissions on database data directory
2. Short-term (< 24 hours): Increase Java heap space allocation and implement 
   connection pooling with proper timeout handling
3. Long-term (< 1 week): Optimize slow query and implement circuit breakers

Carol's trace:

Root Cause: Database service failure due to missing ConfigMap 'database-config' 
in production namespace, causing cascading failures

Escalation: Notify executive team if not resolved within 20 minutes
Notifications: Executive channels (critical severity only)

Next Steps:
1. Immediate (< 1 hour): Create/restore missing ConfigMap, fix permissions, 
   increase memory allocation
2. Short-term (< 24 hours): Implement circuit breakers to prevent cascading failures
3. Long-term (< 1 week): Review memory usage patterns in UserService.loadAllUsers

So they both noticed the same thing happening, and they both dug into it, and then the trace shows how it got the 'insight' including why it was handled this way, for this person, in this context.

Start Small

Remember - this is new - REALLY NEW - so we're all just figuring it out as we go. You do not need to do all of this on day one, and at this stage it's more than enough to just be aware that this is even a thing that we can almost do.

If you want to get stuck in, I recommend picking one workflow that's super decision heavy. Something where "it depends" is the honest answer. Start there, and then:

Make an agent to handle that workflow with Strands
Configure memory to capture those insights from the decision trace. To do this, you'll need to design namespaces around the things you care about (things like inputs, reasoning, outcome)
Build precedent lookups so you're not reinventing the wheel every time. You want to look at similar decisions from the past before making a new decision.
Iterate! This is where the 'figuring it out' part is key. Think about what you're missing, what would make this more helpful, and how you can continue to get value as you scale.

Where we're at now, and where we're going

Ok so - the "trillion-dollar opportunity" that Foundation Capital described in their blog isn't something I can show you how to build out of the box today. Not on AWS, and not anywhere. I keep talking about 'figuring it out' in this piece, and that's because this is very (very!) new. I'm excited to get to the "what if" simulation that Animesh was talking about, but we're not there yet. (Or at least I'm not there yet, but I'm doing my best and taking you along with me as I work on this!)

What we can do today is start capturing those decision traces! This is the foundation that everything else seems to layer on to, and AgentCore Memory gives you a neat way to start on this.

AgentCore Memory is hierarchical storage with semantic search, but it isn't a graph database. But! You can (and IMO you should) be smart when you design your namespaces to set yourself up for graph-like query patterns. The SRE agent uses patterns like:

/sre/users/{actorId}/preferences           → user-specific settings
/sre/infrastructure/{actorId}/{sessionId}  → what each agent learned
/sre/investigations/{actorId}/{sessionId}  → decision traces by user

This is smart because it sets you up for success with your query patterns. So if you want to be able to find "every decision that touched Customer X" you could use /decisions/customers/{customerId}. Or, if you want to be able to figure out "all incidents for Service Y"? then you could use /decisions/services/{serviceId}.

Single dimension namespaces are probably enough to get started, and you can add cross referencing down the line as you start to get a clearer pcture of your query patterns. The important bit here is that the decision traces are getting recorded, and then you can build out the graph structure as the tooling (and research) matures.

Once you have figured out how to set up namespaces, it opens the door for you to start searching for precedent before reinventing the wheel and making new decisions:

# Before making a decision, search for similar past episodes
similar_decisions = session.search_long_term_memories(
    namespace_prefix=f"/sre/investigations/{actor_id}",
    query="API latency degradation with memory leak pattern",
    top_k=5
)

# The results include the full episode: situation, intent, assessment, justification
for decision in similar_decisions:
    print(f"Past situation: {decision.situation}")
    print(f"What we decided: {decision.assessment}")
    print(f"Why: {decision.justification}")

This is the feedback loop that unlocks the compounding side of context graphs. You'll know it's all working when previously captured decision traces become searchable precedent for future use, and then every new decision adds another trace.

What's Next

If you want a good place to start digging around, I recommend the Amazon Bedrock AgentCore Samples repo because it has working examples of the patterns I've mentioned in this blog, including the SRE agent that uses memory to personalise investigations based on context from users.

Like I said, we're all still figuring this out! One of the tricky things about writing content about AI is that if I waited until everything was perfectly ready, the blog would feel like it was late. But I really enjoyed reading these pieces, and poking around in the Strands Agents and Bedrock AgentCore docs to piece this together, so hopefully this sparks some ideas for you too. The "what if" simulation capabilities that Animesh describes (where you can ask "what would happen if we changed this policy?") aren't ready yet. But precedent search and pattern extraction are ready for you to try today, and that's enough to start capturing the reasoning that's been missing from so many of our systems for so many years.

The reasoning connecting data to action was never treated as data in the first place, and context graphs are how we might start fixing that. I'm really excited to see this field develop, and if you're working on this please let me know in the comments below!

Once again, thanks to Jaya Gupta, Ashu Garg, and Animesh Koratana for the foundational thinking on context graphs. Their posts are worth reading in full:

AI's trillion-dollar opportunity: Context graphs - Jaya Gupta & Ashu Garg
How to build a context graph — Animesh Koratana

Top comments (1)

aviral srivastava • Jan 13

This is an excellent and very timely piece. Framing context graphs as decision traces rather than just another memory layer clearly highlights what has been missing from dashboards and many current agent architectures. The comparison to git is especially effective in showing why chat history or raw logs alone aren’t sufficient to capture real decision-making.
I also really appreciate how practical this article is. The breakdown of Agent core Memory into semantic, episodic, and summary memory — and the connection of episodic memory to STAR-style reasoning — makes the concept concrete and immediately understandable. The focus on namespaces as a way to approximate graph-like queries today reflects a very realistic “build now, evolve later” mindset.
The Alice vs. Carol example is a strong illustration of why who is reasoning matters just as much as what is discovered. That distinction clearly demonstrates the value of separating reasoning (memory) from governance (policy and identity), particularly for decision-heavy or regulated environments.
Most importantly, this post strikes a great balance between vision and practicality. By clearly showing what’s already achievable today—such as precedent search and compounding knowledge—while thoughtfully outlining future possibilities like “what-if” simulation, it feels both grounded and highly actionable.
Thanks for putting clear structure and concrete implementation guidance around an idea many of us have been circling for a while. This definitely sparks ideas for applying context graphs to decision-heavy workflows beyond SRE, especially in product, risk, and operations