DEV Community: Danilo Poccia

From prompt engineering to agent experience

Danilo Poccia — Thu, 09 Apr 2026 11:31:32 +0000

Not long ago, the craft of working with large language models was all about the prompt. You'd carefully compose a single, curated invocation, pre-filling it with everything the model could possibly need: data from databases, context retrieved through RAG, instructions refined through dozens of iterations. We called it prompt engineering, and it felt like a real discipline. In a way, it was.

But the models were already trained to follow instructions as part of a conversation. The prompt was never meant to be a one-shot affair. It was always the beginning of a dialogue — and what happened next is that the dialogue started to extend to the outside world.

The internal monologue

The shift began when models gained access to tools. Some call it function calling. What it really meant was that the conversation with the model became richer, including a sort of "internal monologue" where the model could reach out, touch external systems, retrieve information, take actions, or preserve memories across sessions. The model wasn't just responding anymore. It was reasoning its way through a problem, step by step, deciding what to do next.

MCP (Model Context Protocol) accelerated this by standardizing how models access tools. For any organization with a landscape of existing services, MCP is a way to open those services to agents without rebuilding them.

But the real transformation happened when the harness around these models turned them into agents: systems that can autonomously pursue a goal, not just answer a question. How much space you give the model depends on the problem. If the goal can be achieved procedurally, a script or some traditional ML would be enough. If the process can be drafted at a high level, a graph-based orchestration of agents works well. You define the flow, the model fills in the gaps. But the more creativity the problem requires, the more unpredictable the path, the more you want to step back and let the model drive: give it a goal and access to what it needs, and let it figure out the route.

This is a useful reframing. Instead of building elaborate graph-based agent architectures for every use case, you can just give the model access to the right tools and let it work. The complexity moves from the harness to the model's reasoning.

The terminal as a gateway

Models are now evaluated on benchmarks like Terminal-Bench, which tests their ability to perform real-world tasks in command-line environments: compiling code, configuring servers, managing infrastructure. They can operate CLIs, write code that calls SDKs, interact with systems the way a developer would. In theory, you could give an agent a single gateway tool — a terminal — and let it do everything from there.

It's an appealing simplification. But it's also difficult to control and secure. An agent with unrestricted terminal access is powerful, but the blast radius of a mistake is significant because not everything can be run inside a sandbox. The art is in finding the right level of abstraction: enough freedom for the agent to be useful, enough constraints to keep things safe. And the larger the organization, the more that balance matters.

Who's making the decisions?

When given enough room, agents don't just execute a plan. They evaluate options and pick a path. They choose which library to use, which API to call, which architectural pattern to follow. This is happening now, in code generation, in infrastructure provisioning, in data pipeline design. The agent makes decisions based on what's in its context and training, not necessarily what you would have chosen.

A human can provide suggestions, of course. But as these systems scale and handle more complex tasks, how much will humans realistically be able to guide each decision? This question doesn't have a clear answer yet, and it points to something important: the way you shape the agent's environment becomes the primary way you influence its choices: what tools are available, what documentation it can access, what constraints it operates under. You're not scripting behavior anymore. You're designing the space in which behavior emerges.

Beyond the prompt

Designing that space turns out to have two dimensions, and I think they're two faces of the same challenge.

The first is a runtime problem. When an agent runs for many turns, calling tools, accumulating information, making decisions, the context window fills up. Managing that context is no longer about crafting a good initial prompt. It's about curating the full, evolving content of the model's working memory throughout a long-running process: deciding what stays, what gets summarized, what gets dropped. This is context engineering, and it's a fundamentally different discipline from prompt engineering.

There's an awkward tension here with prompt caching. Most caching implementations assume the context is relatively stable — if you change it, the cache becomes inefficient. Some model providers are starting to offer managed ways to handle this, but it's still early days.

The second problem is a design-time one. With prompt engineering, the focus was on giving the model the right input upfront. You knew what the model needed because you were defining the task precisely. With agents, you often don't know in advance what information they'll need. The task is open-ended, the path is emergent. So instead of perfecting the initial prompt, the job becomes providing clear guidance for what the agent might need and, critically, where and how to find it.

This is the agent experience — the counterpart to user experience, but for AI agents. How easy is it for an agent to discover the tools it needs? How well-structured is the information it has to work with?

Self-improving agents

A concrete example of agent experience done right is Andrew Ng's Context Hub, a CLI tool (chub) designed specifically for agents to search and fetch curated API documentation. But it's not a passive reference. It's built around two feedback loops that let agents get smarter over time.

The first loop is local. When an agent uses a doc and discovers something that isn't in it, like an environment-specific gotcha, a version quirk, or a workaround, it can annotate the doc with chub annotate. That annotation is stored locally and appears automatically the next time any agent on that machine fetches the same doc. The agent learns from its own experience, and that knowledge persists across sessions.

The second loop is shared. After using a doc, agents can also rate it with chub feedback — up or down, with optional labels like "outdated" or "wrong-examples" — and that feedback goes back to the doc authors through the registry. The authors update the content based on what's working and what isn't. The docs get better for everyone, not just locally.

We spent years optimizing interfaces for humans. Agents work similarly for some aspects and differently for others. Now we need to think just as carefully about the interfaces we build for both of them.

Are you reading me, agent? What can I do for you?

P.S. — If you're not an agent and you've made it this far, I'd love to know what you think. Tell me in the comments below.

How I Used Kiro to Optimize Its Own MCP Configuration

Danilo Poccia — Mon, 12 Jan 2026 13:21:19 +0000

I had a problem. My Kiro setup had accumulated 14 MCP servers over time: AWS tools, web automation, documentation servers, email integration. That's dozens of tools, all loading into context at the start of every conversation, whether I needed them or not.

This is the hidden cost of extensibility. MCP servers give AI assistants access to external tools and services. Powerful, but each one adds to the context window. More tools means more tokens consumed before you ask your first question, slower responses, and an AI that sometimes picks the wrong tool because it has too many options.

So I asked Kiro to help me build a better solution.

Using Kiro CLI to Create Powers

Kiro Powers are the solution to MCP sprawl. They're self-contained packages that bundle MCP servers with documentation and best practices, loading on-demand based on keywords in your conversation. But here's the thing: while Powers currently only work in Kiro IDE (CLI support is coming), Kiro CLI is excellent at creating them. So I used the CLI to analyze my MCP configuration and generate properly structured Power folders, then added them to Kiro IDE.

Getting Started

I used Kiro CLI with the Auto model, the default mode that blends frontier models with specialized ones for different tasks. For this kind of work (analyzing config files, researching documentation, generating structured output), Auto picks the right model for each step rather than using a single expensive model for everything.

First, I copied my mcp.json from ~/.kiro/settings/ to a working folder:

mkdir ~/optimize-powers
cp ~/.kiro/settings/mcp.json ~/optimize-powers/
cd ~/optimize-powers

Then I launched Kiro CLI in that folder and gave it this prompt:

"I have an mcp.json file with too many MCP servers. Help me group and transform them into Kiro Powers. Research carefully how Powers work online, including the folder structure and POWER.md frontmatter syntax. For each logical grouping, create a folder with the correct structure: POWER.md with proper frontmatter (name, displayName, description, keywords, mcpServers), an mcp.json with the server configurations, and a steering/ folder with best practices. Prepare a detailed plan and propose it to me before creating any folders."

That last sentence is important. I didn't want Kiro to just start creating files. I wanted a plan I could review and discuss.

What Kiro Proposed

Kiro analyzed my mcp.json, researched the Powers documentation online, and came back with a consolidation plan. It identified logical groupings based on functionality:

Before: 14 individual MCP servers, all loading at startup

After: 7 Powers that load on-demand based on keywords:

aws-development bundles AWS API, documentation, serverless, CDK, and diagram tools
web-development handles frontend frameworks and Playwright for testing
web-research uses Fetch for browsing documentation
bedrock-agentcore for agent production and deployment
strands-agents for agent development with the Strands SDK
office-tools covers email and calendar
amazon-aurora-dsql for database-specific workflows

The key insight: these tools naturally cluster by workflow. When I'm building AWS infrastructure, I need the CDK server and AWS documentation, but not Playwright. When I'm testing a web app, I need browser automation, but not the architecture diagram tools.

The Conversation

After reviewing the plan, I had follow-up questions.

"Can't Bedrock AgentCore and Strands be grouped into a single agentic AI power?" This led to consolidating them. Both are about building AI agents, just at different stages of the workflow.

"What if I want to use Playwright for web browsing too, not just testing? Would it be duplicated across powers?" Kiro explained that Powers are namespaced to prevent conflicts, but this led to a better design for my workflows: consolidating into a single "web-research-browse-test" Power that combines Fetch and Playwright for both browsing and testing, separate from frontend development.

Each question refined the outcome. The AI handled the implementation; I steered the design. The final result: 6 Powers instead of 14 scattered MCP servers.

After finalizing the plan, I approved it: "I approve your plan, build it!"

Kiro created the folder structure for each Power, wrote the POWER.md files with appropriate keywords and onboarding instructions, configured the mcp.json for each Power, and added steering files with best practices.

What a Kiro Power Looks Like

A Kiro Power is a self-contained package: documentation, best practices, and its own MCP server configuration bundled together. When the Power activates, its MCP servers become available. When it's not needed, they stay out of context.

Here's the AWS Development Power that Kiro created:

aws-development/
├── POWER.md
├── mcp.json
└── steering/
    ├── aws-api-workflows.md
    ├── serverless-patterns.md
    ├── cdk-best-practices.md
    └── architecture-diagrams.md

The POWER.md frontmatter defines when the Power activates:

---
name: "aws-development"
displayName: "AWS Development"
description: "Comprehensive AWS development toolkit..."
keywords: ["aws", "cloud", "serverless", "lambda", "cdk", 
           "cloudformation", "infrastructure", "api", 
           "documentation", "diagrams"]
mcpServers: ["aws-api", "aws-knowledge", "aws-serverless", 
             "aws-diagram", "aws-cdk"]
---

The keywords array is the trigger mechanism. When Kiro sees any of these words in your conversation, it loads this Power automatically. The keyword matching is simple but effective: use words that match how you naturally talk about the workflow.

Adding Powers to Kiro IDE

Once Kiro CLI created the Power folders, I added them to Kiro IDE one at a time. In the Powers panel, I clicked "Add power from Local Path" and selected the directory containing the POWER.md file (e.g., ~/optimize-powers/aws-development/). Each Power appeared in the IDE's Powers list and activates automatically when I use its keywords in conversation.

The Results

The difference is immediate. When I'm working on a web frontend, I don't have AWS and email tools cluttering the context. The AI makes better tool choices because it has fewer, more relevant options. Ask about "CDK stacks" and the AWS Development Power loads. Ask about "browser testing" and the web automation Power loads.

What's interesting is the approach itself. I used Kiro CLI to analyze a configuration, research documentation I wasn't fully familiar with, propose a restructuring plan, discuss edge cases, and implement the folder structures. Then Kiro IDE activated them.

Try It Yourself

If your MCP configuration has grown unwieldy, the same approach works. Copy your mcp.json to a working folder, open Kiro CLI there, and ask it to analyze your servers and propose groupings. Review the plan, discuss any concerns, refine it, then approve and let it create the Power folders. Finally, add each Power to Kiro IDE through the Powers panel.

Powers don't require MCP servers. You can create documentation-only Powers with steering files for specific frameworks or patterns. And they're portable: push to a GitHub repo and others can install them. The community Powers repository has examples from Datadog, Figma, Stripe, and others.

If you haven't tried Kiro yet, you can get started for free.

What This Opens Up

I used an AI tool to restructure how that same AI tool accesses its capabilities. Kiro CLI analyzed the configuration, researched documentation, and generated the folder structures. Kiro IDE then activated them. The whole process took about 15 minutes.

The same pattern applies beyond MCP setups: treat your AI tooling configuration as something that evolves. Start with defaults, use the tools, notice friction, then ask the AI to help reduce it. The configuration files aren't sacred. They're just another part of a codebase that can be refactored and improved. And once Powers come to the CLI, this workflow will get even simpler.

Exploring the OpenAI-Compatible APIs in Amazon Bedrock: A CLI Journey Through Project Mantle

Danilo Poccia — Tue, 16 Dec 2025 12:59:57 +0000

After Amazon Bedrock introduced OpenAI-compatible application programming interfaces (APIs) through Project Mantle, I decided to explore firsthand what this meant in practice. There's nothing like actually calling endpoints and seeing responses to build real intuition. I needed a way to quickly experiment with both the Responses API and Chat Completions API, compare their behaviors, and understand when to use each one.

That's why I put together bedrock-mantle, a command-line interface (CLI) that took shape as I tested these new endpoints. It's designed as an exploration tool—something you can fire up when you want to understand how stateful conversations work, test background processing for long-running tasks, or simply verify that your existing OpenAI software development kit (SDK) code will work with minimal changes.

In this post, I'll walk through what makes these APIs different, show you how to use the CLI for hands-on exploration, and share some insights about when each API makes sense.

What Is Project Mantle?

Before diving into the APIs, it's worth understanding what's under the hood. Project Mantle is a distributed inference engine for large-scale model serving on Amazon Bedrock. It's designed to simplify onboarding new models while providing performant serverless inference with sophisticated quality of service controls.

For developers, the practical benefit is twofold. First, Project Mantle provides out-of-the-box compatibility with OpenAI API specifications—existing code using the OpenAI SDK works with Bedrock models by changing the base URL and API key. Second, it introduces new capabilities like stateful conversation management and asynchronous inference that go beyond simple compatibility.

The Responses API currently supports the OpenAI GPT OSS 20B and 120B models, with support for additional models coming. The Chat Completions API already works with all Bedrock models powered by Project Mantle.

Getting Started

The CLI is packaged as a Python tool and uses uv for dependency management. Installation takes one command:

uv tool install .

Configuration requires two environment variables pointing to the Mantle endpoint and your API key:

export OPENAI_BASE_URL=https://bedrock-mantle.us-east-1.api.aws/v1
export OPENAI_API_KEY=your-amazon-bedrock-api-key

You can get your API key from the Amazon Bedrock console. With that configured, you're ready to explore.

Two APIs, Two Approaches to Conversation State

The heart of Project Mantle is the choice between two APIs: the Responses API and the Chat Completions API. They solve the same fundamental problem—getting responses from language models—but they handle conversation state very differently.

All the code examples below assume you've set the OPENAI_BASE_URL and OPENAI_API_KEY environment variables as described in the Getting Started section. The OpenAI SDK reads these automatically.

The Responses API maintains conversation state server-side. When you send a message, the server remembers the context automatically using a previous_response_id. You don't need to send the full conversation history with each request because the server tracks it for you. This simplifies client code, reduces bandwidth (especially for long conversations), and makes tool use integration for agentic workflows more straightforward.

Here's a basic request using the Responses API, taken from the official AWS documentation:

from openai import OpenAI

client = OpenAI()

response = client.responses.create(
    model="openai.gpt-oss-120b",
    input=[
        {"role": "user", "content": "Hello! How can you help me today?"}
    ]
)

print(response)

For multi-turn conversations, chain responses using previous_response_id. Each response object includes an id field that you pass to the next request:

# First turn
response = client.responses.create(
    model="openai.gpt-oss-120b",
    input=[{"role": "user", "content": "What is the capital of France?"}]
)
print(response.id)  # e.g., "resp_abc123..."

# Second turn: pass the previous response id
response = client.responses.create(
    model="openai.gpt-oss-120b",
    input=[{"role": "user", "content": "What river runs through it?"}],
    previous_response_id=response.id
)

The server handles the history—you just chain the IDs.

The Chat Completions API follows the traditional stateless pattern. You manage conversation history client-side and send the full context with each request. The server processes the request and returns a response without retaining any state between calls. This API also supports reasoning effort configuration, giving you control over how much computational effort the model applies to generate responses.

Here's a basic chat completion request:

from openai import OpenAI

client = OpenAI()

completion = client.chat.completions.create(
    model="openai.gpt-oss-120b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"}
    ]
)

print(completion.choices[0].message)

With Chat Completions, you build and maintain the messages array yourself. Each request includes the complete conversation history, giving you full control over what context the model sees.

Here's what a typical session looks like with the Responses API:

$ bedrock-mantle chat --model openai.gpt-oss-120b

Starting chat session
  Model: openai.gpt-oss-120b
  API: Responses API
  Streaming: enabled
  Background: disabled

Type /quit or /q to exit, /clear to reset conversation
------------------------------------------------------------

You: What is the capital of France?
Assistant: The capital of France is Paris.

You: What river runs through it?
Assistant: The Seine River runs through Paris, flowing through the heart of the city.

You: /quit
Goodbye!

Notice how the second question ("What river runs through it?") works without explicitly mentioning Paris. Under the hood, with the Responses API, the server maintains the conversation context, so "it" resolves correctly. With the Chat Completions API, you'd need to include the previous exchange in your request to get the same behavior.

To switch to the Chat Completions API, add the --completions flag:

bedrock-mantle chat --model openai.gpt-oss-120b --completions

Background Processing for Long-Running Tasks

Some tasks take time. Complex reasoning, extensive analysis, or multi-step processes might run for minutes rather than seconds. Keeping an HTTP connection open for that duration introduces reliability concerns—network timeouts, connection drops, and client resource consumption all become issues.

The Responses API addresses this with asynchronous inference through background processing. When you enable background mode, requests are queued and processed asynchronously:

bedrock-mantle chat --model openai.gpt-oss-120b --background

The CLI submits your message and then polls for completion, showing progress as it waits. This pattern is useful for long-running inference workloads where you'd rather wait confidently than wonder whether your connection will survive.

Behind the scenes, the CLI handles the polling loop:

# Background mode: poll for completion
response = client.responses.create(
    model=model,
    input=user_input,
    previous_response_id=previous_response_id,
    background=True
)

while response.status == "in_progress":
    time.sleep(1)
    response = client.responses.retrieve(response.id)

This approach translates well to production architectures where you might submit a request, receive a job ID, and check back later.

Choosing Between APIs

The choice between APIs depends on your requirements. I've found it helpful to think about three dimensions.

State management is the most obvious differentiator. If you want the server to track conversation context automatically, use the Responses API. If you need full control over what context gets sent (perhaps for privacy reasons, or because you're doing custom context management), use the Chat Completions API.

Data retention matters for compliance-sensitive applications. The Responses API stores data for approximately 30 days to support its stateful features. The Chat Completions API follows a zero data retention model—no conversation data is stored between requests.

Model support varies between the APIs. The Responses API currently works with OpenAI GPT OSS models (20B and 120B parameters), while the Chat Completions API supports all Bedrock models powered by Project Mantle. You can check available models with:

bedrock-mantle list-models

Practical Exploration Patterns

The CLI includes several options that make exploration more productive. Disabling streaming shows you the complete response structure rather than incremental chunks:

bedrock-mantle chat --model openai.gpt-oss-120b --no-stream

Streaming is useful when you want to display responses as they arrive. Here's how streaming works with the Responses API:

from openai import OpenAI

client = OpenAI()

stream = client.responses.create(
    model="openai.gpt-oss-120b",
    input=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for event in stream:
    print(event)

And with the Chat Completions API:

from openai import OpenAI

client = OpenAI()

stream = client.chat.completions.create(
    model="openai.gpt-oss-120b",
    messages=[{"role": "user", "content": "Tell me a story"}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

Notice the difference in how you process the stream. The Chat Completions API returns chunks with a delta containing the incremental content, while the Responses API returns typed events.

Custom system prompts help you test how different personas or instructions affect behavior:

bedrock-mantle chat --model openai.gpt-oss-120b --system "You are a helpful assistant who explains concepts simply"

During a session, the /status command shows your current configuration, and /clear resets the conversation state—useful when you want to start fresh without restarting the CLI.

What I Learned Building This

Building the CLI taught me some practical lessons about working with these APIs.

First, the stateful nature of the Responses API changes how you think about error handling. If a request fails mid-conversation, you need to decide whether to retry with the same previous_response_id or reset the conversation. The server's state might be consistent even if your client didn't receive the response.

Second, background processing introduces its own considerations. How often should you poll? How long should you wait before giving up? The CLI uses simple fixed-interval polling, but production code might implement exponential backoff to be more efficient.

Third, streaming and non-streaming responses have different structures. If you're building tooling that works with both modes, you need to handle the response parsing accordingly.

The Migration Path

If you're currently using the OpenAI APIs and considering a move to Bedrock, migrating is much easier. The endpoint format, request structure, and response format follow the OpenAI specification. In many cases, changing two environment variables is enough to switch between providers.

That said, testing matters. The CLI gives you a low-friction way to verify behavior and eventual code changes. Run your typical prompts through both the Responses API and Chat Completions API, observe the responses, and build confidence in how the migration could affect your application.

Try It Yourself

The CLI is available on GitHub under the MIT license. Clone it, configure your credentials, and start exploring the tool and its code. The info command shows you the current configuration, and list-models tells you what's available in your region.

Whether you're building applications that currently use the OpenAI APIs and you're curious about what migration to Bedrock would look like, or you're already using Bedrock and want to understand these new capabilities, the CLI provides a playground to build intuition before committing to architectural decisions.

I'm curious to hear what patterns you discover as you explore. Are there specific use cases where the stateful Responses API simplifies your architecture significantly? Or do you find the control of the stateless Chat Completions API more valuable for your needs? Let me know in the comments what you're building with these capabilities.

The complete code is available at github.com/danilop/bedrock-mantle. Contributions and feedback welcome.

Building a Semantic Storage for Humans and AI Agents

Danilo Poccia — Fri, 05 Dec 2025 15:20:48 +0000

This week at AWS re:Invent 2025, Amazon S3 Vectors reached general availability, bringing purpose-built vector storage directly into object storage. S3 Vectors now supports up to 2 billion vectors per index (40x the preview capacity), delivers query latencies around 100ms for frequent queries, and integrates with Amazon Bedrock Knowledge Bases and Amazon OpenSearch Service. About a month earlier, Amazon Nova Multimodal Embeddings became available in Amazon Bedrock, providing a unified embedding model that handles text, images, audio, video, and documents through a single model.

The combination of these two services creates an interesting opportunity: store any content in Amazon S3, generate embeddings with Nova, index them in S3 Vectors, and you have semantic search across everything—without managing vector database infrastructure.

That idea became SemStash, a semantic storage system that lets you store any content and find it using natural language. Instead of remembering exact file names or maintaining folder hierarchies, you describe what you're looking for: "the presentation about Q3 revenue" or "photos from the beach trip."

In this post, I'll walk through how SemStash works, the architecture decisions behind it, and how you can use it both as a human tool and as persistent memory for AI agents.

The Core Idea

The fundamental concept is straightforward: when you upload a file, SemStash stores it in S3 and generates a vector embedding using Amazon Nova. This embedding captures what the content means, not just what it contains. When you search, SemStash converts your query into an embedding and finds content with similar meaning.

This architecture means you can search across different media types. Upload a photo of a sunset, then find it by searching for "evening sky with orange colors." Upload a meeting recording, then find it by asking for "discussion about the new product launch."

Understanding the Building Blocks

Before diving into the implementation, it helps to understand how each underlying technology works.

Amazon S3 Vectors

S3 Vectors introduces vector buckets—a new bucket type specifically designed for storing and querying vector embeddings. Unlike regular S3 buckets that store objects, vector buckets organize data into vector indexes where you can run similarity queries. Each vector bucket can hold up to 10,000 indexes, and each index can store tens of millions of vectors.

The key operations are putting vectors (with optional metadata for filtering), querying for similar vectors, and managing the index lifecycle. Writes are strongly consistent, meaning you can query immediately after inserting. S3 Vectors handles the optimization of your vector data automatically as it evolves, maintaining performance without manual tuning.

What makes S3 Vectors particularly interesting for this use case is the cost model. Traditional vector databases often require provisioned capacity, but S3 Vectors follows the S3 pattern: you pay for what you store and query, with no infrastructure to manage. For applications like SemStash where queries might be infrequent but storage needs to be durable and scalable, this works well.

Amazon Nova Multimodal Embeddings

Embedding models convert content into numerical vectors that capture semantic meaning. What makes Nova Multimodal Embeddings different from earlier models is that it handles multiple content types through a single model, mapping them all into the same semantic space.

This unified approach means that an embedding from a text description and an embedding from an image can be compared directly. You can search your photo library with a text query like "person smiling on beach" and find matching images, even though the query is text and the content is visual. The same applies to audio and video: search for "piano music" and find audio files containing piano, or search for "outdoor interview" and find video clips matching that description.

Nova supports four embedding dimensions: 256, 384, 1024, and 3072. Higher dimensions capture more semantic nuance but require more storage. The model uses Matryoshka Representation Learning, which means the first N dimensions of a larger embedding work as a valid smaller embedding, giving you flexibility to balance precision against storage costs.

For synchronous operations, Nova handles up to 8,192 tokens of text or 30 seconds of audio/video. Longer content can be processed asynchronously with automatic segmentation.

How SemStash Works

The architecture separates content storage from semantic indexing:

The core library handles all AWS interactions. Five interfaces—CLI, Python API, Model Context Protocol (MCP) server, Web UI, and REST API—share this same core, ensuring consistent behavior across all access methods.

Storage Design

Each stash consists of two S3 buckets: one standard bucket for your files and one vector bucket for embeddings. The content bucket stores original files with their metadata. The vector bucket uses S3 Vectors to store embeddings with matching keys, enabling fast similarity search.

This design keeps content and embeddings synchronized: when you delete a file, its embedding is also removed. The check command verifies consistency, and sync repairs any drift that might occur.

Content Type Handling

Different content types require different processing before embedding:

Text files (.txt, .md, JSON, HTML, CSV, XML) are embedded directly. The embedding captures the semantic meaning of the text content.

Images (JPEG, PNG, GIF, WebP) are embedded visually. Nova understands the visual content, so you can search for "red car" or "person smiling" and find matching images.

Audio files (MP3, WAV, FLAC, OGG) are processed for semantic content. You can search recordings by their spoken content or audio characteristics.

Video content (MP4, WebM, MOV, MKV) is embedded considering both visual and audio elements. Search for "presentation with charts" or "outdoor interview" and find matching clips.

Documents receive special handling. PDF files are rendered as images and embedded visually, preserving layout and graphics. Word documents, PowerPoint presentations, and Excel spreadsheets have their text extracted and embedded, making all their content searchable.

Using SemStash from the Command Line

The CLI provides the primary human interface. Install it with uv:

uv tool install git+https://github.com/danilop/semstash.git

Creating a stash sets up the S3 bucket and vector index:

semstash init my-stash

Uploading follows a path model similar to a filesystem. Every piece of content has a path starting with /, and the trailing slash determines whether you're specifying a folder or an exact path:

# Upload to root (file keeps its original name)
semstash my-stash upload vacation-photo.jpg /

# Upload to a folder (trailing slash = folder)
semstash my-stash upload meeting-notes.txt /notes/

# Upload with tags for organization
semstash my-stash upload *.jpg /photos/ --tag vacation --tag 2024

Searching uses natural language:

semstash my-stash query "beach sunset"
semstash my-stash query "financial projections for next year"
semstash my-stash query "action items from last meeting"

Results are ranked by semantic similarity. You can filter by tags or path:

semstash my-stash query "sunset" --tag photos
semstash my-stash query "meeting notes" --path /notes/

The Python API

For programmatic access, the same functionality is available through a Python library:

from semstash import SemStash

# Create and initialize storage
stash = SemStash("my-stash")
stash.init()

# Upload content to root
result = stash.upload("photo.jpg", target="/", tags=["vacation", "beach"])
print(f"Stored at: {result.path}")  # /photo.jpg

# Upload to a folder (preserves filename)
result = stash.upload("notes.txt", target="/docs/")
print(f"Stored at: {result.path}")  # /docs/notes.txt

# Query semantically
for item in stash.query("sunset on beach", top_k=5):
    print(f"{item.score:.2f} - {item.path}")
    print(f"  Download: {item.url}")

# Query with path filter
for item in stash.query("meeting notes", path="/docs/"):
    print(f"{item.path}: {item.score:.2f}")

# Get content metadata and URL
content = stash.get("/photo.jpg")
print(f"Type: {content.content_type}, Size: {content.file_size}")

# Browse a folder
for item in stash.browse("/docs/").items:
    print(f"{item.path}: {item.content_type}")

# Download content locally
stash.download("/photo.jpg", "./local-copy.jpg")

# Delete when done
stash.delete("/photo.jpg")

The API supports all the same operations as the CLI: init(), open(), upload(), query(), get(), download(), delete(), browse(), check(), sync(), and destroy().

Web Interface

For browser-based access, SemStash includes a web interface that provides a visual way to interact with your semantic storage:

semstash web
# Open http://localhost:8000/ui/

The web interface makes SemStash accessible without command-line experience. The dashboard at /ui/ shows storage statistics and provides quick actions. The upload page at /ui/upload supports drag-and-drop file uploads with target path specification. Browse pages at /ui/browse/{path} offer paginated content lists with folder navigation. The search page at /ui/search provides semantic search with relevance scores displayed alongside results. Content pages at /ui/content/{path} show previews, metadata, and download/delete options.

The browse interface lets you navigate your content by folder structure:

Semantic search results display with relevance scores, making it clear how well each result matches your query:

Configure the server with environment variables:

export SEMSTASH_BUCKET=my-stash
export SEMSTASH_HOST=0.0.0.0     # Optional: bind address
export SEMSTASH_PORT=8000        # Optional: port number
semstash web

REST API

The same server that hosts the web interface exposes a REST API for programmatic HTTP access. Interactive documentation is available at /docs. Key endpoints:

POST   /init              Create new storage
POST   /open              Open existing storage
POST   /upload            Upload files (multipart form with target path)
GET    /query?q=...       Semantic search (supports path= filter)
GET    /content/{path}    Get metadata and download URL
DELETE /content/{path}    Remove content
GET    /browse/{path}     List stored content at path
GET    /stats             Storage statistics
GET    /check             Consistency check
POST   /sync              Repair inconsistencies
DELETE /destroy           Remove storage (irreversible)

Interactive API documentation is available at /docs.

MCP Server for AI Agents

The Model Context Protocol (MCP) server gives AI assistants persistent semantic memory. Start it with:

semstash mcp

For MCP-compatible assistants, add to your configuration:

{
  "mcpServers": {
    "semstash": {
      "command": "semstash",
      "args": ["mcp"],
      "env": {
        "SEMSTASH_BUCKET": "my-agent-memory"
      }
    }
  }
}

The MCP server exposes tools for uploading content, querying semantically, browsing stored items, and managing the stash. Agents can save information they discover and retrieve it in future conversations.

Using SemStash with Strands Agents

Strands Agents provides first-class support for MCP, making it straightforward to give agents access to SemStash as persistent memory. Here's how to connect a Strands agent to the SemStash MCP server:

from mcp import stdio_client, StdioServerParameters
from strands import Agent
from strands.tools.mcp import MCPClient

# Create the MCP client for SemStash
semstash_client = MCPClient(lambda: stdio_client(
    StdioServerParameters(
        command="semstash",
        args=["mcp"],
        env={"SEMSTASH_BUCKET": "agent-memory"}
    )
))

# Use the MCP client with an agent
with semstash_client:
    tools = semstash_client.list_tools_sync()
    agent = Agent(tools=tools)

    # The agent can now store and retrieve information semantically
    agent("Save this information: The quarterly report deadline is March 15th")

    # Later, the agent can recall it
    agent("When is the quarterly report due?")

The agent doesn't need to know the exact phrasing or keywords used when information was stored. The semantic search finds relevant content based on meaning, so asking about "quarterly report due date" will find content stored with "quarterly report deadline."

For agents that need to combine SemStash with other tools, you can include the MCP client alongside other tool providers:

from strands import Agent
from strands.tools.mcp import MCPClient
from strands_tools import calculator, current_time
from mcp import stdio_client, StdioServerParameters

semstash_client = MCPClient(lambda: stdio_client(
    StdioServerParameters(
        command="semstash",
        args=["mcp"],
        env={"SEMSTASH_BUCKET": "agent-memory"}
    )
))

with semstash_client:
    mcp_tools = semstash_client.list_tools_sync()

    # Combine MCP tools with other tools
    agent = Agent(tools=[calculator, current_time] + mcp_tools)

    # Agent can use all tools together
    agent("What time is it, and do I have any meetings scheduled today?")

This pattern lets agents build knowledge over time. An agent working on a research project can save findings as it discovers them, then recall relevant information when answering questions or generating reports.

Configuration and Tuning

SemStash works with sensible defaults but supports customization through environment variables or a configuration file.

These environment variables configure the web server, MCP server, and Python API. The CLI takes the bucket name as a command argument instead.

SEMSTASH_BUCKET=my-stash        # Bucket name (for web/MCP/Python API)
SEMSTASH_REGION=us-east-1       # AWS region
SEMSTASH_DIMENSION=3072         # Embedding dimension (256, 384, 1024, 3072)

Or through a configuration file (semstash.toml or .semstash.toml):

[aws]
region = "us-east-1"

[embeddings]
dimension = 3072

[output]
format = "table"  # or "json"

The embedding dimension is the main tuning parameter. Higher dimensions (3072) capture more semantic nuance, while lower dimensions (256, 384, 1024) reduce storage costs with some accuracy trade-off. The dimension is set when you create a stash and cannot be changed afterward—when you open an existing stash, SemStash automatically uses its configured dimension.

AWS Requirements

SemStash requires AWS credentials with permissions for S3 (creating and managing buckets, uploading and downloading objects), S3 Vectors (creating indexes, storing and querying vectors), and Bedrock (invoking the Nova embeddings model).

The default region is us-east-1. You can check AWS regional availability for Amazon Bedrock, S3, and S3 Vectors in other regions.

Maintenance

A few commands help keep your stash healthy:

# Verify content and embeddings are synchronized
semstash my-stash check

# Repair any inconsistencies
semstash my-stash sync

# See storage statistics
semstash my-stash stats

# Permanently remove a stash (irreversible)
semstash my-stash destroy --force

The check command reports orphaned embeddings (vectors without content) or missing embeddings (content without vectors). The sync command repairs these issues by removing orphans and regenerating missing embeddings.

What I Learned Building This

Working with S3 Vectors and Nova Multimodal Embeddings together highlighted a few things.

The unified semantic space that Nova provides enables searches that would be difficult to express otherwise. Searching for text and finding images based on meaning, or finding video clips from audio descriptions, opens up workflows that don't fit the traditional keyword-search model. I found myself uploading content without worrying about file organization, trusting that I could describe what I needed later.

S3 Vectors fits naturally into applications where you need durable, scalable vector storage without managing infrastructure. The serverless model—pay for what you store and query—aligns well with applications where usage patterns might be bursty or unpredictable. The GA release this week at re:Invent brought significant improvements: 2 billion vectors per index, ~100ms latencies for frequent queries, and integration with Bedrock Knowledge Bases.

Building for multiple interfaces (CLI, Python API, Web UI, REST API, MCP server) from a shared core turned out to be the right decision early on. Each interface serves a different use case, but they all exercise the same underlying logic, which made testing more straightforward and behavior consistent.

The most interesting use case that emerged was using SemStash as memory for AI agents through the MCP server. Agents can accumulate knowledge over time—saving information they discover and retrieving it in future conversations—without the application developer building custom storage infrastructure. This pattern of "semantic memory" for agents feels like it has broader applications beyond what I've implemented here.

The code is available on GitHub under the MIT license. I'd be curious to hear how others use it, particularly for the agent memory use case.

Strands Agents now speaks TypeScript: A side-by-side guide

Danilo Poccia — Thu, 04 Dec 2025 15:14:46 +0000

The Strands Agents SDK, which was released as open source in May 2025, recently added TypeScript support as a public preview. Developers who prefer Python have had access to Strands since launch. Now TypeScript developers can build AI agents using the same model-driven approach, with full type safety and modern async patterns.

This article compares the Python and TypeScript implementations side by side, building a practical example along the way.

Because the Strands Agents TypeScript SDK is currently in public preview, APIs may change as the SDK is refined, and some features available in the Strands Agents Python SDK are not yet implemented. The core functionality—agents, tools, MCP integration, streaming—works as expected, but this is an early release. Feedback and contributions are welcome via the GitHub repository.

The Model-Driven Approach

Many agent frameworks are workflow-driven: you define explicit chains of actions where the agent does A, then B, then C. You're essentially programming the agent's behavior in advance. This works well for predictable tasks but becomes brittle when dealing with novel situations.

Strands Agents takes a different approach. Instead of prescribing steps, you give the agent a goal and a set of tools and let the LLM figure out how to accomplish the task. The model's reasoning capabilities—its ability to plan, reflect, and adapt—drive the behavior. This is what Strands calls the "model-driven" approach.

At the core of every Strands agent is the agent loop: the model receives context (conversation history, system prompt, tool descriptions), decides what to do next, optionally calls tools, observes the results, and repeats until it produces a final response.

For details on how this works, see the Agent Loop documentation.

Building a Task Planning Agent

Let's build something practical: an agent that helps break down a project into tasks. We'll give it two tools—one to add tasks and one to list them—and let it figure out how to help the user.

Defining Tools

The two Strands SDKs take different approaches to tool definition that reflect each language's idioms.

Python uses the @tool decorator with docstrings and type hints:

from strands import Agent, tool

tasks = []

@tool
def add_task(description: str, priority: str = "medium") -> str:
    """Add a task to the plan.

    Args:
        description: What needs to be done
        priority: Priority level (high, medium, low)
    """
    tasks.append({"description": description, "priority": priority})
    return f"Added: {description} [{priority}]"

@tool
def list_tasks() -> str:
    """Show all tasks in the current plan."""
    if not tasks:
        return "No tasks yet."
    return "\n".join(
        f"- [{t['priority'].upper()}] {t['description']}" 
        for t in tasks
    )

Strands extracts the tool specification from the function signature, type hints, and docstring automatically.

TypeScript uses Zod schemas:

import { Agent, tool } from '@strands-agents/sdk'
import { z } from 'zod'

const tasks: Array<{description: string, priority: string}> = []

const addTask = tool({
    name: 'add_task',
    description: 'Add a task to the plan.',
    inputSchema: z.object({
        description: z.string().describe('What needs to be done'),
        priority: z.string().default('medium').describe('Priority level (high, medium, low)')
    }),
    callback: async (input) => {
        tasks.push(input)
        return `Added: ${input.description} [${input.priority}]`
    }
})

const listTasks = tool({
    name: 'list_tasks',
    description: 'Show all tasks in the current plan.',
    inputSchema: z.object({}),
    callback: async () => {
        if (tasks.length === 0) return 'No tasks yet.'
        return tasks
            .map(t => `- [${t.priority.toUpperCase()}] ${t.description}`)
            .join('\n')
    }
})

Zod is a TypeScript-first schema validation library that solves two problems at once: runtime validation and type inference. TypeScript's type system only works at compile time, but when your agent receives input from an LLM, you need runtime validation. Zod validates the data and automatically infers TypeScript types from your schema—define once, get both.

Creating and Running the Agent

Python:

agent = Agent(
    system_prompt="You are a planning assistant. Help users break down projects into actionable tasks.",
    tools=[add_task, list_tasks]
)

result = agent("Help me plan a surprise birthday party for next Saturday")
print(result.message)

TypeScript:

const agent = new Agent({
    systemPrompt: 'You are a planning assistant. Help users break down projects into actionable tasks.',
    tools: [addTask, listTasks]
})

const result = await agent.invoke('Help me plan a surprise birthday party for next Saturday')
console.log(result.message)

The agent will reason through what's needed—venue, guest list, food, decorations, timeline—call add_task multiple times with appropriate priorities, then call list_tasks to show the plan. You didn't program that sequence; the model figured it out.

Sample Output

I've created a plan for the surprise birthday party:

- [HIGH] Confirm venue and time for Saturday
- [HIGH] Create guest list and send invitations ASAP
- [HIGH] Arrange for someone to bring the birthday person to the venue
- [MEDIUM] Order or bake birthday cake
- [MEDIUM] Plan food and drinks menu
- [MEDIUM] Buy decorations (balloons, banner, etc.)
- [LOW] Create a party playlist
- [LOW] Plan games or activities
- [LOW] Arrange for someone to take photos

Given the short timeline, I've prioritized the items that need immediate action. 
Would you like me to break any of these down further?

How the Two Languages Differ

The most visible difference is in tool definition. Python extracts schemas from docstrings and type hints—you write documentation as you normally would, and Strands uses it. TypeScript relies on Zod schemas, which provide both runtime validation and compile-time type inference in a single definition.

Invocation patterns also differ slightly. In Python, you call the agent directly with agent("prompt"), and async is optional. In TypeScript, you use await agent.invoke("prompt"), and async is the default model throughout. Both approaches describe the same thing to the LLM—they just do it in ways idiomatic to each language.

Streaming Responses

For interactive applications, both Strands SDKs support streaming so users see output as it's generated.

Python:

async for event in agent.stream_async("Plan a weekend hiking trip"):
    if "data" in event:
        print(event["data"], end="", flush=True)

TypeScript:

for await (const event of agent.stream('Plan a weekend hiking trip')) {
    if (event.type === 'text') {
        process.stdout.write(event.data)
    }
}

Model Context Protocol (MCP)

The Model Context Protocol lets you connect agents to external tools and services. Both Strands SDKs have built-in MCP support.

Python:

from mcp import stdio_client, StdioServerParameters
from strands import Agent
from strands.tools.mcp import MCPClient

mcp_client = MCPClient(lambda: stdio_client(
    StdioServerParameters(command="uvx", args=["awslabs.aws-documentation-mcp-server@latest"])
))

with mcp_client:
    agent = Agent(tools=mcp_client.list_tools_sync())
    agent("How do I configure an S3 bucket?")

TypeScript:

import { Agent, McpClient, StdioClientTransport } from '@strands-agents/sdk'

const mcpClient = new McpClient({
    transport: new StdioClientTransport({
        command: 'uvx',
        args: ['awslabs.aws-documentation-mcp-server@latest']
    })
})

const agent = new Agent({ tools: [mcpClient] })
await agent.invoke('How do I configure an S3 bucket?')

See the MCP Tools documentation for more transport options.

Beyond the Basics

The example above covers the core workflow, but both Strands SDKs offer more. Long-running conversations can exhaust the model's context window, so both provide conversation managers that automatically maintain a sliding window of recent messages. The Strands Agents Python SDK goes further with a summarizing conversation manager that compresses old messages rather than discarding them—useful when you need to preserve context over very long interactions.

For complex problems where a single agent isn't enough, the Strands Agents Python SDK provides multi-agent patterns. You can wrap specialized agents as tools that an orchestrator calls, create swarms where agents hand off tasks to each other autonomously, or define explicit graphs with deterministic execution order. These patterns aren't yet available in the TypeScript SDK. See the Multi-Agent Patterns documentation for details.

The Strands Agents Python SDK also supports structured output through Pydantic models, letting you constrain the agent's response to a specific schema. This is particularly useful when you need to parse the agent's output programmatically.

Feature Comparison

The table below summarizes what's available in each SDK. The Strands Agents Python SDK is stable with the full feature set, and the Strands Agents TypeScript SDK is in preview with core functionality available.

Feature	Strands Python (stable)	Strands TypeScript (preview)
Core agent loop	✅	✅
Custom tools	✅ (decorator)	✅ (Zod)
MCP integration	✅	✅
Amazon Bedrock	✅	✅
OpenAI	✅	✅
Anthropic API	✅	—
Ollama	✅	—
Streaming	✅	✅
Conversation management	✅	✅
Structured output	✅	—
Multi-agent (Swarm/Graph)	✅	—
Community tools package	✅	—
Session persistence	✅	—
OpenTelemetry traces	✅	—

When to Choose Which

The choice between SDKs often comes down to your existing stack and requirements. If your application is already in JavaScript or TypeScript and you're comfortable with preview-stage software, the Strands Agents TypeScript SDK gives you the core agent functionality with compile-time type safety and modern async patterns. It works in both Node.js and browser environments.

If you need the full feature set—multi-agent orchestration, structured output, more model providers, or the community tools package—the Strands Agents Python SDK is the more complete option for now.

Getting Started

Python:

uv init my-agent && cd my-agent
uv add strands-agents strands-agents-tools

TypeScript:

mkdir my-agent && cd my-agent
npm init -y
npm install @strands-agents/sdk zod

Both SDKs default to Amazon Bedrock as the model provider, which requires valid AWS credentials. If you're new to AWS, the AWS Free Tier provides $100 in credits at sign-up plus up to $100 more by completing onboarding activities— one of which involves using Amazon Bedrock. The free plan lasts six months, and you won't be charged unless you upgrade. See the Strands Agents Quickstart guide for setup details.

Resources

The Strands Agents documentation is the best starting point, with guides for both SDKs. The source code is available on GitHub for both the Python SDK and the TypeScript SDK. For Python, the community tools package provides ready-to-use tools, and the samples repository has complete example agents.

Wrapping Up

The Strands Agents TypeScript SDK brings the model-driven approach to the JavaScript ecosystem. While the Python SDK currently remains the fully featured option, the TypeScript preview delivers the core experience: define tools, create an agent, and let the model reason through problems.

The Strands Agents TypeScript SDK preview status means now is a good time to experiment, provide feedback, and help shape its development.

Modernizing Python Projects: Converting requirements.txt to uv in One Command

Danilo Poccia — Thu, 06 Nov 2025 14:22:22 +0000

When cloning Python projects that I want to test or use for a demo, I often find out that the repository uses a requirements.txt file instead of a modern pyproject.toml. While requirements.txt has served the Python community well for years, I've come to appreciate the speed and simplicity of uv, a fast Python package manager that uses the standardized pyproject.toml format. But converting between these formats manually? That can be tedious and error-prone, especially when dealing with complex dependency specifications.

That's why I built requirements-to-uv: a command-line tool that automatically converts Python projects from requirements.txt to uv-managed pyproject.toml with a single command. Whether you're modernizing your own projects or quickly setting up cloned repositories, this tool handles the conversion details so you can focus on actually working with the code.

In this post, I'll walk you through what makes this conversion non-trivial, show you how to use the tool in seconds, and explain some of the intelligent logic that makes it work reliably across different project structures.

Why Move from requirements.txt to uv?

Before diving into the conversion process, it's worth understanding why this matters. The requirements.txt format has been the standard for dependency management in Python for many years, but it has limitations. For example, there's no standardized way (apart from file names) to separate development dependencies from production ones. And different tools interpret the format slightly differently, leading to inconsistencies.

Enter pyproject.toml: a standardized format defined in PEP 518 that consolidates project metadata and dependencies in one place. When combined with uv, you get lightning-fast dependency resolution and installation. The uv tool provides a consistent development workflow across projects and uses lock files for truly reproducible environments.

The transition from requirements.txt to pyproject.toml with uv isn't just about following trends—it's about faster builds, more reliable environments, and better project organization.

The Challenge: The requirements.txt File Is More Complex Than It Looks

At first glance, converting a requirements.txt file to pyproject.toml seems straightforward. Just copy the package names and versions, right? Not quite. The requirements.txt files you can find in the wild are surprisingly complex, and a naive conversion would lose important information or break dependencies entirely.

Consider what a real-world requirements file might contain:

# Standard packages with version specifiers
requests>=2.28.0
flask[async]>=3.0.0

# Git dependencies with specific branches
git+https://github.com/user/repo.git@main#egg=mypackage
-e git+ssh://git@github.com/user/another.git@develop

# Local path dependencies
-e ./local-package
../another-package

# Environment markers for conditional installation
pytest>=8.0.0 ; python_version >= "3.8"

# Poetry-style version constraints that aren't valid in pyproject.toml
django^4.2.0

# Package extras
celery[redis,msgpack]>=5.3.0

Each of these patterns requires different handling in the pyproject.toml format. Git dependencies need to be split into a regular dependency entry and a separate [tool.uv.sources] section that specifies the repository URL and branch. Poetry-style caret (^) constraints need to be converted to the equivalent range syntax. Local paths require special source declarations. Environment markers need to be preserved exactly.

Beyond parsing complexity, there's also the challenge of project structure. Many repositories have multiple requirements files: requirements.txt for production, requirements-dev.txt for development tools, requirements-test.txt for testing frameworks, and so on. A proper conversion should detect these patterns and organize them into the appropriate dependency groups in pyproject.toml.

Intelligent Metadata Detection

One of the tool's most useful features is automatic project metadata detection. When you run req2uv in a Python project directory, it doesn't just convert dependencies—it tries to build a complete, valid pyproject.toml by gathering information from multiple sources.

For the project name, the tool starts with the current directory name and normalizes it according to Python packaging standards (replacing spaces and special characters with hyphens). For the version, it searches for version declarations in __init__.py files, checks setup.py if present, looks at git tags, and falls back to 0.1.0 if nothing else is found. The Python version requirement is detected from .python-version files, setup.py classifiers, or defaults to the currently running Python version. The description comes from the first line of your README file, and author information is pulled from your git configuration.

This intelligent detection means you're not starting with a minimal skeleton file—you get a properly structured pyproject.toml that actually describes your project.

Smart Merging with Existing Files

Not every project is starting from scratch. Sometimes you're modernizing a project that already has a partial pyproject.toml file, perhaps created manually or by another tool. The requirements-to-uv tool handles this scenario carefully.

When a pyproject.toml already exists, the tool merges new information without destroying what's there. It preserves existing metadata and configuration sections, appends new dependencies to existing lists (detecting and warning about duplicates), and creates a backup file (pyproject.toml.backup) before making changes. This merge logic uses a sophisticated approach that understands the structure of TOML files and dependency declarations, ensuring that manual customizations aren't lost during the conversion.

Handling Edge Cases and Limitations

Python packaging has accumulated many special features over the years, and not all of them translate cleanly to the modern pyproject.toml format. The tool handles these cases transparently while keeping you informed.

For package hashes (the --hash=sha256:... format), these aren't supported in pyproject.toml because uv uses lock files for reproducibility instead. The tool strips these out but generates a comment explaining the change. Custom package indexes specified with --index-url or --extra-index-url also can't be stored directly in pyproject.toml. The tool adds a comment with the original URL so you can configure it through uv's CLI or configuration file instead.

SSH git URLs present an interesting challenge. While GitHub, GitLab, and Bitbucket SSH URLs can be automatically converted to their HTTPS equivalents, URLs from other hosts generate warnings since the conversion might not be straightforward. For these cases, you may need to adjust the generated file manually.

The tool handles these limitations gracefully: it does the conversion, preserves as much information as possible, explains what couldn't be directly translated, and provides guidance on how to handle special cases in the uv ecosystem.

Getting Started in 30 Seconds

The tool is designed to get you working quickly. Installation is straightforward using uv itself:

# Install as a global uv tool
uv tool install git+https://github.com/danilop/requirements-to-uv.git

Then navigate to any Python project with a requirements.txt file and run:

req2uv

The tool automatically detects your project structure, finds all requirements files, gathers metadata, and generates a complete pyproject.toml. It runs in interactive mode by default, showing you what it found and asking for confirmation before writing files. Once the conversion is complete, you can immediately use uv to install dependencies:

uv sync

For CI/CD pipelines or scripts, you can use non-interactive mode:

req2uv --non-interactive

You can also preview what the tool would do without making changes:

req2uv --dry-run

Real-World Usage Patterns

After using this tool across various projects, certain patterns have proven particularly valuable. When modernizing existing projects, I start with a dry run to see what the tool will generate. This helps catch any issues before committing to changes. The tool creates a backup of any existing pyproject.toml, but I also like to commit my current state to git first, just to be safe.

For repositories with multiple requirements files, the automatic detection and categorization saves significant time. The tool recognizes common patterns like requirements-dev.txt, requirements-test.txt, and requirements-docs.txt, and organizes them into appropriate dependency groups. This structure aligns well with uv's dependency group feature, making it easy to install just the dependencies you need for a particular task.

When cloning open-source projects that still use the older format, I typically run req2uv as my first step after cloning. This lets me work with the project using my preferred tools without needing to maintain both formats or deal with inconsistencies between pip and uv behavior.

Architecture: Simple and Focused

The tool is built with a clear focus on doing one thing well. It uses the packaging library for parsing requirements and handling version specifiers according to Python packaging standards. Click provides the command-line interface with automatic help generation and parameter validation. Rich handles terminal formatting for readable output and progress indication. And questionary powers the interactive prompts when running in interactive mode.

The core logic separates concerns cleanly: a parser module handles requirements.txt parsing and normalization, a detector module gathers project metadata from various sources, and a generator module creates valid TOML structures. This modular design makes the tool maintainable and makes it easier to extend with new features.

Best Practices for Conversion

Through working with various project structures, a few practices have emerged as particularly helpful. Before running the conversion, it's worth reviewing your requirements files to ensure they're current and removing any commented-out dependencies that you no longer need. This gives you a clean starting point.

After conversion, I recommend reviewing the generated pyproject.toml, especially the [tool.uv.sources] section if you have git or path dependencies. While the tool handles most cases automatically, some scenarios—like private git repositories or unusual URL patterns—might need manual adjustment.

It's also helpful to test the converted dependencies immediately by running uv sync and verifying that your application still works as expected. This catches any edge cases early while the conversion process is still fresh in your mind.

Wrapping Up

Converting Python projects from requirements.txt to modern pyproject.toml with uv support doesn't have to be a manual chore. The requirements-to-uv tool handles the complexity of parsing various dependency formats, intelligently detects project metadata, and generates complete, valid project files.

Whether you're modernizing your own projects or quickly setting up repositories you've cloned, this tool helps you move to a faster, more standardized Python development workflow. The complete code is available on GitHub, and contributions are welcome.

Give it a try the next time you encounter a Python project still using requirements.txt—you might be surprised how much smoother your workflow becomes with modern Python tooling.

Never Forget a Thing: Building AI Agents with Hybrid Memory Using Strands Agents

Danilo Poccia — Fri, 31 Oct 2025 10:49:20 +0000

When using (and building) AI agents, I kept running into the same frustrating problem: as conversations grew longer, my agents would either lose important details from earlier in the conversation or hit context limits and crash. The standard solution—a sort of aggressive summarization—worked for maintaining context flow, but it created a new problem: those summaries were lossy. Important details, specific numbers, exact quotes, and nuanced context could vanish into their generalizations.

I needed something better: a memory system that could maintain conversation flow through intelligent summarization while preserving the ability to retrieve exact historical messages when needed. After researching the broad topic of context engineering, I built a proof-of-concept Semantic Summarizing Conversation Manager: a hybrid memory system for Strands Agents that combines the efficiency of summarization with the precision of semantic search.

In this post, I'll show you how this system improves on the memory problem, walk you through its architecture, and demonstrate how it can upgrade your AI agents from forgetful assistants into (more) reliable partners with perfect recall.

The Memory Problem: Summarization vs Recall

Before diving into the solution, let's understand why this problem exists. AI agents typically manage conversation context in one of three ways:

Keep Everything: Store all messages in the active context. This works great for short conversations but inevitably hits model context limits. When you reach that limit, the agent has to do something to reduce the size of its context.

Summarization: When context gets full, summarize older messages into a compressed form. This process, also known as compacting, maintains conversation flow and prevents context overflow, but summaries are inherently lossy. Ask "What was the exact number I mentioned earlier?" and the agent might recall "you discussed some statistics" but not the actual value. A possible mitigation is to apply different hierarchical levels of summarizations that are retrieved based on the specific requests.

Sliding Window: Keep only the N most recent messages, discarding older ones entirely. Simple and memory-efficient, but loses all historical context beyond the window. The agent literally forgets everything from earlier in the conversation.

Proactive Memory Curation: A variation of automatic summarization is to actively control the process. For example, to trigger summarization not when the context is full but when something happens in the agent lifecycle, such as the completion of a specific task. This works because summarization is applied to a bounded context (the task) that can reduce the amount of information needed about the specific task internals by the rest of the tasks.

Each approach has fundamental trade-offs. You can have context efficiency or perfect recall, but not both.

Hybrid Memory: The Best of Both Worlds

The Semantic Summarizing Conversation Manager takes a different approach: it combines summarization for active context management with semantic search for precise historical recall. Here's how it works:

Normal Operation: Messages flow through the conversation as usual. The agent sees the full context and responds naturally.

Context Overflow: When the context gets too long, the system performs two parallel operations:

Creates a summary of older messages for the active conversation, maintaining flow
Stores the exact messages in memory using Strands Agents' key-value state for later retrieval
Indexes those messages in a semantic (vector based) search engine for intelligent lookup

Query Time: When new messages arrive, a Strands Agents hook automatically searches for relevant historical messages, includes surrounding context for better understanding, and prepends this context to the user's message if relevant matches are found.

The agent gets three types of memory working together: the active conversation with summaries (for context flow), the archived exact messages (for precision), and the semantic index (for intelligent retrieval). This hybrid approach means the agent never loses information, but also never overwhelms the model with excessive context.

Why This Architecture Makes Sense

Here's a crucial insight that makes this hybrid approach viable: the amount of RAM available to an agent is typically orders of magnitude larger than the model's context window.

Consider a typical deployment: a modern language model might have a context window of up to 1 million tokens (roughly 750,000 words or about 4MB of text). Meanwhile, even a small AWS Lambda function has at least 128MB of memory, and container deployments often have several gigabytes. That's a difference of three to four orders of magnitude—1,000x to 10,000x more storage capacity than context capacity.

This disparity is fundamental to how language models work. Context windows are constrained by the quadratic attention mechanism—doubling the context quadruples the computation. But RAM? RAM is relatively cheap and abundant in comparison. You can store thousands of conversation messages and tool results in a few megabytes, along with their embeddings for semantic search, and still use less than 1% of available memory.

The implication: you don't need to delete information just because it doesn't fit in the model's context window. Store it, index it, and retrieve it intelligently when needed. The bottleneck isn't storage—it's attention. This hybrid architecture respects that constraint while leveraging the abundant storage available to modern agents.

This is why the semantic conversation manager can confidently store exact messages indefinitely (with optional limits for safety) while keeping only the most relevant information in the active context. We're playing to the strengths of the underlying hardware: use the model's limited context for reasoning and generation, use RAM for comprehensive storage and retrieval.

Architecture: Three Components Working in Harmony

The system consists of three main components that integrate seamlessly with Strands Agents:

Component 1: SemanticSummarizingConversationManager

This is the core conversation manager that extends Strands' base conversation management with semantic capabilities. It maintains the active conversation window, triggers summarization when context overflows, stores exact messages with semantic indexing, manages memory limits by message count or total memory usage, and provides real-time memory usage statistics.

The key innovation here is that summarization and archival happen atomically. When messages get summarized, they're simultaneously preserved and indexed, ensuring nothing is ever lost.

Component 2: SemanticMemoryHook

This hook integrates with Strands' lifecycle system to provide automatic context enrichment. It subscribes to the MessageAddedEvent, searches semantic memory when new messages arrive, retrieves relevant historical messages with surrounding context, and prepends the enriched context to user messages naturally.

The hook uses Strands' elegant event system, keeping the memory logic completely separate from your agent's main code. Your agent doesn't need to know anything about memory management—it just works.

Component 3: SemanticSearch Engine

The search engine powers intelligent retrieval using sentence transformers for initial embedding, cross-encoder reranking for precision, configurable relevance thresholds, and persistent index storage.

I chose a two-stage retrieval approach because it provides the best balance of speed and accuracy. The sentence transformer quickly narrows down candidates, then the cross-encoder reranks for precision. This combination ensures the agent finds truly relevant messages, not just keyword matches.

Setting Up Hybrid Memory

Let's build an agent with semantic memory. This implementation is a prototype designed to demonstrate the hybrid memory concept. While functional and tested, it's intended for experimentation and learning rather than production deployment without further development and testing. The setup is straightforward:

from strands import Agent

from strands_semantic_memory import (
    SemanticSummarizingConversationManager,
    SemanticMemoryHook,
)

conv_manager = SemanticSummarizingConversationManager(
    embedding_model="all-MiniLM-L12-v2"
)

semantic_memory_hook = SemanticMemoryHook()

agent = Agent(model="us.amazon.nova-lite-v1:0",
              conversation_manager=conv_manager,
              hooks=[semantic_memory_hook])

That's it! Your agent now has hybrid memory. Use it normally:

# Store information
response = agent("Our shared number is 42. This is confidential, don't include it in any summary.")

# ... many messages later, after summarization ...

# Retrieve exact information
response = agent("What was our shared number?")
# The hook finds the archived message and includes it automatically

Understanding the Parameters

The configuration parameters give you fine-grained control over memory behavior:

summary_ratio (0.1-0.8): Determines what percentage of messages to summarize when context overflows. Lower values create shorter summaries but trigger overflow more frequently. I find 0.7 (70%) provides a good balance.

preserve_recent_messages: Messages that never get summarized. These stay in the active conversation no matter what. I typically use 10-20 to maintain recent context flow.

message_context_radius: When retrieving a relevant message, how many surrounding messages to include. A radius of 2 means you get 2 messages before and 2 after the match, providing better context. This prevents retrieving messages in isolation where the surrounding conversation provides crucial meaning.

semantic_search_top_k: Number of relevant messages to retrieve. More isn't always better—too many matches can overwhelm the context. I start with 3 and adjust based on testing and evaluations.

semantic_search_min_score: The cross-encoder relevance threshold (default: -2.0). Higher values are more selective, lower values cast a wider net. The default provides balanced precision and recall.

max_num_archived_messages: Optional limit on stored messages. When exceeded, oldest messages are removed. Useful for long-running agents to prevent unbounded growth.

max_memory_archived_messages: Optional limit on total memory usage (in bytes). Includes both message content and embeddings. When exceeded, oldest archived messages are removed to stay within budget.

These last two parameters are particularly important for production deployments where long term memory constraints matter. You can use either, both, or neither depending on your needs.

How It Works: A Complete Example

Let me show you the system in action. The included demo creates an agent, stores a secret that shouldn't appear in summaries, builds conversation history, triggers summarization, and then demonstrates semantic retrieval.

When you run the demo with uv run main.py, you'll see the complete flow:

Initial Conversation (20 messages):

[ 0] user: Our shared number is 700. This is confidential - don't include it in any summary...
[ 1] assistant: Understood. I'll keep our shared number confidential...
[ 2] user: Tell me about recursive functions and data structures.
...
[19] assistant: Recursion is when a function calls itself...

After Summarization (9 messages):

[ 0] user: ## Conversation Summary
* Topic 1: Explanation of recursion
* Topic 2: Arrays
* Topic 3: Linked Lists
[Note: The shared number is NOT in the summary ✅]

[ 1] user: What are sorting algorithms?
...

Notice that the summary preserves the conversation flow (discussing recursion and data structures) while excluding the confidential information. The agent can continue having coherent conversations about algorithms without the secret cluttering the context.

Semantic Retrieval Finds Everything:

🔍 Query: 'What was our shared secret number?'
Search completed in 66.7ms (reranked from 9 candidates)
✅ Found 4 relevant messages in semantic memory

• Secret '700' retrievable: ✅ YES

The semantic search quickly finds the archived message, even though it's not in the active conversation. The system automatically enriches the query:

Based on our previous conversation, these earlier exchanges may be relevant:

---Previous Context---
[Message 0, user]: Our shared number is 700. This is confidential – don't include it in any summary...
[Message 1, assistant]: Understood. I'll keep our shared number confidential...
---End Previous Context---

Current question: What was our shared number?

The agent sees both the original messages (with surrounding context from the radius parameter) and the current query. This natural enrichment happens automatically. The agent code doesn't change at all.

Memory Usage Monitoring

The conversation manager includes built-in memory monitoring, essential for production deployments:

# Get detailed statistics
stats = agent.conversation_manager.get_memory_usage_stats()
print(f"Messages stored: {stats['message_count']}")
print(f"Total memory: {stats['total_memory']:,} bytes")
print(f"Message memory: {stats['message_memory']:,} bytes")
print(f"Embedding memory: {stats['embedding_memory']:,} bytes")

# Get human-readable summary
print(agent.conversation_manager.get_memory_usage_summary())

This visibility is crucial when tuning your memory limits. You can see exactly how much memory your agents are using and adjust the configuration accordingly.

Deployment Considerations

Before considering deployment of this prototype, several important factors need careful evaluation and likely additional development:

Memory Limits: Set appropriate limits based on your deployment environment. A Lambda function with 3GB memory needs tighter constraints than a long-running container. Use both message count and memory size limits to prevent unbounded growth.

Embedding Model: The system uses sentence transformers by default, which runs locally. For production, consider your latency and throughput requirements. Local models add no API costs but use CPU resources. You might want to experiment with different embedding models for your specific use case.

Index Persistence: The semantic index persists to disk, enabling warm starts. This means restarted agents can immediately search historical messages without rebuilding the index. Make sure your deployment environment has writable storage (or modify the code to use a different persistence backend).

Context Radius Tuning: Start with a radius of 2 and adjust based on testing. Larger radii provide more context but use more tokens. Monitor your context usage to find the sweet spot for your domain.

Search Threshold: The default min_score of -2.0 works well for general use, but you might need to tune it. If you're getting too many irrelevant matches, increase it. If you're missing relevant context, decrease it. Log the scores during development to understand what works for your data.

Intelligent Overlap Handling

The system automatically merges overlapping message ranges. If semantic search finds messages 5-7 and messages 6-9 as relevant, it merges them into a single range 5-9 rather than duplicating messages 6 and 7. This prevents token waste and maintains a cleaner context presentation.

This improves context quality because the agent sees a coherent narrative flow rather than confusing duplicated messages.

Real-World Use Cases

This hybrid memory architecture excels in several scenarios:

Customer Support: Keep the last few exchanges in active context for natural flow, but retrieve exact past conversations when a customer references an earlier issue or order number.

Personal Assistants: Maintain recent context for ongoing tasks while being able to recall specific details from weeks or months ago. "What was that restaurant you recommended last month?"

Technical Documentation Bots: Summarize long technical discussions while preserving the ability to retrieve exact code snippets, error messages, or configuration values.

Educational Tutors: Remember the student's learning journey, including specific questions they asked and concepts they struggled with, even across multiple sessions.

Data Analysis Agents: Maintain conversation flow while being able to recall exact numbers, queries, or insights from earlier in a long analysis session.

The common thread: any agent that needs both conversational coherence and precise recall benefits from this architecture.

What Makes This Different

You might be wondering how this compares to other memory solutions. Several approaches exist in the agent ecosystem, but they typically choose one strategy:

Some frameworks use hierarchical summarization, creating summaries of summaries. This manages context well but makes precise recall even harder—information gets compressed multiple times.

Some implement retrieval-augmented generation (RAG) where the agent explicitly calls a memory retrieval tool. This gives the agent control but requires it to decide when to search, adding cognitive overhead.

The Semantic Summarizing Conversation Manager combines automatic summarization for context flow with automatic semantic retrieval for precision. The agent doesn't need to manage memory—it just works. The hook system in Strands makes this possible through its elegant event architecture.

What's Next

This hybrid memory system balances efficiency with precision and automatic behavior with configurability. As a prototype, this system demonstrates the core concepts but would benefit from additional hardening, testing, and optimization before production use.

The complete prototype code is available on GitHub. I've included comprehensive documentation, the working demo, and modular components you can adapt for your needs.

I'm particularly interested in feedback on parameter tuning for different domains. What works well for customer support might not work for technical documentation. If you use this system, I'd love to hear about your configuration choices and what you learned.

Ready to improve your agents memory? Clone the repo, run the demo, and see hybrid memory in action. Your agents (and your users) will thank you for it.

Visualizing AI Agent Memory: Building a Web Browser for Amazon Bedrock AgentCore Memory

Danilo Poccia — Fri, 19 Sep 2025 11:20:07 +0000

When building and testing multiple AI agent frameworks with Amazon Bedrock AgentCore, I realized I needed a tool to visualize and explore what my agents were actually remembering. AgentCore Memory provides powerful capabilities for managing both short-term conversation context and long-term knowledge extraction, but debugging memory patterns meant diving into AWS CLI commands or writing custom scripts just to see what was stored. I needed a way to quickly browse, search, and understand the memory patterns my agents were creating.

That's why I built AgentCore Memory Browser: a web interface that makes it simple to explore and interact with Amazon Bedrock AgentCore Memory resources. Whether you're debugging an agent's memory extraction or simply curious about what your agents are learning over time, this tool provides the visibility you need.

In this post, I'll walk you through the AgentCore Memory Browser's capabilities, show you how to set it up in minutes, and demonstrate how it can accelerate your agent development workflow. This tool complements the multi-framework journey I've been documenting in my main AgentCore blog series, providing essential visibility for any agent implementation.

Why Build a Memory Browser?

Working with AI agents in production requires understanding not just what they say, but what they remember. Amazon Bedrock AgentCore Memory provides sophisticated memory management with multiple strategies for extracting and storing different types of information. AgentCore Memory can capture user preferences and settings, store factual information extracted from conversations, create condensed summaries of sessions, and maintain the raw conversation history for context.

When an agent isn't behaving as expected, or when you want to understand its memory patterns, you need visibility into these memory stores. The AWS CLI provides the raw capability, but switching between terminal commands while developing breaks your flow. I needed something more intuitive—a tool that could show me at a glance what each memory strategy was storing, let me search through records, and help me understand how my agents were using memory across different sessions and actors.

Key Features That Accelerate Development

The AgentCore Memory Browser provides real-time exploration of all your AgentCore Memory resources with live data pulled directly from both control plane and data plane APIs. You can see memory status, configurations, and strategies at a glance.

Each memory strategy gets its own dedicated interface with operations tailored to its purpose. Whether you're working with user preferences, semantic facts, or session summaries, the browser adapts to show relevant operations and namespace patterns. AgentCore Memory uses namespace templates with placeholders, and when a strategy defines a namespace that contains a {memoryStrategyId}, the browser automatically fills in the strategy ID portion while keeping the field editable so that you can substitute the actor and session values. This makes it easy to explore specific user or session data without having to type the full namespace path each time.

The browser provides three core operations for each strategy. You can list events to view the sequence of events for specific sessions and actors, helping you understand the temporal flow of your agent's interactions. You can browse all memory records in a namespace with pagination support for large datasets. And you can retrieve memory using natural language queries, taking advantage of AgentCore's semantic search capabilities.

The developer-friendly UI includes quick copy buttons for Memory IDs, ARNs, and namespace values, saving you from manual selection and copying. The auto-expanding JSON viewer with syntax highlighting makes it easy to inspect complex memory structures. The browser remembers your namespace edits during a session, so you don't have to re-enter actor and session IDs repeatedly. And all user content is HTML-escaped to prevent injection attacks, ensuring security even when browsing untrusted memory content.

Installation in Under a Minute

To get started quickly, I've packaged the AgentCore Memory Browser as a Python tool that can be installed globally using uv, the fast Python package manager.

Before installation, ensure you have Python 3.13 or higher and AWS CLI configured with appropriate credentials. You'll need AWS IAM permissions for bedrock-agentcore-control:ListMemories and GetMemory operations, as well as bedrock-agentcore:ListEvents, ListMemoryRecords, and RetrieveMemoryRecords.

You can install directly from GitHub with a single command:

uv tool install git+https://github.com/danilop/agentcore-memory-browser.git

Then run it from anywhere:

agentcore-memory-browser

The application automatically opens in your default browser at http://localhost:8000 (you can pass a different port on the command line).

If you want to modify the tool or contribute to development, you can clone the repository, install dependencies with uv sync, and run the application with uv run agentcore-memory-browser.

Architecture: Clean Separation of Concerns

The AgentCore Memory Browser follows a clean, modular architecture. The backend is built with FastAPI, providing a modern, async-capable web framework. It uses two AWS service clients: the AgentCore control plane to list and describe memory resources, and the data plane to perform operations like listing events, browsing records, and executing semantic searches.

The frontend uses Bootstrap for responsive design and vanilla JavaScript for interactivity—no complex build process required. The interface is organized with a sidebar for memory selection with metadata preview, main content with a tabbed interface for each memory strategy, operation panels with dedicated forms for each memory operation, and a results display with a JSON tree viewer with syntax highlighting.

Real-World Usage Patterns

After using the Memory Browser while developing agents, certain patterns have proven most valuable in my workflow to debug memory extraction as events are processed by strategies, understand how an agent's knowledge evolves over time, and optimizing semantic emory searches.

Best Practices

After extensive use, certain practices have emerged as particularly helpful. First, configure your AWS environment properly by setting your default AWS Region and verifying credentials with aws sts get-caller-identity. This ensures the browser can connect to your AgentCore Memory resources without issues.

Also, making good use of the copy buttons saves time when you need to reference memory IDs or ARNs in your code.

What makes AgentCore Memory particularly useful is how it simplifies handling both short-term and long-term memory for AI agents. Short-term memory captures the immediate context of conversations, while long-term memory extracts and preserves important facts, preferences, and patterns that persist across sessions. The Memory Browser gives you a window into both, helping you understand how your agents build knowledge over time and how they use that knowledge to provide more personalized and contextually aware responses. This visibility helps building agents that can maintain coherent and efficient results while learning and adapting from their interactions.

Token Counting Meets Amazon Bedrock

Danilo Poccia — Tue, 16 Sep 2025 16:30:07 +0000

When working with large language models through Amazon Bedrock, understanding token consumption can help managing costs and staying within model limits. While the Bedrock console provides token counts after each API call, developers need a way to measure tokens before sending requests, especially when building applications that process large volumes of text or require precise truncation.

Amazon Bedrock offers a CountTokens API that provides exact token measurements for the supported models, currently:

Anthropic Claude 4 Sonnet
Anthropic Claude 4 Opus
Anthropic Claude 3.7 Sonnet
Anthropic Claude 3.5 Sonnet
Anthropic Claude 3.5 Haiku

However, integrating this API into development workflows requires dealing with the correct syntax and implementing efficient algorithms if truncation is needed. This is where ttok4bedrock comes in—a command-line tool and Python library that makes token counting as simple as it should be.

# Count tokens (default: Claude Sonnet 4)
ttok4bedrock "Hello, world!"
# Output: 11

# Count from stdin
echo "Count these tokens" | ttok4bedrock
cat document.txt | ttok4bedrock

# Truncate to N tokens
ttok4bedrock -t 100 "Very long text..."
cat large.txt | ttok4bedrock -t 100 > truncated.txt

# Use specific Bedrock model (full model ID)
ttok4bedrock -m anthropic.claude-3-5-sonnet-20241022-v2:0 "Text"
ttok4bedrock -m anthropic.claude-3-7-sonnet-20250219-v1:0 "Text"

# Specify AWS region (uses default if not specified)
ttok4bedrock --aws-region us-west-2 "Text"

Standing on the Shoulders of Giants

Simon Willison's ttok has become a standard tool for token counting with OpenAI models, valued for its simplicity and versatility. Rather than creating something entirely new, I built ttok4bedrock as a drop-in replacement that maintains complete compatibility with ttok's interface while leveraging Bedrock's native CountTokens API.

The goal was straightforward: preserve the developer experience that made ttok successful while adapting to Bedrock's requirements. This means you can switch from ttok "Count my tokens" to ttok4bedrock "Count my tokens" without changing your scripts or learning new commands. The tool automatically handles AWS authentication using the standard boto3 credential chain and can work with any AWS Region.

Solving the Truncation Challenge

One of the most requested features in token counting tools is intelligent truncation—cutting text to fit within a specific token limit. This is not straightforward if you can only count tokens.

The truncation algorithm I implemented uses an adaptive approach that minimizes API calls while achieving exact results. It begins by analyzing text characteristics such as punctuation density and word length to estimate the character-to-token ratio. Through iterative refinement using linear interpolation, it finds the precise character boundary where the token count meets your target. The algorithm typically converges in 3-5 API calls for most texts, with built-in caching to eliminate redundant API requests.

For developers, this means you can pipe any text through the tool with a token limit and get perfectly truncated output: cat large_document.txt | ttok4bedrock -t 1000 > truncated.txt. The truncation is exact, not approximate, ensuring you maximize the content within your token budget.

The tool includes self-imposed limits to prevent runaway API usage, capping truncation attempts at 20 API calls. In practice, this limit is rarely reached, but it provides a safety net against unexpected edge cases or malformed input.

Handling AWS Integration Properly

Working with AWS services requires attention to authentication and configuration patterns. The tool respects the standard AWS credential chain, working seamlessly whether you're using environment variables, AWS profiles, IAM roles on EC2, or any other standard authentication method. Region selection follows the same precedence rules as other AWS tools, checking command-line arguments, environment variables, and configuration files in that order.

The tool requires minimal IAM permissions—just bedrock:CountTokens on the foundation model resources. This follows the principle of least privilege while keeping setup simple.

Technical Details That Matter

An interesting quirk of Amazon Bedrock CountTokens API is that it wraps text in message structures, adding approximately 7 tokens of overhead. This overhead is invisible but affects the count, potentially confusing developers who expect the raw token count of their text. The ttok4bedrock library automatically detects and subtracts this overhead, providing intuitive results that match what developers expect.

Model selection is explicit: ttok4bedrock -m anthropic.claude-3-5-sonnet-20241022-v2:0 "Your text here". Claude 4 Sonnet is the default if no model is specified.

Integration Patterns for Developers

For Python developers, the library offers the same API as ttok, making migration trivial. Import ttok4bedrock as ttok, and your existing code continues to work with Bedrock models for the functionalities that are provided (token count and truncation).

The CLI tool fits naturally into Unix-style pipelines, accepting input from stdin and outputting to stdout. This design enables powerful compositions with other text processing tools, making it easy to integrate token counting into existing workflows. Whether you're building a document processing pipeline or analyzing prompt efficiency, the tool adapts to your needs.

Practical Applications

Token counting might seem like a utility concern, but it enables important optimizations in production systems. Accurately measuring tokens before API calls helps with prompt and context engineering, allowing developers to maximize the information within model context windows. For applications that process user-generated content, pre-flight token counting prevents errors and improves user experience by providing immediate feedback about text length.

The truncation capability is particularly valuable for RAG (Retrieval-Augmented Generation) systems where you need to fit retrieved documents within prompt limits. Instead of crude character-based cutting that might break mid-word or mid-sentence, the tool provides clean truncation at exact token boundaries.

Getting Started

Installation is straightforward using uv, the fast Python package installer. After cloning the repository and running uv sync, you're ready to count tokens.

For teams already using ttok in their workflows, migration is as simple as aliasing ttok4bedrock to ttok. The identical command-line interface means existing scripts, documentation, and muscle memory all transfer seamlessly.

The next time you're working with Claude models on Amazon Bedrock and need to count or truncate tokens, give ttok4bedrock a try.

Building Production-Ready AI Agents with LangGraph and Amazon Bedrock AgentCore

Danilo Poccia — Mon, 15 Sep 2025 14:43:37 +0000

In this fifth and final deep dive of our multi-framework series, I'll show you how to build a production-ready AI agent using LangGraph and deploy it using Amazon Bedrock AgentCore. The complete code for this implementation, along with examples for other frameworks, is available on GitHub at agentcore-multi-framework-examples.

LangGraph takes a different approach to agent workflows. Rather than linear chains of prompts or simple tool loops, LangGraph models agent behavior as a state graph where nodes perform actions and edges define transitions. This graph-based approach enables control flows with cycles, conditional branching, and human-in-the-loop interactions—capabilities that work well with the AgentCore persistent memory system.

LangGraph makes state management explicit—every node in the graph receives the current state, transforms it, and passes it along. Having explicit state makes debugging easier and helps integrate with AgentCore Memory for state persistence across sessions.

Setting Up the Development Environment

I'll begin by navigating to the LangGraph project in our repository:

cd agentcore-multi-framework-examples/agentcore-lang-graph
uv sync
source .venv/bin/activate

The project builds on the LangChain ecosystem with these dependencies:

langgraph                # Graph-based agent framework
langchain                # Core LangChain library
langchain-aws            # AWS integrations including Bedrock
langchain-tavily         # Tavily search integration
bedrock-agentcore        # AgentCore SDK
bedrock-agentcore-starter-toolkit  # Deployment tools

Understanding the LangGraph State Machine Architecture

LangGraph introduces a fundamentally different way of building agents through its StateGraph abstraction. Instead of imperatively calling tools or chaining prompts, I define a graph where each node represents a step in the agent's reasoning process.

Defining Agent State

The foundation of any LangGraph agent is its state definition. I use a TypedDict to define what information flows through the graph:

from typing import Annotated
from typing_extensions import TypedDict
from langgraph.graph.message import add_messages

class State(TypedDict):
    messages: Annotated[list, add_messages]

The add_messages annotation is special—it tells LangGraph to append new messages to the list rather than replacing it. This creates a growing conversation history as the graph executes, maintaining context throughout the workflow.

Building the Graph

The graph construction follows a declarative pattern that clearly shows the agent's decision flow:

from langgraph.graph import StateGraph, START, END
from langgraph.prebuilt import ToolNode, tools_condition

# Initialize the graph with our State type
graph_builder = StateGraph(State)

# Configure the LLM with tools
llm = init_chat_model(
    "us.anthropic.claude-3-7-sonnet-20250219-v1:0",
    model_provider="bedrock_converse",
)
llm_with_tools = llm.bind_tools(tools)

# Define the chatbot node
def chatbot(state: State):
    return {"messages": [llm_with_tools.invoke(state["messages"])]}

# Add nodes to the graph
graph_builder.add_node("chatbot", chatbot)
tool_node = ToolNode(tools=[tool])
graph_builder.add_node("tools", tool_node)

The init_chat_model function provides a unified interface for initializing chat models across providers. By specifying model_provider="bedrock_converse", I'm using Amazon Bedrock's Converse API, which provides consistent behavior across different foundation models.

Conditional Edges and Control Flow

The control flow in LangGraph is defined through its edges:

# Add conditional edge from chatbot
graph_builder.add_conditional_edges(
    "chatbot",
    tools_condition,
    {"tools": "tools", END: END},
)

# Add edge from tools back to chatbot
graph_builder.add_edge("tools", "chatbot")

# Add edge from START to chatbot
graph_builder.add_edge(START, "chatbot")

# Compile the graph
graph = graph_builder.compile()

The tools_condition is a pre-built function that examines the chatbot's output. If the model called a tool, it routes to the tools node; otherwise, it ends the conversation. This creates a loop where the agent can make multiple tool calls before providing a final answer.

Integrating Tavily Search

For this implementation, I've integrated Tavily Search as the primary tool, demonstrating how LangGraph agents can access real-time information:

from langchain_tavily import TavilySearch

tool = TavilySearch(max_results=2)
tools = [tool]

Tavily provides an AI-optimized search API that returns clean, relevant snippets rather than full web pages. This reduces the overhead of parsing HTML when agents need current information. The integration is seamless—LangGraph automatically handles tool invocation and result incorporation into the conversation flow.

AgentCore Runtime and Memory Integration

The integration with AgentCore Runtime provides the production infrastructure for the LangGraph agent. Let me explain how the entrypoint function processes requests and manages memory.

The Entrypoint Function

from bedrock_agentcore import BedrockAgentCoreApp
from bedrock_agentcore.runtime.context import RequestContext

app = BedrockAgentCoreApp()

@app.entrypoint
def invoke(payload: Dict[str, Any], context: Optional[RequestContext] = None) -> Dict[str, Any]:
    """Main entrypoint with AgentCore memory integration."""

    logger.info("LangGraph invocation started")

The @entrypoint decorator marks this function as the handler for incoming requests. The function receives:

payload: Contains the request data, including the user's prompt
context: A RequestContext object providing session management

Memory Enhancement Before Graph Execution

Before executing the graph, I retrieve relevant memories and add them to the input:

    # Extract parameters with context priority for session_id
    actor_id = payload.get("actor_id", DEFAULT_ACTOR_ID)
    session_id = context.session_id if context and context.session_id else payload.get("session_id", DEFAULT_SESSION_ID)

    prompt = payload.get("prompt", "No prompt found in input")

    # Enhance prompt with AgentCore memory context
    memory_context = memory_manager.get_memory_context(
        user_input=prompt,
        actor_id=actor_id,
        session_id=session_id
    )
    enhanced_prompt = f"{memory_context}\n\nCurrent user message: {prompt}" if memory_context else prompt

    # Create messages for LangGraph
    messages = {"messages": [{"role": "user", "content": enhanced_prompt}]}

The memory context retrieval uses two AgentCore Memory APIs:

get_last_k_turns to load conversation history
RetrieveMemories to search for relevant memories

Executing the Graph and Storing Results

After the graph processes the enriched input, I store the conversation and return the result:

try:
    # Invoke the LangGraph
    response = graph.invoke(messages)
    response_message = response['messages'][-1].content

    # Store conversation in AgentCore memory
    memory_manager.store_conversation(
        user_input=prompt,  # Store original prompt, not enhanced version
        response=response_message,
        actor_id=actor_id,
        session_id=session_id
    )

    return {"result": response_message}

The store_conversation() method calls the create_event API, which:

Stores the raw conversation
Triggers memory strategies to extract preferences, facts, and summaries
Makes these insights available for future retrievals

I store the original user prompt, not the enhanced version with memory context. This prevents recursive memory expansion where each retrieval would include previous retrievals, keeping the memory focused on actual conversation content.

The function returns a dictionary that AgentCore Runtime automatically serializes to JSON for the HTTP response.

Advanced Graph Patterns

While our implementation uses a simple tool-calling pattern, LangGraph enables more complex workflows that work well with AgentCore Memory.

Human-in-the-Loop Workflows

LangGraph's interrupt and resume capabilities allow building agents that pause for human input:

from langgraph.checkpoint.memory import MemorySaver

# Add checkpointing for state persistence
checkpointer = MemorySaver()
graph = graph_builder.compile(checkpointer=checkpointer)

# Add interrupt before critical decisions
graph_builder.add_node("human_approval", human_approval_node)
graph_builder.add_edge("chatbot", "human_approval", interrupt_before=["human_approval"])

Combined with AgentCore Memory, this enables workflows where agents remember not just conversations but also approval patterns and human feedback over time.

Multi-Agent Collaboration

LangGraph excels at orchestrating multiple agents working together:

# Define specialized agents as subgraphs
research_agent = create_research_graph()
analysis_agent = create_analysis_graph()

# Compose them in a parent graph
parent_builder = StateGraph(State)
parent_builder.add_node("research", research_agent)
parent_builder.add_node("analysis", analysis_agent)

# Route based on task type
def route_to_agent(state):
    if "research" in state["messages"][-1].content.lower():
        return "research"
    return "analysis"

parent_builder.add_conditional_edges(START, route_to_agent)

Each subgraph can maintain its own memory namespace in AgentCore, enabling specialized agents with distinct knowledge bases while sharing conversation context.

Parallel Execution

LangGraph supports parallel node execution for concurrent processing:

from langgraph.graph import END

# Add nodes that can run in parallel
graph_builder.add_node("search_web", search_web_node)
graph_builder.add_node("search_memory", search_memory_node)
graph_builder.add_node("synthesize", synthesize_node)

# Both searches run in parallel
graph_builder.add_edge("chatbot", "search_web")
graph_builder.add_edge("chatbot", "search_memory")

# Both complete before synthesis
graph_builder.add_edge("search_web", "synthesize")
graph_builder.add_edge("search_memory", "synthesize")

Combining web search with memory retrieval is useful, allowing the agent to gather information from multiple sources simultaneously.

Deploying the Agent

The deployment process leverages the same AgentCore Starter Toolkit workflow used throughout this series, with some LangGraph-specific considerations.

Configuration and Local Testing

First, I configure the agent for AgentCore:

agentcore configure -n langgraphagent -e main.py

For local testing with Tavily search, I need to provide the API key:

agentcore launch --local --env TAVILY_API_KEY=<YOUR_TAVILY_API_KEY>

This launches the containerized agent locally. Testing with memory context:

agentcore invoke --local '{"prompt": "AI multi-agent architectures - Also, what did I say about fruit?"}'

The agent performs a web search for current information about AI architectures while also retrieving the stored memory about fruit preferences, demonstrating the power of combining real-time data with persistent context.

Production Deployment

Deploying to AWS requires passing the Tavily API key as an environment variable:

agentcore launch --env TAVILY_API_KEY=<YOUR_TAVILY_API_KEY>

AgentCore securely stores the API key in AWS Secrets Manager and injects it into the Lambda function environment at runtime. This keeps credentials protected, never exposed in code or configuration files.

Monitoring the deployment:

aws logs tail /aws/bedrock-agentcore/runtimes/<AGENT_ID_ENDPOINT_ID> --follow

The structured logging from both LangGraph and AgentCore makes it easy to trace the graph execution flow and debug any issues.

Observability and Debugging

LangGraph provides excellent visibility into agent execution through its graph structure. I can even visualize the graph running this code locally:

# Generate a Mermaid diagram of the graph
image = graph.get_graph().draw_mermaid_png()
with open("graph_diagram.png", "wb") as f:
    f.write(image)

This visualization helps understand the agent's decision flow and is invaluable for debugging complex workflows.

Tracing Graph Execution

LangGraph's execution is fully traceable through LangSmith or custom callbacks:

from langchain.callbacks import StdOutCallbackHandler

# Add tracing to graph execution
response = graph.invoke(
    messages,
    config={"callbacks": [StdOutCallbackHandler()]}
)

Combined with CloudWatch Logs and observability from AgentCore, this provides complete observability from infrastructure to application logic.

Cleaning Up Resources

To delete the resources created by agentcore launch, I use the agentcore command in the Python virtual event:

agentcore destroy

This command deletes the AgentCore agent, the ECR images, the CodeBuild project, and the IAM roles used by the agent and by CodeBuild.

To delete the memory, including all stored events, the strategies, and the memories extracted form the events, I lookup the memory ID in the ../config/memory-config.json file and use the AWS CLI:

aws bedrock-agentcore-control delete-memory --memory-id <MEMORY_ID>

Production Considerations

Running LangGraph agents in production with AgentCore has taught me several important lessons.

State Management Strategies

LangGraph's explicit state management requires careful consideration of what to persist. While the graph maintains state during execution, AgentCore Memory handles cross-session persistence. I've found it best to keep graph state focused on the current task while using AgentCore Memory for long-term context.

Comparing Graph-Based to Traditional Approaches

After implementing agents with multiple frameworks, I can see where LangGraph's graph-based approach works best. Traditional agent frameworks often struggle with complex, multi-step workflows that require backtracking or parallel processing. LangGraph makes these patterns natural and explicit.

The graph structure also makes it easier to implement safety constraints. By controlling edges and adding validation nodes, I can make agents follow approved workflows even when using powerful foundation models. This matters in production environments where predictability is as important as capability.

However, this power comes with complexity. Simple question-answering agents might be overengineered as graphs. The key is choosing the right tool for the job—LangGraph for complex workflows, simpler frameworks for straightforward tasks.

What We've Learned

This LangGraph implementation completes our journey through five different agent frameworks, all unified through Amazon Bedrock AgentCore. The graph-based approach offers unique advantages for complex workflows while maintaining the same production-grade memory and deployment infrastructure we've used throughout the series.

AgentCore's flexibility enables each framework to use its strengths while providing consistent operational excellence. Whether you're building simple tool-calling agents with Strands, multi-agent systems with CrewAI, type-safe applications with Pydantic AI, data-centric agents with LlamaIndex, or complex workflows with LangGraph, AgentCore handles the production challenges so you can focus on agent logic.

The shared memory architecture we've used across all frameworks shows the benefits of standardization. By creating a common memory interface, we've enabled true portability—agents built with different frameworks can share memories and even hand off conversations to each other.

Next Steps and Future Directions

While this series focused on Runtime and Memory, AgentCore offers additional services that further enhance production deployments. Future explorations could include using AgentCore Identity for inbound and outbound agent authentication and authorization, implementing AgentCore Gateways to transform existing APIs and AWS Lambda functions into MCP servers, or leveraging AgentCore Monitoring for advanced observability.

The complete code for all five frameworks is available on GitHub. I encourage you to explore the implementations, experiment with combining different frameworks, and build your own production-ready agents.

The AI agent ecosystem is evolving rapidly, with new tools and patterns emerging constantly. What remains constant is the need for production-grade infrastructure that can adapt to these changes. Amazon Bedrock AgentCore provides that foundation, enabling you to experiment with cutting-edge agent technologies while maintaining enterprise-ready deployment and operations.

Thank you for joining me on this multi-framework journey. Now it's your turn to build something amazing!

Building Production-Ready AI Agents with LlamaIndex and Amazon Bedrock AgentCore

Danilo Poccia — Mon, 15 Sep 2025 14:37:28 +0000

In this fourth deep dive of our multi-framework series, I'll show you how to build a production-ready AI agent using LlamaIndex and deploy it using Amazon Bedrock AgentCore. The complete code for this implementation, along with examples for other frameworks, is available on GitHub at agentcore-multi-framework-examples.

LlamaIndex takes a data-centric approach to agent development. While other frameworks focus primarily on orchestration and tool calling, LlamaIndex specializes in connecting agents with diverse data sources and building RAG (Retrieval-Augmented Generation) pipelines. This is useful when your agent needs to reason over documents, databases, or APIs while maintaining production-grade memory persistence through AgentCore.

LlamaIndex provides specialized components for every aspect of data processing—from ingestion and chunking to embedding and retrieval. This modular design works well with AgentCore's memory system, allowing me to build agents that process external data while maintaining searchable conversation histories.

Setting Up the Development Environment

I'll start by navigating to the LlamaIndex project within our multi-framework repository:

cd agentcore-multi-framework-examples/agentcore-llama-index
uv sync
source .venv/bin/activate

The project uses several LlamaIndex packages for different capabilities:

llama-index-core         # Core agent and data framework
llama-index-llms-bedrock-converse  # Amazon Bedrock LLM integration
llama-index-embeddings-bedrock     # Bedrock embeddings for semantic search
llama-index-tools-wikipedia        # Wikipedia data source
llama-index-tools-requests        # Web request capabilities
bedrock-agentcore                 # AgentCore SDK
bedrock-agentcore-starter-toolkit # Deployment tools

Understanding the LlamaIndex Agent Architecture

LlamaIndex recently introduced the FunctionAgent, a streamlined agent implementation that focuses on tool use and conversation flow. Unlike the earlier ReActAgent, FunctionAgent leverages the native function-calling capabilities of modern LLMs, resulting in more reliable and efficient agent behavior.

The architecture I've built combines three key LlamaIndex concepts:

Global Settings Configuration

LlamaIndex uses a Settings object to configure global defaults for LLMs, embeddings, and processing parameters. This centralized configuration provides consistency across all components:

def initialize_llamaindex_settings():
    """Initialize LlamaIndex global settings."""
    # Configure LLM
    llm = BedrockConverse(
        model="us.amazon.nova-pro-v1:0",
    )

    # Configure embeddings
    embed_model = BedrockEmbedding(
        model_name="amazon.titan-embed-text-v2:0",
        region_name="us-east-1"
    )

    # Set global settings
    Settings.llm = llm
    Settings.embed_model = embed_model
    Settings.chunk_size = 512
    Settings.chunk_overlap = 50

The BedrockConverse integration uses the Amazon Bedrock Converse API, which provides a unified interface across different foundation models. This means I can easily switch between Claude, Llama, or Amazon Nova models without changing my agent code.

Tool Integration

LlamaIndex tools follow a consistent interface through the FunctionTool abstraction. I've created a comprehensive tool suite that combines custom tools with pre-built integrations:

def create_llamaindex_tools(memory_manager=None) -> List[FunctionTool]:
    """Create a list of tools for the LlamaIndex agent."""
    tools = []

    # Basic function tools
    tools.extend([
        FunctionTool.from_defaults(
            fn=calculator,
            name="calculator",
            description="Perform basic mathematical calculations"
        ),
        FunctionTool.from_defaults(
            fn=text_analyzer,
            name="text_analyzer", 
            description="Analyze text and provide statistics"
        )
    ])

    # Add external tools
    wikipedia_spec = WikipediaToolSpec()
    wikipedia_tools = wikipedia_spec.to_tool_list()
    tools.extend(wikipedia_tools)

The ToolSpec pattern allows entire tool suites to be packaged and shared. The Wikipedia and Requests tools come from LlamaHub, the LlamaIndex community tool repository, demonstrating how easy it is to extend agent capabilities.

Memory-Enhanced Context

Integrating AgentCore Memory with LlamaIndex requires dynamically updating the agent's system prompt based on retrieved memories, rather than just appending memories to prompts:

# Enhance the user input with memory context if available
if memory_manager:
    memory_context = memory_manager.get_memory_context(
        user_input=user_input,
        actor_id=actor_id,
        session_id=session_id
    )

    # Update agent's system prompt if enhanced
    if memory_context:
        enhanced_prompt = f"{get_system_prompt()}\n\nRelevant context from previous interactions:\n{memory_context}"
        agent_instance.update_prompts({"system_prompt": enhanced_prompt})

AgentCore Runtime Integration

The integration with AgentCore Runtime starts with the entrypoint function that receives requests and returns responses. Let me explain how each component works:

The Entrypoint Function

from bedrock_agentcore import BedrockAgentCoreApp
from bedrock_agentcore.runtime.context import RequestContext

app = BedrockAgentCoreApp()

@app.entrypoint
async def invoke(payload: Dict[str, Any], context: Optional[RequestContext] = None) -> str:
    """AgentCore entrypoint for LlamaIndex agent invocation."""

The @entrypoint decorator marks this function as the handler that AgentCore Runtime invokes when your agent receives a request. The function receives two parameters:

payload: A dictionary containing the request data, including the user's prompt
context: A RequestContext object that provides session information managed by AgentCore

Processing Requests

# Extract parameters from the request
user_input = payload.get("prompt", "Hello! What can you help me with?")
actor_id = payload.get("actor_id", DEFAULT_ACTOR_ID)
session_id = context.session_id if context and context.session_id else payload.get("session_id", DEFAULT_SESSION_ID)

logger.info(f"Processing request for actor_id: {actor_id}, session_id: {session_id}")

The session_id from the RequestContext is particularly important—AgentCore Runtime automatically manages session isolation at iuinfrastructure level, giving each user their own isolated conversation space that persists across invocations.

Memory Enhancement Before Processing

Before the agent processes the request, I retrieve relevant memories and add them to the prompt:

# Enhance prompt with AgentCore memory context
memory_context = memory_manager.get_memory_context(
    user_input=user_input,
    actor_id=actor_id,
    session_id=session_id
)

# Update agent's system prompt if enhanced
if memory_context:
    enhanced_prompt = f"{get_system_prompt()}\n\nRelevant context from previous interactions:\n{memory_context}"
    agent_instance.update_prompts({"system_prompt": enhanced_prompt})

The get_memory_context() method performs two operations:

Retrieves conversation history using the get_last_k_turns API
Searches for relevant memories using the RetrieveMemories operation

This enriched context becomes part of the agent's system prompt, providing historical context for the response generation.

Executing the Agent and Returning the Response

# Run the agent asynchronously
response = await agent_instance.run(user_input)
response_text = str(response)

# Store conversation after generating response
if memory_manager:
    memory_manager.store_conversation(
        user_input=user_input,
        response=response_text,
        actor_id=actor_id,
        session_id=session_id
    )

# Return the text response
return response_text

The function returns a string containing the agent's response. AgentCore Runtime takes this string and automatically:

Wraps it in the appropriate HTTP response structure
Handles response formatting for API Gateway
Manages error responses if exceptions occur

Notice the async keyword—the FunctionAgent in LlamaIndex supports asynchronous execution natively. This allows the agent to make multiple tool calls or process large documents without blocking other requests. The BedrockAgentCoreApp handles async functions transparently, whether running locally or deployed.

AgentCore Memory Integration

The memory system in this implementation uses AgentCore Memory to provide persistent context across sessions. Here's how it works:

Memory Retrieval and Prompt Enrichment

The memory integration happens in two phases. First, when the agent starts processing a request, it retrieves relevant memories to enrich the input:

# Get memory context using the shared memory manager
memory_context = memory_manager.get_memory_context(
    user_input=user_input,
    actor_id=actor_id,
    session_id=session_id
)

The get_memory_context() method performs these operations:

Load Conversation History: On the first invocation of a session, it calls the get_last_k_turns API to retrieve up to 100 previous conversation turns:

conversations = self.memory_client.get_last_k_turns(
    memory_id=self.memory_config.memory_id,
    actor_id=actor_id,
    session_id=session_id,
    k=100  # Maximum conversation turns to retrieve
)

Retrieve Relevant Memories: It performs semantic search using the RetrieveMemories operation to find relevant facts, preferences, and summaries:

memories = retrieve_memories_for_actor(
    memory_id=memory_manager.memory_config.memory_id,
    actor_id=actor_id,
    search_query=user_input,
    memory_client=memory_manager.memory_client
)

The namespace structure /actor/{actor_id}/ provides complete isolation between users—each actor has their own memory space that other users cannot access.

Storing Conversations After Response

After the agent generates a response, the conversation is stored in AgentCore Memory:

# Store conversation in AgentCore Memory
memory_manager.store_conversation(
    user_input=user_input,
    response=response_text,
    actor_id=actor_id,
    session_id=session_id
)

The store_conversation() method calls the create_event API:

messages_to_store = [
    (user_input, 'USER'),
    (response_text, 'ASSISTANT')
]

self.memory_client.create_event(
    memory_id=self.memory_config.memory_id,
    actor_id=actor_id,
    session_id=session_id,
    messages=messages_to_store
)

When create_event is called, AgentCore Memory automatically:

Stores the raw conversation
Extracts user preferences using the UserPreferences strategy
Identifies semantic facts using the SemanticFacts strategy
Generates session summaries using the SessionSummaries strategy

These extracted insights become available for future retrievals, building the agent's long-term knowledge.

Memory as an Agentic Tool

I've also exposed memory retrieval as a tool that the agent can use strategically:

def retrieve_memories(query: str, max_results: int = 5) -> str:
    """Retrieve relevant memories based on a query."""
    try:
        memories = retrieve_memories_for_actor(
            memory_id=memory_manager.memory_config.memory_id,
            actor_id=memory_manager.default_actor_id,
            search_query=query,
            memory_client=memory_manager.memory_client
        )

        if not memories:
            return f"No relevant memories found for query: '{query}'"

        result = f"Retrieved {len(memories)} memories for '{query}':\n\n"
        for i, memory in enumerate(memories, 1):
            content = memory.get('content', 'No content')
            score = memory.get('score', 'N/A')
            result += f"{i}. (Score: {score}) {content}\n"

        return result

Exposing memory retrieval as a tool gives the agent agency over its memory. Rather than automatically retrieving memories for every query, the agent can decide when memory context would be helpful. For instance, when asked about preferences, the agent might search for "user preferences" explicitly, or when solving a problem, it might search for similar problems from past conversations.

Context Window Management

LlamaIndex's context management is important when dealing with large documents or extensive conversation histories. The memory manager handles this by implementing a sliding window approach:

conversations = self.memory_client.get_last_k_turns(
    memory_id=self.memory_config.memory_id,
    actor_id=actor_id,
    session_id=session_id,
    k=100  # Retrieve up to 100 previous conversation turns
)

The get_last_k_turns API efficiently retrieves recent conversation history without overwhelming the context window. This matters when using LlamaIndex with document-heavy workflows where context space is at a premium.

Deploying the Agent

The deployment process leverages the AgentCore Starter Toolkit's streamlined workflow. First, I configure the agent:

agentcore configure -n llamaindexagent -e main.py

When prompted, I accept the default values. This creates the necessary AWS infrastructure including IAM roles, ECR repositories, and Lambda function configurations.

Local Testing

Before deploying to the cloud, I always test locally to verify everything works correctly:

agentcore launch --local

This starts a containerized version of the agent running on my local machine. The local environment exactly mirrors the production environment—same container, same runtime, same memory access.

Testing the local deployment:

agentcore invoke --local '{"prompt": "What did I say about fruit?"}'

The agent retrieves the sample memory we added during setup and responds appropriately. I can then test more complex scenarios:

agentcore invoke --local '{"prompt": "Search Wikipedia for information about apples and tell me if I would enjoy apple-based dishes"}'

This tests both the Wikipedia tool integration and memory retrieval, demonstrating how LlamaIndex agents can combine external data sources with personal context.

Production Deployment

Once satisfied with local testing, deploying to AWS is straightforward:

agentcore launch

AgentCore handles all the complex deployment tasks automatically. It builds the container image, pushes it to Amazon ECR, creates the Lambda function with proper networking configuration, sets up API Gateway endpoints, configures IAM permissions with least privilege access, and enables comprehensive CloudWatch logging.

After deployment completes, I check the status:

agentcore status

This provides the endpoint ARN, CloudWatch log group, and other deployment details I need for monitoring and debugging.

Cleaning Up Resources

To delete the resources created by agentcore launch, I use the agentcore command in the Python virtual event:

agentcore destroy

This command deletes the AgentCore agent, the ECR images, the CodeBuild project, and the IAM roles used by the agent and by CodeBuild.

To delete the memory, including all stored events, the strategies, and the memories extracted form the events, I lookup the memory ID in the ../config/memory-config.json file and use the AWS CLI:

aws bedrock-agentcore-control delete-memory --memory-id <MEMORY_ID>

What's Next

This LlamaIndex implementation demonstrates how to build data-centric agents with RAG capabilities while maintaining production-grade memory persistence. The framework's modular architecture and extensive tool ecosystem make it ideal for agents that need to process diverse data sources while maintaining conversational context.

In the next article, I'll explore LangGraph, showing how to build stateful, graph-based agent workflows using the same AgentCore infrastructure. You'll see how LangGraph's unique approach to agent orchestration enables complex, multi-step reasoning while leveraging our shared memory architecture.

The complete code is available on GitHub. I encourage you to experiment with the implementation and explore how LlamaIndex's document processing capabilities can enhance your agent's knowledge base beyond just conversation memory.

Ready to build your own data-powered AI agent? Clone the repository and start exploring the possibilities!

Building Production-Ready AI Agents with Pydantic AI and Amazon Bedrock AgentCore

Danilo Poccia — Mon, 15 Sep 2025 13:07:17 +0000

Building Production-Ready AI Agents with Pydantic AI and Amazon Bedrock AgentCore

In this third deep dive of our multi-framework series, I'll show you how to build type-safe AI agents using Pydantic AI and deploy them with Amazon Bedrock AgentCore. The complete code for this implementation, along with examples for other frameworks, is available on GitHub at agentcore-multi-framework-examples.

Pydantic AI brings the power of Python most popular validation library to AI agent development. If you've ever struggled with unpredictable LLM outputs or spent hours debugging type mismatches in your agent code, Pydantic AI offers a refreshing solution. It enforces structure where AI tends to be chaotic, providing type safety, automatic validation, and clear data contracts.

Pydantic AI has a minimalist approach that simplifies implementations and works well for production deployments. Unlike the multi-agent orchestration of CrewAI or the model-based extensibility of Strands, Pydantic AI focuses on doing one thing exceptionally well: ensuring your agent's inputs and outputs are exactly what you expect them to be. This predictability is invaluable when building systems that other services depend on.

While Pydantic AI excels at structured outputs through Pydantic models, automatic validation of responses, and built-in error handling for type mismatches, production deployments also need to follow security and scalability best practices. Let me show you how the integration with AgentCore enables persistent memory and scalable deployments without adding complexity to an agent code.

Setting Up the Development Environment

Let's start by setting up the Pydantic AI project. If you haven't already cloned the repository:

git clone https://github.com/danilop/agentcore-multi-framework-examples.git
cd agentcore-multi-framework-examples

Now let's set up the Pydantic AI project. I'm using uv, a fast Python package installer, to manage dependencies:

cd agentcore-pydantic-ai
uv sync
source .venv/bin/activate

The project dependencies are remarkably lean:

pydantic-ai-slim[bedrock]: The core framework with Amazon Bedrock support
bedrock-agentcore: The SDK for integrating with AgentCore services
bedrock-agentcore-starter-toolkit: CLI tools for deployment

Notice I'm using pydantic-ai-slim rather than the full package. This gives me just the agent functionality without additional dependencies I don't need for this implementation.

Creating and Configuring AgentCore Memory

Before building our agent, let's set up AgentCore Memory. If you've already done this for previous examples, you can skip the creation steps and just copy the configuration file.

Quick Memory Setup

For those who haven't set up memory yet:

cd ../scripts
uv sync
uv run create-memory
uv run add-sample-memory
cd ../agentcore-pydantic-ai
cp ../config/memory-config.json .

The memory system provides three strategies that automatically extract insights from conversations: User Preferences (behavioral patterns), Semantic Facts (domain knowledge), and Session Summaries (conversation overviews). These work together to give our agents persistent memory across sessions.

Building the Type-Safe Agent

Pydantic AI approach is refreshingly straightforward. Instead of complex configurations or multiple files, everything centers around the Agent class with its clean, functional API.

Understanding Message History

One of Pydantic AI key features is its built-in message history management. Unlike other frameworks where you might manage conversation context manually, Pydantic AI provides a structured way to maintain conversation continuity:

from pydantic_ai import Agent

# Session state tracking (minimal global state)  
session_message_history = {}  # Dict[session_id, List[ModelMessage]]

@app.entrypoint
def invoke(payload: Dict[str, Any], context: Optional[RequestContext] = None) -> str:
    """Main entrypoint with refactored memory functionality using AgentMemoryManager."""

    prompt = payload.get('prompt')
    actor_id = payload.get("actor_id", DEFAULT_ACTOR_ID)

    # Get session_id from context (AgentCore automatically provides this)
    session_id = context.session_id if context and context.session_id else payload.get("session_id", DEFAULT_SESSION_ID)

    # Get or initialize message history for this session
    if session_id not in session_message_history:
        session_message_history[session_id] = []

    current_message_history = session_message_history[session_id]

The message history is a list of ModelMessage objects that Pydantic AI uses internally. By maintaining this history and passing it to the agent, we make the conversation flows naturally across multiple invocations.

AgentCore Memory Integration

Combining Pydantic AI message history with AgentCore long-term memory provides different levels of memory. Let me show you how the integration can be implemented:

# Get memory context to add to user prompt
memory_context = memory_manager.get_memory_context(
    user_input=prompt or "",
    actor_id=actor_id,
    session_id=session_id
)

# Create enhanced user prompt with memory context
enhanced_prompt = prompt or "Hello"
if memory_context:
    enhanced_prompt = f"{memory_context}\n\nUser: {enhanced_prompt}"
    logger.info("Added memory context to user prompt")

The MemoryManager (the same class we used in Strands Agents and CrewAI) retrieves relevant memories from past sessions. Here's how it works internally:

def get_memory_context(self, user_input: str, actor_id: str, session_id: str) -> str:
    """Get memory context as a string to be added to user input."""
    session_key = f"{actor_id}:{session_id}"
    context_parts = []

    # Load conversation history on first invocation
    if not self._initialized_sessions.get(session_key, False):
        conversations = self.memory_client.get_last_k_turns(
            memory_id=self.memory_config.memory_id,
            actor_id=actor_id,
            session_id=session_id,
            k=self.max_conversation_turns
        )

        if conversations:
            # Format as conversation history
            context_messages = []
            for turn in reversed(conversations):
                for message in turn:
                    context_messages.append(f"{message['role']}: {message['content']}")
            context_parts.append(f"Recent conversation:\n{'\n'.join(context_messages)}")

        self._initialized_sessions[session_key] = True

    # Retrieve semantically relevant memories
    if user_input:
        memories = self.memory_client.retrieve_memories(
            memory_id=self.memory_config.memory_id,
            namespace=f"/actor/{actor_id}/",
            query=user_input
        )

        if memories:
            memory_text = self._format_memories(memories)
            context_parts.append(f"Relevant long-term memory:\n{memory_text}")

    return "\n\n".join(context_parts)

This integration provides three levels of memory:

Session Message History (managed by Pydantic AI): Maintains conversation flow within the current session
Conversation History (via AgentCore get_last_k_turns): Loads previous conversations when a session starts
Semantic Memory (via AgentCore RetrieveMemories): Searches across all memories for relevant context

By prepending this context to the user's prompt, the agent has access to historical information even in a new session, enabling truly persistent conversations.

Running the Agent

With memory context prepared, I create and run the agent:

# Create agent with base instructions
agent = Agent(MODEL, instructions='Be concise, reply with one sentence.')

# Run the agent with enhanced prompt and message history
result = agent.run_sync(enhanced_prompt, message_history=current_message_history)

# The result object contains:
# - result.output: The agent's text response (what we return to the user)
# - result.all_messages(): Complete message history including this interaction

The run_sync method executes the agent synchronously, which is perfect for our serverless deployment model. The message_history parameter provides the agent with context from earlier in the conversation. The method returns a result object where result.output contains the agent's text response that we'll return to the user.

Storing New Messages

After the agent responds, I need to detect and store new messages in AgentCore Memory:

# Get all messages including the new interaction
all_messages_after = result.all_messages()

# Detect new messages by comparing counts
previous_message_count = len(current_message_history)
new_messages = all_messages_after[previous_message_count:]

if new_messages:
    logger.info(f"Storing {len(new_messages)} new messages in memory")
    store_pydantic_messages_in_memory(new_messages, memory_manager, actor_id, session_id)

# Update session message history (keep last NUM_MESSAGES)
session_message_history[session_id] = all_messages_after[-NUM_MESSAGES:]

This approach efficiently detects which messages are new by comparing message counts before and after the agent run. The store_pydantic_messages_in_memory function handles Pydantic AI message format and uses the store_conversation method to store them in AgentCore Memory.

When store_conversation calls the create_event API internally, it not only stores the raw conversation but also triggers the memory strategies to automatically extract user preferences, semantic facts, and generate session summaries. This is how our agent builds long-term knowledge from every interaction.

The NUM_MESSAGES limit (set to 30) prevents the message history from growing unbounded. This is important for managing token limits and ensuring consistent performance.

Integrating with AgentCore Runtime

The integration with AgentCore Runtime follows the same pattern as our other frameworks, but Pydantic AI simplicity makes it particularly elegant:

from bedrock_agentcore import BedrockAgentCoreApp
from bedrock_agentcore.runtime.context import RequestContext
from pydantic_ai import Agent

app = BedrockAgentCoreApp()

@app.entrypoint
def invoke(payload: Dict[str, Any], context: Optional[RequestContext] = None) -> str:
    """Main entrypoint with refactored memory functionality using AgentMemoryManager."""

    prompt = payload.get('prompt')
    actor_id = payload.get("actor_id", DEFAULT_ACTOR_ID)
    session_id = context.session_id if context and context.session_id else payload.get("session_id", DEFAULT_SESSION_ID)

    # Get or initialize message history for this session
    if session_id not in session_message_history:
        session_message_history[session_id] = []

    current_message_history = session_message_history[session_id]

    # Get memory context to enhance the prompt
    memory_context = memory_manager.get_memory_context(
        user_input=prompt or "",
        actor_id=actor_id,
        session_id=session_id
    )

    # Create enhanced prompt with memory context
    enhanced_prompt = prompt or "Hello"
    if memory_context:
        enhanced_prompt = f"{memory_context}\n\nUser: {enhanced_prompt}"

    # Create and run the agent
    agent = Agent(MODEL, instructions='Be concise, reply with one sentence.')
    result = agent.run_sync(enhanced_prompt, message_history=current_message_history)

    # Store new messages in memory
    new_messages = result.all_messages()[len(current_message_history):]
    if new_messages:
        store_pydantic_messages_in_memory(new_messages, memory_manager, actor_id, session_id)

    # Update session message history
    session_message_history[session_id] = result.all_messages()[-NUM_MESSAGES:]

    # Return the agent's string output
    return result.output

if __name__ == "__main__":
    app.run()

The BedrockAgentCoreApp provides all the infrastructure scaffolding, while the @entrypoint decorator marks our handler function.

The key line is result = agent.run_sync(enhanced_prompt, message_history=current_message_history) where the Pydantic AI agent processes the request. The run_sync() method returns a result object, and we return result.output which contains the agent's text response as a string. AgentCore Runtime automatically wraps this string in the appropriate HTTP response structure for API Gateway, so you don't need to worry about response formatting—just return the text content you want to send back to the user.

This implementation is simple—there's no complex class hierarchy, no multiple configuration files, just a straightforward function that processes requests and returns responses. This aligns perfectly with Pydantic AI approach of minimalism and clarity.

The RequestContext automatically provides session management:

# Get session_id from context (AgentCore automatically provides this)
session_id = context.session_id if context and context.session_id else payload.get("session_id", DEFAULT_SESSION_ID)
logger.info(f"Using session_id from context: {session_id}")

Session isolation is implemented at infrastructure level—each user gets their own session that can persist across invocations but remains completely isolated from other users.

When deployed, AgentCore Runtime handles:

Request routing: API Gateway forwards requests to your Lambda function
Session management: Automatic session tracking and isolation
Scaling: Lambda automatically scales based on request volume
Error handling: Built-in retry logic and error reporting
Logging: CloudWatch integration for monitoring and debugging

Your Pydantic AI agent doesn't need to know about any of this infrastructure—it just focuses on processing messages and returning responses.

Memory Manager Architecture

The Pydantic AI implementation uses the store_conversation() method from the shared memory module, with framework-specific message conversion logic in the main application file:

def convert_pydantic_messages_for_storage(messages: List[Any]) -> List[tuple]:
    """Convert Pydantic AI message objects to memory storage format."""
    messages_to_store = []

    for msg in messages:
        # Handle Pydantic AI ModelMessage objects
        if hasattr(msg, 'parts') and msg.parts:
            for part in msg.parts:
                if hasattr(part, 'content'):
                    content = part.content

                    if hasattr(msg, 'kind'):
                        if msg.kind == 'request':
                            role = 'USER' if part.part_kind == 'user-prompt' else 'SYSTEM'
                        elif msg.kind == 'response':
                            role = 'ASSISTANT'
                        else:
                            role = 'ASSISTANT'

                    message_tuple = (content, role)
                    messages_to_store.append(message_tuple)

    return messages_to_store

def store_pydantic_messages_in_memory(new_messages, memory_manager, actor_id, session_id):
    """Store Pydantic AI messages using the unified store_conversation method."""
    messages_to_store = convert_pydantic_messages_for_storage(new_messages)

    # Store each message pair using store_conversation
    for i in range(0, len(messages_to_store), 2):
        if i + 1 < len(messages_to_store):
            user_content = messages_to_store[i][0]
            assistant_content = messages_to_store[i + 1][0]

            memory_manager.store_conversation(
                user_input=user_content,
                response=assistant_content,
                actor_id=actor_id,
                session_id=session_id
            )

This approach keeps the shared memory module clean and unified while handling Pydantic AI message format in the application-specific code.

Shared Memory Architecture

Like the other frameworks in this series, Pydantic AI uses the same unified memory management module. This architectural decision allows complete portability of memories across frameworks.

The shared memory.py module provides:

Core Components:

MemoryConfig class for centralized configuration management
MemoryManager class with framework-agnostic memory operations
retrieve_memories_for_actor() function for semantic search
format_memory_context() function for consistent formatting

Unified Interface:
All frameworks use the same store_conversation() method, with framework-specific message conversion handled in the application code:

def _convert_messages_for_storage(self, messages: List[Any]) -> List[Tuple[str, str]]:
    """Convert Pydantic AI messages to AgentCore format."""
    converted = []
    for msg in messages:
        if hasattr(msg, 'role') and hasattr(msg, 'content'):
            role = 'USER' if msg.role == 'user' else 'ASSISTANT'
            converted.append((str(msg.content), role))
    return converted

This demonstrates how the shared architecture adapts to each framework's specific needs while maintaining the same core functionality. Memories created by a Pydantic AI agent can be accessed by agents built with Strands, CrewAI, or any other framework in the series.

Testing Locally

Before deploying to production, let's test locally. Configure the agent for AgentCore:

agentcore configure -n pydanticaiagent -e main.py

Press Enter to accept all defaults. This creates the necessary AWS resources.

Launch the agent locally:

agentcore launch --local

Test it with:

agentcore invoke --local '{ "prompt": "What did I say about fruit?" }'

The agent should retrieve the sample memory about fruit preferences and respond concisely. Notice how the response is indeed one sentence, following the instruction we provided.

Test conversation continuity:

agentcore invoke --local '{ "prompt": "Tell me more about my preferences" }'

The agent maintains context from the previous message and can elaborate on your preferences while staying concise.

Deploying to Production

Once local testing is complete, deploy to AWS:

agentcore launch

AgentCore Runtime handles all the deployment complexity:

Building and pushing container images to Amazon ECR
Creating AWS Lambda functions with proper configuration
Setting up API Gateway endpoints
Configuring IAM permissions
Enabling CloudWatch logging

Check your deployment status:

agentcore status

Test the production deployment:

agentcore invoke '{ "prompt": "What did I say about fruit?" }'

Monitor logs in real-time:

aws logs tail /aws/bedrock-agentcore/runtimes/<AGENT_ID-ENDPOINT_ID> --follow

Cleaning Up Resources

To delete the resources created by agentcore launch, I use the agentcore command in the Python virtual event:

agentcore destroy

This command deletes the AgentCore agent, the ECR images, the CodeBuild project, and the IAM roles used by the agent and by CodeBuild.

To delete the memory, including all stored events, the strategies, and the memories extracted form the events, I lookup the memory ID in the ../config/memory-config.json file and use the AWS CLI:

aws bedrock-agentcore-control delete-memory --memory-id <MEMORY_ID>

What's Next

This Pydantic AI implementation demonstrates how type safety and simplicity can coexist in production AI systems. The minimal API surface area makes the code easy to understand and maintain, while the integration with AgentCore Memory provides the persistence needed for meaningful conversations.

What strikes me most about Pydantic AI is how it manages to be both simple and powerful. There's no magic, no hidden complexity—just clean Python code with predictable behavior. This predictability becomes invaluable as your system grows and other services start depending on your agent's outputs.

In the next article, I'll explore LlamaIndex, showing how to build agents with advanced retrieval capabilities and knowledge management. You'll see how the same memory architecture and deployment patterns adapt to a framework designed for working with large document collections.

The complete code is available on GitHub. I encourage you to experiment with adding Pydantic models for structured outputs, implementing custom validators, or building agents that return complex data types.

Ready to build your own type-safe AI agent? Clone the repo and start experimenting!