DEV Community

Cover image for AWS re:Invent 2025 - Build autonomous code improvement agents with Amazon Nova 2 Lite (AIM429)
Kazuya
Kazuya

Posted on • Edited on

AWS re:Invent 2025 - Build autonomous code improvement agents with Amazon Nova 2 Lite (AIM429)

πŸ¦„ Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

πŸ“– AWS re:Invent 2025 - Build autonomous code improvement agents with Amazon Nova 2 Lite (AIM429)

In this video, Pat and Gene from AWS demonstrate building autonomous coding agents using Amazon Nova 2 Lite, a large context window reasoning model announced that day. They walk through practical implementation using Strands framework and GitHub's Model Context Protocol (MCP) to create agents that analyze GitHub issues and generate code changes. Key topics include optimizing context management by using repository file structures instead of dumping entire codebases, implementing multi-agent architectures with separate planning and acting agents, managing tool configurations to avoid overwhelming the model with 40+ tools, and creating secure code execution environments using Docker containers. They demonstrate reasoning budget settings (low, medium, high) and show how proper system prompts with core mandates, tool usage guidelines, and error handling significantly improve agent performance while reducing token consumption and costs.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Introduction to Amazon Nova 2 Lite and Workshop Overview

Welcome. We are here today to talk about Amazon Nova 2 Lite. How many folks here have tried out Nova 2 Lite since its launch, less than 7 hours ago? OK, you didn't leave the keynote and do it immediately. This might be the first workshop where we do that. I'm Pat. I'm a solutions architect with the WWSO Nova go to market team, and I've been working for the last year with Gene on Nova. My name is Gene, and I'm on the Amazon AGI team. I am a senior solutions architect, so happy to be here with you all today. I'm glad the badges work too, so congratulations on getting in.

We're going to cover a lot today, and we have a rough agenda, but if we're doing this the right way, we're going to be doing a lot of coding, and unfortunately some of it will be mostly code review given the difficulties here, but we're going to go through a lot. When Gene and I were first planning this, we had a very tangible, bite-size idea, and we quickly realized this is a huge problem. To solve this at scale and build a fully autonomous coding agent has a lot of nuts and bolts we're going to talk about all of that.

Thumbnail 50

I want to set the right expectations here that we may not get through all of it, but what we want you to see is how you can prompt Nova 2 Lite, the major considerations to make if you're going to do this at scale, and any of the stuff Gene's going through or I'm going through, make sure you ask questions. We want everyone engaged. I know it's a large audience, but that's always the best way to do this. A question: how many people have actually done a code talk before? Are you familiar with the format? The goal for this is we're going to be in the code and we're going to be very interactive. We want to be talking. We want to hear the questions that you have as you have them.

This is not a ten-minute Q&A at the end. This is a conversation we want to have with everybody, so it might be a little bit different than all the breakout sessions you've attended today. It's really scrappy. We're going to be in the code and we're going to be talking about the model. I have Leon and Cesar who I can call on, which is great. Does anyone else want to be called on? I figured that was the answer. We'll get you talking in a second. Don't worry.

So let's talk about Nova 2 Lite. Nova 2 Lite was announced today. It's a large context window reasoning model. It comes with built-in support for our system tools. You'll get to see some of that at work here. It's our next generation of Lite. We've also announced Pro 2 and Omni that are on the horizon in preview, but today we're using Nova 2 Lite exclusively and we're going to walk you through how to use reasoning, how to set budgets, and how that impacts code generation.

Thumbnail 150

Thumbnail 190

Where Amazon Nova 2 Lite fits in, like I said, it's got a 1 million context window. It's actually got 64,000 token output, so if you are using reasoning, you will use some of that. The output of the model we've seen perform really well. It supports all the languages we've supported thus far. We're seeing it do really well at translation tasks. You can use it to fine-tune. We'll talk about why you might want to do that for code generation use cases. We'll go ahead and get started in the code. Gene's going to get started on building out this coding agent. Feel free to follow along in your notebooks if you want.

Thumbnail 210

Setting Up the Coding Agent with MCP and GitHub Integration

You want to switch me over? OK, nice. We'll walk through some of that today. Another question: this can just be a show of hands. Is everyone familiar with Strands? OK, some people are. I'll give a little bit of an overview, but I'll make sure to cover what's important here. What about MCP, Model Context Protocol? Nice. That's a good amount of the group. Has anyone built an agent using an MCP server before? OK, pretty good. So we're going to be using MCP today as well to talk through this.

Thumbnail 280

Thumbnail 290

To set the scene for where we're going to start and how we're going to build, we are going to have a GitHub issue that we're working with. The problem is a little bit meta, but yes, question. Yes, I absolutely can. I am in a repository, langchain-aws. So LangChain, we have our own dedicated sub-repository for the AWS integration.

Thumbnail 310

A new model was released, and we want to make sure we're integrated well and that we support all the new features of the model. This is the meta problem we're going to talk through today. We're starting with an actual GitHub issue in a real repository. We're not building an entire application from scratch. We're working in a real repository with real code that serves a real business function, and we're going to iterate on what it looks like to actually interact with that codebase. How do we get the model to understand the context of the space it's working in? How do we get it to solve real business problems in a space that has been around for a while with real code?

Thumbnail 350

I'm going to be using Strands, which is a high-level framework. If you're not familiar with it, it's similar to LangChain, except it's AWS's version and has really great integration with AWS services and tools. This means that our models and other things we've been building on AWS all play very nicely together. I'm going to use that today to supercharge and get started with building this agent. The nice thing about agentic frameworks like this is that they do a lot of the code pieces for you, so I can get an agent going in a couple of lines of code.

Thumbnail 420

Thumbnail 430

Thumbnail 440

What I want to start with is just what happens if I prompt the model with something like: look at this issue and comment on the issue with your implementation plan. This is extremely vague. I'm not giving any context at all. All it has is the GitHub MCP client. We're going to run this and see what happens. I have some boilerplate code that should make this a little bit nicer, and I'm going to zoom in because I'm sure you can't see it clearly. It's not as pretty when I do that, I apologize. What's happening here is the model is starting to call tools through the GitHub MCP client.

Thumbnail 450

Thumbnail 460

Thumbnail 480

Thumbnail 490

Let me open my Python environment to show you what the model is seeing behind the scenes. I'm going to import the client and initialize it in a Python execution. The goal here is to show you what the model is seeing behind the scenes. I put in a one-line prompt and initialized a GitHub MCP client. GitHub provides a couple of ways you can use their MCP server. We're using their remote URL, and we're going to take 15 seconds to explain what the MCP is. Yes, absolutely. MCP stands for Model Context Protocol. It provides a uniform, standardized way to interact with arbitrary servers and things like that, and it allows agents to build with them very easily.

Instead of having to go and implement every single API that's going to interact with GitHub, you can use their GitHub MCP server. Now I can get issues, make changes to files, create a branch. I don't have to implement any of that myself. They've created this standardized interface for me. GitHub released their own MCP, and many companies are doing the same. We're going to see how you can use it and how you cannot use it, but it's going to be one of the first tools you're going to want to consider when you're rolling out a coding agent of any sort. AWS also has MCP servers for all the announcements today, so if you're building agents that need to use the latest and greatest, you can make a tool to look that stuff up for you.

Thumbnail 540

When I interact with the GitHub MCP client, it takes me a couple of lines of code, and this is all of the tools that become available to me. I can get a comment, add a comment to a pending review, get team members, get a commit, assign a co-pilot to an issue. Maybe we don't even want to do this code task today and we'll just say get a co-pilot and go do this for me. You have these standardized ways that you can start interacting with GitHub.

Thumbnail 640

Starting to interact with GitHub is nice and easy to get started with, but here's what's tricky. I did a couple lines of code and now I'm passing my model 40 different tools it has to choose from. That is a lot of context for the model to immediately have and decide what tool to use. There are many things that come into play here where it's important to be intentional. It's a very nice thing, but it's easy to get things a little out of hand just to begin with. Remember, the model gets passed a giant tool config, and the model has to decide which of these tools is the most relevant for what I'm trying to do.

Managing Tool Complexity and Crafting Effective System Prompts

If you give it 5 very well-selected tools, it's going to have a better chance and you're not going to consume as much context window. If you give it 20 or 30, you're really making a big bet on the model's ability to traverse long context and pick the right one. Nova 2 Lite does struggle with this. The consideration here is that once you get above 20 tools, you'll see a lot of guidance for all model families recommending breaking things into sub-agents and exploring multi-agent architectures. It's a really important consideration that you want to make.

Thumbnail 690

Thumbnail 710

What you definitely need if you're going to do that is a good system prompt to give some guidance. I didn't have any system prompt. I just said go analyze the issue and I'm not going to lie to you, it did okay. It read the issue, it added a comment, it made an implementation plan. Let me pull up the issue and make sure my internet connection is working. Let's see what it said. It said "I'll implement support for this" and made a couple of steps. This is all pretty reasonable, right? With no instructions, just providing the context of the tools, it did something that I think is pretty fair.

What might stand out to you is that this is all very vague. It didn't actually read any of the files. It didn't actually go and start exploring the repository. It read the issue, made a comment, and created a decent plan, but that's not what I want it to do. If I'm going to build a real agent, I want it to look at the files. I want it to actually be interacting with what's real.

Thumbnail 750

Thumbnail 760

So what we can do next is go back to my Strands agent and use this new template. I'm going to take a second here to talk through some best practices with Nova 2 and I think a lot of complex agent prompts that you'll want to build overall. You need more than just a simple prompt. The first line might look pretty familiar. We're providing a role, which you do for every agentic application or any LLM application, but we need to take it a step further.

Thumbnail 810

Thumbnail 840

A couple things that we have trained our models to be good at are these specific sections that you can use to draw attention to specific pieces, workflows, and business processes. The first one is core mandates. This is the important business rules that you need the model to follow. For example, I don't want you to update the original issue. Don't edit it, just add a new comment. Another one is the actual workflow you want the model to follow. I want you to get the issue, but I want you to explore it. I want you to go and look at the files. Nova does really well when we pass the tool names that we want it to use. It really draws attention and links to the tool configuration it's seeing. Then I'm having it return the plan in a specific format. I have output formatting at the end of the prompt.

Optimizing Tool Usage and Parallel Calling for Better Performance

The next section is tool usage. You might wonder, okay, we're passing it the tool configuration which is going to go to the model. What else would we really have to specify? Tool usage is where we want to put the important business rules for how we call the tools. MCP is great and provides a standardized interface, but how we want to use the tools is going to change based on every single use case that you're going to have. This is where you bring in some of the specific things that are the most important for you.

I always want to make sure that we're reading the issue and providing some tips around how to interact with the file paths. I'm encouraging parallel tool calling. Parallel tool calling is when we want the model to call multiple tools in parallel instead of waiting for one after the other, so I'm encouraging that behavior. What I want to pause on here for parallelism is whether this is a parallel invocation of the model that's occurring when you're asking it to do that, or something else.

What happens is the model will make two tool calls in a single assistant response. This might come up when I would hope to see it, if necessary, is viewing the contents of multiple files at the same time. That's a behavior I'd like to encourage.

The next thing is around error handling. It's very nice to assume that everything will work perfectly all the time, but when you're using MCP servers and tools in agentic workflows, things break a lot. Something's going to time out or something else might go wrong. So you want to focus on how the model should respond. If there's a timeout, do you want the model to try again? Do you want it to respond gracefully and give up or fail quickly? This is where we start to think about that section.

Thumbnail 970

Thumbnail 980

Thumbnail 990

Thumbnail 1000

Thumbnail 1010

I'm going to now run this. I'm just going to pass it in the system prompt here with no other changes necessary. I'm going to keep the prompt exactly the same. We're going to run it again and look at how the model changes. What I'm hoping to see is that we still read the issue, and I'm hoping the model will start actually looking at the files and exploring the repository. Fortunately, that is what I'm seeing. Something you might notice though is that it's going to be a little bit slower than it was the first time. Why is that? It's going to have to make a lot of tool calls because it doesn't have any context about what this repository looks like at all. It has no idea. So what is it going to do? It's going to guess and then try to start exploring and figuring this out. A lot of models are very good at that, and if I'm doing my own day-to-day job and I want to connect to arbitrary repos, that's fine. But if I'm building a coding agent for a repository I know is mine, we can take advantage of that. We don't have to make the model explore by itself all the time.

Thumbnail 1060

Thumbnail 1080

Thumbnail 1090

Thumbnail 1100

You can see this is just slow, and I don't blame the model at all. It's going to be searching for a repository and exploring and iterating. It's fine. So this is where we get into an optimization opportunity for context. The Nova models, as Pat mentioned earlier, support a 1 million token context length. So I wouldn't blame you if the first thing you thought about was why don't we just dump the entire codebase into the model context. That seems easy, right? I'll pause here because we got the output, but you can see there were 8 tool calls. It did great. It read the issue, explored a lot of different files, added the comments, and we can see here the quality has improved just on our system prompt. We're actually looking at specific code changes we want to see happening. It took a long time to get there, but it got there, and now we have real code that is actually going to be usable and relevant to our own repository. But I want to improve on this because I'm not a very patient person and I think that we could do it better.

Thumbnail 1130

Context Window Management: Testing Full Codebase vs. Selective Context

So, real quick though, any questions on what was just gone through there? Yeah, so when you've got, yeah, I might have missed it, but you specify it. That's a great question, and it's actually going to be one of the optimizations we're going to make a little bit down the line. At first, I am just passing all of the tools. I'm passing all 40 of them, and you can see, as Pat mentioned, the Nova models are tuned for agents. It does well with 40 tools. It made the right tool calls. I'm happy with the decisions it was making. But that's a lot of tool calls and the first prompt where I just have one line with all of the tools adds immediately over 8000 tokens just in the tool config. Which is helpful context, but it's a lot and it's unnecessary. So that's one of the steps we'll take: kind of shortcutting that and being really intentional.

Yeah, and the quantity of tools is actually a good place to start if you're trying to now make multiple agents, because then you can compartmentalize your tool use associated with the behavior you want out of the agent. So it's a good place to start if you're moving into multi-agent architectures.

I'm using GitHub's hosted MCP server, which they have available for self-hosting as a Docker container that you can run locally, but they also have one that's available through an API endpoint, which is what I'm using today. I definitely recommend that if you try this at home and you make your GitHub token, be smart about the permissions you put on your GitHub token. I've done some crazy things, so learn from me.

We're going to push to the LangChain repo today, right? Just kidding, we're not going to do it. You got suspicious when you asked me not to ask. Yes, I did. You caught me there. One time I wasn't following our own security best practices and it was starting to push to the main repo and I was like no. I made a protected branch and then I was intelligent about it. It's fun to play with this stuff and see what happens. It's fun to break things.

Thumbnail 1280

Tools are an implicit security boundary too. If you don't want an agent to ever do anything associated with a tool, you don't give it the tool. When you're first iterating on things like we're doing here, you'll find those rough edges really quickly. I went a little bit on a tangent there with the tools and everything that was happening. Going back to how we take advantage of the context length, I do want to show what happens when we pass the entire codebase. I think it's a very natural path to take, especially when we have repositories that can handle that much context.

Thumbnail 1350

Thumbnail 1360

Thumbnail 1380

I have a formatted script where I dumped all of the files. I'll call out that I'm using some of our best practices for how to format for long context, so I'm going to be breaking up the documents in a specific structure. I did not do something very nice to the model, which is instead of passing the path that is going to be in the GitHub repository, I used my local file path, so not really making it easier, but at least providing some context to the files and the content in the repository. What we're going to try to do now is I'm going to put this in the prompt. I called it codebase. Next, I'm going to say format repo content and we're going to put this at the top, which is another best practice for long context. You'll want to put that at the top of your prompt. Now we're going to run it and we're going to see what happens. I'm going to call something out that you might pick up on really fast.

Thumbnail 1420

Thumbnail 1440

One, it is a lot slower. We're passing an absurd amount of tokens through. I was smart and ditched the UV lock and some other stuff that just threw it right to the context length. It's going to be about 500,000 tokens, but that's going to be on the higher end of our token context length. There's a lot of papers around this topic, so we can take advantage of the context length. We train for long context, but something that is seen across all models is that performance of the model degrades as the context increases. When we do long context, a lot of times it's usually best when we're looking for something very specific, right? It's a needle in a haystack. I want to go search for something in this huge amount of context. Models are very good at that, but what's important for when we're doing code is semantic connections and how things are playing together, and that you don't get as much when you're just dumping everything into the context.

Thumbnail 1470

Thumbnail 1500

The benefit of having all of it is a little bit tough, right? What we're probably going to see is still very good outputs. It's probably going to call less tools, but it's a lot slower and it's going to cost a lot more. We've seen customers now with navigating the dependency of a software project take this in a whole direction unto itself, building a graph of dependencies so you can look up dependencies using a query.

There are many things that can go into making this specific part of a coding agent more efficient. Hopefully you get the results I'm looking for. We'll notice a couple of things, right? Fewer tool calls, so that's nice. I had to do fewer steps, which maybe saves us a little cost. But what stands out to me immediately is that it didn't add a comment like I asked it to. The performance of the agent decreased with all of that context, and the quality of the code, while still great, shows there is a big trade-off here for agentic workflows. You want to be really intentional about managing the context because of this.

Repository Structure as Context: A Surgical Approach to Code Navigation

What we see a lot with coding agents is that you can run through your context so fast, and you don't want to shortcut that by adding a bunch of context at the beginning. You want to allow the context to build up naturally and have more proactive compaction strategies throughout the period of the agentic workflow. We don't want to handicap the model or put it at a disadvantage just from the beginning. How many folks are using Codeist in their IDEs? You're willing to admit it? That's awesome. I do too.

Thumbnail 1530

What you notice a lot with Cursor, with Continue, with any of these plugins is that they're very surgical about how they look at files. They'll search for something in a file and take the line numbers and put that into the context. Just to highlight what's being explained here, that's for context management. They're getting very specific about the things they put into the context window.

Thumbnail 1630

My proposal to you all, and what you'll see if you really get into the nitty gritty of how a lot of these coding assistants work, is that they will use the repository structure in the context instead. One of the benefits of the fact that we as humans have been working in codebases for so long is that we try, hopefully, to design them in a way that's intuitive and easy for us to navigate. Now we have the ability to just take advantage of that. It's designed for us to be able to navigate it. I hope, if you're lucky, I've been at companies where maybe we're not as good about that, but you have human preference tuned file paths that provide context out of the box for what is probably going to be happening in the file. You can provide that to the model, and it gives it a starting point of where to look.

Thumbnail 1680

What I have now is a separate text file that has all of the file paths in the entire repository. I'm not at this moment providing any additional context to it, but I'm providing the file paths. You can see here that the thing I called out in the issue itself is to add something to the ChatBedrockConverse class. We can see here we have this file path that says bedrock_converse. It probably will give a hint to the model that this might be a really good place for it to start looking.

Thumbnail 1710

Thumbnail 1720

If we switch this out now and just instead pass this new text file that has the file list and try to run that, we should see a couple of things. Ideally, the Wi-Fi holds up. It's going a bit faster. We should still see a pretty high quality output from the model. This kind of really easy tuning, to be honest, wasn't a very hard script to write. This helps us provide context to the model very easily.

Thumbnail 1730

While that's running, I will show the next step that you can take. You can actually sort the file paths and stuff, and if you're willing to put the time in, this is another way that you can be a little bit more intentional about providing context to the model. A lot of repositories that want to be really easy for agents to work in will provide this kind of stuff in an agents.md file and things like that.

Thumbnail 1750

Thumbnail 1760

Thumbnail 1790

It's being a little slow. I'll blame the Wi-Fi. But so far we've only had two different tools that are going to be called. We're getting another file contents, so we'll see what the output looks like. What I'm hoping for is we'll get a high quality output with the implementation plan.

One thing to keep in mind because agents in particular are token hogs, right? We know they use a lot of tokens. Codebases will eat through those tokens really fast.

Thumbnail 1820

Thumbnail 1830

The other reason to be conscious about how many tokens you're using in your context window and how good your tools are is because that's a measure of cost in the long run. If you're running this agent for every single issue that gets submitted and it's running autonomously, there's a consideration for cost. We'll see the output tokens here, but it's considerable. We're only doing a pretty straightforward task.

Thumbnail 1860

Thumbnail 1870

Thumbnail 1880

We can see the output now. We had a significant reduction in the tool calls from the first scenario where we provide no context to the repository. We also, in comparison to the one where we're passing the entire code base, see that this is much more reasonable on the input tokens going into the model. The quality of the output we're getting is still very high. We're still going to be making very specific code changes that are targeted for the files that will have the impact we want for the GitHub issue. This is all good when we're thinking about an autonomous coding agent in terms of planning.

We're going to do a long horizon coding task. We want high quality planning at the beginning. If you're using a coding tool, you want to plan and then you act. It's the same when you're building it yourself. We put a lot of time into building a high quality plan. This is only so helpful because at this point nothing's actually writing any code. I still have to go implement the plan. We want to build on this and we want to actually build something that can take actions in the code base.

Transitioning to Multi-Agent Architecture for Coding Tasks

We have two basic options. We can either continue to build on the current agent that we have, or we can start to expand upon a multi-agent implementation strategy. There are a lot of cons with a mono agent architecture here. A lot of it comes down to the context. We're going to put all of these tool calls and context just to create the plan. All that really matters from the plan is the output of the plan. If we build on top of that, we're still tracking all of the tokens that we use to get to the plan and then act and actually build it.

Thumbnail 1970

Thumbnail 1980

What you'll see in a lot of the coding agents and the direction we want to go now is a multi-agent architecture. We will start using Strands and we're going to implement it using a graph-based approach where we will have a planning agent and then an act agent. Our plan agent will do all the tool calls to look at the files and determine what needs to happen. Then that output will go to our act agent. The act agent will take the implementation plan and it will actually start iterating and writing the code.

Thumbnail 2050

Thumbnail 2060

So the direction we've been going so far, you might assume we're going to use the GitHub MCP. One thing you might pick up on and I'll call out is when we use the GitHub MCP server, every single file, every single tool call where I'm getting the file contents, I am getting the entire file contents from their API. It's a lot and it's quite slow. If we do the create update changes, we're also outputting the entire file through a tool call back to the server, so it's very slow and not necessarily the most practical way to do it. But we'll get to that a little bit later as a teaser. So we'll try it out.

Thumbnail 2070

I have two agents like I said. I'll just show very quickly in Strands if you've done a graph approach before. What you'll do is you'll create this graph builder. You'll add your nodes. That's really it. It's quite easy. Is everyone familiar with the basic agent orchestration patterns? I'll go over those really quick if that helps.

So the most ubiquitous pattern right now in agents is the manager agent and the sub agents. The manager agent dictates when and who it calls to get work done and it decides when the work is done. There's also graph agents. You can think of it as a left to right graph. In our case for coding, every agent is dependent on the previous nodeβ€”they call them nodes.

Thumbnail 2120

Thumbnail 2130

Thumbnail 2140

Different frameworks have their own terminology. Curry has its own terminology, Strands has its own terminology, but the idea is that the agents are moving left to right, and once they move to the right, they don't come back. In code, projects you typically have that if you've used Curry. It feels like a left to right process, right? They have a requirements doc or design doc, requirements, and then implementation and then unleash the agents and do the things. So those are the two most ubiquitous orchestration patterns.

Thumbnail 2150

Thumbnail 2160

Thumbnail 2170

Thumbnail 2180

I'm sorry, I missed that. Is Strands doing like a group chat of the agents? A group chat, that would be cool. No, essentially what's happening in a graph agent orchestration is that when the agent is done, you can send the results of that agent to the next one and then it picks up where it left off. You can send state to it. You can propagate the input from the original all the way to the last agent. But no, they're not commingling and talking about things. That would be more of the manager agent orchestration. You can ask it to do a whiteboarding session, update a markdown file with your ideas, let's take votes based on what you think of what's been proposed, stuff like that.

Thumbnail 2190

Thumbnail 2230

The manager agent says, and this would go in your prompt, right? You'd say you are a manager agent, you're assessing the work here. You have an implementation agent, it's responsible for this. You have a coding agent, it's responsible for this, and you're relying on the reasoning ability of the model to decide which agent to work with. So you're leaving that completely up to the manager agent. So I have it running. I'm actually not even going to let it continue all the way just because we only have twenty minutes left and we're going to get to some really cool stuff. But the message I want to give here is we're able to switch into this multi-agent architecture. We still have a disadvantage because it's still really slow. So the path that we want to talk through now that Pat's going to take over a bit is how can we build this more locally centric, how do we create an environment that we can actually make the code change and test the code changes, because that's going to be a lot more practical for these kinds of applications.

Thumbnail 2260

Building Code Execution Environments with Docker and Security Considerations

Let's do it. I'll have to show you what I wrote, which I know isn't as exciting, but we've had some issues. My laptop doesn't connect well, so I've neglected this side of the room. I hope you're all getting as much out of this as I hope you are. So let's talk about a couple quick concepts. When you write code, you never just write it and then send it and then push it. I mean, some of you might be that good at writing code, but you've got to make sure it actually executes and it actually does what it's supposed to do, and we take that for granted as developers. We run the code, we write a test, and if it works, it works. If it doesn't, you write more unit tests and you keep iterating until it's done.

Thumbnail 2370

So how do you get an agent to do that on its own because it doesn't have a workspace? Well, you give it a workspace. What I'm going to talk about now is how to maybe do that with code execution. How can you use Docker to create a workspace for this agent to put some files in it, mount it to the image or the container, run it, see what happens, get the exception, and keep iterating. This is something that we're seeing a lot of folks do. They might have a container image that they've baked for their pipeline that they want to use. Those are great candidates, by the way. We're going to show you a really simple example in Docker. Before I go there, let's take a look at that tool.

Thumbnail 2370

I called it the code execution tool, by the way. In Strands, as long as you're doing things and following the rules, anything you put in the tools directory can be sucked up by the agent without question, right? That's good and bad because if you've got hundreds of tools, that gets to be a problem. But if you're building agents for the first time, it's a good way to get started. Code execution. I'll walk through a few things here. So what I've done here is I've hard coded a default workspace path, but what's going to happen here is I'm going to tell it put all your work in this Docker workspace. If you're going to clone anything, if you're going to write any code, make sure it goes in there and put your test somewhere else, and you make sure that we're keeping these things separate. File system security and network security are also a thing, right? You don't want to deploy this and let it ping the Internet, pulling down any library it wants that's completely unvetted by your security team.

These are all super important considerations to make. Anthropic in late October released Sandbox Runtime. I would highly suggest taking a look at it. They've offered a great tool for solving this in a very thoughtful way. You can send all your traffic to a proxy if you want. If you don't want this agent doing anything else, you can lock down the files or the directories it has access to. We're not going to get into that today, but I wanted to put that out there. There's also Agent Core and code interpreter that is designed specifically for this use case.

Thumbnail 2450

Thumbnail 2470

My execute Docker tool, just like all tools, you want to give it good descriptions. Remember the context. All of that goes into the context. When Strands picks up the tool, it's picking up the description to help it decide what it's doing. I got a little carried away here, but it's important to remember that I'm sending it back a return error if I don't have the right information and dependencies. So how are you going to tell the agent if it needs to do a pip install or if it needs to get the environment ready before it executes the code? You need to tell it where to put it. Requirements.txt is going to be a really important thing. I'm using Python, so of course if you're using Node or some other language, that's going to be any number of steps to complete as we know. But here we just have a simple Docker command.

Thumbnail 2500

If I have all the requirements, I'm going to include the requirements dock. If I don't, then I don't. But I've pulled down a container image. I'm telling it the workspace to use, and I'm going to pip install everything before I do a single thing, before any code runs. So I have parity with my test environment and my prod environment. You can imagine you probably wouldn't do this for real. You'd probably have a container that's hardened and everything. But we're just moving fast here. Any questions on this approach? Any worries about doing this this way?

Thumbnail 2550

Yeah, so file, files, right, go ahead. The question was how do you prevent the agent from downloading packages that are either not approved or not from your artifact repository, which is what most enterprises are doing. That's where you'd want to give a lot of thought to locking down which addresses it's allowed to use, and that's why using a proxy is good. Sandbox Runtime lets you say only allow anything to this domain. If it's going anywhere, it has to go through this address. That's one way. You could create a separate tool or a separate agent to prepare the dependencies, do it in a secured manner, get the trust of your security department, and maybe they manage it. That's a way to do it. There's a lot of angles you could take, but it's worth thinking about.

She asked, you know, it's just in Docker. It's probably not going to go anywhere. No, it probably won't. But Docker is leveraging my network hardware, right, so we get to hand wave that. Yeah, but if you're launching or spawning a container in an ECS cluster, that's going to be managed for good reason. But you wouldn't want it reaching out through your net gateway and just pip installing whatever it feels like, because I can tell you it'll end poorly. Did you have a question? Cool.

Thumbnail 2660

Thumbnail 2670

Thumbnail 2680

This executes, runs the Docker command, and then returns. I'm sorry we had this while working, but we can't get this laptop working. It returns errors if it doesn't work. Okay, so code execution. I'll show you the Docker file I've written for this. If that helps clarify. It's roughly what you would expect. You're setting the work directory to Docker workspace. If your code has to create things and has to persist things, that's a consideration too. If its task is to actually output more code or output artifacts you need for some other process, telling it where to put it is also important. Go put it in S3. You could put it locally, but then when the container dies you lose it. So these are the same considerations.

Thumbnail 2730

Thumbnail 2740

Thumbnail 2750

These are the same considerations you'd make in any surveillance architecture. In our case, we have a very simple Docker file. I'm going to show you how we've proposed building a multi-agent that does this. For this very trivial coding agent example, we're just using the Docker runtime on your desktop. However, you might want to host it on AWS using ECS, Fargate, or whatever your favorite container platform is.

Remember that when the agents are running in a CI/CD pipeline or in GitHub Actions, you need to orchestrate where that compute and runtime occurs. When I think about practical applications of this use case, we have coding agents that run locally all the time. So the question is, why would you want to do this? I see the benefit here, actually running in a GitHub Action or similar environments. You want it to be able to spawn in a place that's not local all the time. Obviously, we have one hour, and it would be really cool if we could talk through all of this in one hour, but we're going to try. We have ten minutes left, so I'm going to keep going. Please raise your hand for questions.

Thumbnail 2830

Thumbnail 2840

Thumbnail 2850

Configuring Reasoning Budgets and Enhanced Planning Agent Implementation

I want to show you how to actually make a request to Nova. We have a model helper here where we're wrapping the Strands Bedrock model class and putting Nova 2 Lite in there. For a lot of heavy reasoning tasks, like if you're traversing a big context, you're going to want to use reasoning on high. I mentioned this in the keynote session today. The Nova 2 Lite model we're specifically talking about, along with Pro and Premier that are announced in preview, are extended reasoning models. You can turn reasoning on or off. It defaults to off.

Thumbnail 2860

If you're familiar with using other extended reasoning models, you might be used to setting budget tokens or something like that. We do it a little bit differently. We have reasoning budgets, and we group them into three categories: low, medium, and high. They'll use a different amount of tokens in their reasoning. This is kind of t-shirt sizing that you can use to control the amount of reasoning that's done. Don't always think that more reasoning is better for obvious reasons. Reasoning high will use up to 32,000 output tokens. We can imagine really complex problems. It's a really cool thing, but it's cost and time. We've actually found that if you use too much reasoning, that is a possibility, and the outputs can degrade. So something to keep in mind.

Thumbnail 2950

These are just different configurations of the same model to make things a little bit easier. The question was if we had four previously. I was just using reasoning medium, so I was using the same model. Now, when you can also tune it for the actual task, that's a really good question too when we think about multi-agent. Maybe you want more complex reasoning for the initial planning phase and you want to do less reasoning for the act or the validation. You don't need that much reasoning if you want to add that at the end.

Thumbnail 2980

Thumbnail 2990

Let me talk about where I was with the actual agent here. We understand roughly how the tool would be orchestrated so you can get code execution. In this example, I've taken the liberty to make an enhanced plan agent. Some of the bottlenecks that we ran into was how does the model even know what the repository structure is like? Well, now we're getting into basic file system tools. Any of these IDE tools that you're working with have capabilities to show you a directory tree, show you the first three files, show you the first 100 rows of a file, and look for anything in the file that contains specific text.

Then expand and give you the previous and following 20 lines. So you very quickly find yourself building tools that the agent can use, because as we said, you can't just dump 1000-line Python files into the context and expect it to perform well. It's going to be expensive and it's just not going to be accurate.

Thumbnail 3050

So you very quickly find yourself building tools that the agent can use. We show here some file system tools called show directory tree and get directory summary to see how many files are in a directory. We're also using some Git tools. As you move into working with GitHub MCP and move that to your local workspace like any developer actually does, you're going to want Git tools. That's scary because you can do dangerous things with Git, but in this case all we're doing is clone. We're just cloning the repo down and then we're telling it to use these tools to explore. Tell me, do a code analysis, read the issue, and tell me where I should be making changes and how.

Thumbnail 3090

One question I'll throw in there is worthwhile to call out. We're no longer passing the entire GitHub MCP tool config, all 40 files. We're narrowing it down, but you can see we're saying issue read at issue comment. There is still value to using the MCP server. Those tools provide a lot of utility, but the ones that are going to require a lot of data and everything that's passing through can benefit from using the local code environment. It's better to use the tools that we're talking about. How many have built file system tools or anything for their agents and gone that far? How has it gone?

I thought about it, so the first step was always clone the repo. I didn't even think about grabbing specific parts because I thought the model wouldn't have any context. Yeah, and even so, another step you could take is what we were talking about before. A lot of repos have a contributing MD. Tell the agent to take a look at that. Until super intelligence is here, all of these agents will be coupled very tightly to the repository they're associated with. That's just inevitable. You're going to have to prompt it on guidelines for contributing. You'll probably have more information on where to find things in the prompts. You're going to have to couple this agent tightly to your repo unless your development practices are completely ubiquitous across all code bases, which is rarely the case.

Thumbnail 3200

Thumbnail 3220

I'm going to quickly get through here. Lastly, I'll show you the agent prompt if that would help. We're creating a Strands agent and we're using reasoning low . You could use high, but high is going to take a while. It's also going to be generally more expensive because you're using more tokens to arrive at the answer. So let's take a look at the prompt. We're at just under four minutes left, so I'll try to go through this quickly. I'll do word wrap here so we can actually see. You are an enhanced planning agent with GitHub issue validation capabilities. I'm sort of telling it what an issue is and to determine whether it's a bug or a user error because a lot of issues are like "go check the readme, this is the wrong way to use it." So validating whether this is a legitimate issue or if it's a feature. You can imagine these are tracks you might go down where you're going to handle features completely differently or handle bug fixes with some level of urgency in assessing the urgency of the issue.

Thumbnail 3290

I want to call this out too. Models really love the users, right? If we have an agent that's just going to go and change the repository every single time the user makes a mistake and it's just not using things correctly, that would be way out of hand. We don't want it to do that. So it is an important consideration because a lot of times people, I don't know if you guys are managing any open source repos, but they might be commenting things all the time that are just "that's not how you do this." We have some repositories from SAs at Amazon who are already managing their repository in this manner, so I can send you an example.

There's some dynamism here that changes over time. The things that you bake into your prompt, you're going to have to keep revisiting. We haven't even talked about agent evaluation, which is analogous to employee performance evaluation. So not sidestepping all that, but a lot of it is very analogous to that.

The question is whether you can do this when a PR is submitted. I think there would be a different application, maybe when a PR is submitted, doing a code review or things of that nature. What we're looking at here might be almost reactive to specific issues that are open on the repository. Something is reported as a failure. How can we go and create the PR that would have the fix and then have the human go and do the review and merge it.

We are out of time. We're more or less at that time, I think probably a good place to stop. Any questions? We'll publish the repo for this code. We'll make it available. Yes, we can do all of that. We will do that. We're just going to push it to the LangChain repo. We're just kidding.

Well, thank you for joining everybody. It's been an absolute pleasure. I'm sorry we had the technology issues, but the point of this wasn't to go too much into the features of the Nova models. It's not a breakout session; it's talking about the code. If you have questions about any of the new releases, we have a booth at the expo with people there all the time that can answer questions. I hope that you got some value out of this and had a good time. I appreciate that.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)