Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Code completion to agents: The evolution of development (DVT405)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Code completion to agents: The evolution of development (DVT405)

In this video, Giovanni Zappella and Laurent Callot from Amazon Q Developer trace the evolution of coding agents from basic code completion to autonomous systems. They detail their journey from RAG-based solutions achieving low performance to sophisticated architectures like Taxy Code Agent (38% on SWE-bench verified) and Logos agent (51%) with sandbox execution. The presentation culminates with Huang, a supervisor-sub-agent architecture for complex tasks. Key lessons emphasize optimizing for specific use cases, creating relevant metrics (introducing PolyBench benchmark), building reliable systems using Stren Agent SDK and Amazon Bedrock Agent Core, and maintaining flexibility as models and customer needs evolve.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

The Evolution of Coding Tools: From Basic IDEs to Autonomous Agents

Hi everyone. My name is Giovanni Zappella, and with Laurent Callot today we will try to give you a very short summary of the evolution of coding agents, starting from code completion all the way to the different architectures that we adopted in the last years. We are both Principal Scientists working on Amazon Q Developer and in particular on the Amazon Q Developer autonomous agent.

When I started coding, IDEs looked pretty much like this. You know, you get a little bit of syntax highlighting. There was autocompletion, but it was much more about completion than about being automatic. Support was a very vague term, and you still need to remember where to put your semicolon if you actually want to compile your code.

Fast forward 10 years, we get very powerful IDEs which are able to basically complete the variable names, the method names, and so on and so forth, everything you're thinking while you write the code. They typically show some kinds of suggestions directly in your editor, and they are able to complete fairly large chunks of code. At the same time, this is still something that is able to speed up your typing activity, not necessarily to code on your behalf.

Fast forward another couple of years. We start seeing agents like the one that we have in the Amazon Q Developer CLI that, given a problem in natural language, are able to autonomously interact with your file system, identify files that need to be changed, and make these changes in order to achieve a certain goal. At the same time, this is not exactly how things started. This is already something that saw some iterations and progress in the last couple of years. So let's get started from the beginning.

Two Families of Agents: Synchronous Companions vs. Asynchronous Autonomous Systems

First of all, when we speak about agents, we use this term for pretty much every kind of autonomous entity, but I would say there are two big families of agents. It's not really a hard split, it's more of a spectrum, and some agents position somehow in the middle of these two families, but for simplicity, let's focus on the two extremes and I will let you then define to which family each agent belongs.

On one side we have the synchronous experience, the experience like the Amazon Q Developer CLI where the agent is interactive, is your companion, and is basically helping you to accelerate in accomplishing the task you have. On the other hand, we have an asynchronous experience, or an experience where you can delegate certain tasks to agents and let the agent work autonomously in order to achieve that goal. These are, as I said, two extremes, but they also entail a very different kind of experience for the developer.

For instance, if we look at the software developer operating in a sequential way, so starting to solve the task, complete the task, and then move on to the next task, it's a little bit of an oversimplification, I know, but it follows a lot of the patterns we see in the day-by-day activity of developers. We can see the companion agent, the assistant, speeding up this sequential activity. So the developer will still be able to complete a much bigger number of goals and tasks, but it's still a sequential, somehow similar activity to the one that is doing on its own.

When we look at the second family of agents, the autonomous one, we look at something which is a bit different, where the human developer can start delegating and parallelizing its work, and it will have a very different kind of activities. Some of them will be long running, some of them will be shorter, but generally speaking, it's not a strictly sequential operation anymore.

It's also not completely true that these tasks are monolithic blocks. In fact, when we look at the task structure, especially when we want to delegate a task, there are still some touch points between the agent and the human developer.

For instance, the task needs to be defined, so the human developer needs to write a specific definition of the task and what needs to be achieved. After this synchronous kind of activity, there is an asynchronous phase where the agent operates autonomously and creates a pull request, for instance. The request still needs to be reviewed by a human developer, typically before being merged, and this is another synchronous activity where there is interaction between the two. Eventually, the agent will need to autonomously iterate on the changes and so on and so forth until the code is ready to be shipped. This is again a little bit of simplification, but should give a little bit the idea of what we call synchronous and what we call asynchronous.

We started working on this kind of agents several, a couple of years ago, and in particular, we focused on the second family of agents, the ones operating asynchronously and operating autonomously. For the rest of this presentation, I will share with you results on a benchmark that is called SWE-bench. You probably saw it around before. I will use that benchmark, not because it's the most important or the only one we have, but because it's something that has been around for quite a long time, even though our activities started before. We managed to retroactively compute some of the results for the purpose of this presentation, and because it will give you also reference points about how much these agents evolved and changed over time.

The purpose of this talk is not to explain to you how to implement an agent in five minutes or something like that. The purpose of this talk is to share with you what we built over the last two years in order to, for instance, get to the top of the leaderboard of SWE-bench a few times and give you the idea of how the architecture of this agent changed over time. What I would like to achieve is to give you some inspirational ideas that you can use the next time you want to create a new agent.

So, let's get started. For those which are not familiar with the benchmark, the benchmark is pretty simple. It's scraped from GitHub. So there is a GitHub issue that was solved by a human developer. The human developer solves the issue by writing the code which is needed, for instance, to fix the bug, and typically some unit tests which verify that the code is correct. What the creator of the benchmark did is to take these two components, remove the code that was written by the human developer, let the agent write the code for that, and verify that the code is correct by running the unit tests which were provided by the human developer. If the unit tests are passing, we assume that the code is actually correct and solves the problem. If the tests are failing, we assume that the code is incorrect. There are a few assumptions along the way. For instance, unit tests may not cover all the possible cases, or the reviewer that first approved the pull request didn't do a great job, but it's fairly correlated with the quality of the code.

From RAG to Fixed Workflow: Overcoming Early Retrieval Challenges

The start for us was somehow in an epoch which looks very far right now, when people were starting to evaluate LLMs basically on benchmarks like HumanEval. These benchmarks were pretty simple, provided the signature of a Python function, for instance, and provided some description. The benchmark required to complete the body of the function to return the correct values. When we switched from this kind of benchmarks using still fairly complicated solutions to something that was realistic, we immediately noticed that these benchmarks were far from capturing the quality of the code required for real world applications, let's say. In particular, we saw problems in generating actual patches, failures in passing the tests, and so on and so forth. This was already a system which was non-trivial and it was based basically on a RAG. So we had a retriever retrieving some code, typically some files, passing them to the LLM and letting it make the changes.

What was even more concerning was the actual inability of the retriever to identify and provide to the LLM the correct pieces of code, the correct code snippets and files. For people which are not familiar with these metrics,

recall is a fraction that measures the percentage of relevant files that are retrieved by the agent. In other words, given the files which are relevant for the task at hand, recall measures the fraction of those files which are actually retrieved. This means that if we have a recall of 50%, only half of the files which should be modified to complete the task were actually retrieved.

This is a fairly complex problem because we switched from simple benchmarks that had only one function provided, or maybe two functions being provided, to a benchmark that relies on a real-world code base with thousands of files possibly. Just identifying what needs to be changed was a challenge on its own. We clearly understood that RAG was not a solution for that, but at the same time, we wanted to keep something somehow controlled, effective, and relatively fast in delivering a result.

So the first solution we created was a fixed workflow. The workflow was made of a few steps. It was starting by identifying files which were potentially relevant with very high recall. So we were retrieving a lot of code in order to make sure that we somehow retrieved at least a large amount of the files which were actually needed. At the following step, we were further expanding the content of the files, for instance, by providing class names, method names, and so on and so forth. And then we ran a second step of retrieval where we were discarding from the initial set the files which were no longer relevant, given the new information. After that, we were selecting the code chunks that needed to be modified and rewriting them before generating the actual diff.

This is a very simple workflow. In fact, it was only using four LLM calls. At the same time, this gave us an improvement of roughly 10x compared to RAG-based solutions, and it was the most effective agent out there at the time. It was on top of the leaderboard of SWE-Bench. Was it satisfactory for us? And the answer generally is no, and for a very simple reason. The workflow was extremely efficient, but it was not flexible.

The Taxy Code Agent: Introducing Flexibility Through Agentic Loops and Specialized Tools

Flexible is a very vague term to be defined here, but you can imagine as a software developer that when you have a task at hand, you don't always follow the same workflow. If you want to fix a bug first, maybe you want to reproduce the bug and then observe what is in the traces and then move on, identify which files to be changed, and so on and so forth. While if you want to start a new project from scratch, you start implementing something first and then you run things along the way. So the fixed workflow was not flexible enough to cover all these use cases, which are real.

And we also noticed something which was significantly more challenging, and it was the quality of the tools that the humans had in their hands was much, much higher than the quality of the tools that we were providing to the agent. So we moved on, creating the Taxy Code Agent. The Taxy Code Agent was basically relying on two big improvements. So the first one was to create a set of tools and an environment where the agent could actually interact with the code base and the file system in a somehow agentic way.

It's a very simple concept if you want, but you can imagine that human tools are very visual, while the agent relies on text, and in particular the text should be fairly short and to the point without useless information that may confuse the agent in the next turn. At the same time, we wanted to keep a specific amount of information which is relevant to solve the problem always available to the agent and give it some kind of workspace where to store, somehow memorize if you want, some useful information.

We created an agent based on a so-called agentic loop where the agent continuously iterates and uses the LLM to select which tool to use at the next step. The agent was equipped with a number of tools for opening files, modifying files, selecting code chunks to put in the memory of the workspace, and so on. This was a significantly more complex agent, but it also gave us a big boost in terms of performance.

On SWE-bench, the full dataset, the performance went from roughly 13% to 19%. However, you need to account for the fact that some of these tasks are probably not solvable or they have overly specific unit tests. On the verified subset of SWE-bench, which is a human-annotated and verified set of test cases, we were seeing a much bigger boost going from 25% to 38%. This was also the first agent reaching the top of the leaderboard on verified. We also noticed that the agent was significantly more flexible than the workflow, which is exactly what we wanted to achieve.

This flexibility created opportunities to expand the use cases supported by the agent. For instance, in this case, we tried to create an agent which was redacting documentation instead of writing code. It was based exactly on the same tool, on the same workspace, on the same model. We just changed the instruction to make sure it focused on documentation files and in particular on readme files. We also compared it to specialized workflows which were actually studied and tuned to produce the best readme files. We immediately saw that the new structure was flexible enough to actually match the performance and in some cases outperform specialized workflows. This was certainly something that gave us hope to further expand the supported use cases and improve the behavior of the agent.

At the same time, we were also somehow hitting an obstacle. Once we created the agent, we identified a number of tools and kept adding tools and providing better feedback to the agent along the trajectory. However, it was pretty clear that the agent was still struggling because the models were not able to understand the so-called semantics of the code. I don't want to do any parallel evaluation, but when humans think about code, they don't typically think about the code itself or about the words that they see in the code. They kind of execute the code in their mind and they think about what that implies and what those actions will return as a result.

This was not happening here, and we immediately noticed that LLMs were not able to effectively self-correct. Even in the presence of feedback, it was really a big struggle for them to improve the code when it was incorrect. At the same time, we didn't really have a viable alternative because if the LLM cannot understand the code, what can we do? Tests were not really generated in a very effective way.

The Logos Agent: Leveraging Code Execution and Test-Time Compute for Enhanced Performance

As time went by, models got much better at generating tests, and we could finally leverage code execution. Code execution was a big shift, not because it's particularly complex to imagine, but because you need to have on one side a model that is able to generate correct and relevant tests, and on the other side you need to have all the infrastructure to safely execute both code and tests in a protected environment. You don't want an agent which is based on a stochastic model connected to the Internet that can generate random code and do whatever it wants. So this is a fairly significant challenge in engineering terms.

In engineering terms, it's also the kind of environment where you need to scale and be able to run thousands of these agents at the same time if you want to effectively evaluate them. So it was a significant effort, but this led us to create what we call the Logos agent. The Logos agent allowed us to create a significantly more complex architecture.

In fact, we didn't have only a sandbox to run the code, but we could start using the test to leverage more compute at test time and try to let the agent autonomously identify problematic solutions. So I will give you a quick overview of this. If you notice on the left side, we have the code writer agent, which is fairly similar to text code in spirit at least, and it has a number of tools that can interact with the file system. But the file system now is in the sandbox environment, it's not the local file system anymore. Conceptually it doesn't change much, but it's safe to run it this way.

We could run multiple of these agents, in this case three different versions or also the same agent three times, and let another component verify potential regression by running unit tests. So we could run the agent three times, generate three different patches solving the same problem, then run unit tests, not the unit tests that we use in the harness to evaluate the performance, but the unit tests already present in the code base. This allowed us to identify regressions which were not related to the task we have at hand or that we don't want to observe in our solution, discard potentially some of the patches which cause regression, and move on selecting one specific patch using an LLM-based algorithm.

We can see that this not only gave us a significant improvement in terms of one-shot performance where we generate a single patch, going from the 38.8% that we saw before to 51%. But also the selection of the patches, having multiple patches in this case three and being able to select a single one, was giving us another 4 percentage point improvement, which may not sound like much right now. But if you consider we started just probably 15 minutes ago from 13 or 14%, 4 percentage points are a significant improvement.

By means of this, we could also scale further and instead of generating three patches, we can generate 10 patches, 25 patches, and so on and so forth. Using more compute at test time could give us much better performance because we can evaluate multiple solutions for the same problem.

At the same time, we were still not leveraging everything we could do, in particular with LLMs getting better. We saw the so-called thinking and reasoning skills of the models getting much better, and they proved to be effective and useful to plan and to specifically think about situations, for instance when you're trying to solve a bug or something like that. So we created specialized tools offering the agents the opportunity to extensively use the reasoning and to make good use of them, not because we overthink along the trajectory, but because the model is able to select when to actually leverage these skills.

They also provided us with a significant boost in terms of performance, roughly 4 or 5 percentage points, which is another meaningful improvement. But we were still not getting to the point of solving very large tasks. In particular, we had a problem of context size, but we also had a problem of the flexibility of tasks, which was certainly improving compared to the fixed workflow but not always to the point of being sufficient for practical purposes.

For instance, if you try to reproduce a bug, you can try 10 times to reproduce a bug and only be successful the last time, which means that the previous 9 steps are completely useless for the purpose of solving the task.

This consumes part of your context and it's something that remains in your memory, in the memory of your agent. On the contrary, when you have tasks which are significantly different, maybe you don't want to only write code but also change your configuration for deployment, you may want to isolate that and avoid conditioning on those changes when making the next steps.

Huang Architecture: Supervisor and Sub-Agent Design for Complex, Multi-Step Tasks

So we created a more complex architecture which we call Huang. This is made of a supervisor agent and a sub-agent. The supervisor agent is a fully equipped agent able to interact with the file system, reading files, typically creates a plan, and then it's able to leverage sub-agents which you can imagine are super editors if you want, receiving instructions and accomplishing subgoals that are necessary to achieve the final one.

Let me give you an idea of how this works. Let's try to implement a new feature in a website. The website is a simple website containing detailed pages for restaurants, and this request basically sets a specific goal of creating a new route in this Flask website, allowing customers of this website to vote for a specific restaurant. The page must be password protected because we want only to have a selected amount of people voting.

What happens is that the supervisor agent starts exploring the code base. In this case it uses some command line tools like find and ls and grep, and it tries to explore the files to identify the relevant templates, the relevant folders, and uses some patterns in the file names to identify which files it wants to actually change. After that, it creates a plan. The plan is typically fairly long, so I cannot put it all in one slide, but I think you can get already a feeling from the titles.

Step one is to add authentication and create a slash votes route in app.py, which is a file which is typically used in the Flask applications, and then the second step is to actually create a template directory with votes.html and then we will create the HTML for the login page. It's fairly logical, and nothing too crazy, but the split makes sense also to humans typically when they try to accomplish this task.

Then it starts on the first sub-agent and the first sub-agent receives a set of instructions which is importing some libraries which are necessary to actually modify this website and to introduce the functionalities that we want to implement. Then we need to create the login route and then we want to have some checks and so on and so forth. These instructions are not part of our prompt, are not part of the request that we got from the user. They are generated by the supervisor agent for the sub-agent, and this gives us significantly more flexibility because given a task, the agent is basically able to expand and provide a detailed set of instructions to the model.

What we get at the end is a login page which requires a password to access a page where you can vote for a specific restaurant. This is a little recap now of how we evolved our agents over the years. We went from RAG, which was basically a retriever plus one LLM call to a fixed workflow, which was made of only 4 LLM calls, but that already gave a 10x boost of performance. We then switched from the fixed workflow to a real agent with a loop where the model selects the next action and is not predetermined by the workflow structure. We also created an environment where the agent can

interact better with the file system and using better tools and overall more effectively move towards his goal. Finally, we reach a stage where we want an increased amount of flexibility and we are willing to somehow create more complex agents with the purpose of achieving bigger goals and somehow having long-running agents that can complete big tasks. Along the way, we also learned a few lessons and I will let Laurent give you a summary of those.

Three Key Lessons: Optimization, Reliability, and Continuous Evolution in Agent Development

Thank you, Giovanni. So, Giovanni covered about two years of the evolution of the development of coding agents and it has moved as fast as the field of AI has moved during this time. So what I'm going to do now is try to distill all of this experience into three lessons which I hope will be useful for you when you think about building your own agents or using agents.

The first one is that you need to optimize for your specific use cases. There are a lot of designs that are possible right now. Models are very powerful. You have a wide choice of how to build agents. If you're going to do repetitive tasks or well-defined tasks, use a simple workflow with LLMs. You get the power of LLMs, but you get the predictability of a deterministic workflow and the speed that goes with that.

If you need to work on something that is interactive, I think a design that favors low latency is critical and so you want to control what tools your model, your agent is able to use and how often and how fast these tools are, for example. Or if you want your agent to autonomously solve very hard tasks, which is the kind of things that we are working on in Kira, then you want to have designs that maximize the power of these agents and so have agents that are able to create their own sub-agents that have access to very rich toolbox to validate their work, to get feedback signals, etc.

The other thing that you need is to create metrics that really reflect the customer experience that you want to create. Giovanni discussed SWE-Bench a lot. This has been the standout dataset in the coding agent community for a while, and we have used it extensively. We're very grateful for the team that released it. It's been super helpful. But as we grew experienced, as we gave our agents to our customers and observed how they use it, we noticed a few things.

First of all, the kind of problems that were in the benchmark were much simpler than the kind of problems that our customers were actually putting to our agents. You can see that, for example, by looking at the number of files that are modified in each task. Those in the benchmark, the problems in the benchmark typically modify one file, sometimes two, rarely more. Customer requests, on the other hand, require you to modify three, five, ten files to properly solve. And so, relying exclusively on a benchmark that has tasks that are too simple gives you the wrong idea of how good your agent is and doesn't reflect the customer experience very well.

Similarly, we've noticed that the length of the problem statement, the amount of details that was given in this problem statement in the benchmark is typically quite long, over 1,000 characters, very, very clear expectations of the task to solve. This is not how people actually use agents often. When I use them I often start with a vague idea and then try to refine the problem and that seems to be how our customers use it, perhaps how you do it as well. And so we have these problem statements that have, you know, fewer than 100 characters that are very short, implement this, do that, very underspecified. So again we need to make sure that we have benchmarks that have the right kind of problem statements.

To get a bit closer to the kind of tasks that we are interested in solving, we created a new benchmark called PolyBench which we've made publicly available. It has four languages as opposed to a single language for SWE-Bench. It has harder tasks that are a subset of them is verified, so we know they are actually solvable by agents. There are more varieties of tasks, not just bug fixes, but new features or code refactoring, things like that.

In addition, we have created a much richer set of metrics to measure, for example, not just whether the agent is able to solve the task, but whether it is looking in the right place to solve this task, whether it is retrieving the right files, looking at the right functions, and not retrieving too much. It is still not perfect, of course. Evaluation is a hard problem in itself, but it helped us better understand how our agents were performing for our customers.

The second lesson is that you need a reliable system. That might sound obvious in many ways, so let me be a little bit more specific about why you need a reliable system. Agents are based on LLMs, and LLMs are stochastic. You can submit the same task to your agent twice, and you will get different answers, sometimes different in subtle ways and sometimes completely different. There is also the power behind what Giovanni was talking about, generating multiple solutions and picking the most promising one because you have this diversity, but that also means that it's very hard to evaluate whether a new idea is a good one or not. You need to run a very large number of tests and wait for the law of large numbers to kick in, essentially, to be able to say whether a new idea is effective or not.

These tests are typically long. We're interested in solving complex tasks. They can run for 30 minutes or an hour. So multiply this by a few thousand tasks and try to do that a few times a day. You need a system that is really solid to do that. The second thing is that agent trajectories are complex. So the trajectory is the sequence of steps that your agent is going to take, the LLM calls, tool use, retrieval, and so on. Failure attribution is very hard.

You might have dozens or hundreds of tools accessible to your agent. If one of them fails, models are now intelligent enough to try to find a workaround, try to find an alternative way to get what they wanted out of that tool call. This is great. This makes your system more robust, but it also makes it very hard to actually understand what works and what doesn't under the hood and so to improve your system. Understanding these trajectories is not something you can do through manual inspections anymore.

Trajectories, as Giovanni said, we started with fairly simple workflows with four LLM calls. You can put all that on the screen and kind of look at it and understand what your agent does. Now, when you have hundreds and hundreds of steps with very large context used every time, it's unmanageable. You need great observability tools. So we've been building our agents now on top of the Stren Agent SDK and on Amazon Bedrock Agent Core.

Stren is an Apache 2.0 Python SDK. Actually, today it's also available in TypeScript since today. You know, we've deployed a lot of agents in production, and not just us, based on this SDK. It is proven. It provides a lot of different tools and resources to build powerful agents. Agent Core, well, I won't elaborate too much on it. Swami talked very eloquently about it this morning, but it offers a broad range of tools to build, to deploy, and to observe agents. In particular, it provides all the things you need to get metrics about how your agent behaves, look at trajectories, analyze them, understand and improve.

And the third lesson is that you need to be ready to evolve. This is not a static space. Customers are going to use your agents in ways that you don't expect. You need to be able to observe and understand how they're using your agent, understand how it is performing for them, and be ready to change the way you build so that you improve the power of your agent for their use case.

There will come new tools that will change fundamentally the way your agents work and the way they should be designed. For example, when we brought in sandbox environments and the possibility to execute code, it implied very different architecture from static read and write workflows. Your scaffold also needs to adapt to new models. So AI is moving extremely fast. New foundation models are released every few weeks, every few days even at this point, and they come with new capabilities every time.

Building agents is clearly on top of mind for the people training and releasing models. They try to improve the way their models power agents. That means in turn that things that were critical to an agent scaffold with the previous iteration of a model might be completely obsolete. For example, building all kinds of dedicated tools to do code representation or code navigation, it was great a year ago or a year and a half ago. That made the whole difference and that allowed you to create state-of-the-art agencies.

It's completely irrelevant now because models are extremely good at using bash, using standard tools that developers use directly to explore the code. So it's a change. And finally, you know, we're expecting to see more and more small models, faster models, become competitive and usable to power agents. And what that means is that the field of applications that you can envision changes a lot, right? The cheaper and the faster it is to run these agents, the more you can envision running them on low-latency tasks or on device or on simpler tasks. That completes our presentation. Thank you very much for your attention.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community