Kazuya

Posted on Dec 6, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - AI is Breaking the SDLC: Here's How to Fix It (DVT103)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - AI is Breaking the SDLC: Here's How to Fix It (DVT103)

In this video, Michael Webster from CircleCI examines the rapid growth of AI coding agents and their impact on software delivery. He presents data from GitHub showing agents evolved from PR comments to actively pushing code by May 2024, with CircleCI's internal data confirming this trend across enterprise customers. Webster explains how agents create bottlenecks in code review and deployment despite faster code generation, using queuing theory to illustrate delays. He argues that investing in validation processes—reliable tests, faster CI pipelines, and automated quality checks—is more effective than chasing the latest AI techniques. Webster introduces Chunk, CircleCI's validation-first agent focused on fixing flaky tests and maintaining production-ready code, emphasizing that delivery pipeline improvements benefit both human and AI-generated code.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

The Rise of Headless AI Agents: Data from GitHub and CircleCI

Hi everyone, thank you. My name's Michael Webster. I'm a Principal Engineer at CircleCI, and we're going to talk a little bit today about where we are right now with agents and what they're doing to the SDLC, some of the problems that we're seeing already being created, and some techniques we're finding effective to fix it, including some things we're adding into our product. So yeah, let's go and get started.

As a quick recap, I want to go over the history of agentic workflows. I won't go too in depth, but I think where we are now is starting to see a little bit of a decoupling point that we didn't have before. Back in the days of 2021 and 2022, there was a lot of developer copy-pasting between terminal, IDE, and a ChatGPT window, what have you. Then we started seeing IDEs come along that are still driven by the developer but are capable of more agentic tasks. They can do more long-range planning and execution. This was really interesting and unlocked a lot of potential within the AI space.

But now what we're seeing is these headless agents starting to emerge. You take a CLI, you throw it inside of a Docker container, and now suddenly you can start running cron jobs, webhook-triggered agent runs, anything you really like in order to impact this. Now, there's a lot of products on the market that have launched in the last six months to do this. So I thought it was interesting to maybe take a look and see how real is this trend.

I went through and looked at all of the bot activity in the public GitHub archive going back to the end of last year. I just looked and said, for these known coding agents, a small subset of them, five or six, what are they actually doing on GitHub? And you can see a normal, really rapid growth trend overall within these agents across all the big common ones: Copilot, Claude, Codex, all of those.

But it gets more interesting. So there's obviously some growth, there's activity from these agents. But it gets more interesting when you break it down into the event types, like what are these agents actually doing, and you start to see a really interesting trend from the first three months on and then forward. When these things first launched, they were just doing PR comments. This was just like code review bots. Every example starter workflow of how to use an agent in an action or in a CircleCI config, it was doing things like issue triage and code review, and that's what they did.

But you can see around May of this year, you started to see these agents actually starting to push code. This was something that people sort of expected, but it took a little while to happen. And at this point, they're almost doing as much, this data is as recent as October. In some weeks you're seeing just as much push activity as you are comment and PR activity. So these agents are actually doing real work. They're pushing actual code to real repositories, which is what you would expect again with the growth trends.

So that's public GitHub activity, but that could simply be experiments, people with hobbyist projects, nothing necessarily in a real-world setting. So at CircleCI we see a lot of build activity from multiple projects that aren't on the public GitHub archive. These are enterprises, large and small startups. So we decided to look and see what our activity pattern is. Do we see a similar growth trend as what we see in the public archive? And this is a screenshot, some aggregated and anonymized data, and you see again, we see this trend. We see the same pattern repeat itself.

So there's activity happening, and in particular with CircleCI, these are cases where we actually ran a pipeline. This is not a case where someone simply updated a readme or maybe was building a static blog. This is someone took the time to configure a pipeline to run unit tests and do deploys in response to a push event. So you would think this is economically valuable work being done by these agents. It's not just I had a hobby project, let me go turn it on and see if Claude can keep my GitHub activity green or something like that.

And again, this is all looking at agents that we can distinctly identify. So this is our low-end estimate of what's happening. If you use Claude Code or Codex or Cursor and commit under your own name, we're not necessarily going to be able to tell that it's you. So this is a very low bar estimate, and we've seen within really about five or six months, we've seen a very fast growth clip. And this is reflecting in the revenue of these tools and just the general trends overall where people are more willing to try out these headless agent bots.

The Delivery Bottleneck: Why Writing Code Faster Isn't Enough

Okay, so why is this a problem? This is the thing that everybody wanted. We wanted these agents to do more than just get us out of the IDE. Instead of multi-boxing Cursor, you can now just have multiple agents running in parallel. And the issue here really comes down to,

Code isn't really valuable until it's in a customer's hands, and all of the stuff that happens after you write the code is not necessarily keeping up, right? PRs are getting massive. Open source projects are requiring AI disclosures because they suddenly get mysterious 2000 line PRs that all look alike. Reviews are taking longer. Humans looking at those 2000 line PRs. And in general, the build stability isn't really improving, and this is kind of what you would expect.

This is a basic kind of queuing theory. It's a branch of math about how queues operate. To simplify it a lot, really, if work is arriving into your system faster than you're able to process it, you get delays. You probably understand this intuitively. If you've ever been at a store that only had a single checkout line when they were busy, everybody has to wait. Eventually, some people might get tired and give up, but when we have a revenue feature on the line, we can't really just stop. So the queue just builds up and builds up, and then people stop working to go drain the queue and do all the reviews.

So the reality is for a lot of organizations, even though you can write a lot of code now, you probably couldn't actually go much faster if you wanted to in terms of your delivery processes. To give an example of what this looks like, I put together kind of a queuing simulation of, under various scenarios, what happens with different speedups due to AI, right? So, on the bottom, we kind of have a status quo baseline. It assumes that you can process code twice as fast as you can write it, but if you hold some of these things constant, as the AI gets faster and faster, the delays get larger and they come up even more quickly.

And this is particularly a problem because with humans, the workday ends at some point, right? You don't have an infinite backlog because people don't work at an AI pace all the time, but the agents can. If you want to hook them up to your Jira backlog, your Linear queue, whatever it is. The reality is that for most organizations, you're not actually going to be able to get the benefit because you're going to be spending all your time reviewing the PRs and waiting for the deploys to go through.

This theory aligns with subjective feedback. You look at the DORA metrics, there's actually some, many teams are reporting an increase in instability from AI. There's minimal effects on the product effectiveness, and, in general, burnout. If you look at other industry benchmarks, when you dig in further, you see that a lot of the AI gains, they center around 10% improvement, but that's widely distributed. It's basically a bimodal distribution. People who are really good at delivering software are getting most of the benefit. People that are average to mediocre, they're actually seeing no benefit to negative effects from AI.

And this is a problem, obviously, because, as a technologist, we want these tools to work, but also, we're not putting 10% level improvements from your senior engineer investments into AI. We're spending hundreds of thousands to millions of dollars building data centers, all kinds of inferences. We need more than just incremental improvements. So, how can we fix this? Well, the simple pithy answer is to go faster, right? You know, in that simulation I showed before, we said, okay, the code comes in faster, but the delivery goes the same. Well, what if we just made the delivery faster?

And you can see again, expected effects here. If you're able to ship the code faster, the curve kind of bends down. In some cases, even if you don't get any improvement from AI, if your agents are still producing at the same output as your developer, but you've made your delivery faster, you can actually cut your delays. So this has benefit even if you're using AI or not. Okay, so we have a hypothesis of how we can improve this. We have some levers that we think we can pull. So what does it look like? How do we actually do that?

Validation-First Strategy: Making AI Agents Work Through Better Testing

Just go faster is not a particularly actionable task. To start with, I love AI, I love agents, but I want to be clear, a lot of this isn't whiz bang multi-agent orchestration. It's blocking and tackling engineering, stuff we've been doing for 15 to 25 years. You know, having things like acceptance tests that will reliably tell you if things broke, being confident that when you get an automated PR from a bot, whether it's a CD bot or an AI bot, that it didn't cause a regression, right, where you're comfortable trusting the fact that if your tests say you're green, you're green, and you can go ahead and merge.

You have to make investments in your delivery pipelines, and so I don't want to downplay all of this work. This is all super necessary and it's very tractable. And it's definitely a thing that everyone should invest in. AI is really great at helping out with this too, right? If you need to convert scripts, if you've got a bunch of batch scripts that are slow and doing subprocessing, you can rewrite it in a language that's compiled and the AI will help you with that. So it's definitely a spot where AI benefits you, but you do have to do this work.

Okay, but let's just assume for the sake of argument that's still not fast enough, that only gets you, you know, a slight reduction in your queuing delay. Well, how can you go even faster?

The way we think about this at CircleCI is that it really comes down to validation. I'm riffing on a really old saying about amateurs talking strategy and professionals talking logistics, but the idea here is, instead of obsessing about what's the latest and greatest way to prompt an agent, what's the latest and greatest memory framework, or what have you, think about how you can validate the code. I'll try to make the case of what I mean by validation here.

This is a really simple loop. You make a plan, you do some work, and you have something that judges the quality of that work. Then if it passes, you go through. If it doesn't pass, you go back to the beginning and you start over. This is the foundation of, if you want to do red-green refactor, classic CI, any type of the local loops that agents go through. This is the pattern that's really effective at getting work done. You do something, judge the results, give feedback, and continue forward. So this is the basic recipe when we talk about validation.

To make a further argument for why I think validation is so important, it has a lot of nice properties here. It's really scalable, right? The work you put into making sure that you can validate code written by agents works for individual tasks. You probably use a version of this in your local loops where Claude will run tests, run linters, formatting. It works if you want to tune the prompts. Having a set of, we know this was a good change, we know this was a bad change, if you even want to go full on training your own reinforcement learning agent, that basic recipe I showed of task, output, evaluate, and loop is really the fundamental recipe. So it works all the way up from your developer on your machine all the way up until you get to the point where you're wanting to train your own agent.

It's also durable, right? If you've been in the AI space at all, you're familiar with all of the churn. It's chain of thought, then it's graph of thought, then it's now program of thoughts. There are so many techniques that people learn, and I'm glad that people are researching them, but for developers, what ends up happening is they get trained into the underlying models. So trying to chase techniques and tactics, you need to do that to some extent. You want to use the tools that are available to you, but making that the centerpiece of your strategy, you can probably steer your investment in more productive ways.

Finally, validation is tractable for most organizations. It's possible, but not straightforward, to run and train your own models. But making your tests faster, understanding why your deploys failed, understanding what good code review feedback looks like so that you can give it to the developers earlier, this is all information that's readily accessible. It's happening inside of your organizations. You're probably using a developer experience tool that is mining this to give you subjective feedback on your developers. So when you look across all of the things that you could do when it comes to agents, investing in how you validate their output, to me feels like the most effective high ROI option that we have.

The other thing that I would recommend investing in is context. Whenever we are dealing with agents, one of the things that you have to manage is what does it know, what is it having to remember, and what is it having to parse through in order to complete the task. The nice thing about context is your validation pipeline becomes an input for context for the agent. Taking a log of having Claude code work on a backlog of tasks, having another AI summarize the results of that task, and then you feeding that back to improve the prompts, the tools, the sub-agents that you might be using is a really powerful technique. So yeah, at the end of the day, you might be changing a prompt, but you're driving this off of real feedback that's tuned to your organization and your codebases.

Very quickly, when we talk about context, there are a few levers that I think are really important that you can deal with. You really can control the tools, the task, or the prompting of the agent, as well as the check. This is a very simple version of what an agent does where you give it a task, it keeps looping until it passes some standard. I think this check, I would recommend a bottom-up approach here. Start with a check. Then use that as a way to improve how you give tasks to the agent and then extract those things into tools that the agent can use. So you use that validation as a way to ripple things back up until you end up with a library of things that you have really good confidence that the agent can use and execute on effectively.

I'll give a verbal version of this basic recipe that we found really effective at CircleCI when we're working in our development teams, which is we use the feedback from CI pipelines or local testing runs to get the model to produce the results we want.

Once we have a baseline, once we know that this is good, then we can start extracting that process into rules to get the model to do that all the time. So instead of everyone starting from scratch, one person goes through the process of how do I get a Playwright-based flaky test fixing process working, and then we're able to use that across all of our projects to perform a specific task.

And then just like any library development kind of process that you might have done before, once you've got a few one-off encapsulated functions, you can then turn those into rules that are more abstract. So it's taking the same techniques that we approach with software and just applying them to the prompts, all driven off of a clear indicator of if what the agent did was good or bad.

Introducing Chunk: CircleCI's Validation-Driven Agent for Production-Ready Software

All right, and so yeah, I'll talk about Chunk now. So Chunk, this is our CircleCI agent that we're working on with CircleCI that's built from the ground up with most of these principles in mind. The idea behind Chunk is to help make sure that your software is always validated and to make the process of validating that software faster, more reliable, and more effective.

Chunk, we built this thing very much validation first. Anytime this agent touches your code base, it's going to execute your CI pipelines to verify that it works. This means you're getting feedback from real environments that you've already built. You already know that this environment and these commands are good enough to say that this code can go to production, and so we're just going to reuse that. There's no real need to completely reinvent the wheel here and have to think too much about hooks or some other side process of development.

You've already configured a way to test and see if these things are good. We also keep your software production ready. The first thing we're doing is going after flaky tests. When it comes to delivery, flaky tests are kind of the bane of at least my existence. When you have a change that you know is good, but you have a test that is unreliable and can't always tell you if something is good or bad, that slows everything down. It slows down humans, it slows down agents. It really grinds, it's sand in the gears to any kind of smooth delivery process.

We're also working on things like improving code coverage, which again is very useful context for agents to understand how files relate to each other and how they're tested effectively, and just generally handling a lot of the toil. Nobody really likes maintaining CI pipelines. I work at a CI company. I don't like writing YAML, but agents are really good at writing YAML, so things like build optimization, keeping your build fast, these are all really first class tasks that we'll be adding into Chunk.

The other thing that Chunk does here, because we have context on all of the changes that occurred and whether they were good or bad, right, we can follow a change all the way through from the commit that came into GitHub to was this change deployed, merged, deployed, or eventually rolled back. We're able to build a really good understanding of how you can make an agent work more effectively. So instead of you having to build your own loops, your own feedback cycles, your own RL environments, we can take the results of your builds, and we can use them as feedback to further tune the agents in a more automatic fashion.

All right, so quick recap and takeaways here. The AI agent adoption trend is quite real and growing. This is not, we're kind of out of the realm of reviewing PRs or triaging issues. These are writing real code. What we're seeing today is probably an underestimate of reality, but it is a real and growing trend. We're seeing that this is causing some negative impact and definitely is not uniformly beneficial. Not all organizations are getting the same results from AI initiatives that they would expect.

Also, hopefully I've convinced you investing in the delivery process is a really good start and a way to get going, and that the validation that you do in delivery is the foundations of keeping your agents fast and reliable. Yeah, and so that's all I have for you today. If you want to talk to me more about Chunk or see a demo, I'll be hanging out in the back, and then we're in booth 1451 at CircleCI. Great. Thanks y'all.

; This article is entirely auto-generated using Amazon Bedrock.