DEV Community

Cover image for AWS re:Invent 2025 - Move fast & don't break things: Maintaining software excellence as you adopt AI
Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - Move fast & don't break things: Maintaining software excellence as you adopt AI

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Move fast & don't break things: Maintaining software excellence as you adopt AI

In this video, Ganesh, CTO at Cortex, discusses maintaining software excellence while adopting AI coding assistants. He presents survey findings showing 70% of engineering leaders rank security and quality regressions as top concerns with AI adoption. The talk examines how AI amplifies existing risks—developers understand less code, ownership becomes unclear, and incident management suffers. Ganesh emphasizes strengthening foundational practices: robust testing, clear ownership accountability, comprehensive security across all repositories, and production readiness processes. He introduces Cortex's approach using scorecards for AI readiness and maturity, measuring customer impact metrics like MTTR and incidents rather than just AI adoption rates, and leveraging Magellan AI engine for automatic ownership mapping.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

The Hidden Trade-offs of Moving Faster: What Engineering Leaders Worry About Most

Hello. I'm really excited for this talk. I'm going to introduce myself in just a second, but first, let me provide some context for why I'm excited. I think a lot of the conversations I'm having these days are really about moving faster, shipping more, and deploying more. But are we talking about the downside of doing that? What are we giving up? What is the trade-off? As engineers, we always know that every decision is a trade-off decision. Moving faster is a trade-off decision. You are giving something up along the way. But maybe you don't have to. So how do you mitigate some of those trade-offs that you have to make along the way? That's what we're going to be talking about today.

A little bit about myself: I'm Ganesh, one of the co-founders and CTO at Cortex, which is an internal developer portal. Everything from cataloging all of your services and defining ownership to setting best practices and standards with scorecards to defining golden paths and letting you drive those things across the organization—that's our bread and butter. That's what we do. I'm excited to talk today about maintaining software excellence in the age of AI.

Thumbnail 70

Instead of coming up here and talking a lot about what exactly that is, I thought we would take a different approach and start with first principles. What does that even mean? What does software excellence mean? What should we be focused on, and how do we work backwards from there so that we can walk away from this with an understanding of how to break down the things that matter to your organizations as well.

Thumbnail 90

Let's talk about what leaders are worried about today. If you're thinking about software excellence in an organization, you're generally thinking about it from the perspective of the entire organization in aggregate. Let's break it down from what people care about. If we know what people care about, then we can work backwards to what are the things we should be investing in.

Thumbnail 100

This year we ran a survey across a bunch of engineering leaders about AI adoption, their concerns, and things that they're investing in because we wanted to get a state of the industry. One of our takeaways was that 70% of engineering leaders are ranking security and quality regressions as their top concerns. Obviously there are other concerns in there as well, but these are the top two things. If you think about it, this makes sense. We're introducing AI that's writing code that we don't necessarily understand. It could be leaking secrets into our code. There are all kinds of new risk factors that we're introducing into our software development lifecycle. The unknowns here are what is it doing to our quality and what is it doing to security.

Thumbnail 150

Thumbnail 160

But that's kind of a high-level thing. Quality and security—you kind of wave your hands and obviously everyone cares about this, but what does that actually mean in the day to day? Let's go to the second-order effects of that and break that down one level further. Quality regressions first: what does quality regressions actually mean from a thing that you can measure, a thing that can actually drive impact? Quality regressions show up as more incidents, right? More incidents because you're shipping things you don't necessarily understand, you're trying to move too fast, and you don't have the right guardrails. You might be breaching your SLAs or your SLOs because maybe you're introducing new performance concerns or latencies, introducing new incidents. The cost of maintaining your business goes up—the keeping the lights on time. Maybe you're spending more time on incidents and resolution and escalations and bugs and things of that nature.

But at the end of the day, the thing that this causes is customer impact. Your customers, whether that's another business or an end user, are impacted, and you lose their trust over time. These are things that you can measure. In quality regressions, you know whether or not you're having more incidents. You know if your SLA is getting breached or your SLOs are getting breached. You know what your MTTR looks like. These are the things that show up as second-order effects of quality starting to regress.

The second thing is security risk. This one is a bit more open-ended, but I wanted to focus on two things that we know coding assistants are causing. In a recent meeting with a few engineering leaders, the things that came up were secret leaks and vulnerabilities and data breaches as being the two things that engineering leaders were most concerned about. Interestingly, the pattern that we found was that businesses that were focused on consumer products were the most concerned about data breaches because, at the end of the day, consumer trust—especially if you're building a premium product—means the data of your customers is extremely important. I'm not saying it's not true if you're an enterprise business as well, but it happens to be one of your differentiating factors of a consumer business.

Things like secrets are another area of risk. If the human is no longer in the entire loop of writing code, then things that we know as humans we can do—like not copy-pasting a secret into our environment files—are more likely to happen now with the LLMs. Obviously, you can put guardrails around these kinds of things, but these are the types of risks that people are concerned about. So now we've gone from, okay, what are the concerns—security and quality—and what is the impact that we're seeing as a result of those two things being a potential risk.

Thumbnail 300

Why AI Amplifies Existing Problems in Your Software Development Lifecycle

Talking a little bit about why now: we've always cared about these things.

Thumbnail 320

We've always cared about things like MTTR and incidents, but why is there a renewed focus now? Why are engineering leaders more concerned than they were before about these particular things? Let's go through the breakdown of what is happening to our software development life cycle as a result of this. If you looked at the recent report from the DORA organization, they came to a similar conclusion: AI is basically an amplifier. Things that you do well, it helps you do more of. Things that you're not so great at, it's going to make those things worse as well. AI really is an amplifier or a mirror, whatever you want to call it.

Breaking this down, we know AI is writing more of our code. The exact percentage of code being written by AI is not particularly relevant here. We know that AI is writing more code and that code is getting shipped to production. Developers are understanding less of it. This was already a problem when humans were writing all the code. I understand the code that I write for the most part, but my peers, my teams, and my cross-functional peers in other teams may not necessarily understand that code either.

Now, if I'm not the one writing my own code, not only do I not understand it, my team understands it even less. You're creating a pile-up of pull requests that people have to review. People are going through those reviews faster than they used to, and they're giving it less attention. You're slowly building up tribal knowledge and the lack thereof within the organization. Ownership becomes very unclear. We already see this today at a service level. That was one of the reasons we founded Cortex in the first place: clear, accountable ownership of services drives the behaviors you want across the organization. The same thing holds true for your code.

If AI is writing more of your code, you end up in a state where you don't know what was written and you don't know much about it. But whether it's your PR or somebody else's, we still need to have accountability. As a result of all of this, you end up managing incidents in a haze. Many organizations don't have clear ownership of their services to start with. Now all of a sudden you don't have clear ownership of your code. You don't really understand what's being written or what's being shipped. Incidents are already painful. We're already scrambling to find the right information in the right context in those moments. When we understand that code even less, that's leading to increased risk during incidents.

Thumbnail 460

Thumbnail 470

AI is amplifying existing risk factors. These are things that happen today, things that are happening pre-coding assistance, and AI is just making those things worse. Now that we've defined the concerns that we care about as engineering leaders, what the impact of those things are, and why coding assistants are causing impact in particular, let's break that down. What can we do to actually mitigate that risk? Let's go step by step.

Back to Basics: Strengthening Engineering Foundations to Mitigate AI Risk

We're talking about AI writing more code, developers understanding less of it, ownership being unclear, and managing incidents in a haze. Let's think about the practices that can help us with code quality and testing. Those of you who have adopted more agentic strategies for your coding assistants have realized that writing tests is a great thing. Humans should be involved in writing more of those tests because we can define the guardrails of what the code should do, and then your agents can operate within those boundaries.

Even if we don't understand the code that we're pushing out in its entirety, being able to know and detect when things go wrong is at least a good first step. That way your customers are not realizing things are going wrong and you can catch those things first. Set up monitors and SLOs and all the things that we know about. SLOs can capture customer impact. If you can't necessarily measure the input into that, at least measure the output. Measure the things that are causing customer impact so you know when things are getting too bad.

For security practices, don't let repos linger out in the wild. This is one of the anti-patterns we see: focusing only on the repos we know are active or critical. Those are not the only repos with code. All of your repos contain code that's a potential risk factor. Make sure that you're understanding the impact across all your repositories and make sure that you're enforcing security practices across all repos, not just the ones that you know about.

Finally, bring the human in the loop and keep the human in the loop. Create clear accountability and up-to-date ownership. This creates the right culture because if I know that the buck stops with me and my team for the SLOs and the quality of my services and the impact for customers of the services that I own, then that creates the right behaviors. I'm going to go back and invest in those tests so that I'm shipping better code. I know that I'm being held accountable for the things that I'm delivering. Making sure that you have clear, accountable ownership across all of your systems is more important than ever before.

Thumbnail 600

It turns out engineering leaders that we surveyed also agree with these practices. Every single engineering leader we surveyed believes that security, testing discipline, and ownership clarity matter.

Thumbnail 610

Thumbnail 620

Thumbnail 650

But I want to pause for a second. It feels like I'm probably regurgitating things that we've already talked about. We know that most organizations still don't have strong production readiness or security processes. Most organizations have a vibe, for lack of a better term, of a production readiness process. We generally know these are the things we should be doing. We probably should have monitors, we probably should have SLOs, we probably should have good code quality, but those are all part and parcel of a good production readiness program. When we say before something goes to production, we want it to meet all these requirements, not just as a checklist, but because these are the things that can help mitigate risk when something goes to production in the first place.

Thumbnail 660

The thing that I wanted to quickly highlight here is that this is not new. These are things we've been talking about for a very long time. We've been talking about production readiness processes and the importance of that for software excellence for a long time. We've been talking about ownership and accountability as a key thing for a long time. Testing is not new. None of these things are new. These are things we've been discussing for a long time before. But going back to the earlier point, AI is an amplifier, and we should focus on the things that AI is amplifying and focus on improving those things. We could boil the ocean and think about all the things that we can improve, but most organizations have a lot of things they can be improving at any given time.

Thumbnail 710

Thumbnail 720

So let's focus now as we're adopting AI coding assistance on the things that are becoming bottlenecks or choke points in our STLC. We know that because of its quality, things like testing and accountability and monitors and SLOs are the things that matter. Well, it just happens to be that these things are actually basic engineering foundations. These are practices that we know we need to be following, but because AI is amplifying those things, let's go focus on those things and make those things better. So I want to zoom out a little bit and talk about AI excellence in the aggregate. How do we roll out AI coding tools with confidence?

Building AI Excellence: How Internal Developer Portals Drive Readiness and Impact

So kind of summarizing the stuff that we talked about, you want to build on a stable foundation. You're not going to build a skyscraper on a wonky surface. You're going to build on a stable foundation. So things like the right testing practices and the right test frameworks, investments in CI tooling, investments in production readiness, security processes—those foundations will help you adopt AI. They'll help you regardless, by the way. So you should probably do those things anyway, but especially as you're adopting AI coding assistance, these things are more important than ever before. So build on a stable foundation.

Second, you want to measure the impact of coding assistance on those metrics. When I say impact, I don't mean what a lot of people in the industry are talking about today, which is that our AI coding assistants are helping us move faster. That's only one part of the equation. The thing we've been talking about today is what is the customer impact. So do teams that are adopting coding assistance not only move faster, but are they improving or decreasing quality? Is their MTTR better or worse? Are the number of incidents better or worse? So think about the actual downstream impact.

So don't focus on the AI metrics on their own. Focus on the second-order impact. One of the things I see a lot of teams get hung up on is, "I want to measure what percentage of lines of code are written by AI, and we can use that to correlate with incidents." My challenge to you is why does that matter? We're not asking the question of how developers who use Vim or IntelliJ compare. We just say you use your tools, what we care about is the end result, the end impact. And so that holds true even now, but we can build that correlation of, as we're rolling out these coding assistants, do we know the impact it's having on the organization? So measure the impact from the customer lens.

Thumbnail 840

And then finally, make the most of them. They're spending a lot of money on coding assistants, and it is a new technology. It's a new set of practices that we're adopting across the organization. A lot of folks here have been through their cloud journey. That was one kind of iteration of this where it was new muscles and you're spending money on AWS and you want to make sure you're getting the most of it. Coding assistants are no different. You're spending good money on coding assistants. Make sure that you're taking advantage of all of them.

So when we think about the IDP or internal developer portal, this is the shameless plug for Cortex. We kind of see the IDP as a very powerful way of driving these three parts of AI excellence. So in phase one, driving readiness, this is one of the core use cases for something like Cortex as an IDP—defining best practices and scaling that across the organization. So customers for the past five years have been using Cortex to drive things like production readiness today. Customers in e-commerce and tax preparation would use Cortex for things like, "We're going to get ready for Black Friday every year. We're going to make sure all of our services are up to speed for Black Friday or for tax season." We know our security teams are using Cortex to drive security standards and shift. Now you can use Cortex scorecards to drive AI readiness, so making sure that every repository has an owner, making sure that every repository has secret scanning and vulnerability scanning enabled, that you're reporting on code coverage and that code coverage is not regressing. So being able to set those guardrails and then letting your teams have the autonomy to operate within those guardrails is something that scorecards are very good at.

The second thing that Cortex provides is measuring impact with a very strong focus on customer impact. So it's not just adoption rates or usage of AI tools. It's about the correlation between AI usage and things like MTTR and incidents and cycle time, the things that we know are bottlenecks and friction points. And then finally, AI maturity. So making sure that you're adopting the best practices.

This can be everything from whether we're setting up agent instruction files in a repository so that we're able to make the most of them regardless of whether we're using cloud or something else. Are we following best practices in our code reviews if you're following spec-first development now? Are your specs actually good? Are you following spec-first development? And how do you scale this across your organization? How do you make sure that every repo and every team is adopting this?

I think one of the challenges you'll run into as you're trying to roll out AI coding assistance is that there are very different levels of maturity across different teams and individuals, and this is a new technology. So we want to make sure that we're bringing everyone with us. The same way that a production readiness scorecard helped bring the entire organization into a lens of what good looks like when we're going into production, AI maturity scorecards can do the same thing for AI adoption. We can say, hey, this is what good looks like. Here's how you can make the most of the coding assistance that you've been given access to, and then bring everyone with you along the way.

Thumbnail 990

A couple of things that Cortex does to make this journey a little bit easier is that Cortex has Magellan, which is our AI data engine, where we can automatically map your entire engineering ecosystem and catalog it. When I was talking about ownership and accountability, that's a very hard problem to solve as humans in the loop. You might have thousands of repositories. How are you meaningfully going to go out and define ownership across the entire ecosystem?

The approach we're taking there is, well, if we're trying to help you adopt AI, why not use AI to solve some of those problems in the first place? Magellan can help you automatically figure out which team is accountable for which repository and define that for you up front. The same thing applies to automatically mapping things like pager duty rotations. We also provide out of the box scorecards for things like AI readiness and AI maturity. We have out of the box AI impact dashboards as well, and you can query all this data through our MCP.

If you're an engineering leader and don't want to learn a new tool, give them access to the MCP. It's very easy to query that. And then engineering intelligence allows you to measure the impact of these things directly within the IDP. We wrapped up a little bit early. I want to make sure I was on time here because I do have a tendency to go over. We have two minutes left. If you have questions, you can find us at booth 450. Feel free to add me on LinkedIn for additional thoughts, and I'm happy to connect and answer questions there as well. Thank you all so much. I hope this was helpful, and have a great rest of the event. Thanks everyone.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)