Kazuya

Posted on Dec 11, 2025

AWS re:Invent 2025 - Breaking AWS networks on purpose to build resilience (DEV343)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Breaking AWS networks on purpose to build resilience (DEV343)

In this video, Craig Johnson, Principal Solutions Architect at Forward Networks, demonstrates how to introduce controlled chaos into AWS networks to build resiliency. He explains using AWS Network Manager for network visualization and emphasizes the Reachability Analyzer as a critical tool for intent-based verification. Johnson shows how to establish baseline checks before changes, intentionally break network components (security groups, routes, Transit Gateway attachments), then use pre and post-change intent checks to verify network functionality. He advocates for automating these checks in CI/CD pipelines and conducting regular "chaos hours" as DR drills, providing a GitHub repository with Terraform code for implementation.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Breaking Networks to Build Resilience

Morning, morning. How's everyone going? So my name's Craig Johnson. I am a Principal Solutions Architect at Forward Networks. So quick show of hands, who here would consider themselves a network engineer, network guy, something like that? Okay. Who here has taken down a production network before? Awesome, awesome. Everyone has. So what I want to talk today about is introducing a little bit of chaos into your networks in an effort to build more resiliency, an idea that we've taken from on-prem data center networks. I want to apply the same kind of logic into our AWS networks.

So a little bit about me. Most of my career has actually been on the data center side. I spent many, many years at Cisco. I've got the CCIEs. They're all expired these days. Most of what I do is on the CLI. I'm going to apply a lot of those same principles here. Also, if you're interested in the Community Builder program, definitely reach me afterwards. It's a great program. It's probably the best thing I've done in my career in the last five years. And I also run a local networking group. If anyone's in the Dallas area, feel free to come talk to me about that.

So let's set a few ground rules. Like I said, I'm a CLI junkie. Most of what we're going to do is going to be using the CLI and actually putting these ideas into practice. Being an engineer, a network engineer, is hard. It's always the network. That's what the shirt says. It's always the network, and everyone always blames the network, and that's the mindset we're going to go into. It's natural that they blame the network. The network is the glue of everything inside your environment.

So when something doesn't work, you know, I can't say it's the database. Can't say it's the application. The network is the thing that connects all of those things together, so that's the kind of mindset you have to take. You're always having to disprove it's the network, and I'm going to show you how to do a little bit of that to give you some of that confidence and to give you the way to say, hey, I definitely can show management and everyone else the network is not the problem here.

Like I mentioned, I'm going to show you some screenshots, and that'll be useful, but I'm going to show you mostly what we do either via pipelines, CLI, infrastructure as code. This is going to come from a repo that I have that I've created that'll actually, you'll be able to download. At the end, you'll have a QR code for it. You can see the code that I've done to create this and actually run some of the scripts that I've done to be able to actually introduce some of this chaos into your environment.

And of course, what I don't like about doing things in the cloud is a lot of my traditional network engineer tools don't really work anymore. You know, I used to be able to log into a router. I'd SSH into my Cisco router, run a few show commands, show some routes, do some pings. I can't really SSH into a Transit Gateway to look at its routes. I can go to that screen and kind of see it, but I can't really look at a routing table. There really isn't a routing table to look at. I don't have access to the same level of functionality. Now, some people will say, well, just turn on things like flow logs, and you get that. I'll explain why that's not a great idea in a second.

But as I said, all of the code that I'm going to use today is available at the end of this presentation. You'll be able to go to my repo, download it, play with it. It's an environment you can set up, but you can also take the concepts into your own environment. Okay, so today, the game plan, what we're going to do. I'm going to talk about breaking your network for fun and profit. This is going to be showing how I'm going to introduce some of this controlled chaos into your environment.

Establishing Baselines with AWS Network Manager Visualization

But first we have to gather a baseline. So before you make your change, you know, if you've ever done any network changes, before you go to your change control board, you have to have a baseline of what your environment looks like and to verify everything works before that. How do we do that and how do we prove that out? Do I actually trust what my applications are saying? And this is always the disconnect between the network team and the rest of the IT crowd, is that, well, the network sees the flows, but you don't always really know if the application is actually working as it is. I'm going to show you why we don't necessarily need to trust those, but we can get our own metrics on this.

I want to introduce you to the idea of verifying my network based on intent. So when I do a change, I've got a MOP, and I'm like, okay, I'm going to make these route changes, these attachment changes, these changes to an ALB. I don't want you to think about it that way. I want you to think about what is the intent of my change, like what flows do I want? Where's my source, my destination, ports and protocols? Is that actually working? Whatever I change in the middle doesn't matter so much because oftentimes I'm not the only one making the change. I can verify that my change was done correctly. I implemented it the way I meant to. That doesn't mean the intent was right. So I want to give you a view on that.

We're going to iterate on this a little bit, so I want to show you how we can take this and actually iterate through your environment, so you're not just saying, okay, do this one time, do it for one flow.

We can do this in a more automated way so that we're not doing this manually. Now, a lot of people still do the click-offs manually, but this will let you integrate it into your pipelines. My code has examples of that too.

So it turns out there's actually quite a few network engineer tools in AWS that a lot of people don't know about, and it kind of makes me sad that people don't really understand some of these tools that we can use. In lieu of me SSHing into a device, I can use AWS Network Manager to get a good view of my entire environment. If you are in one or more VPCs, you've got transit gateways, you've got transit gateway connects, you've got direct connects, VPNs, lots of tools that I want to show you here. The main thrust I want to really focus on is using the Reachability Analyzer. This will help me look at the intent of my environment and help me figure out what's going on here.

So it turns out you can actually have quite a bit here that honestly, I don't see many people using because you just kind of assume the network works. But when it doesn't, you kind of rush to some of these to figure things out. I want to give you an idea of how to use this today to figure these things out.

So the first thing you want to do, and I rarely see people do this, is get a good visualization of your network. Now if this was a data center network, this would be me either opening up Visio or Draw.io, connecting my routers together. Maybe I've got an automated tool to do this, but generally it's just doing those things and connecting them. Way back in the day, getting one of those really big plotters and making my nice network map with all of my top of rack switches and everything together. I fortunately don't have to do that on AWS.

I can use AWS Network Manager to visualize all of my AWS components, my transit gateways, my direct connects, my VPNs, my transit gateway connects just by registering in Network Manager. It doesn't cost you anything, and it's a way for you to pull up all of this environment together. So it's super easy to do. All you really need to do, and of course you can do this via the GUI, but like I said, CLI, you create a global network in Network Manager. You create the sites that you're going to do. So it's not even just your AWS components. Obviously, you're going to have direct connects, VPNs, things that are going to connect to external. I can create sites and actually show you where those things are connected to, to create this kind of picture of your environment.

Reference the same ID. You could even put locations because this will give you a global map to show you where these things are at, so I can say, hey, here's all of my end to end network, not just AWS but everything else with it. I can create devices, so obviously I don't manage those devices, but I can add like my Cisco routers, my on-prem firewalls into this, same thing there, and then create links to show how they connect them and even put bandwidth on those links. Now, once I've done that, I can actually even do this across accounts too. So many people don't have only one account. You can do this across multiple accounts as long as you give the right level of IAM access. You can create the entire environment no matter how big or small this has to look like.

So once I have created this, this creates a dynamic map of my environment. So I can show, and it also updates whether things are up or down. So you see the purple links are my transit gateway connection, the green links are my VPN connectivity or direct connection. Everything, including my VPCs, transit gateway connects, everything that I have in this environment is going to show up in this environment. As things get updated, as you add new VPC attachments, as you add firewalls, as you add anything to this, it's going to update. So you can always say, hey, I know before I make any of my changes, this is a visual representation of my network so that you can go to your change control board and say, the network is in a good state, at least from a high macro level, to show me how my entire global network is set up.

Intent-Based Verification Using Reachability Analyzer

OK, now that you've done that, let's move on to actually doing a couple of guardrails here. So a lot of people say, well, I can just turn on flow logs, or I can use synthetic transactions. There's a lot of good observability tools to do that. It doesn't really help on a network side. They are super useful though. They give you a steady state of what the traffic of my network looks like. So before I've made the change, I can get an idea of, OK, here are the traffic that's going on in my environment. Here's what it looks like. What does that mean for the rest of the environment though?

Now, if I have no idea, now as a network guy, I often don't know what my applications are doing. Things are deploying, spinning up, spinning down all the time. This gives me a really great way to say, OK, here's the traffic in my environment. They are kind of pricey to leave on all the time. Turning on flow logs, if you have a lot of traffic that generates synthetic data, tells me source, destination, five tuple data for all that. Kind of expensive to leave on all the time, so not something you may not want to do if you don't need that level of data analysis.

For outages, when we are actually trying to figure out what's wrong, they're not so useful. And the reason is

these are things that are looking at the state of the network, looking at the traffic. If I've broken something and it's down, the traffic's not flowing. It doesn't tell me where something's going on. And it also has a lot of different places to look. I've got more than a few VPCs. I've got to look at flow logs in multiple places. There's some good AI tools that'll help you figure this out, but kind of not very useful here. But definitely have them when you need them, turn them on. You've got a good monitor of what the network looks like if you have transactions down, Application Load Balancers, things like that.

Okay, so let's talk about intent checks. Intent checks are how we're going to build this model of the environment. They provide a way to ensure changes I make never break the network. The idea is that whenever I make a change, I want to make sure I exit that change window never having had a broken network. And I need a way to prove it not just to myself, but to all the other stakeholders. So network changes are by far the largest source of outages that you will see in an environment. Now, network changes can encompass a lot of different things, but the reason that is, is back to my original point, the network touches everything, so they're going to cause the most outages. We need to provide a way to prevent that.

I should know all my critical flows. This gives me back to the previous slide. It gives me an idea of what my critical flows are, or at least I know what my applications care about. This source goes to this destination through this load balancer, whatever it happens to be. Flow logs only check for actual traffic. They don't help me when things are down. I need a way to make sure the network is performing like I expect it to pre and post change, and that's really the key, pre and post.

How can I do that though without relying on a bunch of application verifiers on the phone saying, yep, my app looks good, my app looks good. In any size environment that could be 100 people on a call checking for these things. I need to be able to do this myself and as a network guy, I don't necessarily have visibility in those applications. This is where Reachability Analyzer comes in. It is absolutely your best friend in making network changes, and I highly recommend you start using it.

What this means is we're going to introduce a little bit of chaos into our environment, but we're going to use Reachability Analyzer to help me figure out if what I've done actually breaks the environment. Now, the simulated change we're going to make here, we can make an actual real change. My environment here is actually going to show more of a simulated level change. I want to make sure any change that I make doesn't affect the intent of the network though.

Now what kind of change could it be? Could be a messed up route. I could mess up a security group, Network ACL. Maybe I screwed up a Transit Gateway attachment. It doesn't really matter. You can use this for a macro view, Transit Gateway Analyzer if I want to go across regions or even more of a micro view. Look at connectivity between instances in a VPC or a Lambda or whatever. So let's go ahead and break something. We're going to go ahead and create some chaos inside our network.

Implementing Controlled Chaos and Automated Pipeline Integration

So there's lots of fun things we could do. I could block something in a security group, misconfigure an attachment somewhere, maybe create a blackhole route, change Application Load Balancer targets, or even just ruin my entire AWS Network Firewall policy. Lots of things I could do that in just in the course of making changes is pretty common and super normal that you'll see in an environment. Now, before you start doing this, you might not want to do this on your production network. It's your career, you handle it how you want to, but you might want to do this in a twin of your environment or use my repo to actually practice some of this. Once you get a little better at this, you can start to introduce some of this controlled chaos, but maybe not just yet.

So let's go ahead and run our preflight intent checks. Okay, so this is Reachability Analyzer. Notice what I've done here is I have a series of checks that are here. These are all point in time checks though. So every time that I do this on a baseline, I do this before and after my change, I get a point in time that says, yes, the network is functioning correctly. And when I say the network is functioning correctly, what that means is I've specified a source interface, source port, source IP address, destination, ports and protocols that I'm trying to get to, and it's going to tell me the exact path it takes through my AWS environment.

It goes from the instance I'm going from, the ENI that it's attached to, any security groups and Network ACLs you pass through, the routing table that it's configured with, all the way till it finally gets to its destination. This is what I mean by intent. I have source, destination, everything in between, and really all I care about is a yes or no. Does this pass? Does this fail? From the previous slide, you saw that all of those were successful. So I run these checks. I give this to my change review board and say, yes, every one of my intent checks are passing before I make the change. I make my change, and then I run them again afterwards and see what happens. So let's do that. We'll go ahead and make our change that breaks something in the environment.

The reason this is useful is if I'm making a change and I don't know the effect of it, trying to get all of my applications to check could take a long time. I only have so much time in my maintenance window to do this though. Like I said, I can run my pings and trace routes and things like that, but that's only going to give me basic connectivity. That's not going to tell me, going through a firewall or going through a security group, am I actually hitting the right ports. And it's extremely limited in time. If I have a change window and I have four hours to do it, if I mess up something and I don't necessarily know what it was or where I need to pinpoint, getting all my application people to tell me this app works, this app doesn't work, but it works in this region but not in this region, this is a way that you can use this to tell you right away what things work and what things don't work.

Okay, so I've made my bad change. I've run my analyzer afterwards. You can see it shows up as not reachable. Now I can quickly say, okay, not reachable. Well, I need to troubleshoot what's going on with this. Or if it's near the end of the change window, I back out of the change. Going inside, you can see exactly what's going on here. Fourth line down, it says you've got an attachment misconfigured on this particular VPC. That's the issue you need to go to. It may not be that simple to tell you exactly what the issue is, but you at least have that pass-fail to know that change that you made. Here is the pinpoint, and you know you're not guessing what the state of the network is. You have a good intent because you've run this before and after you've made the change. You've got actionable data. If the change goes bad, you know exactly what to do for next time, and you're not saying I'll back out the change, revert everything, and you're still blind because you don't exactly know what happened. You've got the forensic data to figure this out.

Now, I want you to think about setting this up in a pipeline. These manual changes are just fine, and you can use this for all sorts of things: attachments, transit gateways, VPC endpoints, even VPN connectivity all the way to your on-prem side. Set this up in a pipeline. The idea is, the tagline I want you to walk away with is never exit a change with a broken network. This is great for point-in-time stuff, but you can automate this a little bit too. The code that I have does this via Terraform. You can do this via CloudFormation, where I actually have a pre and post to say before I make the change, run my reachability checks, give me a pass, execute the change via the same pipeline, run those same checks again, gives me a pass-fail. If I get any fails, I can have a human intervention or I can optionally back out of the change entirely. You've created some chaos to give you an idea, but to give you a way to really narrow down to figure out what's going on.

Now, what a lot of people say is my network is super complex. You've shown me just a few flows here. I have thousands of applications that are going on. I don't really know what they look like. How do I create all of those intent checks? That's where those observability tools I had before come in. Those flow logs that you had, those synthetic transactions, this is where you can look at all of the top talkers on your network. You gather this data from your applications, but I look at the top talkers and I say, okay, here is my critical application. Just because they're sending this much traffic gives me source, destination, IP address, all things I can easily plug into Reachability Analyzer. Firewall rules either from a physical or virtual or AWS Network Firewall will tell you which ones are getting the most hits, so you can see that data too. You can update these intent checks over time in an automated way. This is a super easy thing to do. In my code, I actually have these flow logs running and SageMaker dashboards, so you can actually start to pull them out and query it, so you'll be able to see that kind of data too. It's a very good use case for AI to keep this in a continually updated pipeline.

So that's the code. That's the link that I have there. It'll be a QR code pop up. But what I want you to think about, schedule your first sort of chaos hour. This is something you should do, kind of like a DR drill, something that you do monthly, quarterly. Do something that you don't necessarily know what the change is. Different person breaking the network is one that's fixing it. Now my code does this too. It has a randomized break on there so you can practice this on your own. But you want to blend these kinds of tools together. The observability tools are great for pulling the data in. This kind of thought process that people have, getting people to really think about, hey, when I make my change, I need to have these checks, not because they're going to take a little more time, sure, but it really saves me because I can prove out to people network is the problem, network is not the problem. Use the repo and the runbook to go for it. And if you'd like, submit some PR improvements, and I'll be here in the theater expo. There's the QR code for it as well if you'd like to link to the repo for it. So feel free to come up, ask any questions, and I appreciate everyone.

; This article is entirely auto-generated using Amazon Bedrock.