Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Powering your success through AWS Infrastructure innovations (NET402)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Powering your success through AWS Infrastructure innovations (NET402)

In this video, AWS Senior Principal Engineers Jorge Vasquez and Stephen Callaghan explore infrastructure innovations through a fictional online store called Shannon Store. They cover CloudFront's new flat-rate pricing plans, quantum-safe TLS algorithms in s2n-tls, VPC Block Public Access and Encryption Controls, AWS WAF's anti-DDoS capabilities with fingerprinting that mitigates 90% of floods within 20 seconds, the 320 Tbps Fastnet transatlantic cable, network traffic engineering using distributed market simulation without central control, Cross-Region PrivateLink, and Ultra Cluster 3 with AWS UltraSwitch enabling seamless network maintenance and traffic management for ML workloads. The presentation demonstrates how physical infrastructure innovations from fiber duct banks to Geographic Information Systems (GIS) and Shared Risk Link Groups (SRLGs) ensure reliability and performance at scale.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Shannon Store's AWS Infrastructure Journey

Welcome to Powering Your Success Through AWS Infrastructure Innovations. I'm Jorge Vasquez. Next to me is Stephen Callaghan. We are both Senior Principal Engineers at AWS Infrastructure. Every year, we come here and talk about innovative solutions on AWS infrastructure that make your workloads more secure, reliable, and efficient. Today, we tell the story of a fictional online store whose success was accelerated by behind-the-scenes innovations.

We'll touch on all the layers of the stack, starting from ducts encasing concrete under our data centers through which we pass fiber, all the way to how we create the bills you receive every month. We're going to pay special attention to AI, which has benefited from a bunch of previous innovations we made that make the network and infrastructure more flexible. Nevertheless, AI keeps Stephen and myself on our toes, given all the demands that it poses on performance and bandwidth. Let's get started.

It so happens that rivers make really good names for online stores. Since we're very creative, we look at the places where we came from. I come from Brazil, but there was a bookstore that already took the main river from there to use as its name. So we went to Stephen, who comes from Ireland, and we took River Shannon, which you can see on the screen. Meet Shannon Starr. Shannon is deployed to eu-west-1, which is in Ireland.

It uses three Availability Zones for better reliability, and it's deployed to three VPCs. It has a front-end VPC, which is the only one that gets connections from the internet. It has a back-end VPC where all the microservices that power the store are deployed. And finally, it has a payments VPC, which is used to serve services related to credit card processing that need additional scrutiny when it comes to PCI compliance.

CloudFront as Shannon's Internet Front Door: Performance and Cost Optimization

Like any e-commerce website, Shannon has sales events that generate large amounts of traffic from all around the internet. Let's look at how Shannon plans for very good days where this traffic is high. Starting from Shannon's internet front door, Shannon's engineers decided to use CloudFront. I think it's a solution that made sense. CloudFront has more than 700 points of presence around the world. These points of presence are edge locations that are very close to customers. This means Shannon customers get the best possible latency and Shannon gets to pay the least amount for data transfer out.

Since we're talking about cost, Shannon doesn't like surprise bills. I don't think anyone does. So they are taking advantage of the newly announced CloudFront flat-rate plans. These plans include everything needed to set up a web front door: data transfer out, S3 storage, web application firewall, API requests, DNS, and more. The best part is if Shannon has a really good day and the sales blow through their forecasts, this good day is not going to get ruined by a surprise bill. The bill is fixed.

Granted, this is not a common under-the-hood infrastructure innovation like I promised when I just came on stage, but it's a first for us at AWS. So I'd like to ask you to please let us know if you like this and you want to see more of this across the company. Shannon also uses CloudFront to serve dynamic content. Content that cannot be cacheable, these are things like personalized experiences, APIs, and the shopping cart. The reason it does that is because CloudFront can make those requests faster. Let's see why.

First, the TCP and TLS handshakes between end users and Shannon can take up to three round-trip times. If I look at the time between Seattle, where we are located, and Ireland, where Shannon is, we are talking 118 milliseconds each way. So a total of about a third of a second that customers are waiting for the network instead of shopping. Because points of presence are close to the user, the round-trip time is usually much smaller than that. It's about 20 milliseconds or less, so it's almost six times faster than my previous example.

CloudFront also keeps connections open to Shannon's server inside the region. This means there's zero additional latency on the longer network segment. You have a small handshake and then just one request flows through, zero round-trip time if you will. With Origin Shield, a feature of CloudFront, it can do one better. It can even pre-open these connections to the origin before they are needed. So when a request comes, there's always a connection waiting to be used.

At that point, the only handshake that needs to happen other than the one from the customer might be one from CloudFront infrastructure in the same region where Shannon already is, which means a handshake within a few milliseconds of one endpoint and another. The third reason that CloudFront accelerates content is because all the traffic that goes on that longer network path rides the Amazon backbone. As you'll see throughout this presentation, we go to great lengths to make sure it's always available and never congested. This means lower latencies and higher availability.

The final reason is that CloudFront supports more advanced protocols such as HTTP/3 and TLS 1.3, and it does that even if Shannon does not support it itself on its backend. So to make it better, CloudFront even takes care of otherwise obscure details that make all the difference. One example is the graph you see on the screen right now. It shows the uptake in HTTP/3 connections as compared to HTTP/2 connections when CloudFront launched support for HTTPS records. These records are somewhat esoteric. What they do is they allow clients to know that an endpoint supports QUIC before it makes its first connection. Because of that, Shannon customers experience more than 20 milliseconds first byte latency reduction at P90.

Securing TLS Communications: From s2n-tls to Quantum-Safe Cryptography

Changing gears to security. Since Heartbleed about 10 years ago, AWS has invested in making TLS safer for everyone. We developed an open-sourced, modern, secure-by-default TLS library called s2n-tls. Not only does it benefit from more modern engineering practices like defensive coding and more recently formal verification, but it also has saner defaults that reduce the surface area for defects that might cause security problems. s2n-tls is just part of the story though.

If you ever run a public endpoint of any large size and you had to go through three TLS protocol migrations the last few years as we adopted TLS 1.2 and then 1.3, you know that these migrations can be painful. By using AWS networking services like Application Load Balancer, Network Load Balancer, and CloudFront, restricting the latest TLS version is only one click away. But the innovation I wanted to talk to you about when it comes to TLS is about preparing for a future where quantum computers might become a reality.

Why is that? Quantum computers are the boogeymen of cryptographers, and that's because algorithms like Shor's algorithm can break elliptic curve, RSA, and Diffie-Hellman—all the algorithms in pink on your screen. While they are safe today given the key sizes we use, they are still subject to what we call harvest now and decrypt later, which is when someone captures encrypted traffic, stores it, and waits for the time when a large quantum computer is available.

It is true that quantum computers are not there yet. They're not big enough. They are about an order of magnitude away in terms of the number of logical qubits that are needed to decrypt these algorithms at practical key sizes. Yes, it's still early days, but there are billions of dollars of venture capital alone on the private side of the industry being invested in trying to come up with a practical quantum computer.

AWS has been working on quantum-safe algorithms since at least 2019 when the National Institute of Standards and Technology, NIST, started a process for standardizing such algorithms. That year, we implemented quantum-safe algorithms in s2n-tls and AWS-LC, which is our open-source cryptographic library. Fast forward to last year—that standardization process came to a head, and we started implementing the same algorithms in CloudFront, Application Load Balancer, and Network Load Balancer. These are available today. In CloudFront, it's even the default for everyone. If you're using CloudFront, you're already quantum-safe.

But why should Shannon care? It's just an online store, right? Who wants to decrypt that information? Well, for some customers, it could be a compliance checkbox, for example, if you're subject to CNSA 2.0 in the US, but for Shannon, it's about privacy. It's about the peace of mind that no matter what results come from those billions of dollars, customers' privacy is still going to be safe when it comes to what they search for online and what they automatically buy.

VPC Security Innovations: Private Origins, Block Public Access, and Encryption Controls

I originally told you that CloudFront's front end was the only one with internet access.

However, since last year, CloudFront can now connect to your private resources inside the VPC as well. Shannon has been taking advantage of this feature in order to harden its front-end VPC and eliminate that last open internet endpoint. Now it can be accessed only through Shannon's own CloudFront distribution, and there's no need to configure security groups.

This is how it works. First, CloudFront creates a secure network tunnel between POPs and a CloudFront-controlled set of instances in the same Availability Zone where Shannon resources are located. Then, CloudFront adds one ENI in Shannon's VPC. Finally, using AWS Hyperplane, which is the technology behind products like Network Load Balancer and NAT Gateway, CloudFront connects between its EC2 instances and the web server inside Shannon's VPC.

This resource can be an instance, it can be a load balancer, it can be anything inside that VPC. And at the end of the day, only Shannon's CloudFront distribution can access that endpoint. This brings me to VPC Block Public Access, which is a feature we introduced last year. The other two VPCs, as I said before, had no reason for using the internet. And we don't want an accidental new configuration to create a door through which internet traffic could get into those VPCs.

However, I told you that this backend VPC had many different services running in it. Ideally, Shannon would break this VPC into many VPCs. But in reality, Shannon is an old startup and that's how it ended up. AWS is here to meet Shannon where Shannon is. So Block Public Access allows Shannon's platform team to say these VPCs do not get to connect to the internet.

Now, individual team members, if they accidentally configure an internet gateway, NAT gateway, Global Accelerator, Network Load Balancer, Application Load Balancer, API Gateway, or even CloudFront private origins that I just talked to you about, if they configure any of those things or any other service that can connect to the internet, it gets blocked. It doesn't work. What I like about this one is that it's a different type of innovation. It's not about doing something that is technically interesting. It's about doing something that changes the rules of the game.

We take the intent from the customer and we do the hard work behind the scenes to make it work. At the end of the day, Shannon engineers want to build a better web store and not audit network logs, and this allows them to do that. This year, we just announced what I believe is an equally innovative feature. We call it VPC Encryption Controls. If I zoom into the backend VPC, teams have a pattern of acquiring new instances, deploying workloads to them, most likely with clusters of containers, and then launching them.

There are some instances, the newest ones that do support VPC encryption natively. Those are the Nitro V3 and above. But who knows which instances are Nitro V3 and above? I suppose Shannon's platform team does, so they do restrict the other teams from launching non-compliant instances. The problem happens when these teams get other instances launched into the VPC by another service like RDS. Now Shannon's platform team doesn't have a good method of controlling what gets launched and what's not, and these individual teams do not know which instance types are VPC encrypted or not.

Bot Protection and DDoS Mitigation with AWS WAF

Once encryption control is in enforcement mode, these launches are blocked just like the previous one. Shannon platform teams just tell us their intent—VPC encryption only—and we take care of making sure this intent is never violated. Let's switch gears once again and talk about bots and DDoS. When you are a very successful store like Shannon, people want to scrape your catalog. After all, just like some other store has everything from A to Z, Shannon has everything from S to N. Shannon uses Web Application Firewall as part of its CloudFront configuration. WAF can do a lot to restrict bots from scraping the catalog. Let me focus on a couple of different things it can do. The first one is instead of giving the content right away, it can first give a script that requires the bot to perform a computationally expensive challenge. This doesn't block bots completely, right? Bots can still do computation.

But it does change the economics of the game. Shannon has tens of millions of items in its catalogs, and having to solve an expensive puzzle besides parsing a web page adds significantly more resources onto the bot operator.

The second thing I wanted to talk about is the fact that AWS WAF has automatic mitigation for high volume HTTP request generators. This was initially created to block distributed denial of service attacks, and it's called the anti-DDoS AWS managed rule set. But it's also effective at blocking incoming requests from large collections of bots that want to get real-time information about Shannon's catalog.

Here's how it works. The WAF data plane gathers as much metadata as it can from the requests and streams it back to a central fleet. That fleet then fingerprints those requests and compares them to baselines that are individual to every single customer. When it identifies outliers, it streams back rules to the data plane that will then block requests based on this fingerprinting.

This works even when bots try to disguise themselves by faking headers like user agents or header orders or JA3 fingerprints on the TLS handshake or even using IP addresses that don't belong to the bot operators. I cannot tell you all the details on how we do fingerprinting because it's somewhat of a cat and mouse game and we're always updating the way we do that. But I can tell you that in 90 percent of the floods we see, the floods are blocked and mitigated within 20 seconds of starting. When a Shannon engineer gets paged, the flood has already been mitigated. That's our goal.

Network Scaling Philosophy: Proactive Infrastructure and the Nitro System

Coming to network scaling, our goal here is for Shannon to never have to worry about network scale. This is how we plan to deploy more infrastructure so that this is actually real. There are two fundamentals. The first one is to be proactive and never wait to run out of capacity or redundancy. This seems simple, right? But it's not in my and Stevenson's life. Nothing is elastic. Scaling the network from our point of view means laying out fiber, acquiring new buildings, sometimes digging up trenches. It's all about the physical world. There's nothing elastic about it.

The second thing is about having enough redundancy so that disruptions do not cause an impact to Shannon when they inevitably happen. Disruptions can be fiber cuts and plant maintenance or power outages—these unforeseen events that happen all the time. In our case though, they can also be our own doing because we do take millions of devices out of service every day to do OS upgrades and configuration updates so that they're always doing what you expect them to do and they're always secure. We have to plan for all those disruptions.

One example of that philosophy of building ahead of time was our decision to build the newly announced transatlantic subsea cable called Fastnet that runs between Maryland in the US and Cork in Ireland. This is very good for Shannon. This cable has 320 terabits per second of capacity, and it includes a newly developed optical switching and branching unit which can be used to accommodate future changes to the topology of the network, for example, adding new landing locations on either side of the cable system.

A recent innovation in how we plan for the scaling of the network has to do with how we define enough redundancy. Up until a couple of years ago, we planned for a number of fixed failures depending on the segment of the network. In the backbone, that meant, for example, two concurrent failures. In the data centers, it meant losing an entire building. So we had excess capacity. The result of that is that we ended up with excess capacity in places where the network was very, very reliable. Imagine the connection between two buildings inside the same AWS-owned campus, but sometimes insufficient capacity in places where the infrastructure is not as reliable as that.

The solution for that was two-pronged. The first prong was to better understand our network, which started by fully characterizing every single element that we could in terms of mean time between failures and mean time to recovery. Then I also needed to understand the correlated failures. Correlated failures is something Steven is going to get into more details later.

But the main idea is that we want to avoid scenarios like digging up a trench and cutting more than one fiber, and then having us be surprised that we lost more than one single circuit in one event. The second prong of that strategy is to take that more detailed characterization of our network and feed it into a custom-built network simulator that runs the exact same network controllers, the software that does traffic engineering in the network that you run in production. Because we now have the full understanding of how the network fails and what's correlated and what's not, we can simulate the correct parallel and sequential failures given the probabilities we understand better.

The end result is that we could redirect network scaling to where it matters the most and with that increase network reliability. This one is an oldie but goodie. Did you know that the VPC has no limits in the number of instances you can put in it, other than the IP addressing of the subnet? This means that for IPv6-only VPCs, there are practically no limits in the number of instances you can have in a subnet. However, you probably also know that we innovated many years ago by launching the Nitro system, which is hardware virtualization of the network function. You can probably imagine that hardware doesn't have infinite amounts of memory, right? There are trade-offs to doing those kinds of things.

However, when packets come out of an instance and get into the Nitro system, the packets need to be encapsulated towards the server that contains the instance the packet is destined to. So we need a table there that says this server in this instance in your VPC is actually located at this IP address somewhere else inside this AZ. This table does not fit in memory for very large VPCs on the Nitro card. We also did not want to send those packets to a middle box and have to hop into every single communication path inside a VPC. So what was our solution?

Our solution was to deploy these green elements to our network, which are large services that can respond for every VPC and every instance where it is located. The Nitro makes a call to the service, gets a response, and sends the packet. All of that has to happen within way less than one millisecond for the experience to be seamless. So we had to innovate on at least two fronts to make that happen. One was to build a custom protocol. An HTTP REST API over TLS would not cut it within a millisecond here. It would not make the bar.

The second one, which is more interesting for Steven and myself, is deploying those elements inside every single building close to every single instance so that the network path is actually very small and does not add latency to that request that needs to be short in order for the experience to be seamless. Now let's talk about Shannon's sales. Don't ask me about details, but apparently in Ireland, the day after Christmas, which is called Saint Stephen's Day, is the biggest sale day of the year. Shannon scales the issue to capacity but doesn't have to scale the network to handle that sale. How can that be? How can different parts of the network cooperate so that videos and place order clicks can always get their way through unimpeded through the network?

Traffic Engineering: Distributed Market Simulation and Dynamic Congestion Management

Let's start with the videos. Shannon videos are stored in S3. S3 has a front end in situ, and is using network load balancer as well. Between that network load balancer and CloudFront, data has to flow, and it flows through AWS's backbone. For this example, consider two types of videos. The video you see on the product description page is very popular and can be cacheable because everyone seeing that product sees that video. But there are also videos within individual user reviews that users post.

Those don't get watched as much, and they therefore are not super cacheable on CloudFront. When you're streaming content on CloudFront that only a handful of users are seeing, CloudFront becomes more and more of a proxy, and that puts more demand on AWS's backbone. In that example, users in London are watching content and the paths between London and Dublin are getting close to getting congested. At that point, the traffic engineering controller kicks in and starts redirecting some of that content through Manchester.

The system enlists these new network paths transparently to the workload running on top of it. In this case, CloudFront fetches content from S3. The way we do traffic engineering is quite unique. I call it a distributed market simulation. The interesting part about it, though, is not this fancy name I just uttered. It's the fact that it has no central brain. It has no central element that is deciding where every flow goes on the network. The reason we don't want a central brain is because it's a central single point of failure that can bring down your entire network, and we don't want that.

It also does not use a reservation-based system like RSVP that is very common in MPLS networks because those are subject to first come, first served behaviors and have deadlocks. Last year, we went into way more details about this system and you can find it on the QR code on your screen. However innovative we might believe our traffic engineering system is, it is not magic. It cannot create fiber and routers and data centers out of thin air.

So how can we make sure that St. Stephen's Day is a good day when everyone is watching those product review videos way more than we expected? Our solution was to integrate the backbone traffic engineering controller with services like CloudFront. Imagine that those videos are being watched so much that all the paths between the region and the edge location in this example are not enough. That path starts becoming red, which means getting out of capacity. The traffic engineering system then tells CloudFront, and CloudFront now has the opportunity to move some of the content to a different edge location.

It looks at which content is more cacheable, which content is less cacheable, what type of content is there, and makes an informed decision. In this case, CloudFront decided to move videos because videos are not affected by latency as much as they are by bandwidth. As long as customers can still watch those videos in their full 4K glory, they're fine with a few more milliseconds before the video starts. The limiting factor, however, can be ports connecting AWS to internet service providers instead of our own backbone. Last year, I explained how the traffic engineering controller for the internet connectivity balances things out, which is what you're seeing on your screen right now.

The same problem, however, can happen. What if all the sites in a metro area in a city are at capacity towards a given ISP? Again, the traffic engineering controller tells CloudFront about this fact and that gives CloudFront an opportunity to serve some content from another city where presumably we have enough capacity towards that. Up until now I've been talking about CloudFront. Sometimes, however, the origin is an EC2 instance. There is direct connectivity from users on the internet and something hosted in EC2. In the case of Shannon, they have one API behind a load balancer.

Everything I just talked about also goes for EC2 traffic as of this year. The EC2 traffic engineering controller, which internally we call Catapult, has also been integrated with the traffic engineering controller for the backbone and the internet connectivity so that it can now choose different waypoints to evade and avoid congestion, both on the internal network, the backbone network, and on the internet ports. Everything I just talked to you about happens automatically and behind the scenes. Customers might see a little bit of latency fluctuation, but never congestion. All this machinery is also very useful for real customers like Amazon.com running Prime Day or Thursday Night Football streaming, or even Fortnite players downloading multi-gigabyte updates so that they can go and play their new version of the game.

Planning for Failure: Availability Zones, Fiber Infrastructure, and Geographic Information Systems

Let me bring a flesh and blood Steven to the stage to tell you how AWS helps Shannon's store prepare for days that are not as happy as the biggest sales day of the year. Thanks, Georgie. So let's continue our story. Shannon Store has set up their systems, they're able to serve their customers reliably. They've also scaled up. They've taken advantage of the services that Georgie talked about, and AWS has facilitated some of that under the covers. But just as we're all here concerned about how the infrastructure will predictably scale, we also want to plan for how this infrastructure could predictably fail. To recap the architecture, we've got multiple VPCs, and these are spread across three availability zones. Now, we all know that availability zones are described as discrete entities.

But let's look at what that physically means. So we state that availability zones are redundant from an infrastructure standpoint. This means they are separated by meaningful distances, they have distinct power sources, but they also have separate fiber paths. And we do this such that a single event won't impair more than one AZ. And in large regions where we have multiple buildings, we have multiple connections that are spread across these buildings, and so we need to plan for how this will affect our redundancy.

So let's zoom in to what happens inside an AZ and on the surface, what appears simple of two buildings that connect. Now, this may seem simple, but there's a lot that needs to be accounted for here. We've got airports and houses and businesses and mountains and rivers. We need to build around the physical infrastructure that exists. And as we add new buildings, we have to grow and augment the existing fiber paths so we can extend out into these new locations.

So, how we do this is that we need to know as much as possible. And the only way we can do this is because of our global Geographic Information System, or GIS. And this allows us to really obsess about the details. Because our fiber plant is under our control, and that's inside an AZ between AZs or even across the backbone, we need to make sure we've got enough controls in place to understand what's happening. And this is the fiber, that's how many fibers are inside the conduit, the path that it takes, and even things like when we're crossing a bridge, is it going under or over that highway.

As an example, here's a real life situation where we've got fiber that runs down the left side of the highway, the right side of the highway, and even the middle of this divided highway. And because we know this, we can model different failure characteristics depending on which path we're looking at. Because while we may have construction impair the left side of the highway, it's not a guarantee that it's going to impair the right side of the highway at the same time. So we take this into account when we're doing our planning.

There's also things that we can do that will reduce the likelihood of impact. And things we can do is like protect the fiber by encasing it in concrete. Now this may not stop one fiber duct being hit, but it's possible it will stop further damage, and we may not hit all of them in the trench. Now traditionally this was constructed on site, which as you can imagine, is a fairly labor and resource intensive activity. But also it made construction time unpredictable, because we could have weather disruptions. So lately, we've moved to using prefabricated fiber duct banks. And this is one of these multi-win scenarios that I just love so much.

So firstly, it speeds up deployment, because we do not have to wait for concrete to dry. Secondly, we can deploy capacity quickly, and we can give ourselves space for scaling. Thirdly, because it's constructed in a shop environment and not in the field, it's actually a higher quality product. We have less problems with it there, and probably most importantly, it's safer for our construction teams, because they spend less time in trenches. So because we obsess about physical details like this, customers like them only ever see the infrastructure being increasingly reliable and consistent.

So let's take a step up from the physical that we just talked about and move to the logical realm. So when we model demand, we do it between data center pairs. How much traffic is going from A to B? Now between any two data center pairs, there may be 2 or 4 paths. There's multiple ducts, there's cables like this in the ground that have multiple subducts and multiple strands inside. And each of those strands is going to be connected to a circuit. Now that circuit may be EC2 or backbone, or even direct connect.

And now, because we know where all these circuits are, and we know where all the physical part of all these things, we have this mesh of capacity and demand that has to be planned. So to manage this, we use a method called SRLGs or shared risk link groups. And this is where we're analyzing this complex graph, both logically and physically. We're looking for places like where two fiber ducts cross, or maybe there's a concentration of capacity where we're trying to cross an ocean. They're the places that could have the most significant impact if something were to happen, so that's what we want to plan for.

Now, in a scenario like this, where there's 4, it could be a bit tricky. Imagine a region like Virginia, where there's hundreds of thousands of fiber paths, how we'd go about planning it. And if I look at customers all over Europe,

when I look at how we're going to manage this for consistency, what can we do? Well, here's a high-level logical overview between the regions. But I think from what I've just mentioned, you can understand that in reality, there are thousands of fiber paths, submarine cables, and interconnect points all at the physical layer.

Backbone Resilience: Fast Failover with Pre-Provisioned Bypass Tunnels and SRLGs

Last year when we were here, Jorge spoke about our penalty-based traffic engineering system. One of the benefits of this system across the backbone was that it provided a gradual return to equilibrium across the set of paths that we had. However, when we have an impairment like a fiber cut, we don't want to wait for this gradual rebalancing. We want to act quickly. So in addition to the optimal path through the system in steady state, we're also going to protect each path with a fast failover service. We do this through pre-provisioned IP-based bypass tunnels.

How do we look at this? Well, the first thing that's important to do is balance scale and performance. As part of that, it's important to delegate responsibilities to the layer of the stack that can most quickly respond to something. At a region or backbone level, it's not necessary for the path controller to know about the thousands of individual switches that are there. This is why we have an underlay routing protocol. Similarly, we don't need to know about every link, only bundles of links and how much capacity they have. We're delegating some of this responsibility to these lower levels because we feel that the problems can be dealt with faster and more simply.

With this, the backbone system kicks in, and after our primary path has been created, the traffic is being forwarded and balanced based upon penalties. The backbone controllers also calculate a backup path for each segment, which you see here in blue. What's important is that these segments are independent, not just from a logical perspective like you see here, but also physically. It's important that no elements in this protection path share an SRLG with the primary path.

Now, we know that because of the system that we have, there could be a chance of contention. If there were multiple failures, traffic would try to jump on the same path at the same time. To handle this, we're continually redoing our capacity calculations to manage that chance and make sure it stays extremely low. In the event of a failure, we're going to immediately recalculate these effective paths such that by the time we have reprogrammed the paths, we'll be ready before an uncorrelated failure occurs.

We also have a lot of experience with some of the largest RSVP and MPLS networks in the world, but the scale and meshiness of the backbone that we have means that those protocols are not flexible enough for us. They're not doing what we need. When I look at this, I think of solutions that can be described as simple. It sounds simple that every segment has a backup tunnel. But the elegance in a design like this is making something that's incredibly complex, with all those SRLGs, fiber paths, and interconnect points, appear simple by having the bypass tunnel. This is especially difficult when you consider the constraints we put on ourselves for scale, availability, performance, and agility.

All of this means that Shannon Store is not dealing with inter-region capacity management. They're not concerned about their customers accessing their store in Ireland and learning about things like submarine cuts on the news instead of being paged or seeing them on their dashboards. Shannon Store now has their global audience and a reliable stack, and their customers are using the store. But maybe they could do with some help in finding the right products. Well, how do you solve that? If you had it on your bingo card, yes, we're going to solve it with the world's favorite topic right now: AI.

Cross-Region Connectivity: Private Link for AI Integration with Penny AI

Shannon Store has decided to work with a third-party provider to build a customer assistance agent. Say hello to Penny AI. Now as you can tell from this email, their CEO has potentially overly high expectations about the ability for this to deliver real value. Let's see how we can try to meet these high expectations. Penny AI runs in US East 2, and their design exposes their service via Private Link. Private Link removes the requirement to renumber any networks. There's no overlapping CIDRs, there's no exposing of internal or beta services, and the security teams and respective teams are happy.

However, until now, this was the setup that was required. Someone needed to handle that cross-region transfer themselves with these transit VPCs. But last year with the launch of Cross-Region Private Link, Penny AI can now configure their load balancer simply for cross-region configuration.

Shannon Store then sets up the cross-region PrivateLink connection to connect to Penny AI's backend, and AWS has handled this connectivity under the covers. One thing I particularly like about what the team has done with this is that the infrastructure for cross-region PrivateLink does not intermingle with the in-region PrivateLink. This kind of region independence is part of the promise that we give, and we remove concerns of contagion between the services if there were any problems.

But what about in the other direction? Shannon Store is a huge data lake of customer data in S3 in Ireland. How can Penny AI get access to this data? Well, launched last week, cross-region PrivateLink for AWS services exposes these VPC endpoints for AWS services between regions securely. With these tools, Penny AI has the control of exactly what they're offering to Shannon Store, and Shannon Store can expose exactly the data that they want in Ireland to the Penny AI team. We're not messing with global replicas or moving data around. This is all keeping it secure within VPCs.

Ultra Cluster 3 and AWS UltraSwitch: Building Reliable AI Infrastructure at Scale

Now Penny AI and Shannon Store have access to the data they need. It's time to process it and build the AI. This is where the crucible of AI comes in, which is accelerated compute. You absolutely need to get access to the accelerators, and that's important, but that's just the start of it for me. No one likes a super fast race car that can't finish the race. You need something that runs reliably, consistently, and predictably, because that's what makes the winning team.

This is where AWS and I in particular spend a lot of our time building AI and ML networks. In my mind, nothing is more expensive than a system that regularly fails and struggles to make forward progress. We've spoken about Ultra Clusters before, and this year we have Ultra Cluster 3. We're deploying sites with hundreds of thousands of GPUs. In fact, the combination of network hardware that we have and our Elastic Fabric Adapter is flexible enough that we're probably a couple of years away from hitting network scaling limits even with the current generation of hardware. I think we'll run out of practical space and power first.

Hard limits like this are super interesting to me. How much heat can you move? How much power can you provision to a building or a location? I could probably speak for another hour on liquid cooling and power management alone, but let's stick with the network. Ultra Cluster 3 builds upon our flexible planar architecture. This fits the fabric to the size of the building and the requirements of the rack types. With this new generation, we're able to cater for a couple of hundred GPUs all the way up to potentially fifty thousand in a single network fabric. With our ability to connect multiple buildings together in a single cluster, we can form installations of millions of GPUs.

But let's back to Earth. Penny's not in the billions. They need a couple hundred GPUs, and like I said, they're connected to US East 2, which is in Ohio. But actually these GPUs are next door in Indiana, and we've done that because we landed the hardware close to the power source. That's where the ML cluster lives. By extending the ML VPC in this cluster versus in the region, I've got a double win. Firstly, we can grow faster because we're building in two locations. Secondly, I can give more concentrated capacity. The ML buildings can be dedicated for ML, and this keeps the non-ML servers in the region closer together because there are fewer buildings there. So it helps both ML and non-ML customers because we're spreading this across many more sites.

Now what is an Ultra Cluster made of? We build Ultra Clusters for every type of accelerated compute that we have, but the ones that are pushing new limits right now are Ultra servers. This is where we have a large scale-up domain in the rack where accelerators can communicate locally at extremely high rates. We also have a scale-out Ethernet domain where the Ultra servers can talk to other Ultra servers in the Ultra Cluster. One thing that's required some extra innovations are the connectors for Ultra servers, and this is the stuff that I really love. There's a lot of network bandwidth required to connect these super dense racks, and we want to make sure that connections are as reliable as possible.

We see a lot of improvements by pre-testing all of our cabling. We use these loop back connectors called Firefly, and this one's called Loki. We connect these to the end of the cables before the compute racks show up. This breaks the full length of cabling up, so we're able to test it segment by segment. We're validating these cables as soon as they're installed, and this is potentially days before the compute rack even ships. This is our new ganged MMC interface.

I have one up here too. This holds 4 MMC connectors, and each of them has 16 fibers, meaning this represents 64 fibers being inserted with a single connection. Here's how we connect them to the rack: we remove the dust cover and we insert the gang connector. The wings on this connector have a quite satisfying little click. One thing that I like about it is it ensures we've got the right mounting pressure. The connectors are being pushed together and held in with the right amount of pressure. There's no reliance on humans pushing this correctly or pushing this in—we've got that click.

When we ran a study of this connector, we found that we had a 36% reduction in link level failures from that alone and a 76% reduction in the time to cable a rack, meaning we're getting capacity into service faster. These are double wins that I really like. Of course, we can remove an individual connector if we want to do some troubleshooting.

So now we have the Ultra Clusters and the Ultra Servers. Let's talk about how we're going to make them perform well for customers like Shannon's store. What would a reinvent be without adding to the family and having an UltraSwitch? AWS UltraSwitch is an innovation where we looked at the total problem of how an AI and ML network operates. We worked out that there are some circumstances where we'd like the network to behave a little differently.

In the normal world of network switches, when it comes to connecting servers to the network, there's a rack switch. This connects the hosts to the fabric. When the rack switch needs to determine where to send a packet from the host and there are multiple candidate paths into the fabric, we generally use a method that's known as equal cost multi-path. Most hardware network folks may have heard of a mechanism called ECMP. This mechanism creates a hash key for each packet to determine which egress interface to use. We're using information in the packet header, like the IP address, the EDP ports, and any encapsulation to produce this hash key. That hash key then gets mapped to an egress interface, and the packet gets sent to that port.

However, as you can see here on Switch 2, sometimes flows will hash to the same place, causing some network contention. Over the years, network chip designers have created new features that improve the efficiency of this behavior. If an egress candidate interface's buffers are full or its queues are full, the ASIC may move some of those flows to a new interface. Alternatively, you can embrace the chaos of nature and spray packets across all candidate interfaces without keeping the flows intact. The downside is that in a loaded network like this, if this switch starts seeing problems, then a lot of flows are going to be negatively affected by this.

Finally, you can also structure the architecture to be rigidly aligned with the data patterns. But this means the fate of the traffic is coupled between the GPU and the network. Loss of a network switch means GPUs are cut off. This is also the least flexible for emerging traffic patterns that we see because we're hardwired in this configuration. So what UltraSwitch gives us is ultimate flexibility. It sits between the servers and the Ultra Cluster and can give us all of the previous behaviors.

Given Shannon's workload is evolving, along with the rest of the industry, from pre-training and reinforcement learning and single node inference to distributed inference, the ability to choose the behavior that matches the workload without physical recabling is a powerful capability. An UltraSwitch means we can also provision additional back planes in the Ultra Cluster, so we're better able to absorb traffic bursts because we just have more raw capacity. And if Shannon's workload requires a fabric with virtual rails, like you see here in the middle of the switch where traffic is aligned to a backplane and the green stays with the green, we can do that too.

I can also say that Shannon UltraSwitch helps Shannon's costs because, as I said earlier, cost in this case is related to downtime. Yes, all hardware fails, but what's not cool is when ML hardware is running but its access to the network has been cut off because the network has a failure. UltraSwitch allows us to act like a hot spare drive in a ZFS pool. Meaning it's there to take over in the case of a network failure. We can redirect the traffic to a standby backplane, we keep our rail alignment like it was there before, and the healing happens on the remote side. This means the host never sees the link going down.

Now practically speaking, if I'm operating a network and I see a bit error rate increase on a link, I can now proactively shift that traffic away.

Seamlessly, without Shannon ever knowing. We're keeping the network more reliable because of it. I can also say that Shannon's workload is being helped by AWS UltraSwitch because we can keep the network always up to date. We're not going to ask Shannon or anyone else for a weekly maintenance window on an ML network just so we can update the software on our routers. With AWS UltraSwitch, we can seamlessly take a whole backplane out of service for software upgrades. Let's say we want to upgrade the firmware on every transceiver in that backplane. We shift the traffic over, do the upgrade, and shift the traffic back. This is all done without Shannon's Services knowing because we can keep the flows intact the way they were before.

Shannon need not obsess about lowering the annual failure rate of a link by another 2% because we can do that. We know all those 2%s add up. AWS UltraSwitch has another feature where, if we wanted, we could take all the storage traffic that lands on a network and constrain it to its own backplane. We can then give physical separation between RDMA and storage traffic. From the customer's perspective, this is all down the same interface, so they don't see any difference. We can do this all on the back end.

What I love about AWS UltraSwitch is that we have built all of these capabilities and they're configurable. Our network performance team can run experiments, identify benefits, work with customers, and then we can deploy them using our Aiden cider controller system with a couple of API calls to hundreds of thousands of switches and millions of links. To learn more about how that works, check out a talk we gave two years ago about our intent-driven network.

With AWS Ultra Servers growing up with ultra clusters and getting the behaviors they need with AWS UltraSwitch, we don't just connect capacity by the book. We are iterating where it gives customers like Shannon value, improving reliability, providing consistency and performance and agility, and providing the same experience regardless of whether it's in Oregon, Indonesia, or Stockholm, or whether it's 20,000 GPUs or just 72. We're innovating on the technology for benefits at all scales.

If you want to see some of these things in person, we'll have some of our network switches, cables, connectors, and more at the expo hall in the Venetian here at re:Invent. The booth is staffed by the engineers who build these devices, so please get in deep and feel free to ask as many questions as you like. Sessions like this run on feedback from you. Please let us know what you thought of the session in the survey, but also use it to let us know what you might like to see more of next year. We'll keep sharing under-the-covers details like these while they're of interest to you.

With that, I'll thank you for going on this journey with Shannon and their new AI assistant, Penny. It's been a pleasure showing you some of the innovations that help customers like Shannon scale, plan for failure, and embrace new technologies. Hopefully you can see the work that goes into giving your workloads competitive advantages in today's ever-evolving technological landscape.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community