Kazuya

Posted on Dec 11, 2025

AWS re:Invent 2025 - From threat to threat intel: 360 degrees of DDOS (NET318)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - From threat to threat intel: 360 degrees of DDOS (NET318)

In this video, AWS explains how it transforms DDoS attacks into defensive advantages through a comprehensive threat intelligence flywheel. The session covers AWS's evolution from manual IP table edits in the 2000s to automated systems processing 60 terabits per second daily. Key topics include the network layer known offenders list and application layer known offenders list derived from 700,000+ network detections, MadPot honeypot infrastructure for botnet infiltration, and integration of threat intelligence into services like CloudFront, Global Accelerator, and WAF. The new WAF Anti-DDoS Managed Rule Group uses baseline learning and challenge actions to differentiate legitimate traffic from attacks. Figma's case study demonstrates defense-in-depth strategies combining WAF rate limiting, custom Envoy proxy filters, and traffic isolation that successfully mitigated attacks reaching tens of millions of requests per minute. AWS's collaboration with law enforcement for botnet takedowns exemplifies their commitment to improving internet-wide resilience.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Transforming DDoS Threats into Defensive Intelligence

Good afternoon everyone. It's the afternoon before re:Invent, so I'm glad to see so many of you still here. Today we're going to talk about how AWS turns DDoS into a defensive advantage. My name is Alaina Wanek, and I lead the Shield Response Team. I'm accompanied by Bryan Van Hook, who is a senior specialist SA, and Bashar Al-Rawi, who is a software engineer with the company Figma.

Over the next hour we're going to take you behind the scenes of AWS's DDoS protection systems and show you how we transform DDoS into actual threat intelligence. This is a 300 level course and so there's an expectation that you have some high level knowledge on what AWS services do as we'll be digging deeper into how we integrate DDoS protection into those services.

Before we dive in, let me take you through a simple analogy about how AWS approaches threat intelligence. Think of the internet like the weather. Most days are predictable and mild, sunny with a chance of rain. And when we encounter rain, simple protection like an umbrella will help us easily handle it. But weather has the tendency to evolve from a drizzle to a storm, and just similarly, DDoS can escalate from simple attacks to complex threats.

When severe weather approaches, we take shelter, and the same principle can apply to defending against major cyber threats. Just like meteorologists use sophisticated systems to track hurricanes, we at AWS have also evolved our defenses to not only react to DDoS but also adapt to emerging threats.

In today's agenda we're going to go through a flywheel approach for protecting against DDoS. Bashar is going to present his use case on how Figma has used that approach to help benefit them, and then we'll end with some best practices that you can take away to help make your systems more resilient.

The Evolution of DDoS Attacks and AWS's 360 Degrees of DDoS Approach

To kick us off, we're going to dig into how DDoS has developed over the years. DDoS threats can target any type of application on the internet, and broadly speaking, bad actors commonly send commands to botnets to issue large DDoS traffic. For reflection attacks, bad actors scan the internet for open ports like DNS or NTP, and then they use those ports to direct traffic, reflect it off of those ports, and then amplify it either up to 100 or 1000 times the original size in order to overwhelm the endpoint.

For proxy-driven attacks, threat actors use open or private HTTP proxies, and they use these to redirect requests to disrupt websites' availability. Generally these attacks hit various layers within the OSI model, and they aim to disrupt your availability, overwhelm your network capacity, overload your connection tracking, and generally overload your application server.

I'm going to talk through how the threats have evolved over the years. So in the early to mid 2000s, we mostly saw reflection and amplification attacks, and at that time we were getting on hosts and editing IP tables and doing this per host kind of on the fly to protect services like CloudFront and Route 53. This was complicated and unsophisticated, and so we knew we needed to evolve our approach.

In the 2010s, attacks evolved to leverage botnets, and we responded to this by deploying packet scrubbing hardware across our network edge in order to centralize mitigation in 2012. We then introduced automated mitigations in 2013 in order to help get the humans out of the loop and make it more automated. Then WAF and Shield were released a couple of years later in order to provide better customization for mitigations.

In the 2020s, we saw the rise of HTTP-driven proxy floods, and we responded to this by launching the Shield Application Layer Auto Mitigation in 2021. Then just this year, we also released the WAF Anti-DDoS Rule Group, and we'll dig deeper into that. Brian will dig deeper into that in a little bit.

Now that we have some background context around DDoS over time, let's talk about how AWS thinks about addressing DDoS. So we're going to go on this journey through this flywheel that we call the 360 Degrees of DDoS, and this represents how we think about producing threat intelligence to protect ourselves from DDoS attacks. We went through a journey of how threat actors have evolved over time. It starts with a threat actor. We then dive deeply into the telemetry that we collect across our targeted endpoints. And then we analyze the signal to evolve that into actionable threat intelligence. From there, we show how AWS threat intelligence is built into AWS services for improved endpoint protection. And then finally, we wrap up with actual efforts to disrupt and take down threat actors in order to improve resilience from DDoS across the internet.

Capturing Telemetry: AWS's Massive Network Detection Systems

So let's get into the telemetry piece for what we observe on our targeted endpoints. AWS's massive network provides a real-time comprehensive view of global internet traffic, and not just global but also regional because we're spread across various locations. Just to give you an idea of the scale, on average we see a peak total rate of 60 terabits per second, 12 billion packets per second, and 105 million requests per second. This is the peak rate that we see on average every single day. This is an enormous amount of traffic to have to sift through, which means that we have a lot to dig through to get some good information out of it.

Meteorologists face a similar problem when forecasting the weather. If they were just looking at the temperature outside, they wouldn't get a good understanding of what it actually is like out there. And so, much like meteorologists set up weather stations to measure things like pressure, humidity, UV index, and so on, we need to be able to gather data on more than just IP address.

To do that, we've strategically integrated detection systems across AWS to capture different aspects of the traffic. When DDoS traffic is destined for an AWS endpoint, it first hits our network layer detection systems, and this system sees what we call top talker traffic, so the peak of the peak traffic coming through. And it's more on the byte and packet layer, more on the byte and common vectors that we see in the IP and protocol layers. Our next system is in our edge services like AWS Global Accelerator, CloudFront, Elastic Load Balancing, and these systems see information more based off of top requests, TCP connections, and ASN, et cetera. Further in, we also have a detection component in WAF, and this is header order, requests, header combos, as well as request combos. And then deeper into the last stop in our detection systems, we also have a fleet of honeypots that detect through direct threat actor interaction. We'll dig into this in just a few minutes.

Across all of these detection systems, year to date, we've seen over 700,000 network layer detections and over 900,000 application layer connection detections. And this has helped us funnel down that raw telemetry that we've seen all across our network into more specified information on traffic anomalies. But that being said, this is still a ton to have to sift through. So we need to take it one step further, which is where threat intelligence comes in.

From Detection to Intelligence: Overcoming Challenges with MadPot Honeypots

And for me, this is kind of the fun stuff. So doing that has a few challenges. We collect so much data that our detection systems need to be able to form it into some meaningful way to prevent hindering real users.

In DDoS, we regularly see threat actors reusing IP addresses when sourcing DDoS traffic. This is likely because spinning up new infrastructure is either costly or time-consuming, and so it tends to be reused. But unfortunately, it's not just as easy as blocking any IP address that we detect a traffic anomaly on. Often threat actors will use open proxies or they will use compromised home routers in order to send traffic, and so if we were to just outright block those IP addresses, we would be impacting legitimate users.

Our second challenge comes in when our detection systems are amazing, but this also means that sometimes they capture large increases of traffic that are non-malicious, commonly known as flash crowds, and so these aren't actually DDoS. And then last, sometimes compromised hosts will get patched and become good again.

To account for these challenges, we have to get crafty with our traffic attributes. So to create a good list, we run through a cycle to continuously improve our list. When we detect DDoS, we run an analysis to understand how effective our list is at capturing the actual DDoS traffic. This analysis evaluates the match rate that our list is having on the actual DDoS traffic. When our match rate is low, we inspect the traffic even further.

So what this means is we have to dig into this funnel of traffic attributes in order to better understand how the DDoS herd of traffic is behaving and how that differs from the legitimate traffic. Once we understand that behavior, we tune our algorithm to capture those attributes, and this tuning process helps us better capture the same herd-like behavior, so that way we are able to mitigate that same type of behavior when it repeats in the future. And then all of that leads to successful mitigation.

But as we all know, threats are always evolving, and sometimes new DDoS vectors crop up and evade our detections. And when they evade our detections, that means they're not properly getting integrated into our lists. This is where our honeypots come in, known as MadPot. And so MadPot is a fleet of AWS honeypots that run across the AWS network and operate on EC2 instances. MadPot pretends to be a vulnerable service in order to get actors to interact with it, and this provides AWS intelligence on what the bad actors are actually trying to accomplish.

For botnet-driven DDoS, MadPot captures the exploitation attempts. It then automatically downloads, detonates, and analyzes the malware to identify key parts of the botnet infrastructure. This also allows MadPot to infiltrate the botnet and see when the command and control server is actually issuing DDoS commands in order to run DDoS attacks. These commands include attack destinations, which gives us a positive DDoS detection signal to help us know when DDoS is evading our detection systems.

Known Offenders Lists: Application Layer and Network Layer Protection

We then run these through our threat cycle in order to incorporate them into our detection systems, and then once they're incorporated into our detection systems, we're then better able to include all of those into our known offenders lists. Okay, so we have two major known offenders lists. I'm going to start with the application or Layer 7 known offenders list, and this was created in 2022, kind of around the time of this increase of proxy-driven HTTP flood, and it was really to help combat that rise of attack pattern. This is curated on bad actors sending web-based, so HTTP or HTTPS type traffic.

But one thing I want to note is request level traffic does not match one-to-one for network layer packets. What that means is you can cram many requests into a single network packet, and often threat actors will exploit this by cramming a ton of application layer requests into a single packet. It gets a little bit challenging when looking at the traffic because it doesn't equally match to things that would congest our network capacity. We actually have to terminate TLS in order to better understand how many requests are coming into the application. We've seen great benefit using this list to mitigate as a mitigation tool and integrated it early on into CloudFront and WAF. Some exciting news is just this year we also integrated it into Elastic Load Balancing, or Application Load Balancer, so that way it has some extra built-in DDoS resilience.

Our second list was developed just this year, and it's the network layer known offenders list. It focuses on bit and packet heavy attacks, and this was curated using the detection data that we have on our network layer detection systems. To dig into the effectiveness of this list and highlight how it's both helping globally and regionally, I'm going to go over two case studies.

For the first case study, I want to walk through a recent UDP vector that we've seen out in the wild. The concept is pretty straightforward. It's a UDP flood, but what makes it challenging is that the threat actors send a distributed amount of traffic with both the source ports randomized and the destination ports randomized. What this means is it's really challenging to be able to pinpoint the traffic to any one of these attributes or any one of these ports because they're all being randomized. Earlier in the year, we weren't actually able to detect these, and we were able to lean on detection systems like MadPot and our other telemetry in order to help us understand this traffic profile. With all of that, we were then able to feed it into our detection systems, which then had the positive feedback loop of getting integrated into our network layer mitigations, or our network layer known offenders. This list is used today mostly to protect AWS network infrastructure.

For the second example, we'll examine a regional case study where DDoS vectors, unlike the past one where it was particularly UDP randomization, the DDoS vectors were a bit more varied. As I mentioned, this list is broadly protecting AWS infrastructure. When analyzing the effectiveness, we discovered something interesting in one particular region. When we investigated, the list was only seeing a small percentage of DDoS in this area. The thing that's strange is we were accurately detecting the floods, but when we checked out the match rate of the amount of traffic that our known offenders list was having, it was pretty low relative to other threats we were seeing in other regions.

Upon further investigation, we found that the attacking IP addresses were uniquely regional. These threat actors were specifically targeting this area rather than participating on a broader global threat landscape. To improve our protection coverage, we analyzed the traffic attributes, refined our inclusion criteria, and this allowed us to better capture that threat actor and those IPs. When they sent traffic again, we were able to mitigate them with our known offenders list.

Integrating Threat Intelligence into AWS Services for Non-HTTP Workloads

All right, now I've shown you how AWS curates threat intelligence. I'm going to hand it over to Brian so he can show you how we've baked it into our services. Thank you. All right, now that you've seen how we curate and invest in all of this great threat intelligence, what should we do with it to protect customers? Well, we integrate that threat intelligence into a number of AWS services, and this helps those services provide mitigation capabilities for customers. As you'll see, some of these capabilities are built in and always on to protect all customers, while other capabilities require you to opt in to certain features based on the type of workload that you have.

Now application traffic is often based on HTTP, so common examples there are websites or maybe a REST API. But there are a lot of non-HTTP workloads out there as well. Some examples might be a back-end gaming server that's running a popular massively multiplayer game, or maybe you're operating your own DNS resolvers.

So next we're going to talk about some DDoS resiliency best practices for both of these categories of workloads, and let's start with non-HTTP workloads. So here you see I have a non-HTTP workload running in a couple of private subnets. That's done intentionally to reduce and limit exposure to the public internet. We also recommend using a mechanism that automatically scales your workload to respond to an unexpected flood that wasn't fully mitigated so that you can absorb it and protect your business.

It's also very important where possible to use an intelligent load balancing service. Here I'm using Network Load Balancer, and this will evenly distribute incoming traffic across the entire fleet. And you'll see that this is also sitting in private subnets. Again, very important to limit your exposure to the public internet. Now for this type of workload, we recommend using AWS Global Accelerator. Global Accelerator is a networking service that improves performance and reliability by routing your traffic over the AWS global network instead of the internet.

So with Global Accelerator you get static IPs that are anycast from the AWS Edge network, which means your client traffic is routed to the closest edge location and then from there it's routed to your VPC over the AWS global network. And what's really great about this is that it works even for VPC resources that are private like our Network Load Balancer in this case. All right, so at all of these edge locations there are many built-in DDoS mitigations, and I'm going to cover a few here.

The first is that we rate limit application layer and network layer known offenders, and this is based on the threat intelligence that Elena outlined earlier. We don't outright block this traffic because there are times when these known offender lists can contain legitimate services that customers want to use but that have been abused recently and might end up on this list, so we can't outright block it all. We also block SYN floods. So as you probably know, the SYN packet is the first step in a TCP three-way handshake, and the server responds with a SYN-ACK and then waits for an ACK from the client.

Well, the SYN flood attempts to overwhelm the server by exhausting the SYN queue, which is going to deny legitimate connections. So this is mitigated at our edge locations using something called a SYN proxy. So instead of storing that SYN, we encode the connection information into a sequence number that's sent back to the client. And when legitimate clients respond to that request, we decode the connection information and use that to reconstruct the connection state to complete the three-way handshake.

Next up is what is called a PSH-ACK flood, so this happens after the three-way handshake is complete. Clients can send packets with a push flag set if they want to. That's normal, and this acts as a directive for the server to take the data out of that packet and immediately push it up to the application layer without buffering. So a PSH-ACK flood creates a situation where there's a continuous state of urgency. And if you look at these packets on a normal basis, they represent a very small percentage typically of total TCP traffic, so we aggressively rate limit this type of traffic, and this helps avoid a state of continuous urgency.

The last I'll mention is that we block suspicious UDP traffic. This is based on port numbers that are typically only relevant outside of cloud environments. The next piece here is that it's important to protect your Global Accelerator endpoint with AWS Shield Advanced. So with Shield Advanced, it will place mitigations using thresholds that are twice as sensitive as Shield Standard, and these mitigations will stay in place based on exponentially increasing TTLs to help mitigate frequently reoccurring attacks. Now also with Shield Advanced you gain access to

the Shield Response Team, and this is a team of DDoS response engineers all around the world that you can engage to get help troubleshooting more complex issues that you haven't been able to mitigate. They can help you implement sophisticated mitigations in your environment, and we are of course fortunate to have Elena managing that team for us at AWS.

DDoS Protection for HTTP Workloads with CloudFront and AWS WAF

All right, let's move on to HTTP workloads. So this part looks similar. We have an HTTP workload again running in private subnets to reduce internet exposure, and we're using an auto scaling mechanism to ensure you can scale out if you need to absorb an unmitigated flood. We're also using an intelligent load balancing service. In this case, it's an Application Load Balancer because it's HTTP aware, and this will evenly distribute traffic across your fleet as well. Great idea, and you'll see that this is also in private subnets to reduce internet exposure.

You may be wondering how are your clients going to reach this load balancer if it's sitting in a private subnet that's not reachable from the internet. Well, for all HTTP workloads exposed to the internet, we recommend using Amazon CloudFront, our managed CDN service. A relatively new feature with CloudFront is called VPC Origins. This lets you define a VPC origin in CloudFront that references a resource in your VPC that's private. Then you can set up a CloudFront distribution using the VPC origin, which will route traffic over the AWS global network in a way that doesn't require your origin to be exposed to the public internet at all.

So most of you that are using CloudFront might have origins that are exposed to the internet, and I strongly encourage you to take a look at this feature if you haven't yet. Also, at all of our CloudFront edge locations, we rate limit application layer known offender traffic. Again, this is based on the threat intelligence that Elena outlined earlier. And at these edge locations, we block all known network layer threats.

Now with CloudFront, we recommend protecting your distributions with AWS WAF. Now AWS WAF has been around since 2015, and it's evolved to address a wide variety of threats, but there are a few here that are specific and unique to DDoS threats. The first is a new managed rule group this year called the Anti-DDoS Managed Rule Group, and I'm going to dive deeper into this in a bit. Another one I have listed here is the DDoS IP List, and this is a managed rule with AWS WAF that has a list of IPs that's very similar to the application known offenders list that we rate limit at these edge locations. And as you'll see, you can use this rule to tailor your mitigations in AWS WAF if you need to, and we'll go through that as well.

AWS WAF Advanced Features: Rule Labels, Challenge Action, and the Anti-DDoS Managed Rule Group

Now before we get into the Anti-DDoS rule group, there are a couple of AWS WAF features that we'll review first. The first is WAF rule labels. The second one is the challenge action. So here's a request that might be coming into your application where it has a URI of terraform, and the source of this request is Mars, so we're getting a request from Mars just to keep things a little interesting here.

So WAF rules, they typically inspect one element of a request like maybe the URI or query arguments or the body of a request or the source IP, source geo, things of that nature. And often that WAF rule generates a signal that by itself is not enough to decide ultimately what you want to do with that request. So it would be nice if WAF rules could annotate a request by adding metadata that could then be inspected by other rules to make a final decision on what to do with that request based on a combination of signals, and that is where rule labels come into play.

So when this request comes in to your system and it's inspected by AWS WAF, let's say this rule looks at the URI and says if it's slash terraform, I'm just going to count this request. I'm not going to block it or allow it, and when you do that, you can have this rule attach a label to that request. So currently terraforming a planet is kind of expensive, so we're just going to label this with expensive. And so now the request inside of AWS WAF looks like this. It's got one label on it called expensive.

The next rule comes into play, and it's looking at the source geo, and if that is Mars, that's currently a hazardous location to operate, so we're going to add another label called hazardous.

And now the request has two labels on it: expensive and hazardous. The third rule now comes into play, and you'll see that it's not actually looking at any of the original fields on the HTTP request. It's just looking at labels, and it's looking for both of these labels. If they both are present, it's going to make a decision to block this request because maybe your business isn't quite ready yet to perform a project that's both this expensive and this hazardous.

Next up is the challenge action. So when your web application receives a flood of traffic, you have to decide what to do with it. Now you could just block all of it, but that is almost certainly going to cause false positives because it's a mix of good and bad traffic. So it would be nice if there was a way for AWS WAF to respond to traffic during a flood that disrupts the malicious traffic but does not impact the experience of a normal user, and that is where the challenge action comes into play.

So here we have a client, you don't know if it's good or bad, sending a request to your login page. And the cookie is blank, which just means we haven't seen this client before. So AWS WAF is going to look at this request, and if the URI is login, we're not going to allow or block the request. We're going to respond with a challenge. So what happens here is the WAF returns some HTML content to the client that contains an embedded script. Now the client needs to run this script because what it does is it collects metadata about the client environment, and we call this a fingerprint.

Now the script introduces a computational cost that's expensive for botnets at scale to operate, but it's not something a normal human would notice. And also in the context of DDoS, usually what's happening is the client, the driver, is just simply trying to overwhelm you with requests and it's not even paying attention to the response at all, let alone take the time to run a script. So this script generates a fingerprint. The fingerprint is sent back to AWS WAF where it's evaluated, and what we're doing here is trying to determine if we think that this is a human or a bot, something that's not a human.

If it is a human, we generate an encrypted token that's sent back to the client. And then the client stores that token in a cookie and resends the request. Now this rule evaluates the request again, and if it sees that there's a valid WAF token in the cookie, it just moves on to the next rule to complete this handshake process.

All right, now that we've looked at WAF rule labels and the challenge action, let's get into this Anti-DDoS rule group and see how it works. So you're following best practices and you've got a CloudFront distribution set up, and you're seeing traffic coming from a wide variety of clients. And you've also protected your CloudFront distribution with AWS WAF, and you have implemented this new Anti-DDoS Managed Rule Group. So what happens right away is that the WAF starts to learn what normal traffic looks like for your web application, and we call this a baseline.

This baseline is established within 15 minutes. It's a rolling baseline that's updated every 5 seconds. Now one of the great things about this baseline is that we can detect anomalies in traffic that are unique to your workload. And not all anomalies necessarily though are bad. You might have a flash flood of traffic that's a spike in traffic from legitimate users, and you don't want to block that. And you need a way to differentiate the good from a bad spike, and we use the threat intelligence that Elena outlined earlier to help make that differentiation.

Now when there's a malicious spike in traffic, we call this a DDoS event, and every request that arrives during a DDoS event is labeled with event detected by these rules, and you'll see how that can come into play later if you want to tailor the behavior here. Now during a DDoS event, not every request that comes in is malicious. There's going to be good traffic in the mix as well. And so what this rule group does is it assigns a suspicion score, and there's three levels here: low, medium, and high, and you can see the corresponding labels that are added to requests based on the level of suspicion.

This is great because if you need to, you can fine tune the sensitivity of this rule group based on these suspicion levels, so you'll see how that works in a bit. The last piece here is that in this rule group you can define a regular expression that essentially determines which URIs in your application support the challenge action, so that helps you narrow down the scope for which these rule groups will respond with a challenge or not.

Now for most customers, you really don't need to understand any more than what I've already covered. This rule group works great for most customers using the defaults, but since this is a 300 level talk, we're going to go a little bit deeper. Alright, so what are the rules in this rule group? Well, there's three of them. The first one is called ChallengeAllDuringEvent. So this rule will respond with a challenge for every single request that comes in during a DDoS event that's been detected, but it only does this for requests that are challengeable based on that regular expression you've defined earlier, so it doesn't look at the suspicion score at all. So you can think of this rule as having the maximum sensitivity because it challenges everything.

Now, if you want to tailor the sensitivity of this behavior, you can put this rule in count mode, which brings the next rule into play. And this is called ChallengeDDoSRequests. So with this rule it's doing something very similar. Only during a DDoS event it's going to challenge requests, but only if they are suspicious. And by default the sensitivity for this rule is high, and this is where you define the sensitivity of that rule. And what this really means is it's highly sensitive, which means it's going to challenge any request that has some level of suspicion, really any level of suspicion. Now let's say it's being a little too aggressive and you don't want it to challenge requests that have low suspicion. You can set this from high to medium to make it less sensitive, and so on.

Tailoring WAF Behavior: Combining Rate-Based Rules with DDoS IP Lists

The third rule is called DDoSRequests. And this will actually block requests that are suspicious during a DDoS event, and by default the sensitivity to this rule is low to avoid false positives, so it's only going to block requests that have the highest suspicion level during a DDoS event. Alright, so let's talk about how you might want to tailor the behavior of AWS WAF going beyond the capability you get out of the Anti-DDoS Managed Rule Group. So here we have edge locations off on the right and you've got CloudFront set up with a distribution that's protected by AWS WAF, and you've already got the Anti-DDoS rule group in play.

Let's walk through four different types of traffic that you might want to consider when tailoring the behavior here. The first is that you might have normal users that are sending around 50 requests per minute. And you'll see here that some of these users might be using an open proxy, which is perfectly normal. Unfortunately some of these users are using an open proxy that's red here indicating that it has been recently abused and is actually on our known offenders list, so it's a bit of a tricky scenario that we have to deal with.

The next use case is a threat actor that's using an IP address that has bad DDoS reputation, and they're sending four times more traffic than you would expect. And then here's another user that's sending 20 times more traffic than you would expect, again using an IP that has a bad reputation. And then our fourth scenario here is a threat actor that's using an IP address that doesn't have a bad reputation. It's actually good as far as we can tell, and they're sending over 100 times more traffic than you would expect.

So what are we going to do with these situations? Well, on the lower right corner, let's say that you implement this DDoS IP list rule that I mentioned earlier, so this is a list of IPs that has our known offenders based on our threat intelligence, and you could just put this in block mode, for example, to start off. Well, what's going to happen here is that this top threat actor, well, all their traffic will be allowed because they're using an IP that doesn't have a bad reputation, so it won't be blocked by the DDoS IP list, and

we call this a false negative. This is not the outcome you want. The next two will be blocked because they're using IPs that are known offenders. That's all good. But here we're blocking normal users because they're using an open proxy that's been abused recently, and we call this a false positive. It's also a scenario where this is not the right outcome in this case.

So what can we do to avoid false positives and false negatives here? Well, let's back up a little bit and try something else. Here we still have the Anti-DDoS rule group in play, but remember it is labeling traffic during a DDoS event with this event detected label. Let's see if we could take advantage of that, and I'm using numbers here because the ordering is actually important on what we're walking through next.

So let's put the DDoS IP list back into play, but instead of putting it in block mode, we'll just put it in count mode. The value here is that it will label requests during a DDoS event with the DDoS list label. What this means is that it gives you some metadata to tell if a request is coming from an IP that is a known offender. Now this DDoS list label is not the actual label for this rule. I've abbreviated it so it fits on the slide.

The third rule we're going to add to this web ACL is a rate-based rule. Rate-based rules are great. What they do is they count the number of requests coming in by aggregating on some sort of key. The key I'm using here is a source IP, so it's going to count every request by source IP. It does this counting over a rolling time window, and I'm using a one-minute window here. You set a limit, and the limit I'm using here is 2000. So what this rule does is it's going to block traffic from any source IP that has sent more than 2000 requests during the past one minute, which is great. This is going to handle the top situation here.

Now you might be tempted to just set this limit of 2000 to something much lower to address the other use cases here, but you want to be careful with that because you might have normal users, for example, that are behind a corporate proxy or a NAT gateway of some kind. You have a big collection of users that essentially all look like they're coming from the same public IP address, and that'll all get aggregated together and could easily reach much higher limits here. So with rate-based rules, the rule of thumb here is to err on the side of using limits that are higher than you might think to avoid false positives, because they're still very effective at handling very large floods.

The fourth rule is another rate-based rule. Here we have the limit much lower. It's down to 500 per minute. But the scope here says DDoS list, so what this means is it's only going to count requests coming from source IPs that are on that list. There's some level of risk here with that traffic, so it's safer to use a lower limit that should not affect normal users.

The last rule we're going to add is yet another rate-based rule. This one is also counting by source IP, but the scope here is using DDoS list label plus a particular URI. You might pick a URI that is especially sensitive to DDoS threats within your application. It's only going to count requests for that URI and only if they're coming from a source that is suspicious, so you can use a much lower limit here and sort of layer these rate-based rules in a way.

Now just in case normal users kind of wander into this territory of sending 100 requests per minute for a particular URI, you don't want to impact those users, so we're using a challenge action here to avoid false positives. Now the Anti-DDoS rule group adds this label of event detected. You could put that label in the scope of all these rate-based rules if you want to further narrow them down so they only come into play during an actual DDoS event, which would further reduce the risk of false positives.

Figma's Real-World Journey: Defense in Depth Against Evolving DDoS Attacks

All right, with that, let's hear from Bashar on how Figma has used these capabilities earlier this year to mitigate real-world DDoS threats. All right, thank you, Brian. Hi, my name is Bashar, and I'm super excited to be talking about Figma's journey with DDoS attacks over the last couple of years.

So what is Figma? Figma is the platform that designers, product managers, and software engineers use to turn their ideas into shipped products. We have almost over 13 million users who use Figma products every month, and over 95% of the Fortune 500 companies use Figma products today. Almost every product that you use has been designed in Figma.

Figma is a web-based platform that runs in the browser. Our customers connect to our services using their web browser, mobile or desktop apps, and we funnel all that traffic through CloudFront to terminate their SSL connection closer to their location. And then we funnel that traffic eventually into our proxy for logging and observability, and finally that traffic hits our backend services that power Figma's user experiences and collaborative features via either our real-time services or our application servers.

Let's take a look at some of the attack patterns that we've seen over the last two years. This graph shows about a recent two-month period, and it shows how many requests have hit our platform distribution per minute. And as you can see, these attacks happen almost on a weekly basis and sometimes multiple times in a day. They are very brief. They only last a few minutes, and they are many orders of magnitude more than our steady state traffic. There is no way we can handle this much traffic. It's just not economically possible to be provisioned for that much traffic in a steady state, nor can we scale up quickly enough to absorb that traffic. And when these attacks used to take us down, our customers could not iterate on their designs, prototypes, and ideas, and they literally wouldn't be able to do their job.

So how did we approach this? So first we tried to analyze these requests to understand and find any signatures that we can use in order to step up our defenses. So we adopted a defense in depth strategy by trying to leverage both AWS WAF, as Brian mentioned, and also adding some protections in our proxy layer to protect against these types of attacks. And finally, as these attacks have evolved, we continue to iterate on our monitoring, observability, and fine tuning our protections over time.

So how do we analyze these requests? So we relied heavily on AWS WAF logging in order to try to capture all the raw requests that come in and then use the CloudWatch Logs Insights to analyze these requests to identify any suspicious patterns such as trying to identify the top talker IP addresses, or any suspicious headers that might come in that we can use in our rules for protections to block these types of attacks. And so we added some basic rules and we also started in count mode because we didn't want to potentially affect user traffic. And we tried to target just the top talker IP addresses, JA3s, and any suspicious headers we've seen such as unexpected user agents and so on.

And then we also added some rate limiting rules on top of this to make sure we have multiple layers of protections. So we tried to target different keys and time windows and thresholds in order to provide multiple layers of protections. So we tried to rate limit the traffic per user and by targeting the cookie header, for example, to provide very specific rate limit rules. And we also added some endpoint protections such that some endpoints were more expensive than others, and so we added some tighter limits there.

And finally we had some catch-all rate limit rules such as limiting the amount of traffic that we get per IP address. We had to be very careful with these thresholds because we didn't want to block potentially good traffic coming from corporate IP addresses, proxies, and open proxies as Brian mentioned. And then eventually we ended up adopting challenges which are really amazing because they let you provide tighter limits because even if there were some false positives, the good traffic is supposed to go through because it's supposed to be solved by the browser, whereas the bad traffic will be blocked automatically.

Now we added all of these rules and they were amazing, but we were also worried that some of this traffic might slip into our backend services either because our rules were not attuned perfectly or our attackers would evolve their attacks over time.

So what we also tried to do was to make sure we isolated traffic in our backend infrastructure. What we noticed was that a lot of these attacks came from unauthenticated requests, for example, and so we ended up trying to bring up replicas of some of our backend services and change our routing such that we would send unauthenticated requests to authenticated application servers and authenticated requests to different application servers that would handle authenticated traffic. That way, if we get a spike, let's say for unauthenticated traffic, it may take down those API servers that handle unauthenticated requests, but the authenticated requests would not be affected.

Similarly, we tried to move some of our application logic to the front end and to try to actually serve that traffic by S3, which is a globally scalable service, such that this traffic won't actually eventually take down our backend services. This was great and everything, but unfortunately it was just not super maintainable. Just imagine every backend service that you add, then you have to think about, oh, I need to bring a different replica for unauthenticated traffic and change all of these routing rules.

So what we ended up doing was to try to centralize all that intelligence in our proxy layer. This is also important because some of these rate limit rules in WAF don't kick in immediately and take about 30 seconds, and any spike in traffic during this 30-second period will actually take down our backend services if we were not able to block them immediately. And so what did we do? We built a custom Go filter and added it to our Envoy proxy that would actually provide fair sharing between any endpoint, priority level, or authentication type such that there is no bucket here that would take more than its fair share of capacity as it would do in steady state traffic.

We also added signing to our cookies, and we would check for signatures here because, as I mentioned, we added rate limit rules targeting the cookie header, but attackers can easily just forge these cookies, so we made sure that we can actually check for these cookies to make sure they are signed by Figma, and we would reject traffic immediately. It's amazing to use signatures here because you don't need to talk to any backend service, so you can do it extremely cheaply.

Now our attackers have evolved over time, and what we noticed was these attackers started rotating their IP addresses, JA3s, and so on, even within a single attack and targeting more critical endpoints. So we had to continue to automate our analysis and try to do it in semi-real time, identifying the top talkers and then adding them to our blocking rules and tuning our rate limit rules over time. And finally, we tried to adopt all the amazing threat intelligence that Alaina just talked about. Figma is just one customer, but AWS has millions of customers. They have way more sophisticated threat intelligence than we can possibly do, and so we adopted the Anti-DDoS rule group that Bryan just talked about and the IP reputation list to try to identify all these spikes and the IP reputation automatically without us trying to actually do this ourselves.

And so this has been a really exciting journey over the last couple of years partnering with AWS. We were one of the early adopters for the Anti-DDoS rule group that we just talked about and integrating the WAF and iOS SDK rules in order to provide challenge support seamlessly within our product and finally providing a lot of product feedback such as adding configurable time windows for rate limit-based rules that we needed in order to have multiple rules targeting specific endpoints. And finally, let's just take a look at a couple of real world examples that just show the magnitude of these DDoS attacks as well and how our protections were able to protect us against these attacks.

The first example is just a very common example targeting our main web application, which used to provide both an authenticated and unauthenticated experience. And as you can see here, there were tens of millions of requests per minute, and because we had multiple rules that were actually working all together in order to protect against these attacks, challenging this number of requests per minute, and for the rest of the traffic, as I mentioned, we tried to actually funnel it through S3, which is a globally scalable service. The final example here is a more sophisticated attack that targeted one of our API endpoints, and this also shows how this defense in depth strategy works together in order to protect.

So you can see here in the second graph, the first 30 seconds, before WAF kicked in, we had our intelligent proxy just react automatically and very quickly to protect 70,000 requests per second. If these requests came into our backends, they would actually cause some downtime. And for the remaining period when all the WAF rules kicked in, we were able to rely on WAF to actually protect for the rest of the attack.

Disrupting Threat Actors and Best Practices for DDoS Resilience

And with that, I'll hand it over to Elena for some final thoughts. All right, so we're going to come back to our threat intelligence cycle on our last stop of disrupting and taking down threat actors. I'm going to do this quick and then Brian's going to wrap us up with some best practices, but I wanted to call out some of the cool things that AWS has been a part of over the last year. We worked with law enforcement in order to give them information to help successful takedown of actual botnet infrastructures. This top one here is one that I would want to talk a little bit deeper into where we were seeing this botnet issuing DDoS that was multi-terabit, multi-billion packet per second floods on a regular basis, not just towards AWS but broadly towards various locations on the Internet.

I think that, I don't know if all of you are familiar, but AWS has a set of what we call leadership principles. Those are kind of like guiding tenets in order to help us ensure that we're producing good things, and some of them are like customer obsession or earns trust. One of my personal favorites is success and scale bring broad responsibility. And I really feel like this embodies this from the perspective of AWS is actively working to dismantle and take down these threat actors and trying to leave the Internet a better place kind of for everybody. So I'm going to do last handover to Brian so he can talk about some best practices.

All right, so a quick recap of some best practices here. So first is to auto scale out so that you can absorb unexpected traffic spikes. The next is to make sure you're load balancing across your entire fleet and then leverage the built-in DDoS mitigations that are available at AWS edge locations. And then do everything that you can to minimize internet exposure. And then protect your resources with AWS Shield Advanced and use the new WAF Anti-DDoS Managed Rule Group. And then if you need to customize that behavior, use WAF rate-based rules and the DDoS IP reputation list.

So at AWS we operate the largest public network of any cloud provider. And this gives us a unique view into the DDoS challenges that our customers face every day. And we work relentlessly to improve our threat intelligence so that we can help mitigate that risk. Now the result is a level of DDoS resiliency that's very difficult for most organizations to achieve on their own. So if you're headed outside into challenging weather, make sure to put AWS to work for you. So with that, I want to thank you for being here and we hope to work with you in the future. Thanks.

; This article is entirely auto-generated using Amazon Bedrock.