Kazuya

Posted on Dec 4, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Infrastructure protection at scale with AWS Security, ft. Block, Inc. (SEC224)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Infrastructure protection at scale with AWS Security, ft. Block, Inc. (SEC224)

In this video, AWS Solutions Architect Dustin Ellis, Product Manager Amish Shah, and Block's Staff Security Engineer Ramesh Ramani discuss infrastructure protection at scale using AWS native services. The session covers platform engineering fundamentals, emphasizing how 80% of organizations are expected to adopt platform engineering by 2026. Block's journey from hybrid infrastructure to unified network security is highlighted, showcasing their evolution from microaccount models to standardized patterns. Key AWS services discussed include Network Firewall, WAF, and DNS Firewall, with announcements of new features like multi-VPC endpoints (60% cost reduction), partner managed rules, and active threat defense. Block's implementation details reveal handling 85,000 requests per second at Square alone, managing thousands of AWS accounts, and building a unified registry system for service mesh policies. The presentation demonstrates how platform engineering reduces cognitive overload, increases developer velocity, and ensures architectural consistency through centralized expertise and self-service capabilities.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Infrastructure Protection at Scale with AWS Native Services

Good afternoon everyone. Thank you for sticking around for the last session of the day. This is SEC 224, Infrastructure Protection at Scale Leveraging AWS Native Services. In this session, we're going to have a customer spotlight featuring one of our financial services customers that I have spent the last four years working closely with. If this sounds interesting to you, I encourage you to stick around, sit back, take some notes, and stay at the end for some Q&A.

My name is Dustin Ellis. I'm a Senior Solutions Architect with AWS. I've been here for the last five years, and the last four years I've worked closely with Block's engineering teams. I'm joined by my colleagues Amish Shah, who's a Senior Product Manager with our Infrastructure and Security Services, as well as Ramesh Ramani, who is a Staff Security Engineer with Block. Ramesh will be covering Block's journey later on in the session today.

Let's take a look at the agenda over the next 45 minutes. First, I'm going to provide a little bit of background on platform engineering. We're going to do a quick recap of what it is and the benefits that it brings to organizations, particularly those operating in cloud environments. This is going to set the stage and provide important context for everything else that we're going to talk about in the session today.

After that, I'll hand it over to Amish. He's going to provide an overview of all of our infrastructure security services and capabilities that most of our customers are using to build their infrastructure security platforms. We'll also cover a couple of announcements that we've made this week that you probably have already seen. After that, we'll hand it over to Ramesh to close us out. He's going to share everything that they have done at Block over the last few years building their infrastructure security platform, or as much as we can cover in about 20 or 30 minutes. This includes some of the challenges that they faced early on, the architectural decisions that they made along the way, as well as some lessons and things that they continue to work on that maybe your team can adopt.

The Rise of Platform Engineering and Its Strategic Value

Let's go ahead and provide a little bit of context on platform engineering. I just want to do a quick show of hands. How many people in the audience have already started implementing or practicing platform engineering within your organization? Okay, it's a little hard for me to see up here, but I think at least half the room, so that's great. That actually aligns with one of the statistics that I wanted to start off with.

Gartner publishes their Strategic Technology Trends. I'm sure many of us in the audience have heard of this. One of them that stood out to me having operated in the space for quite a while was that 80% of organizations going into 2026 are expected to have platform engineering teams. I think this is interesting not because of that statistic alone, but just a couple of years ago that was 45%, so we've almost doubled over the last few years. Many of you have already learned about platform engineering and have already started implementing platforms internally at your organization. The question that I really want to answer here is why. What is the contributing factor to this rapid rate of adoption? Why are so many companies moving in this direction?

Well, if we pause and think about this for a second, in the absence of platforms, what we all have probably noticed is that developers will go away and come up with a lot of different solutions for the same problems. What this leads to is architectural inconsistency, redundancy, and ultimately this can lead to unintended effects like a lack of security in areas that we weren't expecting. In contrast, the benefit of platform engineering in a nutshell is that we can build reusable architectural components that can enable our developers to build faster, more securely, and more consistently.

Oftentimes when I have this conversation with customers, the question becomes: we're just starting out on our cloud journey, we've got a single AWS account or a few AWS accounts. When should we really start thinking about building platforms into our architecture? My answer usually is as early as possible. We'll talk about that, taking us all the way back to AWS architecture day one and building on that. The key thing here before I continue is that it's easier to make these decisions earlier on in your cloud journey than later when maybe you have hundreds of accounts with different operating models, or maybe there's been some mergers and acquisitions. These kinds of decisions become increasingly difficult if you wait until later when your environment is much bigger.

Let's go back to day one real quick. Everyone started here with a single AWS account. This is a really simple diagram, but the benefit obviously with a single account is that we get one clear isolation boundary. With this, we put all of our applications, our data stores, and our identities into this account. The biggest benefit here is simplicity. It's easy to visualize this architecture.

Scaling from Single to Multi-Account Architecture: Key Decision Points

As we start to scale our footprint on AWS and deploy more workloads into these accounts, limitations start to show themselves in different ways. You run into things like account-level limits, workload isolation becomes really difficult, and you're operating with all these constraints where your workloads are competing for resources. At some point on your cloud journey, you will reach a point where you move towards a multi-account paradigm. You'll start provisioning more AWS accounts, and this is a simple diagram. In this example, we went from one to two accounts. In reality, many of our customers have hundreds or even thousands—in the case of Block, as Ramesh will share later—thousands of AWS accounts attached to either a single AWS organization or multiple organizations.

This increasing complexity comes with benefits to this approach. You get nice workload isolation, which is one of the first AWS best practices from an architecture standpoint. The biggest benefit with workload isolation is that we can reduce the scope of impact if there's any kind of security or operational issue. We also can start to implement distributed ownership models, which provides benefits for things like cost attribution or limits. If you haven't hit the point yet where you've encountered any of those account-level limits, be aware that many do exist, and it's better to be proactive and get ahead of this.

On this scaling journey, you'll inevitably face some key inflection points and decisions that you're going to have to make. A few of them are shown here, which we've worked with Block quite a bit on over the last few years. One of them is how you will vend these AWS accounts. Some people might refer to this as an account factory or a vending machine, but there are many ways to do this, and you want to make sure that you do it the right way—in a way that's going to be secure and scalable for your organization.

What do we mean by account vending? It's not just how we create the accounts; it's how we baseline them and ensure that they're all held to the same security standard consistently. It's also how we monitor those accounts throughout the entire account lifecycle, and most importantly, if those accounts aren't being used anymore, how we decommission those accounts. Another decision you'll have to make is access patterns. Once those accounts are provisioned, how do we then grant our human users as well as our application identities access into those accounts so that they can create and manage resources?

This also includes what are the preferred interfaces for doing that. Is it the AWS console like a click-ops approach? Most people don't recommend that. Or is it infrastructure-as-code tools like Terraform or CloudFormation or even the AWS CLI? Third is your network topology, and this is where we're going to spend most of the time for the rest of the session today. Once you have those accounts vended and you have given people and your applications access to those accounts, the next logical question is: most of these resources that we're going to be creating will be deployed into a virtual private cloud network or a VPC.

So how should we think about VPC in our organization? Should we have one VPC per account? Should we lean into these new models like VPC sharing where we vend subnets into those accounts and create some abstraction layers for our developers? This also includes things like traffic management—ingress and egress traffic, inter-VPC traffic—and this will introduce other services which we're all probably familiar with: Transit Gateway, PrivateLink, and VPC peering. There's a lot of architectural decisions that fall into this category.

Lastly is your data perimeter strategy. Block had a great session earlier this year at Reinforce on data perimeters, so we won't be touching on that too much in this session today, but it's worth mentioning this because as you scale—going back to what I mentioned—security consistency across your accounts. This includes how we structure our accounts hierarchically using things like AWS organizational units as well as how we apply consistent guardrails using things like service control policies and resource control policies. If those are new concepts, I encourage you to investigate data perimeter. At its core, data perimeter is all about how we ensure that our resources are only being accessed by trusted identities from trusted networks.

Platform Engineering Best Practices and Block's Strategic Approach

Just to drive platform engineering home before we move on to AWS infrastructure services, this is a great visual for a mental model of what a platform should look like at a very high level. As we can see at the top, we have our application users and our business users, and these are the users within our organization who should be focused on building, delivering business value, building new features, and servicing external customers—the customers of your products that your company builds.

In contrast, we have our platform engineers at the bottom, and these platform engineers and teams should be focused on delivering this platform to those internal stakeholders. We can see some examples here of what those platform components might look like. You can start with one and expand on that later, but some examples that we've seen at Block include policy enforcement, authentication and authorization, compute and networking infrastructure platforms, event buses like Kafka, and most relevant to the session today, infrastructure and security platform components. This is where we'll spend the rest of the session.

Let me go back—there's one thing that I skipped here which is really important. If you get platform engineering right, what you should see over time is this powerful feedback loop where the application users, your developers internally, are providing feedback and feature requests and requirements back to those platform engineers. There should be a communication channel between the two parties such that the platform engineers can translate those requirements into new capabilities that enhance the platform and allow developers to build faster, build more securely, and build more consistently.

Before I hand things over to Amish, there are a couple of things that I wanted to highlight working with Block over the last few years that I've observed and that I think they definitely did right and continue to iterate upon. First, before moving a lot of their workloads to AWS, which Ramesh will talk about later, they really took some time to be intentional about what their landing zones should look like—how should we vend AWS accounts, what should those baseline security configurations be, and what should our guardrails be at the organization level. Spending time here before you commence your migrations is incredibly important and will help scale and set a good security foundation later on.

The second thing was that Block was very intentional about their network topology and design decisions. They evaluated all the available options at the time—we're talking about something a few years ago at this point—but after that they landed on VPC sharing and other network design principles that Ramesh will talk about in his segment. The third thing is that they lean into platform engineering best practices and they continue to do this. One of those best practices is making sure that you have clear isolation between the roles and responsibilities of what developers are responsible for versus what the platform teams are responsible for. Obviously there's going to always be some overlap, but Block continues to iterate on this, and it's something that has really stood out to me as being pretty impressive.

Lastly, there's the infrastructure security platform that Block built—again, Ramesh will talk more about this shortly. They're leaning into AWS native services like AWS Network Firewall and new capabilities, having an experimentation mindset, being willing to test new services and features and capabilities. Sometimes they won't work, other times they're exactly what you need to continue to scale your platform. The builder mindset at Block is something that's really impressed me over the years. So now I'm going to hand things over to Amish to cover AWS infrastructure security services. Thank you, Amish, over to you.

AWS Infrastructure Protection Services: Ingress and Egress Security Overview

All right. So let's talk about AWS infrastructure protection services. My name is Aish Shah. I'm the Senior Product Manager for Network and Application Protection Services at AWS. I focus on AWS Network Firewall service as well as threat intelligence offerings. AWS has a wide range of security services that we offer to customers. The network and application protection services are part of the infrastructure protection space. All these services are actually very tightly coupled together, helping you achieve your security posture goals. I'll talk a little bit more about some of these services and how customers use them to protect their infrastructure, as well as new features that we have launched on the AWS Network Firewall side and on the AWS WAF side that help you simplify your security operations and improve your security posture.

Before we go there, let's take a look at the overall AWS global backbone network. I just wanted to show you the scale at which we operate. As you know, this continues to expand. A few months ago, we launched a new region in New Zealand, and we'll continue to expand in more geographies by launching new regions. The size and scale of our deployment gives us this unique visibility into global Internet traffic. Using this information, we can now come up with curated threat intelligence that can be used to protect AWS infrastructure services. Now, imagine the overall total attack surface of AWS across the AWS domains, Amazon retail domains, and .com domains. Every day, we stop hundreds of cyberattacks.

This is possible because of the digital decoy network that we have set up across our global infrastructure. I'll cover that in more detail later in the session on how we use that information to help keep infrastructure secure. Now, going back to the network and application protection services, one of the easiest ways to talk about how customers use this service is by splitting into two different areas. One is about ingress filtering, and the second is around egress filtering. This is an oversimplification, but if you are getting started with AWS Network Security Services, this is an easy way for you to think about what services are available to protect incoming traffic and outgoing traffic.

Talking about ingress security, especially when you want to protect your web applications, customers typically start with Web Application Firewall to protect incoming HTTP and HTTPS requests. WAF helps you protect anything that is coming into your web applications. On the other side, for egress traffic, you have services like AWS Network Firewall as well as Amazon Route 53 Resolver DNS Firewall to protect traffic that is going out from your VPCs.

As I said, this is a bit of an oversimplification because AWS Network Firewall is more like a Swiss Army knife. It does multiple things. It protects you for incoming traffic, and our customers use that in coordination with WAF. WAF helps you with HTTP and HTTPS traffic. For non-HTTP and HTTPS protocols, customers use AWS Network Firewall for defense in depth. It also helps you prevent lateral movement of traffic. If you want to stop VPC to VPC traffic communication or create separate logical boundaries, network firewall can be used to do that as well.

For this session, let's focus on the egress traffic. Customers use network firewall and DNS firewall to make sure that your applications and workloads are only talking to trusted destinations. We also have management services such as AWS Firewall Manager that allows you to manage Web Application Firewall, network firewall, security groups, and many other services from a single pane of glass.

Web Application Firewall: Simplified Configuration and Enhanced DDoS Protection

Let's talk a little bit more about ingress security. Here is a typical web application architecture. When we talk about ingress security, you think about traffic coming in, but this particular slide is designed to show you both sides, or the complete end-to-end web application architecture. We are showing the inbound traffic as well as the outbound traffic. This is really important because when you are designing your network, you have to make sure that you are applying multiple layers of defenses, both for incoming as well as outgoing traffic.

If you only focus on incoming traffic and for some reason an attacker is successful in exploiting one of your vulnerabilities, then the outbound filtering mechanisms will help you prevent damage. It will help stop data exfiltration, breaches, or command and control threats. So you have to look at both ingress as well as egress protections. That's how customers use some of our network and application protection services. As I said earlier, for inbound traffic, you have Web Application Firewall. You can also use network firewall for both web and non-web protocols. For outbound traffic, you have DNS firewall, which is looking at the DNS queries, and then network firewall, which is a stateful Layer 3 to Layer 7 firewall that allows you to do port-based filtering, domain-based filtering, URL filtering, custom rules, and IDS/IPS rules on network firewall.

Talking about Web Application Firewall, some of the benefits are that it offers layered security against different types of attacks, whether it is DDoS attacks, bot attacks, or web attacks. There are different types of rules that you can configure on WAF to mitigate these types of attacks. It also has very low effort when it comes to managing the WAF infrastructure. The resources are automatically provisioned for you, so that makes it much easier for you to apply web application security and API security.

The rules offered on AWS WAF are highly customizable, so you can write rules to match against any part of the incoming request, which gives you enough flexibility based on your requirements. To further simplify customer experience, earlier this year we launched a new console experience for WAF. With this new experience, we have significantly simplified how you configure application security and API security rules on your WAF by transforming configuration into onboarding wizards. With this new approach, we have reduced the configuration time by 80%, which means you can now get the foundational protections within minutes instead of hours.

When you go to your WAF console, you will now see preconfigured, curated protection rules based on workload types. You can select the type of workloads or applications that you're trying to protect, whether it is APIs or PHP applications, and we will have a list of predefined rules that you can start with. This is your base level security. You can always add custom rules on top of it, but by giving you these preconfigured rules, which are expertly curated by the AWS security team based on our experience with how customers protect their applications, this can become your starting point and allow you to move much faster and secure applications.

We also announced a new Managed Rule on AWS WAF, which is the anti-DDoS managed rule. This new managed rule protects you from volumetric DDoS attacks as well as novel DDoS attacks where you have a client that is sending a higher rate of requests per second compared to an average client. We have also improved the detection logic so we can easily differentiate between a malicious client and a flash crowd event where you have a large volume of actual clients sending requests. We can identify and detect which are volumetric DDoS attempts and mitigate that in seconds instead of minutes using this new AWS managed rule set.

AWS Network Firewall: Comprehensive Egress Filtering and Use Cases

This managed rule set means AWS will constantly update it and make sure you have the most up-to-date rules to protect against all these different types of volumetric DDoS attacks. These are some of the enhancements that help you secure your web applications from traffic coming into your applications. Now, let's look at egress security and what AWS offers to help you protect traffic going out of your VPCs. Why is this important? There was a study done where 94% of all ransomware included some sort of exfiltration. Customers are worried about this and have a lot more overhead in protecting egress traffic. They constantly have to keep up to pace by adding new rule sets to prevent traffic going out of their VPCs and make sure that they're only talking to trusted destinations.

Here is a typical web application architecture for outbound traffic. There are two main things that customers are worried about. One is whether their workloads are talking to known domains, and second, whether their workloads are using the right protocols that are allowed when they connect to the internet. For example, if you don't want FTP or SSH connections from these workloads, are those sessions blocked? To do that, we have services like Amazon Route 53 Resolver DNS Firewall, which will allow you to block DNS queries. We also have AWS Network Firewall, which is a full-blown cloud-native firewall offered by AWS. We'll talk more about AWS Network Firewall and some of its capabilities.

AWS Network Firewall is a fully managed firewall service that makes it easy for you to deploy essential protections for your VPC workloads. It is highly reliable and scalable, and you get up to 100 gigabits per second throughput.

You get up to 100 gigabits per second per availability zone, and the infrastructure automatically scales based on your traffic volume. As more traffic comes in, we automatically scale up the firewall instances behind the scenes. You don't have to worry about it. More importantly, it's fully managed, which means you don't have to deploy the infrastructure needed to run these firewalls. You don't have to worry about software upgrades and maintenance. All those things are done automatically for you. All you have to do is write your rules based on your requirements, and Network Firewall will automatically make sure that it is protecting any traffic that is coming in or going out from your VPCs.

It has a built-in stateful engine, so you can write IP-based rules, port-based rules, and application-based rules. You can use features like GeoIP filtering where you can stop traffic going to certain countries that are known to have high risks. It also offers managed rules, so you can block traffic to known bad IPs and bad domains with just a single click. You can also write custom rules. If there are certain parameters that you want to match against, you can do that using custom Suricata rules. It is fully integrated with Firewall Manager service, so using Firewall Manager you can now apply consistent security policy across all your AWS accounts.

Common use cases for Network Firewall include protecting incoming traffic. Customers want to prevent intrusion, and they use protocol detection, stateful inspection, and IPS capabilities of Network Firewall. Then we have egress filtering, which is where you want to stop traffic going out from your workloads. Say you have workloads that are talking to GitHub. When you allow those connections, it can also potentially open up a back channel that an attacker can misuse. They can install malware or exfiltrate data. Network Firewall helps you there by stopping these connections and only letting outbound connections to trusted destinations. This is one of our most common use cases, where customers start and then expand into other use cases.

Network Firewall Deployment Models and Active Threat Defense

Finally, we have VPC to VPC filtering, which is where customers want to prevent traffic between their workloads and create logical isolation. We offer features like resource-based tags, which can be used to create those logical isolations using Network Firewall. Regulated industries such as banking and financial services use Network Firewall for east-west filtering. How do you actually deploy Network Firewall? There are two different ways that customers typically deploy Network Firewall. One is the centralized inspection mechanism, where you have a central firewall sitting in an inspection VPC. You have a transit gateway that is routing traffic from all your application VPCs to the central inspection firewall. If that traffic is allowed, it goes out to the internet via an internet gateway, or if it is east-west traffic, it will be sent back to the transit gateway to the destination VPC.

We have simplified this design by launching a new feature, which is a native attachment of Network Firewall on transit gateway. Instead of you configuring that inspection VPC, adding Network Firewall in that inspection VPC, and managing the routing between these two services, you can simply use the native attachment of transit gateway for Network Firewall, and we will orchestrate the connection between your Network Firewall and transit gateway. You don't have to worry about deploying the inspection VPC. This native attachment further simplifies how you can quickly build this centralized inspection architecture.

Next, we have distributed deployment. This is where customers deploy a firewall within each of their VPCs. Some customers have a requirement where they want to reduce blast radius, or they have a need to have separate policies for each of these VPCs. That's where they deploy this distributed deployment. We have customers who do both centralized and distributed deployments. Some customers do centralized egress and distributed ingress. Many customers use both these designs. One of the challenges that customers faced with distributed design was especially at scale, when you have thousands of VPCs, the cost of Network Firewall endpoints increases because you are deploying a firewall in each of these VPCs.

To address that, we introduced a new feature called multi-VPC endpoint. Instead of deploying a firewall in each of the VPCs, you are deploying a secondary endpoint in this VPC and sharing a primary firewall across 50 VPCs. The secondary endpoint is available to you at a 60% lower cost than the primary firewall endpoint, thereby bringing down the overall firewall endpoint cost.

You still get the distributed firewall functionality, and your blast radius is still reduced and contained to that particular VPC. So you get the best of both worlds at a lower cost. This is very popular for customers who have hundreds of thousands of VPCs, but these are low bandwidth VPCs where you don't need the 100 gig capacity that a firewall offers. You can share that across 50 VPCs and use this design.

Now, I want to talk about the enhancement that we did on the threat intelligence side of things to help you remain secured. Remember earlier I mentioned the digital decoys that AWS has deployed? The digital decoy is the honeypot fleet. AWS has deployed honeypots in their global infrastructure that, to an external bad actor, would look like an AWS service that is vulnerable. Attackers try to exploit that vulnerability, and we leverage that intelligence to identify the TTPs used by the attacker. That intelligence is then used to protect the AWS infrastructure.

Many of our customers said that you have all this intelligence, and they asked if we could give them that intelligence so they could use that same threat intelligence to protect their VPC workloads. That is our answer with active threat defense. Active threat defense uses the threat intelligence that was generated by the honeypot fleet that we have deployed, and we automatically create rules that can help you protect your VPC workloads. This is a new AWS managed rule available on AWS Network Firewall.

It is a managed rule, which means we constantly keep updating these rules. Every 10 minutes, we update the rules. If there are active attacks, those rules are already in place to protect your workloads. We also clean up rules based on whether our security experts deem that certain IOCs that are part of these rules are no longer valid. So you always get the latest threat signatures as part of this managed rule to protect against active attacks that are happening on AWS infrastructure. It is centrally integrated with GuardDuty as well, so if you are using GuardDuty, you can get central visibility into active threats.

Partner Managed Rules and Firewall Policy Best Practices

Beyond AWS managed rules and active threats, I am excited to announce our new enhancement, which is partner managed rules on AWS Network Firewall. We announced this feature two weeks ago. With this launch, we are bringing managed rules from some of the top AWS Marketplace partners directly into AWS Network Firewall. We have chosen seven partners as our launch partners based on their track record and expertise. Each of these brings a unique value proposition that helps customers keep their AWS workloads secure.

What this means to you is that you can now, with a few clicks, deploy threat intelligence from any of these vendors into your network firewall policy. This will significantly simplify how you add threat signatures to your policy and reduces the time and management effort that previously used to take to manage custom rules. All of this is available directly within the AWS Network Firewall console, which means you can select any of these partner managed rules, look at what each of these rules are meant to do, and then subscribe to those rules directly from the network firewall console. You don't have to do a back and forth between the marketplace, third-party platform, and network firewall. Everything is within the network firewall console with just a few clicks, and you get consistent protection. Our partners are regularly updating these rules.

Our partners are regularly updating their rule sets, keeping you up to date and ahead of attackers. This is available in 21 AWS regions, and we are planning to expand to other AWS regions very soon. Now that we have all these managed rules available, one of the questions that customers ask when they are building their infrastructure protection design is how should I create my firewall policy? Should I build an allow list or a denial list?

How many of you think it should be an allow list? Maybe a few of you. What about a denial list? Maybe one or two percent. Well, more or less, the answer is you need both of them working together to build a policy. Our generic guidance, which is also available in the Network Firewall best practices, is to use both of them. You add your denial list using the managed rules that we just discussed. It can be AWS managed rules or partner managed rules depending on your use cases and requirements. You can also use some of the features that are natively available from Network Firewall, such as GIP filtering to block traffic to certain countries that are known to be high risk or certain countries where you don't have any business.

After you build that denial list, you start with a generous allow list by allowing certain trusted high-level top-level domains, such as .gov and .edu. Eventually, over time, you create your allow list in such a way that traffic is only allowed to certain trusted destinations. We launched a feature called the automated allow domain list feature, which will help you quickly build that allow list. Using that feature, you get visibility into all the domains that are currently being accessed by your workloads. You get a list of domains, the number of times that domain was accessed, how many unique users were trying to go to that particular domain, and when it was last used. Using that intelligence, you can quickly convert your allow list to those specific domains.

So the general guidance is to use a denial list and then start with a generous allow list, and eventually only allow traffic to set trusted destinations. That concludes my session. I'll hand it over to Ramesh to talk about how they use some of our infrastructure protection services in Block's environment to deploy security at scale. Thank you.

Block's Infrastructure Journey: From Hybrid to Standardized Patterns

Hello all, welcome. My name is Ramesh Ramani. I work at Block as a security engineer on the network security team. I've been with Block for a little over six years now, and during my time here, I've worked as a cloud security engineer, Kubernetes security engineer, and now as a network security engineer. Apart from this, I have over a decade of experience in cloud, data center, and enterprise security.

What I hope to convey as I go over these 18 or so slides across 20 minutes is how we as a company have evolved our journey. We moved from having a hybrid, fragmented infrastructure toward standardized infrastructure patterns, and then finally toward what we call a unified approach to network security in AWS. What did we do? Why did we do it? And what are the benefits that we're seeing right now? Let's explore that together.

Let me first give you some quick insights into Block. We're an ecosystem of ecosystems, starting with Square. Square allows millions of users globally to process payments and provides them with business tools. Then there's Cash App, which allows users to spend, save, and send money to other users along with Bitcoin and stock market transactions. Then you have Afterpay, which allows users to buy now and pay later. Now these three business units require high throughput transactions and also real-time fraud prevention. And finally, there's Tidal, my favorite, because it allows users to stream high quality music content, which means high streaming throughput for music.

Let me give you some quick context around the scale that we work at. Zooming into Square and looking at ingress as an example here,

we average around 85,000 requests per second, peaking at 115,000 requests per second. When you extrapolate this, it comes up to 2.7 trillion requests per year. Now look at Egress. We average around 2,400 requests per second, peaking at 3,200 requests per second, and this comes up to around 77.4 billion requests per year. This is for Egress at Square alone. Think about Cash App, Tidal, Afterpay, and all our other business units, and you can imagine how this quickly adds up.

We started our cloud journey in 2020, moving our data centers and all our workloads from the data center onto the cloud specifically for Square. In the year of COVID, it has been a long time, yet somehow feels too soon. We started with Square using a microaccount model, which means every single service or application has its own AWS account. That is great from an isolation standpoint, but it leads to scalability issues. Then you have Cash App, which went in the diametrically opposite direction. They work with a monolithic model with all of their applications in a single account in an EKS cluster. Then you have Afterpay with its hundreds of ECs, which are directly managed by the account, like the developers themselves, which is great from an agility standpoint, but leads to consistency challenges.

Let me deep dive slightly into two of these. Square, for example, uses a microaccount model. Every single service gets its own production development and staging account. Just think about this. We have thousands and thousands of applications. So what does this imply? Tens of thousands of accounts. It is mind-boggling. With Square and all of these accounts, subnets are directly shared to it from a primary VPC. This allows for easy routing. At Square, we also have our own Kubernetes cluster as a platform where every single service is its own namespace, and all of these namespaces can access their own AWS resources using their own EKS pod identity.

Next up is Cash App. They went in the diametrically opposite direction with singular production development and staging accounts with an EKS cluster in each of them and all of their applications deployed in an EKS cluster. All of this traffic is tightly controlled using a Layer 7 service mesh. Of course, all of these services are their own namespaces and can access their AWS resources using again their own EKS pod identity.

Addressing Key Challenges Through Six Security Pillars

When we started in 2020, we realized that there are four key challenges that we need to face. First up is the threat landscape. As you can imagine, with all of these different business units and us moving to the cloud, the threat landscape was quite dynamic. What works for Square's payment systems may not work for Cash App's financial services or Afterpay. Not to mention us moving to the cloud, now we have to worry about these different perimeters. We had to scale accordingly as well.

Next up is the infrastructure. With all of these different perimeters, we had all different WAFs and firewalls and ingress access, different service-to-service communication mechanisms, and then every single business unit having its own egress requirements. It was difficult. We realized very soon that we needed to standardize. Then there is scalability, which is a normal growing problem. In front of my eyes, I could see manual changes, lack of observability, and the proliferation of AWS accounts was not exactly helping. And then finally, compliance. Our services, if they need to accept payments, need to now be PCI compliant and SOC compliant for financial reporting and a bevy of other compliances and audit standards that we have to adhere to.

So what did we do? The platform teams and the security teams targeted six different pillars together. We have ingress and egress traffic, inter-business unit traffic, inter-environment traffic, multi-region traffic, service mesh and policy framework, and finally, our platform unification.

By tackling these six pillars, we feel that we'll be able to address all of the four challenges that we just laid out. Let's start with ingress traffic. We consolidated ingress using a dual-homed CDN providing global availability in a singular place where we can put consistent policies. Traffic from the internet comes and terminates on a load balancer in a separate VPC, which isolates all of this traffic from the workloads. From here, traffic moves into our workloads via a transit gateway. All of this is at layer 3 and layer 4, and then beyond this at layer 7, we have our own layer 7 ingress gateway which terminates this traffic into our layer 7 service mesh straight at the workloads.

This architecture provides key benefits. There's only one path for traffic, which means we now have full visibility into traffic patterns and can tune traffic accordingly. We have a single place to apply security policies and this greatly improves operations as well. Egress is a different beast altogether. With egress, every business unit started standardizing it. For example, what you see on the screen is for Square. All of our workloads are on a shared VPC, so they can route easily out to the internet via a transit gateway, but they route directly out to the internet by going through a layer 7 Envoy proxy. This layer 7 Envoy proxy is fronted by a network load balancer, and traffic goes out via a NAT gateway and passes through an internet gateway.

This, along with Kubernetes network policies and security groups, controls this entire path. For Cash App, it's similar, but they use a layer 7 STO egress gateway. Afterpay was a different story altogether. We have all of these different VPCs which route their traffic to a centralized inspection VPC via a transit gateway, and from here traffic goes out to the internet. For Square and Cash App, there are a few applications which cannot rely on our layer 7 egress gateways because of circular dependency reasons, and we created a paved path for them as well. We created a Terraform module for these users. These users just use this Terraform module and it spins up a Suricata firewall with Suricata rule groups and a Route 53 domain firewall. They just enter a domain there and it creates both together, thereby providing a paved path for these exception applications as well.

In fact, it's the same exception path which is enabling a lot of our AI workloads at Block as well, including Block's very own open source AI agent Goose. All of these different environments have standardized patterns that we've created. However, it's still fragmented. This teaches us valuable lessons about what we need to do for platform unification going forward. Next up is inter-business traffic. This is similar to having the inspection VPC in the path. All of our business units route to a centralized transit gateway, and from here, if they need to communicate with each other, they have to go via this inspection VPC with a firewall in it. This firewall has consistent policies that segment these different environments at layer 3 and layer 4. Of course, there are certain services which, if they need to communicate with each other across business units, directly do so via VPC endpoint services as well.

For inter-environment traffic, we have a core principle at Block. Environments cannot talk to each other directly unless and until they have to pass through this inspection VPC again. For example, Prod can talk to Prod, Dev to Dev, and Staging to Staging, but if they need to talk to one another, they have to pass through this transit gateway, pass through this inspection VPC via this AWS Network Firewall, which segments all of these environments unless, of course, allowed for particular services. For multi-region traffic, many services need to be deployed in multiple regions for reliability reasons. All of these applications live in their own environment in their own region.

Service Mesh, Policy Framework, and the Path to Platform Unification

Applications now have their own environment in their own region, and they have a transit gateway in their own regions which are paired together. This is how these applications are now able to seamlessly talk to each other. I want to spend a couple of minutes talking about our service mesh and our policy framework. Our story for the service mesh is one of committing to evolving security standards. When I joined Block, our data center had a layer 7 service mesh. Our cloud with Square has its own voice service mesh, and Cash has an STO service mesh. We took it a step further with a key innovation here at Block called Registry.

Think about Registry as a catalog of all our services and applications. Users would go into the centralized platform or UI and define what their applications' dependencies are. For example, a Square user would go to Registry and say my application needs to talk to Bar. This intention in the backend gets converted into a namespace-specific Kubernetes network policy and also a layer 7 authorization policy. With this, we're able to provide defense in depth while completely eliminating operational overhead and manual errors.

What's next for us is platform unification. We realized that we went from having hybrid infrastructure to having standardized patterns for ingress, egress, and more. The next step is to enable developers to move faster. They only have to define their intention, and we roll out the infrastructure and security components for them under the hood without them having to worry about much. We achieve this by building cookie-cutter environments across tenants and across business units with similar components and similar security policies.

For example, what we call this is a BKE environment, or Block Kubernetes environment. Going forward, every single application will be deployed in an EKS cluster in an account in the private subnet. There's also a DMZ subnet which has the layer 7 ingress and egress gateways, and then you have a public subnet which has a NAT gateway. If traffic were to egress out, traffic from the private subnet would route to a layer 7 egress gateway in the DMZ subnet, and from here traffic would go to the public subnet before egressing out to the internet via a NAT gateway.

Similarly, traffic coming from the internet would hit the internet gateway and then move towards a layer 7 ingress gateway before terminating traffic into our layer 7 service mesh in the private subnet. All of this includes defense in depth with Kubernetes network policies and security groups, along with authorization policies. Why stop at infrastructure when you can unify policies as well? We have a singular frame or a singular UI where users go ahead and define what their service-to-service dependencies are, egress dependencies are, and ingress dependencies are. Under the hood, this gets spun up as infrastructure components and security components without the users ever having to worry about this.

What does the future look like? As I alluded to, it's platform unification. We want developers to move fast. They just provide us what their intentions are and everything gets built seamlessly under the hood, ensuring that we can focus on providing better services for our customers, ensuring that we're building much faster infrastructure and keeping it secure as well. With this, I hope I've been able to communicate a story which says how we moved from having hybrid fragmented infrastructure to moving to what we're now calling a unified approach to network security in AWS. I'm going to hand it over to Dustin now. Thank you all.

Conclusion: The Benefits of Platform Engineering and Next Steps

Pretty amazing stuff, right? I know that was a lot to try to capture in 30 minutes, but Ramesh and to the rest of the Block team, pretty incredible stuff. Love working with you guys.

I'm going to end here. We have about 5 minutes left, so let me do a quick recap. Bringing us all the way back to platform engineering, we heard from Ramesh about all of the different projects that Block has worked on over the last 4 to 5 years since 2020, improving their infrastructure security platform. We also heard from Amish about the AWS services that can help you go away after re:Invent and build your own infrastructure platforms.

But I want to go all the way back to the beginning and recap what are the key benefits of platform engineering in the first place. The first thing is reducing cognitive overload on our developers. That's a really big one because building in the cloud gets more and more complex. Just to become a networking specialist on AWS these days requires a lot of study and expertise. We don't want our developers to have to learn networking, security, compute, Kubernetes, and Terraform in order to continue to scale. One of the things that Block has done incredibly well is compartmentalize that and go back to shared roles and responsibilities. The intent of that is to reduce cognitive overload on our developers and enable them to build more features, build faster, and build more securely.

The second thing is increasing developer velocity and productivity. When developers are given paved paths and reusable architecture components that have been vetted and approved by a platform team, it's just less for them to have to worry about, and they can focus on writing code. Self-service capabilities are a great example of that. What Ramesh touched on is the registry, so being able to go in as a developer and define your service dependencies and have the platform take care of the rest, the policy propagation and enforcement and the service mesh. It takes time to build those things, and Block has invested years building that one platform component, the registry, and it continues to evolve. Ramesh talked about the unification aspect of that, which again enables our developers.

Another couple of things here are architectural consistency and standardized infrastructure components. You heard me mention at the very beginning that we generally see customers tend to struggle on the scaling journey when they allow their developers to have too much autonomy and there's no architectural standards or consistency. That can lead to some teams being more secure than other teams, with different architectural patterns and not really clear guidance from a developer standpoint about what is the approved path. And lastly, centralized expertise. Ramesh touched on it, but Block is one of those customers who has a very robust platform engineering team that covers infrastructure security but also compute, databases, networking, and traffic. A lot of them are here at re:Invent. Maybe you'll bump into them in other sessions.

And this last slide is really just other sessions. I went through the catalog, and there's a lot of these pertaining to networking that are also 200 level, so I wanted to highlight a couple of these. I won't read through all of them, but maybe you've attended some of these already, and I encourage you to take a look at these. I think you'll get a lot of value out of these, especially if you're just starting out on your networking journey building your AWS networking environment.

That's it. Last call to action here is please complete the session survey in the events app. Other than that, we'll stick around afterwards for some Q&A if you have any questions. Thank you.

; This article is entirely auto-generated using Amazon Bedrock.