Kazuya

Posted on Dec 11, 2025

AWS re:Invent 2025 - Architecting for hypergrowth: Scaling to 200 million users w/ Skyscanner-ARC209

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Architecting for hypergrowth: Scaling to 200 million users w/ Skyscanner-ARC209

In this video, AWS and Skyscanner present best practices for scaling applications from initial architecture to serving 200 million users globally. The session covers the build-measure-learn cycle, starting with three-tier web applications using EKS and Aurora DSQL. Skyscanner shares their 10-year journey from a .NET monolith to a cellular EKS architecture running 300 Java services across four regions with 24 production Kubernetes clusters. Key topics include compute options (EC2, ECS, EKS, Lambda), API fronting services (API Gateway, ALB, AppSync), database selection strategies favoring SQL initially, and scaling techniques like CloudFront CDN, ElastiCache, Karpenter for node autoscaling, and multi-region deployments. The presentation emphasizes managing blast radius through cell-based architectures, transitioning to asynchronous communication patterns, cost optimization achieving 95% Spot instance usage, and the importance of observability with CloudWatch. Skyscanner's Flights Live Pricing handles 5,000 searches per second generating 100 billion prices daily, demonstrating practical implementation of these scaling principles.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Building and Scaling on AWS from Day One

All right, hello everyone. I hope everyone's having a good re:Invent so far. Show of hands, how many of you are here today because you have an idea for an application but don't know where to start? Show of hands, don't be shy. Okay. How many of you are managing or building infrastructure on AWS and you have to learn how to accommodate for scale? Should be everyone in the room. Excellent. Okay, great.

Well, I'm really excited. No matter where you are in your scaling journey, I'm really excited to share with you today best practices and architecture guidance on how you can scale on AWS. I'm joined by Skyscanner today to tell you firsthand how they were able to scale from up to 200 million users globally. So let's get started.

You know, even if you don't have any architecture, it's okay, right? If you just have an idea, no architecture is designed for high scalability on day one, but we'll try, right? So it's really important to get into that mindset of the first building, measuring, and learning, right? The architecture that you start out on day one may not necessarily be part of your final production two years or five years down the line, right? So we want to first build an initial architecture that can help support maybe tens of thousands of users first and then measure, understanding any application performance metrics, learning from it, gathering feedback and evidence, and reiterating through that process. So build, measure, and learn.

So let's start from day one, right? We don't have an application yet. Perhaps the only application that I know how to get started with is a three-tier web application, right? That's a front end, a back end, and a data store layer, right? So that's something that we can think about when we first start building our architecture, and we also have this cycle of continuous improvement of building, measuring, and learning. So I'm going to start with my three-tier web application today. Maybe in a year it doesn't exist, it doesn't work out too well. Maybe I change to different types of architecture patterns, we'll see. And then you'll also see too from Skyscanner that they didn't start with their initial architecture on day one, right? So you'll hear later on how Skyscanner has been able to scale over the years, but for now let's start with day one.

Choosing Your Compute Platform: From EC2 to EKS Auto Mode

So starting with our three-tier web application, we need something to host our front end and back end on, right? Something like compute. And so on AWS we have a couple different options for compute, right? We have EC2, which is our virtual servers in the cloud. We have ECS, EKS, and Fargate for our container ecosystem. And finally we have Lambda for our serverless compute option. Now, how do I choose, right? There's a lot of different options, a lot of different use cases for them. Like what's the best? So for EC2 you have the most control over your configuration, but as you move down to ECS or EKS, right, that's your container ecosystem, you still have a couple of decisions to think about. So think about like networking configuration or security configuration. EKS helps you run Kubernetes at scale. ECS is for container management. And then as you move on to Fargate, right, you want to run your containers without having to manage servers. And as you go for AWS Lambda, you're really only worrying about your code and running it when it's needed.

So putting it all together, you can think of all of our different compute options as a spectrum. So as you look at the top we have AWS Lambda. You're really only thinking about managing your application code, but as you move further down, think about Fargate, right? You're starting to take on some more of that decision making. So now you're probably thinking about data integrations or security configurations, et cetera. As you keep moving down, you're taking on more of those responsibilities. So where's the best place to start? In my opinion, I think the best place to start is probably around that middle layer because then you can take on additional responsibility. Because if you go lower down you can take on additional responsibility, and that's something to consider when you're thinking about hosting your application.

So for Skyscanner, Skyscanner used EKS in 2018.

So learning from Skyscanner, we're going to start off our initial architecture on EKS. EKS is a fully managed service for running Kubernetes clusters. It handles installing, operating, and scaling the Kubernetes control plane. This allows you to focus on your applications rather than the underlying infrastructure, and it's compatible with Kubernetes tooling and plugins.

Now EKS Auto mode fully automates Kubernetes cluster management for compute, storage, and networking. So you can think of this as your personal Kubernetes assistant, optimizing your infrastructure by provisioning optimal compute resources automatically and scaling your node groups based on your workload demands. And it also has multi-AZ for high availability.

Exposing Business Logic: API Gateway, Application Load Balancer, and AWS AppSync

So now that we've decided to run our front end and our back end on EKS, the next thing we want to think about is how do we expose the business logic to the front end, right? We have two different pieces now. Pretty much every customer I'm talking to today is building an API. So when it comes to building an API, we have three different options.

The first is Amazon API Gateway. Think of that as purpose-built for REST APIs. Next we have Application Load Balancer, so that layer 7 proxy, really great for route mapping. And finally we have AWS AppSync, which is great for hosting GraphQL-based APIs. Personally, I work with startups, so I see a lot of startups using API Gateway building REST APIs, but all of these services are still really common when it comes to exposing business logic to the front end.

So to put it all together, there is no wrong answer for choosing a service, right? So this is kind of our cheat sheet for picking an API fronting service. If you're looking for complex APIs with multiple data sources, go with AppSync. If you need web sockets, throttling, usage tiers, and you're going to scale with millions of requests per month, go with API Gateway. If you have a single API action or method and need billions of requests per day, go with Application Load Balancer.

Starting with SQL: Amazon Aurora and Aurora DSQL for Your Data Store

So we talked extensively about our compute. Now thinking about our application, we're going to need some sort of data store, right? So if you're starting from scratch, one of the things that you want to think about is, am I going to go with a relational database or a non-relational database? And nowadays we have a lot of different offerings for different types of databases, right? So I think graph databases, document stores, right?

So are you going to go with a relational database or are you going to go with something very specific or a NoSQL database? So my recommendation is start with SQL databases. Now some of you are probably thinking, why start with SQL? So let me explain why. SQL is well established and a well-known technology. People learn SQL in college. There are communities, existing code, existing documentation, so it's really easy to get started with SQL.

Now there's probably some of you who are still thinking, I still don't get it. Why should I start with SQL? I'm going to have massive amounts of data. There's no way a SQL database is going to be able to handle this. Now some of you here are probably thinking that's me. I have massive amounts of data. We need to optimize for it today. There's no way a relational database is going to work.

Now let's dive in a little bit deeper on that, right? What do we mean by massive amounts of data? Are you going to have multiple terabytes in year one? So going back to the cycle of build, measure, and learn, right, we can build now and improve later, right? So if you don't have multiple terabytes of data right now, we can maybe think about revisiting that a year from now or two years from now, right? So this kind of use case is very rare.

Now why else might you need NoSQL? So NoSQL, again, non-relational database, things like document stores, graph databases, right? But things like super low latency applications is something that you might want to think about for a non-relational database, maybe something like a metadata-driven data set, for example. Something that might exist maybe a couple of years down the line, but maybe from day zero, day one, that may not be something that you encounter quite often.

Again, this isn't most of you if you're starting out from the ground up, so start with SQL databases. So Amazon Aurora is our relational database service. It's durable, which means that your data is stored across three availability zones, and it's fully managed. So you don't have to worry about the underlying infrastructure. You don't have to worry about provisioning or administrative tasks like patching or backups.

Most recently in the past year or so, we released Amazon Aurora DSQL. This is our distributed SQL database service. It's PostgreSQL compatible. Again, it's fast and familiar. It's serverless, so again, you don't have to worry about any underlying infrastructure or any instances to manage. There's going to be no downtime because of patching or maintenance windows, and it has virtually unlimited scaling. So what does that mean? Let's dive into a little bit on how Aurora DSQL scales.

You don't really see any of this underlying infrastructure with DSQL. All you really need to do is create a cluster endpoint in the console, connect to it, and then you can begin using your database. But what happens underneath the hood is that the internal architecture is disaggregated and distributed. So all of these individual components, the compute layer, the transaction log, the storage layer, have been separated completely, and it scales independently as well.

So to recap, here is our initial architecture. We're using EKS as compute and Aurora DSQL as a data store. Again, this is going to take us pretty far without necessarily having to think about a whole lot of effort upfront. EKS Auto Mode helps us with the operational overhead, and each of these have high availability and are configured to be multi-AZ. So again, this can take us pretty far. However, you know, hundreds of users, definitely thousands of users, depending on, you know, all of a sudden if, I work with startups, so if a startup blows up in the next year or so and ends up being millions of users, that's kind of where things start to go wrong typically.

Optimizing Performance: CloudWatch Observability, Caching, and Kubernetes Auto Scaling with Karpenter

Maybe that's year one, maybe that's year five for you, maybe that's right now, but things start to go wrong at some point. And so you're either seeing different parts of the business impacting others. You're maybe seeing slower queries in the database due to large table sizes or index growth. And so at some point, you're going to want to kind of rethink or reevaluate your architecture. So let's dive in a little bit, talking again about the front end, the back end, and the data store here. So how do we want to scale and optimize performance for each of these different layers?

Now, again, before we go too much further, we can't tune what we aren't measuring, going back to the cycle of build, measure, and learn. We're kind of missing that measure and learn before we start building again. So we have Amazon CloudWatch for observability. It's built natively on AWS. You can measure things like CPU usage, latency, and request rates. We have something called Real User Monitoring that allows you to collect and view client-side data about your web application.

And using the power of generative AI, we have something called CloudWatch Investigations that allows you to quickly investigate and resolve incidents by surfacing relevant information. So CloudWatch Investigations will take metrics, logs, traces, and other data to generate root cause hypotheses and actionable insights. So now that we have data at hand, we can now make data-driven decisions. So now we're maybe actually seeing slow database queries or slow API requests, things that we can actually address by changing our architecture.

So let's learn more about the front end tier.

This is one of the ways Skyscanner also scaled their front end. They used CloudFront, which is our content delivery network, and it's built on top of over 700 globally available points of presence. So you want to make sure that you can cache your content closer to your end user to reduce latency.

So let's talk a little bit more about the data tier now that we've addressed the front end. Something that you might be thinking about is going multi-region, right? So when you create a multi-region cluster in DSQL, DSQL actually creates another cluster in a different region and links them together. So these linked regions make sure that you have strongly consistent reads and writes, and then there's a third region called the witness region which basically has limited encrypted transaction logs, and that is used basically to provide durability and availability for these multi-region clusters.

Now the best database queries are the ones you never need to make often, so that's where caching comes into play. With ElastiCache, it's a fully managed service, so you don't have to worry about managing the underlying host. It speeds up reads by storing frequently accessed data in a faster memory location, and Skyscanner actually uses Valkey and Redis clusters to help with that, to help with caching.

Now let's learn a little bit more about the back end here, right? So we can dive a little bit deeper into Kubernetes auto scaling. So there's two main elements to auto scaling in EKS. There's node scaling and pod scaling. So with pod scaling there's horizontal pod scaling and vertical pod scaling. Think of horizontal pod auto scaling as scaling out, right? So you're increasing the number. Think of vertical pod scaling as scaling up, so you're making the pod larger. But if there is no available capacity on the nodes in your cluster, then you want to think about cluster or node auto scaling.

So the Cluster Autoscaler strategy is centered around the use of EC2 auto scaling groups, and Cluster Autoscaler assumes the instance types are identical in a node group. So if you have multiple different types of instances or even like different purchase options like spot versus on demand, you're going to have multiple node groups to support multiple node types, so that really spins up additional clusters making it perhaps difficult to manage, right? And that's where Karpenter comes into play.

Karpenter doesn't use node groups or auto scaling groups. Instead, it manages the EC2 instances directly. Karpenter is an open source Kubernetes cluster Autoscaler that provides dynamic groupless provisioning of worker nodes. So one of the ways it's able to improve efficiency and cost is by intelligently choosing instance types and consolidating pods to lower compute costs. So this is huge for Skyscanner when they used EKS in 2018. They used Karpenter to help manage that. And who better to tell a story than Paul from Skyscanner.

Skyscanner's Scaling Context: From a Founder's Loft to 200 Million Users

Thanks, Christine. Hi, hi everyone. So in this talk, I'm going to cover Skyscanner's scaling story over the last 10 years. I'll touch on the scaling context, so what problems that actually drive the scale that we operate on. I'll talk about our 10 year journey of our compute platform. We've had a few day ones in that journey. I'll talk about flights live pricing. That's one of our largest workloads that runs on our platform. And then also talk about some of the cultural and organizational scaling tactics that we use.

So Skyscanner is a global trusted travel marketplace. We bring millions of people to trusted partners every day for flights, car hire, hotels and package search. We work with over 1200 global partners. You'll see Visit Las Vegas is one of them. And at the start of this year we had 160 million monthly users and it's grown considerably over the course of this year.

We support over 18 million unique flight routes. And over the course of a day, it's not unusual for us to see 100 billion prices coming back from all these partners. We'll have 5,000 requests a second at peak. And over the course of a month, that's about 1.1 billion searches. So there's a lot of traffic there and the architectures have to scale to support it.

Flights metasearch is what drives the scale that we operate at. It looks easy from a UX perspective, like what's difficult about that. But there's some real high cardinality problems in this. So we have billions of unique ticketable flights per year. We have a huge number of partners to integrate. We have multi-petabyte data ingest coming in from these partners, and we see big traffic spikes both from seasonality and from marketing campaigns and other events. So our architecture has to cope with volume, volatility, and variability all at the same time.

So we use AWS to help us scale. We regard them as a technology partner. From our perspective, we care about the global reach that they provide, the focus on resilience, the ability to scale, and the breadth of services that they provide. So Christina's already talked about a lot of them. There's also the pipeline of innovation behind those services as well. One thing that's apparent in our architecture is that we don't rely on a single region.

So Skyscanner yesterday was a .NET monolith on a SQL Server on a very heavy server box which was in our founder's loft, and this now sits in our Edinburgh reception, so you can see how far we've come over the last 14 years. So Skyscanner today is a cloud-native system. It's about 300 Java services. We use four active regions with 12 availability zones. We have 24 production Kubernetes clusters and we routinely run over 37,000 cores a day over 250 plus different instance types. We'll see 400,000 requests per second across our service mesh across all the clusters, and we have hundreds of terabytes a day of cache in use.

So we use a lot of AWS services. There'll be a lot of familiar ones there. Christina's already talked about a few of them. We use Bedrock as well, but this is not a talk about AI, sorry to say. We won't cover that, but there's been plenty of AI this week. The key to this list is not about the individual services. It's about how we bring them together, how we compose them for reliability, managing blast radius, and cost.

Four Generations of Compute: Skyscanner's Evolution to Cellular EKS Architecture

So now I'll talk about scaling compute. Compute is the backbone of our scaling story, so I'll run through four generations of our compute platform. We'll go from hybrid clouds with Auto Scaling Groups and EC2 to large ECS clusters to large Kubernetes clusters, then on to cellular EKS architecture.

So Skyscanner V1, this is like 2014 to 2016. Very pragmatic use of AWS. It was for burst capacity for our partner scraping environment. We ran Classic Load Balancers, EC2 Auto Scaling Groups, and used CloudFormation almost from the start. And this architecture allowed us to scale to around about 100 services. Traffic back to our data centers ran through HAProxy that abstracted the service endpoints and then handled the failover when we had problems with Direct Connect. The key point here is we started hybrid, and we used AWS as a pressure valve for growth, not a big bang migration.

So skipping forward to V2, containerization arrived, and by 2018 we had around about 300 services deployed and hundreds of Lambda functions. The services were deployed onto centrally managed ECS clusters. Around about this time, we built our own continuous deployment system called Slingshot, and that allowed us to scale up the number of deployments that we were doing quite significantly. One problem with this architecture though, this was pre-ALB, so every service had an ELB. That proved to be quite costly, so that was one of the downsides of this quite simple approach. Another lesson we learned here is that with a large number of microservices, you can get sprawl. There's a lot of complexity that starts to come into your architecture.

So skipping forward to V3, we're now in the era of Kubernetes. So we adopted Kubernetes primarily to bring Spot into our compute platform. So by about 2019, we're at about 500 services. We still had hundreds of Lambda functions, but we were doing about 10,000 deploys a month. We operated ECS and Kubernetes side by side, centrally managed clusters, and we'd migrated most of our flights workloads across from the data center at this point. So we had efficiency and velocity, but the downside was what we called the mega clusters.

So huge failure domains essentially. So mid-2019, an inflection point had been reached.

We had the problem of two different container scheduler technologies. We had multiple mega clusters and a growing number of significant outages. Our clusters had become too big to fail, essentially. Developer experience was also poor. It's too many different ways to deploy and platform operations were brittle as well. So we looked at blast radius, inspired by AWS's own work on failure isolation, and we knew we had to shrink the units of failure. So this is a great example of reinventing action, being able to learn from AWS and speak to their leaders and service teams.

So Skyscanner V4, this is the current architecture that we have. It's about five years old now. We're a cellular Kubernetes platform. We have Istio cross-cluster service mesh. We run regional accounts. We have multiple EKS cells per region. We use GitOps to manage the lifecycle of the services and also the clusters using Argo CD. We have a standardized AMI and container tool chain. We still do about 10,000 deploys a month, but we have nearly half the number of services that we did in 2019, so the throughput is considerably higher. We have bounded size clusters and we use NLBs at the edge to feed traffic into the mesh.

So conceptually we traded mega clusters for a fleet of small composable cells. So why do we use cells? Well, it gives us a guarantee. A failure in one cluster only affects one over N of total capacity. So we cap the cluster size, we can stagger upgrades, and we enforce an N plus two deployment policy for services. So that means the service will be deployed in at least three clusters in the region, sometimes up to five for large services. So this means we can survive multiple cluster failures without dropping availability and without resorting to full regional failover most of the time. So the model here is designed for partial failure as the normal case and not an edge case.

So the cells architecture that we've got, the compute from that is supplied by Spot. So our normal state is 95% Spot compute, 5% is on-demand, and we use that for critical functionality like traffic ingress. So we rely heavily on Spot features like rebalance recommendations and placement scores and Karpenter. So Karpenter has provided, it's turned out to be a fantastic technology for us. It really reduces the complexity and the toil of running a large Spot fleet. The trick with Spot though is aggressive diversification. So we have over 250 different instance types across four regions and 12 AZs, and this is what makes Spot usage at scale work. And the DevOps ART diagram here shows 24 hours of Spot instances across our fleet.

So we also have the scar tissue of operating a cells environment. So 2021, bad config push deleted all application namespaces across 24 clusters in a couple of minutes. It was effectively RM minus RF for Skyscanner. And we published a full write-up if you want to go into the gory details. This incident forced us to get really serious about control plane simplicity and config blast radius and also operational drills.

So rebuilding the resilience and trust, first we minimized on global config deploys that can be scoped to a service or a cell should be. Second, we ran regular backups and restores with fresh runbooks. At the time we weren't using EKS, so we had to manage our own etcd databases. Third, if we're adding logic to templates, they were no longer config, they were code, so we treated them accordingly. So unit tests, linting, comprehensive reviews. And finally, we continue to invest in incident commanders. So this allowed engineers to get focused on fixing, not coordinating.

So the lesson we learned from operating cells environments is the control plane is the most important system you'll build. Don't underinvest in it. We underestimated the migration effort of moving hundreds of services from legacy clusters. It's definitely a marathon, not a sprint. It took us several years to do that. Cells introduced an overprovisioning trade-off, so a small service can be over-replicated and large services can end up in 20 plus clusters, so that's one of the reasons that we're using Spot. And good observability and standardization is critical for preventing your cells architecture from becoming unmanageable.

Flights Live Pricing: Handling 5,000 Searches Per Second with Multi-Layered Infrastructure

So now I'll talk about Flights Live Pricing, because that's one of the largest workloads that we run on the cells environment. So it handles about 5,000 searches per second.

This is the thing that generates the 100 billion prices per day, and we'll see 70 gigabits per minute data transfer at individual regions, so there's a lot going on. If Flights Live Pricing is not available, users can't search for flights, so we have a P1 incident.

The scaling is not just about compute. We have many additional supporting systems that allow us to scale Flights Live Pricing, which I'll talk about now. So as Christina mentioned, we use CloudFront. To be able to get global traffic into our regions, we use Route 53, which resolves to multiple CloudFront distributions. CloudFront is for CDN functionality, so we do caching there, header control, and URL rewrites. We also have edge security there provided by Amazon WAF. So that's before traffic even reaches our regions.

We then have an additional layer of Route 53, which uses weighted DNS and health checks to shift traffic between regions. Then within a region, NLB feeds traffic into a cell cluster, and from there Istio handles the last mile routing down to the individual pods. In 2015, we made a very pragmatic decision to run our own NAT instances. This is because of the multi-petabyte data ingest that we have from partners. Managed NAT, although a great service, just wasn't economically viable for us to use.

We do fail over to managed NAT if we have issues with our NAT instances, like we saturate them or there are other problems, but we primarily run on EC2 network-optimized Graviton for all traffic in EU and US. We also use Bring Your Own IP, and so it makes it easier for us to manage IP ranges at our partner APIs. The key point here is you don't have to have managed everything. You can be pragmatic and choose your own pathway through it. And for us, this is about cost control.

So caching is our biggest data store use case in Skyscanner. We have hundreds of terabytes in use per region. Flights Live Pricing is one of the biggest users there. We run multi-AZ Redis and Valkey clusters. We're in the process of migrating to Valkey, but the responsibility for these caches sits with the service teams themselves. There's no central management of these stores because they're part of the services themselves. So conceptually, we cache partner quotes, constructor itineraries, indicative prices, geodata, and a lot more for flight search.

Skyscanner is a data business. We emit around about 55 billion events per day into our data lake, which is about 25 petabytes of data under management, and a significant amount of that comes out of Flights Live Pricing. We have regional Slipstream endpoints. That's a custom Go service that runs in our cell's environment, and that writes compressed micro-batches into multiple Kinesis streams. So we learned that not all data is equal, though, and that's why we use different streams with different SLAs and quality profiles.

On the consumer side, we have Spark Structured Streaming jobs running on Databricks that consume from all the regions and then write to Delta tables and S3, where we use Intelligent Tiering for storage cost control. So bringing this together, Flights Live Pricing is a microservice architecture. The diagram there is considerably simplified from what it actually is, but it'll give you a hint of how it all hangs together. So we have sessions, we have Flights Pricing Service, which is where the traffic comes in, and you'll see the caches in use there. It's essentially caching your session for your search.

Itinerary Construction is one of the more compute-heavy services that we've got. It does a lot of the work that brings the itineraries together and actually is what gets presented back to you. It also handles ranking as well. There's a set of services that manage quote retrieval, and the Who to Ask service is all about the routes. It manages the routes that we actually search for you. Then we have the Egressor service, which holds all the logic for the partner integrations, and that's where the NATs come in, where we actually talk back to our partners.

Cultural and Organizational Scaling: Platform Engineering, Cost Management, and Continuous Improvement

Raw technology alone doesn't scale. You also have to think about abstractions and culture. So in Skyscanner, we have a strong principle of preferring open specifications. That's driven our adoption of Kubernetes, OpenTelemetry, Delta Lake, Istio, Karpenter, and other open source projects. This gives us portable, well-understood abstraction layers at key points in compute, networking, and data.

So at our scale, cost is a first-class concern. We track cost using cost per 1,000 sessions. We call it CPS. It's our primary infrastructure metric. CPS is a shared language between engineering and finance, so we set yearly targets with it. They're reviewed monthly across all engineering groups, and every team in Skyscanner owns its own spend.

Each team owns its own spend. It's managed by common tooling and tagging policies, so cost is delegated, not centralized.

So we adopted platform engineering before it was cool. This was out of necessity, essentially in 2018. Today we have 40 engineers who operate and evolve our production platform. They cover cloud infrastructure, compute, global traffic routing, observability, CI/CD, and SRE enablement. Roughly 50% of their time is spent operating these systems, another 40% is spent on improving them, like kind of product work, and 10% is on learning and development. So we don't just keep the lights on with our production platform. We're continually moving it forward each day.

So wrapping up, some lessons that we've learned in the last 10 years. Be opinionated. You need to pick the smallest set of technologies that meet your scaling and resilience needs, and then standardize and harden them. Manage the blast radius, so decide explicitly how far failure is allowed to propagate. Speak business. Scaling isn't cheap, so get fluent in cost and value so you can have real conversations and make cost everyone's responsibility. Above all else, be pragmatic. You can serve a staggering amount of traffic with less than perfect architectures. The art is in improving them continuously without stopping the world. So this is how we went from a server in a loft to a multi-region cell-based platform serving close to 200 million users. Now I'll hand you over to Fraser, who'll talk about different scaling patterns and how they can be implemented using AWS services.

Breaking Down the Monolith: Microservices, Asynchronous Architecture, and Database Strategies

Thank you, Paul. It's great to hear Skyscanner's journey over the last 10 years on the cloud. So I'll take you back to some of the stuff Christine covered earlier. So we're 10 years in, we've got about a million users, it's a bit of an inflection point. Firstly, you're going to get new feature needs. Your business needs have changed. Look at Amazon.com. You've got things like personalization that have joined there that help grow your business. And with that comes new needs for infrastructure. Your beloved database, which probably has a single writer instance, is starting to become a real bottleneck and a problem for your operations. And finally, you've got a monolithic architecture that's really bogging you down. It's becoming a massive point of failure and it's slowing down development and operations. So what can we do about it?

The first thing is looking at a microservices architecture. Now there's a lot of talks you can hear at re:Invent that'll cover a microservices architecture, but effectively it's the act of taking a monolithic or large application and breaking it down into small components. When you're doing it, there are kind of two broad ways we look at doing this. Firstly is data domain mapping, which is when you look at all your data stores, your data structures, your schemas, et cetera, and look at all those commonalities and try and divide that way. Or you could divide by business function. When you're doing this, it's an inflection point to look at your compute platform as well. Paul mentioned earlier, they started on ECS then evolved into EKS. Those needs are going to change. Some technologies that work very well for you at one point may start to become a bit of a sticking point, and it's very much going to be dependent on your application and architecture. Finally, when you're operating a bunch of microservices, you need to work out how to mesh it all together and how to glue it together.

The next thing you want to look at is your databases. They're really critical to your application. They store all your really critical data, but at some point they become a bit too big, so you're going to need to start breaking them up. Now you look to break up probably by function or by purpose. So in the example here, we've got a forums database, a users database, and a products database. This won't help you if you've got very unoptimal queries or massive tables. A lot of the time when we see that, you actually have the wrong technology. So when you start out, you might need a quick way to get analytics and data, and you've done it on a relational database. It works really well to a point and then it just becomes a massive headache. So really do think about the right technology you're picking. Custom-built data warehouses will scale much nicer than your relational database will.

Another thing to think about is NoSQL. Christine covered this earlier in a fair bit of detail. And one thing I get a lot of customers asking is when to start looking at NoSQL or DynamoDB in our case. DynamoDB is great for massive scale with very low latency. And some really good use cases are things like key-value data stores, things like metadata. But it's not a one-size-fits-all approach. We've heard a lot this week about AI. AI is going to massively change your data needs. Your data's going to evolve in a massive way as you're doing things. And it's going to give you a lot more data, both structured and unstructured, so put a lot of thought into the database technology you pick and how it's likely to grow with you.

Another thing to look at is how you break up your backend services. Broadly, when I see people doing this, look to mirror your data tier. Look at that data pattern and break it up that way. Again, another inflection point on your compute platform. There's no right answer in this, so really make it work for your use case. Then look at your business logic, the thing that really makes your business work. And I also look at moving from synchronous communication over to asynchronous, and this is something I'm going to cover in the next few slides. And finally, look at technologies like queues and buses and streams that will help build an event-driven architecture.

One really important thing to remember is that as you scale, everything gets more difficult. Microservices adds an extra layer of complexity on top of it. You have more to manage, more to think about, there's more there. It will make things more operationally difficult, but it gives you so many more benefits.

So I mentioned thinking asynchronously. And we've got a diagram here that kind of helps explain it. On the left we have a synchronous command where your client calls Service A, which will then call Service B. If there's any issue with Service B, you won't necessarily get a reply to the client and it'll get held. If we go asynchronous, the client only goes to Service A, it'll give a reply, and it'll separately go to Service B. So say there's an issue between Service A and Service B, maybe a network issue, could be that Service B has some form of issue or latency. The client still gets a good answer.

So let's look at a more relevant example. We've got an e-commerce application, so you've got your orders API and your client will go and post to that, and separately it'll go to an invoice generation service. Separately to that, your client can go straight to the invoice service, but if there's anything that doesn't quite work with the invoice service, they can still make the order. You're still able to take the money, you're still able to generate revenue and operate your business.

So transitioning to an asynchronous architecture is an investment that is going to take more time, as you really do need to understand your data and different commonalities with it. Understanding the communication amongst things is really important as well, as are any changes to configuration you need to make. Doing this really gives you a much more in-depth understanding of your application and its architecture though.

So when you're looking to grow, you'll need to look at how you hold things together and also how you kind of decouple them. And so you look at things like topics, streams, queues, and buses. And we've got four services here. We've got Simple Notification Service, Simple Queue Service, EventBridge, and Kinesis Data Streams. These all do slightly different things and they all have different use cases.

So if you're wondering what to use, say you've got a massive throughput of data, you need some ordering, you might have multiple consumers, the ability to replay your data, Kinesis Data Streams is a really good fit. If you're going one to one, you don't have much of a fan out and you're going straight to a target, SNS. If you need an ability to buffer your requests in a queue to have them be consumed, and you can order them or not, SQS is a really good service. And finally, if you've got a one to many fan out with a lot of different targets and schemas, EventBridge is a good option. It's worth saying you can use a number of these in combination. This isn't a one size fits all approach, and there's some really nice integrations between some of these. It will depend on your architecture.

So looking at microservices, this is quite a typical microservices architecture we see from customers. Don't be scared by it. There's a lot of lines there, there's a lot of services there, but it is quite typical. But let's walk it through. On the left we have Amazon Route 53, 100% SLA on the data plane that will resolve your DNS, and it can also do things like health checking and help with routing. We have CloudFront to help accelerate experience for your end users and caching any static content. You then go into your compute layer with EKS before fanning out into your different microservices and data stores.

Everyone's architecture will be unique. Your choice of data store to use will be unique, and your choice of compute will be unique, and how you use it will be unique to your business. Skyscanner have had fantastic success with EKS and Spot to deliver cost-effective resilience. As I've said a lot, this will really depend on what you need.

Advanced Scaling Patterns: Cell-Based and Multi-Regional Architectures with Best Practices

Another concept Paul touched upon was the concept of a cell-based architecture. There's something we use quite a lot for our own services at AWS and Amazon. And this is when we deploy a full copy of an application into a cell. The cell is a fault boundary. It's done to reduce the area of impact for any failures. You partition your data, it's effectively sharding, and you have complete isolation between the cells. There

are hard bulkheads between them. You'll have a routing layer at the top which is going to be highly resilient and available. A lot of people look at things like Route 53 for that with its 100% SLA on the data plane. But you will at some point need to be able to scale this. You'll need to know how big they are. So there's a balance you'll have between the management overhead of having cells because you need to manage each one of these, and how much you're willing to have fail in the event of an incident.

Another thing Paul briefly touched upon was multi-regional architecture. Now, a lot of our customers look at multi-regional architectures for different things. That could be for regulatory reasons. They may need to have a certain uptime availability if they're in a regulated industry. They may have another regulatory need to have their data in a particular country, or they may just want to be closer to their customers and have lower latency for them. One key thing if you have multi-region: architect for regional independence. If you have an issue in one region, that shouldn't impact another region. Keep them separate. Try to avoid cross-dependencies. Make your writes idempotent. And like cells, this will create additional overhead for you operationally. Be aware of that before you go in. Really think of the trade-offs you need and what you need to be able to comply with and what your availability needs are. There's a number of talks you'll be able to hear this week and see online around multi-regional architecture, and I'd recommend looking into that if you are looking at that journey.

So I've simplified this with some best practices that apply to both cells and multi-region architectures. Firstly, when you're deploying code and updates, try and use your regions and cells as a way to do it. Deploy very fractionally, very gradually. So only deploy to a small number, a small area, one cell, one region, a small number of users, and if you have an issue, quickly back it out. Reduce that area of impact. It's the way AWS does deployments, very fractionally. Performance test. This is something I see a lot of customers struggle with, but performance testing really is a fantastic way to understand your scale. It'll also let you know at what point you need to then look at moving into another cell or another region, what will start to break, what will start to cause you problems.

Christine touched upon observability earlier, and this is a really critical factor. Make sure it aligns to your fault boundaries. If you've got an issue in one cell, how do you know it's in that cell? Make sure it's clearly identified, tagged, and marked. And finally, when you're going to be testing and observing, look for an outcome. Do you really care too much if an EC2 instance individually has a problem? Probably not. You care about the business outcome. Can your users still find a product? Can they still place an order? Look at those outcomes when you're doing your testing. What's the ultimate experience for your end user?

So we've heard a number of things today that will help you really scale your architecture to the next level. A lot of this is really going to depend on your business needs, your growth requirements, and how you're trying to do things. But these things will really let you scale to the next level for a long time to come.

In closing, we're in a much better place than we even were five years ago with the availability of managed services that will let you out of the box scale much more efficiently. Through the stack, there's a lot more resources available. It's a lot easier for you to do that. The best way to scale is to do less. Christine mentioned the best query you make is the one you don't make. Use caching so you don't actually have to do the work. Let it do the work for you. It's there to get hit. It's there to take that heavy lifting. It will reduce the scope of what your database is actually being queried with and processing.

Refactoring is a big investment. It's a lot of time you're going to have to put into it. It's going to involve some challenges along the way. So make sure when you're doing it, you think carefully about it and the different trade-offs. Look for your best fit technologies based on what you need as you go and be flexible around it. Look at the resilience and availability of your application. We're in a digital world, and people expect to be able to do things when they want, on demand. Any downtime is very damaging to your business, to your brand, and will cause bigger problems. Architect around these fault boundaries. Make sure you understand your area of impact. If you're going to fail, how big is the failure going to be?

So thank you for coming today to our session. Christine, Paul, and I will be available outside for a short time after to answer some questions, and please fill out the session survey in the mobile app. Thank you.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community