Kazuya

Posted on Dec 4, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Navigate cloud compute with Fargate and ECS Managed Instances (CNS342)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Navigate cloud compute with Fargate and ECS Managed Instances (CNS342)

In this video, Mats Lannér and Alexandr Moroz from Amazon ECS discuss container orchestration on AWS, focusing on capacity provisioning, security, and cost optimization. They introduce ECS Managed Instances, a new option between Fargate and EC2 that provides managed compute with more control over instance types while handling patching and lifecycle management. The session covers the shared responsibility model, task isolation differences between Fargate and Managed Instances, and cost optimization strategies including Compute Savings Plans and container image caching. Ruben di Battista from Qube Research & Technologies shares their journey scaling from a small deployment to tens of thousands of ECS services running Python-based trading algorithms, achieving 100x growth using Fargate, AWS Cloud Map for service discovery, and a multi-account architecture with centralized observability through CloudWatch and OpenTelemetry Collector.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction to Amazon ECS: A Container Orchestrator Built for AWS

My name is Mats Lannér, and I'm the Director of Engineering for Amazon ECS, so I'm responsible for all things ECS and related. With me I have Alexandr Moroz. I'm a Product Manager for Amazon ECS focusing on AWS Fargate. We also have our distinguished guest, Ruben di Battista from Qube Research & Technologies. Nice to meet you.

Today we're going to talk about what Amazon ECS is and how it works. This is a 300-level session, so we assume you know the basics of ECS since you chose to join us. I'm going to talk about capacity provisioning with ECS and how you get the compute you need to actually run your applications. After that, Alexandr will take over and talk about security and compliance aspects of running your applications on ECS and how we help you there, as well as cost optimization and what we do to make it cheaper for you to run your applications. Then, perhaps most interestingly, we'll have Ruben talk about the journey that Qube Research & Technologies have been on with ECS, going from a reasonable scale application to a very large scale workload on ECS and Fargate, which has been super exciting to see.

So, what is Amazon ECS? The simplest way of thinking about it is that ECS is a container orchestrator built for use on AWS where the guiding principle is that we take on as much of the operational chores as possible so that you can focus on your application instead. There's no control plane to manage because we take care of that. If you use Fargate, there's no compute to manage because we take care of things like that.

Four Pillars of ECS: Scale, Cost Efficiency, Speed, and Security

Looking at ECS, there are four pillars we think about for how ECS works and what we build it for. The first one is to launch containers at scale. It doesn't matter if your workload requires one task and runs at a fraction of a TPS, or if it's a very large scale application with tens of thousands of tasks running at hundreds of thousands of TPS. ECS is there to support you on that whole journey, giving you a consistent operating model regardless of what your scale is.

Second is that you only pay for what you use. What that means is that there's no charge for using ECS itself. What you do pay for is the resources that we provision for you. So if you launch EC2 instances, you pay for those instances, EBS volumes, load balancers, and things like that. You only pay for the things that you actually need to run your application. If you're using Fargate, for example, if you're not running a task, there's no charge for Fargate. You only pay while you run Fargate tasks.

Going back to what I mentioned earlier, we're all about improving speed and agility for you. We want you to be able to focus on your application, reach scale and production with your application as fast as possible, and not have to fiddle with a bunch of infrastructure mechanisms. That doesn't mean that we lock you down so that you can't do advanced scenarios, but the default experience is to get you up and running with AWS best practices. Then when your use case requires it, you can dig into more advanced capabilities and take advantage of those.

Finally, as I mentioned a little bit earlier, we embody AWS security and operational best practices. We try to set you up for success. Part of that is that if you use things like Fargate, you don't have to worry about how to correctly configure an instance to run your workload. We take care of all of that. With Fargate, you get task isolation. Every task you run executes in its own EC2 instance, and that EC2 instance is never reused. It's only used for the lifecycle of the task and then it's recycled.

To give a little bit of context about ECS and how ECS is growing, as of about two months ago, customers launched about three billion tasks on ECS every week across all of our regions. Looking at new customers who come to AWS, start running new workloads, and choose to use containers to run those workloads, we see about 65 percent of those customers choose to use ECS,

and the vast majority of them choose to use Fargate with ECS because it just makes their life easier. Additionally, both AWS services and Amazon.com services run on ECS. One specific example that we can share here is that during Prime Day this past summer, the team supporting Prime Day launched about 18 million tasks on Fargate every day during Prime Day, which doesn't quite include everything they launched leading up to Prime Day. Still, it's an indication that Fargate and ECS is ready to support the biggest workloads out there.

So why do customers choose to use ECS and trust ECS to run their applications? Part of this is simplicity. Not simplicity in the sense that ECS is limited and only gives you a limited set of capabilities, but simple because it simplifies operations for customers. We really want to focus our effort to help you focus on your application. We'll take care of the other operational chores for you.

Other reasons include security. Obviously, we all want that. We make use of AWS security best practices and set you up for success with that from the beginning without you having to do anything explicit about it. Finally, there's efficiency. We're very focused on driving high utilization for you so that you can run your application at scale with high availability at the lowest possible cost to you.

We do that in a number of ways. For example, with Fargate, you tell us that you need a 2 vCPU task with 8 gigabytes of memory. Well, that will be provisioned for you and that's what you pay for. If there's any overhead on our end on that, that's our problem to solve, not your problem. You tell us you want 2 vCPUs, we give you 2 vCPUs. We do all kinds of things like that, including scaling up when you need to scale up and scaling down when you need to scale down. Things like that will make it easier and lower cost for you to operate your applications.

Capacity Provisioning Options: Fargate vs EC2

So digging in a little bit into capacity provisioning, how do you get the compute you need to actually run your application tasks? AWS provides a wide range of compute options, and at the end of the day, most of it relies on EC2 in one way or another. The exception would be if you use ECS Anywhere and you run on-premises, in which case you're on the hook for providing the actual compute. But you can use ECS to run across all of these options and get a consistent operating model.

We also see that the vast majority of our customers use either EC2 or they use Fargate. So I'll dig in a little bit to what the trade-offs and benefits are between using Fargate and EC2. Fargate, which many of you are probably familiar with, provides a fully managed data plane for you. We abstract away the compute to the point where the only thing you need to think about is the amount of vCPUs and the amount of memory that you need.

There are a few other things that you can control as well. You can request more storage and things like that, but centrally, it's CPU and memory that you think about. We take care of everything else. If you need to launch 2 tasks, we'll provide compute that can run 2 tasks. If you need to run 500 tasks, we'll provide that compute. We take care of providing workload isolation by design in Fargate. Every task runs on a separate EC2 instance. That EC2 instance is provisioned for that task. When the task stops, we throw away the EC2 instance.

So every time you launch a task, you know that you're running on fresh compute. There's nothing left from previous tasks or anything like that. There can be no cross-talk or anything like that, which for many customers is a key consideration.

As soon as the Fargate task stops, there's no more charge for it. You get a lot of benefits with Fargate, but you also give up control of things. So if you choose to use Amazon EC2 as your compute, you get a lot of control. You get to choose exactly which EC2 instance types and sizes you use for your application. Whereas with Fargate, we use a wide range of instance types and sizes for your tasks depending on region and availability zone. You may see us use an M8I for a task versus an M6I, depending on availability. We always focus on being able to launch customer tasks, prioritizing that over consistency in the underlying compute, which is an important aspect of Fargate.

For EC2, you control all of that. But that comes with a cost, because now you're also on the hook for providing the operating system that we use for EC2 instances. You have to make sure that the operating system and instance are correctly and securely configured. You need to make sure that you patch your instances. But on the other hand, you do get to take advantage of all the EC2 purchasing models. You can use Reserved Instances, you can use on-demand capacity reservations, Savings Plans, things of that nature. Both Fargate and EC2 support the compute savings plan, so you do have that option available regardless.

ECS Managed Instances: Bridging the Gap Between Fargate and EC2

There are a few tensions between using Fargate and EC2. We've had lots of feedback from customers telling us that they love Fargate, but they need a little bit more control for their workloads. So in response to that, we launched ECS Managed Instances about two months ago. What we're trying to do here is strike the right balance between managed compute that we give with Fargate versus giving you more control over what we use to run your workloads.

With ECS Managed Instances, we launch EC2 instances in your account. You will see these EC2 instances, but you have no control over them. You can't take any mutating actions on these instances. Only we can do that for you. But that also means that we control the entire lifecycle of these instances. We make sure that we launch them when they're needed because you need more compute capacity to run your applications. After roughly two weeks, which is very similar to how Fargate operates, we will retire the instance and shift your tasks over to a new instance. We do this so we can patch your instances and make sure that we're always meeting patching and compliance requirements.

You have control over the instance types. You can either choose to say AWS, I trust you, pick the most cost-optimized instance types for my application, or you can say that I know exactly what I need. I need network-optimized Graviton instances for my workload, and you can choose to use that. Or you can say I want GPUs for my workload. You can get those for ECS Managed Instances.

We have more robust cost optimization in ECS Managed Instances. One is that we've rebuilt how we scale up additional compute for you. We can scale up additional compute faster with ECS Managed Instances than we can with both Fargate and EC2. We also implemented cost optimization that continuously looks for idle instances or underutilized instances. When we find underutilized instances, we will shift tasks over to other instances and then shut down those instances, reducing your costs as a result.

Finally, we give you more control. Many of you who are Fargate customers today will know that you can't run privileged containers on Fargate. You can't make use of agents that use eBPF, for example. With Managed Instances, we give you this ability. By default, nothing runs as privileged, but

you can say in your task definition, "Hey, I have this sidecar container. It needs privileged access because it requires eBPF," and you can do that. Going back to the earlier slide, where does managed instances fit into the picture? It's right in between Fargate and EC2. You get more capabilities, and we still give you a fully managed compute experience, but you need to be more aware that we actually provision EC2 instances for you, even though we completely manage them on your behalf.

Capacity Providers and When to Use Each Compute Option

A core concept in ECS is capacity providers. We introduced this back in 2019. Originally, we had the Fargate capacity provider, which is actually two capacity providers: Fargate on-demand and Fargate Spot. We also introduced the Auto Scaling group capacity provider for EC2. With managed instances, we've introduced a new capacity provider that works closely with the whole task lifecycle in ECS to make sure that you have instances available when you need them and only when you need them.

Looking at the Fargate capacity provider, it's simple. You don't have to worry about EC2 instances, configuration, patching, or anything like that. When your workload increases and you need more tasks to run your application, we will launch more tasks for you. When your workload decreases, we will scale down those tasks. All you need to think about is the amount of CPU and the amount of memory that you need.

For managed instances, I've touched on some of these things already. This new capacity provider is tightly integrated with the rest of ECS to know when it needs to launch additional EC2 instances and when it should shut down EC2 instances. We launch these instances in your account, but they're fully managed by us. You can see them, but you can't terminate them or take any mutating actions on them. We control everything about the management of these instances.

For example, we provide the Amazon Machine Image for these instances, and you cannot control that. We use Bottlerocket as our container operating system because we think it's the right Linux OS to use for container workloads. We make sure it's configured correctly, and as I mentioned, we run for about two weeks and then we replace instances, shifting your workloads transparently. One thing that you get with ECS managed instances is that you get to use EC2 event windows. If you want to control when we replace your instances, you can create EC2 event windows and tell us, "Use this four-hour time window on Sunday mornings between 10 a.m. and 2 p.m.," and we'll make sure that we only initiate operations in that event window.

With managed instances, you have the option to map the instance type to your workload needs. You can use the default capacity provider, and then you don't have to do anything other than just selecting it and trust us to pick the right instance types. We'll focus on cost when we make those decisions. Obviously, the tasks you run also matter. We try to bin pack multiple tasks onto an instance, so we'll take that into consideration as well.

When you want to explore more capabilities, you can use the custom capacity provider, and then you get to use what we call attribute-based instance type selection. You can define things like, "I need a GPU. I need a GPU of this particular type. I need at least this number of vCPUs on the instance. It needs to be network optimized." You can control any number of attributes like that. You don't have to pick specific instance types, although you can if you really want to. Instead, focus on the attributes of these instances that map to what you need for your workload.

Now we have EC2, we have Fargate, and we have managed instances. So when should you use these? We strongly recommend that if you're using ECS and EC2 today, you should take a look at managed instances because we think we can give you a much better experience using managed instances.

For Fargate, Fargate is the default in ECS. It will remain the default because we think that is the best experience for the most customers. That said, if you're on Fargate today and you run into limitations on Fargate, then ECS Managed Instances is there to support you. For example, if you need to run a 48 vCPU task, Fargate does not support that, but ECS Managed Instances does. If you want to use a logging sidecar that requires eBPF, Fargate does not support that, but ECS Managed Instances does. If you want to be one of the cool kids and use a GPU workload, Fargate does not support that, but ECS Managed Instances does. Those are the kinds of things that you should think about when you want to use ECS Managed Instances.

Security, Compliance, and the Shared Responsibility Model

With that, I'm going to hand over to Alexander, who's going to talk about security and compliance. Security and compliance—what an exciting topic, right? Let's try to make it more or less painless. When we talk about security, we usually start by looking at the shared responsibility model.

Let's take a look at the screen right now. That's the situation with ECS and self-managed EC2. AWS is responsible for the ECS control plane. It is responsible for all the foundational services: storage, compute, network, and all monitoring, and of course, the global infrastructure—basically, regions, local zones, everything that is actually physical. But you have to worry about and secure all your compute capacity. You need to think about how to provision that capacity, how to scale it up and down, how to distribute it across different availability zones. You need to think about instance selection, and more than that, each time you launch an instance, you need to place the right AMI, you need to make sure that it has the right version of the agent, you need to patch it, and of course, you need to keep monitoring all that stuff. On top of that, your application runs yet another layer of complexity. So that becomes rather complicated.

That's why with Fargate and now with ECS Managed Instances, AWS takes on more responsibilities. You don't need to worry about things that you're worried about with self-managed stuff, things like capacity provisioning and EC2 instances. Everything that runs on them is fully managed by AWS in different ways, and I'm going to talk about it in a minute. The key point is that you're still responsible for your application, how the load balancing happens, how the security groups are configured. You can configure security groups, define firewall rules, and of course, IAM roles. Basically, this is your responsibility; everything else is AWS. As you can imagine, that change in the shared responsibility model gives you way more time to focus on what actually matters to your business: building the application, building the business logic, serving your customers, while AWS takes care of pretty much the infrastructure and configuration.

Now, let's take a look at the isolation boundaries, and that's where we have a difference between Fargate and ECS Managed Instances. As Mats already mentioned, Fargate provides a really strong isolation boundary security model. Each task is launched on a separate EC2 instance hidden in an AWS account. We do not reuse these tasks, so we never place two tasks in the same instance. We never reuse the same instance for two tasks in a row. We always discard it, scrap the memory, scrap the ephemeral storage, so there is absolutely no risk of data cross-contamination. With ECS Managed Instances, we launch EC2 instances in your accounts, and we do believe it's the best option for you to actually launch multiple tasks. We take care of bin packing to optimize the infrastructure, to optimize the utilization. With that approach, we can probably launch more tasks on less compute and really help you reduce your AWS bill.

Of course, you have controls there. If you're not comfortable launching tasks from different applications on the same instance, you can use different capacity providers, so you have certain controls there. You can also control the size of your instances to pretty much simulate a single task experience, but we will keep reusing these instances and keep them running for longer after they become idle for a very short period of time. This is now configurable. You can actually set it up to one hour, so we don't just release that capacity.

Why do we do this? Well, because each next task that is launched and already running in an existing instance launches way faster—about three times faster than a task on a brand new instance when we have to spin it up. So you have these controls with managed instances. Let's take a look at the best practices that we use with Fargate, and they're very similar between Fargate and managed instances.

The key difference is the isolation point. Task isolation is probably the key feature of Fargate. We do use the attack surface, so we don't have access to underlying AWS hardware. You don't have access to it either. We only provide access to logs that are important for telemetry and troubleshooting, but we don't have access and we don't give you access to underlying hardware. We do run regular patching, and one very important thing that is often overlooked is AWS VPC networking mode.

Why is it so important? Well, because now each task gets an individual IP address. With that, you can use VPC flow logs to understand what's going on with your traffic. You can use security groups to have fine-grained control for each individual task. So you have really full control over your network security, and that's a major thing. Something that you cannot achieve with network address translation if you're running, for example, bridge mode with EC2. You can't really do that. So it's a really strong and very secure solution that follows pretty much all AWS security best practices.

Managed instances have a list of benefits. They provide fully managed compute with compliance and patching every two weeks. If you tell me it looks like Fargate that launches instances in your accounts, you won't be wrong. Technically, this is Fargate that is exposed to your account, so you're getting the same experience with a couple of differences. Bottlerocket, as mentioned, is an operating system that is developed by AWS. It's an open source project that we believe is the best way to run containerized workloads. It has a very minimal set of packages required for containerized workloads. We don't have anything extra, resulting in a reduced attack surface.

Of course, all the instances that you launch in your accounts, you don't have access to them. You can still see them. Why do you still see them? Well, I can give you an example. Imagine if you don't see them, but they already exist and run in your account, and then you have to go and delete a subnet. What's going to happen? You're going to get an error message saying you have instances running, so you'd better see them. So you have visibility, but you don't have control. You don't have a way to change them directly using EC2 APIs. Everything goes through the ECS API, adding another layer of security.

Of course, there's also the maintenance windows feature. Your EC2 instances have to go through hardware replacement from time to time. You can define these windows, and now you can use the same concept to tell us when you're okay with patching. So it's not going to happen on a random day. It's going to happen in the time interval that you tell us to. Fargate and managed instances really help you achieve better compliance faster because now you can point to AWS and say AWS takes care of pretty much everything but my application. To achieve compliance, you need to provide a much shorter list of evidence that you actually have met these compliance requirements.

Everything else is on AWS, so we have a comprehensive list of things that we provide as evidence that we actually meet the compliance requirements. We help your workloads achieve compliance faster. You can find the list of services and scope on our website, which is really comprehensive. Pretty much everything you need is there, including Amazon ECS.

Cost Optimization Strategies: Hardware Selection, Purchasing Options, and Scaling

Compliance is a great topic, but now let's talk about something that is very dear to my heart: how to save money. We're going to discuss cost optimization strategies, which involve four key things. First, choosing the right hardware. Second, choosing the purchasing options that work best for your workloads. Third, figuring out how to scale and optimize your workloads. And fourth, observing and ensuring that all of this actually works to your benefit.

Starting with hardware, we offer different CPU options today: Intel CPUs, AMD, and Graviton. With Fargate, you have control over whether you want to use x86-64 or ARM CPU, so Graviton or ARM. However, you cannot choose whether your tasks will land on Intel or AMD. What if your application takes a dependency on AVX-512, for example, instruction sets that work really well on Intel? Or what if your application benefits from the super strong single-core performance of AMD CPUs? That's where managed instances come into play. You can choose the CPU type and exact instance types that are best for your compute. If you run tasks that are time-bound jobs and finish your work running your workloads faster, you pay less. It's an excellent way to choose the right CPU for your workload to actually pay less or just get the latest Graviton generation with its phenomenal price-to-performance ratio.

Switching to purchasing options, on-demand is the default and excellent for spiky workloads, though it's the most expensive. Saving Plans work well when you have predictable workloads. If you can commit to spending a certain amount of money on AWS over one year or three years, you can get significant discounts. If you have predictable workloads, please use Saving Plans. It's a very powerful tool in your arsenal. On top of that, there's Spot. Spot is a great way to save if you have fault-tolerant applications, offering 80% savings on compute easily. The caveat is that we can reclaim this capacity at any time, so this is really for fault-tolerant applications. This is not yet available with managed instances.

Let's take a look at Fargate and how it helps you optimize costs. First, and not really obvious, you always get 100% utilization. We give you exactly what you asked for. You don't need to reason about bin-packing your tasks on an instance and figuring out what happens when the task goes away or what to do with that empty slot on the instance or what else to place there. Another really important thing is that you can speed up task launches on Fargate with seekable OCI lazy loading. The idea is that we get the container and start it before it's fully loaded. It's an excellent feature, so please use it. It works better with large images. Of course, we have Compute Optimizer now. It looks at your workloads and says you can stop these instances you're not using, or it's better to right-size them. It suggests a better size if, for example, you run them at a very low level of CPU usage. It says to go down the size because you don't need that much capacity, saving you money.

On managed instances, the situation is slightly different. Remember, there's a multiple-tasks bin-packing problem. What do we do with empty slots? Well, now we actively fill them. We look at your entire cluster.

We find underutilized instances, and compact your tasks to achieve better utilization. We shut down anything that is idle, which is really powerful active optimization that runs all the time. You can tell us that you don't want that. These particular tasks need to run for a long time and cannot be interrupted. We have a way for you to disable this feature completely, which is handy when you use something like Reserved Instances and on-demand Capacity Reservations. You already paid for that hardware, so why would you want to stop these instances? Maybe you want to keep them running all the time so you can launch new tasks faster. There are different ways to think about it.

Since we have multiple tasks in a single instance, we can now use container image caching. Instead of downloading the container image from ECR every time, your tasks can grab it from that instance immediately. To wrap up, you need to think about Fargate and managed instances. We believe that Fargate is an excellent choice for the majority of workloads, but managed instances come incredibly handy when you need things that just do not work on Fargate. eBPF is one example of that.

Another consideration is related to cost when you have short-lived tasks on Fargate. It is not efficient. You're spending way too much money because you're paying for the time the image is being downloaded from ECR. Managed instances, as I said, offer container image caching and the ability to reuse instances for new tasks, which helps you save money. Everything that runs under two minutes, I'm confident it's going to be way more cost optimal on managed instances. Please consider that as an option. If you have Reserved Instances or Saving Plans or Capacity Reservations, managed instances is a simple way to use that capacity and save money.

QRT's Journey: Building a Scalable Cloud-Native Platform with Fargate

With that, we're coming to the most exciting part of this presentation. Ruben, please come on stage and chat about QRT's journey. Hello. I hope you're getting the best of the first day of re:Invent. My name is Ruben, and before diving into the journey and the technical talk, I would like to introduce myself a little bit. The first thing you need to know about me is that I'm Italian, and that's important for one reason. I have a challenge with myself today to try to deliver at least half of the volume of information I can convey with my hands by using words. So you're going to let me know if that works at the end of the presentation.

The second element is that I would like to talk about my previous life before joining QRT. I was actually a rocket scientist. I have a PhD in applied mathematics and I was simulating rocket engines. I must admit I don't regret the choice to switch, but sometimes I miss the random explosions in the lab. After introducing myself, I would like to talk about QRT. Qube Research and Technologies is a global systematic asset manager. Let's dissect that a little bit in the right order, starting from the end. Asset manager means we get money from our investors and we put it on the market to make more money with it. By systematic, what we mean is that the vast majority of our investment strategies are data-driven and automatic. We have a lot of algorithms which trade on the market on a daily basis in an automated way. By global, we mean that QRT is operating across the globe and across all types of assets: equities, futures, crypto, you name it.

The mission of QRT is to deliver results for our investors in all market conditions. It's easy to understand: when the market is bullish and things go well, we want to do better, and when things are less favorable, we try to mitigate that. Now we're good with the introductions. We want to dive into the nerdy talk.

When I joined QRT almost four years ago, there was already an existing solution in place that worked more or less like this. We have researchers—basically the bright people who come up with ideas for investment strategies. Most of them, especially in recent times, are accustomed to doing their research with Python because of the rich data science and scientific computing ecosystem. They would do their research with Python and then pair up with a quantitative developer who would rewrite the research in another language, compile that artifact, and deploy it on existing on-premises infrastructure, sharing resources with other ideas. That worked quite well. QRT has been a successful enterprise, but when I joined, we were tasked with another objective: to build something that is natively scalable and cloud native.

So we said, okay, maybe we can skip the detour. Let's keep what works. Researchers really like to work in Python, so let's keep them working in Python. What we decided to do is containerize our Python workloads. Now we have our container, and we had to decide which orchestration system we wanted to use. I guess the fact that I'm on this stage is probably a spoiler, but we ended up using Fargate and ECS. Why? I will try to explain to you why we chose Fargate as a capacity provider and ECS as an orchestrator.

First of all, at that time, we were a very small team. We were four people when I joined, and we wanted to focus on the business. The team was not particularly expert in cloud computing, but we had expertise in the financial part and the standard engineering part, so we wanted to focus on our core business. Fargate gave us that serverless vision which allowed us to focus on our core business. We tend to build things within the team, keeping the intellectual impedance at the lowest we can, and Fargate really helps with that.

At the time when we started our endeavor, we didn't know if it would have been successful or not. So we were looking for something that could help us start small but scale if we needed it. We needed the capacity to keep up with our growth. Last but not least, the financial industry is a very regulated industry, so we needed a compliant security posture by default. The fact that in Fargate everything is isolated, encrypted, and you cannot access it on your machine really played well for us. These were the design principles of our choice.

Now let me explain how things work in more detail. Some schematic here, just bear with me. In our system we have two types of services. One type of service is called a quant computational unit, or QCU, which you see here on the bottom of the slide. Then we have another type of service, which is a server that is marked as a logical server. We also call it a head node. The role of the head node is important because the QCUs cannot talk directly to the database. As you see in the top of the slide, we have our underlying firmwide database with all the data. The QCUs are designed not to talk directly to the database for performance reasons.

So the role of the head node is twofold. One is to aggregate requests. Imagine one of your units needs Amazon stock prices for the last ten years, and another one needs it for the last five years. Basically, the head node queries the database for just ten years and then does the slicing locally. The second role, as you can imagine, is caching. We cache some of the requests in-memory. I mentioned that this is a logical server. Why? Because in reality, our head node deployment is composed of several ECS services, and in the bigger deployments, several hundreds or even several thousands. In this slide, each square is an ECS service.

An ECS service has one particular characteristic: each service has only one task. This is somewhat unconventional compared to AWS best practices. We chose this approach for simplicity. If you have two units publishing almost the same thing, you need to decide which one is correct. We decided to accept a slight loss of residency for the sake of simplicity, so we have just one unit publishing.

Another important element is that everything communicates using the gRPC protocol. The QCU talks gRPC to the head node, which talks gRPC to the shards, and the shards talk gRPC with the database. The shards exist because you cannot host all the data you need in a single service. The shards split this data based on several criteria to enable a distributed deployment. When the QCU finishes processing what it needs to elaborate, it publishes the results of its computation back into our database.

Service discovery is another important element. In our experience with other orchestrating systems, we realized that scaling DNS correctly is quite challenging. We decided to use AWS Cloud Map, which is serverless. When we deploy the head nodes, we register a DNS entry in Cloud Map and it works seamlessly.

How did it go? The fact that I am on this stage is probably a spoiler, but it went very well. I would like to give you the reality of things at the scale we operate. Unfortunately, our compliance colleagues keep the exact figures in the shadows, so I cannot tell you the precise numbers. What I can tell you is that we achieved 100x growth, which means everything and nothing. We have deployed several tens of thousands of services currently running, where several tens of thousands is on the higher range. Based on the architecture I presented, you begin facing challenges when you scale. For example, we encountered quotas on the number of services per cluster and the number of CPUs per account.

Scaling to Tens of Thousands of Services: Multi-Account Architecture and Future Growth

We actually had to evolve our deployment. We worked with AWS through an engagement process that resulted in an evolution of the architecture I presented. We started by putting head nodes in dedicated accounts. We have some accounts for head nodes and some accounts for the QCUs. All these accounts are within the same VPC, which is stretched across all the accounts.

There are a few caveats here. When you do this, you need to first define the VPC in a single account, and then you can stretch it across other accounts. The most important caveat is that you can register your DNS service discovery only in the accounts where you initially created the VPC. In our case, we created the VPC in the head nodes accounts because the head nodes are what we want to discover, and the QCUs need to discover them. Then we stretched it around. If you want to use API-based service discovery instead of DNS query-based discovery, that works across the VPC without any problem.

The second important element, and this is really the secret sauce to becoming a successful asset manager, is the naming system for the clusters. As you can see, we are using flowers. It is really the secret sauce for our deployment. I recommend you do that. It is great and fun.

We grew from a few services in a single account and single cluster to single account multi-cluster, which I did not show, and now we are multi-account, multi-cluster. You can realize now that this becomes challenging to control. You really require the right observability posture. You really want your system to ping you instead of you polling the system, which is much more scalable.

When you approach observability, at least in our experience with ECS, you have several options. The one which is most documented is spinning up a sidecar on your services with Firelens and then scraping your telemetry. However, during our engagement with AWS, we decided to go with a different approach, which is a bit more exotic, but it works for us, and I will present that to you now. We use a single VPC with multiple accounts, some for head nodes and some for QCUs. Those accounts publish their logs in a serverless way using CloudWatch, and they also publish metrics if you turn on container insights in a serverless way into CloudWatch metrics. Each account does that. What we did is connect what is called log subscriptions and metric streams. These are serverless mechanisms which react when a new log appears in CloudWatch in whichever account or when a new metric appears in whichever account. You can connect them to Firehose, Kinesis, and to a Lambda. What we do is use those which are in a single account and we use those to send our telemetry to our third-party observability vendor. That is for infrastructure-related telemetry.

For application telemetry, what we did was deploy an Open Telemetry Collector within the VPC we just discussed, spanning across accounts in a dedicated cluster. This Open Telemetry Collector has been configured to auto scale. What is a bit different from standard practice, at least based on our experience, is that normally you configure collectors to scrape your services. In this case, our services are pushing to the collector. Why that approach? No sidecars means you don't have the problem of monitoring the monitor. You don't have the problem of deciding how much compute you need to allocate for the sidecar, and you have one single point of entry to control everything, which looked easier to us at the scale we were operating at. The telemetry collector then pushes through a private link back to our observability vendor, which then holds everything. You have logs, you have metrics, and you have traces if you are using it.

We want more, and we are still growing. Right now we are at a certain scale where we cannot afford to just grow without thinking, so we need to adopt a sustainable growth approach. There is a lot of appetite in the market and also by extrapolation in the firm to leverage hardware accelerators, GPUs, but not only GPUs. As you might know, when you use Fargate, you are limited to 120 gigabytes as the top memory level you can choose. Some applications on our side require more than that, and so we are looking at alternatives. The alternative, spoiler alert, is that we are already testing and looking at ECS Managed Instances, which looks really like the sweet spot where we can adopt all these elements and make our business grow and our deployment grow.

Orthogonal to this, we want to keep the resiliency posture we already have. We increased it by going multi-account, and now we are starting to think about going multi-region because right now all the accounts were initialized in a single region. So we started to think about how to do that and take the VPC unit with all the accounts and move it into another region, having a multi-region deployment. This will give us advantages on flexibility for compute, but also a high resiliency posture. That was it for me. Thank you for your attention and back to Alex.

; This article is entirely auto-generated using Amazon Bedrock.