Kazuya

Posted on Dec 4, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - From monolith to microservices: Migrate and modernize with Amazon EKS (CNS210)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - From monolith to microservices: Migrate and modernize with Amazon EKS (CNS210)

In this video, Nirmal Mehta and Isaac Mosquera demonstrate migrating from monolithic applications to microservices architecture using Amazon EKS. They explain the strangler fig pattern for gradual decomposition, introduce containers and Kubernetes for managing microservices at scale, and showcase EKS Auto Mode with Karpenter for automated infrastructure management. The session covers multi-tenant architecture implementation using namespace isolation, network policies, AWS CNI, resource quotas, and EKS Pod Identity for secure AWS resource access. They detail compliance and auditing through CloudTrail, Open Policy Agent for policy-as-code enforcement, and introduce AWS Controllers for Kubernetes (ACK) for managing AWS resources declaratively. The presentation concludes with EKS Hybrid Nodes for on-premises integration and announces the newly launched EKS capabilities feature for managing controllers like Argo, ACK, and KRO, enabling GitOps-based control plane automation for scalable multi-tenant SaaS architectures.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

The Monolith Challenge: When Growth Demands Change

Good morning. How is everyone doing today? Great. I'm Nirmal Mehta, a Principal Specialist Solutions Architect and Containers Tech Lead at AWS. I'm joined today by my colleague Isaac Mosquera, and we're going to talk about migrating from monolith to microservices. Who here is on this journey? We've talked to a few people here already. Awesome. You're in the right session. So let's get started with something that might look familiar to you if you squint.

Let's get started with a real scenario. Imagine that you're an ISV with a successful monolithic application in your own data center. Does that resonate with you? Something similar to what you're working on right now? You've grown from something like four to ten developers to maybe four hundred developers, and things are evolving naturally. Weekend scaling is manual. The CPUs are idle during off-peak hours, and you have to coordinate deployments with all these other teams, going through the build, test, and release cycle as a unit. Does that resonate with some of you? I'm seeing some head nods.

That means if one developer's QA test fails, the whole thing stops. You have to fix that piece and go through the entire process again with that monolith. This is our starting point, and often after this slide you'll see something that says "Why not microservices?" Well, at first, monoliths have a lot of pros. They're simple and not overengineered. You can understand the complexity of that monolithic application. It has lower operational complexity and resource efficiency at small scale.

But all of a sudden, your CTO or CEO comes and asks, "Where's the AI?" I've heard that from a few folks. This is what our customers are coming to us saying. They've got this monolithic application, but all of a sudden there's competitive pressure, and they can't integrate AI into the application. They're being asked to do it, and they just don't move fast enough. This is where the monolith breaks down, and we start talking to our customers about adopting another architectural principle such as microservices.

Breaking Down the Monolith: The Promise of Microservices and the Strangler Fig Pattern

So what is the promise of microservices? It's the ability to be more agile, flexible, and innovate faster. You have a little bit more autonomy so development teams can make their own choices about programming languages, databases, and APIs. The functional components become more scoped down, and at the same time, productivity can go up because of this architecture. That's the promise of these loosely coupled microservices.

So what do those look like? Microservices solve the limitations of monolithic applications through functional isolation. You take that one application and start to break it down into many functional services. Each of those functional services is independently built and can scale independently of other components. Those functional independent components are organized around individual teams. Now you have a team that takes ownership of that functional service and another team that takes ownership of the next functional service. They have the ability to go at their own pace and pick their own technologies if it's a well-functioning organization. You can organize your team structure to best support the different services and their needs.

That means you can also harness individual skills that are focused and really good at specific types of services and not have them scattered across the monolithic application necessarily. On top of that, microservices play really well with elastic infrastructure, which is what we're all about here at an AWS conference. Once you get more of these functional services, you can start to take advantage of that elasticity.

You can also start to reuse those patterns and have some standardization about what those services look like, so those API contracts between them are cohesive, comprehensive, and teams are able to communicate better. A microservice is a collection of APIs, compute, and storage for that specific service.

So you're buying into microservices. We're on the early stage of this journey that we're going to go on for the next 54 minutes. I love microservices. How do I get there? There are lots of different ways to start tackling your monolithic application and break them down into microservices. One pattern that we see a lot of our customers use is the strangler fig pattern.

You don't want to just jump in and start hacking away at your monolithic application. You want to be prudent and understand the risks and be pointed at what you want to break out first. The strangler fig pattern is a pattern that allows you to take that monolithic application, identify the services that you want to start breaking apart and taking them out while leaving the rest of the monolithic application alone. You do it over time at your own pace, and as services come online and you get more familiar with the microservices architecture, you can increase that pace and start breaking down the monolith over time.

From EC2 Chaos to Container Orchestration: Why Amazon EKS Matters

Let's see what our new architecture looks like. We've been doing this for a little bit, and we've been breaking down our monolithic application. We still have our web front end, we have our identity service, and some shared database services on our corporate data center, but we've now started breaking out microservices onto EC2 instances. We have payment service, metric service, billing service, and all these other services as EC2 instances in AWS Cloud.

Things are going great, but we just have three services here, and over time you're going to get a web of microservices. All of a sudden, you've got hundreds of EC2 instances all with different configurations, all with different deployment pipelines, and no standardization. You're starting to lose control over the complexity. How do I manage that? As a specialist solution architect focused on containers, that's where containers come in.

Containers came onto the stage to address exactly this issue. The issue is that it works on my developer laptop in my microservices team, but when we throw it over the fence and put it into an EC2 instance, the environment is not the same. There's no service discovery, there's no isolation, resources are all over the place, and it's difficult to comprehend what these different teams need and how to orchestrate and manage all these services. Containers allow you to have a consistent environment and consistent dependencies within the application definition. You create a container image that's owned by that team or multiple container images owned by that microservices team, and you get reduced operational overhead. You have a consistent environment from the developer's environment to production, and it increases the speed and ease of testing and iterating on those microservices.

But now you have a bunch of containers, and it seems like you have the same kind of problem again. Instead of hundreds of EC2 instances, you have thousands of containers. You still need to figure out service discovery, you need to understand load balancing, and you need to figure out the platform capabilities that you need to operate thousands or hundreds of thousands of containers in production on AWS. That's where Kubernetes came onto the scene, primarily to orchestrate and manage containers at scale.

The Kubernetes open source project creates a control plane that schedules and orchestrates multiple containers at scale. There are other sessions here on the other spectrum of our track and containers track. There's an INV 500 session and an EKS under the hood session which goes very deep into the massive scale that we can achieve today with Kubernetes. But initially, this is what it was for: managing all these containers at scale. It provides out of the box service discovery, load balancing, auto-scaling capabilities, and rolling deployments.

However, managing open source Kubernetes is complex to get all those benefits. The Kubernetes control plane has significant complexity to it, and much of that complexity is undifferentiated heavy lifting that you shouldn't need to do. It's maintaining that control plane, and nowadays with EKS Auto Mode, which Isaac will go into more detail later in this presentation, you can manage the nodes and the compute as well. But I'm getting a little ahead of myself. If you've used open source Kubernetes, managing the control plane can get complicated and complex very quickly. That's where Amazon EKS comes in.

Elastic Kubernetes Service is our managed Kubernetes environment service. We help you focus on building, running, and scaling your workloads on a production-ready Kubernetes control plane and cluster. With Amazon EKS, you can accelerate innovation because you don't have to worry about the control plane or the complexity of managing the Kubernetes platform. You can optimize cost and performance, especially with tools like Karpenter and all the goodness in the open source community. You can enhance availability, scalability, and reliability with all the well-architected best practices from AWS built into Amazon EKS. And then on top of that, you can run Kubernetes with Amazon EKS in any environment, anywhere.

The key features are that it runs in any environment, you get a fully managed cluster including automatic updates with Auto Mode, and with Amazon EKS you get automatic updates of the control plane. You have native AWS integration, so thinking about database services, IAM, and security groups. The list goes on, and Isaac will go through it in more detail. It's Kubernetes compliant, so you get all the goodness of the Cloud Native Compute Foundation open source ecosystem that's built around the Kubernetes open source project.

Bridging Cloud and Data Center with Amazon EKS Hybrid Nodes

Let's put it all back together and go back to our environment. We bought into microservices and we've been breaking down the monolith. Now we have an Amazon EKS cluster with all of our microservices managed on managed instances. But there's one more thing: we still have those resources in the data center. Maybe there are some services still in your data center that you're just not going to be able to move anytime soon. Maybe they are related to data availability or compliance or some other regulation that you need to meet. That's where Amazon EKS Hybrid Nodes comes into play.

This is a feature we launched last year at re:Invent. It allows you as a customer to hook up an Amazon EKS cluster to your VMware nodes or other on-premises nodes in your data center and take advantage of the Amazon EKS goodness in the cloud and the compute resources and other resources that you already purchased and have available in your data center. Let's take Hybrid Nodes and put it into our architecture. Now we have an awesome single-tenant microservices-based Amazon EKS Auto Mode environment with Hybrid Nodes where you have your microservices running on managed instances and you have your on-premises microservices still running in your data center using Hybrid Nodes and that connectivity.

This was a really quick journey, but we're now in the microservices world. Things start to get better, right? You're taking advantage of the microservices architecture. You've got this well-oiled platform and development teams are accelerating. You're starting to make progress towards innovating and being more competitive with your competitors. And things start to grow. With that awesomeness in your application, you're getting more customers. More customers is always great, right? But with that growth, you start to have to manage all the operations of managing multiple customers in that single-tenant kind of architecture. So are you going to spin up a new cluster for each customer? Probably. Isaac, can you help me figure out how to make sure that I'm not losing my head in operational efficiency with all these customers that I'm seeing on my new microservices architecture? Absolutely, absolutely.

The Multi-Tenant Challenge: Moving Beyond Single-Tenant Clusters

Can you give me a thumbs up if you can hear me? Alright, good. This is my first silent session ever, so this is great. When Nirmal got us to this point, we were breaking down this monolith into microservices. We now have a tenant on a single cluster, and inside of that tenant we have a bunch of individual microservices running. As Nirmal was explaining, we started to gain success. We're moving faster and shipping features faster because we broke down this monolith and allowed our teams to run independently.

Soon enough, we get some more customers, and we get a few more customers, and then we get a few more customers. Now we are running individual clusters with a single tenant on each cluster to make sure we have that isolation. The sales teams are excited because every time they're selling these deals, they're cheering it on. But you who have to manage these pets are just thinking about what you're going to have to deal with at 2 a.m., waking up to manage individual clusters for individual tenants.

From a compliance and security point of view, this looks very advantageous. They're going to love this—a whole bunch of isolation. Except that your CFO is not going to love it because it's extremely inefficient. You, as the people who have to manage these pets, also have to deal with it every single time we sell a new deal. It's going to be more work for you. But you have to ask yourself, why now? Why are you all sitting here? Why am I talking about this now? This isn't that new.

What has changed over the last five to six years is that we moved into the remote-first economy. Everybody is working, or at least partially working from home. Our expectations as consumers have also changed. When we go online and sign up for Netflix, we don't wait weeks for them to give us an account. As we consume more cloud services, we expect that all to be instantaneous. Your customers are also expecting this. The competitive landscape has also changed. Your boards and investors are all expecting SaaS returns in multiples—the ability to provision your tenants instantaneously so that you can start billing instantaneously.

Lastly, cloud-native technology now is extremely mature. Nobody thinks of it as a risk. It is an extremely safe bet. In fact, it's probably the opposite. If you're going into the data center, someone's going to ask you why we're moving back into the data center. So how do we get to a multi-tenant architecture? Something that allows us to scale, something that allows us to not scream every time the sales team makes a new deal.

While this is a relatively simple architecture diagram, there is a lot of complexity that we'll get into in the next few slides. On the left-hand side, we have our control plane, just like Kubernetes has a control plane. We have to think about building our own control plane as well for ISV or SaaS-oriented businesses. We have onboarding, identity, metrics, and tenant provisioning all happening from a centralized place. We have to move away from pipelines and being able to just provision customers from a pipeline, which is automated-ish—somebody still has to push that button and coordinate a bunch of different pipelines in order to get a tenant up and running.

Instead, we move into something that's more automated using services and APIs to get that going. On the right-hand side, we have our application plane. You'll notice we're moving into more multi-tenancy. Of course, there are contracts and certain compliance requirements. There are certain things that a security team just won't allow us to do, so we will still have a single tenant on a single cluster. I don't think we'll ever really get away from those types of requirements. But what we want to do for ourselves and for the business is make sure that we are using infrastructure efficiently and that our lives are good—that we wake up and are able to automate things and are not just rushing around taking care of individual clusters.

Building Secure Multi-Tenant Infrastructure: Network Isolation, Resource Management, and Auto Mode

So how do we get here? How do we get into this particular architecture? That's what we're going to get into now. So we have all of our clusters. The first thing that we want to do is create namespace-based isolation. It helps us reduce cluster count and creates a logical boundary around tenants. It centralizes some of our operations purely by using namespaces. As we use namespaces, we'll be able to reduce the number of clusters that we use, and some tenants will be able to share the same cluster.

While from a CTO or engineer perspective you're screaming yes, it looks really pretty on an architecture diagram.

However, what nobody tells you is that all Kubernetes networking by default is a flat network architecture. Pods and namespaces kind of mean nothing at this level. We haven't done anything to secure it down, so a pod from tenant 1 can reach out to a pod in tenant 2, and there's nothing to block it.

So what do we do? We have our cluster. We have our namespaces, and this is what it could look like if we don't lock it down. Somebody from a basic tier namespace can reach out to tenant 1 and tenant 2, and we want to be able to block that. We do so by building digital fences, right? Building network policies. Now this policy that you're looking at here is saying that tenant 1 has a pod with a role of type analytics or a pod role of type subscriber can reach out and talk to the backend namespace. By doing so, we start creating these digital fences, limiting the amount of network flow that goes through our systems, our pods, and our namespaces. We've created our first little digital fence.

The next thing that we're going to want to do is create gateways or ingresses and make sure that we have path-based routing to our tenants. We don't want to just have pods communicate with each other. We want to make sure that we are controlling it through a gateway. And last, if you're not using AWS CNI, we're going to do that as well. We're going to make sure that we have AWS CNI enabled and installed, and what that allows us to do is create ENIs for each individual pod. What this means is that you're able to use the security groups and all of your network policies that you have for the rest of your AWS constructs or resources, and we can treat pods in the same way. So we have secure networking, and we can now start enforcing security groups as well.

In doing so, we've now created a full digital fence. We've turned this open house into a gated community, and now we're able to control which tenants can communicate with other tenants. Everything looks pretty good up to now. However, we still have to deal with the noisy neighbor problem, which I'm sure you've all had to deal with at some point or at least heard of it. It sounds pretty scary at first when you say we're going to put a whole bunch of tenants on the same cluster. There are ways to help us deal with this. Tenant one can potentially have a runaway workload and consume all of the CPU inside of pool one. Because you have a lot of different tenants running inside of that pool, they may consume all of the memory on this box, leaving tenant 1 starved.

The way that we normally deal with that sometimes is we overprovision, but overprovisioning is going to make the CFO very sad because that's where those bills start to mount up purely because we want to make sure nothing bad happens. So there has to be a balance between overprovisioning and making sure we have the right resources at the right time. The first thing that we're going to do is install resource quotas, which are namespace-based. If you think about a namespace sort of like an apartment, the resource quotas are all the things that this apartment could possibly use. What are the resources that it can consume? Here, pretty simply, we're saying this namespace or this apartment can only use 8 CPUs, 16 gigabytes of memory, and have 20 pods running at most.

The next thing that we're going to want to look at to make sure that we have the appropriate resources running on our machine is to make sure that we install limit ranges. In this apartment analogy, if resource quotas are about the entire apartment, pods are like individual rooms, and we can set limit ranges for these individual rooms to make sure that each individual pod does not starve out the machine. Here we have a max CPU of 2 and then memory limitation as well. It allows us to apply it to a container type. So we have this figured out. We have our limit ranges, we have our resource quotas, and we're able to make sure that one individual tenant isn't starving out the other tenants.

We're still managing these, and we still have to manage the EC2 nodes. That doesn't go away. We still have to provide infrastructure. We are limiting the tenant, but what if the tenant needs more? What if they actually need more? We'll have to provision that ourselves. The way that most of our customers start is they start with managed node groups, and that works up to a certain point. But the challenge with managed node groups is that you have to specify a particular instance type.

As you start to grow, you have to make sure that you're still doing capacity planning and estimating what each tenant is going to use in the future. What if you need different types of instances? What if you have different types of workloads running on those machines? You have to create more and more managed node groups. As you get more tenants and need to continue having this type of isolation, you'll need even more managed node groups. Now we've gone from managing a bunch of clusters to managing a bunch of tenant node groups.

The way we solve that is through Auto Mode. How many of you have heard of Auto Mode? Some of you, how many are you using Auto Mode? Good. What Auto Mode allows us to do is have a Kubernetes-centric view of managing our infrastructure and our nodes underneath the covers while giving us the flexibility. We can now set up a manifest that provides us flexibility in the types of infrastructure that we use. It automatically scales because it is Kubernetes native. We can now scale based on the workload types by looking at what's happening inside that cluster and the metrics and scaling based on those metrics. All of this is part of EKS right now as a built-in feature that we'll manage for you.

Let's look at what these manifests start to look like. Part of Auto Mode is managing Karpenter. Have you all heard of Karpenter? Good. Underneath the covers, part of it is Karpenter, and you can see on the left-hand side for Tenant 1 we have two types of instances that we can use. Obviously we are limited to two, but you could put other values in there as well and mix and match other instance types. For Tenant 1, because they signed a particular contract with us and we have a particular SLA with them, we're going to say we only want spot instances from them. If the availability is there, great; if not, that's part of their contract. We can limit the number of CPU and memory.

On the right-hand side, we can have bigger instances because they signed a bigger contract and we have a different SLA with them. They need more resources. As you can see in the values there, we have spot and on-demand because their SLA is different and requires us to provide resources to them. Auto Mode will provision these resources for us, and as our tenants grow or shrink or need more resources, Auto Mode will take care of that for you. It may put Tenant 1 on an entirely separate node, isolated completely from everybody else.

Securing AWS Resource Access with EKS Pod Identity

We've gone through networking and resource quotas, and we have Auto Mode running helping us scale up and down. But we still need access to AWS resources like S3, RDS, and DynamoDB. Nothing runs on its own inside of Amazon. It's always dependent on other resources. Customers start managing secrets and AWS credentials by storing them as environment variables, hard coding them directly into the container, or providing them as config maps or secrets. What we start to see is a proliferation of hard-coded AWS keys everywhere. Long-lasting AWS keys make it hard to manage compliance and security because ultimately what we want to do is rotate them frequently.

If we're doing this all manually, what happens next is we forget to clean those out or rotate them. Sometimes we see a lot of key leakage. If somebody escapes out of Tenant 1's pod, they're able to grab secrets and reach out to a different bucket. The way that we deal with this inside of EKS is called EKS Pod Identity. Pod Identity is AWS native IAM policies for pods themselves, sort of equivalent to the AWS CNI where we're applying security groups and AWS native constructs for networking. It's the same thing here for IAM. I want to make sure that you're using Pod Identity because it does allow you to continue to have this separation of duties, which is really important for SaaS companies and compliance. We're able to manage what these pods can access not in a Kubernetes native way but through AWS IAM roles and policies.

You can continue to use the exact same tooling and infrastructure you have set up to manage those policies and ensure they remain intact. We map those AWS roles to Kubernetes service accounts. What we're doing here is mapping an AWS role and policy to a service account. Additionally, this approach allows you to have auditability through the same tools you use to audit everything else inside of AWS and enables you to scale more efficiently through the teams and processes you're already using to manage identity across all of your other AWS resources.

Let's put one of these things together. It's a pretty simple policy. We're saying here that tenant one gets access to tenant one's S3 bucket with basic operations: get, put, delete, and list. The next thing we're going to do is create a service account inside of Kubernetes. We then map that service account to the pod, which allows the pod to have access to that bucket we defined earlier. As we deploy this into the Kubernetes cluster, we're following the exact same processes, security, and compliance policies we had before. Nothing has really changed there.

What's great about this approach is that I didn't create an AWS key, put it into an environment variable, or store it in a secrets manager. I didn't have to do any of that. AWS handles that for you and automatically rotates the keys behind the scenes. By doing this, we ensure that tenant one does not have access to the wrong bucket.

Achieving Compliance and Automation: Audit Logs, Policy as Code, and Infrastructure Controllers

Now let's get a little deeper into the internals. We still need to get to that control plane and application plane architecture, but to do so, we need to dig deeper into Kubernetes and discuss auditability and why it's important. Let's go through the steps. I know the numbers are off, so we'll go through them anyway, and the numbers will be off in a couple of other slides as well, but we'll get through it. First, a DevOps engineer deploys into Kubernetes. Behind the scenes, it's just an API server. As we get to step three, it stores the configuration into etcd. Then the scheduler picks up the change and creates or puts the pod inside the tenant namespace in step five. Now the pod that was created can communicate out to RDS and S3.

When we get into compliance and auditing, it's about understanding what is happening inside your system. We want to understand what's happening at the Kubernetes level, who is making these changes, what changes were made, and what is communicating with what out into an AWS account. We want to make sure we understand what's happening inside these namespaces with these tenants and ensure we have auditable proof across the entire chain, which is pretty difficult to do when you think about all the pieces and changes happening across the system. We're going to cover how we deal with deployments, system access, auditable proof, and Open Policy Agent for compliance enforcement.

The first thing we can do is make sure we're auditing the right logs out of Kubernetes. We can control this in many different ways, but here you can see we have a request and response logged for any user or anything asking about pods. We're literally logging the request and the response, so you can limit it to the things you need to prove to your auditor to ensure you're doing the right things and can prove who's actually making these changes or querying the system.

The next thing we're going to discuss is OPA, Open Policy Agent, and Kyverno, which is a very similar system for policy as code. We'll focus on OPA today. We're going to go through the same workflow. A DevOps engineer submits a manifest or makes a change into the API server. Behind the scenes, the request goes to the admission controller. The admission controller determines whether this is a valid manifest that can be submitted into the system from the right user at the right time and for the right tenant by looking at OPA rules. It goes out to OPA and asks: is this a valid manifest submitted by the right person at the right time? If the answer is no, it rejects it. These become auditable, provable, declarative rules that we can give to compliance officers to ensure we're doing the right things. Security loves this as well. If the answer is yes, it follows the same flow through steps four, five, six, and seven.

One other change, though, is number 8, which is monitoring this through CloudTrail. As we monitor what's happening at the API server, we can log all of the changes that are happening from which users and why. From OPA, we can see what was rejected and why, or approved and why. What rule sets are inside of OPA allow us to make sure that those manifests are accepted or rejected. We now have provability across the entire system. This is what compliance and policy as code looks like.

Is anybody here using OPA or Kyverno for policy? Just a few? What this is saying is if Isaac were to submit a manifest, right, the first thing—a deployment object, let's just say, or a deployment manifest—it's going to look at the pod and make sure if it has the label classification PHI, right, privacy information. We want to make sure that it also has the rule not run as root. We want to make sure that that's there because that's very sensitive private information. If it's not there, it's going to reject it.

This helps us build trust into the system. It allows developers to move quicker because we know that as people who have set up the system, nobody can submit a manifest that will break our rules. Now this is a pretty simple rule, but you can imagine all the very complex rules that you can create. With things like Open Conferno, you can actually integrate it into other systems inside of your organization. You can verify all of that at deploy time so nothing gets into the system that shouldn't be in there.

So now what do we have? We have compliance and auditing. We've dealt with the noisy neighbor problem. We've dealt with networking. We now have logical namespaces, but what we find still is that tenant onboarding takes weeks, months, sometimes honestly years to onboard a particular tenant. We were talking to a few of you before we got started, and somebody had mentioned we're still going through approvals to just get access to something. This happens frequently and actually repetitively. If you need a new Kubernetes cluster, sometimes it'll take six weeks to months.

So what makes it take so long? As the organization grows, there's more and more individual teams and more specialization. Somebody's the landing zone team. They'll have the IAM policies. They'll create the accounts. If you ask them, "Is everything automated?" they'll say yes. Then you go to the networking team and you ask them, "Is everything automated?" Yes. You go through them and they'll do their thing. Then for me it always begs the question: if everybody is always so automated, why does it take so long? Shouldn't automation be instant?

What ends up happening is what kills us is the coordination between teams. When we say we're automated, what we really mean is my stuff is automated. But the way that you kick off that automation is through Jira tickets or ServiceNow tickets. That isn't automated. Somebody has to manually type that in. Somebody has to accept that ticket. Somebody has to actually execute the pipelines. It's all of this coordination that happens that kills us, and that's what takes a really long time for onboarding.

The other thing too is it lacks version control. Some of those things aren't as automated as we want them to be, and there are a lot of manual steps and a lot of kicking of the pipeline. Those become things like tribal knowledge. You might say, "First kick off this pipeline, then this one, but if you do it the other way around, it'll create havoc." When I tell someone this and then I leave the company, and then maybe that person leaves the company, where does that go? It's all in our heads. So it creates a lot of risk and just doesn't scale.

But there is a solution to this, and we see it inside of Kubernetes already. If you ever noticed inside of Kubernetes when you let somebody deploy a manifest, what's actually happening behind the scenes—we already talked about it. We talked about how something like auto mode can create EC2 nodes. We talked about networking automatically being there. We talked about everything being managed through manifest, creating ingresses. There's a lot of automation that happens already inside of Kubernetes that creates underlying infrastructure. I've been using Kubernetes for close to eleven or twelve years, and I remember when I first started using Kubernetes, nothing was automated. You wanted an EC2 instance, you had to create it yourself and attach it to the system yourself.

All that was managed were pods. But over a period of time, not just AWS but the community started creating more and more controllers. We now have controllers for ELBs, ALBs, things like Route 53, Secrets. We have managed node groups. I just told you about auto mode. We have the CNIs. Now we're creating more and more infrastructure. We can actually now with the cluster API create Kubernetes clusters with Kubernetes. When I first started with Kubernetes, I remember everybody saying this will never work. It will never be a generalized compute platform because it will never do things like stateful workloads. But look at us now. We have EBS volumes and many other ways to get attached storage. We're dealing with AI workloads, GPU AI workloads, databases all run on Kubernetes. And so it begs the question: why can't we manage more of our infrastructure in this way?

If we manage more of our infrastructure in this way, what do we get? The same deployment system, the same auditability, the same networking—everything becomes the same. What we see in our customer base is that we are moving from Kubernetes just as a container or the application plane for our pods and moving it into a platform. Because ultimately this is what you need to do to scale your multi-tenancy. Luckily for us, there's a great Kubernetes open source community that has been creating controllers. There are thousands upon thousands of controllers. I haven't listed them all out here. On the lower left-hand side, you see it's not much effort to run some of these controllers. Some of them will manage them for you actually—inside of auto mode, we do. But we may not have full coverage. As you get to the right, as we get into customer custom controllers and the Operator SDK, it allows you to manage anything.

I had one customer ask me: can we manage satellites with this? Yes, you can manage anything with an API. And it wouldn't be so crazy to think, because I thought it was crazy too at first. I was like, why would you want to do that? Well, they said, sometimes we lose connectivity with satellites, and we want to be able to have that reconciliation loop. We have hundreds of satellites up in the sky, and they all have different APIs and different versions. We would like to abstract that out from the people who have to operate and manage all of the different satellites. And that clicked for me. I was like, you know, I'd never thought about that before, but conceptually it should work. And if you think about that and you just take satellites, the same analogy, and use that for Kubernetes clusters or AWS resources, then it all becomes the same.

The Control Plane Architecture: AWS Controllers for Kubernetes and the Path to GenAI Integration

You can do that because sometimes you do lose connectivity, sometimes we do have network partitions, sometimes we may have rate limits, which I'm sure you've all dealt with. I know I've dealt with my fair share. So today we're going to just focus on one, which is AWS Controllers for Kubernetes, called ACK. A quick show of hands—who's heard of this? A fair amount. OK. So what ACK is: it's an open source project, mostly run by AWS, that provides controllers for a lot of our resources. Our most popular AWS resources and services—RDS, DynamoDB, S3, IAM policies—are all there. This means you can start actually managing your AWS resources through Kubernetes in the same way.

This is what a pretty simple manifest looks like for S3. We have our tenant, tenant one, and we can create a bucket. So now when we're provisioning our tenant, I don't have to go to that other team. I don't have to go to another pipeline. I don't have to do all the other things. I'm just provisioning it along with my namespace and all of the other resources that tenant needs all at one go in the same pipeline. Now you can see how we can get to true automation because it's all going through the exact same thing. You can start setting tags, so you can still see where these resources are going. You can have public access blocks. You can see who this resource belongs to.

So let's put this together now. We're going to get into our control plane and our application plane architecture. This is how we get here. On the left-hand side, we have Git. Developers are no longer actually submitting manifests directly into the cluster. We're going to do this all through Git. Git will store that and have a webhook into Argo. Argo is a controller that is a GitOps controller, the most popular one. I'm sure many of you have heard it, and probably many of you are using it. The thing that's different here is we're not trying to use this to deploy into a namespace. We're using it to actually create AWS infrastructure from Argo.

From Argo, it will submit it into the API server. We'll follow the same flow. We still have it going to OPA, and if it's not a valid manifest, it'll reject it. Let's go back to this manifest here. Some of the rules that we can create with OPA say do not give me a manifest that doesn't have a key for tenant, and we can reject that. We can also say if it doesn't have the proper public access block rules, reject it. We can give it values that say we only accept these values or these types, or reject this type. So now we have control in the same way that we control our deployments. We can control our AWS resources.

Now the difference here is that in this tenant namespace, I'm not deploying a pod. I'm deploying a CRD. ACK is listening for that, listening for a new S3 bucket, a new RDS, or just changes to resources that already exist. Then ACK will reach out to the AWS API and create those resources for us. You can still be deploying into remote EKS clusters, which become your application plane. So on the right-hand side, we have that application plane. On the left-hand side, look what we have now. Notice there's no workloads running in there. We haven't deployed a workload into the left-hand side. That becomes our control plane. That becomes the center of everything that we do with our operations. This is where we can have other applications like billing, metrics, and metering. We start deploying more and more resources here or centralized resources here.

But wait, where's the GenAI? I promise I won't talk too much about GenAI, but all of our leadership, all of your leadership, is still asking about GenAI. Nirmal was talking about this earlier. We now have microservices. We now have this control plane. We have our application plane. How do we start thinking about enabling GenAI because that's ultimately what our leadership is looking for? But nothing prevents us from creating Bedrock or SageMaker resources in the exact same way. We have Bedrock and SageMaker controllers inside of ACK. So now not only can we provision our S3 buckets and our RDS for a particular tenant, we can start provisioning resources that we need inside of SageMaker or Bedrock for that tenant as well. All through the exact same pipeline, same compliance, same auditing, same rules.

Let's tie it in a little bit more and bring hybrid nodes back into it because maybe you already have GPU resources running in your data center or you already have bought them and want to use and leverage them, while still leveraging the exact same infrastructure and the exact same control plane. Notice we haven't changed anything on that left-hand side in order to get our application plane to have different resources and our tenants to be able to access those different resources. All of it could be managed and controlled through EKS itself, whether it's in the data center, whether it's a GPU in the data center, whether it's a hybrid node in the data center, or whether it's AWS resources. We talked about how we can manage AWS networking through Kubernetes. We can manage IAM policies. All through the same interface, which simplifies your operations.

Going back to why does it take so long, it's all that coordination. But when you have a system like this running, there's way less coordination. The rules are declaratively written. People trust the system, and it reduces the amount of back and forth you have to do through Jira tickets and meetings. All of that is written down in code and in your Git repo. This is where we ended up. We talked a little bit about how that left-hand side started to have that, and we didn't go into every single box here. But we gave you the foundations of what you can build upon. On the right-hand side, we still have our application plane, super simplified, of course. We didn't talk about GPUs on the right-hand side or the other resources inside of AWS that you can provision, but you have this now. You have the foundations of what to build upon.

We see many customers as they start breaking down monoliths, coming in from the data center needing AWS resources while still leveraging EKS and Kubernetes, getting into this particular architecture because that allows them to scale. At the end of the day, what we're all trying to do is scale. We're not getting more headcount to help us grow, or if we do, it's not going to be a linear investment. We need to figure out how to scale ourselves through automation, and this is the way that our customers are going when they're leveraging EKS. So I'd like to invite Nirmal back on here to close it out. Thank you. Awesome. Just a second. So what do we think? We went through that journey. We started with our

monolithic application and now we have a multi-tenant, multi-cluster, highly available infrastructure with all the services you can imagine from AWS and your ability to adapt to those new requests from your developers. Even though we did go into a lot of depth, this is just a teaser of all the capabilities that EKS can offer and a starting point for your journey to adopt all these new features. One of which was just launched while we were sitting here, so this is super fresh news.

We now have a new feature called EKS capabilities. I can go back to this slide , which shows our EKS Managed Controllers for Argo, ACK, and KRO, which stands for Kubernetes Resource Orchestration. This is a component and a potential component of this architecture that is new. So check that out. There is a session tomorrow at 1 o'clock, CNS 378, where they are going to go into details of EKS capabilities. This is super fresh, and they were messaging me while we were all sitting here, so you all are the first to know.

It gets even better. Isaac mentioned these things that you have to put into your cluster to manage and do this kind of automation and deployment. Well, now that is even easier because we can manage those controllers for you and make it streamlined for you to be able to implement the architectures that you need to support your customers. In addition to that, there are some other sessions this week. We have an amazing EKS track this year with lots of different sessions. This is just a sampling of some of the ones that you might be interested in.

In addition to that, we have a hybrid nodes workshop as well. You can find out all the details here. Here are a lot of resources that you can use to get started today, including labs and workshops for auto mode and for hybrid nodes, more information on EKS blueprints, so you do not have to figure this stuff out for yourself. You can use pre-made blueprints to get started. If you are just getting started with EKS and do not know many of the words that we said today such as containers and Kubernetes, we also have an EKS digital badge which you can use to get started and understand the basics.

Here are the session resources for this session specifically, but it is mostly the same thing as the previous one. EKS capabilities just launched, so check that out. CNS 378 tomorrow. With that, I would like you to take the survey, please.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community