DEV Community

Cover image for AWS re:Invent 2025 - Scaling Serverless with platform engineering: A blueprint for success (CNS361)
Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - Scaling Serverless with platform engineering: A blueprint for success (CNS361)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Scaling Serverless with platform engineering: A blueprint for success (CNS361)

In this video, Anton and Ron from CyberArk demonstrate how to scale serverless engineering across organizations through platform engineering blueprints. They explain moving from traditional infrastructure/application team separation to a unified approach using Infrastructure as Code modules (Terraform, CDK) that embed best practices, security, and observability. CyberArk reduced new service creation time from 5 months to 3 hours by building reusable blueprints for common patterns like API Gateway-Lambda-DynamoDB architectures. The session covers creating vetted IaC modules, implementing governance with tools like Checkov and CDK-NAG, balancing developer autonomy with standardization, and building a service platform that scaffolds complete enterprise-grade services including frontend, backend, CI/CD pipelines, and SaaS integrations. They emphasize treating platform engineering as a product, building with customers, and providing flexible customization within defined boundaries.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Introduction: Scaling Serverless Engineering Across Organizations

Thank you so much for coming. My name's Anton. This is my good colleague and friend, Ran from CyberArk. We started as a partnership, and now we're good friends. Today we're going to be talking about scaling serverless with platform engineering, a blueprint for success.

Now it's important to mention that when I'm saying scaling, I do not necessarily mean what you think scaling means in the context of serverless. It's not how you scale from 5 requests per second to 5,000 requests per second. That's not the kind of scaling we're discussing. When we're talking about scaling here, it's how do you scale serverless engineering in your organizations? How do you get from a point where you have one team developing with serverless to a point where you have 100 teams developing with serverless without losing their efficiency? That's what we're going to be talking about today.

At the very last slide, you will have a giant QR code that will lead you to a page with everything you will see today, including these slides. You'll have these slides in 59 minutes roughly, so feel free to take pictures, but technically you don't have to. All right, we've got to start somewhere, right? And it always starts here. In the beginning we have application development teams and infrastructure teams.

Thumbnail 80

Thumbnail 90

So let's do a quick show of hands. Who's here on more of the application development side? A few people. More in infrastructure platform engineering? Oh, you're in the right place. You're in the right place. All right, let's talk. So everyone is familiar with this model, right? It exists for ages. You have application developers who are responsible for the application layer. They write code in whatever language, and you have that infrastructure layer under the hood: virtual machines, databases, networking, et cetera, which is usually managed by this awesome team with many different names—infrastructure, DevOps, operations, SRE—depends on your organization, right?

Thumbnail 120

Thumbnail 150

Now, sometimes developers need to make changes to the infrastructure. As you develop new applications, you need changes in infrastructure. So who's here using a ticketing system between development and operations? It's pretty common. It's still pretty common. So application developers open a ticket using whatever system, a change request, the infrastructure team implements that change, and it works great if you have one application development team and one operations team. But at some point you get your second development team, and then you have your third development team, and it continues and so on and so on. Anyone resonate with that, right? I think all of us know what we're talking about here.

It's not a nice thing to say, but at some point in time we hit a place where that infrastructure team, without really wanting to get to this point, they're kind of becoming the bottleneck, and that's not a good place to be. No one wants to be that team that everyone complains about, right? So how do we solve that? How do we solve that? Well, there are a couple of ways we can solve it.

Thumbnail 190

The Serverless Shift: Balancing Developer Autonomy with Operational Standardization

First of all, yay, shift left to the rescue, right? Now, shift left works, but the problem with shift left is that the definition of shift left evolves over time. And you know you can always throw more stuff into your development teams, right? So now you're also responsible for DevOps, and now you're also responsible for security, and now you're also responsible for sustainability, right? And now you're also responsible for picking the right model for your agentic application, right? So it's not going to scale for 100 engineering teams. You cannot just throw stuff to the left and expect it to work, right?

Thumbnail 220

So another way is, you know, since this is a serverless session, you can start using serverless. If you have problems managing a lot of infrastructure, well, let's manage less infrastructure. Let's shift to serverless, right? I'm not going to cover every single bullet here since you're in this session. You're probably familiar with the idea of serverless, right? Minimized infrastructure management. But when you're moving from this world to serverless, your perception changes a bit, right?

Thumbnail 250

Thumbnail 260

So you come from the world of traditional applications where there's a clean separation between the application layer and people and the infrastructure layer and people that handle that into this new shiny world of serverless applications, right, where you might still have the infrastructure layer. It's pretty slim, but it's still there. It's not necessarily gone, right? You might still be using RDS, right? But most of your application components—I'm going to be using the term components here—are serverless. It's essentially a collection of loosely coupled serverless services like Lambda, ECS, DynamoDB, et cetera, et cetera, et cetera.

Thumbnail 280

And we're heading into an interesting state here, right? Application developers and infrastructure people, they kind of start having this interesting conversation. A Lambda function, it's kind of my application resource as a developer. I want to control it. But the operations people are saying, no, no, no, no, no, no, no, we do it with Terraform, CDK, CloudFormation. It's infrastructure.

Thumbnail 310

Thumbnail 320

Thumbnail 330

Thumbnail 340

We want to control it, right? And both sides have good intentions. No one is wrong here. They have good intentions, but they have slightly conflicting intentions. So it becomes a conflict between the autonomy that developers want versus the standardization that the operations team wants. Is anyone familiar with this? Oh yeah, so if we put it on a chart, it's going to be a little bit interactive. There are startups, and startups care more about autonomy, right? They don't need to do PCI, HIPAA, and other certifications. They need to move as fast as possible. But there are also highly regulated enterprises that need that compliance because they just cannot afford every single team doing things in their own way. It's super hard for regulated industries.

Thumbnail 370

Thumbnail 380

Thumbnail 390

So let me ask you a question. Let's take about three seconds. Where do you think you are on this chart? Who's thinking they're more into autonomy? What about standardization? Probably most of you are thinking, oh, we're definitely having the right balance, right? We're exactly in the right spot. So the goal of today's session is to talk about how you can get here, how you can get the best of both worlds. I'll be talking about different techniques, and then Ron will cover how they implemented these techniques and many more in their environment and sped up their development. To give you a little teaser, or maybe a spoiler but more of a teaser, they've managed to reduce the time it takes to launch a new service from scratch, from nothing, from pretty much five months to three hours. That's like a 99% improvement. So today you'll see some of these practices from me in hypothetical terms and from Ron in practical terms, and again, all of that is open source on GitHub.

Thumbnail 420

Thumbnail 430

Thumbnail 450

Platform Engineering Blueprint: Building a Standardized Serverless Stack

Cool. Whoa, sounds great. Can you please be a little bit more specific? You're probably wondering at that point, right? Yeah, let's be more specific. So this is what development organizations commonly look like. That's what we call traditional ownership boundaries. You have an application development team that manages their code bases, and then you have operations people, infrastructure operations, or platform, which is the new fancy term. They manage a lot of things like CI/CD, monitoring, and so on. It's not like everyone is exactly the same. This is the general picture of this shift left. Shift left brings us to this. You might have seen it. I call it the Wild Wild West, where essentially you're saying every single team is free to pick their CI/CD process, and then you have 100 teams with 100 CI/CD approaches. And when you need to collect evidence for PCI certification or something like that, good luck, it's going to take you seven to twelve months.

Thumbnail 480

Thumbnail 510

So shift left without proper governance in place is dangerous. It is dangerous. That's why the idea of platform engineering that we're talking about here implies you do shift left. Shift left works, but you cannot do it in a chaotic way. You need to have proper governance in place, and that governance is managed by platform engineering. Essentially, your teams do own more. They do own the stack end to end, but they don't need to figure out everything on their own. It's based on the artifacts produced by the platform team. So the first step, let's get practical. The first step would be to define the infrastructure and application resources ownership model. Let's define tools, let's define processes, and we're not talking here about EC2 instances. We're talking about services, because application developers want to own some of that, but infrastructure people don't want to lose control. So let's talk about what that platform story kind of looks like.

Thumbnail 540

Thumbnail 550

Thumbnail 560

Let's build that platform for a serverless world. At the very bottom of that platform, you will have AWS. You've got stuff that is running on AWS. That's the services we provide, management API, Lambda, Fargate, whatever you're using. That's kind of the baseline. On top of that, you need to make a decision. Your organization needs to make a decision. What's the Infrastructure as Code framework that we're going to use? It doesn't have to be one. Anyone here, let's see, Terraform? CDK? Awesome. CloudFormation? SAM? Okay, so we went through the statistics. Our examples today are going to be primarily CDK and Terraform. Statistically, this is where most people are. Everything you see today is applicable to any Infrastructure as Code framework. It's just these two are quite popular in the serverless world. So pick a framework. If you want to use Terraform, if you're using Terraform in other parts of the organization, go ahead, continue using it. If you're using Crossplane or ACK and you manage your resources through the Kubernetes control plane, continue doing that. Don't reinvent the wheel.

Thumbnail 610

On top of this, you have the CI/CD flow. Once again, you need to make a choice. Do you use GitHub, GitLab, or CodeBuild? You probably noticed that towards the left side, those are more AWS options. Towards the right side, those are more third-party partners, open source, and so on. Every layer will have something like this.

Thumbnail 630

Thumbnail 650

On top of that, you need to define what governance tools you're going to use. Why? Well, because since you're giving your developers freedom, and we'll be talking about that in a moment, you also need to have governance. There is no freedom in selection of tools without proper governance that ensures your developers are actually selecting from an approved list. And obviously on the very top, and this is not going to be the main topic for today, but still you can have something like Backstage or CNOE or GitLab, something where you store your assets in, something that your developers can access. It can be something sophisticated as Backstage or something as simple as a Git repository. Essentially, something that your developers can use to obtain access to those resources.

Thumbnail 680

Thumbnail 690

Thumbnail 700

You'll have a bunch of partners. Those are common ones. I just arbitrarily picked three of them. So you'll have different partners that you need to integrate into your tools, and the question is, that's a lot of icons and a lot of layers. So how do you actually standardize this? How do you build what we call a blueprint? And this is where we're recommending you start with a catalog of vetted infrastructure code modules. Everything before this slide was hypothetical. It sounds awesome in theory, right? Where's the practice? This is the practice. This is something tangible, something that your developers can use. It's much more than a pretty slide that sells a great idea. This is tangible.

Thumbnail 740

Thumbnail 760

Creating Vetted Infrastructure as Code Modules with Embedded Best Practices

So you're building a catalog of vetted Infrastructure as Code modules. Depending on your Infrastructure as Code framework, it will have different names. Terraform uses modules, CDK uses constructs. Different frameworks have different terminology, but the idea is you identify what your developers need and you modularize that. So why? Big question. Well, the same reasons we're modularizing stuff. First of all, best practices. If you're creating something as a module, you can embed best practices into that implementation, and you'll see examples, don't worry. Reusability. If it's a module, you can reuse it across the whole organization. And the last one is composability. Once you have different modules implementing different functionality, you can start composing your architectures from these modules.

Thumbnail 780

So let's take a look at some code. Let's take a look at how this actually works. Terraform, right? A lot of people raised their hands. If you're at this point thinking this is one useless Terraform script, like Terraform configuration, you're kind of right at this point. We're going to evolve it, don't worry. So I got variables, function name and runtime. We're talking about Lambda here, and then I'm creating a resource, and I'm just using function name and runtime as variables. There is nothing fancy here at all. I'm just using variables to define the resource.

Thumbnail 840

But there is one interesting thing here. If you look at the left side, variable runtime, the default is set as node 20. So with this one single line, I can standardize the default runtime across my whole organization. It's kind of basic. It's pretty rudimentary, but what you're doing with this single line is basically you're saying my development teams don't need to make this decision, and this is just a simple example. That's the default that we've defined in our organization. Let's evolve it. How about log retention days? You're probably familiar with that. The default retention for CloudWatch logs is forever. Forever means pain forever. How about you standardize it in a module, let's say 14 days, again, arbitrary number. But basically by creating this very simple approach, you're saying, well, by default all the logs will be stored for 14 days and deleted automatically after that. Not a single person in your engineering teams needs to know about it or worry about it anymore, because that's the standard across the organization.

Thumbnail 880

Thumbnail 890

Thumbnail 900

A few more examples. I want a standardized logging config, because previously every single team had different logging structure. I think many of you can relate here. I want a standardized usage of Lambda Powertools, so I want every single function to have a specific library embedded. Now my developers don't need to worry what's Powertools, which version do I use, do I get it through ARN or whatever other way. You just standardized it.

Thumbnail 910

Thumbnail 930

Once you build this Terraform module, you can package it. Again, I'm using Terraform as an example, and Ron will show CDK. We specifically use these two frameworks to show you it's not about the framework, it's about the approach. So you've created a Terraform module. You called it my awesome enterprise Lambda function or whatever you want to name it. You've defined what the configuration variables are, and that's something that your engineering teams can change because obviously they need to specify function name and memory size and so on. But you've also defined defaults for many of these parameters, right? So for example, if it's a Java application, probably a good default will be 2 gigabytes of memory. But if it's Python, probably 0.5 gigabytes would be just fine. For many of you it sounds pretty basic, but once again, your engineering teams don't need to worry about this anymore. Can they change it? Of course they can, but do they have to know about it? No, they don't. That's the beauty of it. And you're also going to have some outputs.

Thumbnail 960

Thumbnail 1020

So those are three examples of very real composable modules that we see across many customers, including in Ron's company. We already talked about baseline Lambda function. So how about Lambda function with periodical scheduler? Anyone here implements this scenario, Lambda function with EventBridge schedule? Yeah, I see about four people. SQS with redrive, DLQ with redrive. So this is not rocket science. You probably already implement this in your scenario. Those are three patterns that I've implemented as modules, Terraform modules, CDK constructs, doesn't matter. How do I change these three modules into an architecture? This is probably the most important part of this presentation. Do you know what's the difference between three modules and architecture? This, two arrows, right? Now it's an architecture. Again, three distinct modules. Architecture. Obviously, I'm simplifying this a little bit, but now you can take composable components and build an architecture.

Thumbnail 1050

So you can have, for example, an analytics job that runs once a day, processes data, throws that into SQS, and you have a Lambda function processing the data from the queue. Every single component implements best practices provided by the platform team. Development teams don't need to worry about what's the best practice for using a specific runtime or whatsoever. You would put those modules somewhere in the repository, right? It's quite flexible. It can be Git. It can be whatever you're using internally. We see Git as pretty popular because again, it's a tangible piece of code, right? So you would put it in some sort of repository and your engineering teams will consume those modules. If you use something like Terraform Cloud or Pulumi, they provide explicit ways to distribute those modules. But the idea is your engineering teams consume those modules.

Thumbnail 1120

Over time, since once again this is code and you're talking about engineering teams, they start contributing to these modules. So it's not like you as a platform team are now responsible 100%. No, you can have a social coding exercise. You can go to your engineering team and tell them you want an update, give us a PR. An engineering team can start contributing to that as well, right? You don't have to start from scratch. There are a bunch of open source repositories. Again, you'll get these slides like in what, like 42 minutes, right? There are a bunch of open source projects that already implement modules for various serverless components that you can either use as is or start building on top of. It's open source so you can take it, you can take the part that you like and add whatever is missing. So completely up to you.

Thumbnail 1130

Thumbnail 1150

Thumbnail 1160

From Modules to Architectural Blueprints: Composing Serverless Patterns

What's next? So let's evolve this idea. So we talked about creating those modules. Let's evolve it into architectural blueprints, right? Let's build something bigger out of it. For example, right, I know my company. I've spoken to a few people in different parts of the company, and I figured out that this is a pretty popular pattern in my organization. API Gateway, bunch of Lambdas, DynamoDB. Anyone here using it? It's like probably one of the most popular patterns ever, right? So it's popular in my organization, meaning there are multiple teams that are using this pattern, and those teams are communicating with event-driven architecture using EventBridge or maybe something else like SQS. So you have different parts that are doing the same pattern, right? And there is a mechanism to communicate between these parts. What does it actually mean?

Thumbnail 1180

Thumbnail 1190

So these are the building blocks. What's missing? Arrows. Arrows are as important as blocks because in the serverless world you define these arrows also with Infrastructure as Code when you create event source mapping for a Lambda to read from SQS, right? So those arrows are as important as blocks because once again you're responsible for defining those arrows, right?

Thumbnail 1220

You don't just give a URL to developers and leave it completely up to them to do whatever they want with that. It's also part of your architecture. Well, you know, it's common, it's useful. My engineering teams would love to have this as a pattern. I'm going to call it a blueprint. This is my first blueprint: synchronous API with database. So whenever someone in my company needs to implement a synchronous API with a database, they can just reuse my module, and it will have all the best practices that I've implemented.

Thumbnail 1240

Thumbnail 1260

Thumbnail 1270

Which best practices? Well, how about this? This is just a small subset of what you can do when you're providing this thing as a module. As a platform team, you can preconfigure defaults and enforce things for a lot of things on these resources. So now your engineering teams are essentially getting that out of the box. Moreover, you can embed observability, security, best practices, and so on, end to end. Ron will talk more about that as well. Again, a few examples. How many of you ever experienced what we call orphaned resources? Resources that you have no idea who owns them. Oh yeah, that's like, I wish I had a dollar. How about you set default tags on that module, and now every single resource has an owner, and developers don't need to do anything. They don't even need to know about it. Like what, five lines of Terraform code, or CDK supports that as well. You standardize this.

Thumbnail 1300

Thumbnail 1310

Thumbnail 1330

Another problem: you can have development variables and production variables, basically injecting those values. How about you standardize your DynamoDB security? You're saying that server-side encryption must always be enabled. TTL must be enabled. Point-in-time recovery must be enabled. How many of your developers are even aware of the fact that DynamoDB provides point-in-time recovery and they might need to use it? Now you standardize it. Security: if you have DynamoDB as part of your module, instead of saying, well, I have no idea what's going to happen there, we're just going to give asterisk as permissions, no. You can standardize that. So by default, put item, get item, update item. If your developers need more, they can update it, but by default you scope down the permissions to what you think is a good selection.

Thumbnail 1350

Thumbnail 1360

Thumbnail 1390

Thumbnail 1410

Okay, we spent like what, 23 minutes, and I never mentioned AI. Let's fix that. Bedrock. This is something that your customers, your developers, your customers might be using. So it's another component, a frequent component in every single architecture today, and everyone is looking for answers. How do I properly configure it, like context size, which model do I use, what's the configuration for temperature, top P, and so on? Everyone suddenly needs to become an LLM expert. Well, how about first of all, you standardize security. It's part of your blueprint. You standardize that your blueprint has access to a specific model. So now your developers don't need to wonder which model do I use. This is what we use as a standard in our organization. Moreover, the configuration, max tokens, temperature, top P is also standardized. Can your developers override it? Yeah, easy. Do they need to worry about that? No, they don't. That's the difference.

Thumbnail 1460

This is actually interesting because this is the first time that you see function code, not Infrastructure as Code, but function code. That's a little bit different. So as an infrastructure team, well, not infrastructure anymore, platform. As a platform team, are you saying that now in addition to being able to define the architecture with Infrastructure as Code, we can also enforce things in the actual application code in the function handler, for example? And the answer is yes. So if you think about a function handler, it has some business domain logic, business domain code, but there are also some things which are not specific to a particular function, not specific to a particular handler. Like observability. How many of you are embedding some observability extension, for example? Do you want your developers to worry about that? Probably not.

Some things like the way you do config or secret management, multi-tenancy, that's a lot of work that Ron and CyberArk did. Security governance integrations and SDK. There are things that are unique to this function, and that's the handler, that's the business value that the function provides, but there are also things which are more generic across multiple functions, and you can actually standardize that as well. And now you probably have a big question: but what if developers do want to make changes? I've just standardized everything with amazing defaults, and now I have a lot of developers coming to me saying, how do I override these defaults?

Thumbnail 1520

Defaults are nice, but I need custom configuration. Obviously it makes perfect sense. So like I said in the beginning, you do want to give your developers this flexibility. You do want to allow them to change things, but you want to allow them to change things within boundaries that you've defined.

Thumbnail 1540

Governance Through Proactive and Detective Controls

We have this notion of various controls. So this is in a nutshell the standard, again generalized a little bit, the pipeline of your application code. It starts with you write your code and infrastructure as code, commit, build, test, package, deploy, run. So you have the phase where it's development, then you have continuous integration. This is where you build and test stuff through something like Jenkins or whatever you're using. Then you have continuous delivery where you're actually pushing that to the cloud. At any point in time here, your developers might want to change things, to customize things. They want flexibility because we started with flexibility.

Thumbnail 1580

We provide this notion of proactive controls and detective controls. You're familiar with that. You've heard about that. So proactive controls are essentially controls that catch things before something happened. For example, during the development phase when a developer changes something, you want to make sure that the change they made is safe. To give you an example, let's say in your organization you want to define what are the approved runtimes. You want to say we only approve, again arbitrary, Node and Python and only the two latest versions. So whenever a developer decides, well, I've got to experiment with Java, maybe at first they're going to get an alert saying no, no, no, no, we don't do Java here. Again, just picking on a particular language. It can be anything.

Thumbnail 1640

And then you evolve that. So you want to have these proactive controls and detective controls at every single stage. Now there are different tools to achieve that. Some of those tools are coming from AWS. Other tools are coming from other vendors. The choice of those governance tools are heavily dependent on your choice of infrastructure as code tools. So if you're using CloudFormation, for example, you can use CloudFormation Guard. If you're using CDK, you have CDK-NAG, an amazing open source framework. If you're using Terraform, you've got HashiCorp Sentinel or Checkov. Anyone using these? So what do these frameworks do? That's a really good question.

Thumbnail 1670

Thumbnail 1680

Those are the frameworks that I'm seeing pretty commonly. What do they do? Well, we'll take Checkov as an example because we had a few examples of Terraform code. So Checkov is a framework, a governance framework for Terraform. You can see here a list of controls that all of them are green, all of them have passed, that I validated before committing code. So this is part of my local development experience. This is part of my CI/CD process. First, before code is actually committed, I run Checkov and it validates that my infrastructure as code complies with all the rules that I've defined. And then the same thing happens during the CI process.

Thumbnail 1730

Now what's going to happen if I'm going to introduce a change that is not compliant with whatever guardrails that the platform team supplied? Well, this. You can see here at the bottom, ensure that CloudWatch log group specifies retention days. Well, because I accidentally deleted that property from my configuration, so now my logs are going to be stored forever. And ensure CloudWatch logs retains logs for at least one year. That's the default control that they have. So what it means is the developer, they've made a change locally according to what they think they need, but that change goes beyond what you think it should be allowed.

So they do have flexibility to make those changes, but you control what's the range of what is allowed within your organization. To give you a few examples, anyone here ever used a Lambda function running on AWS with 10 gigabytes of memory? Why is it less common? It's perfectly fine, but why is it less common? Because at 10 gigabytes of memory, you're getting six virtual CPUs, and Node by default is not going to use six virtual CPUs. You need to write your code in a very special way for this to happen. So it's doable, it works, but you need to know about the fact that now you need to write your code in this specific way.

So as a platform team, you're familiar with this.

You can write a rule that says if this is a Node application, you can set memory up to 2 gigabytes. Why? It's a safeguard, because you know you can set it to 10 gigabytes and you're going to be paying a lot but not actually using that. If you need to exceed that, let's talk. Maybe we'll add one more property saying I know what I'm doing, something like that. Essentially, to summarize the section: Do your developers have flexibility? Yes. Do they need to be aware of everything that's going on under the hood? No. The cognitive load on the development teams is significantly reduced.

Thumbnail 1840

Thumbnail 1860

So let's summarize the section. The platform team builds and maintains those blueprints, and they put them in some sort of curated blueprint catalog. Developers consume those blueprints. Over time, developers provide feedback to the platform team, so the platform team can evolve those blueprints. Think of it as a product. It's a product, it's not a weekend project. It's a product within a company.

Thumbnail 1870

Thumbnail 1880

If everything works great, over time developers start contributing to those blueprints, because this is the way developers can get what they need faster. You know the process you want it, we're happy to merge your pull request. Over time you can have security and compliance teams and whatever other teams you have in your organization actually bringing their requirements. Now developers don't need to talk to security teams. No one really enjoys that. We have to, but no one really enjoys that. You can bake this into the blueprints, the requirements from security, compliance, and other teams.

CyberArk's Platform Engineering Journey: From Fragmentation to Unified Experience

Now I've spoken for what, 30 something minutes, but I think you want to see how this actually works in a production grade huge system. So I'm going to pass it to Ron, and he'll tell you exactly that. Thank you. Everybody can hear me? Yeah. Happy to be here.

Thumbnail 1930

So before we start talking about platform engineering at CyberArk, I'm going to quickly introduce you to what we do at CyberArk. CyberArk was founded in 1999, and we are the identity and access management global leaders. We have over 4,000 employees across the globe and over 1,000 developers, and we are a cloud-native, serverless-first SaaS company, and that's where I want to shift our focus to.

Thumbnail 1950

Now if we go back about six or eight years ago, we had several SaaS solutions, but their experience was fragmented. From the developer perspective, there were multiple silos, so there was no unified developer experience, no tech stack, or even architecture. From the SRE perspective, again, multiple solutions, so it was harder for them to support all our solutions. And for the customer experience, which is the most important one, again, there was no unified experience. There was a different onboarding experience and the look and feel wasn't all the same.

Thumbnail 1990

Thumbnail 2000

So in 2020 we decided to do better and we started our platform engineering team with 15 engineers, and I was one of those engineers, and our goal was to basically unify all these experiences. Now fast forward to today and we're actually able to do that. We've streamlined our tech stack. We're using AWS, serverless, and Python. We have a unified observability stack, unified customer experience onboarding, look and feel, and we've defined best practices for security, governance, and created multiple toolings. Now these toolings are used by hundreds of developers, and we saved with them years of development time. And from a humble start of 15 engineers, we are now well over 100 engineers across two divisions.

Thumbnail 2030

Now our goal is basically to adopt and scale serverless across the organization, but we want to do it in a smart manner. We want to maintain standards and best practices for architecture, governance, security, and observability. Basically we want to help our developers and the organization to deliver value faster for our customers.

Thumbnail 2050

So what we've done is build a service platform, and this platform basically encapsulates automations and best practices into blueprints like Anton has mentioned, and these blueprints are used by hundreds of developers. We've created dozens of services and we reduced the new service creation time by 99%. That's not a typo, that's actually real, and we're going to see the numbers later on, and we're able to save years of development time and millions of dollars.

Thumbnail 2080

Thumbnail 2100

So my name is Ron Isenberg. I'm a principal software architect at CyberArk at the platform engineering division. I'm an AWS Serverless Hero, and I have a website called RunTheBuilder.cloud where I talk about AWS serverless and platform engineering. So over the years we've created in the platform generic multiple SaaS services, multiple SaaS products across multiple planes.

For example, in the application plane we've created our shell service, which is our UI that loads all sorts of iframes for the different SaaS services that we have, and we've created an audit service which is a centralized audit service that shows audits from all of our services. In the more traditional platform engineering plane, the control plane, we've created our tenant management service or the customer onboarding service, license services and such. And even in the data plane, we've created our centralized Pub/Sub for service to service communications.

But all of these services are built on top of what I like to call the pillars of impact. Now this can be SDKs, blueprints, automations across very important and critical domains: observability, security, governance, automation, and developer experience. And my goal and the platform engineering team's goal is to help other teams at CyberArk, the other service teams, to build their services and use the same pillars of impact.

Thumbnail 2170

Thumbnail 2180

Enterprise-Grade Serverless at Scale: Reducing Service Creation from Five Months to Three Hours

So let's talk about scaling enterprise grade serverless services. How do we build them at a grander scale? So this might seem familiar because this is actually what we're building. When we start a new serverless service, we start with the backend service, and this is exactly what Anton was showing a couple of minutes ago. We have some CRUD API or some entity. We have an API gateway that invokes a set of Lambda functions that read and write from a DynamoDB table.

Thumbnail 2200

But I did mention that it needs to be enterprise grade, right? So now we need to think about all of these elements. Infrastructure as Code: I need to write my CDK code with all the best practices to spin up all these resources. I need to have my pipeline take me from PR to production through multiple gates, multiple environments: dev, test, stage, pre-production, production. These are different AWS accounts in our case, and even multiple regions. We deploy to dozens of regions. Then you have to learn the best practices: hexagonal architecture, input validation, tenant isolation libraries that we've created at CyberArk. Then you have testing: unit tests, integration tests, and end-to-end tests. Observability: we have numerous observability libraries. So it's a lot of work, right? And this is a very simple backend microservice.

Thumbnail 2250

But it doesn't stop there because we also need a user interface. Think of this new service as a single web page application. You have a table that shows items from the backend. You can read, write, and change the items. So we're going to add CloudFront for distribution. We're going to add our S3 bucket with the static React files. But again, we need all the best practices: the Infrastructure as Code, our CI/CD pipeline, our frontend best practices. We need to think about input validation, our integration with the backend, error handling, accessibility, testing, localhost testing, Cypress, Playwright implementations, and even telemetry. We're using Mixpanel, so now we need to add our Mixpanel SDK so we know what our customers are doing in the UI, so we know which features we want to implement more and expand and which we can drop.

Thumbnail 2310

And I did mention that this is a SaaS service, so it gets even more complicated. So now we need to integrate with our SaaS control plane, and at CyberArk that means we need to integrate with the tenant management service. We need to have our own subdomain for our service, our own hostname and TLS certificate for our customers to send our service API requests. And we need to have cross-account access. Maybe we have some SQS that needs to subscribe to an SNS topic in a different account, or maybe we need to do role delegation and assume a role in another account to access an IAM-protected API gateway. These are things that we need to do out of the box.

Thumbnail 2340

So what I'm trying to say here is that enterprise grade is complex. It's complicated. It's a lot of work. One microservice is just not enough. You need multiple of them, right? You need to integrate with other SaaS components. You need to deploy to multiple AWS accounts: dev, test, stage, pre-production, production, multiple regions. And you need to do all of these by following service and enterprise best practices and libraries and tooling.

Thumbnail 2370

So before we had our amazing automation and tooling, we saw from surveys and estimations that it would take a senior engineer about five months or 23 to 25 weeks to implement these microservices with all the tooling and all the SDKs and all the best practices out of the box. But now that we've created our automation, it takes about three hours, just three hours.

Thumbnail 2400

So let's see this automation in practice. Let's see the developer experience. So our goal is to build a new service, the OneClick service, and it's going to have our CRUD API, all the best practices, our frontend,

Thumbnail 2420

Thumbnail 2440

the CyberArk unified UI, the look and feel, and we want to connect it to our control plane right out of the box. So all of these things are going to be part of our new service. For us, it always starts with a developer portal. For those who don't know, this is Port. We're using it. It's a service that we integrated with. You go to self-service at the top, you click, and then you choose create new business service, and you have this form. So now you need to choose your service ID in the SaaS control plane, your service name. It appears for the customers in the UI, your name, your GitHub organization, and things like that, and then you click on start. That's where the magic happens.

Thumbnail 2460

Thumbnail 2480

Then three hours later, you are greeted with new components and integrations that are provisioned. We have six new GitHub repositories that we scaffolded and deployed to AWS. And now all it takes is to create a new tenant with our new service and log in as the customer and see the new service in action. So here we can see the new service in the CyberArk UI, and you can see on the left we have the application picker, our SaaS application picker, and we have our new service with the default icon. But here in the middle we can see all the items from the backend. So we fetched all the items from DynamoDB with the table, so it works. Our integration works, and even in the title where you can see FF enabled, that's actually a feature flag that we got from AppConfig via the backend API calls.

Thumbnail 2510

Thumbnail 2540

So what did we see? We saw six blueprints. They have all these best practices, security, observability, all these toolings that are baked in. We have our frontend, our backend, our feature flags configuration, our CloudFront distribution, our tenant management integration, and these are all deployed to four different AWS accounts: dev, test, stage, and integration, where in integration we deploy to two regions out of the box. Now, from the developer perspective, that's amazing, because now they get all this heavy lifting out of the box, and they can focus just on the business domain. They can take what we built, what we gave them, and just expand on that. As Werner says, go build.

Thumbnail 2560

Now these are some basic automation developer experience tips from our use cases. Your automation needs to be simple to use with minimal prerequisites, and I cannot stress this enough, minimal prerequisites, otherwise people will get it wrong, get confused, and get very frustrated. It needs to be retryable because people do make mistakes. It needs to be customizable because one blueprint will not fit everybody's needs, and you need to have the ability to delete failures. You also need to have ongoing maintenance on all the blueprints, but not just a single blueprint, but the entire process. You need to create all these one-click services from time to time, so you can see that this process runs from the beginning to the end.

Thumbnail 2600

Thumbnail 2610

So now I want to talk about architectural blueprints, which are a different type of blueprints. So as Anton mentioned, we're using CDK Python, and in this example we're creating a Python library that contains CDK constructs that encapsulate these black box architectural patterns, and these are versioned. We have GitHub pages so we document them, and we have release notes around them, and they're easy to update and use. So a few examples: we have our Lambda function with dynamic provisioned concurrency, meaning the provisioned concurrency setting goes up and down according to traffic shifts. We have a secure S3 bucket, which is basically a bucket that our security architects give us the thumbs up on its configuration. We have the classic SQS queue with DLQ. You've got to have those. We have our CMK in KMS where we can sign and encrypt messages. We even have a DynamoDB table that works seamlessly with our tenant isolation library for idempotency use cases, and we have our WAF ACL association that knows how to connect to our WAF that came from our firewall manager, from centralized firewall manager.

Thumbnail 2680

So this is an example of actual code. This is code that I copied from one of the constructs. It's a very simple construct. It's our S3 bucket that we use across the organization. So you can see that in line thirteen we enforce the SSL communication. We set the removal policy to retain in production environments because we don't want to delete our customers' data by mistake. We block public access and we enable encryption. It's very effective, and all of our developers can use it across their services.

Thumbnail 2710

Thumbnail 2720

Now, it's 2025 and things are changing as you've noticed, and the platform engineering domain is also evolving. Now, if you recall this diagram from before, I'd argue that now in 2025, we need to add another pillar of impact. Yes, it's agentic AI.

Thumbnail 2730

Thumbnail 2750

Now, let's think about the following problem, and this is a real problem that we have. So we have a requirement for three of our developers, three or four developers from different services to build an MCP server and to expose their service API through that. Now, it might be fine, but what's the problem here? The problem is that they need to reinvent the wheel basically. It's a new world. It's a new domain. You need to think about authentication and authorization, and they're doing this all in parallel. CDK. They need to write the CDK. They need to write the CI/CD pipeline. They need to learn how to test with MCP clients to figure out observability. So this can cause duplicated efforts, architectural inconsistencies, and maybe even technical debt.

Thumbnail 2780

But there's a better solution, a very simple solution. Let's use platform engineering blueprints. So this is something that I actually have the pleasure of writing and open sourcing, by the way. We have our own MCP server blueprint. We have our Amazon API Gateway with a WAF connected to a Lambda function that runs the Lambda Web Adapter extension. We run the Python Fast MCP server and then we have some examples for MCP tools and resources that our developers can use, extend, do whatever they want with them. But on top of that, you get Infrastructure as Code, CI/CD pipeline, security best practices for this new domain, testing, and observability. So now, for our developers, they can just scaffold this blueprint and create their own new MCP server, and that's great. They get all these toolings out of the box and it's easy to use. Back to you, Anton.

Thumbnail 2840

Best Practices and Key Takeaways for Platform Engineering Success

Thanks, Ron. All right, yeah, so you saw a real world implementation of what these practices look like. You probably noticed that it takes time. It's not perfect. It's not something you do within a few days. It's a process. You evolve over time. You add more functionality based on what your consumers, your development engineering teams are looking for.

Thumbnail 2870

So some best practices that we've seen as very efficient when adopting this approach. Don't boil the ocean. Do not say I'm going to solve all the problems in the world with my blueprint. Identify a problem that is the most impactful in your organization. API Gateway, Lambda, DynamoDB, like 70% of teams are using this. Let's standardize it and let's provide flexible customization for that. Identify what you can solve realistically, not something that will take two years to implement. Solve that problem, measure success, evolve, iterate. You'll probably notice it sounds like building a product because it is. Don't treat it as a weekend project. It is a product you've noticed. Over 100 people working on this. Started with 15. Over 100 people are working on this now, and they're saving years of developing time for the whole organization. So focus and prioritize. Don't spread yourself too thin.

Thumbnail 2920

One size doesn't fit all workloads. Obviously, your amazing blueprint will immediately get feedback from engineering teams, and no, we cannot use that for whatever the reason is. Trust me, it's going to happen. Listen to this feedback, ideally collect it early, and make sure that you address it and you provide customization. So for example, your teams, they want to be flexible on the runtime, cool. They want to be flexible on TLS configuration, awesome. Collect that feedback and implement it into your blueprints because there's never going to be a product that is suitable for 100% of consumers. I'm not familiar with such a product, but try to cover large scope and make your blueprints customizable.

Thumbnail 2970

Documentation and education are key. So you're building a product for people to consume. Make it easy to consume. Dropping a piece of code on GitHub and saying go clone it, that's not good enough. It might be good enough for a very small part of your organization. But if your engineering teams need to invest more time into understanding your blueprint than building it themselves, that's not going to work. Make it easy to consume, set up education sessions in your organization.

Thumbnail 3030

When you're building a product for your organization, you need to prove the value of that product. Does this approach work? Does the blueprint work? Yes, it does. We have examples. But you need to make sure your organization is educated and they know how to use it. The good thing is generating documentation with Generative AI today is really simple. This is probably the most important thing. If you believe in "we're going to build it and the users will come themselves," well, no. Build with your customers. Before you start building, validate that this is actually solving a real problem. Don't try to talk to every single team. Don't boil the ocean. Find a small subset, go to them and say, "Hey, I think I have something that you will benefit from, and we can work on that and start solving your problems." Build with your customers. In this scenario, customers are the engineering teams that you as platform engineers are trying to help. These are the general best practices that we've seen that actually work quite well when starting to adopt this approach.

Thumbnail 3080

Thumbnail 3090

With that, we have several other sessions that we would like to recommend. Some of them are actually in the past, but it's all going to be on YouTube. There's one on Thursday. If you're not familiar with Serverless Land, anyone here familiar with Serverless Land? This is a website our mostly developer advocacy team maintains. It has a lot of templates, blueprints, and tutorials, hundreds of things that you can use. I highly recommend you explore that one. We have weekly office hours on YouTube and Twitch, so there's a lot of good information there. Ron and I are going to be right here if you have any questions. We're not going to run away. We're happy to take any of your questions once we're done in about 30 seconds.

The last thing I promised is a giant QR code with everything you saw today: source code, slides, everything, videos, and more. Don't forget to complete the survey in the application, and we hope this was helpful. Thank you.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)