DEV Community

Cover image for AWS re:Invent 2025 - From Reactive to Proactive: Infrastructure governance by design (COP352)
Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - From Reactive to Proactive: Infrastructure governance by design (COP352)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - From Reactive to Proactive: Infrastructure governance by design (COP352)

In this video, David Killmon and Sefi Avrech demonstrate how to implement proactive controls for infrastructure as code using CloudFormation Hooks and AWS Control Tower. They explain how Hooks prevent non-compliant resources from being deployed by evaluating CloudFormation Guard rules before resource creation. Through live demos, they show creating Hooks with pre-written Guard rules from GitHub, testing them against AI-generated templates, and fixing compliance violations for S3 buckets, DynamoDB tables, and Lambda functions. They demonstrate Terraform integration using the AWSCC provider and Cloud Control API, and show how to deploy Hooks across organizational units using Control Tower's Control Catalog for multi-account governance with framework-specific controls like NIST compliance.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Introduction: Infrastructure as Code Governance and the Power of Proactive Controls

Hi everybody, happy Thursday, unofficial last day of re:Invent. How are you all doing? Okay, alright, that's the energy. Okay, I was actually watching some of the talks that my coworkers presented on Monday morning at 8 o'clock and there was nothing, there was silence. That was so devastating for me to watch. This will not be like that. We are friends up here and we're going to go on a journey today through infrastructure as code and governance and it's going to be very fun.

But before we get into all of that, I do have some questions for you all because this is an us situation. This is not a breakout session. They won't let me do those anymore. This is an us session, so if you have any questions at any time, you just say David or Sefi, and we'll be like, what's up? Speaking of David, I'm David. My name is David Killmon. I'm a Principal Engineer in AWS, specifically infrastructure as code, so things like CloudFormation and Terraform and CDK.

A quick survey, who here uses infrastructure as code? Oh, that is a beautiful 100 percent. Maybe I, we actually can't see up here because of the light, so I will, the narrative will be changed to suit the talk. A quick survey, who uses CloudFormation? Okay, I see you, I see you. Who uses Terraform? Nice. Who uses both? Oh yeah, I like this guy. Cool. Any Pulumi? Any Pulumi people? They have the best swag at the, that's why I ask. Okay, anyway, so my name is David Killmon. I am a Principal Engineer on CloudFormation, and today we're talking about infrastructure as code governance, making deployments of IaC safe. And I am joined today by Sefi. This is my friend.

Thumbnail 120

Hi, nice to meet you all. I'm Sefi Avrech. I'm a Senior Solutions Architect. I'm working with governance and compliance and security. This is the side that I'm covering here, and let's get started. Okay, so we're all familiar with this pattern. You write in some code, write some CloudFormation, some Terraform, you push it to your Git, and then there is the CI/CD pipeline that starts running, right? And you have your resources deployed inside your AWS account. We're all familiar with it, right? This is how we work. Great.

Thumbnail 150

But how do we implement some guardrails around this situation? And this is what we're going to talk about today. We will run this slide, so we go to the code directly. So developers are moving very fast, right? You all use, I believe, generative AI to generate code to help you with the infrastructure. And we need something to help you stay in the guardrails of your organization, the security requirements, cost requirements, even machine image requirements. So let's see if the car is working. This took us a few hours to put together. It's working. Inside the guardrails. Yeah, please hold the applause until the end.

Thumbnail 210

So what we're going, this is the one in CloudFormation Hooks is coming to help us, huh? Ooh dude. It's so good we want to show you twice. You make it so good. Ah, yeah, see what automation don't do cops do. So on the left side, on the right side, you see the simple CloudFormation template going to create a resource. And it's going to be failed, okay, it's going to be failed because it's not standing in the organizational requirements, okay. And now I'm passing to David. We'll give you some use cases.

Thumbnail 240

Cool. So whenever we talk about failing things, Sefi always sends it back to me, so yeah, today we're going to talk about infrastructure as code and how proactive controls can stop bad things from happening. So we're going to talk about Hooks. Hooks is a CloudFormation feature. I work on CloudFormation. They sign all of my paychecks, but we also realized that infrastructure as code spans beyond just CloudFormation. So we're also going to talk about Terraform in a bit. So for the first part of this talk we're going to do a lot of demos in CloudFormation, but Sefi's going to show you how things work in Terraform as well.

I just, anytime I type Terraform apply on my Amazon issued laptop, it lights on fire. So Sefi will be doing the Terraform stuff, but the root of it is proactive controls are the same across all of IaC, right? And what are proactive controls for? You know, I think probably the most common and most important reason we talk about proactive controls is around things like safety and governance. And we want to stop bad things from happening basically, right?

Why Proactive Controls Matter: Real-World Lessons from Production Incidents

So if you look at this slide, it's a really simple slide, right? I have a, this is on the left is a CloudFormation template and this is the simplest CloudFormation template I could write. It just creates an S3 bucket,

On the right, you can see that if I tried to provision this bucket, it actually fails because there's no bucket encryption enabled, and that's totally fine. I have a quick question for the audience. Who here considers themselves security people, like security is part of my responsibility? Cool. How about platforming people, like the platform team? Okay, cool. Yeah, so we see this a lot. I'm an engineer, and I don't know if you can tell, but I kind of just do stuff very fast. I basically try to find the fastest way to make things work, and I don't always think about the platform. I always think about security, but sometimes I omit security in the name of speed, like this bucket.

I think all of us in security and platforming teams are familiar with this trade-off. Developers want to go really fast, but we who own the platform or get paged for security incidents don't want to sacrifice the platform security or governance or compliance. You know, tons of organizations have things like FedRAMP compliance or NIST compliance, all these compliance packs which require you to always put on bucket encryption or enable object lock on your S3 bucket. These are really nuanced configuration options that we don't really want to have to have that conversation with developers every time they want to create an S3 bucket, that they have to turn on SSE encryption for their bucket. We want to make sure that they can move fast but not break our platform. So Hooks and proactive controls help stop bad things from happening before they happen.

I'm going to tell you a story. We're going to have a lot of stories, and they're all going to be a little embarrassing about me or our interns. A long time ago, does anyone remember when Lambda launched Function Endpoints? It was maybe two or three years ago. Does anyone here not know what Lambda is? It's totally fine. Do we count on Lambda? Okay, cool. I love Lambda. Anytime re:Invent is very fun for us as well because this is the first time we're hearing a lot about a lot of new features.

So I remember a few years ago at re:Invent, Lambda launched this feature called Function Endpoints. It basically lets you generate an HTTP URL that when someone calls it, like sends a GET request to it or a POST request or something, it invokes a Lambda function, and I thought this was super cool and it was really easy to set up. This was sort of back before things like Hooks existed, and so I was experimenting. I was trying to set up this Lambda with the function endpoint, and it was actually really fun. I had the Lambda function send a text message to my manager saying that you should give David a raise, and then I posted the URL into our organization's Slack channel and everyone kept clicking it. It was really a beautiful day, and it was really fun for me because I got to experiment with this new feature. I had a lot of fun with it.

The cool thing about function endpoints is that there's also a lot of configuration that you can do for security. You can say, hey, only this IP address range can call this endpoint, or it has to be within this VPC and it's private. Those are really cool, and I did not care about them because I just wanted to get it to work. I wanted to make a funny joke for my coworkers and send this link out, so I did that. It was fun, and then I was going home. I was on the train listening to, I don't know, some podcast or something, and I got this alert. I thought it was an amber alert, but I was actually getting paged, and I was like, I am not on call. Why am I getting paged?

It turns out we had this detective control system that scanned our AWS accounts every hour or so to make sure that all of our resources were compliant. It turns out our security team had decided that a globally accessible function endpoint that costs money when it is invoked, that's just accessible by anyone, was not compliant with security. So I had to get off the train and go find a coffee shop and read this ticket that I got paged for and go disable this, delete this function. I was really mad, actually. This is the thing that radicalized me on proactive controls. This is why I care about it so much, because I was like, how did you just let me create something that you knew was wrong?

Thumbnail 600

So that's where proactive controls come in. That experience for developers kind of sucks. If you create a resource and it is misconfigured and you just let it happen, that change can propagate through pipelines. It can go to production. Reactive controls and detective controls are amazing. They always catch things, but proactive controls are better because they can stop things before they happen. So that's why I love proactive controls, and that was just a long story to say that proactive controls are really good for governance. Another thing that proactive controls are really helpful for are stopping you from making accidental mistakes.

Preventing Costly Mistakes: AutoScaling Groups, COE Culture, and Cost Management

In this example, I have a CloudFormation template with an AutoScaling group, and my AutoScaling group had a minimum of one EC2 instance and a maximum of 100 EC2 instances. In CloudFormation, you can create this thing called a change set, which is very similar to a Terraform plan, which basically just says, "Hey, I'm transitioning from this state to this other state."

So in this case, I had a Hook that was running that when I tried to make this change, failed. It failed because the Hook had logic in it that said, "Hey, whenever some sort of capacity change happens on a bunch of resources, so like CPU, memory, IOPS, AutoScaling group sizes, whenever those changes happen and they exceed 50%, I'm just going to fail," right? Because this is production. You're changing things really drastically.

So in this example, it actually wasn't me. I didn't make the change. I'm not the villain of this particular story. We had an intern. This intern was so smart and much smarter than me. One of the things that our intern was doing was they really wanted to deploy our service to their own account. We give all developers at Amazon an AWS account that they can do anything with. So they wanted to deploy our service to their own personal account, and I was so excited because I was like, "Yay, bias for action. This is fun. You're going to experiment."

This intern also was very frugal. They were like, "I don't need 100 EC2 instances being spun up in my account," so they actually changed their template to make it so that the number of EC2 instances that could be provisioned was 10 instead of 100. Through a series of unfortunate accidents, they accidentally committed this change to our repository, and our pipeline picked it up. Our pipeline started to try to deploy it to our beta stage, and then that's when this Hook was invoked and stopped that change from rolling out, which is great because if our service, you know, CloudFormation powers hundreds of thousands of deployments, so if we had only 10 EC2 instances to do that, that would have been a very difficult day for our on-call.

The other nice thing about this story is that Amazon has this culture called COE culture. Are folks familiar with COEs? They're just basically when something goes wrong, you write this doc that's like, "What went wrong? Why did it go wrong?" And most importantly, "How do we stop it from happening again?" The reason we had this Hook written for our team was not because I'm really wise and a wonderful platform engineer and foresaw this, but because we had a COE a couple of years ago where another intern had done something very similar.

So the nice thing about our COE process is we have these tenets that action items should not be best intentions, right? So an easy action item is like, "Update a runbook," and that way when our intern comes back, they'll follow the steps to make sure that they do the right thing. We call those good intentions, and sometimes they're fine, but we always are searching for mechanisms. How can we enforce that this bad thing does not happen again? So with things like Hooks, you can take your learnings from different types of outages, like over-eager interns or maybe careless engineers, and codify them and enforce them in your IaC that way these mistakes don't have to happen again.

Thumbnail 820

Okay, I'm going to go over one more use case that my manager insisted I put in as a punishment. So the other great thing that proactive controls can do is make sure that you don't spend too much money. Sometimes engineers provision extremely large EC2 instances because they're not responsible for their AWS bill, and they just leave them in their account and run up huge bills. That is something I have done, so now we have a Hook on our team only that makes sure that there's a set of allow-listed EC2 instances that we can deploy. My manager made me do that because we had a huge bill.

But our actual practical reason why folks would do this a lot is, you know, a lot of organizations do a lot of negotiation with Amazon to make sure that we have Savings Plans or certain EC2 instances that we get discounts on. That's really hard information to trickle down to engineers or people who are building platforms to know that, "Hey, you should only use this particular type of instance because we get a discount on it." So that kind of information can be codified in a proactive control that says, "Hey, you didn't do anything wrong. It's just that you tried to, there's a better instance type that you can use that we get a discount on."

So proactive controls can help you with things like governance and security. They can help codify best practices and learnings from COEs. They can help you make sure that you trickle down organizational best practices like cost best practices. So they're really useful.

Thumbnail 920

I think they're super valuable in the Infrastructure as Code world, and I think in CloudFormation and Amazon specifically, we're only just in the past couple of years really understanding how valuable these are. So I'm excited for us to get together and experiment with them.

Building Your First Hook: CloudFormation Guard and the Guard Rules Registry

Cool, so you thought that it was cool that I was talking for so long, but now we're actually going to code because this is a code talk. In this code talk we're going to do a couple of things. I'm going to teach you a little bit about CloudFormation Guard. This is our policy-as-code DSL. There's nothing really special about Guard here. You can replace it in your mind with any policy-as-code language that you're familiar with, like Sentinel or Rego or Checkov, whatever you're comfortable with. Conceptually they all do the same thing. I'm going to show you how to build a Hook using Guard. We're going to go through a beautiful journey of trying to deploy a CloudFormation stack, a process that I'm sure we've all experienced and enjoyed a lot. And then I'm going to show you how we cannot do any of that and just use a couple of clicks to build a Hook using something called the Control Catalog. And of course, I will show you how to make it work in Terraform too, which is going to be fun and delightful. But now we're actually going to do all that stuff.

Thumbnail 990

Cool, so, yeah, switch. Okay, I'm going to assume my official coding position, which is sitting on this chair with my knees. Cool, yeah, let's get into it. Let's build stuff.

Thumbnail 1020

Okay, so also Amazon has this thing called Connections where every day they ask us a question and you cannot stop it from happening and it always pops up during presentations, so I'm going to move this over here and we're just not going to talk about it anymore. Okay, cool, so let's set up some proactive controls for Infrastructure as Code and show how they work. All right, so let's go to CloudFormation, which is my favorite Infrastructure as Code service that's natively hosted on AWS. I talk a lot of smack about Terraform, but I actually love it, so please don't take it personally. I'm going to close every single notification that came up because we are very eager this year to launch a lot of features.

Thumbnail 1040

And I'm going to go to our Hooks section. So this is our Hooks page. This is a three-step picture that explains in what I took 20 minutes to explain. So this is always delightful if you ever forget and don't feel like listening to this talk again. It just shows how a Hook basically sits in front of CloudFormation doing something and evaluates if that thing is good or bad. If it is bad, it stops that bad thing from happening.

Thumbnail 1070

Thumbnail 1090

So I'm actually going to go down here and we're going to create a Hook and we're going to use Guard because that's my DSL of choice. And okay, so over here it tells us, hey, you have to upload your Guard rules to S3. And so this is a code talk, so I'm supposed to be coding, but I'm remarkably lazy, so we're not actually going to write these rules ourselves. So there's this community on GitHub of pre-written Guard rules which are super cool. So I'm actually just going to go steal this from our Guard rules registry.

Thumbnail 1100

Thumbnail 1110

Thumbnail 1130

Thumbnail 1140

Thumbnail 1160

So this is GitHub. If we go into the rules, you can see that there's a bunch of different AWS services, so like Amazon S3. This is one of my favorite ones, so I'm going to click it and you can see that there's a bunch of different files that map to a policy. So we have S3 bucket default lock enabled, S3 bucket logging is enabled. There's a bunch of, well, let's see what is a good one? Oh, bucket versioning enabled, that's kind of fun. So we're going to just, I already took a few of these, but I just wanted to show you there's a bunch of different services over here. Let's see, IAM exists, which is always a fun one, and IAM has a bunch of, IAM is probably the place where you can do the most damage from a security point of view. So there's a lot of really fun policies written already for IAM, for example, no admin access, which is delightful. We should not have admin access for a lot of things.

Thumbnail 1180

So you know, what I did earlier for demo magic purposes is I cloned this repository locally and I took a couple of rules from that repository, so I have some DynamoDB, some IAM, some Lambda, some S3, and I just zipped it up. And then I uploaded it to S3. We never know what the Wi-Fi situation is going to be in these rooms, so I try to pre-buffer as much as I can.

Thumbnail 1190

Thumbnail 1200

So I'm going to, I uploaded my Guard rules to S3 earlier, so I'm going to select this. And I'm going to tell Guard to write a bunch of logs whenever something fails, and we're going to call this something. This is fun. Does anyone have a fun name for this Hook that we should call it? I'm thinking very good Hook. Does anyone have any other suggestions?

Thumbnail 1220

Thumbnail 1230

Thumbnail 1240

I heard someone say gotcha hook. I like that because it's like pew pew, I got you. So we're going to call this the gotcha hook, and hooks can run in a lot of different places. Before a CloudFormation stack is updated, through Cloud Control API at change sets, for this we're going to just do resources. So any time CloudFormation goes to provision a resource, it's going to run these Guard rules to make sure that resource is compliant. And I'm going to tell hooks to run this whenever a resource is updated or created.

Thumbnail 1250

Thumbnail 1260

Thumbnail 1270

Thumbnail 1280

There's this thing called a hook mode. So a hook, I'm going to change it to fail. Basically any time this hook evaluates to false, it blocks something. You can also put a hook in warn mode so like a hook can fail and it doesn't actually block something. So this is really good if you're authoring new rules and you want to test them out without necessarily blocking people, but for this we're going to block it. And I'm going to create a new role called the gotcha role. You can tell, you can have hooks be really specific on exactly when they run. I'm just going to have it run all the time. And I'm going to create this hook. Okay, so I created this hook. It's running in my account now.

Thumbnail 1300

Thumbnail 1320

Thumbnail 1330

Thumbnail 1340

Thumbnail 1350

Thumbnail 1370

Thumbnail 1380

Fun question, has anyone here used Gen AI for IaC stuff yet? Okay, cool. Yeah, it's fun. It's fun. I also use Gen AI for IaC stuff for this talk. I asked Kiro to generate for me a CloudFormation template. And my prompt to Kiro was basically like I want a really simple CloudFormation template that creates a Lambda function, a DynamoDB table, and an S3 bucket, just to create a really simple application that writes to DynamoDB and pulls files from S3. And this is the template that it gave me. It created a Lambda execution role. This is the role that Lambda will use to do stuff. It created a Lambda function, so a simple inline function using Node.js. Sure, that sounds good. It created an S3 bucket with no properties, fine, that is a correct bucket. And it created a DynamoDB table. And the thing that I like about this template that it generated is it tried a little hard, right? Like it turned on point in time recovery for me which is a really nice gesture. I do appreciate that because you don't want to accidentally delete your DynamoDB table and realize you don't have point in time recovery turned on. So this is a great template, and I think maybe as I was scrolling through this, the security people may have lifted an eyebrow or maybe folks realized that I was alluding to certain things earlier with this S3 bucket, but we're going to try to deploy it to CloudFormation and see what happens now.

Thumbnail 1390

Thumbnail 1400

Thumbnail 1410

Thumbnail 1420

Thumbnail 1440

Thumbnail 1460

Iterative Debugging with Hooks: From Stack Failures to Change Set Validation

So I'm going to go take this template, CloudFormation, from the web. Oops, sorry. Cool. Next. Yeah. Oops, sorry. Sometimes the AI doesn't always create perfectly valid CloudFormation. Okay, so that one works. Okay, we're going to call this "thiswillprobablywork." And I'm going to just power through all these prompts and tell it to do stuff and it's going to start creating stuff. Oh, does anyone have any feedback for CloudFormation? We'll get to that later. We love feedback. Okay, cool. So CloudFormation tried to create my stack, my DynamoDB table, and it failed. So if I scroll through like all of what happened chronologically, it started creating this Lambda role, S3 bucket, but before it did anything like actually called the service APIs to create a bucket or create a Lambda function, it invoked our gotcha hook. And as you can see, like the S3 bucket failed to create because the gotcha hook failed.

Thumbnail 1470

Thumbnail 1480

Thumbnail 1500

So if I click on this we can see that, oh, this hook failed with the message that the template, you know, we had a rule called S3 bucket lock enabled that failed, a rule called S3 bucket read prohibited that failed. So I guess we're going to have to fix that. So if I go into the hook, you can see like, all right, here's the details of this particular hook run. I had a bunch of, you know, we had the Lambda rules and the DynamoDB rule and the IAM rules. They did not run because this is an S3 bucket, but the S3 ones did. And I'm going to click this S3 bucket lock enabled real quick and it kind of gives us, it tells us exactly what went wrong, right?

Thumbnail 1530

Thumbnail 1540

Thumbnail 1560

Thumbnail 1610

Thumbnail 1620

The error message tells us exactly what went wrong. It says the check was not compliant because the property of S3 bucket properties object lock is missing, and we have this helpful hint saying that the violation is that S3 bucket object lock was not set and we should set it to true to make this check pass. So just to show you what this rule looks like, let me see. Here it is. This is a Guard rule. This is all comments, but it's just describing what this rule does. This is my object lock enabled rule, and what this does is every time this rule runs it takes as input the CloudFormation resource that's getting provisioned. We say this rule applies if any resource type is of AWS S3 bucket. Our rule says that S3 bucket default lock must be enabled, and it says when it has found a resource of type S3 bucket, assert that there is a property called object lock enabled and that the property object lock enabled equals true. You can see in this Guard rule I have this free form text here. It says violation, and if we get here that means that the object lock enabled property did not exist and it needs to, so we can set this feedback. If the S3 bucket lock enabled existed but was not true, we say you have to set this to true, and you can see this is the data that pops up right here. You can be as specific or as cryptic as you'd like to be with your feedback, but this is a really good example. I think this is really good feedback for me because it's really clear what happens. It's looking for this object lock enabled property and it's looking for it to be true.

Thumbnail 1640

Thumbnail 1660

Thumbnail 1670

Thumbnail 1680

Thumbnail 1690

So let's go back to our template and help out a little bit. I'm going to add a property called object lock enabled, and it says it should be true. I don't know what object lock enabled means, but my platform administrator, who was very wise, said that we needed it, so I'll believe them. Actually I do know what object lock enabled is, but it is very cool. It's a really interesting feature, but critically, putting my platform hat on now, you can only enable object lock on S3 buckets when you're creating the bucket. So this is actually a really good rule because you cannot add that feature after the fact. We're going to create it. We're going to try creating the stack again with my new and improved template, and we're going to call it This Will Def Work. You can see that I go through this experience pretty often, so I'm just going to power through the CloudFormation console and we're going to say submit. We're going to see again what's happened. It failed. This is not too surprising. We knew some things were going to fail, but I just wanted to see what it looked like now.

Thumbnail 1700

So if we go back to our hook, you can see now our S3 bucket default lock enabled experience worked. It was fixed and it passed, but we still have this other rule that failed called S3 bucket public read prohibited. If I look at this one, it's a little more complicated. It was looking for this property called public access block configuration in the properties of this S3 bucket, and there's a couple of properties it wanted to be set. Block public ACLs, block public policy, ignore public ACLs, restrict public bucket access. It wanted all of those to be true. I think buckets being public is bad in general, so as the platform teams and security folks, we are always really excited when folks want to make sure that their buckets can have absolutely no way of having public access. By setting all of these values to true, we can make that a reality.

Thumbnail 1780

Thumbnail 1790

Thumbnail 1800

So let's go back to our S3 bucket. I have this public access block configuration. I'm just going to copy straight from this output and I'm going to say true. All right, cool, so I did that. That was fun, and I'm sure you all were really excited for me to go through the create wizard again, but we know there's more things that failed. If I go back to our template, you know, the DynamoDB table failed. The Lambda execution role failed.

Thumbnail 1810

Thumbnail 1820

Let's see, the DynamoDB table failed because the table must be encrypted. This is fun, and I enjoy coding with you all, but as a developer, the experience of waiting for my stack to create a resource and then iteratively trying to fix it is not great, right? I think this stack is really fun because it's serverless. It's using DynamoDB, it's S3, it's really fast. But if I was using something like ECS, which, when you create an ECS service in CloudFormation, it waits for the deployment to actually be done before it considers the resource successfully created, that can take a while. That can take a few minutes. If you're creating a CloudFront distribution, that also can take minutes, so your stack execution time can take a while, and you don't really want to find out that the template that you're building was not compliant only after minutes, like 20 to 30 minutes. We want to know early, as early as possible.

Thumbnail 1870

Thumbnail 1880

Thumbnail 1900

Thumbnail 1910

So this is fun. Instead of going through and iterating on this template one by one, trying to fix all these errors, I'm going to show you a fun trick. So we're going to go back to our hook, our gotcha hook. That was a really good name. I don't know who said that, but you should consider yourself really cool. So I'm going to go to our hook, and you can see for this hook all the different times it ran and when it ran. So it ran for this old stack, ran for this new stack. You can also see if it failed or not. Oh, they all failed, surprise. But I'm going to change this hook to actually run not just when we create resources but when we create a change set. Are folks here familiar with change sets? Raise your hand if you know what they are. Cool. Change sets are basically CloudFormation's version of like a Terraform plan. I'll show you about that rather than just describe it. I'll show you how it works.

Thumbnail 1930

Thumbnail 1940

Thumbnail 1950

Thumbnail 1970

So I'm going to go to CloudFormation. I'm going to upload my template. And change sets rule. And okay, so instead of clicking submit and having CloudFormation go try to provision everything, I'm going to create a change set, and this change set will show me, hey, here are all the things that would happen if you tried to apply that. So this change set says, hey, if you tried to apply this change set, you're going to create a DynamoDB table, Lambda execution role, Lambda function. But my change set is actually in a failed state because my hook failed, my gotcha hook failed, right? So this time, what's cool about this experience is that you get every single one of the rules that I ran. They run early, and you can see all the details of how to fix this stuff directly.

Thumbnail 1990

Thumbnail 2000

Thumbnail 2010

Thumbnail 2020

Thumbnail 2030

So if I go back to my gotcha hook, I can look at this hook run, and you can see, well, look, we fixed a bunch of stuff. We fixed the public read prohibited and we fixed the object lock, but now you can see in one place, hey, your DynamoDB table has to have SSE enabled and your IAM role has an admin policy, and the fix for that is remove it, which makes a lot of sense. So just for fun, we're going to, I'm going to show you, we're going to, through the power of movie magic, I've already fixed all of this actually, and I'll just show you ta-da how it works. Yeah, what happens in the delightful case? I'll show you. Cool. Yeah, so in this case you can see that my change set can actually run. I can execute it. You can see that our hook passed because we fixed it and it was truly delightful.

Thumbnail 2050

Thumbnail 2070

Thumbnail 2100

Control Catalog and Compliance Frameworks: Enabling Hooks Without Writing Code

So that's kind of the experience that I have as a team lead of writing CloudFormation Guard to make sure things work. I want to show you one more thing real quick because CloudFormation Guard is really cool, but not everyone wants to do it. We have another hook called the Control Catalog, and this is meant for compliance checks. So if you look, this basically has a bunch of prebuilt policies. So I can, oh see, I can just, so instead of having to write the Guard myself, Amazon owns and runs these rules, so we maintain them. I can basically say, hey, this bucket has to have S3 versioning enabled. It has to have server-side encryption. And Sefi is going to show you more about this later, but what's really cool about this is that these policies are also grouped by their compliance framework, right?

Thumbnail 2120

Thumbnail 2140

You can see that this rule applies to the NIST framework, and if you're excited about security, you could just create this hook with all of these proactive controls already enabled without having to write a single line of code. You can take this home and just enable this hook in one mode with all these controls turned on and see what resources people in your organization are creating that are not compliant. Also, if you're working in a multi-account environment, you can run it on all accounts. How many of you are working with a multi-account environment from a security perspective? How many of you implement Control Tower? Oh cool. So after we show some cool stuff, this is delightful that I'll create this hook, it'll be in my account, and I can play with it and experiment with it. But in reality, as Sefi is saying, you really actually want to apply this to organizational units and enforce it, so he'll show you how to do that later.

Also, I love Terraform and I've been talking a lot, so I'm actually going to hand it over to Sefi to show you how this can work and how you can use CloudFormation Hooks to also run policy and proactive controls for Terraform. Yeah, thank you David. You're welcome. Let's just fix the credentials. What's up? Oh, I don't want to run for that. Just seems like what I want. Yes, so the question is why not just have hooks always run on change sets, is that right? Basically, yeah, that's a really good question.

Thumbnail 2280

There is nuance. When we run at the resource level, we're literally about to call the APIs, so we have every piece of information that we need to call the APIs. At the change set time, there are dynamic values that we don't necessarily know, so you don't always have all the information that you need to make a fully deterministic decision. An example would be if you're creating an RDS instance and you want to look at the subnet IDs and evaluate to make sure that they're correct or allowed. If your CloudFormation template references the subnet ID as an output of the subnet resource, there's no way you can know that until you create the resource. So basically, it's sort of like a filter. You have less information at change set time because properties that are only evaluated after a resource is created are not available yet because you've not created the resource. But at resource time, you have all the information, so usually for our hooks we generate and run them on both, and we usually write our rules such that if the value just isn't present because it hasn't been resolved yet, we just skip it. So that's why I, why Jean said that certain values cannot be evaluated at this time. Just seems to, yes, yes. But that's what I'm saying. Oh yeah, I mean that's fair.

Thumbnail 2340

I think we just, like change sets hook launched last year, and so we don't have a super opinionated approach yet with the wizard. But it's a fair point. I think if I were to create a hook, the defaults would be resources and change sets. The reason why sometimes we don't make that is because, as Sefi will show you, we're going to apply this to Terraform now, and Terraform doesn't have change sets. So it's a good point. I think for most cases, if you're using CloudFormation, I would almost always just choose all of the things, like resources, stack, and change set. So good question.

Okay, so the question is what do I do with exceptions basically? There's a few things you can do with exceptions. I skipped through the hook creation screen pretty quickly, but in there you can basically exempt certain stacks. You can just say, you know what, this is fine. The other thing that I didn't show you is that in the Guard hook, you can provide this thing called an input parameter, which is like, so a common thing that we see is basically this stack and this particular bucket are exempt. What you would do is put that into a file, put it in S3, and then you tell your Guard hook to read that file, and then that data becomes available as context to write your rules. Another thing is, if you look at the Guard repository, if there are rules that you think are useful but developers should be able to override them themselves, if you look at the Guard repository, you can include additional metadata in the CloudFormation template at the top, acknowledging an issue but indicating you don't care, and you can write your rules such that if there's a suppression rule in the template, you'll skip this rule.

But it's up to you. Good question. Was there another question over there?

Oh cool, I had stated that there's some certain clients. Yes, so Terraform plan is almost exactly equivalent to change sets. The weird thing with Terraform plans is we don't actually have an interception point for that because it's on the client side, so we can't apply hooks at that level. We can do it with resources in Terraform because we have the AWS Cloud Control provider which basically just takes, yeah, yeah, and Sefi's going to show you that right now, so I won't show you. David, you will help me. Yeah, yeah.

Thumbnail 2470

Thumbnail 2480

Thumbnail 2500

Terraform Integration: Using CloudFormation Hooks with the AWSCC Provider

So before we start, I will run Terraform apply and we're going to fail. Okay, everybody relax. We're going to fail. This is one of the reasons we're doing it. And then we start walking. So I have two rules. If you saw the template of David before, the CloudFormation template for the Lambda, you use the deprecated runtime. I think Node 16 because he will create it for you. So basically I created some guards that check exactly which runtime is inside the resource because you don't want to run deprecated runtimes. And you see that as David mentioned, we use resource type AWS Lambda Function, and then we take the runtime variables. But from where we take it? We take it from the CloudFormation template reference. Are you familiar with that? Who is working with CloudFormation? Yeah, so also in the rules, you can see which variables you can take from it and then just paste it inside the rule. For example, you can see all the version of the Lambda.

Thumbnail 2520

Thumbnail 2530

Thumbnail 2560

Thumbnail 2570

But now that the Terraform is running, I want to show you how we're going to use the AWS console to create the hook. Okay? So we'll do it really fast. We're going to create a hook with Guard. Okay, let me browse the file that I zipped before. Okay, and I'm going to use the same bucket for the results also. And of course the name, but in the hook target we're going to use the Cloud Control APIs, okay, which basically is going to catch the Terraform applies that we're doing and it is exactly the same thing like CloudFormation. Now we can choose what action we want, create, update, et cetera, and we can choose if we want to warn or fail. Okay, so I already created one of the hooks. Now let's see what happened to our Terraform.

Thumbnail 2580

Thumbnail 2590

Thumbnail 2600

Thumbnail 2610

Thumbnail 2620

Thumbnail 2650

Thumbnail 2660

Thumbnail 2670

Oh okay, so basically what we did here, I also use ChatGPT to generate it. Okay, we're going to use, we're going to try to create a Lambda with runtime of Node.js 18, which is deprecated. And also we're going to try to deploy an S3 with a lot of missing variables like block public access, bucket encryption, et cetera. Now there is something new. My screen is a little bit big for me. It's okay. You can see like this. You can see that also in the Terraform in the ID you can see the reason of why the hook failed you. Okay, for example, in the S3 bucket you can see that it's failed because S3 encryption required, S3 no public read, and S3 version enabled, like David showed before. In the UI you can see through the ID, but if I want to see the exact reason, all the output of the hook of what happened, so I can just take the S3 JSON, add a kind of a dash and then it will be to my standard out over here. Okay, so if I'm a developer, I can use it for my VS Code, and if I'm a ClickOps guy, I can use it for the UI, right. Okay. What else we want to see?

Thumbnail 2680

Thumbnail 2690

Thumbnail 2700

Just to show you from here also. You see, it's exactly the same like the CloudFormation hooks. It's working the same with Terraform. Right? Okay. Basically this is a demo of, yeah, the cool thing, so Sefi used the AWSCC provider which is different from the AWS provider. With the AWSCC provider, basically it uses Cloud Control API under the hood.

It takes the resource and sends it to CloudFormation to create it, rather than orchestrating API calls directly on the client side like the AWS provider. Since it sends the entire payload to us before it gets created, that's why hooks can run at the same time. That's why you can catch it before. Exactly. So we want to show now how to implement it on organizational units. Any questions about the Terraform implementation?

Thumbnail 2740

Oh yeah, can you show the provider? What do you mean provider? Yeah, so the required provider here is HashiCorp AWSCC, instead of AWS. This is different because, as David mentioned, we want to catch it before it creates a resource, so it's not doing Terraform API calls straight to the AWS account. It's sending the payload and then you can block it before it happens. Yeah, basically all of these hooks, if we start this talk when we talk about it, it's to stop people from getting calls after hours because they create something that they weren't supposed to create and they didn't even know that they were supposed to do it that way. Right.

So this is not a drop-in replacement if you have existing Terraform and you're using the AWS provider. Hooks is not very helpful for you going forward. Our relationship with HashiCorp is that HashiCorp owns this provider, but it's a partnership with us. One of the main reasons why we partnered with HashiCorp to create this is not for hooks at all. That's just a nice side effect, but mostly around coverage. We've noticed, and our customers have noticed, that new feature launches in Terraform can lag. They can get there before AWS support, but we've noticed a lot with AWS that we have mandates on launching with resource support immediately. There's this growing gap of sort of medium to lower popularity resources in Terraform just never getting support.

To help with that, we basically said, hey, use the AWS provider for stuff if you want, but also you have access to the CloudFormation versions as well, so you can mix and match. A side effect of that is that with hooks you can also intercept governance stuff with that. But yes, to your point, there's not a lot of sugar coating on that if you're using the AWS provider. Later today, we can't put hooks in that call path. The only option you have really is sort of like IAM and SCPs to do governance stuff, but those are declarative policies. There's a lot of options that you can do at the organization level through Control Tower to block this thing from happening. Cool. One more question, yeah.

Yes, yes, yes. Change sets and hooks and the Control Tower hook and Lambda and Guard hook, all those things are also in GovCloud, the China regions, and the secret regions as well. Yeah, we're actually, we had a really interesting launch in CloudFormation this, like last week or this week. I don't know what day it is. We do early validation, so basically we'll check before a CloudFormation stack runs to make sure that the resources that you're trying to create don't already exist or some properties aren't set correctly. It's sort of a developer experience thing, but it's built on top of hooks. We're considered part of launch blocking for new region builds, so hooks and all of these features are there for all new regions. Would you like to see how to implement the controls and hooks on organization? Yes.

Thumbnail 2970

Thumbnail 2980

Thumbnail 2990

Thumbnail 3000

Organization-Wide Governance: Deploying Controls Through Control Tower

Okay, so we're going to close all of this. So I have a multi-account strategy, a landing zone in this demo environment, of course. And we're going to get inside to the management account. It's up, and we will go to Control Tower. Everybody familiar with Control Tower? Yeah, cool. And we will go to Control Catalog. It just has a new interface now, which is amazing. And we're going to filter the controls proactive behavior, okay.

Thumbnail 3010

Thumbnail 3020

Thumbnail 3040

Thumbnail 3050

Thumbnail 3060

You can choose whatever behavior you want, and you can select any specific resource type, such as Redshift. You can see exactly what frameworks are associated with each control, as David mentioned before, and which regions are also deployed. If you work in a region where this control is not available, you basically cannot deploy it. We're going to enable the control, and here we can see the selected control. You can choose multiple controls. You don't have to choose just one, but for this demo, we'll choose one so it will be faster. Then we have the organizational units that we can choose. Let's take the Sandbox OU for example. Let me just verify that we have an account in there for a second. Okay, so we choose the OU, we can target it, and we enable the controls.

Thumbnail 3080

Thumbnail 3100

Thumbnail 3110

It will take several seconds until it will be provisioned. You can see all the controls that I've already implemented. We'll give it a second. The good thing about this approach is that from here you can choose from a security perspective if you want a specific framework, and then you can enable those controls. You need to do it carefully because you want your team to succeed with this. Let's see if it's ready. Yeah, we'll go to the account.

Thumbnail 3120

Thumbnail 3130

Thumbnail 3140

The cool thing about this approach versus the one I showed earlier, where I was enabling the hook in an individual account, is that he's enabling it for an entire OU. There's a resource policy on that hook that basically means you cannot turn it off unless you are in the root account. So you can see we have the CloudFormation Guard hook that I built before, and now we also have the Control Tower hook that's managed by Control Tower. Under this OU, I might have 56, 10, or 100 accounts. It will be implemented to all of these accounts, and of course nobody can delete it. That's where you can secure your environment from a management account.

Thumbnail 3180

Thumbnail 3190

The nice thing about the Control Catalog through Control Tower is that this talk is called "From Reactive to Proactive." The Control Tower Catalog also groups things by control objective. For example, if the control objective that I want to accomplish is no public S3 buckets, through the Control Catalog you can enable the proactive version of it, but you can also enable the reactive version, the detective version too. It's a defense-in-depth thing. If someone goes to the console or uses the AWS provider in Terraform, they're not going to hit hooks, so you still probably want detective controls as well.

Thumbnail 3210

Thumbnail 3240

Basically, you want to implement all kinds of controls. It depends on the environment and the security requirements, but what I do want to show you here is that you have common controls from Control Tower. It's also divided by services, which means if you're using specific frameworks and things like this. You can also see which controls are enabled, and this is also something cool to see exactly on which accounts it's enabled. If you want to investigate or look at the impact that this is having on different accounts that you have, you can see which accounts exactly and which framework as well, which is also cool. Any questions?

The limit increased from 100 deployable controls. Are the ones that are in? It was 50 back from the quarter? Did you open a support case? I think the, I'm pretty sure that they use the same implementation of the Control Hook as we have, and we've completely rewrote it. I'm pretty sure that that limit is now gone. We should double check to make sure, but we'll take it offline afterward and we'll take your details. No problem. They had to rewrite it basically. We rewrote the implementation to support this.

Actually, it was really funny. The hook runtimes used to be constrained to Java and Python. We added Rust. CloudFormation Guard is written in Rust. All of these controls are written in Guard, so we were able to, they were doing truly gnarly things with the JVM in order to run Rust. But we rewrote it, and now I think probably a lot of these limits are gone.

Thumbnail 3310

The limits are gone, but we should definitely double-click on that for you. Any more questions? Oh, there's a question in the back.

Q&A and Closing: CloudFormation vs Terraform, Observability, and Future Improvements

The question is about Terraform versus CloudFormation. Of course, CloudFormation—just kidding, they're paying my bills. I would say use what's right for you and your organization. For me, with CloudFormation, you get a lot of safety and auditability built in if you're using AWS Organizations. If you have to manage an organization, it's really well integrated with that. There's a product called StackSets, which basically allows you to provision CloudFormation stacks across Organizational Units and accounts by default.

I would also say, and I don't think this is a huge secret so I don't mind telling you, that most CloudFormation customers are also Terraform customers. A lot of shops just use what's right for the right thing. Basically, if you want a multi-cloud infrastructure, you have the solution of Terraform and the factory for Terraform, which is also very good. What you run into when you use Terraform just from an AWS perspective—and I love Terraform—is that a lot of work that we do, like hooks, is supported with an asterisk, so things like that sort of happen. But yeah, I also love the CDK, so I think YAML is fine, but I love the CDK and I love writing my code in TypeScript.

Cool, any other questions? Oh yeah, the question is about Control Tower integration—can you make a hook turn into a warn instead of fail? It's a wonderful idea. Technically, yes, and we're working with them to figure out what the right experience for that is. I think some of that work is under the covers. The Hooks team has been doing some work on backporting, like sending invocation history backwards to the root account basically, so you can actually do some observability on it and say, "Hey, this is failing for 100% of the accounts that it is deployed into." So we're working with them to support that. It's pretty challenging from our side with all the problems we had on warn, right, to see what the stats are.

Thumbnail 3470

Thumbnail 3480

Thumbnail 3510

Yes, I will show you one thing. So we did launch some UX improvements. It's still not—I will tell you two things about observability. Does anyone have any feedback for Amazon Connections? One is that we don't talk about this much, but you can actually integrate—there's an EventBridge hook for Hooks. Anytime we invoke a hook, we send invocations to EventBridge, and you can route it to wherever you want. This is how we're implementing this under the hood for Control Tower. Also, we've launched this new screen basically. I don't know why I turned off results. Basically, you can see all the invocations that happen in this particular account and see whether they passed or failed and what the validation error was, for the exact reason that it's actually really difficult without something like this to see the impact that a warn hook is having. Observability and Organizations integrations is sort of the next set of features that Hooks is really diving into.

Thumbnail 3550

Cool, any other questions? Cool. Remember to like and subscribe—I mean, fill out your survey on your app for this session. If you give us great reviews, they give us raises. If not, they put us back in the pen. So yeah, thank you so much. Thank you.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)