When I started developing services on AWS, I thought CloudFormation resources could cover all my needs. I was wrong.
I quickly discovered that production environments are complex, with numerous edge cases. Fortunately, CloudFormation allows for extension through custom resources. While custom resources can be handy, improper implementation can result in stack failures, deletion issues, and significant headaches.
In this blog post, we’ll explore CloudFormation custom resources, why you need them, and their different types. We’ll also define best practices for implementing them correctly with AWS CDK and Python code examples using Powertools for AWS, Pydantic and crhelper.
This blog post was originally published on my website, “Ran The Builder.”
Table of Contents
The Case for a CloudFormation Custom Resource
Post Deployment Scripts
Custom Resource — One Stack to Rule Them All
CloudFormation Custom Resource Types
Plain AWS SDK Calls
SNS-Backed Custom Resource
Lambda Backed Custom Resource
Custom Resources Best Practices
Summary
The Case for a CloudFormation Custom Resource
CloudFormation can be useful when your provisioning requirements involve complex logic or workflows that can’t be expressed with CloudFormation’s built-in resource types. — AWS Docs
Here are some examples that come to mind:
Adding a database to an Aurora cluster.
Creating a Cognito admin/test user for a user pool.
Creating a Route53 DNS entry or creating a certificate in a domain created in a different AWS account.
Uploading a JSON file as an observability dashboard to DataDog
You want to trigger a resource provisioning that takes a lot of time — maybe up to an hour.
Any non AWS resource that you wish to create.
Post Deployment Scripts
To counter such scenarios, I’ve seen people add a mysterious ‘post_deploy’ script to their CI/CD pipeline that runs after the CF deployment stage and creates the missing resources and configurations via API calls.
It’s dangerous. If that script fails, you cannot automatically revert the CF stack deployment as it has already been done leaving your service in an unstable state.
In addition, people forget that resources have a lifecycle and handle object deletion, thus keeping many orphaned resources when the stack is deleted.
Custom Resource — One Stack to Rule Them All
The way I see it, everything that you do in the pipeline in deployment stage, any resource that you add or reconfigure should update together as there are dependancies, and if there’s a failure, CloudFormation will reliably revert the stack deployment and safeguard your production from being broken.
Our solution is to stress the importance of including ALL resources and configuration changes, including their lifecycle event handling (more on that below), as part of the CloudFormation stack as a **custom resource**.
However, it’s not all roses and daisies. Many people stay away from custom resources because mistakes can be highly annoying — from the custom resources failing to delete to waiting for up to an hour for a stack to fail deployment.
Rest assured, you’d be fine if you followed the code examples and best practices.
Let’s review the types of custom resources.
CloudFormation Custom Resource Types
It’s important to remember that every CloudFormation resource has life cycle events it needs to implement. The main events include creation, update (due to logical ID or configuration changes), and deletion. When we build our custom resource, we will need to define its behavior in reaction to these CloudFormation events.
There are three types of custom resources; let’s list them from the simplest to the most customized and complicated options:
Plain AWS SDK calls — simple, less code to write
SNS-backed resource — more complicated
Lambda-backed resource — the most complicated but the most flexible
Let’s start with the first type.
Plain AWS SDK Calls
This is the simplest way to implement a custom resource. In the example below, we want to create a Cognito user pool test user right after the user pool is created.
The process of creating and deleting a user is as simple as making a call to the AWS SDK. You can find the necessary steps [here] and [here].
Let’s see how we can translate these API calls to a simple CDK object.
We define a CDK function that receives a Cognito user pool object used as SDK parameters (its ID and ARN).
In line 7, we create a new ‘AwsCustomResource’ instance.
In line 10, we pass the API definition for the creation process: the boto SDK service, the API name: ‘adminCreateUser,’ and its parameters. Similarly, we can add ‘on_delete’ and ‘on_update’ handlers.
Behind the scenes, AWS creates a singleton Lambda function that handles the CloudFormation lifecycle events for you — super simple and easy!
In line 26, we add a dependency; this resource depends on the user pool created before running an API.
Bottom line: if you can map your lifecycle events to AWS SDK API calls, it’s the best and most straightforward way to cover CloudFormation’s missing capabilities with minimal code.
SNS-Backed Custom Resource
The second type is an interesting one.
I’d use this custom resource to trigger long provisioning (up to an hour!) in a decoupled and async manner via an SNS message. Depending on where the SNS topic resides, it can create resources or configurations, even in a different account.
One practical application of this custom resource type is to send all custom resource creation information to a centralized account. This allows for easy tracking of unique resources, enhancing organizational visibility.
This is a use case I describe in an article that I wrote with Bill Tarr from the AWS SaaS factory for the cloud operations AWS blog website. It will be hopefully released soon.
The entire GitHub repository can be found here.
Event Flow
Let’s review the custom resource creation flow below. Please note that the SNS to SQS to Lambda pattern is not given in the CDK below, it is assumed that the SNS topic owner (perhaps even in a different CF stack), creates this pattern. However, I will provide the Lambda function code as it has specific custom resource logic related code.
Custom resource creation event flow:
Parameters are sent as a dictionary to the SNS topic. You must ensure the topic accepts messages from the deploying account/organization.
SNS topic passes the message to its subscriber, the SQS queue.
SQS queue triggers the Lambda function with a batch of messages (min size is 1).
The Lambda function parses the messages and extracts the custom resource event type (create/delete/update) and its parameters, which appear at the ‘resource_properties’ property of SQS body massage. Note that you will be given both the previous and current parameters for an update event.
The Lambda function handles the logic aspect of the custom resource, creating or configuring resources.
The lambda function sends a POST request to the pre-signed S3 URL path that is part of the event with the correct status: failure/success and any other required information. Click here for a ‘create’ event example.
Custom resource is released from its wait state, deployment ends with a success or failure (reverted).
During the deployment in stage 1, the custom resource enters a wait state after it sends an SNS message. The message receiver needs to release the resource from its wait state. If an hour passes without this release (default timeout time), the stack fails on a timeout, and a revert takes place. If the message receiver sends a failure message back, the stack fails, and a revert takes place.
The receiver must send an HTTP POST request with a specific body that marks success or failure to a pre-signed S3 URL the custom resource generates.
Elements 2–6 can be part of a different AWS account belonging to a different team entirely in your organization and serve as a ‘black box’ orchestration. In that case, you just build the Custom resource, which is relatively easy.
Custom Resource CDK Code
Let’s start with the custom resource definition. The custom resource sends the SNS topic a message with predefined parameters as the message body. Each life-cycle event (create, delete, update) will automatically send a different SNS message attributes with the CDK properties we defined. In an update event, both the current and previous parameters are sent.
In lines 9–18, we define the custom resource.
In line 12, we provide the SNS topic ARN as the message target.
In line 13, we define the resource type (it will appear in the CF console), and it must start with ‘Custom::.’
In line 15, we define the dictionary SNS message payload that will be sent to the topic. We can use any set of keys and values we want as long as their value is known during deployment.
Lambda Function’s Code
Let’s review the receiver side of the flow and how it handles the CF custom resource events. We will use the library ‘cr_helper’ to handle the events with a combination of Powertools’ Parser utility for input validation with ‘pydantic.’ ‘cr_helper’ will route the correct event to the appropriate function inside the handler, manage the response to the S3 pre-signed URL, and handle errors (send a failure response for every uncaught exception).
The code below is taken from one of my open-source projects, which deploys Service Catalog products and uses custom resources and SNS messages. Other than the code under the ‘logic’ folder, which you can replace with your own implementation, most of the code is generic.
You can view the complete code here.
The flow is simple:
Initialize the CR helper library. It will handle the routing to the inner event handler functions and, once completed, release the custom resource from a wait state (see 2c below) with an HTTP request.
Iterate the batch of SQS messages and per SQS message:
Route to the correct inner function according to the SQS message body, the custom resource CF event. Route ‘create’ events to my ‘create_event’ function, ‘delete’ to the ‘delete_event’ function, and update’ to ‘update_event.’
Each ‘x_event’ function parses the input according to the expected parameters defined in the CDK code according to the ‘CloudFormationCustomResource’ schemas (lines 5–7). We leverage Powertools for the AWS parser utility and pass the payload to the logic layer that creates deletes, or updates resources.
-
‘cr_helper’ sends an HTTP POST request to the pre-signed URL with either success or failure information. Failure is sent when the inner event handlers raise an exception.
In lines 17–22, we initialize the ‘cr_helper’ helper utility.
In line 43, we must return a resource ID in the ‘create_event’ function. It’s crucial to make sure it is unique. Otherwise, you won’t be able to create multiple custom resources of this type in the same account.
In line 50, we implement an update flow. This can happen when either the resource id changes or the input parameters change. The CloudFormation event will contain both the current and previous parameters so it’s possible to find the differences and make changes in the logic code accordingly.
The bottom line is that if you need to trigger a provision or logic in another account or service (that might belong to another team), this is a great way to decouple this logic between the services and allow a long process, which can last up to an hour.
Lambda Backed Custom Resource
In this case, the custom resource triggers a Lambda function with a CloudFormation life-cycle event to handle. It’s beneficial in cases where you want to write yourself the entire provision flow and maintain it in the same project; that’s in contrast to the previous custom resource where you send an async message to an SNS topic and let someone else handle the resource logic.
Let’s review the custom resource creation flow in the diagram below.
Event Flow
Custom resource creation event Flow:
Parameters are sent as a dictionary as part of the event to the invoked Lambda function.
Lambda function parses the messages, extracts the custom resource event type (create/delete/update) and its parameters that appear at ‘resource_properties’. Note that for an ‘update’ event you will be given both the previous and current parameters.
The Lambda function handles the logic aspect of the custom resource, creating or configuring resources.
The lambda function sends a POST request to the pre-signed S3 URL path (‘ResponseURL’ in the event) that is part of the event with the correct status: failure/success and any other required information. Click here for a ‘create’ event example.
Custom resource is released from its wait state, deployment ends with a success or failure (reverted).
You can use this resource to trigger a longer provision process (up to an hour) by triggering a Step Function state machine in the Lambda function, as long as you send the S3 pre-signed URL to that process so it can mark the result instead.
Custom Resource CDK Code
Let’s review the code below.
In lines 10–16, we build the Lambda function to handle the CF custom resource events.
In line 18, we define a provider, a synonym for an event handler, and set our lambda function as the custom resource event target.
In lines 19–27, we define the custom resource and set the service_token as the provider’s service token. See the provider definition here.
In lines 24–25, we define the input parameters we want the Lambda to receive. We can pass whatever parameters the Lambda can use during the provisioning process.
In line 27, we set the custom resource type in the CF console. It must start with ‘Custom::.’
Lambda Function’s Code
Let’s review the function’s code below. It will be familiar to the previous example, without the SQS batch iteration section, which is replaced with a global error handler in lines 19–23.
We define one function for each event type: create, update, delete, and the ‘helper’ library knows which one to trigger based on the incoming input event properties.
Pydantic and Powertools’s parser utility are used as before to parse the input of every event. This input is then passed to any logic function you write to handle the event: create a resource, send an API request, delete resources, etc.
As before, we need to return a resource ID in the ‘create_event’ function. It’s crucial to make sure it is unique; otherwise, you won’t be able to create multiple custom resources of this type in the same account.
As in the SNS example, the functions ‘handle_delete,’ ‘handle_create,’ and ‘handle_update’ are your implementation logic.
Bottom line: If you need to trigger a flow and manage it entirely in the same account via Lambda function code, this is a great way to do so and handle its life-cycle events.
Custom Resources Best Practices
Custom resources are error prone and you must put extra care into your error handling code.
Failing to do so, can result in resources that CF cannot delete.
Here are a few pointers:
Use the tools in this guide: ‘cr_helper’ and Powertools.
Read the documents specified in this guide to make sure you understand the input events and when each event is sent.
Understand timeouts and ensure you configure all the resources accordingly — Lambda timeout definition, CR timeout, etc.
Try to be as flexible in the Lambda function logic implementation as possible. Don’t fail on every issue. For example, if you need to delete a resource via API and it’s not there, you can return a success instead of failing.
Test, test and test again, flows of create, update and delete. Be creative and ensure proper integration and E2E tests for your Lambda. Learn here in my testing blog series about serverless tests.
Set the custom resource timeout setting. It can now be changed so you don’t wait for an hour in case of an error in your code.
‘cr_helper’ also provides a polling mechanism helper for longer creation flows — use it when required. I have yet to use it. See the readme.
Finally, choose the simplest custom resource that makes sense to you. Don’t over-engineer and think about custom resource team ownership. Decouple when possible with the SNS mechanism if another team handles the provision flow. In that case, it’s best to do it in a centralized manner.
Summary
This post covered several cases CloudFormation native resources don’t cover. We learned of custom resources and their types, their use cases and reviewed general best practices with CDK and Python code.
Top comments (0)