TL;DR
We wrap our bootstrap logic in an AWS Step Functions workflow to transition from a manual script to an automated platform capability. By leveraging EventBridge and CloudTrail, we create an event-driven loop that detects infrastructure changes and reconciles the database state automatically—achieving a production-ready deployment with zero manual intervention.
Table of Contents
Orchestrating with AWS Step Functions
In Part 1, we identified the problem: Terraform is great for infrastructure, but poor for database state management. In Part 2, we built the solution: A "driver-less" Lambda function that uses the Aurora Data API to perform secure, idempotent bootstrap operations.
Now, we face the final hurdle - A Lambda function sitting in the console is just a tool: To make it a platform capability, we need to wrap it in a workflow that handles retries, provides visibility, and allows for controlled execution.
In this final part, we explore how to orchestrate that bootstrap logic using AWS Step Functions. It does not replace migrations, schema management, or application workflows — only control when the bootstrap Lambda needs to run.
The Architecture: The State Machine
I designed the state machine that treats the Lambda as a Worker. The State Machine itself acts as the Manager, responsible for:
- Input Validation: Ensuring the target cluster exists.
- Task Execution: Invoking the Bootstrap Lambda.
- Error Handling: Managing retries and catching failures.
The Working-flow
It's an Event-Driven Architecture (EDA) that automatically triggers the bootstrap workflow whenever Terraform updates the Lambda function configuration or code, as below:
Terraform Apply
↓
AWS Lambda APIs
(CreateFunction / UpdateFunctionCode)
↓
AWS CloudTrail
(Records the Management Event)
↓
Amazon EventBridge
(Matches Rule: source="aws.lambda")
↓
EventBridge assumes the EventBridge IAM role
↓
AWS Step Functions
(Triggered via IAM Role)
↓
Bootstrap Lambda
(Executes Idempotent Logic)
↓
Bootstrap Lambda runs
For this to work effectively, EventBridge Rule pattern should look for specific API calls to avoid misfires. I filtered (using {prefix:..} matching) for:
- CreateFunction20150331 (New deployments)
- UpdateFunctionCode20150331v2 (Code changes)
- UpdateFunctionConfiguration20150331v2 (Env var/Config changes)
This ensures the database bootstrap runs exactly when the code or configuration (like a new DB name in Env Vars) changes.
Implementation: Infrastructure as Code
Consistent with our IaC principles, the State Machine is provisioned via Terraform, while its workflow logic is defined using the Amazon States Language (ASL). The implementation follows a modular architecture, splitting responsibilities between the root (security & orchestration) and child (application logic) modules.
Terraform Resources - Root module
1️⃣ Creates one SFN execution role (states.amazonaws.com) with:
- RDS cluster actions (start/stop/modify/describe/snapshot) scoped to
module.db_cluster[KY].arn -
lambda:InvokeFunctionscoped to the installer Lambdas - CloudWatch Logs delivery permissions (CreateLogDelivery, PutResourcePolicy, etc.) on *
# ----------------------------------------------------------
# IAM role for Step-Function
# ----------------------------------------------------------
resource "aws_iam_role" "air_sfn" {
count = length(local.sfn_db_engines) > 0 ? 1 : 0
name = "${local.template_name}-sfn-Role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = {
Service = "states.amazonaws.com"
}
Action = "sts:AssumeRole"
}]
})
tags = {
Name = "${local.template_name}-sfn-Role"
Module = var.tf_module_name,
Resource = "aws_iam_role.air_sfn",
}
}
data "aws_iam_policy_document" "aipd_sfn" {
statement {
sid = "AllowLambdaInvoke"
effect = "Allow"
actions = [
"lambda:GetFunction",
"lambda:GetFunctionConfiguration",
"lambda:InvokeFunction",
]
resources = [
for KY in local.sfn_db_engines : module.db_cluster[KY].lfn_resource_arn
]
}
statement {
sid = "AllowCloudWatchLogs"
effect = "Allow"
actions = [
"logs:CreateLogDelivery",
"logs:GetLogDelivery",
"logs:UpdateLogDelivery",
"logs:DeleteLogDelivery",
"logs:DescribeLogGroups",
"logs:DescribeResourcePolicies",
"logs:GetLogDelivery",
"logs:ListLogDeliveries",
"logs:PutLogEvents",
"logs:PutResourcePolicy",
"logs:UpdateLogDelivery",
]
resources = ["*"]
}
}
resource "aws_iam_role_policy" "airp_sfn" {
count = length(local.sfn_db_engines) > 0 ? 1 : 0
role = aws_iam_role.air_sfn[0].id
policy = data.aws_iam_policy_document.aipd_sfn.json
}
# ----------------------------------------------------------
# IAM role: EventBridge → Step Functions
# ----------------------------------------------------------
resource "aws_iam_role" "air_eb_sfn" {
count = length(local.sfn_db_engines) > 0 ? 1 : 0
name = "${local.template_name}-eb-sfn-Role"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = {
Service = "events.amazonaws.com"
}
Action = "sts:AssumeRole"
}]
})
tags = {
Name = "${local.template_name}-eb-sfn-Role"
Module = var.tf_module_name,
Resource = "aws_iam_role.air_eb_sfn",
}
}
// Policy: allow states:StartExecution
data "aws_iam_policy_document" "aipd_eb_sfn" {
statement {
effect = "Allow"
actions = [
"states:StartExecution"
]
resources = [
for KY in local.sfn_db_engines :
module.sfn_machine[KY].arn
]
}
}
resource "aws_iam_role_policy" "airp_eb_sfn" {
count = length(local.sfn_db_engines) > 0 ? 1 : 0
role = aws_iam_role.air_eb_sfn[0].id
policy = data.aws_iam_policy_document.aipd_eb_sfn.json
}
2️⃣ Creates one EventBridge → SFN role (events.amazonaws.com) with:
- states:StartExecution scoped to the child module SFN ARNs (
module.sfn_machine[KY].arn) - Instantiates module
sfn_machineper engine vialocal.sfn_db_engines
# ----------------------------------------------------------
# AWS Step-function
# ----------------------------------------------------------
module "sfn_machine" {
for_each = toset(local.sfn_db_engines)
source = "./services//sfn"
lambda_function_arn = module.db_cluster[each.key].lfn_resource_arn
name_prefix = join("-", [
replace(local.template_name, "/-[^-]+$/", ""),
each.key
])
eb_role_arn = aws_iam_role.air_eb_sfn[0].arn
sfn_role_arn = aws_iam_role.air_sfn[0].arn
extra_tags = {
Resource = "module.sfn_machine"
Module = var.tf_module_name
}
}
Terraform Resources - Child module
1️⃣ Creates the state machine:
- arn:aws:states:::lambda:invoke
- FunctionName = var.lambda_function_arn
- CW logs enabled to
/aws/stepfunctions/${var.name_prefix}-bootstrap
# ----------------------------------------------------------
# Step Function Resource
# ----------------------------------------------------------
resource "aws_sfn_state_machine" "this" {
name = "${var.name_prefix}-bootstrap"
role_arn = var.sfn_role_arn
definition = jsonencode({
Comment = "DB / User bootstrap orchestration"
StartAt = "InvokeBootstrapLambda"
States = {
InvokeBootstrapLambda = {
Type = "Task"
Resource = "arn:aws:states:::lambda:invoke"
Parameters = {
FunctionName = var.lambda_function_arn
Payload = {}
}
Retry = [{
ErrorEquals = [
"Lambda.ServiceException",
"Lambda.TooManyRequestsException"
]
IntervalSeconds = 5
MaxAttempts = 3
BackoffRate = 2.0
}]
End = true
}
}
})
logging_configuration {
level = "ALL"
include_execution_data = true
log_destination = "${aws_cloudwatch_log_group.sfn_logs.arn}:*"
}
tags = merge(
var.extra_tags,
{
Name = "${var.name_prefix}-bootstrap"
SubResource = "aws_sfn_state_machine.db_bootstrap"
}
)
}
// CW logs
resource "aws_cloudwatch_log_group" "sfn_logs" {
name = "/aws/stepfunctions/${var.name_prefix}-bootstrap"
retention_in_days = 14
}
2️⃣ Creates the EventBridge rule that works reliably:
source = ["aws.lambda"]detail-type = ["AWS API Call via CloudTrail"]- eventName prefix match for both:
- UpdateFunctionCode*
- UpdateFunctionConfiguration*
- Filters
requestParameters.functionNamewith both forms:- the full ARN:
var.lambda_function_arn - short name extracted via
regex("[^:]+$", var.lambda_function_arn)
- the full ARN:
- Creates the EventBridge target:
- Starts the SFN with an input_transformer payload containing account/region/event_id/time/function_name
# ----------------------------------------------------------
# EventBridge rule: detect Lambda code updates
# ----------------------------------------------------------
resource "aws_cloudwatch_event_rule" "lambda_code_update" {
name = "${var.name_prefix}-lambda-update"
description = "Trigger SFN when bootstrap Lambda code is updated"
event_pattern = jsonencode({
source = ["aws.lambda"]
"detail-type" = ["AWS API Call via CloudTrail"]
detail = {
eventSource = ["lambda.amazonaws.com"]
eventName = [
{ "prefix" : "UpdateFunctionCode" },
{ "prefix" : "UpdateFunctionConfiguration" }
]
# without this, all the SFNs will keep triggering
requestParameters = {
functionName = [
var.lambda_function_arn,
regex("[^:]+$", var.lambda_function_arn),
]
}
}
})
}
resource "aws_cloudwatch_event_target" "lambda_update_to_sfn" {
rule = aws_cloudwatch_event_rule.lambda_code_update.name
target_id = "start-${var.name_prefix}-bootstrap"
arn = aws_sfn_state_machine.this.arn
role_arn = var.eb_role_arn
retry_policy {
maximum_retry_attempts = 0
maximum_event_age_in_seconds = 60
}
# Optional - To capture the "Silent" error
dead_letter_config {
arn = aws_sqs_queue.eb_dlq.arn
}
input_transformer {
input_paths = {
update_fn = "$.detail.requestParameters.functionName"
event_id = "$.id",
region = "$.region"
account = "$.account"
time = "$.time"
}
input_template = <<-EOF
{
"account": <account>,
"action": "AUTO_BOOTSTRAP",
"aws_region": <region>,
"event_id": <event_id>,
"event_time": <time>,
"function_name": <update_fn>,
"trigger": "lambda-code-update"
}
EOF
}
}
Setting up the automated trigger was a classic cat-and-mouse engineering challenge. The real complexity was correctly mapping the CloudTrail metadata into a format the State Machine understands. I spent significant time fine-tuning the
input_transformerlogic — navigating through the JSON paths to ensure that the specific Lambda updates correctly extracted the payload, needed to kick off the bootstrap. Those were the only parameters worked for me; anything more added to it, was stopping the trigger on the next run.
3️⃣ Adds an SQS DLQ and queue policy allowing EventBridge to SendMessage for that rule (optional)
# ----------------------------------------------------------
# SQS: Dead-letter queue redrive
# ----------------------------------------------------------
resource "aws_sqs_queue" "eb_dlq" {
name = "${var.name_prefix}-eb-dlq"
}
resource "aws_sqs_queue_policy" "dlq_policy" {
queue_url = aws_sqs_queue.eb_dlq.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = { Service = "events.amazonaws.com" }
Action = "sqs:SendMessage"
Resource = aws_sqs_queue.eb_dlq.arn
Condition = {
ArnEquals = {
"aws:SourceArn" : aws_cloudwatch_event_rule.lambda_code_update.arn
}
}
}]
})
}
Validation: The Orchestration in Action
Once the EventBridge rule triggers the workflow, the progress can be monitored in real-time. The Step Functions' Executions console provides the list of Failed and Successful executions, as shown below:
Beyond the high-level list, the console also allows us to drill down into the specific input and output of the bootstrap Lambda, where you can see exactly how the CloudTrail event was mapped into the parameters that drove the database creation.
Conclusion: An automated Database Platform
That brings us to the end of Part 3. We have successfully transformed our bootstrap Lambda into a robust, event-driven platform capability that triggers automatically whenever our infrastructure changes. By orchestrating this with AWS Step Functions, we’ve also gained the visibility, retries, and auditability needed for our environment. It's now a complete, driver-less solution that ensures our Aurora databases are initialized and ready the moment they are provisioned - and most importantly, without any manual intervention.
In This Series
- Part 1: Architectural foundation and engineering basics.
- Part 2: Deep dive into Lambda implementation and IAM-based access.
- Part 3: Orchestrating workflows and Execution using AWS Step Functions.

Top comments (0)