DEV Community

Cover image for BootStrapping Aurora RDS Databases using Lambda and Terraform (Part 3)
Santanu Das
Santanu Das

Posted on

BootStrapping Aurora RDS Databases using Lambda and Terraform (Part 3)

TL;DR
We wrap our bootstrap logic in an AWS Step Functions workflow to transition from a manual script to an automated platform capability. By leveraging EventBridge and CloudTrail, we create an event-driven loop that detects infrastructure changes and reconciles the database state automatically—achieving a production-ready deployment with zero manual intervention.

Table of Contents


Orchestrating with AWS Step Functions

In Part 1, we identified the problem: Terraform is great for infrastructure, but poor for database state management. In Part 2, we built the solution: A "driver-less" Lambda function that uses the Aurora Data API to perform secure, idempotent bootstrap operations.

Now, we face the final hurdle - A Lambda function sitting in the console is just a tool: To make it a platform capability, we need to wrap it in a workflow that handles retries, provides visibility, and allows for controlled execution.

In this final part, we explore how to orchestrate that bootstrap logic using AWS Step Functions. It does not replace migrations, schema management, or application workflows — only control when the bootstrap Lambda needs to run.

 The Architecture: The State Machine

I designed the state machine that treats the Lambda as a Worker. The State Machine itself acts as the Manager, responsible for:

  • Input Validation: Ensuring the target cluster exists.
  • Task Execution: Invoking the Bootstrap Lambda.
  • Error Handling: Managing retries and catching failures.

The Working-flow

It's an Event-Driven Architecture (EDA) that automatically triggers the bootstrap workflow whenever Terraform updates the Lambda function configuration or code, as below:

Terraform Apply
      ↓
AWS Lambda APIs
(CreateFunction / UpdateFunctionCode)
      ↓
AWS CloudTrail
(Records the Management Event)
      ↓
Amazon EventBridge
(Matches Rule: source="aws.lambda")
      ↓
EventBridge assumes the EventBridge IAM role
      ↓
AWS Step Functions
(Triggered via IAM Role)
      ↓
Bootstrap Lambda
(Executes Idempotent Logic)
   ↓
Bootstrap Lambda runs
Enter fullscreen mode Exit fullscreen mode

For this to work effectively, EventBridge Rule pattern should look for specific API calls to avoid misfires. I filtered (using {prefix:..} matching) for:

  • CreateFunction20150331 (New deployments)
  • UpdateFunctionCode20150331v2 (Code changes)
  • UpdateFunctionConfiguration20150331v2 (Env var/Config changes)

This ensures the database bootstrap runs exactly when the code or configuration (like a new DB name in Env Vars) changes.

Implementation: Infrastructure as Code

Consistent with our IaC principles, the State Machine is provisioned via Terraform, while its workflow logic is defined using the Amazon States Language (ASL). The implementation follows a modular architecture, splitting responsibilities between the root (security & orchestration) and child (application logic) modules.

Terraform Resources - Root module

1️⃣ Creates one SFN execution role (states.amazonaws.com) with:

  • RDS cluster actions (start/stop/modify/describe/snapshot) scoped to module.db_cluster[KY].arn
  • lambda:InvokeFunction scoped to the installer Lambdas
  • CloudWatch Logs delivery permissions (CreateLogDelivery, PutResourcePolicy, etc.) on *
# ----------------------------------------------------------
# IAM role for Step-Function
# ----------------------------------------------------------
resource "aws_iam_role" "air_sfn" {
  count = length(local.sfn_db_engines) > 0 ? 1 : 0
  name  = "${local.template_name}-sfn-Role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = {
        Service = "states.amazonaws.com"
      }
      Action = "sts:AssumeRole"
    }]
  })

  tags = {
    Name     = "${local.template_name}-sfn-Role"
    Module   = var.tf_module_name,
    Resource = "aws_iam_role.air_sfn",
  }
}

data "aws_iam_policy_document" "aipd_sfn" {

  statement {
    sid    = "AllowLambdaInvoke"
    effect = "Allow"

    actions = [
      "lambda:GetFunction",
      "lambda:GetFunctionConfiguration",
      "lambda:InvokeFunction",
    ]

    resources = [
      for KY in local.sfn_db_engines : module.db_cluster[KY].lfn_resource_arn
    ]
  }

  statement {
    sid    = "AllowCloudWatchLogs"
    effect = "Allow"

    actions = [
      "logs:CreateLogDelivery",
      "logs:GetLogDelivery",
      "logs:UpdateLogDelivery",
      "logs:DeleteLogDelivery",
      "logs:DescribeLogGroups",
      "logs:DescribeResourcePolicies",
      "logs:GetLogDelivery",
      "logs:ListLogDeliveries",
      "logs:PutLogEvents",
      "logs:PutResourcePolicy",
      "logs:UpdateLogDelivery",
    ]
    resources = ["*"]
  }
}

resource "aws_iam_role_policy" "airp_sfn" {
  count  = length(local.sfn_db_engines) > 0 ? 1 : 0
  role   = aws_iam_role.air_sfn[0].id
  policy = data.aws_iam_policy_document.aipd_sfn.json
}

# ----------------------------------------------------------
# IAM role: EventBridge → Step Functions
# ----------------------------------------------------------
resource "aws_iam_role" "air_eb_sfn" {
  count = length(local.sfn_db_engines) > 0 ? 1 : 0
  name  = "${local.template_name}-eb-sfn-Role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = {
        Service = "events.amazonaws.com"
      }
      Action = "sts:AssumeRole"
    }]
  })

  tags = {
    Name     = "${local.template_name}-eb-sfn-Role"
    Module   = var.tf_module_name,
    Resource = "aws_iam_role.air_eb_sfn",
  }
}

// Policy: allow states:StartExecution
data "aws_iam_policy_document" "aipd_eb_sfn" {
  statement {
    effect = "Allow"
    actions = [
      "states:StartExecution"
    ]
    resources = [
      for KY in local.sfn_db_engines :
      module.sfn_machine[KY].arn
    ]
  }
}

resource "aws_iam_role_policy" "airp_eb_sfn" {
  count  = length(local.sfn_db_engines) > 0 ? 1 : 0
  role   = aws_iam_role.air_eb_sfn[0].id
  policy = data.aws_iam_policy_document.aipd_eb_sfn.json
}
Enter fullscreen mode Exit fullscreen mode

2️⃣ Creates one EventBridge → SFN role (events.amazonaws.com) with:

  • states:StartExecution scoped to the child module SFN ARNs (module.sfn_machine[KY].arn)
  • Instantiates module sfn_machine per engine via local.sfn_db_engines
# ----------------------------------------------------------
# AWS Step-function
# ----------------------------------------------------------
module "sfn_machine" {
  for_each = toset(local.sfn_db_engines)
  source   = "./services//sfn"

  lambda_function_arn = module.db_cluster[each.key].lfn_resource_arn

  name_prefix = join("-", [
    replace(local.template_name, "/-[^-]+$/", ""),
    each.key
  ])

  eb_role_arn  = aws_iam_role.air_eb_sfn[0].arn
  sfn_role_arn = aws_iam_role.air_sfn[0].arn

  extra_tags = {
    Resource = "module.sfn_machine"
    Module   = var.tf_module_name
  }
}
Enter fullscreen mode Exit fullscreen mode

Terraform Resources - Child module

1️⃣ Creates the state machine:

  • arn:aws:states:::lambda:invoke
  • FunctionName = var.lambda_function_arn
  • CW logs enabled to /aws/stepfunctions/${var.name_prefix}-bootstrap
# ----------------------------------------------------------
# Step Function Resource
# ----------------------------------------------------------
resource "aws_sfn_state_machine" "this" {
  name     = "${var.name_prefix}-bootstrap"
  role_arn = var.sfn_role_arn

  definition = jsonencode({
    Comment = "DB / User bootstrap orchestration"
    StartAt = "InvokeBootstrapLambda"
    States = {
      InvokeBootstrapLambda = {
        Type     = "Task"
        Resource = "arn:aws:states:::lambda:invoke"
        Parameters = {
          FunctionName = var.lambda_function_arn
          Payload      = {}
        }
        Retry = [{
          ErrorEquals = [
            "Lambda.ServiceException",
            "Lambda.TooManyRequestsException"
          ]
          IntervalSeconds = 5
          MaxAttempts     = 3
          BackoffRate     = 2.0
        }]
        End = true
      }
    }
  })

  logging_configuration {
    level                  = "ALL"
    include_execution_data = true
    log_destination        = "${aws_cloudwatch_log_group.sfn_logs.arn}:*"
  }

  tags = merge(
    var.extra_tags,
    {
      Name        = "${var.name_prefix}-bootstrap"
      SubResource = "aws_sfn_state_machine.db_bootstrap"
    }
  )
}

// CW logs
resource "aws_cloudwatch_log_group" "sfn_logs" {
  name              = "/aws/stepfunctions/${var.name_prefix}-bootstrap"
  retention_in_days = 14
}
Enter fullscreen mode Exit fullscreen mode

2️⃣ Creates the EventBridge rule that works reliably:

  • source = ["aws.lambda"]
  • detail-type = ["AWS API Call via CloudTrail"]
  • eventName prefix match for both:
    • UpdateFunctionCode*
    • UpdateFunctionConfiguration*
  • Filters requestParameters.functionName with both forms:
    • the full ARN: var.lambda_function_arn
    • short name extracted via regex("[^:]+$", var.lambda_function_arn)
  • Creates the EventBridge target:
    • Starts the SFN with an input_transformer payload containing account/region/event_id/time/function_name
# ----------------------------------------------------------
# EventBridge rule: detect Lambda code updates
# ----------------------------------------------------------
resource "aws_cloudwatch_event_rule" "lambda_code_update" {
  name        = "${var.name_prefix}-lambda-update"
  description = "Trigger SFN when bootstrap Lambda code is updated"

  event_pattern = jsonencode({
    source        = ["aws.lambda"]
    "detail-type" = ["AWS API Call via CloudTrail"]
    detail = {
      eventSource = ["lambda.amazonaws.com"]
      eventName = [
        { "prefix" : "UpdateFunctionCode" },
        { "prefix" : "UpdateFunctionConfiguration" }
      ]

      # without this, all the SFNs will keep triggering
      requestParameters = {
        functionName = [
          var.lambda_function_arn,
          regex("[^:]+$", var.lambda_function_arn),
        ]
      }
    }
  })
}

resource "aws_cloudwatch_event_target" "lambda_update_to_sfn" {
  rule      = aws_cloudwatch_event_rule.lambda_code_update.name
  target_id = "start-${var.name_prefix}-bootstrap"
  arn       = aws_sfn_state_machine.this.arn
  role_arn  = var.eb_role_arn

  retry_policy {
    maximum_retry_attempts       = 0
    maximum_event_age_in_seconds = 60
  }

  # Optional - To capture the "Silent" error 
  dead_letter_config {
    arn = aws_sqs_queue.eb_dlq.arn
  }

  input_transformer {
    input_paths = {
      update_fn = "$.detail.requestParameters.functionName"
      event_id  = "$.id",
      region    = "$.region"
      account   = "$.account"
      time      = "$.time"
    }

    input_template = <<-EOF
{
  "account": <account>,
  "action": "AUTO_BOOTSTRAP",
  "aws_region": <region>,
  "event_id": <event_id>,
  "event_time": <time>,
  "function_name": <update_fn>,
  "trigger": "lambda-code-update"
}
    EOF
  }
}
Enter fullscreen mode Exit fullscreen mode

Setting up the automated trigger was a classic cat-and-mouse engineering challenge. The real complexity was correctly mapping the CloudTrail metadata into a format the State Machine understands. I spent significant time fine-tuning the input_transformer logic — navigating through the JSON paths to ensure that the specific Lambda updates correctly extracted the payload, needed to kick off the bootstrap. Those were the only parameters worked for me; anything more added to it, was stopping the trigger on the next run.

3️⃣ Adds an SQS DLQ and queue policy allowing EventBridge to SendMessage for that rule (optional)

# ----------------------------------------------------------
# SQS: Dead-letter queue redrive
# ----------------------------------------------------------
resource "aws_sqs_queue" "eb_dlq" {
  name = "${var.name_prefix}-eb-dlq"
}

resource "aws_sqs_queue_policy" "dlq_policy" {
  queue_url = aws_sqs_queue.eb_dlq.id
  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Service = "events.amazonaws.com" }
      Action    = "sqs:SendMessage"
      Resource  = aws_sqs_queue.eb_dlq.arn
      Condition = {
        ArnEquals = {
          "aws:SourceArn" : aws_cloudwatch_event_rule.lambda_code_update.arn
        }
      }
    }]
  })
}
Enter fullscreen mode Exit fullscreen mode

Validation: The Orchestration in Action

Once the EventBridge rule triggers the workflow, the progress can be monitored in real-time. The Step Functions' Executions console provides the list of Failed and Successful executions, as shown below:

State Machine Executions

Beyond the high-level list, the console also allows us to drill down into the specific input and output of the bootstrap Lambda, where you can see exactly how the CloudTrail event was mapped into the parameters that drove the database creation.

Conclusion: An automated Database Platform

That brings us to the end of Part 3. We have successfully transformed our bootstrap Lambda into a robust, event-driven platform capability that triggers automatically whenever our infrastructure changes. By orchestrating this with AWS Step Functions, we’ve also gained the visibility, retries, and auditability needed for our environment. It's now a complete, driver-less solution that ensures our Aurora databases are initialized and ready the moment they are provisioned - and most importantly, without any manual intervention.


In This Series

  • Part 1: Architectural foundation and engineering basics.
  • Part 2: Deep dive into Lambda implementation and IAM-based access.
  • Part 3: Orchestrating workflows and Execution using AWS Step Functions.

Top comments (0)