Santanu Das

Posted on Jan 10 • Edited on Jun 25

BootStrapping Aurora RDS Databases using Lambda and Terraform (Part 1)

#terraform #aws #aurora #devops

TL;DR
Terraform can create an Aurora cluster, but it cannot safely manage databases, users, or grants. This series shows how we decouple infrastructure provisioning from database bootstrapping using Lambda — and later, Step Functions — without breaking Terraform’s declarative model.

The Problems
Cluster Creation

The Problems

Terraform’s Limitations

While Terraform is ideal for provisioning Aurora clusters, it is poorly suited for managing database-level objects because:

Limited Scope: It can only create one initial database; it cannot manage multiple databases, complex ownership models, or granular IAM-based access.
State Conflict: Managing SQL internals within Terraform leads to brittle plans, partial failures, and irreversible drift.
Risk: Running SQL during terraform apply creates tight coupling between infrastructure and runtime data, increasing operational risk.

The Manual Gap

Although it's able to provision the cluster and an initial database in minutes, it leaves behind an operational vacuum. Once the cluster is live, the engineering team is frequently forced into a manual waiting game:

Access Stagnation: The cluster exists, but there are no application-specific users.
Security Hurdles: A platform engineer must manually log in to configure IAM authentication and map roles to database users.
Permission Complexity: Defining granular Read/Write privileges, schema ownership, and grant structures becomes a ticket-driven manual task.
Developer Friction: Application developers sit idle until the tasks above are completed because the Initial Database provided by AWS lacks the schema-level management required for actual deployment.

This creates a bottleneck where infrastructure is automated, but the database remains unusable until a human intervenes to bridge the gap between a cloud resource and a functional data store.

This post outlines a platform engineering approach to bootstrapping Amazon Aurora databases, arguing that database internals should be decoupled from Terraform state.

Terraform’s responsibility ends when the cluster becomes reachable.
Everything beyond that point is runtime behavior, not infrastructure state.

The Cluster Basics

While this post isn't a deep dive into cluster creation itself, I will cover the essential infrastructure components to provide context for Part 2. My implementation utilizes the standard terraform-aws-rds-aurora module, wrapped within a custom sub-module to meet our specific requirements.

Child Module

module "cluster" {
  source  = "terraform-aws-modules/rds-aurora/aws"
  version = "10.0.2"  #<-- latest at the time of writing 

  name              = local.cluster_name
  engine            = var.db_profile.engine
  engine_mode       = "provisioned" #<-- NOT cluster-type
  engine_version    = var.db_profile.engine_version
  storage_encrypted = true

  enable_http_endpoint = anytrue([
    var.enable_db_managment, var.data_api_enabled
  ])

  master_username               = var.master_uname
  manage_master_user_password   = true
  master_user_secret_kms_key_id = var.kms_key_id

  manage_master_user_password_rotation              = true
  master_user_password_rotate_immediately           = true
  master_user_password_rotation_schedule_expression = "rate(21 days)"

  vpc_id               = var.vpc_id
  db_subnet_group_name = var.subnet_group_name

  security_group_ingress_rules = {
    for ci, dr in var.ingress_cidr_blocks :
    "cidr${ci}" => { cidr_ipv4 = dr }
  }

  cluster_monitoring_interval = 60

  cluster_parameter_group = {
    name        = local.cluster_name
    family      = local.db_engine_family
    description = "${local.cluster_name} cluster parameter group"
    parameters  = var.cluster_parameter_group_parameters
  }

  db_parameter_group = {
    name        = local.cluster_name
    family      = local.db_engine_family
    description = "${local.cluster_name} Database parameter group"
    parameters  = var.db_parameter_group_parameters
  }

  apply_immediately   = true
  deletion_protection = true
  skip_final_snapshot = true

  # ----------------------------------------------
  # Only required for Serverless V2 clusters
  # ----------------------------------------------
  serverlessv2_scaling_configuration = (
    var.cluster_type == "serverless"
    ) ? {
    min_capacity = 2
    max_capacity = 10
  } : null

  cluster_instance_class = (
    var.cluster_type == "serverless"
  ) ? "db.serverless" : var.db_profile.instance_class

  # ----------------------------------------------
  # Based on the number of AZs, it creates
  # 1x RW-instance & 1x RO-inatance per AZ
  # ----------------------------------------------
  instances = var.cluster_type == "serverless" ? {
    for ix, az in concat(var.aws_zones, [join("", var.aws_zones)]) :
    "${var.db_profile.engine_type}${ix + 1}" => {}
    } : merge(
    {
      "0rw" = {
        instance_class = lookup(
          var.db_profile, "rw_instance_class", var.db_profile.instance_class
        )
      }
    },
    {
      for ix, az in var.aws_zones :
      "${ix + 1}ro" => {
        instance_class = var.db_profile.instance_class
      }
    }
  )

  #create_cloudwatch_log_group = true
  enabled_cloudwatch_logs_exports = (
    var.db_profile.engine == "aurora-mysql" ? [
      "audit",
      "error",
      "general",
      "slowquery",
      ] : (
      var.db_profile.engine == "aurora-postgresql"
      ? ["postgresql"] : []
    )
  )

  iam_database_authentication_enabled = true #<-- MUST be true

  tags = merge(
    var.extra_tags,
    {
      Name        = local.cluster_name
      SubResource = "module.cluster"
    }
  )
}

Engine Mode Ambiguity: When deploying Aurora Serverless v2, it is important to note that the engine_mode must be set to provisioned. Unlike Serverless v1, v2 operates under the provisioned framework to allow for better scaling and feature parity.
Log Group Conflicts: Avoid setting create_cloudwatch_log_group = true within the module. In many Aurora configurations, AWS automatically creates the log group as soon as logging is enabled; if Terraform attempts to create it simultaneously, the deployment will fail due to a naming conflict.

Root Module

In the root-module, it just referrenced like this:

module "db_cluster" {
  for_each = anytrue(values(local.service_enabled)) ? tomap(local.db_profiles) : {}
  source   = "./services//aurora"

  aws_zones           = var.aws_zones
  cluster_type        = var.cluster_type
  db_profile          = each.value  #<-- details below
  ingress_cidr_blocks = distinct(local.static_route_cidrs)
  name_prefix         = replace(local.template_name, "/-[^-]+$/", "")
  kms_key_id          = var.kms_master_key_arn
  subnet_group_name   = aws_db_subnet_group.default[var.service_name].name
  vpc_id              = local.my_vpc_info.id

  cluster_parameter_group_parameters = lookup(
    local.cluster_parameters, each.key, []
  )

  db_parameter_group_parameters = lookup(
    local.db_parameters, each.key, []
  )

  dns_host_prefix     = var.dns_name_prefix
  dns_hosted_zone_id  = var.dns_hosted_zone_id
  enable_db_managment = each.value.enable_db_managment
  lambda_service_role = aws_iam_role.air_lfn[var.service_name].arn

  extra_tags = {
    Resource                = "module.db_cluster"
    Module                  = var.tf_module_name
    "${var.eks_access_tag}" = true
  }
  /*
  master_uname        = var.dbs_master_user
  */
}

While the name_prefix can be customized to any string, I utilize a standardized naming structure to maintain consistency across the platform. By using local.template_name, I construct a prefix following the pattern <env><acc>-<vpc_name>-<service_name>, which ensures resources are easily identifiable and organized.

Data set

I leverage Terragrunt to orchestrate our deployments. Below are the specific input variables and configurations I defined as the source of truth for the Aurora cluster:

cluster_type: serverless  #<-- or provisioned
db_clusters:
  - mysql
  - pgsql
# 
db_profile:
  default:
    allocated_storage: 20
    auto_stop: true
    autoscaling_enabled: true
    autoscaling_max_size: 3
    autoscaling_min_size: 1
    database_user: zenapp-iam-user
    enable_db_managment: true #<-- required for part2
    instance_class: db.t4g.small
    max_allocated_storage: 0
    multi_az: true
  mysql:
    enabled: true
    database_port: 3306
    engine: aurora-mysql
    engine_version: '8.0'
    databases:
      - my-database1
      - my-database2
  pgsql:
    enabled: true
    database_port: 5432
    engine: aurora-postgresql
    engine_version: 17
    databases:
      - pg-database1
      - pg-database2
#
cluster_parameter_group:
  mysql:
    family: aurora-mysql8.0
    description: 'MySQL Aurora Cluster Parameters'
    immediate:
      innodb_lock_wait_timeout: 60
      log_bin_trust_function_creators: 1
      long_query_time: 1
      max_allowed_packet: 67108864
      require_secure_transport: 'ON'
      slow_query_log: 1
      time_zone: UTC
      wait_timeout: 28800
    pending-reboot:
      aurora_parallel_query: 0
      binlog_checksum: NONE
      binlog_format: ROW
      log_output: FILE
      max_connections: LEAST({DBInstanceClassMemory/12582912},5000)
      tls_version: TLSv1.2
  pgsql:
    family: aurora-postgresql17
    description: 'PostgreSQL Aurora Cluster Parameters'
    immediate:
      deadlock_timeout: 1000
      idle_in_transaction_session_timeout: 600000
      log_lock_waits: 1
      log_min_duration_statement: 2000
      log_statement: ddl
    pending-reboot:
      log_line_prefix: '%t:%r:%u@%d:[%p]:'
      max_connections: LEAST({DBInstanceClassMemory/9531392},5000)
      rds.force_ssl: 1
      shared_preload_libraries: pg_stat_statements
      timezone: UTC
      track_activity_query_size: 2048
#
db_parameter_group:
  mysql:
    family: aurora-mysql8.0
    description: 'MySQL DB parameters'
    immediate:
      connect_timeout: 60
      general_log: 0
      log_bin_trust_function_creators: 1
      slow_query_log: 1
    pending-reboot:
      performance_schema: 1
  pgsql:
    family: postgresql16
    description: 'PostgreSQL DB parameters'
    immediate:
      default_statistics_target: 100
      maintenance_work_mem: 131072
      random_page_cost: 1.1
      temp_buffers: 16384
      work_mem: 8192

Aurora configuration updates are governed by two apply-methods: immediate and pending-reboot. These settings apply to both Cluster and DB parameter groups, but the engine — particularly PostgreSQL — is notoriously strict about how these classifications are handled. After extensive trial and error to avoid unexpected downtime or deployment friction, I’ve settled on the categorized list above as a stable, battle-tested baseline for platform environments.

Local Variables

I utilized locals {....} block to normalize the dataset, ensuring direct compatibility with the terraform-aws-rds-aurora module. The key transformations are outlined below:

locals {
  ....
  service_enabled = {
    for DB in var.db_clusters :
    DB => var.service_enabled && tobool(
      local.db_profiles[DB].enabled
    ) if contains(keys(local.db_profiles), DB)
  }

  # ---------------------------------------------
  # Decesion  making and conditional  resource
  # creation happens from the values from here
  # ---------------------------------------------
  db_profiles = {
    for DB in var.db_clusters : DB => merge(
      var.db_profile.default,
      {
        for K1, V1 in var.db_profile[DB] :
        K1 => V1
        if V1 != null
      },
      {
        dbs_user_host = var.eks_base_cidr != null ? join(".", [
          replace(var.eks_base_cidr, "/\\.[^./]+\\/\\d+$/", ""),
          "%",
        ]) : "%"
        engine_type = DB
      }
    )
    if contains(keys(var.db_profile), DB) && var.db_profile[DB].enabled
  }
  #
  db_parameters = {
    for DB in var.db_clusters : DB => flatten([
      for K1, V1 in {
        immediate = merge(
          try(var.db_parameter_group["default"].immediate, {}),
          try(var.db_parameter_group[DB].immediate, {})
        )
        pending-reboot = merge(
          try(var.db_parameter_group["default"]["pending-reboot"], {}),
          try(var.db_parameter_group[DB]["pending-reboot"], {})
        )
        } : [
        for K2, V2 in V1 : {
          name         = K2
          value        = try(tonumber(V2), V2)
          apply_method = K1
        }
      ]
    ])
    if contains(keys(local.db_profiles), DB)
  }
  #
  cluster_parameters = {
    for DB in var.db_clusters : DB => flatten([
      for K1, V1 in {
        immediate = merge(
          try(var.cluster_parameter_group["default"].immediate, {}),
          try(var.cluster_parameter_group[DB].immediate, {})
        )
        pending-reboot = merge(
          try(var.cluster_parameter_group["default"]["pending-reboot"], {}),
          try(var.cluster_parameter_group[DB]["pending-reboot"], {})
        )
        } : [
        for K2, V2 in V1 : {
          name         = K2,
          value        = try(tonumber(V2), V2),
          apply_method = K1
        }
      ]
    ])
    if contains(keys(local.db_profiles), DB)
  }
}

Input Variable Defination

Just for the completeness, these are the Terraform variables, to ingest the configuration from the Terragrunt YAML dataset:

variable "cluster_type" {
  type        = string
  default     = "serverless"
  description = <<-EOT
    The deployment mode for the Aurora Cluster
    Valid value is 'serverless' or 'provisioned'.
  EOT

  validation {
    condition = contains(
      ["serverless", "provisioned"], var.cluster_type
    )
    error_message = <<-EOT
      The cluster_type must be either 'serverless' or 'provisioned'.
    EOT
  }
}

variable "cluster_parameter_group" {
  type = map(object({
    family         = optional(string, null)
    description    = optional(string, null)
    immediate      = optional(map(string), {})
    pending-reboot = optional(map(string), {})
  }))
  default     = {}
  description = <<-EOT
    Configuration for RDS Cluster Parameter Groups (applies to the entire Aurora cluster).
    Use this to manage engine-specific settings that affect all instances in the cluster.

    Key attributes:
    - family: The engine family (e.g., 'aurora-postgresql15').
    - description: A custom description for the parameter group.
    - immediate: A map of parameters to apply immediately without a reboot (static and dynamic).
    - pending-reboot: A map of parameters that require a manual reboot to take effect.
  EOT
}

variable "db_parameter_group" {
  type = map(object({
    family         = optional(string, null)
    description    = optional(string, null)
    immediate      = optional(map(string), {})
    pending-reboot = optional(map(string), {})
  }))
  default     = {}
  description = <<-EOT
    Configuration for RDS DB Parameter Groups (applies to individual DB instances).
    Use this for settings that can vary between the primary and replica instances.

    Key attributes:
    - family: The engine family (e.g., 'aurora-postgresql15').
    - description: A custom description for the parameter group.
    - immediate: A map of parameters to apply immediately via the AWS API.
    - pending-reboot: A map of parameters that will only apply after the instance is rebooted.
  EOT
}

variable "db_profile" {
  type = map(object({
    allocated_storage     = optional(number, 20)
    auto_stop             = optional(bool, false)
    autoscaling_enabled   = optional(bool, true)
    autoscaling_max_size  = optional(number, 3)
    autoscaling_min_size  = optional(number, 1)
    database_port         = optional(number, 5432)
    database_user         = optional(string)
    enabled               = optional(bool, false)
    engine                = optional(string, null)
    engine_version        = optional(string, null)
    databases             = optional(list(string), [])
    enable_db_managment   = optional(bool, true)
    instance_class        = optional(string, "db.t4g.small")
    max_allocated_storage = optional(number, 0)
    multi_az              = optional(bool, true)
    rw_instance_class     = optional(string, "db.t4g.small")
  }))

  description = <<-EOT
    A map of database profile objects used to define one or more database clusters/instances.

    Key attributes:
    - allocated_storage: Initial disk size in GB (Default: 20).
    - auto_stop: Whether to automatically stop the instance during idle hours to save costs.
    - autoscaling_enabled: Enables horizontal or vertical scaling for the DB instances.
    - database_port: The network port for database connections (Default: 5432 for PostgreSQL).
    - database_user: The name of the application-user for the database.
    - engine/engine_version: Specific RDS engine type (e.g., 'aurora-postgresql') and its version.
    - databases: A list of specific database schemas to create within the cluster.
    - instance_class/rw_instance_class: The hardware specifications of DB instances (Default: db.t4g.small).
    - multi_az: If true, deploys a standby instance in a different Availability Zone for High Availability.
  EOT
}

This concludes the foundational setup for the Aurora cluster and our specific Terraform approach. With the infrastructure in place, we have a stable platform to build upon. In Part 2, we will dive into the Lambda implementation and explore how to consume the outputs from this module to drive the database bootstrapping process.

This approach is primarily aimed at platform and infrastructure teams managing shared Aurora clusters across multiple services and environments.

In This Series

Part 1: Architectural foundation and engineering basics.
Part 2: Deep dive into Lambda implementation and IAM-based access.
Part 3: Orchestrating workflows and Execution using AWS Step Functions.

DEV Community