Table Of Contents
TL;DR
Terraform can create an Aurora cluster, but it cannot safely manage databases, users, or grants.
This series shows how we decouple infrastructure provisioning from database bootstrapping using Lambda — and later, Step Functions — without breaking Terraform’s declarative model.
The Problems
Terraform’s Limitations
While Terraform is ideal for provisioning Aurora clusters, it is poorly suited for managing database-level objects because:
Limited Scope: It can only create one initial database; it cannot manage multiple databases, complex ownership models, or granular IAM-based access.
State Conflict: Managing SQL internals within Terraform leads to brittle plans, partial failures, and irreversible drift.
Risk: Running SQL during terraform apply creates tight coupling between infrastructure and runtime data, increasing operational risk.
The Manual Gap
Although it's able to provision the cluster and an initial database in minutes, it leaves behind an operational vacuum. Once the cluster is live, the engineering team is frequently forced into a manual waiting game:
Access Stagnation: The cluster exists, but there are no application-specific users.
Security Hurdles: A platform engineer must manually log in to configure IAM authentication and map roles to database users.
Permission Complexity: Defining granular Read/Write privileges, schema ownership, and grant structures becomes a ticket-driven manual task.
Developer Friction: Application developers sit idle until the tasks above are completed because the Initial Database provided by AWS lacks the schema-level management required for actual deployment.
This creates a bottleneck where infrastructure is automated, but the database remains unusable until a human intervenes to bridge the gap between a cloud resource and a functional data store.
This post outlines a platform engineering approach to bootstrapping Amazon Aurora databases, arguing that database internals should be decoupled from Terraform state.
Terraform’s responsibility ends when the cluster becomes reachable.
Everything beyond that point is runtime behavior, not infrastructure state.
The Cluster Basics
While this post isn't a deep dive into cluster creation itself, I will cover the essential infrastructure components to provide context for Part 2. My implementation utilizes the standard terraform-aws-rds-aurora module, wrapped within a custom child module to meet our specific requirements.
Child Module
module "cluster" {
source = "terraform-aws-modules/rds-aurora/aws"
version = "10.0.2" #<-- latest at the time of writing
name = local.cluster_name
engine = var.db_profile.engine
engine_mode = "provisioned" #<-- NOT cluster-type
engine_version = var.db_profile.engine_version
storage_encrypted = true
enable_http_endpoint = anytrue([
var.enable_db_managment, var.data_api_enabled
])
master_username = var.master_uname
manage_master_user_password = true
master_user_secret_kms_key_id = var.kms_key_id
manage_master_user_password_rotation = true
master_user_password_rotate_immediately = true
master_user_password_rotation_schedule_expression = "rate(21 days)"
vpc_id = var.vpc_id
db_subnet_group_name = var.subnet_group_name
security_group_ingress_rules = {
for ci, dr in var.ingress_cidr_blocks :
"cidr${ci}" => { cidr_ipv4 = dr }
}
cluster_monitoring_interval = 60
cluster_parameter_group = {
name = local.cluster_name
family = local.db_engine_family
description = "${local.cluster_name} cluster parameter group"
parameters = var.cluster_parameter_group_parameters
}
db_parameter_group = {
name = local.cluster_name
family = local.db_engine_family
description = "${local.cluster_name} Database parameter group"
parameters = var.db_parameter_group_parameters
}
apply_immediately = true
deletion_protection = true
skip_final_snapshot = true
# ----------------------------------------------
# Only required for Serverless V2 clusters
# ----------------------------------------------
serverlessv2_scaling_configuration = (
var.cluster_type == "serverless"
) ? {
min_capacity = 2
max_capacity = 10
} : null
cluster_instance_class = (
var.cluster_type == "serverless"
) ? "db.serverless" : var.db_profile.instance_class
# ----------------------------------------------
# Based on the number of AZs, it creates
# 1x RW-instance & 1x RO-inatance per AZ
# ----------------------------------------------
instances = var.cluster_type == "serverless" ? {
for ix, az in concat(var.aws_zones, [join("", var.aws_zones)]) :
"${var.db_profile.engine_type}${ix + 1}" => {}
} : merge(
{
"0rw" = {
instance_class = lookup(
var.db_profile, "rw_instance_class", var.db_profile.instance_class
)
}
},
{
for ix, az in var.aws_zones :
"${ix + 1}ro" => {
instance_class = var.db_profile.instance_class
}
}
)
#create_cloudwatch_log_group = true
enabled_cloudwatch_logs_exports = (
var.db_profile.engine == "aurora-mysql" ? [
"audit",
"error",
"general",
"slowquery",
] : (
var.db_profile.engine == "aurora-postgresql"
? ["postgresql"] : []
)
)
iam_database_authentication_enabled = true #<-- MUST be true
tags = merge(
var.extra_tags,
{
Name = local.cluster_name
SubResource = "module.cluster"
}
)
}
Engine Mode Ambiguity: When deploying Aurora Serverless v2, it is important to note that the engine_mode must be set to provisioned. Unlike Serverless v1, v2 operates under the provisioned framework to allow for better scaling and feature parity.
Log Group Conflicts: Avoid setting
create_cloudwatch_log_group = truewithin the module. In many Aurora configurations, AWS automatically creates the log group as soon as logging is enabled; if Terraform attempts to create it simultaneously, the deployment will fail due to a naming conflict.
Root Module
In the root-module, it just referrenced like this:
module "db_cluster" {
for_each = anytrue(values(local.service_enabled)) ? tomap(local.db_profiles) : {}
source = "./services//aurora"
aws_zones = var.aws_zones
cluster_type = var.cluster_type
db_profile = each.value #<-- details below
ingress_cidr_blocks = distinct(local.static_route_cidrs)
name_prefix = replace(local.template_name, "/-[^-]+$/", "")
kms_key_id = var.kms_master_key_arn
subnet_group_name = aws_db_subnet_group.default[var.service_name].name
vpc_id = local.my_vpc_info.id
cluster_parameter_group_parameters = lookup(
local.cluster_parameters, each.key, []
)
db_parameter_group_parameters = lookup(
local.db_parameters, each.key, []
)
dns_host_prefix = var.dns_name_prefix
dns_hosted_zone_id = var.dns_hosted_zone_id
enable_db_managment = each.value.enable_db_managment
lambda_service_role = aws_iam_role.air_lfn[var.service_name].arn
extra_tags = {
Resource = "module.db_cluster"
Module = var.tf_module_name
"${var.eks_access_tag}" = true
}
/*
master_uname = var.dbs_master_user
*/
}
While the
name_prefixcan be customized to any string, I utilize a standardized naming structure to maintain consistency across the platform. By usinglocal.template_name, I construct a prefix following the pattern<env><acc>-<vpc_name>-<service_name>, which ensures resources are easily identifiable and organized.
Data set
I leverage Terragrunt to orchestrate our deployments. Below are the specific input variables and configurations I defined as the source of truth for the Aurora cluster:
cluster_type: serverless #<-- or provisioned
db_clusters:
- mysql
- pgsql
#
db_profile:
default:
allocated_storage: 20
auto_stop: true
autoscaling_enabled: true
autoscaling_max_size: 3
autoscaling_min_size: 1
database_user: zenapp-iam-user
enable_db_managment: true #<-- required for part2
instance_class: db.t4g.small
max_allocated_storage: 0
multi_az: true
mysql:
enabled: true
database_port: 3306
engine: aurora-mysql
engine_version: '8.0'
databases:
- my-database1
- my-database2
pgsql:
enabled: true
database_port: 5432
engine: aurora-postgresql
engine_version: 17
databases:
- pg-database1
- pg-database2
#
cluster_parameter_group:
mysql:
family: aurora-mysql8.0
description: 'MySQL Aurora Cluster Parameters'
immediate:
innodb_lock_wait_timeout: 60
log_bin_trust_function_creators: 1
long_query_time: 1
max_allowed_packet: 67108864
require_secure_transport: 'ON'
slow_query_log: 1
time_zone: UTC
wait_timeout: 28800
pending-reboot:
aurora_parallel_query: 0
binlog_checksum: NONE
binlog_format: ROW
log_output: FILE
max_connections: LEAST({DBInstanceClassMemory/12582912},5000)
tls_version: TLSv1.2
pgsql:
family: aurora-postgresql17
description: 'PostgreSQL Aurora Cluster Parameters'
immediate:
deadlock_timeout: 1000
idle_in_transaction_session_timeout: 600000
log_lock_waits: 1
log_min_duration_statement: 2000
log_statement: ddl
pending-reboot:
log_line_prefix: '%t:%r:%u@%d:[%p]:'
max_connections: LEAST({DBInstanceClassMemory/9531392},5000)
rds.force_ssl: 1
shared_preload_libraries: pg_stat_statements
timezone: UTC
track_activity_query_size: 2048
#
db_parameter_group:
mysql:
family: aurora-mysql8.0
description: 'MySQL DB parameters'
immediate:
connect_timeout: 60
general_log: 0
log_bin_trust_function_creators: 1
slow_query_log: 1
pending-reboot:
performance_schema: 1
pgsql:
family: postgresql16
description: 'PostgreSQL DB parameters'
immediate:
default_statistics_target: 100
maintenance_work_mem: 131072
random_page_cost: 1.1
temp_buffers: 16384
work_mem: 8192
Aurora configuration updates are governed by two apply-methods:
immediateandpending-reboot. These settings apply to both Cluster and DB parameter groups, but the engine — particularly PostgreSQL — is notoriously strict about how these classifications are handled. After extensive trial and error to avoid unexpected downtime or deployment friction, I’ve settled on the categorized list above as a stable, battle-tested baseline for platform environments.
Local Variables
I utilized locals {....} block to normalize the dataset, ensuring direct compatibility with the terraform-aws-rds-aurora module. The key transformations are outlined below:
locals {
....
service_enabled = {
for DB in var.db_clusters :
DB => var.service_enabled && tobool(
local.db_profiles[DB].enabled
) if contains(keys(local.db_profiles), DB)
}
# ---------------------------------------------
# Decesion making and conditional resource
# creation happens from the values from here
# ---------------------------------------------
db_profiles = {
for DB in var.db_clusters : DB => merge(
var.db_profile.default,
{
for K1, V1 in var.db_profile[DB] :
K1 => V1
if V1 != null
},
{
dbs_user_host = var.eks_base_cidr != null ? join(".", [
replace(var.eks_base_cidr, "/\\.[^./]+\\/\\d+$/", ""),
"%",
]) : "%"
engine_type = DB
}
)
if contains(keys(var.db_profile), DB) && var.db_profile[DB].enabled
}
#
db_parameters = {
for DB in var.db_clusters : DB => flatten([
for K1, V1 in {
immediate = merge(
try(var.db_parameter_group["default"].immediate, {}),
try(var.db_parameter_group[DB].immediate, {})
)
pending-reboot = merge(
try(var.db_parameter_group["default"]["pending-reboot"], {}),
try(var.db_parameter_group[DB]["pending-reboot"], {})
)
} : [
for K2, V2 in V1 : {
name = K2
value = try(tonumber(V2), V2)
apply_method = K1
}
]
])
if contains(keys(local.db_profiles), DB)
}
#
cluster_parameters = {
for DB in var.db_clusters : DB => flatten([
for K1, V1 in {
immediate = merge(
try(var.cluster_parameter_group["default"].immediate, {}),
try(var.cluster_parameter_group[DB].immediate, {})
)
pending-reboot = merge(
try(var.cluster_parameter_group["default"]["pending-reboot"], {}),
try(var.cluster_parameter_group[DB]["pending-reboot"], {})
)
} : [
for K2, V2 in V1 : {
name = K2,
value = try(tonumber(V2), V2),
apply_method = K1
}
]
])
if contains(keys(local.db_profiles), DB)
}
}
Input Variable Defination
Just for the completeness, these are the Terraform variables, to ingest the configuration from the Terragrunt YAML dataset:
variable "cluster_type" {
type = string
default = "serverless"
description = <<-EOT
The deployment mode for the Aurora Cluster
Valid value is 'serverless' or 'provisioned'.
EOT
validation {
condition = contains(
["serverless", "provisioned"], var.cluster_type
)
error_message = <<-EOT
The cluster_type must be either 'serverless' or 'provisioned'.
EOT
}
}
variable "cluster_parameter_group" {
type = map(object({
family = optional(string, null)
description = optional(string, null)
immediate = optional(map(string), {})
pending-reboot = optional(map(string), {})
}))
default = {}
description = <<-EOT
Configuration for RDS Cluster Parameter Groups (applies to the entire Aurora cluster).
Use this to manage engine-specific settings that affect all instances in the cluster.
Key attributes:
- family: The engine family (e.g., 'aurora-postgresql15').
- description: A custom description for the parameter group.
- immediate: A map of parameters to apply immediately without a reboot (static and dynamic).
- pending-reboot: A map of parameters that require a manual reboot to take effect.
EOT
}
variable "db_parameter_group" {
type = map(object({
family = optional(string, null)
description = optional(string, null)
immediate = optional(map(string), {})
pending-reboot = optional(map(string), {})
}))
default = {}
description = <<-EOT
Configuration for RDS DB Parameter Groups (applies to individual DB instances).
Use this for settings that can vary between the primary and replica instances.
Key attributes:
- family: The engine family (e.g., 'aurora-postgresql15').
- description: A custom description for the parameter group.
- immediate: A map of parameters to apply immediately via the AWS API.
- pending-reboot: A map of parameters that will only apply after the instance is rebooted.
EOT
}
variable "db_profile" {
type = map(object({
allocated_storage = optional(number, 20)
auto_stop = optional(bool, false)
autoscaling_enabled = optional(bool, true)
autoscaling_max_size = optional(number, 3)
autoscaling_min_size = optional(number, 1)
database_port = optional(number, 5432)
database_user = optional(string)
enabled = optional(bool, false)
engine = optional(string, null)
engine_version = optional(string, null)
databases = optional(list(string), [])
enable_db_managment = optional(bool, true)
instance_class = optional(string, "db.t4g.small")
max_allocated_storage = optional(number, 0)
multi_az = optional(bool, true)
rw_instance_class = optional(string, "db.t4g.small")
}))
description = <<-EOT
A map of database profile objects used to define one or more database clusters/instances.
Key attributes:
- allocated_storage: Initial disk size in GB (Default: 20).
- auto_stop: Whether to automatically stop the instance during idle hours to save costs.
- autoscaling_enabled: Enables horizontal or vertical scaling for the DB instances.
- database_port: The network port for database connections (Default: 5432 for PostgreSQL).
- database_user: The name of the application-user for the database.
- engine/engine_version: Specific RDS engine type (e.g., 'aurora-postgresql') and its version.
- databases: A list of specific database schemas to create within the cluster.
- instance_class/rw_instance_class: The hardware specifications of DB instances (Default: db.t4g.small).
- multi_az: If true, deploys a standby instance in a different Availability Zone for High Availability.
EOT
}
This concludes the foundational setup for the Aurora cluster and our specific Terraform approach. With the infrastructure in place, we have a stable platform to build upon. In Part 2, we will dive into the Lambda implementation and explore how to consume the outputs from this module to drive the database bootstrapping process.
This approach is primarily aimed at platform and infrastructure teams managing shared Aurora clusters across multiple services and environments.
In This Series
- Part 1: Architectural foundation and engineering basics.
- Part 2: Deep dive into Lambda implementation and IAM-based access.
- Part 3: Orchestrating workflows and on-demand execution using AWS Step Functions.
Top comments (0)