Restore AWS RDS Snapshots using Terraform

#aws #rds #terrafom #cloud

The problem

A common use case for AWS accounts is the creation of ephemeral platforms.
Usually for development or integration environment, we want to optimize cost and therefore shutdown services when they are not needed.

In our case, this process is managed by Terraform, which for example create/destroy the platform based on a schedule of our CI/CD tool.

The database problem

But by the nature of the platform concerned by this create/destroy cycle, customers often want their database to be filled with test data that helps them run their integration/functional tests.

By chance, RDS offers a function to Snapshot the database just before deletion, and you can use it in the next platform creation iteration to restore it.

And this is where things are gonna get messy.

The Terraform RDS resource problem

Let's check how AWS RDS resource works with Terraform.
You have two options to create an RDS instance:

Without a Snapshot

resource "aws_db_instance" "dbname" {
  allocated_storage         = 10
  identifier                = "db-instance-id"
  db_name                   = "dbname"
  engine                    = "postgres"
  engine_version            = data.aws_rds_engine_version.pg_version.version
  instance_class            = "db.t3.micro"
  username                  = "adminuser"
  password                  = random_password.admin.result
  skip_final_snapshot       = false
  final_snapshot_identifier = "${terraform.workspace}-${formatdate("YYYYMMDDhhmmss", timestamp())}"
  storage_encrypted         = true

  backup_retention_period = 5
  backup_window           = "07:00-09:00"

  maintenance_window = "Tue:05:00-Tue:07:00"

  vpc_security_group_ids = [
    aws_security_group.allow_postgres[0].id
  ]

  db_subnet_group_name = var.subnet_db_name

}

With a Snapshot

resource "aws_db_instance" "dbname" {
  identifier                = "db-instance-id"
  db_name                   = "dbname"
  instance_class            = "db.t3.micro"
  skip_final_snapshot       = false
  final_snapshot_identifier = "${terraform.workspace}-${formatdate("YYYYMMDDhhmmss", timestamp())}"
snapshot_identifier       = data.aws_db_snapshot.latest_snapshot.id
  storage_encrypted         = true

  backup_retention_period = 5
  backup_window           = "07:00-09:00"

  maintenance_window = "Tue:05:00-Tue:07:00"

  vpc_security_group_ids = [
    aws_security_group.allow_postgres[0].id
  ]

  db_subnet_group_name = var.subnet_db_name

}

As we can see, the same resource is not configured in the same way, whether there is the snapshot_identifier property or not.

Before the first Terraform destroy

Before the first Terraform destroy process, there is no Snapshot available to restore from, so the first applies should be configured with the first definition above, but after the first destroy, a snapshot becomes available, and the RDS resource should be configured with the second definition above.

Can we make come up with a RDS definition that works in all cases ?
Turns out we can, with a little bit of Terraform tricks.

The solution

Terraform Data should point to existing resources

The first thing to note is that the snapshot identifier to restore from comes from a Terraform data source :

snapshot_identifier       = data.aws_db_snapshot.latest_snapshot.id

which is defined like this :

data "aws_db_snapshot" "latest_snapshot" {
  db_instance_identifier = "db-instance-id"
  most_recent            = true
}

But, by its nature, Terraform cannot read data that don't exists without complete failure of the Terraform process, so we will need to read the snapshot id data only if a snapshot already exists.

Reading data only if it exists

We need to check first if a snapshot exists, before reading it with terraform.
So we make the following changes:

data "external" "rds_final_snapshot_exists" {
  program = [
    "./check-rds-snapshot.sh",
    "db-instance-${terraform.workspace}"
  ]
}


data "aws_db_snapshot" "latest_snapshot" {
  count                  = data.external.rds_final_snapshot_exists.result.db_exists ? 1 : 0
  db_instance_identifier = "db-instance-id"
  most_recent            = true
}

And the content of the check-rds-snapshot.sh script :

#!/bin/bash

db_id=$1

if [ -z ${db_id} ]; then
  echo "usage : $0 <db_id>" >2
  exit 1
fi

RESULT=($(aws rds describe-db-snapshots --db-instance-identifier $db_id --output text 2> /dev/null))
aws_result=$?

if [ ${aws_result} -eq 0 ] && [[ ${RESULT[0]} == "DBSNAPSHOTS" ]]; then
  result='true'
else
  result='false'
fi

jq -n --arg exists ${result} '{"db_exists": $exists }'

The external data source checks with the AWS CLI if the snapshot exists, and the count argument on the snapshot data source prevents Terraform from reading its value if none exists.

Now, we only need to combine the two RDS declaration to make it works every time, !

resource "aws_db_instance" "dbname" {
allocated_storage         = 10
  identifier                = "db-instance-id"
  db_name                   = "dbname"
  engine                    = "postgres"
  engine_version            = data.aws_rds_engine_version.pg_version.version
  instance_class            = "db.t3.micro"
  username                  = "adminuser"
  password                  = random_password.admin.result
  skip_final_snapshot       = false
  final_snapshot_identifier = "${terraform.workspace}-${formatdate("YYYYMMDDhhmmss", timestamp())}"
  snapshot_identifier       = try(data.aws_db_snapshot.latest_snapshot.0.id, null)
  storage_encrypted         = true

  backup_retention_period = 5
  backup_window           = "07:00-09:00"

  maintenance_window = "Tue:05:00-Tue:07:00"

  vpc_security_group_ids = [
    aws_security_group.allow_postgres[0].id
  ]

  db_subnet_group_name = var.subnet_db_name

  lifecycle {
    ignore_changes = [
      snapshot_identifier,
      final_snapshot_identifier
    ]
  }
}