Terraform AWS - Imports, Key-Pairs and Broken States

#aws #terraform

There is nothing better than starting a new week with a full terraform infra refactoring using a clean (and shared) remote state.

What do I have?

an old and functional terraform state
a new code folder better organized
an empty s3 bucket to store the new state

Starting with the clean and shared remote state:

variable "the_profile" {}
variable "region" {}

provider "aws" {
  region = var.region
  profile = var.the_profile
}

resource "aws_s3_bucket" "state_bucket" {
  bucket = "my-uniquely-named-state-bucket"
  acl = "private"
  versioning {
    enabled = "true" # this is a must
  }

  server_side_encryption_configuration {
    rule {
      apply_server_side_encryption_by_default {
        sse_algorithm = "AES256"
      }
    }
  }

resource "aws_dynamodb_table" "terraform_state_lock" {
  name = "my-lock-table"
  read_capacity  = 1
  write_capacity = 1
  hash_key       = "LockID"
  attribute {
    name = "LockID"
    type = "S"
  }
}

Creating this init structure will then allow us to properly configure our backend:

## Backend to keep status

provider "aws" {
  region  = var.region
  profile = var.the_profile
}

terraform {
  backend "s3" {
    bucket         = "my-uniquely-named-state-bucket"
    key            = "mystate.tfstate"
    region         = "INSERT_YOUR_REGION_HERE"
    encrypt        = "true"
    dynamodb_table = "my-lock-table"
  }
}

Things are starting to happen.

After some copy+paste oriented programing, I managed to place all my files from the old structure into the new and I'm ready do run my first plan.

And terraform said:

Plan: WAY TOO MANY to add, 0 to change, 0 to destroy.

( ok, it didn't say that, but that's what I read )

Adding the VPC, the Subnets, the Routing Tables, the Security Groups, the Instances and all the other "direct resources" was fairly easy. Nothing that a few queries here and there and some terraform import TERRAFORM_RESOURCE_NAME AWS_ID
didn't handle. And the easiest part was done.

Enter the Security Group Rules

This is where things started to get interesting.
Quoting directly from terraform aws_security_group_rule page,
Security Group Rules can be imported using the security_group_id, type, protocol, from_port, to_port, and source(s)/destination(s) (e.g. cidr_block) separated by underscores (_). All parts are required.

Luckily I had access to my previous functional terraform state. And with a little bit of python I could find and extract the required ID for each missing security group rule.

So, once more the
terraform import aws_security_group_rule.sg_allow_stuff sg-001122334455_ingress_tcp_8080_8080_sg-001122334455

Thing is, the plan continued to show things like:

-/+ resource "aws_security_group_rule" "sg_allow_stuff" {
      - cidr_blocks              = [] -> null
      ~ id                       = "sgrule-12345678" -> (known after apply)
      - ipv6_cidr_blocks         = [] -> null
      - prefix_list_ids          = [] -> null
      ~ self                     = false -> true # forces replacement
      ~ source_security_group_id = "sg-001122334455" -> (known after apply)
        # (5 unchanged attributes hidden)
    }

when I noticed that line
self = false -> true # forces replacement
and the fact that the source and destiny security group IDs were the same.

So, the self was the issue. No problem, just add it at the end of the AWS_ID part on the import command:
terraform import aws_security_group_rule.sg_allow_stuff sg-001122334455_ingress_tcp_8080_8080_sg-001122334455_self

And it imported!

And the stubborn Key

Two minutes (to midnight) after and all the security group rules were added into the new state.

There was only this aws_key_pair missing.
Ok! Let's go!
terraform import aws_key_pair.auth thekeyname
Ok! It imported as all the other resources! Nothing new so far...

But, the next apply, THAT key was still:

  # aws_key_pair.auth will be created
  + resource "aws_key_pair" "auth" {
      + arn         = (known after apply)
      + fingerprint = (known after apply)
      + id          = (known after apply)
      + key_name    = "thekeyname"
      + key_pair_id = (known after apply)
      + public_key  = "#### EMPTY"
    }

Oh, that "#### EMPTY". I forgot to add the resource on my side.
A few keystrokes after (now with the correct key), the same:

  # aws_key_pair.auth will be created
  + resource "aws_key_pair" "auth" {
      + arn         = (known after apply)
      + fingerprint = (known after apply)
      + id          = (known after apply)
      + key_name    = "thekeyname"
      + key_pair_id = (known after apply)
      + public_key  = "ssh-rsa yaddayaddayadda..."
    }

Hello google my old friend :)
Turns out I wasn't the first to hit this wall

Ok, so let's try one of the suggested solutions there.

Download the new state from S3
Edit the state, search for the entry corresponding to the key
Enter the public key manually on the public_key field (not forgetting the quotes)
Save the state
Upload it and ...

!!!RUN TERRAFORM!!! ( !!! RUN FOR YOUR LIVES !!! )

Sadly, Terraform says NO! :(

Error: Error loading state: state data in S3 does not have the expected content.

This may be caused by unusually long delays in S3 processing a previous state
update.  Please wait for a minute or two and try again. If this problem
persists, and neither S3 nor DynamoDB are experiencing an outage, you may need
to manually verify the remote state and update the Digest value stored in the
DynamoDB table to the following value: 5d3a61408eb4ba89aeaf09818561d0a7

Like anyone in trouble, I just focused on the first part of the message. "Things were looking bad" I thought. On top of that, I was so confident that I didn't even backup the state before altering it.
BUT ... Bucket versioning to the rescue!

aws s3api get-object --bucket my-uniquely-named-state-bucket --key mystate.tfstate --version abcdversionreferencehere mystate.tfstate

Of course this didn't solve the problem (as I figured out after a second upload). I decided to calm down and read everything.

DynamoDB Digest? Interesting. So, there was no big problem after all. Turns out the tfstate md5 digest was stored on dynamoDB and there was a couple of missing steps on my procedure:

Download the new state from S3
Edit the state, search for the entry corresponding to the key
Enter the public key manually on the public_key field (not forgetting the quotes)
Save the state
get the md5 sum of the state
update the md5 digest on the DynamoDB terraform_state_lock table
Upload the modified state

And everything worked like it was supposed to!

What did I learn?