Andrey Berezhinsky

Posted on Mar 26, 2019

Lessons learned from 4 years of working with cloudformation

#aws #cloudformation #cli

I started working with cloudformation 4 years ago. Since then I screwed up many pieces of infrastructure. Some of them were in production. But every time I screwed up something I learned something new as well. Thanks to my ignorance 4 years later I can share some of the most important lessons I learned.

Lesson 1: verify changes before deploying changeset

I learned this lesson pretty early in my journey with cloudformation. I don't remember exactly what have I broken but I remember that I was using aws cloudformation update command to deploy my stacks back then. This command just rolls out your template without any kind of verification of the changes to be deployed. I don't think it requires explanation that you need to verify that you understand what you are about to deploy.

After that screw up I immediately changed my deployment pipeline by replacing update command with create-change-set command

# OPERATION is either "UPDATE" or "CREATE"
changeset_id=$(aws cloudformation create-change-set \
    --change-set-name "$CHANGE_SET_NAME" \
    --stack-name "$STACK_NAME" \
    --template-body "$TPL_PATH" \
    --change-set-type "$OPERATION" \
    --parameters "$PARAMETERS" \
    --output text \
    --query Id)

aws cloudformation wait \
    change-set-create-complete --change-set-name "$changeset_id"

When changeset is created it doesn't affect the existing stack in any way. Unlike the update command the changeset approach doesn't cause the actual deployment. Instead it creates list of changes that you can review before the actual deployment. You can view the changes in the aws console gui. But if you have automate-everything mentality then probably you would prefer checking them in you cli:

# this command is presented only for demonstrational purposes.
# the real command should take pagination into account
aws cloudformation describe-change-set \
    --change-set-name "$changeset_id" \
    --query 'Changes[*].ResourceChange.{Action:Action,Resource:ResourceType,ResourceId:LogicalResourceId,ReplacementNeeded:Replacement}' \
    --output table

This command should produce an output similar to the following one:

--------------------------------------------------------------------
|                         DescribeChangeSet                        |
+---------+--------------------+----------------------+------------+
| Action  | ReplacementNeeded  |      Resource        | ResourceId |
+---------+--------------------+----------------------+------------+
|  Modify | True               |  AWS::ECS::Cluster   |  MyCluster |
|  Replace| True               |  AWS::RDS::DBInstance|  MyDB      |
|  Add    | None               |  AWS::SNS::Topic     |  MyTopic   |
+---------+--------------------+----------------------+------------+

Pay a special attention to the changes where Action is Replace, Delete or where ReplacementNeeded is True. Those are the most dangerous changes and usually cause some data loss.

When the changes are reviewed they can be deployed with

aws cloudformation execute-change-set --change-set-name "$changeset_id"

operation_lowercase=$(echo "$OPERATION" | tr '[:upper:]' '[:lower:]')
aws cloudformation wait "stack-${operation_lowercase}-complete" \
    --stack-name "$STACK_NAME"

Lesson 2: use stack policy to prevent replacement or deletion of statefull resources

Well, sometimes simply reviewing changes is not enough. We are all humans and we all make mistakes. Soon after we started using changesets teammate of mine unknowingly performed deployment which caused database to be recreated. No harm was done as it was testing environment. Even though our scripts were showing list of changes and were asking for confirmation the Replace change was overlooked because the change list was so big it was not fitting to the screen. And as it was a routine update in testing environment not so much attention was paid to the changes.

There are resources that you never want to be replaced or deleted. Those are statefull services like RDS database instance or elastichsearch cluster etc. It would be nice if aws would automatically deny deployment if the performed operation would require deletion of such a resource. Fortunately cloudformation has a built-in way of doing this. It's called stack policy and you can read more about it in the docs:

STACK_NAME=$1
RESOURCE_ID=$2

POLICY_JSON=$(cat <<EOF
{
    "Statement" : [{
        "Effect" : "Deny",
        "Action" : [
            "Update:Replace",
            "Update:Delete"
        ],
        "Principal": "*",
        "Resource" : "LogicalResourceId/$RESOURCE_ID"
    }]
}
EOF
)

aws cloudformation set-stack-policy --stack-name "$STACK_NAME" \
    --stack-policy-body "$POLICY_JSON"

Lesson 3: use UsePreviousValue when updating stack with secret parameters

When you create RDS mysql instance aws requires you to provide MasterUsername and MasterUserPassword. As it's not an option to store secrets in the source code and as I wanted to automate absolutely everything I implemented a "clever mechanism" where prior to deployment credentials would be retrieved from s3 and if credentials wouldn't be found the new credentials would be generated and stored in s3. These credentials then would be passed as parameters to cloudformation create-change-set command. While experimenting with the script it happened once that connection to s3 was lost and my "clever mechanism" treated it as a signal to generate new credentials. If I would start using this script in production and this connection problem occurred again it would update the stack with new credentials. In this particular case probably nothing bad would happen. However I anyways abandoned my "clever mechanism" and started using less clever approach of providing credentials only once - when the stack is created. And later on when the stack would require an update I would rather than specifying secret parameter value just use UsePreviousValue=true:

aws cloudformation create-change-set \
    --change-set-name "$CHANGE_SET_NAME" \
    --stack-name "$STACK_NAME" \
    --template-body "$TPL_PATH" \
    --change-set-type "UPDATE" \
    --parameters "ParameterKey=MasterUserPassword,UsePreviousValue=true"

Lesson 4: use rollback configuration

The other team at my working place started to use cloudformation feature called rollback configuration. I never saw it before and quickly realized that it would make the deployment of my cloudformation stacks even more bulletproof. Now I use it whenever I deploy my code to lambda or ECS via cloudformation. Here how it works: you specify cloudwatch alarm arn in the --rollback-configuration parameter when you create changeset. Later on when you execute the changeset aws monitors this alarm for at least one minute. It rolls the deployment back if during this time alarm changes state to ALARM.

Here is an example excerpt of cloudformation template where I create a cloudwatch alarm which monitors custom cloudwatch metric of number of errors in the cloudwatch logs (metric is created via MetricFilter):

Resources:
  # this metric tracks number of errors in the cloudwatch logs. In this
  # particular case it's assumed logs are in json format and the error logs are
  # identified by level "error". See FilterPattern
  ErrorMetricFilter:
    Type: AWS::Logs::MetricFilter
    Properties:
      LogGroupName: !Ref LogGroup
      FilterPattern: !Sub '{$.level = "error"}'
      MetricTransformations:
      - MetricNamespace: !Sub "${AWS::StackName}-log-errors"
        MetricName: Errors
        MetricValue: 1
        DefaultValue: 0

  ErrorAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: !Sub "${AWS::StackName}-errors"
      Namespace: !Sub "${AWS::StackName}-log-errors"
      MetricName: Errors
      Statistic: Maximum
      ComparisonOperator: GreaterThanThreshold
      Period: 1 # 1 minute
      EvaluationPeriods: 1
      Threshold: 0
      TreatMissingData: notBreaching
      ActionsEnabled: yes

Now this alarm can be used as rollback trigger when the changeset is executed:

ALARM_ARN=$1

ROLLBACK_TRIGGER=$(cat <<EOF
{
  "RollbackTriggers": [
    {
      "Arn": "$ALARM_ARN",
      "Type": "AWS::CloudWatch::Alarm"
    }
  ],
  "MonitoringTimeInMinutes": 1
}
EOF
)

aws cloudformation create-change-set \
    --change-set-name "$CHANGE_SET_NAME" \
    --stack-name "$STACK_NAME" \
    --template-body "$TPL_PATH" \
    --change-set-type "UPDATE" \
    --rollback-configuration "$ROLLBACK_TRIGGER"

Lesson 5: make sure you are deploying the latest version of the template

It's very easy to deploy not the most up-to-date version of the cloudformation template and cause a lot of damage. In fact this happened in our team: developer hasn't pulled the latest changes from git and unknowingly deployed the previous version of the stack. This caused some downtime in the application that was using this stack.

Something as simple as adding check if you branch is up-to-date with remote before executing deployment will do just fine (assuming git is your version control tool):

git fetch
HEADHASH=$(git rev-parse HEAD)
UPSTREAMHASH=$(git rev-parse master@{upstream})

if [[ "$HEADHASH" != "$UPSTREAMHASH" ]] ; then
   echo "Branch is not up to date with origin. Aborting"
   exit 1
fi

Lesson 6: don't reinvent the wheel

It might look like deploying the cloudformation is straightforward. You just need bunch of bash scripts executing aws cli commands.

Four years ago I started with a simple script that just called aws cloudformation create-stack command. It was not long afterwards when the script was not simple anymore. Each lesson learned caused the script to go more and more complex. It was not only complex it was also buggy. I noticed only long afterwards that the command presenting changeset changes wouldn't return complete set of changes when the output was too large.

I'm working in a relatively small IT department. As it turned out - not surprisingly - every team has invented its own way of deploying the cloudformation stacks. It's definitely a bad thing. We could do better and not reinvent the same thing independently. But this already happened and probably happens in other teams as well. Right now we are migrating to the unified way of deployment and it does take bigger effort than I originally thought. I wish that our teems agreed on the one deployment tool from the beginning.

Fortunately there are lots of tools that help with deploying and configuring cloudformation stacks. In fact I've written one of such tools myself. It's called stack-assembly and it was written with all those mentioned lessons in mind.

Top comments (1)

Juan • Apr 17 '21

Thanks for the tips!

I'm conflicted between putting each service in my serverless app in a separate template/stack (deployed separately) versus putting them all together.

I'd like to be able to deploy the services as quickly as possible, to prototype and test them in a staging environment. Is there any advantage in either option above?