CloudFormation rolled back our template but not our infrastructure

#cloudformation #aws #sqs #vpc

Imagine this: you just released a critical update on your network infrastructure via Cloudformation, and after a while you see your services error rate starting to pick up. Of course, you rollback your update and everything goes back to normal.
But what if your rollback does not work?
In this article we'll walk through what happened, why, and what you can do about it.

The IaC contract

The fundamental promise of declarative IaC tools is simple: what you define in your template will be the final state of your infrastructure. CloudFormation's own docs put it clearly:

"CloudFormation makes underlying service calls to AWS to provision and configure your resources as defined in your CloudFormation templates."

This is true for the vast majority of cases, but not all, as we learned at our expense.

Our Issue

It all started as we were deploying a simple but very delicate update to our network stack.
The update looked something like this:

# 🅰️ STARTING SITUATION 
SQSEndpoint:
  Type: AWS::EC2::VPCEndpoint
  Properties:
    PolicyDocument:
      Version: "2012-10-17"
      Statement:
        - Effect: Allow
          Principal: "*"
          Action: "sqs:*"
          Resource: "*"
    SubnetIds:
      - !Ref PrivateSubnetA
      - !Ref PrivateSubnetB
      - !Ref PrivateSubnetC
    SecurityGroupIds:
      - !Ref SecurityGroupAllFromVpc  
    ServiceName: !Sub "com.amazonaws.${AWS::Region}.sqs"
    VpcEndpointType: Interface
    VpcId: !Ref VPC

# 🅱️ TARGET SITUATION
SQSEndpoint:
  Type: AWS::EC2::VPCEndpoint
  Properties:
    PolicyDocument:
      Version: "2012-10-17"
      Statement:
        - Effect: Allow
          Principal: "*"
          Action: "sqs:*"
          Resource: "*"
    SubnetIds:
      - !Ref PrivateSubnetA
      - !Ref PrivateSubnetB
      - !Ref PrivateSubnetC
    PrivateDnsEnabled: true # 👈🏼👈🏼👈🏼 Added this line 
    SecurityGroupIds:
      - !Ref SecurityGroupAllFromVpc  
    ServiceName: !Sub "com.amazonaws.${AWS::Region}.sqs"
    VpcEndpointType: Interface
    VpcId: !Ref VPC

If you are not familiar with VPC endpoints, they allow you to access AWS Services within your VPC, without passing through the public internet. For each subnet the endpoint is associated with, a private ENI is created and addressable with a private IP.
There are two ways to use them. The first is to use the DNS record associated with the endpoint, something like vpce-12345678abcdb-91011abcd.sqs.us-east-1.vpce.amazonaws.com, and override the default SQS services endpoint in your code.
The second way to go is to enable the PrivateDnsEnabled: true flag. Doing so makes the VPC endpoints hijack the main SQS endpoint: in practice sqs.us-east-1.amazonaws.com will point to the endpoint's ENI instead of the public SQS IPs. This means that every SQS call within the VPC will resolve to the private endpoints. This is exactly what the update above was meant to do.

In our case, this snippet is part of our network stack. We proceeded as usual and let our CICD pipeline do the work by applying Commit 🅱️ via Cloudformation. Everything worked flawlessly in development and quality.
Given the positive tests, we rolled the updates also in production... and we started noticing some weird behaviors in our platform within a couple of minutes. At this point, we were not sure it was this update which was causing the malfunctions but we rolled back to Commit 🅰️ as a precaution. In fact, rolling back had no effect on what we were seeing: we scrambled for about 20 minutes to try rebooting malfunctioning services (DNS records could have been poisoned) and to find other possible causes.
It became clear that SQS queues were still malfunctioning, so we opened the console and found the Private DNS was still enabled!

We manually disabled the option and slowly, but surely, everything started to work again.

After the whole situation had settled, we needed to answer two questions:

Why did the update fail?
Why didn't the roll back actually roll back to the previous state? Digging in the console revealed the first problem. The VPC Endpoint security group had drifted in production and had been set to default security group, blocking all incoming connections. Of course, Cloudformation's lacking drift detection and correction did not notice a thing, but this is a matter for another time.

We then started looking into the second question. From the Cloudformation stack history we could see the PrivateDnsEnabled: true was successfully picked up and executed.
This is particularly strange, as this is what Cloudformation's docs for the AWS::EC2::VPCEndpoint resource clearly show:

Omitting the flag should result in its value being set to false!
To exclude any possible effect from account settings, existing resources or really anything else, we tested the following minimal examples and got the same result in a sandbox environment.

Minimal example gist 👉🏼👉🏼 https://gist.github.com/sebacaccaro/13510372e69575f732c2a0d1a3bd0d00

How can this happen? This probably has something to do with the way Cloudformation issues the underlying API call to update the VPC Endpoint resource. It makes sense for an API call: if no PrivateDnsEnabled flag gets passed, no change is needed. It does not make sense for Cloudformation.

In this situation, the same version of a template produces different output based on the existing state of the resource.
This goes against one of the fundamental assumptions we make when using tools like Cloudformation: the same template should produce the same result, regardless of application order or any other factor.

Takeaways

We strongly suspect this behaviour is derived from the ModifyVpcEndpoint API call that Cloudformation is likely to use under the hood: when PrivateDnsEnabled is left blank, the previous state is left untouched.
At best, this is an inconsistent behaviour, at worst it can contribute to serious issues like ours. Knowing this could potentially happen with other resources, there are some countermeasures to be taken:

Always prepare for the worst. Do not test only your deployments, but be sure to test also the rollback for such critical measures
Whenever possible, explicitly set flags and properties. This will avoid such undesired behaviour
When something does not add up, do not blindly trust your Cloudformation and check that the state in the console reflects what you expect

DEV Community

CloudFormation rolled back our template but not our infrastructure

The IaC contract

Our Issue

Takeaways

Top comments (0)