If you are following the ‘What’s new in AWS‘ page like many AWS professionals do, you must have noticed the new drift detection support that was announced recently. We have dived into this new feature and this article is about what we have found so far.
Drift detection is one of the many features that have been missing for years from the Cloudformation toolset. Ideally, we should always use Cloudformation to publish our infrastructure changes. However, practicality rules, and frequently we have to change our infrastructure outside of Cloudformation. When we do that, the configuration of our infrastructure drifts away from what is defined in Cloudformation. From time to time we need a summary of what has drifted so we can think about what we should do. Not all of the drifts are bad, and some are even expected. For example, in order to dynamically enable/disable an endpoint, we might have placed a lambda function to tweak the priority of a load balancer rule, which could result in a drift. Therefore, we should focus on other drifts that were created during ad-hoc operations, like a change in the maximum size of an auto scaling group introduced during traffic peaks.
Run drift detection with Python
Now that the feature is available, I wanted to write a python script to detect the drifts in our stacks, and use our Jenkins to run this script periodically so we can stay up to date about the drifts. I installed the latest version of boto3(>=1.9.44) and looked at the boto3 API: we need to trigger a drift detection with a DetectStackDrift
call, then a DescribeStackDriftDetectionStatus
call to view the result. Since it takes time to finish the drift detection, we would need to enclose the second call with a loop, so it would look like this:
client = boto3.client("cloudformation")
stack = "your-stack-name"
detection_id = client.detect_stack_drift(StackName=stack)
while True:
time.sleep(3)
response = client.describe_stack_drift_detection_status(
StackDriftDetectionId=detection_id
)
if response['DetectionStatus'] == 'DETECTION_IN_PROGRESS':
continue
else:
print(f"Stack `{stack}` has a drift status: {response['StackDriftStatus']}")
In general, this snippet should give you the same result as what you can see in the AWS web console. Easy peasy.
Calling DetectStackDrift
Let’s now step back and take look at this DetectStackDrift
API. Apart from the obvious stack name parameter, this call can optionally accept a list of logical record IDs defined in the stack.
Should we worry about the resources defined in our stack which do not support drift detection yet? Not really. By default, if you call the DetectStackDrift
API without this list, and your stack happened to have resources that do not support drift detection, AWS will simply skip them.
Can we find out the last change time of all the resources in this stack, perhaps with the help of a DescribeStackResources
call, and skip the drift detection for a stack? The answer is sadly no. Although both ListStackResources
and DescribeStackResources
return a LastUpdatedTimestamp
field, it only records when it was changed by Cloudformation. When the resource was changed outside of Cloudformation, this field is not updated. Therefore, you cannot short circuit a DetectStackDrift
call by list/describe stack resources calls.
In conclusion, unless you have some very specific requirements, we’ll recommend just call DetectStackDrift
with stack name as the only argument and let AWS takes care about the rest.
Getting drift status with DescribeStackDriftDetectionStatus
If you take a closer look at the response from DescribeStackDriftDetectionStatus
calls, it should have several fields, including:
-
StackDriftStatus
: whether a stack has drifted. -
DetectionStatus
: whether the detection succeeded or failed. -
DriftedStackResourceCount
: number of resources that have drifted.
This is where things get tricky. In some occasions, AWS will fail to detect the status of some of the resources defined in the template, in this case, the response would look something like this:
{
"StackId": "arn:aws:cloudformation:ap-southeast-2:123456789012:stack/your-stack-name/a61c13dc-e875-11e8-8f0f-000c6c095ac3",
"StackDriftDetectionId": "b11fdd5e-e875-11e8-8f0f-000c6c095ac3",
"StackDriftStatus": "IN_SYNC",
"DetectionStatus": "DETECTION_FAILED",
"DetectionStatusReason": "Failed to detect drift on resource [SNSTopic]",
"DriftedStackResourceCount": 0,
"Timestamp": datetime.datetime(2018, 11, 14, 22, 59, 43, 536000, tzinfo=tzutc()),
}
For the record, this logical resource ID SNSTopic
is indeed an SNS topic and it should have drift detection support. From what we can see, the behaviour is consistent: if the drift detection failed once, it will continue to fail no matter how many times you try it. However, it is inconsistent in that the same resource type could fail in one stack while not in another one.
Other minor hiccups
The official documentation does mention some limitations. I recommend taking some time to read it first. Here’s a list of some additional hiccups that we have observed:
Empty PropertyDifferences
Some resources were reported as MODIFIED
with no property differences. In the following response from Cloudformation, ExpectedProperties
and ActualProperties
are exactly the same, and the PropertyDifferences
is an empty list, while the StackResourceDriftStatus
is reported to be MODIFIED
.
{
"StackId": "arn:aws:cloudformation:ap-southeast-2:123456789012:stack/your-stack-name/a61c13dc-e875-11e8-8f0f-000c6c095ac3",
"LogicalResourceId": "TargetTrackingPolicy",
"PhysicalResourceId": "some-arn",
"ResourceType": "AWS::AutoScaling::ScalingPolicy",
"ExpectedProperties": {
"AutoScalingGroupName": "your-stack-name-AutoScalingGroup-L3YOV2E9J92N",
"Cooldown": 900,
"EstimatedInstanceWarmup": 300,
"PolicyType": "TargetTrackingScaling",
"TargetTrackingConfiguration": {
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ASGAverageCPUUtilization"
},
"TargetValue": 40
}
},
"ActualProperties": {
"AutoScalingGroupName": "your-stack-name-AutoScalingGroup-L3YOV2E9J92N",
"Cooldown": 900,
"EstimatedInstanceWarmup": 300,
"PolicyType": "TargetTrackingScaling",
"TargetTrackingConfiguration": {
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ASGAverageCPUUtilization"
},
"TargetValue": 40
}
},
"PropertyDifferences": [],
"StackResourceDriftStatus": "MODIFIED",
"Timestamp": "2018-11-14T22:53:53.090Z"
}
Flattened list
In the following excerpt from a DescribeStackResourceDrifts
response, the NotificationTypes
were flattened to be a string. I’m not 100% sure, but I think what happened was we manually saved the ASG, which triggered a save on everything and flattened the NotificationTypes
. Both configurations will work the same way and I would argue this is not a real change and this could have been handled internally by AWS.
# ExpectedProperties
"NotificationConfigurations": [
{
"NotificationTypes": [
"autoscaling:EC2_INSTANCE_LAUNCH"
],
"TopicARN": "arn:aws:sns:ap-southeast-2:123456789012:your-stack-name-SNSTopic"
}
]
# ActualProperties
"NotificationConfigurations": [
{
"NotificationTypes": "autoscaling:EC2_INSTANCE_LAUNCH",
"TopicARN": "arn:aws:sns:ap-southeast-2:123456789012:your-stack-name-SNSTopic"
}
]
Internal type change
In the following case, we have adjusted the Security group ingress rule and changed it back. In the second save step, AWS must have kept the IpProtocol
as a name(tcp
), but in the template, we have specified it as a number(6
). According to the RFC, protocol number 6 is just TCP, so this should not be a drift at all.
Altered order in list items
One of my colleagues has reported that for a list of security groups, the change of order of security groups is considered a drift. I’m yet to witness this case.
The takeaways
The good:
- It will definitely help us to better maintain our Cloudformation stacks.
The bad:
- Expect some glitches.
- Resources type support is limited (for now).
The ugly:
- Some errors are silently ignored (for now).
References:
General introduction to this new feature
Resource types that support drift detection
Latest boto3 cloudformation API which includes the new APIs
Top comments (0)