A few months back, a colleague reached out to me to troubleshoot a Terraform destroy operation that had taken several hours to execute. They had already terminated and restarted the operation multiple times with no change in outcome.
By the time they reached out, the only remaining resource was an S3 bucket.
In my earlier post Teaching Terraform from the Ground Up, I described how Terraform abstracts multiple AWS API calls into a single resource definition. Debugging this abstraction when it fails to operate as we expect is sometimes challenging and frequently requires us to dig into the provider source code to understand root causes.
This is exactly the issue my colleague experienced when they enabled the force_destroy
property on their S3 bucket resource and then attempted to delete it.
Debugging force_destroy
The Terraform resource documentation for the force_destroy
property states:
A boolean that indicates all objects (including any locked objects) should be deleted from the bucket so that the bucket can be destroyed without error. These objects are not recoverable.
When I debug Terraform abstractions, I frequently consult AWS API documentation to determine if a specified behavior also exists in the core AWS API. If it does, e.g., aws s3api delete-bucket --force-destroy
, we should be able to replicate our issue outside of Terraform and dramatically reduce the surface area of our issue. In this case the S3 DeleteBucket documentation states:
All objects (including all object versions and delete markers) in the bucket must be deleted before the bucket itself can be deleted (emphasis mine)
The documentation for the AWS CLI and Boto3 S3 also confirm his behavior.
Source Code Review
Since force_destroy
exists in the Terraform resource and not in the underlying AWS API, a viable theory is that Terraform implements an abstraction around this property.
Let's dive into the AWS S3 Bucket resource source code to see what API calls are made when that property is set:
if isAWSErr(err, "BucketNotEmpty", "") {
if d.Get("force_destroy").(bool) {
...
err := deleteAllS3ObjectVersions(s3Conn, d.Id(), "", objectLockEnabled, false)
}
}
The conditional statements look boilerplate, but the deleteAllS3ObjectVersions
method call is interesting (in this case d.Id()
returns the name of the bucket), so let's pull on that thread some more.
This source code is more verbose, so we'll again simplify it to the relevant details in our pseudo-code:
input := &s3.ListObjectVersionsInput{
Bucket: aws.String(bucketName),
}
err := conn.LisObjectVersionsPages(input, func(page *s3.ListObjectVersionsOutput, lastPage bool) bool {
if page == nil {
return !lastPage
}
for _, objectVersion := range page.Versions {
objectKey := aws.StringValue(objectVersion.Key)
objectVersionID := aws.StringValue(objectVersion.VersionId)
err := deleteS3ObjectVersion(conn, bucketName, objectKey, objectVersionID, force)
}
return !lastPage
}
A plain-English explanation of this operation might contain the following instructions:
- List all the object versions associated with the S3 bucket
- Paginate through the response object versions
- Delete each object version one-by-one
That can be quite a few S3 DeleteObject API calls depending on the number of objects in your bucket.
Terraform force_destroy
performance benchmarks
Now that we've seen that the number of AWS API calls scales linearly based on the number of objects in our S3 bucket, we can quantify some performance benchmarks:
# of Objects | # of API Calls (est.) | Delete Seconds (min) |
---|---|---|
10 | 11 | 11.3 (.18) |
100 | 101 | 36.3 (.60) |
1000 | 101 | 282.3 (4.7) |
10000 | 1010 | 2323.8 (38.73) |
To summarize those findings, it takes almost five minutes to delete 1000 objects and at 10,000 objects, it takes almost forty minutes. As my peer discovered, this can be a prohibitively expensive operation for buckets without any object lifecycle management.
What's a developer to do?!?
Deleting large volumes of objects using the force_destroy
parameter is likely a non-starter, so what other options are available?
If we want a Terraform-only solution, we can add an object lifecycle management rule to expire all bucket objects.
lifecycle_rule {
enabled = true
expiration {
days = 1
}
}
Management rules are evaluated and executed around 12 AM UTC each day so this may introduce a temporal delay in deleting your S3 bucket. If time is essential, the AWS API provides a bulk delete-objects that can operate on up to 1000 keys per execution.
A sample implementation using the AWS CLI and jq
is demonstrated below:
bucket_name=YOUR_BUCKET_NAME_GOES_HERE
aws s3api list-object-versions --max-items 1000 --bucket $bucket_name > object_versions.json
token=$(cat object_versions.json | jq -r .NextToken)
while [ ! -z $token ]; do
jq --compact-output '{ "Objects": [ { "Key": .Versions[].Key } ] }' object_versions.json > delete.json
aws s3api delete-objects --bucket $bucket_name --delete file://delete.json
aws s3api list-object-versions --bucket $bucket_name --max-items 1000 > object_versions.json
token=$(cat object_versions.json | jq -r .NextToken)
done
)
This implementation performs the delete operation for 10,000 objects in less than 20 seconds.
If no automation is required, you can also use the Empty Bucket option from the AWS Console. I always get nervous when I need to execute manual operations in the AWS Console. If you can avoid this option, I would.
Wrap Up
A recent tweet by Angie Jones reminded me of why I started blogging:
I ran into a basic configuration issue and couldn't find a solution online.
— Angie Jones (@techgirl1908 ) February 14, 2021
After I figured it out, I wrote a simple post with the exact error message and the solution.
~100K views.
Everything doesn't have to be a think piece. Don't be afraid to share 🙏🏾
It was likely a similar tweet by her that first inspired me several years back. To paraphrase, document what you learn along the way to make it easier for developers who come after you.
And if the issue is obscure enough and you're the type who can't remember what you did yesterday much less several months ago, that developer may end up being you ❤️.
Top comments (1)
Great post! I usually run into this problem at the most inopportune moments. The problem of deleting S3 buckets containing a large number of objects has plagued developers since S3 was released in 2006. Here are some notes from my own experience on the subject:
terraform destroy
or the aws cli commands will often drastically reduce your delete time. I haven't personally tried it yet, but it looks like the new AWS CloudShell service would be a good zero-infrastructure option too.PutObject
requests.Delete
operation so you would need to write a Lambda that can delete a S3 object and then use theLambdaInvoke
operation within S3 Batch, and 2. S3 Batch requires you to provide a manifest of all S3 Objects you want to perform the batch operation on, so you would need to first setup a S3 Inventory report for your bucket which may take up to 24 hours to generate.