DEV Community

ericksoen
ericksoen

Posted on

Teaching Terraform from the ground up: Benchmarking S3 force_destroy

A few months back, a colleague reached out to me to troubleshoot a Terraform destroy operation that had taken several hours to execute. They had already terminated and restarted the operation multiple times with no change in outcome.

By the time they reached out, the only remaining resource was an S3 bucket.

In my earlier post Teaching Terraform from the Ground Up, I described how Terraform abstracts multiple AWS API calls into a single resource definition. Debugging this abstraction when it fails to operate as we expect is sometimes challenging and frequently requires us to dig into the provider source code to understand root causes.

This is exactly the issue my colleague experienced when they enabled the force_destroy property on their S3 bucket resource and then attempted to delete it.

Debugging force_destroy

The Terraform resource documentation for the force_destroy property states:

A boolean that indicates all objects (including any locked objects) should be deleted from the bucket so that the bucket can be destroyed without error. These objects are not recoverable.

When I debug Terraform abstractions, I frequently consult AWS API documentation to determine if a specified behavior also exists in the core AWS API. If it does, e.g., aws s3api delete-bucket --force-destroy, we should be able to replicate our issue outside of Terraform and dramatically reduce the surface area of our issue. In this case the S3 DeleteBucket documentation states:

All objects (including all object versions and delete markers) in the bucket must be deleted before the bucket itself can be deleted (emphasis mine)

The documentation for the AWS CLI and Boto3 S3 also confirm his behavior.

Source Code Review

Since force_destroy exists in the Terraform resource and not in the underlying AWS API, a viable theory is that Terraform implements an abstraction around this property.

Let's dive into the AWS S3 Bucket resource source code to see what API calls are made when that property is set:

if isAWSErr(err, "BucketNotEmpty", "") {
    if d.Get("force_destroy").(bool) {
        ...
        err := deleteAllS3ObjectVersions(s3Conn, d.Id(), "", objectLockEnabled, false)
    }
}
Enter fullscreen mode Exit fullscreen mode

The conditional statements look boilerplate, but the deleteAllS3ObjectVersions method call is interesting (in this case d.Id() returns the name of the bucket), so let's pull on that thread some more.

This source code is more verbose, so we'll again simplify it to the relevant details in our pseudo-code:

input := &s3.ListObjectVersionsInput{
    Bucket: aws.String(bucketName),
}

err := conn.LisObjectVersionsPages(input, func(page *s3.ListObjectVersionsOutput, lastPage bool) bool {

    if page == nil {
        return !lastPage
    }

    for _, objectVersion := range page.Versions {

        objectKey := aws.StringValue(objectVersion.Key)
        objectVersionID := aws.StringValue(objectVersion.VersionId)

        err := deleteS3ObjectVersion(conn, bucketName, objectKey, objectVersionID, force)
    }

    return !lastPage
}
Enter fullscreen mode Exit fullscreen mode

A plain-English explanation of this operation might contain the following instructions:

  • List all the object versions associated with the S3 bucket
  • Paginate through the response object versions
  • Delete each object version one-by-one

That can be quite a few S3 DeleteObject API calls depending on the number of objects in your bucket.

Terraform force_destroy performance benchmarks

Now that we've seen that the number of AWS API calls scales linearly based on the number of objects in our S3 bucket, we can quantify some performance benchmarks:

# of Objects # of API Calls (est.) Delete Seconds (min)
10 11 11.3 (.18)
100 101 36.3 (.60)
1000 101 282.3 (4.7)
10000 1010 2323.8 (38.73)

To summarize those findings, it takes almost five minutes to delete 1000 objects and at 10,000 objects, it takes almost forty minutes. As my peer discovered, this can be a prohibitively expensive operation for buckets without any object lifecycle management.

What's a developer to do?!?

Deleting large volumes of objects using the force_destroy parameter is likely a non-starter, so what other options are available?

If we want a Terraform-only solution, we can add an object lifecycle management rule to expire all bucket objects.

lifecycle_rule {
    enabled = true
    expiration {
        days = 1
    }
}
Enter fullscreen mode Exit fullscreen mode

Management rules are evaluated and executed around 12 AM UTC each day so this may introduce a temporal delay in deleting your S3 bucket. If time is essential, the AWS API provides a bulk delete-objects that can operate on up to 1000 keys per execution.

A sample implementation using the AWS CLI and jq is demonstrated below:

bucket_name=YOUR_BUCKET_NAME_GOES_HERE
aws s3api list-object-versions --max-items 1000 --bucket $bucket_name > object_versions.json
token=$(cat object_versions.json | jq -r .NextToken)
while [ ! -z $token ]; do

    jq --compact-output '{ "Objects": [ { "Key": .Versions[].Key } ] }' object_versions.json > delete.json
    aws s3api delete-objects --bucket $bucket_name --delete file://delete.json
    aws s3api list-object-versions --bucket $bucket_name --max-items 1000 > object_versions.json
    token=$(cat object_versions.json | jq -r .NextToken)
done
)
Enter fullscreen mode Exit fullscreen mode

This implementation performs the delete operation for 10,000 objects in less than 20 seconds.

If no automation is required, you can also use the Empty Bucket option from the AWS Console. I always get nervous when I need to execute manual operations in the AWS Console. If you can avoid this option, I would.

Wrap Up

A recent tweet by Angie Jones reminded me of why I started blogging:

It was likely a similar tweet by her that first inspired me several years back. To paraphrase, document what you learn along the way to make it easier for developers who come after you.

And if the issue is obscure enough and you're the type who can't remember what you did yesterday much less several months ago, that developer may end up being you ❤️.

Top comments (1)

Collapse
 
lazycoder profile image
Alex Burck

Great post! I usually run into this problem at the most inopportune moments. The problem of deleting S3 buckets containing a large number of objects has plagued developers since S3 was released in 2006. Here are some notes from my own experience on the subject:

  • For many use-cases using using the AWS cli is your easiest option (even if it takes longer). Go with aws s3 rm s3://mybucket --recursive if you don't have bucket versioning enabled or the script you provided is awesome for versioning enabled buckets!
  • I have found it is potentially orders of magnitude faster to perform operations like this within AWS rather than on your personal computer. So setting up a temporary EC2 instance, or an ECS container, or your AWS compute du jour to run terraform destroy or the aws cli commands will often drastically reduce your delete time. I haven't personally tried it yet, but it looks like the new AWS CloudShell service would be a good zero-infrastructure option too.
  • For the more extreme situations with many thousands of objects, setting up a lifecycle rule with a one day expiration is your easiest and cheapest option, that is if you can wait 2 days (one day before the latest objects can expire, one day until the lifecycle rule is evaluated at midnight). The caveat with this option though is if any objects are written to the bucket after you setup the lifecycle rule then you will need to ensure those objects are deleted before you can delete the bucket. This could mean you need wait even longer for those objects to expire, or you can use one of the other delete options to cleanup the strategies. To prevent this problem, in conjunction with the lifecycle policy you could also setup a bucket policy to prevent any further PutObject requests.
  • AWS released S3 Batch back in 2019 which allows you to perform large-scale batch operations on S3 objects. One could potentially use this service to batch-delete S3 objects, but this comes with two caveats. 1. S3 Batch does not have a native Delete operation so you would need to write a Lambda that can delete a S3 object and then use the LambdaInvoke operation within S3 Batch, and 2. S3 Batch requires you to provide a manifest of all S3 Objects you want to perform the batch operation on, so you would need to first setup a S3 Inventory report for your bucket which may take up to 24 hours to generate.