Fixing Pipeline Caching issues with Terraform and the AWS provider

#terraform #cicd #aws #gitlab

I've been analyzing and optimizing the performance of our CI/CD pipeline in a current project and encountered some unexpected behavior with Terraform. Since my Googling didn't lead to useful results, I'm writing this to share my experience. I'll explain how I identified the reason why Terraform didn't use the cached providers and how to avoid the underlying problem with platform specific hashes in the Terraform provider lock file.

We're using a private Gitlab instance as the platform to host our code and have a dedicated runner to execute our pipeline. The terraform part of the pipeline is responsible for rolling out code and infrastructure changes and consists of two stages with their own jobs - plan and apply. If you've used Terraform before, you probably already guessed, that the plan job creates a Terraform plan file, i.e., the diff between the current and the target state. The subsequent apply job consumes that and executes the changes (unless something else touched the state in the mean time). Depending on the configuration, the apply is sometimes automated and in other cases manual, which is one of the reasons for separating the two steps.

The pipeline uses images that don't have Terraform installed, so each of the two jobs first installs Terraform and next runs terraform init to set up the backend and providers. Only once these two steps are complete, the environment is fully set up and we can run terraform plan/apply.

This project primarily deploys resources in AWS, so the code depends on the AWS provider. Unfortunately, that one is a bit on the heavier side. The current version of the extracted provider from the Terraform registry weighs in at about 700MB. The zipped form, which is actually downloaded, is about 150-180MB. Downloading this for each job adds quite a bit of overhead. Even on a fast connection this will take at least a few seconds. The decompression may also take a bit of time depending on your build environment.

In our case, the terraform init that includes more than only the AWS provider took about 3-4 minutes without caching. That means about 6-8 minutes of extra time for each run, which is unacceptable. Actually without caching is not fair - we had configured caching for the .terraform directory, it just wasn't working. Well, it was actually working - the directory was cached and restored between runs, it just didn't make a difference, which was odd.

Below is the caching configuration that's close to what we're using. It's attached to both the plan and apply jobs and creates a separate cache for each deployment stage (e.g. dev/prod). We could have included the .terraform.lock.hcl file as part of the cache key here, which defines which provider versions to use, but more on that later.

cache:
    - key: 'terraform-init-$STAGE'
      paths:
          - .terraform

For a long time, I thought that our cache somehow corrupted the files so that the init-command couldn't see them, but after logging in to the runner I could see that the permissions were perfectly fine. The next step was increasing the log level, which I did by setting the TF_LOG environment variable before the init command.

TF_LOG=TRACE terraform init ...

Here's the abbreviated and commented output. I removed the timestamps and some unrelated messages. Yes, we're using a more recent version of the provider now. I suggest you focus on the comments.

// TF scans the local provider dir and finds the expected version
[TRACE] getproviders.SearchLocalDirectory: found registry.terraform.io/hashicorp/aws v6.4.0 for linux_amd64 at .terraform/providers/registry.terraform.io/hashicorp/aws/6.4.0/linux_amd64
// TF registers this as a candidate for the aws provider
[TRACE] providercache.fillMetaCache: including .terraform/providers/registry.terraform.io/hashicorp/aws/6.4.0/linux_amd64 as a candidate package for registry.terraform.io/hashicorp/aws 6.4.0

...
// TF decides to install the SAME provider from the internet
// instead of using the existing binary
[TRACE] providercache.Dir.InstallPackage: installing registry.terraform.io/hashicorp/aws v6.4.0 from https://releases.hashicorp.com/terraform-provider-aws/6.4.0/terraform-provider-aws_6.4.0_linux_amd64.zip

[TRACE] HTTP client GET request to https://releases.hashicorp.com/terraform-provider-aws/6.4.0/terraform-provider-aws_6.4.0_linux_amd64.zip

[DEBUG] Provider signed by 34365D9472D7468F HashiCorp Security (hashicorp.com/security) <security@hashicorp.com>

// It scans the local directory again
[TRACE] providercache.fillMetaCache: scanning directory .terraform/providers

// And finds the binary it has just downloaded
[TRACE] getproviders.SearchLocalDirectory: found registry.terraform.io/hashicorp/aws v6.4.0 for linux_amd64 at .terraform/providers/registry.terraform.io/hashicorp/aws/6.4.0/linux_amd64

[TRACE] providercache.fillMetaCache: including .terraform/providers/registry.terraform.io/hashicorp/aws/6.4.0/linux_amd64 as a candidate package for registry.terraform.io/hashicorp/aws 6.4.0

In a nutshell, Terraform saw the local provider matching the desired version but still decided to download it again. That didn't make a lot of sense to me and still doesn't. Here, I went down a rabbit hole comparing the checksums of the downloaded binaries and trying to figure out if GitLab somehow modified them or their metadata. Of course it didn't - that would be a strange caching implementation - but I had to be sure.

I was able to narrow down the issue further when I decided to run terraform init again as part of the same job. The second run reused the cached version from the first init and completed almost instantly. Taking a closer look at the output led me to realize that the first run was printing something that the second didn't:

Terraform has made some changes to the provider dependency selections recorded in the .terraform.lock.hcl file. Review those changes and commit them to your version control system if they represent changes you intended to make.

I mentioned the .terraform.lock.hcl briefly earlier, now its time to dive a bit deeper into it and its function. The dependency lock file is one of two mechanisms that take part in the decision which exact version of providers to install. At the time of writing this (Terraform v1.13.x), this file keeps track of which exact version of a provider was used and includes hashes to verify that the correct binary is installed.

When you run terraform init, it checks the version constraints on the provider configuration and also the .terraform.lock.hcl. If there's a version of the provider that satisfies the constraints and is already mentioned in the lock file, it will install that one (unless you specify the -upgrade parameter). Otherwise, it will select a version that matches the constraints and update the lock file.

In my case, both the provider config had a version constraint for v6.4.0 and the lock file included the v6.4.0 too. That means I expected it to take the local version from the cache - but it didn't. As a further troubleshooting step, I included the lock file as a job artifact of the plan job. This caused the apply job to use the cache as I expected, so I was onto something. I downloaded the file and took a look at it. Compared to my local version, the pipeline had added another hash value for the AWS provider.

Then it dawned on me. Providers are compiled go binaries and my dev environment is a Windows VM while the build pipeline is powered by Linux containers. When I upgraded to v6.4.0 a few months back, I did that on the Windows machine and apparently it only added the hashes for the Windows version of the provider. In fact, there is a specific terraform providers lock command that you can use to add the hashes of different platforms to the lock file.

I used the following command on my dev VM to add the hashes of the Windows and Linux x86 versions of the providers.

terraform providers lock \
  -platform=windows_amd64 \ # 64-bit Windows
  -platform=linux_amd64     # 64-bit Linux

This command ran for a while, because it's not downloading only the metadata from the registry, it actually downloads both providers, computes the checksums and adds those to the terraform.lock.hcl file. I'm sure for tiny providers this is done in the blink of an eye, but for bigger ones, such as the AWS provider, this takes a while. This issue is not unique to Windows and Linux. Whenever your dev team is using different platforms, you may experience this. Check out the docs I linked for all supported platforms.

Once I committed the updated lock file, the job run times went down significantly and caching worked as I expected. The install and init phases now take less than 20 seconds, which speeds up each pipeline run (unless the caches are deleted).

This experience didn't give me warm fuzzy feelings about the current implementation. It seems like it will always download the binary from the registry to add the checksum. It would be much more efficient to store these checksums as metadata in the registry. This would probably require changes on both the backend and the frontend, though. An alternative would be to at least compute the checksums based on locally available versions of the provider. This may lead to reducing the load on Hashicorp's CDN as well.

— Maurice

Photo by Wolfgang Weiser on Unsplash

DEV Community

Fixing Pipeline Caching issues with Terraform and the AWS provider

Top comments (0)