Tibo Beijen

Posted on Dec 3, 2021 • Originally published at tibobeijen.nl

Shifting Akamai to the left using Terraform

#devops #akamai #terraform #testing

This article was originally written for my personal blog

Recently we migrated our CDNs from Cloudfront to Akamai. We use Terraform for infrastructure as code (IaC) and luckily it supports Akamai as well. Since we had Cloudfront distributions for pretty much every environment, it served as a good moment to reflect on what we've taken for granted in the past years, especially since Akamai has the concept of a 'staging network' which doesn't naturally seem to fit in a test-early, test-often approach (Spoiler alert: We don't use the staging network).

Shift Left testing

"Shift left" is popular in the contemporary agile and DevOps IT-landscape, and for good reasons. This article by BMC summarizes it nicely:

Shift Left is a practice intended to find and prevent defects early in the software delivery process. The idea is to improve quality by moving tasks to the left as early in the lifecycle as possible. Shift Left testing means testing earlier in the software development process.

Shift left testing illustrated:

Quoting yet another source, devops.com:

Shifting left requires two key DevOps practices: continuous testing and continuous deployment.

And:

Another way to reduce the failure rate is to make all environments in the pipeline look as much like production as possible.

So, how does a CDN (any CDN, be it Cloudfront, Akamai, Fastly, you name it) fit into this shift left approach? Very well actually, as long as:

The CDN isn't limited to production only but is present in every environment as early as possible in the development lifecycle. ¹
Setting up and updating the CDN should be no different than any other code or infra change. As Martin Fowler says: "If it hurts, do it more often".

Akamai concepts

Compared to Cloudfront, Akamai has some advanced concepts that need to be fitted into an IaC workflow in some way.

Activation

An Akamai 'property' (what in Cloudfront is called a 'distribution') has versions of which one is active at any moment, typically the most recent one. When modifying a configuration of which the latest version is active, this results in a new property version which can be activated when ready.

The staging network

Akamai provides two networks: Production and staging. Property versions can be activated on the staging and production networks independently. The staging network is feature-complete but doesn't provide the performance of the production network. If the production network would use mysite.com.edgekey.net then the staging network would be accessible using mysite.com.edgekey-staging.net. This can be used by modifying the /etc/hosts file, to allow testing before activating the version on the production network.

Adapting to IaC

One can observe that both of the above concepts seem to originate from a more traditional acceptance testing practice happening late in the development lifecycle. In an IaC practice they loose some of their relevance and can even cause ambiguity that can be considered undesirable:

Configuration versions are already present by having configuration in source control. The active version is determined by the branching model that is used (commonly 'latest master'), combined with any automation that exists.
The Akamai staging network can be used to test a property version, but it's not really a staging environment since it uses the production origins. ² To illustrate: One could only test the integration of an application and a CDN change after deploying the application to production. This limits the scope of what can be tested using the staging network. So for test, let alone multiple test (feature) environments, more than one property is needed.

What we found works well:

Create a property for all environments: test (one or multiple), staging and production.
Always activate the latest version.
Test on test, which is fully representative, using any automation one has, for example cypress e2e tests.

This way the delivery of Akamai config changes is identical to that of application changes.

Note that it still allows shit-hits-the-fan rollbacks: The first hour after activating a production property version, there's a quick fallback option. This can be activated (stop the bleeding), after which the active version defined in IaC can be aligned with the actual active version and a fix can be worked on (proper surgery).

Terraform

Overall the Terraform module does a fine job in translating declarative Terraform config into Akamai API actions. There are however some things to consider:

Version to be activated

An activation is a separate Terraform resource. What happens under the hood is that if the version changes it will use Akamai's Property API (PAPI) to create a new activation.

The Terraform property resource has 3 attributes related to versions: latest_version, production_version and staging_version. These are determined after the property has been updated, but before any activation has finished.

We take 'always activating latest' as a starting point. However, scenarios can exist where you want to pin a version. One possible way to accomplish this is a setting a local like this:

locals {
  production_version_to_activate = (var.production_activate_latest == true ? 
    akamai_property.property.latest_version : 
    (var.production_pinned_version > 0 ? var.production_pinned_version : akamai_property.property.production_version))
}

Having variable defaults:

# Note: Similar variables would exist for staging network
variable "production_pinned_version" {
  description = "Pin PRODUCTION network activation to this version. Set to 0 to always use previous property version on production (don't activate any property changes)."
  type        = number
  default     = 0
}

variable "production_activate_latest" {
  description = "Apply latest version to production. This supersedes any pinned version so disable if wanting to stay at a specific version."
  default     = true
}

This way tfvars can be set for various scenarios following below examples:

# Directly activate latest property version (default)
production_activate_latest   = true

# Stick to previously active version (update the property, activate later, or via GUI)
production_activate_latest   = false

# Activate specific version (e.g. reverting to known to work version)
production_pinned_version    = 7
production_activate_latest   = false

Slow activations

Activating the staging network takes about 2 to 3 mins. Activating production typically takes between 9 and 11 minutes. To shorten the feedback loop, one can configure DNS for the test environment to use Akamai's staging network, and avoid activating the production network altogether. Example:

test.mysite.com CNAME test.mysite.com.edgekey-staging.net

Given low traffic, the cache-hit ratio on test usually can't be compared to production anyway, so not having production performance would normally not be an issue.

Implicit edge hostnames

The edge hostname resource requires to set a certificate enrollment ID when using enhanced TLS (edge hostnames ending in edgekey.net). However, if you're a 'Secure by default' customer, you can (not: must) use default certificates. In that case the edge hostname will be created implicitly by the property manager API.

As a result the edge hostname that is created is not managed via Terraform. Most of the edge hostname attributes hardly ever needs to be changed, but for ip_behavior this can be a problem (Github issue).

Final thoughts

The main take-away is: Treat a CDN like any other cloud resource, making sure to have representative environments as early as possible in the development lifecycle, whether it is via Terraform, the Akamai CLI or another tool of choice.

Shift-left in the context of Akamai results in achieving confidence in provisioning 1...n near-identical CDN properties, reducing the need for the Akamai's staging network and ultimately speeding up the delivery process.

Worth noting is that end-to-end tests in a caching setup can be challenging to keep fast due to cache ttl. This can be mitigated via cachebusters, reduced max-age values in response headers or other constructs.

A representative test environment with carefully considered exceptions still beats shifting right.

Thanks for reading! Please leave any feedback or comments below, or find me on Twitter.

For a CDN, representative local development seems a bit far-fetched, but once you deploy, having a representative environment should be the goal. ↩
One could attempt to mitigate this by selecting a staging origin based on the request host, but this is a bad idea for a variety of reasons, the most obvious one being that it adds complexity that can easily backfire (production traffic ending up on staging origin), while still being limited to just production and staging. No test. No shifting left. ↩

DEV Community