DEV Community

Cover image for Abusing Terraform to Upload Static Websites to S3
Greg Schafer for Tangram Vision

Posted on

Abusing Terraform to Upload Static Websites to S3

S3 has been a great option for hosting static websites for a long time, but it's still a pain to set up by hand. You need to traverse dozens of pages in the AWS Console to create and manage users, buckets, certificates, a CDN, and about a hundred different configuration options. If you do this repeatedly, it gets old fast. We can automate the process with Terraform, a well-known "infrastructure as code" tool, which lets us declare resources (e.g. servers, storage buckets, users, policies, DNS records) and let Terraform figure out how to build and connect them.

Terraform can create the infrastructure needed for a static website on AWS (e.g. users, bucket, CDN, DNS), and it can create and update the content (e.g. webpages, CSS/JS files, images), which goes outside the infrastructure part of "infrastructure as code" and is why I'm labeling it as an abuse or misuse of Terraform. Still, it works and has a few benefits:

  • You can define the bucket, properties, DNS, CDN, etc. in the same place as your content
  • You have a fully-automated process for standing up websites that only requires a single tool, Terraform

... and a few downsides:

  • Uploading files is slow compared to something like the AWS CLI's sync command
  • Terraform isn't meant for transforming or managing content, so you may outgrow Terraform's capabilities if you want advanced features or optimization

This article will breeze over the infrastructure parts of creating a static website on AWS and focus more on how to upload content and manage content metadata (MIME types and caching behavior). If you want to learn more about the infrastructure parts (e.g. setting up CloudFront, an SSL certificate, DNS routes), there are many great tutorials out there. Here are a few:

Let's get on to the code! If you want just the code, you can find it here: https://gitlab.com/tangram-vision/oss/tangram-visions-blog/-/tree/main/2021.10.06_TerraformS3Upload

The Boilerplate

We need some boilerplate to set up infrastructure before we can upload files to an S3 bucket. So, let's create a bucket with Terraform and the AWS provider. We'll configure the provider and create the bucket in a main.tf file containing the following:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "3.60.0"
    }
  }
}

provider "aws" {
  # This should match the profile name in the credentials file described below
  profile = "aws_admin"
  # Choose the region where you want the S3 bucket to be hosted
  region  = "us-west-1"
}

# To avoid repeatedly specifying the path, we'll declare it as a variable
variable "website_root" {
  type        = string
  description = "Path to the root of website content"
  default     = "../content"
}

resource "aws_s3_bucket" "my_static_website" {
  bucket = "blog-example-m9wtv64y"
  acl    = "private"

  website {
    index_document = "index.html"
  }
}

# To print the bucket's website URL after creation
output "website_endpoint" {
  value = aws_s3_bucket.my_static_website.website_endpoint
}
Enter fullscreen mode Exit fullscreen mode

AWS Credentials

To create or interact with AWS resources, we need to provide credentials. The AWS Terraform provider accepts authentication in a variety of ways, but I'm going to use a credential file. That file is located at ~/.aws/credentials and looks like:

[aws_admin]
aws_access_key_id = AKIA...
aws_secret_access_key = ...
Enter fullscreen mode Exit fullscreen mode

If you don't have credentials handy, you can follow AWS documentation to create a new user with a policy that grants S3 permissions.

Uploading Files to S3 with Terraform

Here's where we start using Terraform... creatively, i.e. for managing content instead of just infrastructure. For the content, I've created a basic multi-page website — a couple HTML files, a CSS file, and a couple images. By using Terraform's fileset function and the AWS provider's s3_bucket_object resource, we can collect all the files in a directory and upload all of them to objects in S3:

# in main.tf, below the aforementioned boilerplate
resource "aws_s3_bucket_object" "file" {
  for_each = fileset(var.website_root, "**")

  bucket      = aws_s3_bucket.my_static_website.id
  key         = each.key
  source      = "${var.website_root}/${each.key}"
  source_hash = filemd5("${var.website_root}/${each.key}")
  acl         = "public-read"
}
Enter fullscreen mode Exit fullscreen mode

The for_each meta-argument loops over all files in the website directory tree, binding the file path (index.html, assets/normalize.css, etc.) to each.key, which can be used elsewhere in the block. The source_hash argument hashes the file, which helps Terraform determine when the file has changed and needs to be re-uploaded to the S3 bucket. (There's a similar etag argument, but it doesn't work when some kinds of S3 encryption are enabled.)

Terraform Apply

With our trusty main.tf file in hand, we can now invoke dark and mysterious powers, conjuring infinite computational power out of nothing! With the merest flourish of our terminal, unfathomable forces precipitate to our whim — we are the tactician, the champion and commander over greater numbers than were ever deployed in any Greek myth!

206p.gif

Ahem... anyway, do the following:

# Initialize terraform in the current directory and download the AWS provider
terraform init
# Preview what changes will be made
terraform plan
# Make the changes (create and populate the S3 bucket)
terraform apply
Enter fullscreen mode Exit fullscreen mode

At the end of the output from the apply command, you should see the website endpoint:

...
Apply complete! Resources: 6 added, 0 changed, 0 destroyed.

Outputs:

website_endpoint = "blog-example-m9wtv64y.s3-website-us-west-1.amazonaws.com"
Enter fullscreen mode Exit fullscreen mode

Content Types, MIME Types, Oh My

Let's visit that URL in a browser and...

aws_screenshot.png

That's not what we expected. It turns out that S3 assigns a content type of binary/octet-stream to uploaded files by default. When visiting the website endpoint URL (which serves the index.html file), the browser sees that Content-Type: binary/octet-stream header and thinks "This is a binary file, so I'll prompt the user to download it".

We would prefer the browser to treat our HTML files as HTML, the CSS files as CSS, and so on. For that, we need the browser to receive the correct MIME type (e.g. text/html, text/css, image/png) in the Content-Type header. The easiest way to do that is to specify the correct content type when uploading files. To determine the correct type of our files, there are 2 approaches.

Determining MIME Types with a CLI Tool

The first approach is to use a command-line tool like file, xdg-mime or mimetype. These tools use different approaches:

  • file uses "magic tests" (looking for identifying bits at a small fixed offset into the file) to determine the type of files
  • xdg-mime and mimetype match against the file extension first, falling back to using file if the file doesn't have an extension

The below shell session demonstrates basic usage of each command (a dollar sign is used to distinguish input commands from output results):

# Demo of file
$ file --brief --mime-type index.html
text/html
$ file --brief --mime-type assets/normalize.css
text/plain

# Demo of xdg-mime
$ xdg-mime query filetype index.html
text/html
$ xdg-mime query filetype assets/normalize.css
text/css

# Demo of mimetype
$ mimetype --brief index.html
text/html
$ mimetype --brief assets/normalize.css
text/css
Enter fullscreen mode Exit fullscreen mode

A subtle detail in the above is that file may not label text files very precisely — it outputs the CSS file as text/plain instead of text/css because there's no magic test or consistent file header that can identify CSS files (nor the many other variations of text file types).

To determine MIME types with a CLI tool in our Terraform file, we'll add three pieces:

  1. An external data source which, for each file to be uploaded, will call...
  2. An external script that calls a CLI tool (e.g. mimetype) to determine the file's MIME type
  3. The content_type argument of the aws_s3_bucket_object resource to assign the MIME type for each uploaded file

The external data source is a new block in main.tf as follows (I've turned the file list into a local value, because we're using it in multiple places now):

locals {
  website_files = fileset(var.website_root, "**")
}

data "external" "get_mime" {
  for_each = local.website_files
  program  = ["bash", "./get_mime.sh"]
  query = {
    filepath : "${var.website_content_filepath}/${each.key}"
  }
}
Enter fullscreen mode Exit fullscreen mode

The data source calls bash ./get_mime.sh once for each file, passing the filepath as JSON to stdin. Using the example from the Terraform docs, we can implement the bash script to grab the JSON filepath from stdin, run mimetype on the file, and export the result as a JSON object on stdout.

#!/bin/bash

# Exit if any of the intermediate steps fail
set -e

# Extract "filepath" from the input JSON into FILEPATH shell variable.
eval "$(jq -r '@sh "FILEPATH=\(.filepath)"')"

# Run mimetype on filepath to get the correct mime type.
MIME=$(mimetype --brief $FILEPATH)

# Safely produce a JSON object containing the result value.
jq -n --arg mime "$MIME" '{"mime":$mime}'
Enter fullscreen mode Exit fullscreen mode

And finally in main.tf, we associate the correct MIME type from the bash script with the file when uploading to S3

resource "aws_s3_bucket_object" "file" {
  for_each = local.website_files

  bucket       = aws_s3_bucket.my_static_website.id
  key          = each.key
  source       = "${var.website_root}/${each.key}"
  source_hash  = filemd5("${var.website_root}/${each.key}")
  acl          = "public-read"
  # added:
  content_type = data.external.get_mime[each.key].result.mime
}
Enter fullscreen mode Exit fullscreen mode

Determining MIME Types with a File Extension Map

The second approach to determining correct MIME types for our files is to simply provide a map of file extensions to MIME types. I first ran into this approach (for uploading files with Terraform) in this article on the StateFarm engineering blog, but it's a common approach in general:

To use this approach, we add a mime.json file that maps file extensions to MIME types for whatever files we need to upload. It could be as simple as the below:

{
    ".html": "text/html",
    ".css": "text/css",
    ".png": "image/png"
}
Enter fullscreen mode Exit fullscreen mode

And we load that file as a local variable in Terraform and use it when looking up the content type:

locals {
  website_files = fileset(var.website_root, "**")

  mime_types = jsondecode(file("mime.json"))
}

resource "aws_s3_bucket_object" "file" {
  for_each = local.website_files

  bucket       = aws_s3_bucket.my_static_website.id
  key          = each.key
  source       = "${var.website_root}/${each.key}"
  source_hash  = filemd5("${var.website_root}/${each.key}")
  acl          = "public-read"
  content_type = lookup(local.mime_types, regex("\\.[^.]+$", each.key), null)
}
Enter fullscreen mode Exit fullscreen mode

This mapping-based approach has the advantages of being simple and more cross-platform than shelling out to CLI tools. The downside is that you need to make sure all filetypes you're using exist in the extension-to-MIME mapping and are correct.

Fixing a Stale CloudFront Cache

Now we have a working static website that we can visit in our browser! If you don't care about SSL or caching for some reason, you could stop here. But, I would argue that an important part of modern websites is making them secure and fast, so you'll likely want to put a CloudFront distribution in front of your S3 bucket. There are many other tutorials (such as all the ones linked at the top of this article) that cover CloudFront, so I won't dig into the details of that. However, I do want to dig into a problem that you run into when serving a static website via CloudFront: a stale cache.

By default, CloudFront applies a TTL of 86400 seconds (1 day), meaning CloudFront will fetch website files from your S3 bucket and serve the same files to visitors for a full day before re-fetching from S3. If you update website content (e.g. change CSS styles or javascript behavior) in S3, visitors may continue receiving cached versions from CloudFront and won't see your updates for up to a whole day! We'd prefer visitors to see the latest version of all website content, but we'd also like CloudFront to cache files as long as possible, so files can be served faster (directly from cache).

Cache Busting

One solution is cache-busting, which involves adding a hash (or "fingerprint") to non-HTML files' names. If the files' content changes, then the hash changes, so the browser downloads a completely different file (which can be cached forever).

I tried to implement this with Terraform, but uh... Terraform isn't meant for this sort of thing. Between the Terraform filemd5 and regex functions, you can get close, but I hit a wall when trying to replace filenames with their hashed version in all files. This could maybe work if you used template variables (e.g. <link href="${main.css}"> instead of <link ref="main.css">), but then you can no longer browse your website via the filesystem or a local server. Alas, here dies my ill-advised dream of making a Terraform-based static-site generator/bundler.

melting_emoji.png

Fun fact: the melting face emoji was recently approved!

Cache Invalidation

The other solution to a stale CloudFront cache is invalidating files. This approach does not fit into Terraform's declarative paradigm — there are no resources for invalidations in the AWS provider and no third-party modules either. So, it requires more hacky-ness, in the form of a null_resource that triggers based on changes in file hashes and shells out to the AWS CLI to create a new invalidation. That approach might look something like the below:

locals {
  website_files = fileset(var.website_root, "**")

  file_hashes = {
    for filename in local.website_files :
    filename => filemd5("${var.website_root}/${filename}")
  }
}

resource "null_resource" "invalidate_cache" {
  triggers = locals.file_hashes

  provisioner "local-exec" {
    command = "aws --profile=aws_admin cloudfront create-invalidation --distribution-id=${aws_cloudfront_distribution.my_distribution.id} --paths=/*"
  }
}
Enter fullscreen mode Exit fullscreen mode

The null resource is a new provider, so you'll need to run terraform init again.

What About Browser Caching?

We've talked about CloudFront caching, but there's another cache in between your content and your visitor: the browser. The browser cache and the Cache-Control header are a big topic all on their own; Harry Roberts's Cache-Control for Civilians is a great resource if you want to learn more.

For the purpose of this article, it's important to note that you shouldn't set an aggressive cache control header (e.g. Cache-Control: public, max-age=604800, immutable) on your website files without fingerprinting them. Otherwise, visitors' browsers will keep serving a file from their local cache for the max-age duration (one week, in the above example) before they send a request to CloudFront to check if the file is stale. CloudFront invalidations force CloudFront to fetch fresh content, but have no impact on the caching of visitors' browsers.


That's all for this adventure — thanks for joining me in pushing Terraform out of its comfort zone! If you have any suggestions or corrections, please let me know or send us a tweet, and if you’re curious to learn more about how we improve perception sensors, visit us at Tangram Vision.

Top comments (0)