DEV Community: Greg Schafer

Abusing Terraform to Upload Static Websites to S3

Greg Schafer — Wed, 06 Oct 2021 18:27:29 +0000

S3 has been a great option for hosting static websites for a long time, but it's still a pain to set up by hand. You need to traverse dozens of pages in the AWS Console to create and manage users, buckets, certificates, a CDN, and about a hundred different configuration options. If you do this repeatedly, it gets old fast. We can automate the process with Terraform, a well-known "infrastructure as code" tool, which lets us declare resources (e.g. servers, storage buckets, users, policies, DNS records) and let Terraform figure out how to build and connect them.

Terraform can create the infrastructure needed for a static website on AWS (e.g. users, bucket, CDN, DNS), and it can create and update the content (e.g. webpages, CSS/JS files, images), which goes outside the infrastructure part of "infrastructure as code" and is why I'm labeling it as an abuse or misuse of Terraform. Still, it works and has a few benefits:

You can define the bucket, properties, DNS, CDN, etc. in the same place as your content
You have a fully-automated process for standing up websites that only requires a single tool, Terraform

... and a few downsides:

Uploading files is slow compared to something like the AWS CLI's sync command
Terraform isn't meant for transforming or managing content, so you may outgrow Terraform's capabilities if you want advanced features or optimization

This article will breeze over the infrastructure parts of creating a static website on AWS and focus more on how to upload content and manage content metadata (MIME types and caching behavior). If you want to learn more about the infrastructure parts (e.g. setting up CloudFront, an SSL certificate, DNS routes), there are many great tutorials out there. Here are a few:

Let's get on to the code! If you want just the code, you can find it here: https://gitlab.com/tangram-vision/oss/tangram-visions-blog/-/tree/main/2021.10.06_TerraformS3Upload

The Boilerplate

We need some boilerplate to set up infrastructure before we can upload files to an S3 bucket. So, let's create a bucket with Terraform and the AWS provider. We'll configure the provider and create the bucket in a main.tf file containing the following:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "3.60.0"
    }
  }
}

provider "aws" {
  # This should match the profile name in the credentials file described below
  profile = "aws_admin"
  # Choose the region where you want the S3 bucket to be hosted
  region  = "us-west-1"
}

# To avoid repeatedly specifying the path, we'll declare it as a variable
variable "website_root" {
  type        = string
  description = "Path to the root of website content"
  default     = "../content"
}

resource "aws_s3_bucket" "my_static_website" {
  bucket = "blog-example-m9wtv64y"
  acl    = "private"

  website {
    index_document = "index.html"
  }
}

# To print the bucket's website URL after creation
output "website_endpoint" {
  value = aws_s3_bucket.my_static_website.website_endpoint
}

AWS Credentials

To create or interact with AWS resources, we need to provide credentials. The AWS Terraform provider accepts authentication in a variety of ways, but I'm going to use a credential file. That file is located at ~/.aws/credentials and looks like:

[aws_admin]
aws_access_key_id = AKIA...
aws_secret_access_key = ...

If you don't have credentials handy, you can follow AWS documentation to create a new user with a policy that grants S3 permissions.

Uploading Files to S3 with Terraform

Here's where we start using Terraform... creatively, i.e. for managing content instead of just infrastructure. For the content, I've created a basic multi-page website — a couple HTML files, a CSS file, and a couple images. By using Terraform's fileset function and the AWS provider's s3_bucket_object resource, we can collect all the files in a directory and upload all of them to objects in S3:

# in main.tf, below the aforementioned boilerplate
resource "aws_s3_bucket_object" "file" {
  for_each = fileset(var.website_root, "**")

  bucket      = aws_s3_bucket.my_static_website.id
  key         = each.key
  source      = "${var.website_root}/${each.key}"
  source_hash = filemd5("${var.website_root}/${each.key}")
  acl         = "public-read"
}

The for_each meta-argument loops over all files in the website directory tree, binding the file path (index.html, assets/normalize.css, etc.) to each.key, which can be used elsewhere in the block. The source_hash argument hashes the file, which helps Terraform determine when the file has changed and needs to be re-uploaded to the S3 bucket. (There's a similar etag argument, but it doesn't work when some kinds of S3 encryption are enabled.)

Terraform Apply

With our trusty main.tf file in hand, we can now invoke dark and mysterious powers, conjuring infinite computational power out of nothing! With the merest flourish of our terminal, unfathomable forces precipitate to our whim — we are the tactician, the champion and commander over greater numbers than were ever deployed in any Greek myth!

Ahem... anyway, do the following:

# Initialize terraform in the current directory and download the AWS provider
terraform init
# Preview what changes will be made
terraform plan
# Make the changes (create and populate the S3 bucket)
terraform apply

At the end of the output from the apply command, you should see the website endpoint:

...
Apply complete! Resources: 6 added, 0 changed, 0 destroyed.

Outputs:

website_endpoint = "blog-example-m9wtv64y.s3-website-us-west-1.amazonaws.com"

Content Types, MIME Types, Oh My

Let's visit that URL in a browser and...

That's not what we expected. It turns out that S3 assigns a content type of binary/octet-stream to uploaded files by default. When visiting the website endpoint URL (which serves the index.html file), the browser sees that Content-Type: binary/octet-stream header and thinks "This is a binary file, so I'll prompt the user to download it".

We would prefer the browser to treat our HTML files as HTML, the CSS files as CSS, and so on. For that, we need the browser to receive the correct MIME type (e.g. text/html, text/css, image/png) in the Content-Type header. The easiest way to do that is to specify the correct content type when uploading files. To determine the correct type of our files, there are 2 approaches.

Determining MIME Types with a CLI Tool

The first approach is to use a command-line tool like file, xdg-mime or mimetype. These tools use different approaches:

file uses "magic tests" (looking for identifying bits at a small fixed offset into the file) to determine the type of files
xdg-mime and mimetype match against the file extension first, falling back to using file if the file doesn't have an extension

The below shell session demonstrates basic usage of each command (a dollar sign is used to distinguish input commands from output results):

# Demo of file
$ file --brief --mime-type index.html
text/html
$ file --brief --mime-type assets/normalize.css
text/plain

# Demo of xdg-mime
$ xdg-mime query filetype index.html
text/html
$ xdg-mime query filetype assets/normalize.css
text/css

# Demo of mimetype
$ mimetype --brief index.html
text/html
$ mimetype --brief assets/normalize.css
text/css

A subtle detail in the above is that file may not label text files very precisely — it outputs the CSS file as text/plain instead of text/css because there's no magic test or consistent file header that can identify CSS files (nor the many other variations of text file types).

To determine MIME types with a CLI tool in our Terraform file, we'll add three pieces:

An external data source which, for each file to be uploaded, will call...
An external script that calls a CLI tool (e.g. mimetype) to determine the file's MIME type
The content_type argument of the aws_s3_bucket_object resource to assign the MIME type for each uploaded file

The external data source is a new block in main.tf as follows (I've turned the file list into a local value, because we're using it in multiple places now):

locals {
  website_files = fileset(var.website_root, "**")
}

data "external" "get_mime" {
  for_each = local.website_files
  program  = ["bash", "./get_mime.sh"]
  query = {
    filepath : "${var.website_content_filepath}/${each.key}"
  }
}

The data source calls bash ./get_mime.sh once for each file, passing the filepath as JSON to stdin. Using the example from the Terraform docs, we can implement the bash script to grab the JSON filepath from stdin, run mimetype on the file, and export the result as a JSON object on stdout.

#!/bin/bash

# Exit if any of the intermediate steps fail
set -e

# Extract "filepath" from the input JSON into FILEPATH shell variable.
eval "$(jq -r '@sh "FILEPATH=\(.filepath)"')"

# Run mimetype on filepath to get the correct mime type.
MIME=$(mimetype --brief $FILEPATH)

# Safely produce a JSON object containing the result value.
jq -n --arg mime "$MIME" '{"mime":$mime}'

And finally in main.tf, we associate the correct MIME type from the bash script with the file when uploading to S3

resource "aws_s3_bucket_object" "file" {
  for_each = local.website_files

  bucket       = aws_s3_bucket.my_static_website.id
  key          = each.key
  source       = "${var.website_root}/${each.key}"
  source_hash  = filemd5("${var.website_root}/${each.key}")
  acl          = "public-read"
  # added:
  content_type = data.external.get_mime[each.key].result.mime
}

Determining MIME Types with a File Extension Map

The second approach to determining correct MIME types for our files is to simply provide a map of file extensions to MIME types. I first ran into this approach (for uploading files with Terraform) in this article on the StateFarm engineering blog, but it's a common approach in general:

The hashicorp/dir/template Terraform module has a mapping of extensions and MIME types
- Sidenote: An open Terraform issue requesting native MIME type detection directs users to use this Terraform module.
The AWS CLI uses the python mimetypes module, which has a built-in mapping as a fallback if it can't read a mapping from the system (at /etc/mime.types)
In non-desktop environments, the xdg-mime tool falls back to using the mimetype tool, which checks file extensions before performing magic tests (for the most part)

To use this approach, we add a mime.json file that maps file extensions to MIME types for whatever files we need to upload. It could be as simple as the below:

{
    ".html": "text/html",
    ".css": "text/css",
    ".png": "image/png"
}

And we load that file as a local variable in Terraform and use it when looking up the content type:

locals {
  website_files = fileset(var.website_root, "**")

  mime_types = jsondecode(file("mime.json"))
}

resource "aws_s3_bucket_object" "file" {
  for_each = local.website_files

  bucket       = aws_s3_bucket.my_static_website.id
  key          = each.key
  source       = "${var.website_root}/${each.key}"
  source_hash  = filemd5("${var.website_root}/${each.key}")
  acl          = "public-read"
  content_type = lookup(local.mime_types, regex("\\.[^.]+$", each.key), null)
}

This mapping-based approach has the advantages of being simple and more cross-platform than shelling out to CLI tools. The downside is that you need to make sure all filetypes you're using exist in the extension-to-MIME mapping and are correct.

Fixing a Stale CloudFront Cache

Now we have a working static website that we can visit in our browser! If you don't care about SSL or caching for some reason, you could stop here. But, I would argue that an important part of modern websites is making them secure and fast, so you'll likely want to put a CloudFront distribution in front of your S3 bucket. There are many other tutorials (such as all the ones linked at the top of this article) that cover CloudFront, so I won't dig into the details of that. However, I do want to dig into a problem that you run into when serving a static website via CloudFront: a stale cache.

By default, CloudFront applies a TTL of 86400 seconds (1 day), meaning CloudFront will fetch website files from your S3 bucket and serve the same files to visitors for a full day before re-fetching from S3. If you update website content (e.g. change CSS styles or javascript behavior) in S3, visitors may continue receiving cached versions from CloudFront and won't see your updates for up to a whole day! We'd prefer visitors to see the latest version of all website content, but we'd also like CloudFront to cache files as long as possible, so files can be served faster (directly from cache).

Cache Busting

One solution is cache-busting, which involves adding a hash (or "fingerprint") to non-HTML files' names. If the files' content changes, then the hash changes, so the browser downloads a completely different file (which can be cached forever).

I tried to implement this with Terraform, but uh... Terraform isn't meant for this sort of thing. Between the Terraform filemd5 and regex functions, you can get close, but I hit a wall when trying to replace filenames with their hashed version in all files. This could maybe work if you used template variables (e.g. <link href="${main.css}"> instead of <link ref="main.css">), but then you can no longer browse your website via the filesystem or a local server. Alas, here dies my ill-advised dream of making a Terraform-based static-site generator/bundler.

Fun fact: the melting face emoji was recently approved!

Cache Invalidation

The other solution to a stale CloudFront cache is invalidating files. This approach does not fit into Terraform's declarative paradigm — there are no resources for invalidations in the AWS provider and no third-party modules either. So, it requires more hacky-ness, in the form of a null_resource that triggers based on changes in file hashes and shells out to the AWS CLI to create a new invalidation. That approach might look something like the below:

locals {
  website_files = fileset(var.website_root, "**")

  file_hashes = {
    for filename in local.website_files :
    filename => filemd5("${var.website_root}/${filename}")
  }
}

resource "null_resource" "invalidate_cache" {
  triggers = locals.file_hashes

  provisioner "local-exec" {
    command = "aws --profile=aws_admin cloudfront create-invalidation --distribution-id=${aws_cloudfront_distribution.my_distribution.id} --paths=/*"
  }
}

The null resource is a new provider, so you'll need to run terraform init again.

What About Browser Caching?

We've talked about CloudFront caching, but there's another cache in between your content and your visitor: the browser. The browser cache and the Cache-Control header are a big topic all on their own; Harry Roberts's Cache-Control for Civilians is a great resource if you want to learn more.

For the purpose of this article, it's important to note that you shouldn't set an aggressive cache control header (e.g. Cache-Control: public, max-age=604800, immutable) on your website files without fingerprinting them. Otherwise, visitors' browsers will keep serving a file from their local cache for the max-age duration (one week, in the above example) before they send a request to CloudFront to check if the file is stale. CloudFront invalidations force CloudFront to fetch fresh content, but have no impact on the caching of visitors' browsers.

That's all for this adventure — thanks for joining me in pushing Terraform out of its comfort zone! If you have any suggestions or corrections, please let me know or send us a tweet, and if you’re curious to learn more about how we improve perception sensors, visit us at Tangram Vision.

Creating PostgreSQL Test Data with SQL, PL/pgSQL, and Python

Greg Schafer — Fri, 30 Apr 2021 21:18:30 +0000

After exploring various ways to load test data into PostgreSQL for my last blog post, I wanted to dive into different approaches for generating test data for PostgreSQL. Generating test data, rather than using static manually-created data, can be valuable for a few reasons:

Writing the logic for generating test data forces you to take a second look at your data model and consider what values are allowed and which values are edge cases.
Tools for generating test data make it easier to set up data per test. I would argue this is better than the alternatives of (a) hand-creating data per test or (b) trying to maintain a single dataset that is used across the entire test suite. The first option is tedious, and the second option can be brittle. As an example, if you're testing an e-commerce website and your test suite uses hard-coded product details and deactivating the product in your test dataset causes many tests to unexpectedly fail, then those tests were reliant on a pre-condition that happened to be satisfied in your test dataset. Generating data per test can make such pre-conditions more explicit and clear, especially for colleagues who inherit your tests and test data in the future.
Unless you already have a large dataset from a production environment or a partner company that you can use (hopefully after anonymization!), generating test data is the only way to get large datasets for benchmarking and load testing.

Similar to the previous article, if you're using an Object-Relational Mapping (ORM) library, then you'll probably create and persist objects into the database using the ORM or use the ORM to dump and restore test data fixtures using JSON or CSV. If you're not using an ORM, the approaches in this article may provide some learning or inspiration for how you can best generate data for your particular testing situation.

Follow Along with Docker

Similar to the previous article, you can follow along using Docker and the scripts in a subfolder of our Tangram Vision blog repo: https://gitlab.com/tangram-vision-oss/tangram-visions-blog/-/tree/main/2021.04.30_GeneratingTestDataInPostgreSQL

Unlike the previous article, I've provided a Dockerfile to add Python into the Postgres Docker image so we can run Python inside the PostgreSQL database. As described in the repo's README, you can build the docker image and run examples with:

docker build . --tag=postgres-test-data-blogpost

# The base postgres image requires a password to be set, but we'll just be
# testing locally, so no need to set a strong password.
docker run --name=postgres --rm --env=POSTGRES_PASSWORD=foo \
    --volume=$(pwd)/schema.sql:/docker-entrypoint-initdb.d/schema.sql \
    --volume=$(pwd):/repo \
    postgres-test-data-blogpost -c log_statement=all

The repo contains a variety of files that start with add-data- which demonstrate different ways of loading and generating test data. After the Postgres Docker container is running, you can run add-data- files in a new terminal window with a command like:

docker exec --workdir=/repo postgres \
    psql --host=localhost --username=postgres \
         --file=add-data-insert-random.sql

If you want to interactively poke around the database with psql, use:

docker exec --interactive --tty postgres \
    psql --host=localhost --username=postgres

Sample Schema

For example code and data, I'll use the following simple schema again:

Musical artists have a name
An artist can have many albums (one-to-many), which have a title and release date
Genres have a name
Albums can belong to many genres (many-to-many)

Sample schema relating musical artists, albums, and genres.

Generating Data

Using static datasets has advantages (you know exactly what data is in your database), but they can be tedious to maintain over time and impractical to create if you need a lot of data (e.g. for benchmarking or load testing). Generating data is an alternative approach which lets you define how data should look in one place and then generate and use as much data as you like.

There are a few different tools for generating test data that are worth exploring, from plain ol' SQL to higher-level programming languages like Python.

SQL

If you're like me, you may have started this article not expecting SQL to be capable of generating test data. With [generate_series](https://www.postgresql.org/docs/current/functions-srf.html) and [random](https://www.postgresql.org/docs/current/functions-math.html#FUNCTIONS-MATH-RANDOM-TABLE) and a little creativity, however, SQL is well-equipped to generate a variety of data.

To create 5 artists with 8 random hex characters for their names, you can do the following:

INSERT INTO artists (name)
SELECT substr(md5(random()::text), 1, 8) FROM generate_series(1, 5) as _g;

If you want to use random words instead of random hex characters, you can pick words from the system dictionary. I've copied Ubuntu's american-english word list to /usr/share/dict/words in the Docker image, so we just need to load it and pick a word randomly:

-- Temporary tables are only accessible to the current psql session and are
-- dropped at the end of the session.
CREATE TEMPORARY TABLE words (word TEXT);

-- The WHERE clauses excludes possessive words (almost 30k of them!)
COPY words (word) FROM '/usr/share/dict/words' WHERE word NOT LIKE '%''%';

-- Randomly order the table and pick the first result
SELECT * FROM words ORDER BY random() LIMIT 1;

No joke, the first word that the above query returned for me was "bravo". I don't know whether to be encouraged or creeped out.

On a separate note, the dictionary contains words that may be offensive and inappropriate in some settings. If you're pulling test data from the dictionary and don't want these words to pop up in your next demo to customers/bosses, make sure to take appropriate precautions!

Anyway, moving on... using these tools (and a few more), we can generate interesting test data for all of our tables. Comments in the code below explain extra functions and techniques being used.

-- Excerpt from add-data-insert-random.sql in the sample code repo

-- Use 8 random hex chars as the genre name.
INSERT INTO genres (name)
SELECT substr(md5(random()::text), 1, 8) FROM generate_series(1, 5) AS _g;

INSERT INTO artists (name)
SELECT
  -- Pick one random word as the artist name.
  (SELECT * FROM words ORDER BY random() LIMIT 1)
FROM generate_series(1, 4) AS _g;

INSERT INTO albums (artist_id, title, released)
SELECT
  -- Select a random artist from the artists table.
  -- NOTE: random() is only evaluated once in this subquery unless it depends on
  -- the outer query, hence the "_g*0" after random().
  (SELECT id FROM artists ORDER BY random()+_g*0 LIMIT 1),

  -- Select the first 1-3 rows after randomly sorting the word list, then join
  -- them with spaces between each word and capitalize the first letter of each
  -- word.
  initcap(array_to_string(array(
    SELECT * FROM words ORDER BY random()+_g*0 LIMIT ceil(random() * 3)
  ), ' ')),

  -- Subtract between 0-5 years from today as the album release date.
  (now() - '5 years'::interval * random())::date
FROM generate_series(1, 8) AS _g;

-- Assign a random album a random genre. Repeat 10 times.
INSERT INTO album_genres (album_id, genre_id)
SELECT
  (SELECT id FROM albums ORDER BY random()+_g*0 LIMIT 1),
  (SELECT id FROM genres ORDER BY random()+_g*0 LIMIT 1)
FROM generate_series(1, 10) AS _g
-- If we insert a row that already exists, do nothing (don't raise an error)
ON CONFLICT DO NOTHING;

But that's not all! We can define functions in SQL to reuse logic — if we want genres, artist names, and album titles to all be random words, then we can move random-word-picking into a function and use it in many places:

-- Excerpt from add-data-insert-random-function.sql in the sample code repo
CREATE OR REPLACE FUNCTION generate_random_title(num_words int default 1) RETURNS text AS $$
  SELECT initcap(array_to_string(array(
    SELECT * FROM words ORDER BY random() LIMIT num_words
  ), ' '))
$$ LANGUAGE sql;

INSERT INTO genres (name)
SELECT generate_random_title()
FROM generate_series(1, 5) AS _g;

INSERT INTO artists (name)
-- Generate 1-2 random words as the artist name.
SELECT generate_random_title(ceil(random() * 2 + _g * 0)::int)
FROM generate_series(1, 4) AS _g;

-- ...

PL/pgSQL

If the declarative style of SQL is awkward/difficult, we can turn to PL/pgSQL to generate test data in PostgreSQL using a more procedural/imperative programming style. PL/pgSQL provides familiar programming concepts like variables, conditionals, loops, return statements, and exception handling.

To demonstrate some of what PL/pgSQL can do, let's specify some more requirements for our generated data — roughly half of our artists should have names starting with "DJ" and all albums by DJ artists should belong to an "Electronic" genre. That implementation might look like:

-- Excerpt from add-data-plpgsql-insert.sql in the sample code repo
DO $$
DECLARE
  -- Declare (and optionally assign) variables used in the below code block.
  genre_options text[] := array['Hip Hop', 'Jazz', 'Rock', 'Electronic'];
  artist_name text;
  dj_album RECORD;
BEGIN
  -- Convert each array option into a row and insert them into genres table.
  INSERT INTO genres (name) SELECT unnest(genre_options);

  FOR i IN 1..8 LOOP
    SELECT generate_random_title(ceil(random() * 2)::int) INTO artist_name;
    -- About 50% of the time, add 'DJ ' to the front of the artist's name.
    IF random() > 0.5 THEN
      artist_name = 'DJ ' || artist_name;
    END IF;
    INSERT INTO artists (name)
    SELECT artist_name;
  END LOOP;

  -- ...

  -- Ensure all albums by a 'DJ' artist belong to the Electronic genre.
  FOR dj_album IN
    SELECT albums.* FROM albums
    INNER JOIN artists ON albums.artist_id = artists.id
    WHERE artists.name LIKE 'DJ %'
  LOOP
    RAISE NOTICE 'Ensuring DJ album % belongs to Electronic genre!', quote_literal(dj_album.title);
    INSERT INTO album_genres (album_id, genre_id)
    SELECT dj_album.id, (SELECT id FROM genres WHERE name = 'Electronic')
    -- If we insert a row that already exists, do nothing (don't raise an error)
    ON CONFLICT DO NOTHING;
  END LOOP;
END;
$$ LANGUAGE plpgsql;

As you can see in the above code snippet, PL/pgSQL lets us:

Test conditions with IF statements (which can have ELSIF and ELSE blocks or alternately be represented with CASE statements),
Loop over a range of integers with FOR i IN 1..8 LOOP (which can loop in reverse or with a step),
Loop over rows from a query, as in the FOR dj_album IN ... example above,
Print helpful log statements with RAISE,
and do all the above in a performant way, because the client can send the whole code block to the server to execute, rather than serializing and sending each statement to the server one at a time as it would with raw SQL.

There's much more to learn about PL/pgSQL than I can cover here in a reasonable amount of space, but hopefully the above provides some insight into its capabilities to help you decide what tool makes sense for you!

Using Python

PL/pgSQL isn't the only procedural language available with PostgreSQL, it also supports Python! The Python procedural language, plpython3u for Python 3, is "untrusted" (hence the u at the end of the name), meaning you must be a superuser to create functions, and Python code can access and do anything that a superuser could. Luckily, we're generating test data in non-production environments, so Python is an acceptable option despite these security concerns.

To use plpython3u, we need to install python3 and postgresql-plpython3-$PG_MAJOR system packages and create the extension in the SQL script with the command below. I've already taken these steps for the Docker image and plpython script in the sample code repo.

CREATE EXTENSION IF NOT EXISTS plpython3u;

The main difference to be aware of when using Python in PostgreSQL is that all database access happens via the plpy module that is automatically imported in plpython3u blocks. The following example should help clarify some basics of using plpython3u and the plpy module:

-- Excerpt from add-data-plpython-intro.sql in the sample code repo
DO $$
    print("Print statements don't appear anywhere!")

    # Manually convert value to string, quote it, and interpolate
    artist_name = plpy.quote_nullable("DJ Okawari")
    returned = plpy.execute(f"INSERT INTO artists (name) VALUES ({artist_name})")
    plpy.info(returned)  # Outputs the next line
    # INFO:  <PLyResult status=7 nrows=1 rows=[]>

    # Let PostgreSQL parameterize the query
    artist_name = "Ella Fitzgerald"
    plan = plpy.prepare("INSERT INTO artists (name) VALUES ($1) RETURNING *", ["text"])
    returned = plan.execute(plan, [artist_name])
    plpy.info(returned)  # Outputs the next line
    # INFO:  <PLyResult status=11 nrows=1 rows=[{'artist_id': 2, 'name': 'Ella Fitzgerald'}]>

    returned = plpy.execute("SELECT * FROM artists")
    plpy.info(returned)  # Outputs the next line
    # INFO:  <PLyResult status=5 nrows=2 rows=[{'artist_id': 1, 'name': 'DJ Okawari'}, {'artist_id': 2, 'name': 'Ella Fitzgerald'}]>
$$ LANGUAGE plpython3u;

Here are the most important insights from the above code:

You can't print out debugging information with the Python print statement, you need to use logging methods available in the plpy module (such as info, warning, error).
The [plpy.execute function](https://www.postgresql.org/docs/12/plpython-database.html) can execute a simple string as a query. If you're interpolating variables into the query, you are responsible for converting the variable value into a string and properly quoting it.
Alternately, use plan = plpy.prepare then plan.execute to prepare and execute a query, which allows you to leave data conversion and quoting up to PostgreSQL. As a bonus, you can save plans so the database only has to parse the query string and formulate an execution plan once.
The return value of plpy.execute can tell you the status of the query, how many rows were inserted or returned, and the rows themselves.

Now that we have an understanding of how to use Python in PostgreSQL, let's apply it to generating test data for our sample schema. While we could translate the previous section's PL/pgSQL code to Python with very few changes, doing so wouldn't capitalize on the biggest advantage of using Python — the plethora of standard and third-party libraries available.

The Faker Package

Faker is a Python package that provides many helpers for generating fake data. You can generate realistic-looking first and last names, addresses, emails, URLs, job titles, company names, and much more. Faker also supports generating random words and sentences, and generating random data across many different data types (numbers, strings, dates, JSON, and more). Using Faker is straightforward:

-- Excerpt from add-data-plpython-faker.sql in the sample code repo
DO $$
    from random import randint, choice
    from faker import Faker

    fake = Faker()

    for _ in range(6):
        plan = plpy.prepare("INSERT INTO artists (name) VALUES ($1)", ["text"])
        plan.execute([fake.name()])

    # Alternately, we could add "RETURNING artist_id" to the above query and
    # save those values to avoid making this extra query for all artist_ids
    artist_ids = [row["artist_id"] for row in plpy.execute("SELECT artist_id FROM artists")]
    for _ in range(10):
        title = " ".join(word.title() for word in fake.words(nb=randint(1, 3)))
        plan = plpy.prepare(
            "INSERT INTO albums (artist_id, title, released) VALUES ($1, $2, $3)",
            ["int", "text", "date"],
        )
        plan.execute([choice(artist_ids), title, fake.date()])

    # ...
$$ LANGUAGE plpython3u;

The dataclasses Module

If you prefer to create Python objects to represent rows from your different tables, you could use a variety of different packages, such as attrs, factory_boy, or the built-in module dataclasses. These packages allow you to declare a field per table column and associate data types and factories for generating test data.

Please note that if you go very far down this path of representing rows as Python objects, you will find yourself re-creating a lot of ORM functionality. In that case, you should probably just use an ORM!

Here's an example of how you could use the dataclasses module to generate test data for our sample schema:

-- Excerpt from add-data-plpython-dataclasses.sql in the sample code repo
DO $$
    from dataclasses import dataclass, field
    import datetime
    from random import randint, choice
    from typing import List, Any, Type, TypeVar

    from faker import Faker

    T = TypeVar("T", bound="DataGeneratorBase")
    fake = Faker()

    # This is a useful base class for tracking instances so we can use them in
    # relationships (picking a random artist or genre to foreign key to).
    class DataGeneratorBase:
        def __new__(cls: Type[T], *args: Any, **kwargs: Any) -> T:
            "Track class instances in a list on the class"
            instance = super().__new__(cls, *args, **kwargs)  # type: ignore
            if "instances" not in cls.__dict__:
                cls.instances = []
            cls.instances.append(instance)
            return instance

    @dataclass
    class Genre(DataGeneratorBase):
        genre_id: int = field(init=False)
        name: str = field(default_factory=fake.street_name)

    @dataclass
    class Artist(DataGeneratorBase):
        artist_id: int = field(init=False)
        name: str = field(default_factory=fake.name)

    @dataclass
    class Album(DataGeneratorBase):
        album_id: int = field(init=False)
        artist: Artist = field(default_factory=lambda: choice(Artist.instances))
        title: str = field(
            default_factory=lambda: " ".join(
                word.title() for word in fake.words(nb=randint(1, 3))
            )
        )
        released: datetime.date = field(default_factory=fake.date)
        genres: List[Genre] = field(
            # Use Faker to pick a list of genres to avoid duplicates
            default_factory=lambda: fake.random_elements(Genre.instances, length=randint(0, 3), unique=True)
        )

    for _ in range(6):
        g = Genre()
        # "RETURNING id" lets us get the database-generated and store it on the
        # Python object for later reference without needing to issue additional
        # queries.
        plan = plpy.prepare(
            "INSERT INTO genres (name) VALUES ($1) RETURNING genre_id", ["text"]
        )
        g.genre_id = plan.execute([g.name])[0]["genre_id"]
    for _ in range(6):
        artist = Artist()
        plan = plpy.prepare(
            "INSERT INTO artists (name) VALUES ($1) RETURNING artist_id", ["text"]
        )
        artist.artist_id = plan.execute([artist.name])[0]["artist_id"]
    for _ in range(8):
        album = Album()
        plan = plpy.prepare(
            "INSERT INTO albums (artist_id, title, released) VALUES ($1, $2, $3) RETURNING album_id",
            ["int", "text", "date"],
        )
        album.album_id = plan.execute(
            [album.artist.artist_id, album.title, album.released]
        )[0]["album_id"]

        # Insert album_genres rows
        for g in album.genres:
            plan = plpy.prepare(
                "INSERT INTO album_genres (album_id, genre_id) VALUES ($1, $2)",
                ["int", "int"],
            )
            plan.execute([album.album_id, g.genre_id])
$$ LANGUAGE plpython3u;

The above snippet defines classes for each main table in our example schema: Genre, Artist, and Album. Then, it defines fields for each column along with a default_factory function that tells Python (or the Faker package, in many cases) how to generate suitable test data. I made the Album class the "owner" of the many-to-many relationship with Genres, so when an Album is created, it automatically picks 0-3 existing Genres to associate itself with during initialization.

The second half of the code passes the Python objects into SQL INSERT queries, returning the primary key IDs (which weren't generated during object creation, due to the init=False field argument) so they can be saved on the objects and used later when setting foreign keys. This highlights a difficulty with doing this sort of object-relational mapping yourself — you have to figure out dependencies between your types of data and enforce an ordering (in Python and SQL) so that you have database-created IDs at the right times. This can be a bit tedious and messy, especially if you have circular dependencies or self-referencing relationships in your tables.

Importing External .py Files

If your data model or data-generation code start to get complex, it can be annoying to have a lot of Python code in SQL files — your IDE won't want to lint, type-check, and auto-format your Python code! Luckily, you can keep your Python code in external .py files that you import and execute from inside a plpython3u block, using the technique shown below:


-- Excerpt from add-data-plpython-external-pyfile.sql in the sample code repo
DO $$
    import importlib.util

    # The second argument is the filepath on the server (inside the container)
    spec = importlib.util.spec_from_file_location("add_test_data", "/repo/add_test_data.py")
    add_test_data = importlib.util.module_from_spec(spec)
    spec.loader.exec_module(add_test_data)
    add_test_data.main(plpy)
$$ LANGUAGE plpython3u;

The add_test_data.py file can look the exact same as the body of the plpython3u block from the previous example, but you'll need to wrap the bottom half (which uses plpy to run queries) in a function that accepts plpy as an argument, so it looks like:

# Excerpt from add_test_data.py in the sample code repo

# ...
def main(plpy: Any) -> None:
    for _ in range(6):
        g = Genre()
    # ...

Other (Trusted) Ways to Use Python

I want to briefly touch on two ways of using Python outside of PostgreSQL — running Python externally may be preferable if you want or need to avoid the untrusted nature of plpython3u. These approaches let you maintain your Python code completely independent of the database, which may be beneficial for reusability and maintainability.

You could use Python scripts to generate test data into CSV files and then load those into PostgreSQL with the COPY command. With this approach, however, you will likely end up with a multi-step process to generate and load test data. If you invoke a Python script (which outputs CSV) within the SQL COPY command, then you can't populate multiple tables with a single command. If you use multiple SQL COPY commands, it becomes convoluted to reference IDs across tables (foreign keys) across multiple Python script executions. The remaining reasonable approach is a multi-step one: run a Python script that saves multiple CSV files to disk (one per database table) and then run an SQL COPY command per CSV file to load the data.
You could run Python scripts that connect to PostgreSQL via a client library such as psycopg2. The psycopg2 package is used by many ORMs, such as the Django ORM and SQLAlchemy, but it doesn't impose any restrictions on how you handle your data — it just provides a Python interface for connecting to PostgreSQL, sending SQL commands, and receiving results.

Thank you for joining me on this exploration of loading test data (in the previous blog post) and generating test data for PostgreSQL! We tried out a variety of approaches and got some hands-on experience with code — I hope this helps you understand how to use these different approaches, weigh their tradeoffs, and choose which approach makes the most sense for your team and project.

If you have any suggestions or corrections, please let me know or send us a tweet, and if you’re curious to learn more about how we improve perception sensors, visit us at Tangram Vision.

Loading Test Data into PostgreSQL

Greg Schafer — Wed, 28 Apr 2021 22:47:27 +0000

Most web apps/services that use a relational database are built around a web framework and an Object-Relational Mapping (ORM) library, which typically have conventions that prescribe how to create and load test fixtures/data into the database for testing. If you're building a webapp without an ORM [1], the story for how to create and load test data is less clear. What tools and approaches are available, and which work best? There are a lot of articles around the internet that describe specific techniques or example code in isolation, but few that provide a broader survey of the many different approaches that are possible. I hope this article will help fill that gap, exploring and discussing different approaches for creating and loading test data in PostgreSQL.

[1] Wait a minute, why would you build a webapp without an ORM?! This question could spawn an entire article of its own and in fact, many other articles have debated about ORMs for the last couple decades. I won't dive into that debate — it's up to the creator to decide if a project should use an ORM or not, and that decision depends on a lot of project-specific factors, such as the expertise of the creator and their team, the types and velocity of data involved, the performance and scaling requirements, and much more.

If you're interested in generating test data instead of (or in addition to) loading test data, please check out the follow-up article that explores generating test data for PostgreSQL using SQL, PL/pgSQL, and Python!

Follow Along with Docker

Want to follow along? I've collected sample data and scripts in a subfolder of our Tangram Vision blog repo: https://gitlab.com/tangram-vision-oss/tangram-visions-blog/-/tree/main/2021.04.28_LoadingTestDataIntoPostgreSQL

As described in the repo's README, you can run examples using the official Postgres Docker image with:

# The base postgres image requires a password to be set, but we'll just be
# testing locally, so no need to set a strong password.
docker run --name=postgres --rm --env=POSTGRES_PASSWORD=foo \
    --volume=$(pwd)/schema.sql:/docker-entrypoint-initdb.d/schema.sql \
    --volume=$(pwd):/repo
    postgres:latest -c log_statement=all

To explain this Docker command a bit:

The base postgres image requires a password to be set (via the POSTGRES_PASSWORD environment variable), but we'll just be testing locally, so no need to set a strong password.
Executable scripts (*.sh and *.sql files) in the /docker-entrypoint-initdb.d folder inside the container will be executed as PostgreSQL starts up. The above command mounts schema.sql into that folder, so the database tables will be created.
The repo is also mounted to /repo inside the container, so example SQL and CSV files are accessible.
The PostgreSQL server is started with the log_statement=all config override, which increases the logging verbosity.

docker exec --workdir=/repo postgres \
    psql --host=localhost --username=postgres \
         --file=add-data-sql-copy-csv.sql

If you want to interactively poke around the database with psql, use:

docker exec --interactive --tty postgres \
    psql --host=localhost --username=postgres

Sample Schema

For example code and data, I'll use the following simple schema:

Musical artists have a name
An artist can have many albums (one-to-many), which have a title and release date
Genres have a name
Albums can belong to many genres (many-to-many)

Sample schema relating musical artists, albums, and genres.

Loading Static Data

The simplest way to get test data into PostgreSQL is to make a static dataset, which you can save as CSV files or embed in SQL files directly.

SQL COPY from CSV Files

In the code repo accompanying this blogpost, there are 4 small CSV files, one for each table of the sample schema. The CSV files contain headers and data rows as shown in the image below.

A small, static sample dataset of musical artists, albums, and genres.

We can import the data from these CSV files into a PostgreSQL database with the SQL COPY command:

-- Excerpt from add-data-copy-csv.sql in the sample code repo
COPY artists FROM '/repo/artists.csv' CSV HEADER;
COPY albums FROM '/repo/albums.csv' CSV HEADER;
COPY genres FROM '/repo/genres.csv' CSV HEADER;
COPY album_genres FROM '/repo/album_genres.csv' CSV HEADER

The COPY command has a variety of options for controlling quoting, delimiters, escape characters, and more. You can even limit which rows are imported with a WHERE clause. One potential downside is you must run it as a database superuser or as a user with permissions to read and write and execute files on the server — this isn't a concern when loading data for local testing, but keep it in mind if you ever want to use it in a more restrictive or production-like environment.

Psql Copy from CSV Files

The PostgreSQL interactive terminal (called psql) provides a copy command that is very similar to SQL COPY:

-- Excerpt from add-data-copy-csv.psql in the sample code repo
\copy artists from 'artists.csv' csv header
\copy albums from 'albums.csv' csv header
\copy genres from 'genres.csv' csv header
\copy album_genres from 'album_genres.csv' csv header

There are some important differences between SQL COPY and psql copy:

Like other psql commands, the psql version of the copy command starts with a backslash (\) and doesn't need to end with a semicolon (;).
SQL COPY runs in the server environment whereas psql copy runs in the client environment. To clarify, the filepath you provide to SQL COPY should point to a file on the server's filesystem. The filepath you provide to psql copy points to a file on the filesystem where you're running the psql client. If you're following along using the Docker image and commands provided in this blogpost, the server and client are the same container, but if you ever want to load data from your local machine to a database on a remote server, then you'll want to use psql copy.
As a corollary to the above, psql copy is less performant than SQL COPY, because all the data must travel from the client to the server, rather than being directly loaded by the server.
SQL COPY requires absolute filepaths, but psql can handle relative filepaths.
Psql copy runs with the privileges of the user you're connecting to the server as, so it doesn't require superuser or local file read/write/execute permissions like SQL COPY does.

Putting Data in SQL Directly

As an alternative to storing data in separate CSV files (which are loaded with SQL or psql commands), you can store data in SQL files directly.

SQL COPY from stdin and pg_dump

The SQL COPY and psql copy commands can load data from stdin instead of a file. They will parse and load all the lines between the copy command and \. as rows of data.

-- Excerpt from add-data-copy-stdin.sql in the sample code repo
COPY public.artists (artist_id, name) FROM stdin CSV;
1,"DJ Okawari"
2,"Steely Dan"
3,"Missy Elliott"
4,"TWRP"
5,"Donald Fagen"
6,"La Luz"
7,"Ella Fitzgerald"
\.

COPY public.albums (album_id, artist_id, title, released) FROM stdin CSV;
1,1,"Mirror",2009-06-24
2,2,"Pretzel Logic",1974-02-20
3,3,"Under Construction",2002-11-12
4,4,"Return to Wherever",2019-07-11
5,5,"The Nightfly",1982-10-01
6,6,"It's Alive",2013-10-15
7,7,"Pure Ella",1994-02-15
\.

...

In fact, this COPY ... FROM stdin approach is how [pg_dump](https://www.postgresql.org/docs/current/app-pgdump.html) outputs data if you're creating a dump or backup from an existing PostgreSQL database. However, pg_dump uses a tab-separated format by default, rather than the comma-separated format shown above.

By default, pg_dump also outputs SQL to re-create everything about the database (tables, constraints, views, functions, reset sequences, etc.), but you can instruct it to output only data with the --data-only flag. To try out pg_dump with the example Docker image, run:

docker exec --workdir=/repo postgres \
    pg_dump --host=localhost --username=postgres postgres

SQL INSERTs

Another way to put data directly in SQL is to use INSERT statements. This approach could look like the following:

-- Excerpt from add-data-insert-static-ids.sql in the sample code repo
INSERT INTO artists (artist_id, name)
OVERRIDING SYSTEM VALUE
VALUES
  (1, 'DJ Okawari'),
  (2, 'Steely Dan'),
  (3, 'Missy Elliott'),
  (4, 'TWRP'),
  (5, 'Donald Fagen'),
  (6, 'La Luz'),
  (7, 'Ella Fitzgerald');

INSERT INTO albums (album_id, artist_id, title, released)
OVERRIDING SYSTEM VALUE
VALUES
  (1, 1, 'Mirror', '2009-06-24'),
  (2, 2, 'Pretzel Logic', '1974-02-20'),
  (3, 3, 'Under Construction', '2002-11-12'),
  (4, 4, 'Return to Wherever', '2019-07-11'),
  (5, 5, 'The Nightfly', '1982-10-01'),
  (6, 6, 'It''s Alive', '2013-10-15'),
  (7, 7, 'Pure Ella', '1994-02-15');

...

The OVERRIDING SYSTEM VALUE clause lets us INSERT values into the primary key ID columns explicitly even though they are defined as GENERATED ALWAYS.

The pg_dump command's --column-inserts option will output data as INSERT statements (a separate statement per row), rather than as the default TSV format. Using INSERTs instead of COPY will run much slower when restoring the data, so this is only recommended if you're restoring the data to a database that doesn't support COPY, such as sqlite3. Using INSERTs can be sped up somewhat with the --rows-per-insert option, allowing you to INSERT many rows at a time per command, reducing the overhead of back-and-forth communication between client and server for every SQL statement.

Using INSERT statements, we could start moving away from statically declaring everything about our datasets — we could omit the primary key ID columns and lookup IDs as needed when inserting foreign keys, as in the following example:

-- Excerpt from add-data-insert-queried-ids.sql in the sample code repo
INSERT INTO artists (name)
VALUES
  ('DJ Okawari'),
  ('Steely Dan'),
  ('Missy Elliott'),
  ('TWRP'),
  ('Donald Fagen'),
  ('La Luz'),
  ('Ella Fitzgerald');

INSERT INTO albums (artist_id, title, released)
VALUES
  ((SELECT id FROM artists WHERE name = 'DJ Okawari'), 'Mirror', '2009-06-24'),
  ((SELECT id FROM artists WHERE name = 'Steely Dan'), 'Pretzel Logic', '1974-02-20'),
  ((SELECT id FROM artists WHERE name = 'Missy Elliott'), 'Under Construction', '2002-11-12'),
  ((SELECT id FROM artists WHERE name = 'TWRP'), 'Return to Wherever', '2019-07-11'),
  ((SELECT id FROM artists WHERE name = 'Donald Fagen'), 'The Nightfly', '1982-10-01'),
  ((SELECT id FROM artists WHERE name = 'La Luz'), 'It''s Alive', '2013-10-15'),
  ((SELECT id FROM artists WHERE name = 'Ella Fitzgerald'), 'Pure Ella', '1994-02-15');

...

This is hardly convenient, though, because we need to duplicate other row information (such as the artist name) in order to look up the corresponding ID. It gets even more complex if multiple artists have the same name! So, if you have a static dataset I'd suggest sticking to one of the previously mentioned approaches that use SQL COPY or psql copy.

Putting Data in CSVs vs in SQL Files

Is there a reason to prefer putting static datasets in CSVs or directly in SQL files? My thoughts boil down to the following points:

CSVs are a widely understood and supported format (just make sure to be clear and consistent with encoding!). If your datasets will be maintained or created by people who prefer spreadsheet programs to database-admin and command-line tools, CSVs may be preferable.
If you want to keep all your test data and database setup in one place, SQL files are a convenient way to do that.
If your testing or continuous integration processes use pg_dump or its output, then you're already using datasets embedded in an SQL file — keep doing what makes sense for you!

I hope you learned something new and useful about the different approaches and tools available for loading static datasets into PostgreSQL. If you're looking to learn more check out the follow-up article about generating test data for PostgreSQL!

If you have any suggestions or corrections, please let me know or send us a tweet, and if you’re curious to learn more about how we improve perception sensors, visit us at Tangram Vision.

Cover Photo by Susan Q Yin on Unsplash

Exploring Ansible via Setting Up a WireGuard VPN

Greg Schafer — Thu, 04 Mar 2021 17:41:20 +0000

Photo by Thomas Jensen on Unsplash

In my previous blogpost, we set up a WireGuard VPN server and client and learned about various configuration options for WireGuard, how to improve VPN server uptime, how to relay traffic, and more. Setting up a server and client like that is a lot of work! If the server dies or you want to set up a new server (maybe for a friend or family member this time), you have to go back to the walk-through and follow all the steps, remembering if you deviated from those instructions at any point.

There's a better way — automation! If you're only going to do a thing once (e.g. set up a VPN), investing in automation probably doesn't make sense. But if you anticipate doing a thing repeatedly, automating it frees up your time to learn and accomplish more in the future. You can also share your automation, empowering others to build and achieve more, faster.

Automation is the heart of computing, and many different automation tools and approaches have sprung up over time. For our project of automating VPN server setup, we can consider a variety of tools:

Shell scripts
- The simplest approach from a tooling perspective, writing shell scripts would involve running the commands from the previous WireGuard tutorial blogpost, using ssh for the commands that run on the server and rsync to copy configurations files to the server.
SSH scripting libraries like Capistrano or Fabric
- If shell scripting isn't ideal, there are libraries that expose similar scripting functionality in a more ergonomic interface for developers familiar with higher-level languages like Ruby and Python.
Infrastructure/configuration automation tools like Puppet, Chef, or Ansible
- Tools in this category are even more specialized for automating server infrastructure and configuration, often including an ecosystem of packages and plugins to automatically set up or configure nearly anything you can think of.
Infrastructure-as-code tools like Terraform
- Infrastructure-as-code (IaC) tools have a lot of overlap with the above category, but support provisioning cloud resources in a more first-class/native way.
Containers like Docker
- You could also run WireGuard in containers, deploying a server-configured container image to a cloud provider and running a client-configured container image locally to connect to the server. There are a few existing examples of this approach.

For this tutorial, I'm going to focus on the middle category above — infrastructure/configuration automation tools — and specifically, I'll focus on Ansible. There is a great comparison of different tools in this area by Gruntwork and, even though that article favors Terraform, Ansible is still a useful general-purpose tool, especially if you're working with servers that aren't "in the cloud", such as a Raspberry Pi at home.

Let's get started with automating VPN setup with Ansible! By the end of this article, we'll be able to set up a VPN server and client with a single command. Similar to the previous blogpost, I'll use Ubuntu 20.04 and DigitalOcean droplets.

Setting up Ansible

Ansible can be installed via an OS package manager like apt, but I prefer to use pip so I can get the latest updates and avoid cluttering system package management with third-party PPAs (Personal Package Archives). We'll also use pyenv (as suggested by Hypermodern Python) to make sure we're not breaking or cluttering the system Python installation. Install pyenv with the following:



# From https://github.com/pyenv/pyenv/wiki#suggested-build-environment
sudo apt-get update

sudo apt-get install --no-install-recommends make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev

curl https://pyenv.run | bash

It's a good habit when a tutorial gives you curl <url> | bash to open up that URL and see what it's going to do. In this case, you'll see that it'll download and execute a shell script on GitHub that will clone 6 repos from GitHub to your ~/.pyenv folder and prompt you to add a few lines to your shell's initialization script.

Follow the output prompt from above, which asks you to put lines like the below in your shell initialization script (e.g. ~/.bashrc if you use the bash shell). Make sure to fill in your own username!



export PATH="/home/YOUR_USERNAME/.pyenv/bin:$PATH"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"

Install a recent python version:



# List available python versions
pyenv install --list

# Install a specific version
pyenv install 3.9.2

# (Suggested) If you want to always use that version when running `python`
# in your terminal
pyenv global 3.9.2

If you want, you can also create a virtualenv to further isolate the Ansible installation, and make that virtualenv automatically activate when you're in a particular folder/repo. That would look like:



# (Optional)

# Feel free to pick a different virtualenv name than "ansible-tutorial"
pyenv virtualenv 3.9.2 ansible-tutorial

# Create a .python-version file that pyenv will find when your shell is in the 
# same directory (or a sub-directory) and automatically activate the named
# virtualenv
pyenv local ansible-tutorial

Install the ansible pip package, which will install various command-line tools, including ansible-playbook, which we'll use to run a "playbook" of commands that will set up a VPN server and client for us.



pip install ansible

# Confirm installation worked
ansible --version

Get a Server

To use Ansible for a VPN server, we need... a server! Ansible could provision a server from a cloud provider for us (and I'll touch on this briefly later), but we'll keep our playbook hardware-provider-agnostic for now, so you can run it as easily against a cloud server as a Raspberry Pi on your home network. I'm going to create a $5/month DigitalOcean droplet to test against, but you could also use Vagrant (to test against a local VM) or any server you can SSH to.

Testing Ansible playbooks against VMs, rather than a bare-metal machine, comes with an advantage — after you've written the playbook, you can start a new, empty VM and test the whole playbook start to finish to ensure that it works consistently.

Connecting to the Server with Ansible

Once you have your server or VM, take note of its IP address use it to create an inventory.ini file like the below:



[vpn]
vpn_server ansible_host=203.0.113.1 ansible_user=root

An inventory file tells Ansible what servers it can act upon and how to access them. Let's use the above inventory file as an example. When we run Ansible and target the vpn group of servers or the vpn_server host, it will try to connect to the server using a command like:



ssh root@203.0.113.1

So, if you can't SSH to the server, then Ansible won't be able to connect either!

Connecting to the server with an SSH key is strongly recommended! Add your SSH key to your server to connect without needing a password. If you must connect with a password, you can sudo apt install sshpass and then provide your SSH password when using Ansible by adding the --ask-pass flag to all ansible commands.

Let's test to make sure that Ansible can connect to the server:



ansible -i inventory.ini -m ping vpn

This runs the ping Ansible module, targeting the vpn group of servers. You should see "pong" in the output, meaning that Ansible could connect to the server and the server has a Python installation that Ansible can use.

Ansible's Built-in Variables and Facts

There are other useful Ansible modules that we can use with the ansible command:

The setup module fetches system information, also known as "facts", about the server. You can use these facts as variables in Ansible commands and playbooks.
The debug module can evaluate variables, which is useful for... well, debugging!

Try running both of these modules with your server so you can see what facts and information Ansible makes available:



ansible -i inventory.ini -m setup vpn
ansible -i inventory.ini -m debug -a "var=hostvars" vpn

This was one of the most confusing parts for me when learning Ansible — figuring out what all these built-in variables and facts (like groups, inventory_dir, and ansible_distribution) were and how to find them.

Writing an Ansible Playbook

The ansible command lets you run ad-hoc commands across groups of servers. This is powerful, but we probably shouldn't try to automate server setup and configuration in a single ansible command... probably. 🤔 Instead, we can organize multiple tasks in one or multiple YAML files, which we will run with the ansible-playbook command.

Let's write a playbook.yml file In the same folder as inventory.ini. Here are its contents:



---
- name: setup vpn server
  hosts: vpn_server
  tasks:
  - name: ping
    ping:
  - name: show variables and facts
    debug: var=hostvars

If you're not familiar with YAML, the above is equivalent to this JSON structure:



[{'name': 'setup vpn server',
  'hosts': 'vpn_server',
  'tasks': [{'name': 'ping', 'ping': None},
            {'name': 'show variables and facts', 'debug': 'var=hostvars'}]}]

Breaking down the above:

The top-level structure is a "play" in Ansible lexicon. Our play above has a name, a hosts pattern which describes which servers the play will run against, and a list of tasks.
We have 2 tasks, each has a name and the name of an Ansible module that will do something.

Run the playbook...



ansible-playbook -i inventory.ini playbook.yml

... and you'll see that it gathers facts from the server (just like the ansible -m setup command above did), and then runs the "ping" task and the "debug" task to show all the gathered facts and variables defined for vpn_server.

There are tons of built-in Ansible modules, even more curated Ansible community modules, and even more published to Ansible Galaxy (an open repository for Ansible collections and roles).

WireGuard Server Setup

There's much more to learn about Ansible! But let's stop here and apply what we've learned in order to set up a WireGuard server.

Referring to the steps we took in the previous tutorial, we want to:

Install the wireguard system package
Create public and private keys with correct permissions
Create the server's WireGuard configuration file
(Optionally) Enable IP forwarding for relaying traffic
Start the VPN

Managing the Keys

As hinted at in the previous tutorial, if we want to repeatably deploy the VPN server without needing to reconfigure all VPN clients, we need to use the same private key every time.

Put another way: if we generated a private key while deploying the server and used the corresponding public key on various clients, and the server ends up dying, we could deploy it again by generating a new private key. However, all of our VPN clients would then need to update to the new public key to be able to connect to the new VPN server. This would be inconvenient!

Instead, we'll generate the server keys once by hand and use them in the playbook so they're consistent between every deploy. This means we won't include step #2 from above in the Ansible playbook.

Generate the keys with wg genkey and wg pubkey commands. You can output both with the following command:



privkey=$(wg genkey) sh -c 'echo "
    server_privkey: $privkey
    server_pubkey: $(echo $privkey | wg pubkey)"'

Copy the output lines and add them to a new vars mapping under the play in playbook.yml. Here's what mine looks like now (your keys will be different):



---
- name: setup vpn server
  hosts: vpn_server
  vars:
    server_privkey: aBYk1JZyP8ck+FeaTjb3xi94U4Nv8V+gWoTW1hRLQlo=
    server_pubkey: 7/6f7bUT+2hWMEP5BxeK51PGuMuTnQ9pRpkxg5jUSTo=
  tasks:
  # ...

Encrypting the Private Key

It's a good practice to AVOID having secrets in plaintext (like the VPN private key above). This is especially true if those secrets will be shared with anyone else, like via a git repo. Let's prevent this by using Ansible Vault. Vault is a tool for encrypting secret values and using them in playbooks. Encrypt the private key with:



ansible-vault encrypt_string --ask-vault-password --stdin-name server_privkey

You'll be prompted twice for a Vault encryption password, after which you'll paste your privkey value and hit Ctrl+d twice. If the command completed after a single Ctrl+d, try again and make sure you're not copy-pasting an invisible newline character at the end of the privkey value. Copy the output into your playbook, which will now look like:



---
- name: setup vpn server
  hosts: vpn_server
  vars:
    server_privkey: !vault |
          $ANSIBLE_VAULT;1.1;AES256
          646438636565343063343631326136386239623935393637336539653636386135363
          663386639393232346534643163656363316234306439306566306534610a31326664
          363763663139383034636632343230376365333130333230373866353033326563303
          5636138373830633534373033303536303566663166616539360a3936353033663263
          336662663034376661616631343661333164363134373061343739633637623739306
          465653532383838393662396333623966343165366635353132396332313762343534
          65313761623964653532623839356633343838
    server_pubkey: 7/6f7bUT+2hWMEP5BxeK51PGuMuTnQ9pRpkxg5jUSTo=
  tasks:
  ...

Make sure to remember your encryption password (and save it in a password manager); you'll need to enter it every time you run the playbook.

Installing and Configuring WireGuard

Next, we'll remove our testing ping and debug tasks and write tasks for steps 1, 3, 4, and 5 from the above list. These steps translate neatly into Ansible tasks in our updated playbook.yml:



---
- name: setup vpn server
  hosts: vpn_server
  vars:
    server_privkey: !vault |
          $ANSIBLE_VAULT;1.1;AES256
          646438636565343063343631326136386239623935393637336539653636386135363
          663386639393232346534643163656363316234306439306566306534610a31326664
          363763663139383034636632343230376365333130333230373866353033326563303
          5636138373830633534373033303536303566663166616539360a3936353033663263
          336662663034376661616631343661333164363134373061343739633637623739306
          465653532383838393662396333623966343165366635353132396332313762343534
          65313761623964653532623839356633343838
    server_pubkey: 7/6f7bUT+2hWMEP5BxeK51PGuMuTnQ9pRpkxg5jUSTo=
  tasks:
  # https://docs.ansible.com/ansible/latest/collections/ansible/builtin/apt_module.html
  - name: install wireguard package
    apt:
      name: wireguard
      state: present
      update_cache: yes

  # https://docs.ansible.com/ansible/latest/collections/ansible/builtin/copy_module.html
  - name: create server wireguard config
    template:
      dest: /etc/wireguard/wg0.conf
      src: server_wg0.conf.j2
      owner: root
      group: root
      mode: '0600'

  # https://docs.ansible.com/ansible/latest/collections/ansible/posix/sysctl_module.html
  - name: enable and persist ip forwarding
    sysctl:
      name: net.ipv4.ip_forward
      value: "1"
      state: present
      sysctl_set: yes
      reload: yes

  # https://docs.ansible.com/ansible/latest/collections/ansible/builtin/systemd_module.html
  - name: start wireguard and enable on boot
    systemd:
      name: wg-quick@wg0
      enabled: yes
      state: started

Ok ok, yes, this is a bit like drawing an owl.

Source: https://29.media.tumblr.com/tumblr_l7iwzq98rU1qa1c9eo1_500.jpg

...but usually an ansible playbook like the above can be written quickly. I follow a cycle:

Type "ansible module install package" into a search engine
Open the docs.ansible.com result that looks most helpful
Read through available parameters and the (often helpful) examples at the bottom
Copy an example into my playbook and modify parameters as needed
Go back to step 1, searching for the next task (e.g. "ansible module template file")

I've included a comment line linking to the Ansible docs page for each module used in the playbook.yml above, in case you want to read about the parameters.

Testing our First Attempt

Let's test our playbook.



$ ansible-playbook -i inventory.ini --ask-vault-password playbook.yml
Vault password: 

PLAY [setup vpn server] ********************************************************

TASK [Gathering Facts] *********************************************************
ok: [vpn_server]

TASK [install wireguard package] ***********************************************
changed: [vpn_server]

TASK [create server wireguard config] ******************************************
fatal: [vpn_server]: FAILED! => {"changed": false, "msg": "Could not find or access 'server_wg0.conf.j2'\nSearched in: ..."}

PLAY RECAP *********************************************************************
vpn_server                 : ok=2    changed=1    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0

Oh no! Installing WireGuard was successful, but creating the config failed. Ansible's error messages are usually helpful, and this one indicates that the template file (server_wg0.conf.j2) we're trying to use to create the server's configuration couldn't be found. Let's create it at templates/server_wg0.conf.j2:



# {{ ansible_managed }}
[Interface]
Address = 10.0.1.1/24
ListenPort = 51820
PrivateKey = {{ server_privkey }}

A few notes about the above:

Ansible automatically searches in relative paths like templates/ and files/ when running Ansible modules that have a src parameter. Our template task has a parameter src: server_wg0.conf.j2, so Ansible will search for it in the templates/ folder.
It's convention to suffix template files with .j2, to indicate that the file will be templated with Jinja2.
In Jinja2, values inside double curly braces ({{ variable }}) will be replaced with the value of the variable. In this template, the server_privkey variable will be decrypted and its value inserted into the resulting file in place of {{ server_privkey }}.
The {{ ansible_managed }} text is replaced with the string "Ansible managed". It's a good convention to put this in a comment at the top of templated files, because it signals to anyone reading the file on the server that the file is managed by Ansible — any edits they make could be overwritten when Ansible next runs, so they should find and make edits in the corresponding Ansible playbook and template files instead.

Let's run the test again:



$ ansible-playbook -i inventory.ini --ask-vault-password playbook.yml
Vault password: 

PLAY [setup vpn server] ********************************************************

TASK [Gathering Facts] *********************************************************
ok: [vpn_server]

TASK [install wireguard package] ***********************************************
ok: [vpn_server]

TASK [create server wireguard config] ******************************************
changed: [vpn_server]

TASK [enable and persist ip forwarding] ****************************************
changed: [vpn_server]

TASK [start wireguard and enable on boot] **************************************
changed: [vpn_server]

PLAY RECAP *********************************************************************
vpn_server                 : ok=5    changed=3    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

It succeeded! The WireGuard interface is now running on the server.

Notice that the "install wireguard package" step shows ok instead of changed this time. The apt module (and most modules) detect that the server is already in the desired state (the wireguard package was installed last time we ran the playbook, so it satisfies state=present) and perform no actions. The task is idempotent, meaning you can run it repeatedly and the outcome is the same. Idempotent tasks make it easy to see what changed and what didn't each time a playbook is run.

WireGuard Client Setup

Ansible can also operate on the local machine. To set up our local machine as a client, we want to:

Install the wireguard system package
Create public and private keys with correct permissions
Create the client's WireGuard configuration file, which must include the server's public key
Start the VPN

We also need to update the server's configuration file with a [Peer] section including the client's public key, so the client can connect to the server. The client's public key isn't known until after we create it — we could create client keys manually like we did for the server's keys, but then the playbook wouldn't be able to set up multiple clients without having to manually edit the keys for each client.

Acting on Localhost

Because we're targeting a new host (localhost), we need to write a new play in playbook.yml. We can put it above the existing play (which targets vpn_server), so the client's keys are generated before the server config is templated.



---
- name: setup vpn client
  hosts: localhost
  connection: local
  become: yes
  vars:
    # Use system python so apt package is available
    ansible_python_interpreter: "/usr/bin/env python"
  tasks:
    # Coming soon

- name: setup vpn server
  hosts: vpn
  # Rest of server vars/tasks here...

Lots of new things here!

We target the local machine with using [localhost](http://localhost) for the hosts pattern.
We "connect" locally by using the local connection plugin.
The become: yes line indicates that the play will run as root, which we need to be able to install the wireguard package. Ansible will effectively run sudo apt-get install wireguard, rather than just apt-get install wireguard (which would fail). Because of this setting, we'll need to run the playbook with the --ask-become-pass flag. We didn't need this line for the server setup play, because we're already connecting as root via the ansible_user=root connection variable.
With the ansible_python_interpreter var, we tell Ansible to use the system python (which includes the apt python package). Alternatively, we could install that package for our current python 3.9.2 installation. If you get a No such file or directory error, you may need to change the line from python to python3.

Client Setup Tasks and Config

Writing the Ansible tasks for the client-side VPN setup is similar to the server side.



---
- name: setup vpn clients
  hosts: localhost
  connection: local
  become: yes
  vars:
    # Use system python so apt package is available
    ansible_python_interpreter: "/usr/bin/env python"
  tasks:
  - name: install wireguard package
    apt:
      name: wireguard
      state: present
      update_cache: yes

  - name: generate private key
    shell:
      cmd: umask 077 && wg genkey | tee privatekey | wg pubkey > publickey
      chdir: /etc/wireguard
      creates: /etc/wireguard/publickey

  - name: get public key
    command: cat /etc/wireguard/publickey
    register: publickey_contents
    changed_when: False

  # Save pubkey as a fact, so we can use it to template wg0.conf for the server
  - name: set public key fact
    set_fact:
      pubkey: "{{ publickey_contents.stdout }}"

  - name: create client wireguard config
    template:
      dest: /etc/wireguard/wg0.conf
      src: client_wg0.conf.j2
      owner: root
      group: root
      mode: '0600'

- name: setup vpn server
  hosts: vpn_server
  # Rest of server vars/tasks here...

Breaking this down:

Installing the wireguard package should look very familiar!
We generate keys with the shell module so we can use pipes and file redirection. The keys are only generated if the publickey file doesn't already exist, thanks to the creates parameter.
Next, we need to save the public key so we can add it as a [Peer] section in the server config. Normally, we'd use {{ lookup('file', '/etc/wireguard/publickey') }} to look up a value from a file, but the file lookup modules seems not to respect become: yes; it tries to read the file without escalating to root privileges and fails as a result. So, we instead cat the file and save the resulting output as a fact.
Finally, template the client config file. Its contents closely match the previous tutorial's, but we use the ansible_host IP address of the VPN server from inventory.ini to set the server's endpoint.



[Interface]
# The address your computer will use on the VPN
Address = 10.0.0.8/32

# Load your privatekey from file
PostUp = wg set %i private-key /etc/wireguard/privatekey
# Also ping the vpn server to ensure the tunnel is initialized
PostUp = ping -c1 10.0.0.1

[Peer]
# VPN server's wireguard public key
PublicKey = {{ server_pubkey }}

# Public IP address of your VPN server (USE YOURS!)
# Use the floating IP address if you created one for your VPN server
Endpoint = {{ hostvars['vpn_server'].ansible_host }}:51820

# 10.0.0.0/24 is the VPN subnet
AllowedIPs = 10.0.0.0/24

# To also accept and send traffic to a VPC subnet at 10.110.0.0/20
# AllowedIPs = 10.0.0.0/24,10.110.0.0/20

# To accept traffic from and send traffic to any IP address through the VPN
# AllowedIPs = 0.0.0.0/0

# To keep a connection open from the server to this client
# (Use if you're behind a NAT, e.g. on a home network, and
# want peers to be able to connect to you.)
# PersistentKeepalive = 25

Managing Variables

If we run the playbook now, it will fail with a 'server_pubkey' is undefined error. That's because server_pubkey is defined for the play that targets the server, it's not available for the play targeting the client. We need to move the variable somewhere so that it's readable by the entire playbook. Ansible looks for YAML files in a group_vars/ folder where the filename matches server groups in the inventory file. So, we could create a group_vars/vpn.yml file and declare variables in it, which would be directly usable when running a play against any servers in the vpn group. We don't include localhost as a host in the vpn group (though we could). We'll instead use the special group_vars/all.yml file, which makes variables available to all hosts.

Move the server keys' variables from playbook.yml to group_vars.all.yml:



---
server_privkey: !vault |
      $ANSIBLE_VAULT;1.1;AES256
      646438636565343063343631326136386239623935393637336539653636386135363
      663386639393232346534643163656363316234306439306566306534610a31326664
      363763663139383034636632343230376365333130333230373866353033326563303
      5636138373830633534373033303536303566663166616539360a3936353033663263
      336662663034376661616631343661333164363134373061343739633637623739306
      465653532383838393662396333623966343165366635353132396332313762343534
      65313761623964653532623839356633343838
server_pubkey: 7/6f7bUT+2hWMEP5BxeK51PGuMuTnQ9pRpkxg5jUSTo=

Your directory should now look like this:



.
├── group_vars
│   ├── all.yml
├── inventory.ini
├── playbook.yml
└── templates
    ├── client_wg0.conf.j2
    └── server_wg0.conf.j2

Run the playbook and the client should run all its tasks successfully:



ansible-playbook -i inventory.ini --ask-vault-password --ask-become-pass playbook.yml

The VPN client is now set up. The only remaining step for the client is to start the VPN after the server is running and configured to accept connections from the client (so the client's PostUp ping will succeed).

Adding a Peer to the Server Config

Add a [Peer] section to the server template at templates/server_wg0.conf.j2:



# {{ ansible_managed }}
[Interface]
Address = 10.0.0.1/24
ListenPort = 51820
PrivateKey = {{ server_privkey }}

[Peer]
PublicKey = {{ hostvars['localhost'].pubkey }}
AllowedIPs = 10.0.0.8

We read the {{ server_privkey }} from group_vars/all.yml and we read {{ hostvars['localhost'].pubkey }} from the set_fact module that runs during the client-targeted play in the playbook.

Reloading the Server Config

If we run the playbook, the config file on the server will be updated with the new [Peer] section, but the WireGuard interface is already running and configured based on the old file contents. We need to reload the configuration when it changes. Handlers are the Ansible-provided mechanism for this, and they trigger when a task referencing them changes. Handlers run at the end of the play in which they're notified, so many tasks could notify a "reload config" handler, but the handler would only run once at the end. Let's create a couple handlers in a handlers list after the tasks lists in playbook.yml and notify them from the create client wireguard config and create server wireguard config tasks:



  # ...
  - name: create client wireguard config
    template:
      dest: /etc/wireguard/wg0.conf
      src: client_wg0.conf.j2
      owner: root
      group: root
      mode: '0600'
    notify: restart wireguard

  handlers:
  # Restarts WireGuard interface, loading any new config and running PostUp
  # commands in the process. Notify this handler on client config changes.
  - name: restart wireguard
    shell: wg-quick down wg0; wg-quick up wg0
    args:
      executable: /bin/bash

- name: setup vpn server
  hosts: vpn_server
  tasks:
  # ...
  - name: create server wireguard config
    template:
      dest: /etc/wireguard/wg0.conf
      src: wg0.conf.j2
      owner: root
      group: root
      mode: '0600'
    notify: reload wireguard config
  # ...

  handlers:
  # Reloads config without disrupting current peer sessions, but does not
  # re-run PostUp commands. Notify this handler on server config changes.
  - name: reload wireguard config
    shell: wg syncconf wg0 <(wg-quick strip wg0)
    args:
      executable: /bin/bash
# ...

The template Ansible module only performs an action and marks the task as changed if the config file changes — it is idempotent. Idempotence is valuable when used with handlers, because the handler will only run when the task changes. Notifying a handler on a task that isn't idempotent may result in the handler always running (e.g. a service is unnecessarily restarted everytime the playbook is run).

Start the VPN Client

Add one final play to the end of the playbook to start the client VPN now that the server is configured to accept its connection:



# ...
- name: start vpn on clients
  hosts: localhost
  connection: local
  become: yes
  tasks:
  - name: start vpn
    command: wg-quick up wg0

Automation Complete!

Now we can run the whole playbook and — whether the server and client are brand-new or in some intermediate state — this single command will set up a WireGuard VPN server and client!



ansible-playbook -i inventory.ini --ask-vault-password --ask-become-pass playbook.yml

The complete Ansible code can be found at: https://gitlab.com/tangram-vision-oss/tangram-visions-blog

There are many improvements that could be made:

Provision a cloud server automatically, using an Ansible module such as community.digitalocean.digital_ocean_droplet.
Automatically update a floating IP address when provisioning a new cloud VPN server.
Configure multiple clients automatically. One approach is to add a vpn_clients group to the inventory, define VPN IPs in the inventory (e.g. vpn_ip=10.0.0.8), and use those host variables in the config templates. When templating the server config, loop over hostnames in the clients group, adding a new [Peer] block for each.
Organize the playbook as roles, one for the server and one for the client. Roles are more reusable and shareable than playbooks.
Test and lint with molecule and ansible-lint.

Thanks for joining me on this Ansible-learning journey! If you have any suggestions or corrections, please let me know or send us a tweet, and if you’re curious to learn more about how we improve perception sensors, visit us at Tangram Vision.

What They Don’t Tell You About Setting Up A WireGuard VPN

Greg Schafer — Tue, 12 Jan 2021 19:34:03 +0000

WireGuard is a relatively new VPN implementation that was added to the Linux 5.6 kernel in 2020 and is faster and simpler than other popular VPN options like IPsec and OpenVPN.

We'll walk through setting up an IPv4-only WireGuard VPN server on DigitalOcean, and I'll highlight tips and tricks and educational asides that should help you build a deeper understanding and, ultimately, save you time compared to "just copy these code blocks" WireGuard tutorials.

Let's get a server!

To set up a VPN, we need two computers that we want to connect. One of these is typically a desktop/laptop/phone in your possession. If you're looking to remotely access company intranet sites and services, the other computer would be a server in an office or on a company cloud network. If you're looking to remotely access your own home network, privately network with family/friends, or encrypt all of your internet traffic, then the other computer would be a personal server on a cloud provider like DigitalOcean or AWS.

VPN connectivity overview. CC BY-SA 4.0, Image attribution: Creative Commons License

For this walkthrough, we'll use a new Ubuntu 20.04 server on DigitalOcean, though you could follow similar steps using any cloud provider. To create a new DigitalOcean server, follow their guide to creating a droplet. A "droplet" is the term DigitalOcean uses for a "server" or a "VM" or an "instance".

VPCs and Private Networks

DigitalOcean servers are automatically created in a Virtual Private Cloud aka VPC (most cloud providers have VPC or private networking functionality), meaning they have an additional network interface (eth1 in addition to eth0) and an additional private IP address. All servers, databases, and load balancers created in the same VPC can communicate with each other via their private IP addresses, which is a boost to security because all inbound traffic from the public internet (on eth0) can be blocked with a firewall.

You can use your VPN server as a sort of bastion host to access other resources inside your VPC using their private IP addresses. That is, your VPN server can route traffic to any IP address in the VPC and all the servers in your VPC can accept traffic only to their private IP addresses (to eth1), which protects those servers and the services they run from all sorts of attacks. The server configuration section below will mention how to set up this sort of architecture.

How can I keep my VPN server up?

Given the importance of VPN uptime — especially if it serves as the only way to access important servers in a VPC or remote company network — it's worth considering how to handle or avoid downtime. There is a range of options and tradeoffs to consider, ordered below in increasing complexity/effort:

Do nothing! If you set up a server on DigitalOcean, install and configure the VPN, and take no further actions, then your VPN will go down when the server does. It's not uncommon for DigitalOcean to migrate droplets between physical machines due to hardware issues, and the VPN will be unavailable if the migration can't be performed without downtime. If a more serious issue causes downtime (e.g. accidental rm -rf /, networking misconfiguration, or a successful attack), then you'll need to set up and configure a new server from scratch to bring your VPN back up. If you didn't save the VPN server's private key offline, you'll need to generate a new private key and reconfigure all VPN clients to be able to connect to the new VPN server.
Enable droplet backups. You can enable backups for an extra +20% of the droplet price, which will take weekly snapshots of the server. If the droplet ends up horribly broken or unresponsive, you can restore the latest backup and your VPN will be working again (in about 1 minute for a 1 GB droplet).
Set up manual failover. Set up the VPN server and take a snapshot, then restore the snapshot to a new droplet. Point a floating IP to one of the servers and use that IP address when connecting to the VPN. When the primary/active VPN server goes down for any reason, you can update the floating IP to point to the secondary/standby VPN server and your VPN will work again!
Set up automatic failover / high-availability. The next step up in sophistication is to either:
- detect when the VPN server goes down and automatically switch (point a floating IP address) to a healthy standby using something like Pacemaker, or
- put a UDP load balancer in front of multiple VPN servers, but... you might need some network trickery to allow multiple active VPN servers with the same IP address and you might also need sticky sessions, which breaks down for roaming clients without some protocol-level changes like Cloudflare made for WARP.

Set up a WireGuard server

With your shiny new server running, let's install and configure WireGuard. For non-Linux platforms, follow the WireGuard website's instructions and links. For this walkthrough, I'll show instructions for Ubuntu 20.04, starting with installing the wireguard package:



sudo apt update
sudo apt install wireguard

The wireguard package installs two binaries:

wg — a tool for managing configuration of WireGuard interfaces
wg-quick — a convenience script for easily starting and stopping WireGuard interfaces

I encourage reading the manpages (man wg and man wg-quick), because they are concise, well-written, and contain a lot of information that is glossed over in most WireGuard tutorials!

To encrypt and decrypt packets, we need keys. 🔑



# Change to the root user
sudo -s

# Make sure files created after this point are accessible only to the root user
umask 077

# Generate keys in /etc/wireguard
cd /etc/wireguard
wg genkey | tee privatekey | wg pubkey > publickey

Now we have a private key (which only the server should possess and know about) and a public key (which should be shared to all VPN clients that will connect to this server).

Next, create a configuration file at /etc/wireguard/wg0.conf.

If we use wg-quick (spoiler: we will) to start/stop the VPN interface, it will create the interface with wg0 as the name. You can create other interface config files with other names, such as wg1.conf, my-company-vpn.conf, or us_east_1.conf. The wg-quick script will create interfaces with names that match the config filename (minus the .conf part), as long as the name fits the regex tested in /usr/bin/wg-quick.

Print out your private key with cat /etc/wireguard/privatekey and then add the following to the configuration file:



# /etc/wireguard/wg0.conf on the server
[Interface]
Address = 10.0.0.1/24
ListenPort = 51820
# Use your own private key, from /etc/wireguard/privatekey
PrivateKey = WCzcoJZaxurBVM/wO1ogMZgg5O5W12ON94p38ci+zG4=

We'll add the public keys of clients that are allowed to connect to the VPN later, but the above is all you need to run the VPN server for now. Here's what it means:

Address = 10.0.0.1/24 — The server will have an IP address in the VPN of 10.0.0.1. The /24 at the end of the IP address is a CIDR mask and means that the server will relay other traffic in the 10.0.0.1-10.0.0.254 range to peers in the VPN.
ListenPort = 51820 — The port that WireGuard will listen to for inbound UDP packets.
PrivateKey = ... — The private key of the VPN server, used for encryption/decryption.

At this point, you can start the VPN!



# This will run a few commands with "ip" and "wg" to
# create the interface and configure it
wg-quick up wg0

# To see the WireGuard-specific details of the interface
wg

# To start the VPN on boot
systemctl enable wg-quick@wg0

Find more example commands for inspecting the interface at https://github.com/pirate/wireguard-docs#inspect.

Relaying traffic

Recall from above that Address = 10.0.0.1/24 means the server will relay traffic to peers in the subnet. That is, if you connect to the VPN and ping 10.0.0.14 (and a server exists on the VPN at that address), then your ping will go to the VPN server at 10.0.0.1 and be forwarded on to the machine at 10.0.0.14. However, this won't work without one additional piece of configuration: IP Forwarding.

To enable IP Forwarding, open /etc/sysctl.conf and uncomment or add the line:



net.ipv4.ip_forward=1

Then apply the settings by running:



sysctl -p

Now, the VPN server should be able to relay traffic to other VPN hosts. From my understanding, running ping 10.0.0.14 will follow the left-to-right path shown in the diagram below. The diagram doesn't show the ping response from Peer C to Peer A, but you can mentally reverse all the arrows to see what the returning response path would look like.

The path of network packets from a ping command on Peer A to the destination server, Peer C. The packets enter the VPN at Peer A and route to the VPN server (Peer B), which relays the packets to Peer C via the VPN.

Troubleshooting relayed traffic

There are many places where something could go wrong, especially when relaying traffic between multiple servers as in the diagram above. When network requests are failing, tcpdump is a great tool for finding the source of failures and misconfigurations. If you wanted a complete view of the flow in the diagram above, you could run the following tcpdump commands on each machine:



sudo tcpdump -nn -i wg0
sudo tcpdump -nn -i eth0 udp and port 51820

Just be aware that clocks on servers might be slightly out-of-sync, so comparing timestamps in tcpdump output between servers could be misleading!

If you're debugging network packets on a machine with a display like your desktop or laptop, you can use Wireshark, which is a graphical, user-friendly alternative to tcpdump.

For more insight into WireGuard itself, you can enable debug logging by following the instructions at https://www.wireguard.com/quickstart/#debug-info and then running tail -f /var/log/syslog to see the log messages.

Relaying traffic to a VPC or the internet

In addition to using a VPN server to relay traffic between VPN clients, you can use a VPN server as a way to access servers in a VPC (on DigitalOcean or AWS, for example) that are firewalled off from the public internet. This approach requires no change in WireGuard configuration on the server, but you will need to enable masquerading so that responses on one network (e.g. the VPC) can be mapped to the requesting machine on the other network (e.g. the VPN). If you're unfamiliar with masquerading, check out this brief explanation. Assuming your VPN server is connected to the VPC on its eth1 interface, you can enable masquerading on the VPN server with:



iptables -t nat -A POSTROUTING -s 10.0.0.0/24 -o eth1 -j MASQUERADE

Now, a VPN client such as your laptop should be able to ping servers in the VPC, as in the diagram below.

The path of network packets from a ping command on Peer A to the destination server, Peer C. The packets enter the VPN at Peer A and route to the VPN server (Peer B), which terminates the VPN connection and relays the packets to Peer C via the VPC.

If you want to relay traffic through the VPN server to the internet (in which case, the VPN server is often labeled a bounce server), enable masquerading on the public-internet-facing interface (e.g. eth0) of the VPN server:



iptables -t nat -A POSTROUTING -s 10.0.0.0/24 -o eth0 -j MASQUERADE

Now, a VPN client such as your laptop can visit public internet sites via your VPN — if you're on an unsecured coffeeshop wifi connection or you don't trust your ISP, all they'll see is an encrypted VPN connection.

The path of network packets from a ping command on Peer A to the destination server on the internet. The packets enter the VPN at Peer A and route to the VPN server (Peer B), which terminates the VPN connection and relays the packets over the public internet to the destination server.

Firewall rules

We've used iptables above for masquerading, but iptables is also important for managing the VPN server's firewall. You can use ufw instead, but learn and use iptables if you have the time — iptables is more foundational and powerful. Regardless of how you manage your firewall (I like this sort of approach), you'll need to:

allow UDP traffic to the WireGuard ListenPort (51820 in the sample server config above)
allow traffic forwarded to or from the WireGuard interface wg0

The iptables commands for those changes are:



iptables -A INPUT -p udp -m udp --dport 51820 -j ACCEPT

iptables -A FORWARD -i wg0 -j ACCEPT
iptables -A FORWARD -o wg0 -j ACCEPT

Many WireGuard tutorials suggest putting these iptables commands in the PostUp lines of the server WireGuard configuration, meaning the commands will be run when the wg0 interface is created. Be warned that, depending on how you manage your firewall, you may end up erasing these commands if you restart your firewall while the WireGuard interface is running, thereby making the VPN unreachable. Consider managing WireGuard firewall rules in the same place and with the same tool that you manage all your other firewall rules.

Set up a WireGuard client

Similar to the server setup, install WireGuard (follow the WireGuard website's instructions and links for non-Linux platforms):



sudo apt update
sudo apt install wireguard

Generate keys, similar to server setup:



# Change to the root user
sudo -s

# Make sure files created after this point are accessible only to the root user
umask 077

# Generate keys in /etc/wireguard
cd /etc/wireguard
wg genkey | tee privatekey | wg pubkey > publickey

Next, create a configuration file at /etc/wireguard/wg0.conf with the following content:



# /etc/wireguard/wg0.conf on the client
[Interface]
# The address your computer will use on the VPN
Address = 10.0.0.8/32

# Load your privatekey from file
PostUp = wg set %i private-key /etc/wireguard/privatekey
# Also ping the vpn server to ensure the tunnel is initialized
PostUp = ping -c1 10.0.0.1

[Peer]
# VPN server's wireguard public key (USE YOURS!)
PublicKey = CcZHeaO08z55/x3FXdsSGmOQvZG32SvHlrwHnsWlGTs=

# Public IP address of your VPN server (USE YOURS!)
# Use the floating IP address if you created one for your VPN server
Endpoint = 123.123.123.123:51820

# 10.0.0.0/24 is the VPN subnet
AllowedIPs = 10.0.0.0/24

# To also accept and send traffic to a VPC subnet at 10.110.0.0/20
# AllowedIPs = 10.0.0.0/24,10.110.0.0/20

# To accept traffic from and send traffic to any IP address through the VPN
# AllowedIPs = 0.0.0.0/0

# To keep a connection open from the server to this client
# (Use if you're behind a NAT, e.g. on a home network, and
# want peers to be able to connect to you.)
# PersistentKeepalive = 25

There's lots to talk about here!

Address = ... — Set the IP address of this client in the VPN. Packets sent to the VPN server with a destination of this address will be sent to whatever public IP address (endpoint) this client was last seen at.
PostUp = wg set %i private-key ... — Load the private key from the file after the wg0 interface is up. You can copy-paste the contents of the private key file into a PrivateKey line directly (as in the server config) if you prefer. I suggest not loading the private key via PostUp in the VPN server config however, because reloading the config (e.g. after adding a new client/peer) does not re-run PostUp commands, so the VPN will no longer know its private key and the VPN won't work as a result.
PostUp = ping -c1 10.0.0.1 — Ping the VPN server after the wg0 interface is up to test that the VPN connection was successful. If the ping fails, wg-quick will take the interface back down. In my testing, sending traffic from the VPN server to the client didn't work until something was sent from the client to the server — sending 1 ping packet to the server with PostUp does the trick.
[Peer] — There can be multiple peer sections in the config, one for each VPN peer you wish to connect directly to. Often, the VPN server will be the only peer in a client's config file. Lines under the [Peer] header define how and where the client will connect to the peer.
PublicKey = ... — The public key of the VPN server.
EndPoint = ... — The (usually publicly-accessible) IP address of your VPN server. This could be a floating IP address if you're using a cloud provider like DigitalOcean or AWS.
AllowedIPs = ... — For incoming packets from the VPN server, their source IP address must match the addresses or ranges in AllowedIPs. For outgoing packets, the AllowedIPs is the mapping that tells WireGuard what peer (specifically their public key and endpoint) should be used when encrypting and sending. The last example (AllowedIPs = 0.0.0.0/0) would enable WireGuard to send traffic destined for any IP address to the VPN server. With AllowedIPs = 0.0.0.0/0, wg-quick up will conveniently run ip route and ip rule commands to route all your traffic through the VPN (useful in the aforementioned unsecured coffeeshop wifi or malicious ISP scenarios). For more info on how AllowedIPs works, check out WireGuard's documentation.
PersistentKeepalive = 25 — Send a packet to the VPN server every 25 seconds, to ensure that the server can successfully route traffic to the client when the client doesn't have a public or stable IP address. Without this setting, the client can still send traffic to the VPN server and receive responses, but routers between the client and the server only keep their NAT/masquerade mapping for a few dozen seconds. After the mapping expires, the server won't be able to send anything to the client until the client sends something first. You typically won't enable this setting, unless you want to allow new connections from other devices on the VPN — for example, you would enable this on your home desktop if you wanted to connect to it from your laptop or phone while traveling.

Before starting the VPN on the client, the VPN server needs to be configured to allow connections from the client. Open /etc/wireguard/wg0.conf on the VPN server again and update the contents to match:



# /etc/wireguard/wg0.conf on the server
[Interface]
Address = 10.0.0.1/24
ListenPort = 51820
# Use your own private key, from /etc/wireguard/privatekey
PrivateKey = WCzcoJZaxurBVM/wO1ogMZgg5O5W12ON94p38ci+zG4=

[Peer]
# VPN client's public key
PublicKey = lIINA9aXWqLzbkApDsg3cpQ3m4LnPS0OXogSasNW5RY=
# VPN client's IP address in the VPN
AllowedIPs = 10.0.0.8/32

The added [Peer] section enables the VPN server to coordinate encryption keys with the client and validate that traffic from and to the client is allowed. To apply these changes, you can restart the WireGuard interface on the server:



wg-quick down wg0 && wg-quick up wg0

If you want to avoid disrupting or dropping active VPN connections, reload the config with:



wg syncconf wg0 <(wg-quick strip wg0)

At this point, you can start the VPN on the client!



# This will run a few commands with "ip" and "wg" to

# create the interface and configure it

wg-quick up wg0

# To see the WireGuard-specific details of the interface

wg

Connecting from a Chromebook

If you're connecting to a WireGuard VPN from a Chromebook, I suggest using the official Android WireGuard app. My efforts to run WireGuard under crouton failed, because crouton uses a chroot, so I was stuck with the Chromebook's old Linux kernel (4.19) and unable to add kernel modules or network interfaces from within crouton. Similarly, crostini doesn't allow updating or using custom kernel modules, but it does provide a great way to SSH into VPN-accessible servers while the Android WireGuard app is active.

Connecting from other devices

If you want to connect to a VPN from devices where you don't have root access, you can try installing a userspace implementation of WireGuard such as wireguard-go.

If you want to connect to a VPN from devices you don't control (e.g. smart TVs, IoT sensors), look into setting up WireGuard on your router (e.g. instructions for OpenWRT), so you can route all those devices' outbound traffic through a VPN.

Thanks for reading! Hopefully, I’ve saved you time by passing on some of the insights and tips that I learned while digging deeper into the many facets of setting up a WireGuard VPN. If you have any suggestions or corrections, please let me know or send us a tweet, and if you’re curious to learn more about how we improve perception sensors, visit us at Tangram Vision.

If you're setting up multiple VPNs or multiple VPN clients — or if you're interested in learning about infrastructure and configuration automation — check out the next tutorial I wrote: Exploring Ansible via Setting Up a WireGuard VPN.

Corrections

2020-01-13: Previously, my explanation of what AllowedIPs does and how to route all traffic through the VPN was incomplete/misleading. Thanks to Chris Siebenmann on Twitter for catching that!

DEV Community: Greg Schafer

Abusing Terraform to Upload Static Websites to S3

The Boilerplate

AWS Credentials

Uploading Files to S3 with Terraform

Terraform Apply

Content Types, MIME Types, Oh My

Determining MIME Types with a CLI Tool

Determining MIME Types with a File Extension Map

Fixing a Stale CloudFront Cache

Cache Busting

Cache Invalidation

What About Browser Caching?

Creating PostgreSQL Test Data with SQL, PL/pgSQL, and Python

Follow Along with Docker

Sample Schema

Generating Data

SQL

PL/pgSQL

Using Python

The Faker Package

The dataclasses Module

Importing External .py Files

Other (Trusted) Ways to Use Python

Loading Test Data into PostgreSQL

Follow Along with Docker

Sample Schema

Loading Static Data

SQL COPY from CSV Files

Psql Copy from CSV Files

Putting Data in SQL Directly

SQL COPY from stdin and pg_dump

SQL INSERTs

Putting Data in CSVs vs in SQL Files

Exploring Ansible via Setting Up a WireGuard VPN

Setting up Ansible

Get a Server

Connecting to the Server with Ansible

Ansible's Built-in Variables and Facts

Writing an Ansible Playbook

WireGuard Server Setup

Managing the Keys

Encrypting the Private Key

Installing and Configuring WireGuard

Testing our First Attempt

WireGuard Client Setup

Acting on Localhost

Client Setup Tasks and Config

Managing Variables

Adding a Peer to the Server Config

Reloading the Server Config

Start the VPN Client

Automation Complete!

What They Don’t Tell You About Setting Up A WireGuard VPN

Let's get a server!

VPCs and Private Networks

How can I keep my VPN server up?

Set up a WireGuard server

Relaying traffic

Troubleshooting relayed traffic

Relaying traffic to a VPC or the internet

Firewall rules

Set up a WireGuard client

Connecting from a Chromebook

Connecting from other devices

Corrections

References