loading...
Cover image for Automating deploying Vault, a password manager for your code, with Terraform

Automating deploying Vault, a password manager for your code, with Terraform

meseta profile image Yuan Gao ・9 min read

This post looks at deploying a single Vault instance on Google Cloud using Terraform; and is intended for those who are already using Terraform and Google Cloud, and are looking into using Vault.

Introduction to Vault

Vault is a secrets management service, there are a lot of ways and benefits to using it, but essentially you no longer have to hard-code things like database passwords into your code; and you can have your apps request secrets at runtime when they need them. Think of it like a password manager that your apps/microservices can use.

The three consequences of doing this are:

  1. Security, you don't leave your database credentials lying around in your code or environment.
  2. You can automatically generate and manage passwords rather than having to copy/paste a new password by hand to the many places they may be used.
  3. You can add access rules to the passwords based on who's requesting them. Production apps can access production passwords, development apps can't.

An example of setting up a password automatically is spinning up an RDB instance in Google Cloud in Terraform


resource "google_sql_database_instance" "prod" {
  ...
}

resource "random_password" "prod" {
  length = 16
}

resource "google_sql_database" "prod" {
  name = "prod_db"
  instance = google_sql_database_instance.prod.name
}

resource "google_sql_user" "prod" {
  name     = "prod_user"
  instance = google_sql_database_instance.prod.name
  password = random_password.prod.result
}

The above Terraform code creates a new RDB in Google Cloud, and creates a database called prod_db, a user called prod_user, and assigns it with a random 16-character password.

At this point, if you weren't using Vault, you'd have to peek at what that random 16-character password was, and then go configure a bunch of apps or environments to use it. However, with Vault, you could instead also have these lines in your Terraform code:

resource "vault_generic_secret" "prod_db" {
  path = "secret/production/db"

  data_json = jsonencode({
    username = google_sql_user.prod.name
    password = google_sql_user.prod.password
    db = google_sql_database.prod.name
    host = google_sql_database.prod.ip_address.0.ip_address
  })
}

Now any app that is authorized to access secret/production/db can fetch not only the random password that you never saw or have to deal with, but also the username, database, and even host IP of the database, all values that might from time to time change for security reasons or redeployment. You could repeat the process to set up a development db and set the secret's path at secret/development/db. Now apps can switch between the two databases simply by changing which secret it fetches.

An example of the python client code for this is (and this is highly simplified)

import requests

role = "production"
vault_addr = "https://vault.example.com:8200"
secret_path = "production/db"

# fetch JWT from metadata server
req_jwt = requests.get(
    "http://metadata/computeMetadata/v1/instance/service-accounts/default/identity",
    headers={"Metadata-Flavor": "Google"},
    params={"audience": f"http://vault/{role}", "format": "full"},
)
jwt = req_jwt.text

# authenticate with Vault
req_token = requests.post(
    f"{vault_addr}/v1/auth/gcp/login",
    json={"role": role, "jwt": jwt}),
)
token = req_token.json()["auth"]["client_token"]

# request hte secret
req_secret = requests.get(
    f"{vault_addr}/v1/secret/data/{secret_path}",
    headers={"x-vault-token": token},
)
secret = req_secret.json()["data"]["data"]

The first request fetches a JWT from the VM's metadata server, this JWT is this app's proof that it's running inside an authorized server. This JWT is how you avoid having to hard-code passwords, since only apps running inside the VM are able to fetch the JWT.

The second request authenticates against Vault using JWT (and Vault must have been set up beforehand to allow this authentication method) and receives a short-lived access token.

The third and final request uses this access token to fetch the secret that it need. The app can cache the token or JWT for re-use later.

The above code is simplified and doesn't include any error checking. I have a vault.py script containing better wrapped code that I am able to include in all services as a library to give them access to Vault.

Vault also has a CLI and UI, to help you work with it manually.

Vault UI

Deploying Vault

There are lots of ways of deploying Vault, and Vault can run as a cluster too, and become highly available. However the easiest (and perhaps cheapest) way to do it is to run it as a program in a single VM. This is good enough for light use and testing, you can always expand it later.

I'm just going to go ahead and code-dump my Terraform code for this:


resource "google_compute_instance" "vault" {
  name         = "vault"
  machine_type = "e2-small"
  zone         = "us-central1-a"

  tags         = [google_compute_firewall.vault.name]

  metadata = {
    ssh-keys = "terraform:${tls_private_key.vault.public_key_openssh}"
  }

  boot_disk {
    initialize_params {
      image = "ubuntu-minimal-2004-lts"
    }
  }

  network_interface {
    network = google_compute_network.default.name

    access_config {
      nat_ip = google_compute_address.vault.address
    }
  }

  service_account {
    email = google_service_account.vault.email
    scopes = [ "cloud-platform" ]
  }

  provisioner "file" {
    content = <<EOF
storage "gcs" {
  bucket = "${google_storage_bucket.vault.name}"
  prefix  = "vault/store"
}

listener "tcp" {
  address = "0.0.0.0:8200"
  tls_disable = 0
  tls_cert_file = "/etc/ssl/fullchain.pem"
  tls_key_file = "/etc/ssl/privkey.pem"
}

disable_mlock = true
api_addr = "https://${local.vault_domain}:8200"
ui = true
log_level = "warn"
EOF
    destination = "/tmp/vault.hcl"

    connection {
      type        = "ssh"
      user        = "terraform"
      host        = self.network_interface.0.access_config.0.nat_ip
      private_key = tls_private_key.vault.private_key_pem
    }
  }

  provisioner "file" {
    content = <<EOF
[Unit]
Description="HashiCorp Vault - A tool for managing secrets"
Documentation=https://www.vaultproject.io/docs/
Requires=network-online.target
After=network-online.target
ConditionFileNotEmpty=/etc/vault.d/vault.hcl
StartLimitIntervalSec=60
StartLimitBurst=3

[Service]
User=terraform
Group=terraform
ProtectSystem=full
ProtectHome=read-only
PrivateTmp=yes
PrivateDevices=yes
SecureBits=keep-caps
AmbientCapabilities=CAP_IPC_LOCK
Capabilities=CAP_IPC_LOCK+ep
CapabilityBoundingSet=CAP_SYSLOG CAP_IPC_LOCK
NoNewPrivileges=yes
ExecStart=/usr/local/bin/vault server -config=/etc/vault.d/vault.hcl
ExecReload=/bin/kill --signal HUP $MAINPID
KillMode=process
KillSignal=SIGINT
Restart=on-failure
RestartSec=20
TimeoutStopSec=30
StartLimitInterval=60
StartLimitIntervalSec=60
StartLimitBurst=3
LimitNOFILE=65536
LimitMEMLOCK=infinity

[Install]
WantedBy=multi-user.target
EOF
    destination = "/tmp/vault.service"

    connection {
      type        = "ssh"
      user        = "terraform"
      host        = self.network_interface.0.access_config.0.nat_ip
      private_key = tls_private_key.vault.private_key_pem
    }
  }

  provisioner "file" {
    content = tls_private_key.cert_private_key.private_key_pem
    destination = "/tmp/privkey.pem"

    connection {
      type        = "ssh"
      user        = "terraform"
      host        = self.network_interface.0.access_config.0.nat_ip
      private_key = tls_private_key.vault.private_key_pem
    }
  }

  provisioner "file" {
    content = "${acme_certificate.certificate.certificate_pem}${acme_certificate.certificate.issuer_pem}"
    destination = "/tmp/fullchain.pem"

    connection {
      type        = "ssh"
      user        = "terraform"
      host        = self.network_interface.0.access_config.0.nat_ip
      private_key = tls_private_key.vault.private_key_pem
    }
  }

  provisioner "remote-exec" {
    # Install docker
    inline = [
      "echo '*** Install vault ***'",
      "sudo apt update && sudo apt install -y unzip",
      "cd /tmp",
      "wget https://releases.hashicorp.com/vault/${local.vault_version }/vault_${local.vault_version}_linux_amd64.zip -O vault.zip && unzip vault.zip && rm vault.zip",
      "sudo mv vault /usr/local/bin/vault && chmod +x /usr/local/bin/vault",

      "echo '*** Move files to right place ***'",
      "sudo mkdir -p /etc/ssl",
      "sudo mv /tmp/*.pem /etc/ssl",

      "echo '*** Start vault service ***'",
      "sudo mkdir -p /etc/vault.d",
      "sudo mv /tmp/vault.hcl /etc/vault.d",
      "sudo mv /tmp/vault.service /etc/systemd/system/vault.service",
      "sudo systemctl enable vault",
      "sudo systemctl start vault"
    ]

    connection {
      type        = "ssh"
      user        = "terraform"
      host        = self.network_interface.0.access_config.0.nat_ip
      private_key = tls_private_key.vault.private_key_pem
    }
  }
}

It's a bit of a mouthful, and could be simplified and modularized.
The two file provisioners should be loading in templates from a file rather than being defined in-line to keep the code shorter, but I've condensed it to make the post easier to follow.

I'll try to explain what's going on here. Most of the top half are standard stuff for deploying Google VMs with Terraform, it's creating an e2-small instance in us-central1-a using an Ubuntu 20.04 image. We're generating a temporary SSH key so that terraform can go in and do some light deployment work (actual lines for doing that is below).

The first provisioner "file" block deploys a config file for terraform. This config file contains the storage bucket that it needs to access, and the TLS/SSL and other config.

The second provisioner "file" is the systemd service definition for Terraform, it's mostly boilerplate.

The third and fourth provisioners are for copying in the SSL certs for HTTPS. The way I have these set up are similar to my previous post:

Next, the provisioner "remote_exec" completes the setup - it runs an apt update and installs unzip; downloads Vault from hashicorp servers, and installs it; then moves all the rest of the files from the /tmp folder where they were dumped (mostly due to file permission issues); and then enables and runs the service.

The above script does depend on a few other lines, which I've separated out to keep things readable:

locals {
  vault_version = "1.5.3"
  vault_domain = "vault.example.com"
}

These are some local variables, they make it easier to pick the Vault version you want to install. And where vault will be when it's fully deployed.

resource "tls_private_key" "vault" {
  algorithm = "RSA"
}

This generates an SSH key for accessing Vault, it's pretty boilerplate for Terraform VM deployment

resource "google_compute_address" "vault" {
  name = "vault"
}

This reserves an IP address for the vault VM instance. This isn't strictly needed, a VM without an IP address reservation will by default acquire an ephemeral one. However one practicality is that if you have to tear down the Vault VM for any reason and create a new one, it'll receive a new IP address, which will cause a DNS update, which will take a few minutes to propagate, leaving you with some downtime. It's a bit better to reserve the IP.

resource "google_compute_firewall" "vault" {
  name    = "vault"
  network = google_compute_network.default.name

  allow {
    protocol = "tcp"
    ports    = ["9800"]
  }

  source_ranges = ["0.0.0.0/0"]
  target_tags   = ["vault"]
}

This is the firewall setting for allowing access to Vault

resource "google_service_account" "vault" {
  account_id = "vault"
  display_name = "Vault Service Account"
  description = "For vault state storage backend"
}

resource "google_service_account_key" "vault" {
  service_account_id = google_service_account.vault.name
}

resource "google_project_iam_member" "vault-iam" {
  project = data.google_project.project.project_id
  role = "roles/iam.serviceAccountUser"
  member = "serviceAccount:${google_service_account.vault.email}"
}

This is vault's service account. Vault needs to use this service account for two reasons: accessing its storage bucket, and authenticating GCP-generated JWTs. The service key is needed later.

resource "google_storage_bucket" "vault" {
  project = google_project.project.project_id
  name = "<name of your vault storage bucket>"
  location = "us-central1"

  uniform_bucket_level_access = true
}

resource "google_storage_bucket_iam_member" "vault" {
  bucket  = google_storage_bucket.vault.name
  role = "roles/storage.admin"
  member = "serviceAccount:${google_service_account.vault.email}"
}

This creates a storage bucket for Vault's secrets. For more information about how Vault encrypts secrets, see their documentation

If you were to run the above, and assuming you have all the rest of the resources set up, this will spin up and install a Vault instance ready to go. Please go through the HCL closely, you'll find references to data I haven't talked about in this post, including things like network configs, SSL certificates, etc.

Configuring Vault

Once deployed, Vault is in a blank slate, including not having any keys generated. At this point you probably want to get to the instance first and generate the root keys before someone else does. You can either do it through the CLI, or through the web UI, the actual act of configuring and operating Vault is outside the scope of this post, which covers only examples of making a single-instance deployment in google cloud with terraform. For usage details, I suggest checking out the many tutorials on their site

Once you generate the keys, you need to "unseal" Vault before configuring.

Configuring Vault can be done manually using the CLI or UI, but it's much easier to do it through Terraform. The slight complexity of configuring Vault that was deployed in a VM using Terraform is that there's a manual step in which you need to go set up Vault manually after the VM goes live, and before the rest of the Vault-related Terraform code can run. This can cause some confusing deadlocks in the terraform code, there are a few ways around it, using manual waits, and depends_on but I've yet to find a good way, I've usually separated the config into a separate Terraform workspace, or just contended with the errors while I go manually unseal.

Exactly how you want to configure Vault is going to be down to your requirements, but here's some examples of what I use:

provider "vault" {
  address = "https://${var.vault_domain}:8200"
}

You first need the provider set up. See documentation for more details, as you first need your token provided either here, or as part of your environment.

resource "vault_policy" "prod_read" {
  name = "prod_read"

  policy = <<EOF
path "secrets" {
  capabilities = ["list"]
}

path "secret/*" {
  capabilities = ["read"]
}

path "secret/prod/*" {
  capabilities = [ "read", "list"]
}
EOF
}

Here's a Policy for roles that can read production secrets

resource "vault_gcp_auth_backend" "vault_gcp" {
  credentials = base64decode(google_service_account_key.vault.private_key)
}

resource "vault_gcp_auth_backend_role" "production" {
  role = "production"
  type = "iam"
  bound_projects = [ data.google_project.project.project_id ]
  bound_service_accounts = [ google_service_account.production.email ]
  token_policies         = [ "secrets_reader" ]
  max_jwt_exp = 3600
}

This enables the GCP JWT authentication mentioned earlier. The private_key is the one generated above. Note the bound_projects and bound_service_accounts you'll want to add the service accounts from which JWTs are accepted.

There are lots of additional configuration which fall outside the scope of this blog post, but hopefully, this blog gives some guidance on how to set up and operate a single-instance Vault automatically for testing and evaluation or small-scale deployment.


Cover Photo by Jason Dent on Unsplash

Discussion

pic
Editor guide