Building Automated AWS Permission Testing Infrastructure for CI/CD

#aws #iam #security #githubactions

Originally published on graycloudarch.com.

I deployed a permission set for our data engineers five times before it worked correctly.

The first deployment: S3 reads worked, Glue Data Catalog reads worked. Athena queries failed — the query engine needs KMS decrypt through a service principal, and I'd missed the kms:ViaService condition. Second deployment: Athena worked. EMR Serverless job submission failed — missing iam:PassRole. Third deployment: EMR submission worked. Job execution failed — missing permissions on the EMR Serverless execution role boundary. I kept deploying, engineers kept getting blocked, I kept opening tickets.

Five iterations. Two weeks. Every failure meant a data engineer opened a ticket instead of running their job.

The problem wasn't that IAM is complicated — it is, but that's expected. The problem was that I had no way to catch these issues before deploying to the account where real engineers were trying to do real work. Every bug was a production bug.

The "Access Denied" Debugging Loop

Here's what the reactive debugging cycle looks like from the inside.

Engineer opens a ticket: AccessDeniedException: User is not authorized to perform: s3:GetObject. I add s3:GetObject to the permission set. Next day: AccessDeniedException: s3:PutObject. I add s3:PutObject. Day after: write succeeds but cleanup fails — s3:DeleteObject. At this point I've done four deployment cycles and two days of work to get S3 read/write/delete working. If I'd just added s3:* I'd be done, but that violates least-privilege and opens the raw zone to write access, which we explicitly don't want.

The deeper issue is that individual services don't fail atomically. Athena requires athena:StartQueryExecution and athena:GetQueryResults and athena:GetQueryExecution, but it also requires KMS decrypt through the Athena service principal to read encrypted S3 results. That last piece isn't in the Athena docs — you find it by failing in production.

I wanted a way to find it before deploying.

What I Built

The testing framework has four components: per-persona permission set templates, a Bash test library, per-service test scripts, and a GitHub Actions workflow that runs everything on pull requests.

The workflow triggers on any pull request that modifies the identity-center Terraform directory. Tests run against real AWS accounts — dev and nonprod — using test credentials provisioned for that purpose. Results post as a PR comment before anyone approves the change.

Phase 1: Pre-Validated Templates

Before I wrote a single test, I needed a starting point for permission sets that captured the patterns I'd learned the hard way. Templates that handle the non-obvious pieces — zone-scoped S3 access, KMS conditions tied to specific services, explicit denies for destructive operations.

The AnalystAccess template is representative. Analysts get read-only access to the curated zone of the data lake, Athena query execution in the primary workgroup, and KMS decrypt — but only when the decrypt request originates from S3 or Athena, not from arbitrary API calls:

inline_policy = jsonencode({
  Version = "2012-10-17"
  Statement = [
    {
      Sid    = "GlueCatalogReadOnly"
      Effect = "Allow"
      Action = ["glue:GetDatabase", "glue:GetTable", "glue:GetPartitions", "glue:SearchTables"]
      Resource = [
        "arn:aws:glue:*:*:catalog",
        "arn:aws:glue:*:*:database/curated_*",
        "arn:aws:glue:*:*:table/curated_*/*"
      ]
    },
    {
      Sid    = "S3CuratedReadOnly"
      Effect = "Allow"
      Action = ["s3:GetObject", "s3:ListBucket"]
      Resource = ["arn:aws:s3:::lake-bucket-*/curated/*", "arn:aws:s3:::lake-bucket-*"]
      Condition = { StringLike = { "s3:prefix" = ["curated/*"] } }
    },
    {
      Sid    = "AthenaQueryExecution"
      Effect = "Allow"
      Action = ["athena:StartQueryExecution", "athena:GetQueryExecution", "athena:GetQueryResults", "athena:StopQueryExecution"]
      Resource = "arn:aws:athena:*:*:workgroup/primary"
    },
    {
      Sid    = "KMSDecryptViaSvc"
      Effect = "Allow"
      Action = ["kms:Decrypt", "kms:DescribeKey"]
      Resource = "arn:aws:kms:*:*:key/*"
      Condition = {
        StringEquals = { "kms:ViaService" = ["s3.us-east-1.amazonaws.com", "athena.us-east-1.amazonaws.com"] }
      }
    },
    {
      Sid    = "DenyDestructiveOps"
      Effect = "Deny"
      Action = ["s3:DeleteObject", "s3:DeleteBucket", "glue:DeleteDatabase", "glue:DeleteTable"]
      Resource = "*"
    }
  ]
})

The kms:ViaService condition is the piece that took five production failures to discover. KMS decrypt without that condition allows an analyst to call kms:Decrypt directly from their shell, which is not what we want. The condition locks decrypt to requests that pass through S3 or Athena specifically.

The explicit deny block matters too. Without it, if someone later grants broader S3 permissions to this persona for a different reason, the curated zone protection evaporates. The deny creates a hard floor regardless of what else gets added.

Phase 2: The Test Framework

I chose Bash over Python or a proper test framework deliberately. The tests run in CI with no dependencies beyond the AWS CLI — no package installs, no virtual environments, no version pinning of test libraries. The machines running these tests already have the AWS CLI.

The core library in lib/test-framework.sh:

declare -a TESTS_PASSED=()
declare -a TESTS_FAILED=()

run_test() {
  local test_name="$1"
  local test_command="$2"
  local description="$3"

  if eval "$test_command" &>/dev/null; then
    TESTS_PASSED+=("$test_name")
    echo "  ✅ PASS: $test_name"
  else
    TESTS_FAILED+=("$test_name")
    echo "  ❌ FAIL: $test_name"
  fi
}

generate_text_report() {
  echo "Total: $((${#TESTS_PASSED[@]} + ${#TESTS_FAILED[@]}))"
  echo "Passed: ${#TESTS_PASSED[@]}"
  echo "Failed: ${#TESTS_FAILED[@]}"
  [ ${#TESTS_FAILED[@]} -gt 0 ] && printf '  - %s\n' "${TESTS_FAILED[@]}"
}

The most important design decision in the test scripts is testing denials as carefully as allowances. Testing only what should succeed tells you the permission set isn't obviously broken. Testing what should fail tells you it's not accidentally too permissive.

# Test what should succeed
run_test "s3-list-curated" \
  "aws s3 ls s3://lake-bucket-dev/curated/" \
  "Analyst can list curated zone"

# Test what should fail (negative test)
run_test "s3-write-denied" \
  "! aws s3 cp /tmp/test.txt s3://lake-bucket-dev/curated/test.txt 2>&1 | grep -q 'AccessDenied'" \
  "Analyst cannot write to curated zone"

run_test "s3-raw-zone-denied" \
  "! aws s3 ls s3://lake-bucket-dev/raw/ 2>&1 | grep -q 'AccessDenied'" \
  "Analyst cannot access raw zone"

Beyond service-level tests, I run persona tests that simulate end-to-end workflows. An analyst's workflow isn't "call S3, then call Athena separately" — it's "run an Athena query that reads encrypted S3 data and writes results to the query results bucket." That integration test catches failures that individual service tests miss. The original five-iteration DataPlatformAccess failure? An individual S3 test would have passed. A persona test running an actual Athena query against the encrypted lake would have caught the KMS gap.

Phase 3: CI/CD Integration

The GitHub Actions workflow triggers on pull requests that touch the identity-center Terraform directory, runs tests in a matrix against dev and nonprod, and posts a summary comment to the PR.

on:
  pull_request:
    paths:
      - 'common/modules/identity-center/**/*.tf'

permissions:
  contents: read
  id-token: write
  pull-requests: write

jobs:
  test-permissions:
    strategy:
      matrix:
        environment: [workloads-dev, workloads-nonprod]
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::${{ matrix.environment.account }}:role/github-actions-role
      - run: ./scripts/test-permissions/run-permission-tests.sh --persona analyst

The id-token: write permission is required for OIDC authentication to AWS — the workflow assumes a role in each account rather than using long-lived credentials in GitHub Secrets. This is the right pattern: credentials rotate automatically, and there's no secret to rotate manually or accidentally expose.

The PR comment posts the full test output with pass/fail counts per persona per account. A reviewer can look at the comment and immediately see whether the permission change has test coverage and whether the tests pass.

Three Things I Learned the Hard Way

First: test KMS decryption through each service separately. kms:Decrypt via S3 and kms:Decrypt via Athena are different IAM evaluation paths even though they're the same API call. A test that puts an object and gets it back via S3 directly won't catch a broken Athena KMS path.

Second: negative tests matter as much as positive ones. Before I had the test framework, every permission set I wrote was tested only for what it should allow. I had no systematic check that it didn't allow more. The denial tests are what give security reviewers confidence.

Third: persona tests catch failures that service tests miss. Individual service tests are fast to write and good for regression coverage, but they test permissions in isolation. Real workflows cross service boundaries. Build both.

What Changed

Before the framework: five iterations to get one permission set right, every iteration a production impact. After: 95% of permission issues caught at PR review time. Zero production impacts from permission bugs since we shipped it. The templates reduced new permission set creation time by about 70% — instead of starting from scratch with the IAM documentation, we start from a pre-validated base and modify from there.

The time investment was about a week: two days for templates, two days for the test framework and scripts, one day for CI/CD integration and documentation. That investment paid back in the first sprint when the analyst permission set for a new hire went out correct on the first deployment.

Running into IAM permission debugging loops on your team? Reach out — permission testing infrastructure is one of the first things I build when joining a new platform team.