Building Automated AWS Permission Testing Infrastructure for CI/CD

#aws #cicd #githubactions #iam

This post was originally published on graycloudarch.com.

I deployed a permission set for our data engineers five times before
it worked correctly.

The first deployment: S3 reads worked, Glue Data Catalog reads
worked. Athena queries failed --- the query engine needs KMS decrypt
through a service principal, and I'd missed the
kms:ViaService condition. Second deployment: Athena worked.
EMR Serverless job submission failed --- missing
iam:PassRole. Third deployment: EMR submission worked. Job
execution failed --- missing permissions on the EMR Serverless execution
role boundary. I kept deploying, engineers kept getting blocked, I kept
opening tickets.

Five iterations. Two weeks. Every failure meant a data engineer
opened a ticket instead of running their job.

The problem wasn't that IAM is complicated --- it is, but that's
expected. The problem was that I had no way to catch these issues before
deploying to the account where real engineers were trying to do real
work. Every bug was a production bug.

The "Access Denied" Debugging Loop

Here's what the reactive debugging cycle looks like from the
inside.

Engineer opens a ticket:
AccessDeniedException: User is not authorized to perform: s3:GetObject.
I add s3:GetObject to the permission set. Next day:
AccessDeniedException: s3:PutObject. I add
s3:PutObject. Day after: write succeeds but cleanup fails ---
s3:DeleteObject. At this point I've done four deployment
cycles and two days of work to get S3 read/write/delete working. If I'd
just added s3:* I'd be done, but that violates
least-privilege and opens the raw zone to write access, which we
explicitly don't want.

The deeper issue is that individual services don't fail atomically.
Athena requires athena:StartQueryExecution and
athena:GetQueryResults and
athena:GetQueryExecution, but it also requires KMS decrypt
through the Athena service principal to read encrypted S3 results. That
last piece isn't in the Athena docs --- you find it by failing in
production.

I wanted a way to find it before deploying.

What I Built

The testing framework has four components: per-persona permission set
templates, a Bash test library, per-service test scripts, and a GitHub
Actions workflow that runs everything on pull requests.

┌─────────────────────────────────────────────────┐
│  GitHub Pull Request (Permission Set Changes)   │
└───────────────────┬─────────────────────────────┘
                    │
         ┌──────────▼──────────┐
         │  CI/CD Workflow     │
         │  (GitHub Actions)   │
         └──────────┬──────────┘
                    │
    ┌───────────────┼───────────────┐
    ▼               ▼               ▼
┌───────┐      ┌──────────┐   ┌──────────┐
│ S3    │      │  Glue    │   │ Athena   │
│ Tests │      │  Tests   │   │  Tests   │
└───────┘      └──────────┘   └──────────┘
                    │
         ┌──────────▼──────────┐
         │  Test Report        │
         │  (Posted to PR)     │
         └─────────────────────┘

The workflow triggers on any pull request that modifies the
identity-center Terraform directory. Tests run against real AWS accounts
--- dev and nonprod --- using test credentials provisioned for that purpose.
Results post as a PR comment before anyone approves the change.

Phase 1: Pre-Validated Templates

Before I wrote a single test, I needed a starting point for
permission sets that captured the patterns I'd learned the hard way.
Templates that handle the non-obvious pieces --- zone-scoped S3 access,
KMS conditions tied to specific services, explicit denies for
destructive operations.

The AnalystAccess template is representative. Analysts
get read-only access to the curated zone of the data lake, Athena query
execution in the primary workgroup, and KMS decrypt --- but only when the
decrypt request originates from S3 or Athena, not from arbitrary API
calls:

inline_policy = jsonencode({
  Version = "2012-10-17"
  Statement = [
    {
      Sid    = "GlueCatalogReadOnly"
      Effect = "Allow"
      Action = ["glue:GetDatabase", "glue:GetTable", "glue:GetPartitions", "glue:SearchTables"]
      Resource = [
        "arn:aws:glue:*:*:catalog",
        "arn:aws:glue:*:*:database/curated_*",
        "arn:aws:glue:*:*:table/curated_*/*"
      ]
    },
    {
      Sid    = "S3CuratedReadOnly"
      Effect = "Allow"
      Action = ["s3:GetObject", "s3:ListBucket"]
      Resource = ["arn:aws:s3:::lake-bucket-*/curated/*", "arn:aws:s3:::lake-bucket-*"]
      Condition = { StringLike = { "s3:prefix" = ["curated/*"] } }
    },
    {
      Sid    = "AthenaQueryExecution"
      Effect = "Allow"
      Action = ["athena:StartQueryExecution", "athena:GetQueryExecution", "athena:GetQueryResults", "athena:StopQueryExecution"]
      Resource = "arn:aws:athena:*:*:workgroup/primary"
    },
    {
      Sid    = "KMSDecryptViaSvc"
      Effect = "Allow"
      Action = ["kms:Decrypt", "kms:DescribeKey"]
      Resource = "arn:aws:kms:*:*:key/*"
      Condition = {
        StringEquals = { "kms:ViaService" = ["s3.us-east-1.amazonaws.com", "athena.us-east-1.amazonaws.com"] }
      }
    },
    {
      Sid    = "DenyDestructiveOps"
      Effect = "Deny"
      Action = ["s3:DeleteObject", "s3:DeleteBucket", "glue:DeleteDatabase", "glue:DeleteTable"]
      Resource = "*"
    }
  ]
})

The kms:ViaService condition is the piece that took five
production failures to discover. KMS decrypt without that condition
allows an analyst to call kms:Decrypt directly from their
shell, which is not what we want. The condition locks decrypt to
requests that pass through S3 or Athena specifically.

The explicit deny block matters too. Without it, if someone later
grants broader S3 permissions to this persona for a different reason,
the curated zone protection evaporates. The deny creates a hard floor
regardless of what else gets added.

Phase 2: The Test Framework

I chose Bash over Python or a proper test framework deliberately. The
tests run in CI with no dependencies beyond the AWS CLI --- no package
installs, no virtual environments, no version pinning of test libraries.
The machines running these tests already have the AWS CLI.

The core library in lib/test-framework.sh:

::: {#cb3 .sourceCode}

declare -a TESTS_PASSED=()
declare -a TESTS_FAILED=()

run_test() {
  local test_name="$1"
  local test_command="$2"
  local description="$3"

  if eval "$test_command" &>/dev/null; then
    TESTS_PASSED+=("$test_name")
    echo "  ✅ PASS: $test_name"
  else
    TESTS_FAILED+=("$test_name")
    echo "  ❌ FAIL: $test_name"
  fi
}

generate_text_report() {
  echo "Total: $((${#TESTS_PASSED[@]} + ${#TESTS_FAILED[@]}))"
  echo "Passed: ${#TESTS_PASSED[@]}"
  echo "Failed: ${#TESTS_FAILED[@]}"
  [ ${#TESTS_FAILED[@]} -gt 0 ] && printf '  - %s\n' "${TESTS_FAILED[@]}"
}

:::

The most important design decision in the test scripts is testing
denials as carefully as allowances. Testing only what should succeed
tells you the permission set isn't obviously broken. Testing what should
fail tells you it's not accidentally too permissive.

::: {#cb4 .sourceCode}

# Test what should succeed
run_test "s3-list-curated"
  "aws s3 ls s3://lake-bucket-dev/curated/"
  "Analyst can list curated zone"

# Test what should fail (negative test)
run_test "s3-write-denied"
  "! aws s3 cp /tmp/test.txt s3://lake-bucket-dev/curated/test.txt 2>&1 | grep -q 'AccessDenied'"
  "Analyst cannot write to curated zone"

run_test "s3-raw-zone-denied"
  "! aws s3 ls s3://lake-bucket-dev/raw/ 2>&1 | grep -q 'AccessDenied'"
  "Analyst cannot access raw zone"

:::

Beyond service-level tests, I run persona tests that simulate
end-to-end workflows. An analyst's workflow isn't "call S3, then call
Athena separately" --- it's "run an Athena query that reads encrypted S3
data and writes results to the query results bucket." That integration
test catches failures that individual service tests miss. The original
five-iteration DataPlatformAccess failure? An individual S3 test would
have passed. A persona test running an actual Athena query against the
encrypted lake would have caught the KMS gap.

Phase 3: CI/CD Integration

The GitHub Actions workflow triggers on pull requests that touch the
identity-center Terraform directory, runs tests in a matrix against dev
and nonprod, and posts a summary comment to the PR.

::: {#cb5 .sourceCode}

on:
  pull_request:
    paths:
      - 'common/modules/identity-center/**/*.tf'

permissions:
  contents: read
  id-token: write
  pull-requests: write

jobs:
  test-permissions:
    strategy:
      matrix:
        environment: [workloads-dev, workloads-nonprod]
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::${{ matrix.environment.account }}:role/github-actions-role
      - run: ./scripts/test-permissions/run-permission-tests.sh --persona analyst

:::

The id-token: write permission is required for OIDC
authentication to AWS --- the workflow assumes a role in each account
rather than using long-lived credentials in GitHub Secrets. This is the
right pattern: credentials rotate automatically, and there's no secret
to rotate manually or accidentally expose.

The PR comment posts the full test output with pass/fail counts per
persona per account. A reviewer can look at the comment and immediately
see whether the permission change has test coverage and whether the
tests pass.

Three Things I Learned the Hard Way

First: test KMS decryption through each service separately.
kms:Decrypt via S3 and kms:Decrypt via Athena
are different IAM evaluation paths even though they're the same API
call. A test that puts an object and gets it back via S3 directly won't
catch a broken Athena KMS path.

Second: negative tests matter as much as positive ones. Before I had
the test framework, every permission set I wrote was tested only for
what it should allow. I had no systematic check that it didn't allow
more. The denial tests are what give security reviewers confidence.

Third: persona tests catch failures that service tests miss.
Individual service tests are fast to write and good for regression
coverage, but they test permissions in isolation. Real workflows cross
service boundaries. Build both.

What Changed

Before the framework: five iterations to get one permission set
right, every iteration a production impact. After: 95% of permission
issues caught at PR review time. Zero production impacts from permission
bugs since we shipped it. The templates reduced new permission set
creation time by about 70% --- instead of starting from scratch with the
IAM documentation, we start from a pre-validated base and modify from
there.

The time investment was about a week: two days for templates, two
days for the test framework and scripts, one day for CI/CD integration
and documentation. That investment paid back in the first sprint when
the analyst permission set for a new hire went out correct on the first
deployment.

Running into IAM permission debugging loops on your team? Reach out --- permission testing infrastructure is
one of the first things I build when joining a new platform team.