Hari Krishna Pokala

Posted on Apr 23

Eliminating Static AWS Credentials From GitHub Actions With OIDC and Terragrunt

#aws #github #security #terraform

Quick Start

If you want to clone and run before reading:

Update bootstrap/terraform.tfvars and terragrunt/account.hcl with your account details
Run cd bootstrap && terraform init && terraform apply
Add AWS_ROLE_ARN (bootstrap output) and INFRACOST_API_KEY to GitHub Secrets
Open a PR — Checkov, plan diff, and cost estimate appear as PR comments
Merge to develop → deploys to dev. Merge to main → deploys to prod.

Full repo: github.com/krishph/terragrunt-aws-secure-starter

The Bootstrap Problem

Most Terraform + GitHub Actions setups start with: "Add your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to GitHub Secrets."

That works. But it means a long-lived credential with broad AWS access is sitting in GitHub. Rotate it and you break the pipeline until every reference is updated. Leak it and someone can deploy — or destroy — your infrastructure.

The better approach is OIDC. GitHub mints a short-lived token per workflow run, AWS verifies it matches a trust policy scoped to your specific repo, and the pipeline gets temporary credentials that expire in minutes. No secrets stored, no rotation burden.

But there is a well-known catch: GitHub Actions needs an IAM role to access AWS, and creating that IAM role requires AWS access. You cannot use the pipeline to create the role the pipeline depends on.

The solution is a one-time bootstrap step run locally with your personal credentials. After that, your credentials are never used again. This article walks through that pattern end to end — OIDC bootstrap, Terraform modules, Terragrunt multi-environment config, and a GitHub Actions pipeline with security scanning, cost estimation, and daily drift detection.

Architecture

The demo app is a URL shortener: POST /shorten stores a URL in S3 and returns a 6-character code, GET /{code} resolves it and 301-redirects.

┌───────────────────────────────────────────────────────────────────────────────────┐
│                                   AWS Account                                     │
│                                                                                   │
│  ┌────────────────┐                                                               │
│  │ POST /shorten  ├──┐   ┌──────────────────┐   ┌──────────────────────────────┐  │
│  └────────────────┘  ├──▶│  API Gateway     │   │ VPC (2 AZs)                  │  │
│  ┌────────────────┐  │   │  (REST)          ├──▶│  ┌────────────────────────┐  │  │
│  │  GET /{code}   ├──┘   └──────────────────┘   │  │    Private Subnet      │  │  │    ┌──────────────┐ 
│  └────────────────┘                             │  │  ┌──────────────────┐  │  │  │    | S3 Bucket    |
│                                                 │  │  │ Lambda (Py 3.12) ├──┼──┼──┼──▶ | (URL store)  |
│                                                 │  │  └──────────────────┘  │  │  │    └──────────────┘
│                                                 │  └────────────────────────┘  │  │           ▲
│                                                 └──────────────────────────────┘  │           |
│                                                        S3 Gateway endpoint        │           |
│                                                        (no NAT required) ─────────┼───--──────┘
└───────────────────────────────────────────────────────────────────────────────────┘

The pipeline flow:

GitHub
  ├── plan (manual, env dropdown)
  │     ├── Checkov    →  blocks on misconfiguration
  │     ├── Infracost  →  posts cost estimate to job summary + PR comment
  │     └── Terragrunt plan  →  posts plan diff as PR comment
  │
  ├── apply (manual, env dropdown)
  │     └── terragrunt run-all apply  →  AWS (VPC → S3 → Lambda → API Gateway)
  │
  ├── destroy (manual, requires typing DESTROY)
  │     └── terragrunt run-all destroy  →  AWS
  │
  └── drift-detect (manual)
        └── plan -detailed-exitcode against dev + prod in parallel
            opens GitHub issue if live AWS ≠ Terraform state

Why These Choices

Before diving into the code, it is worth explaining the design decisions. These come up in every review.

OIDC over static secrets — Static keys do not expire, cannot be scoped to a single workflow run, and require a rotation process most teams skip. OIDC tokens are single-use and expire in minutes. The only tradeoff is the one-time bootstrap cost, which this article addresses directly.

Terragrunt over raw Terraform — Every Terraform module needs a backend block and a provider block. With three environments and four modules, that is twenty-four places to keep in sync. Terragrunt generates both from a single root config. The dependency block also makes cross-module output references explicit, which is better than copy-pasting ARNs into tfvars.

REST API Gateway over HTTP API — HTTP API is cheaper and faster, but REST API supports per-stage deployments and WAF attachment without extra configuration. For a starter repo that people will extend, REST API is the safer default.

S3 as the URL datastore — DynamoDB would be more appropriate at scale, with indexing and single-digit millisecond reads. S3 is used here deliberately to keep the infrastructure surface small so the article stays focused on the IaC patterns rather than database provisioning. The tradeoff is that S3 object reads add ~10–30ms of latency compared to DynamoDB.

S3 VPC endpoint over NAT for S3 traffic — Lambda runs in private subnets and needs to reach S3. Without a VPC endpoint, every read and write goes through the NAT gateway at $0.045 per GB. An S3 Gateway endpoint is free and routes traffic directly on the AWS network. This is a practical detail most tutorials skip.

Repository Structure

├── bootstrap/                  ← run once from your local machine
│   ├── main.tf                 # OIDC provider, IAM role, state bucket, lock table
│   ├── variables.tf
│   ├── outputs.tf
│   └── terraform.tfvars        # ⚠️  update before running
│
├── terraform/modules/
│   ├── vpc/                    # VPC, subnets, IGW, NAT GW, S3 endpoint, Lambda SG
│   ├── s3/                     # App bucket (versioned, encrypted, private)
│   ├── lambda/                 # Function, IAM role, CloudWatch log group
│   └── apigw/                  # REST API Gateway → Lambda proxy
│
├── terragrunt/
│   ├── terragrunt.hcl          # root: remote state config + provider generation
│   ├── account.hcl             # ⚠️  update account ID, region, bucket name
│   └── environments/
│       ├── dev/  (vpc → s3 → lambda → apigw)
│       └── prod/ (same layout, different CIDRs and retention)
│
├── lambda/index.py
│
└── .github/workflows/
    ├── plan.yml                # PR: security scan + plan + cost estimate
    ├── apply.yml               # push to develop/main: apply
    ├── destroy.yml             # manual only, requires typing DESTROY
    └── drift-detection.yml     # daily scheduled plan against live infra

Part 1: Breaking the Bootstrap Loop

The bootstrap/ folder is plain Terraform, no Terragrunt. You run it once from your machine.

What it creates

GitHub OIDC provider — registers token.actions.githubusercontent.com as a trusted identity provider in your AWS account. This is what lets GitHub tokens be verified by AWS STS.

IAM role with a repo-scoped trust policy:

resource "aws_iam_role" "github_actions" {
  name = var.role_name

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect    = "Allow"
      Principal = { Federated = aws_iam_openid_connect_provider.github.arn }
      Action    = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          "token.actions.githubusercontent.com:aud" = "sts.amazonaws.com"
        }
        StringLike = {
          "token.actions.githubusercontent.com:sub" = "repo:${var.github_org}/${var.github_repo}:*"
        }
      }
    }]
  })
}

The sub condition is the important part. Each GitHub token includes a sub claim like repo:org/repo:ref:refs/heads/main. The condition uses StringLike with a wildcard so any branch or tag in your repo can assume the role, but no other repo can — even within the same GitHub organization.

S3 state bucket and DynamoDB lock table — Terragrunt's remote state backend must exist before any terragrunt run-all command can run. Bootstrap creates these so the pipeline never hits a "bucket does not exist" error on its first run.

A note on the IAM permissions — The role policy is broad for a demo. It uses lambda:*, apigateway:*, and similar wildcards. In a production setup you would scope each action down to specific resource ARNs, and use IAM Access Analyzer to generate a least-privilege policy from actual usage after your first successful deploy.

Running it

Update bootstrap/terraform.tfvars:

github_org             = "your-github-username"
github_repo            = "your-repo-name"
terraform_state_bucket = "your-unique-bucket-name"  # globally unique across all AWS accounts

Update terragrunt/account.hcl with your account ID and the same bucket name:

locals {
  aws_region             = "us-east-1"
  account_id             = "123456789012"
  terraform_state_bucket = "your-unique-bucket-name"
  terraform_lock_table   = "terraform-state-lock"
}

Then run:

cd bootstrap
terraform init
terraform apply

The output prints the role ARN. Add it to GitHub as a secret named AWS_ROLE_ARN. That is the last time you touch AWS credentials manually.

Part 2: Terraform Modules

Each service gets its own module under terraform/modules/. Modules are pure Terraform with no Terragrunt code, which keeps them reusable and independently testable.

VPC module — S3 endpoint is the detail most tutorials miss

# S3 VPC endpoint routes Lambda → S3 traffic on the AWS network,
# avoiding NAT gateway data charges for every object read/write.
resource "aws_vpc_endpoint" "s3" {
  vpc_id            = aws_vpc.this.id
  service_name      = "com.amazonaws.${data.aws_region.current.region}.s3"
  vpc_endpoint_type = "Gateway"
  route_table_ids   = aws_route_table.private[*].id
}

Gateway endpoints are free. The alternative — routing S3 traffic through NAT — costs $0.045 per GB. For a URL shortener with thousands of reads per day that adds up quickly.

Lambda module — source_code_hash forces redeployment on code changes

resource "aws_lambda_function" "this" {
  function_name    = var.function_name
  role             = aws_iam_role.lambda.arn
  runtime          = "python3.12"
  filename         = var.filename
  source_code_hash = var.source_hash

  vpc_config {
    subnet_ids         = var.private_subnet_ids
    security_group_ids = [var.lambda_security_group_id]
  }
}

Without source_code_hash, Terraform only redeploys the function if its declared variables change. The hash ensures any change to the source code triggers a redeployment.

One subtle point: hashing the zip file (filebase64sha256(var.filename)) causes false positives because zip files embed timestamps. A zip rebuilt in CI from identical source will have a different hash than the previously deployed one, so Terraform always sees a change even when the code has not changed. The fix is to hash the source file directly and pass it as an input variable:

# terragrunt/environments/dev/lambda/terragrunt.hcl
inputs = {
  filename    = "${get_repo_root()}/lambda/handler.zip"
  source_hash = filebase64sha256("${get_repo_root()}/lambda/index.py")
}

Now the hash only changes when index.py actually changes, and drift detection stops reporting false positives on every CI run.

API Gateway module — proxy integration handles routing in Lambda

The module uses a single {proxy+} resource with AWS_PROXY integration type, forwarding all paths and methods to Lambda. This means the function owns its own routing logic — adding a new endpoint requires no API Gateway changes, only a code update.

Part 3: Terragrunt — One Config, Two Environments

The root terragrunt.hcl generates the backend and provider blocks for every module automatically. No copy-pasting:

remote_state {
  backend = "s3"
  config = {
    bucket         = local.terraform_state_bucket
    key            = "${local.environment}/${path_relative_to_include()}/terraform.tfstate"
    region         = local.aws_region
    encrypt        = true
    dynamodb_table = local.terraform_lock_table
  }
}

path_relative_to_include() is what gives each module its own state file automatically. dev/lambda/terragrunt.hcl gets the key dev/lambda/terraform.tfstate. You do not configure this per module.

The dependency block is what makes Terragrunt genuinely useful for multi-module setups:

# terragrunt/environments/dev/lambda/terragrunt.hcl
dependency "vpc" {
  config_path = "../vpc"
}

dependency "s3" {
  config_path = "../s3"
}

inputs = {
  private_subnet_ids       = dependency.vpc.outputs.private_subnet_ids
  lambda_security_group_id = dependency.vpc.outputs.lambda_security_group_id
  s3_bucket_arn            = dependency.s3.outputs.bucket_arn
}

terragrunt run-all apply reads these dependencies, determines the correct order (VPC → S3 → Lambda → API Gateway), and applies them in sequence. No Makefile, no manual ordering.

Part 4: The URL Shortener

The Lambda handler uses S3 as a simple key-value store — POST /shorten generates a 6-character code and writes the original URL as an S3 object body, GET /{code} reads that object and returns a 301 redirect. If the key does not exist, it returns a 404.

boto3 is included in the AWS-managed Python 3.12 runtime, so the Lambda package is just the source file zipped: zip -j lambda/handler.zip lambda/index.py. No dependency bundling needed.

Part 5: The Pipelines

plan.yml — three jobs, manually triggered per environment

Checkov security scan runs first and independently of the plan. It scans the Terraform modules statically — no AWS API calls, no credentials needed. If it finds a misconfiguration (open security group, unencrypted bucket, Lambda without VPC config), the PR is blocked before any plan runs.

Intentional suppressions live in .checkov.yaml with a comment explaining why, so the next person does not re-investigate them:

skip-check:
  - CKV_AWS_116   # Lambda DLQ — not needed for synchronous API use case
  - CKV_AWS_272   # Lambda code signing — out of scope for this demo

Terragrunt plan authenticates via OIDC — this is the first time the pipeline actually touches AWS — and posts the diff as a PR comment when triggered from a pull request, and to the job summary otherwise. The environment is selected from a dropdown when triggering the workflow manually, so you can plan against dev or prod independently of which branch you are on.

- name: Configure AWS credentials via OIDC
  uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
    aws-region: ${{ env.AWS_REGION }}

Infracost posts a monthly cost estimate to the job summary on every run, and additionally as a PR comment when triggered from a pull request — updating the existing comment on re-runs rather than stacking duplicates.

A reviewer now sees — in a single PR — what security posture changes, what infrastructure changes, and what the monthly bill changes by.

apply.yml — unattended on merge

The apply pipeline is intentionally simple. The PR review is the approval gate. Pushing to develop deploys to dev; pushing to main deploys to prod. Once merged, terragrunt run-all apply runs without prompts in the correct dependency order.

destroy.yml — manual only

Destroy is workflow_dispatch only — no branch trigger, no schedule. It requires navigating to the Actions tab, selecting the environment, and typing DESTROY into a confirmation field before the job will run. There is no path that destroys infrastructure automatically.

drift-detection.yml — the practical one

This pipeline runs on demand from the Actions tab using plan -detailed-exitcode:

exit 0 = no changes, live AWS matches state
exit 1 = error
exit 2 = changes detected, drift found

When drift is detected it opens a GitHub issue with the full plan diff. If an issue for that environment is already open, it adds a comment instead of creating a duplicate — so after a week of ignored drift you have one issue with comments, not seven identical issues.

The most common source of drift in practice: someone made a manual fix in the AWS console under pressure, noted they would "put it in Terraform later," and did not. Drift detection is what catches that before it causes an incident.

Enabling a schedule is a one-line change in the workflow file:

on:
  workflow_dispatch:
  schedule:
    - cron: "0 6 * * *"   # daily at 06:00 UTC

Two things to know before you enable it. First, scheduled GitHub Actions workflows do not have access to environment-scoped secrets — AWS_ROLE_ARN must be set as a repository-level secret, not scoped to the dev or prod environment. Second, each run downloads providers and plans all modules in both environments, which takes around five minutes. If that generates too much noise, limit the matrix to prod only or switch to weekly (0 6 * * 1).

Production Tradeoffs

Things this repo deliberately simplifies that a real production setup would revisit:

IAM permissions — The bootstrap role policy uses service-level wildcards (lambda:*, apigateway:*). Fine for a demo. For production, scope to specific resource ARNs and use IAM Access Analyzer on real usage logs to generate the minimum required policy.

NAT gateway vs VPC endpoint — The VPC module includes an S3 Gateway endpoint to avoid NAT costs for S3 traffic. For other AWS services Lambda calls (SSM, Secrets Manager, STS), you would add Interface endpoints and remove NAT gateways entirely, reducing both cost and attack surface.

REST API Gateway vs HTTP API — HTTP API is ~70% cheaper per million requests and has lower latency. REST API is used here because it supports WAF, per-stage throttling, and usage plans without additional configuration. If you do not need those features, switch to HTTP API.

S3 vs DynamoDB for URL storage — S3 object reads add ~10–30ms of latency. DynamoDB would be more appropriate at scale, with consistent single-digit millisecond reads and native support for TTL-based expiry of short codes. The tradeoff is another module and a DynamoDB table in every environment.

Module versioning — The Terragrunt configs reference local modules directly with a relative path. In a multi-team setup, you would publish modules to a private registry or reference tagged Git releases so environments can pin to a specific version.

Drift detection scheduling — Drift detection in this repo is manual by default. To run it on a schedule, uncomment the schedule block in drift-detection.yml. Teams that frequently change resources outside Terraform, or that use feature flags, often find daily scheduled runs noisy. A practical alternative is to run drift detection only on prod, or switch to weekly. See the note in the workflow file for the scheduling considerations.

What Can Go Wrong

State bucket name collision — S3 bucket names are globally unique across all AWS accounts. If someone already has your chosen bucket name, bootstrap will fail with BucketAlreadyExists. Make the name specific: include your GitHub username and a project identifier.

OIDC subject condition mismatch — The trust policy uses StringLike with repo:org/repo:*. If you fork the repo, the fork's organization or username will not match. Pull requests from forks will not have access to secrets and the plan job will fail silently. This is expected behavior for security reasons.

Terragrunt dependency ordering surprise — If you add a new module and forget to declare a dependency block for it, run-all apply may apply modules in parallel in an order that fails. The error message from Terragrunt is usually clear, but it can be confusing the first time. Always declare dependencies explicitly even if the ordering seems obvious.

Plan output exceeding GitHub comment limits — GitHub PR comments are capped at 65536 characters. The pipeline truncates plan output and appends ...(truncated) if it exceeds the limit. For very large plans, use the Actions run log instead.

DynamoDB lock not released after a failed run — If a pipeline run is cancelled mid-apply, the DynamoDB lock may not be released. The next run will fail with Error acquiring the state lock. Release it with terraform force-unlock <LOCK_ID> run from the relevant module directory.

End-to-End Test

Once deployed, get the API Gateway URL:

cd terragrunt/environments/dev/apigw
terragrunt output invoke_url

Shorten a URL:

curl -X POST https://<invoke-url>/dev/shorten \
  -H "Content-Type: application/json" \
  -d '{"url": "https://devto.com"}'

# {"code": "aB3xYz", "short_url": "https://<invoke-url>/dev/aB3xYz"}

Resolve it:

curl -L https://<invoke-url>/dev/aB3xYz
# 301 → https://devto.com

Summary

Concern	Solution
No static credentials	GitHub OIDC → temporary STS tokens per run
Bootstrap chicken-and-egg	One-time local `terraform apply` in `bootstrap/`
DRY multi-environment config	Terragrunt root config + `dependency` blocks
Security misconfigs caught early	Checkov on every PR
Cost visibility before merge	Infracost posts breakdown as PR comment
NAT costs for S3 traffic	S3 VPC Gateway endpoint
Manual changes caught	On-demand drift detection, GitHub issue on drift
Accidental destroy	`workflow_dispatch` only + DESTROY confirmation

What to Add Next

Terratest — write Go tests that deploy to a throwaway environment, hit the API, and destroy. Gives you infrastructure integration tests, not just plan validation.
Custom domain — ACM + Route 53 + API Gateway custom domain so short codes are on your own domain.
WAF — attach a Web ACL to API Gateway to rate-limit the /shorten endpoint against abuse.
Least-privilege IAM — use IAM Access Analyzer on the first successful deploy to generate a scoped policy for the GitHub Actions role.

Full source: github.com/krishph/terragrunt-aws-secure-starter

Hari Krishna Pokala

DEV Community