Henry A

Posted on Apr 17

How I Packaged 130+ Hours of AWS Infrastructure Into Reusable Templates

#devops #aws #docker #terraform

Every new project starts the same way.

You spin up an AWS account. You need a VPC. Three AZs, public and private subnets, NAT Gateways. Then you need security — CloudTrail, GuardDuty, IAM hardening. Then CI/CD pipelines. Then Terraform modules. Then Docker Compose for local dev. Then Nginx for production.

Each time, you either copy-paste from your last project (hoping nothing broke) or build from scratch (knowing how long it takes).

I got tired of it. So I packaged every infrastructure pattern I keep rebuilding into reusable templates. 75+ files across 7 products. This article walks through the architecture decisions behind each one, with free samples you can use right now.

1. The Security Checklist No One Actually Follows

Most AWS accounts ship with default settings. No audit trail. No MFA enforcement. S3 buckets one misconfiguration away from being public.

The CIS AWS Foundations Benchmark exists, but it's a 300-page PDF. Nobody reads it. So I distilled it into a 50-point checklist and wrote CloudFormation templates that implement the critical controls.

Here's the full checklist — free, no catch:

AWS Security Hardening Checklist (50 points)

Identity & Access Management

[ ] Enable MFA on root account
[ ] Do NOT use root account for daily tasks
[ ] Enable MFA for all IAM users with console access
[ ] Set strong password policy (14+ chars, complexity, rotation)
[ ] Remove unused IAM users
[ ] Remove unused IAM credentials (access keys)
[ ] Rotate access keys every 90 days
[ ] Attach policies to groups, not directly to users
[ ] Use IAM roles for applications, not access keys
[ ] Implement least-privilege permissions
[ ] Use AWS SSO for multi-account access
[ ] Review IAM policies for * resource permissions

Logging & Monitoring

[ ] Enable CloudTrail in all regions
[ ] Enable CloudTrail log file validation
[ ] Ensure CloudTrail logs are encrypted (KMS)
[ ] Enable CloudTrail log delivery to S3+CloudWatch
[ ] Enable AWS Config in all regions
[ ] Enable VPC Flow Logs for all VPCs
[ ] Enable GuardDuty
[ ] Enable Security Hub
[ ] Set up CloudWatch alarms for: root usage, unauthorized API calls, console sign-in failures
[ ] Enable S3 access logging for critical buckets
[ ] Configure SNS alerts for security findings

Networking

[ ] No security groups allow 0.0.0.0/0 to port 22 (SSH)
[ ] No security groups allow 0.0.0.0/0 to port 3389 (RDP)
[ ] Default security group restricts all traffic
[ ] Use private subnets for databases and application servers
[ ] Use VPC endpoints for AWS service access (S3, DynamoDB, ECR)
[ ] Enable DNS query logging
[ ] Use Network ACLs as secondary defense layer
[ ] No public IP on EC2 instances in private subnets

Data Protection

[ ] Enable S3 Block Public Access at account level
[ ] Enable S3 default encryption on all buckets
[ ] Enable S3 versioning on critical buckets
[ ] Enable RDS encryption at rest
[ ] Enable RDS automated backups
[ ] Enable EBS encryption by default
[ ] Use KMS CMKs for sensitive data (not default keys)
[ ] Enable SSL/TLS for data in transit
[ ] Enforce S3 bucket policies requiring SSL

Compute

[ ] Use IMDSv2 (require HTTP tokens) on all EC2 instances
[ ] Keep AMIs patched and up to date
[ ] Use Systems Manager Session Manager instead of SSH
[ ] Enable detailed monitoring on production instances
[ ] Use Auto Scaling groups (no single points of failure)

Account & Organization

[ ] Enable AWS Organizations with SCPs
[ ] Deny root account usage via SCP
[ ] Restrict regions via SCP
[ ] Prevent disabling of CloudTrail/GuardDuty/Config via SCP
[ ] Enable AWS Budgets alerts (catch unexpected spend = possible compromise)

How to use this: Score yourself. Every unchecked item is a risk. The CloudFormation templates in my Security Hardening Kit automate items 1-23 and 32-40 with 4 aws cloudformation deploy commands.

2. VPC Architecture — Why Three Subnet Tiers, Not Two

The most common VPC mistake I see: two tiers (public + private). This forces your databases into the same subnets as your application servers.

The architecture that works:

Internet Gateway
       │
   ┌───┴───┐
   │Public │ ← ALB, NAT Gateways, Bastion
   │Subnets│   (3 AZs: 10.0.1.0/24, 10.0.2.0/24, 10.0.3.0/24)
   └───┬───┘
       │
   ┌───┴───┐
   │Private│ ← App servers, ECS tasks, Lambda
   │Subnets│   (3 AZs: 10.0.11.0/24, 10.0.12.0/24, 10.0.13.0/24)
   └───┬───┘
       │
   ┌───┴────┐
   │Isolated│ ← RDS, ElastiCache, Secrets
   │Subnets │   (3 AZs: 10.0.21.0/24, 10.0.22.0/24, 10.0.23.0/24)
   └────────┘
       │
   No route to internet. No NAT. Nothing gets in or out.

Why three tiers:

Public subnets have a route to the Internet Gateway. Only load balancers and NAT Gateways live here.
Private subnets have a route to NAT Gateway (outbound internet only). App servers live here — they can pull packages and call APIs, but nothing can reach them directly.
Isolated subnets have NO route to the internet. Period. Databases live here. The only way to reach them is from within the VPC. Even if an attacker compromises your app server, they can't exfiltrate data from your database subnet to the internet.

The NAT Gateway cost trap:

NAT Gateways cost ~$32/month each. A production VPC with 3 AZs needs 3 NAT Gateways = ~$96/month just for outbound internet access in private subnets.

My templates include a dev variant: single NAT Gateway, 2 AZs, ~$32/month. Same architecture, lower cost for non-production environments.

VPC Endpoints slash your NAT bill further:

Every time your EC2 instance or Lambda calls S3, DynamoDB, ECR, or CloudWatch, that traffic goes through NAT Gateway by default. You're paying data processing fees for internal AWS traffic.

VPC Endpoints route that traffic over AWS's private network instead. Free for gateway endpoints (S3, DynamoDB), pennies for interface endpoints. I include endpoints for the 5 services that generate the most NAT traffic.

My VPC Starter Kit includes the full production VPC, dev variant, security groups, VPC endpoints, and a hardened bastion — 5 CloudFormation templates, 1,376 lines.

3. CI/CD — Why OIDC and Why It Matters

Here's a pattern I still see in production:

# DON'T DO THIS
- name: Configure AWS credentials
  uses: aws-actions/configure-aws-credentials@v4
  with:
    aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
    aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}

Long-lived access keys stored as GitHub secrets. If anyone with repo access runs echo $AWS_SECRET_ACCESS_KEY in a workflow step, your keys are in the logs. If GitHub gets breached, your keys are exposed. The keys never expire unless you rotate them manually.

The fix is OIDC (OpenID Connect). GitHub proves its identity to AWS using a signed token, and AWS gives back temporary credentials that expire in 1 hour:

permissions:
  id-token: write  # Required for OIDC

steps:
  - name: Configure AWS credentials
    uses: aws-actions/configure-aws-credentials@v4
    with:
      role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
      aws-region: us-east-1

No long-lived secrets. No keys to rotate. The IAM role's trust policy restricts which repositories and branches can assume it.

The Trivy pattern — another thing I build into every Docker workflow:

- name: Build image (local for scanning)
  uses: docker/build-push-action@v5
  with:
    context: .
    load: true           # Build locally, don't push yet
    tags: ${{ steps.meta.outputs.tags }}

- name: Run Trivy vulnerability scan
  uses: aquasecurity/trivy-action@master
  with:
    image-ref: ${{ secrets.ECR_REPOSITORY }}:latest
    format: table
    exit-code: 1         # FAIL the build
    severity: CRITICAL,HIGH
    ignore-unfixed: true # Don't flag unfixable CVEs

- name: Push to ECR      # Only runs if Trivy passes
  uses: docker/build-push-action@v5
  with:
    context: .
    push: true
    tags: ${{ steps.meta.outputs.tags }}

Build → scan → push. If Trivy finds a CRITICAL or HIGH vulnerability, the build fails. The image never reaches your registry. This catches vulnerable base images and dependencies before they hit production.

My GitHub Actions CI/CD Template Pack includes 5 complete workflows with OIDC, Trivy, and multi-environment approval gates.

4. Terraform Module Design — Keep the Interfaces Clean

The mistake: one giant main.tf with everything in it. 800 lines, impossible to reuse, breaks when you sneeze.

The pattern that works: small modules with clean interfaces.

# Root composition — this is all you touch per environment
module "vpc" {
  source       = "./modules/vpc"
  environment  = var.environment
  vpc_cidr     = var.vpc_cidr
  az_count     = var.az_count
}

module "ec2" {
  source         = "./modules/ec2"
  environment    = var.environment
  vpc_id         = module.vpc.vpc_id
  private_subnets = module.vpc.private_subnet_ids
  public_subnets  = module.vpc.public_subnet_ids
}

module "rds" {
  source          = "./modules/rds"
  environment     = var.environment
  vpc_id          = module.vpc.vpc_id
  db_subnets      = module.vpc.isolated_subnet_ids
  app_security_group = module.ec2.app_sg_id
}

Why this works:

Each module has a clear input/output contract. VPC outputs subnet IDs, EC2 consumes them.
Environment separation through tfvars, not code duplication. Same modules, different variables:

# environments/dev/terraform.tfvars
environment   = "dev"
instance_type = "t3.micro"
az_count      = 2
multi_az_rds  = false

# environments/production/terraform.tfvars
environment   = "production"
instance_type = "t3.large"
az_count      = 3
multi_az_rds  = true

Want to add a new module (say, ElastiCache)? Add it to the root composition. Existing modules don't change.

My Terraform AWS Starter Pack includes 5 modules (VPC, EC2+ASG+ALB, RDS Multi-AZ, IAM, S3) with this exact pattern.

5. Docker Compose — Health Checks Change Everything

Most Docker Compose files I see in the wild look like this:

services:
  app:
    build: .
    ports:
      - "3000:3000"
    depends_on:
      - db
  db:
    image: postgres:16

The problem: depends_on only waits for the container to start, not for the service inside it to be ready. Your app crashes on boot because Postgres hasn't finished initializing.

The fix:

services:
  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: ${DB_NAME:-myapp}
      POSTGRES_USER: ${DB_USER:-postgres}
      POSTGRES_PASSWORD: ${DB_PASSWORD:-postgres}
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${DB_USER:-postgres}"]
      interval: 5s
      timeout: 3s
      retries: 5

  app:
    build: .
    ports:
      - "${APP_PORT:-3000}:3000"
    depends_on:
      db:
        condition: service_healthy  # Waits for healthcheck
    environment:
      DATABASE_URL: postgres://${DB_USER:-postgres}:${DB_PASSWORD:-postgres}@db:5432/${DB_NAME:-myapp}

volumes:
  pgdata:

What changed:

healthcheck on Postgres uses pg_isready — only reports healthy when Postgres can accept connections
depends_on with condition: service_healthy makes the app wait for a real readiness signal
Environment variables with defaults via .env — no hardcoded credentials
Named volume so data survives docker compose down

My Docker Compose Templates include 8 stacks (Node+Postgres, Django+Celery, MERN, Go, Rails+Sidekiq, Spring Boot, Next.js, Traefik) all with health checks, volumes, and .env management.

6. Nginx — The SSL Config That Gets an A+

Getting an A+ on SSL Labs shouldn't take 3 hours of Googling. Here's the config pattern:

ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384;
ssl_prefer_server_ciphers off;

# OCSP Stapling — proves your cert isn't revoked without the client calling the CA
ssl_stapling on;
ssl_stapling_verify on;
resolver 1.1.1.1 8.8.8.8 valid=300s;

# Session caching — avoids full TLS handshake on reconnection
ssl_session_cache shared:SSL:10m;
ssl_session_timeout 1d;
ssl_session_tickets off;  # Disable for perfect forward secrecy

Why these choices:

TLS 1.2+1.3 only — TLS 1.0 and 1.1 are deprecated. Drop them.
ssl_prefer_server_ciphers off — counterintuitive, but with TLS 1.3 the client picks the cipher. Setting this to on only matters for TLS 1.2 and can actually result in worse cipher selection with modern clients.
OCSP stapling — your server fetches the cert status from the CA and includes it in the handshake. The client doesn't need to make a separate request to verify the cert isn't revoked. Faster and more private.

And the security headers that complete the picture:

add_header Strict-Transport-Security "max-age=63072000; includeSubDomains; preload" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-Frame-Options "DENY" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;
add_header Permissions-Policy "camera=(), microphone=(), geolocation=()" always;

My Nginx Config Pack includes 6 production configs: reverse proxy, load balancer, static site with SPA fallback, API gateway with rate limiting, SSL termination, and security headers.

7. Lambda — Partial Batch Failure Reporting

This is the Lambda pattern most people get wrong with SQS:

def handler(event, context):
    for record in event['Records']:
        process(record)  # If ANY message fails, ALL get retried

If your batch has 10 messages and message #7 fails, SQS retries all 10. Messages 1-6 get processed twice. Message 7 still fails. You get infinite retries and duplicate processing.

The fix — partial batch failure reporting:

def handler(event, context):
    failures = []

    for record in event['Records']:
        try:
            process(record)
        except Exception as e:
            failures.append({
                'itemIdentifier': record['messageId']
            })

    return {
        'batchItemFailures': failures  # Only retry the failed ones
    }

With FunctionResponseTypes: [ReportBatchItemFailures] in your SAM template, SQS only retries the specific messages that failed. The successful ones are removed from the queue.

My Lambda Starter Templates include 5 SAM-based patterns: REST API + DynamoDB CRUD, S3 event processor, SQS batch worker (with partial failures), scheduled EventBridge tasks, and a custom JWT authorizer.

The Full Stack

Each of these patterns is something I've built and rebuilt multiple times. I packaged them into 7 products:

Product	What's inside	Price
AWS Security Hardening Kit	4 CloudFormation stacks + 50-point checklist	$19
VPC Starter Kit	5 templates — production VPC, dev VPC, security groups, endpoints, bastion	$24
GitHub Actions CI/CD	5 workflows — Python, Java, Docker+Trivy, Terraform, multi-env	$15
Terraform Starter Pack	5 modules — VPC, EC2+ASG+ALB, RDS, IAM, S3	$29
Docker Compose Templates	8 stacks — Node, Django, MERN, Go, Rails, Spring Boot, Next.js, Traefik	$15
Nginx Config Pack	6 configs — reverse proxy, load balancer, SSL, API gateway, security headers	$13
Lambda Starter Templates	5 SAM patterns — API CRUD, S3 processor, SQS worker, scheduler, authorizer	$19
Full Bundle	All 7 products — 75+ templates	$69

Every template is production-tested, fully commented, and licensed for unlimited commercial use. No subscriptions, no SaaS — download the files and they're yours.

If the 50-point security checklist above was useful, the full templates go deeper. Grab the bundle or pick the individual products you need.

Questions about any of the patterns? Drop a comment — happy to go deeper on any of these.

DEV Community