The Challenge
I was approached by a mid-sized fintech company — let's call them PayStream Solutions — that was processing roughly 4 million transactions daily across a microservices architecture running on ECS Fargate. On the surface, they had a functioning CI/CD pipeline. Engineers could push code and see it running in production within about 45 minutes. That sounds fine until you look at what was actually happening inside that pipeline.
There was no security. Not "minimal security" — I mean literally no security gates. Developers pushed Docker images directly to ECR without any vulnerability scanning. Infrastructure changes were applied manually by a senior engineer who had AdministratorAccess on the production account. Secrets were hardcoded in environment variables visible in the CodeBuild console logs. And the compliance team was running quarterly manual audits that produced 80-page PDF reports nobody read.
The specific pain points the CTO articulated when I sat down with the leadership team:
- Regulatory pressure: PayStream was preparing for PCI-DSS Level 1 certification. Their QSA (Qualified Security Assessor) had flagged the lack of continuous compliance monitoring as a critical gap.
-
Incident aftermath: Three months prior, a developer accidentally pushed a container image built on an
ubuntu:20.04base that carried 14 known HIGH-severity CVEs. Nobody caught it. It ran in production for 11 days before a routine scan surfaced it. - No audit trail: When the compliance team asked "who deployed what, when, and what was the security posture at deploy time?", the answer was a shrug and a CloudTrail log nobody knew how to read.
- Slow feedback loops: Security findings from their periodic scans took weeks to route back to developers, by which time the affected code had already been superseded three times.
- Budget constraints: They had a hard ceiling of roughly $8,000/month for the entire DevOps toolchain, which ruled out third-party SAST/DAST platforms like Veracode or Checkmarx in their enterprise tiers.
The timeline pressure was acute — the PCI-DSS audit was scheduled in 90 days. That gave me roughly 12 weeks to design, implement, test, and document the entire pipeline transformation. It was tight but doable with the right architecture decisions upfront.
Initial Assessment
I spent the first week doing nothing but observing and asking uncomfortable questions. I've found that the most dangerous assumptions in any infrastructure engagement are the ones everyone considers "obviously true." My job was to challenge those.
What I discovered during the analysis:
I pulled up the existing CodePipeline configuration and found a two-stage setup: Source (CodeCommit) → Deploy (manual ECS deploy via CLI script embedded in a CodeBuild project). That's it. No test stage. No build validation. The "build" step was literally docker build && docker push. The deploy step was a aws ecs update-service call.
Here's what the metrics told me:
- Mean time to detect (MTTD) a vulnerability: 47 days (based on the last four incidents)
- Mean time to remediate (MTTR): 18 days after detection
- Pipeline execution time: 43 minutes average (almost all of it was the ECS service update waiting for health checks)
- False production deployments per month: 6 — meaning code that failed basic functional tests still reached production because there were no automated gates
- Manual security review overhead: Two engineers spending ~30% of their time on security-adjacent work that should have been automated
I interviewed the DevOps lead, two senior developers, the CISO, and the compliance manager. The CISO's exact words were: "I don't know what's running in my containers right now, and that scares me." That sentence became the north star for the entire engagement.
The compliance manager showed me their current evidence collection process for audits — a spreadsheet with manual screenshots. My reaction was visceral. They needed evidence to be machine-generated, timestamped, immutable, and linkable to specific pipeline executions.
Risk factors I identified:
- No separation between the IAM roles used for CI and those used for CD — the build role could deploy to production directly
- Secrets Manager was not being used; credentials were in Parameter Store as plaintext strings
- ECR repositories had no lifecycle policies, so image bloat was costing approximately $340/month in unnecessary S3 storage
- No VPC Flow Logs were enabled on the production VPC — a PCI-DSS requirement
- CloudTrail was enabled but logs were flowing to a bucket in the same account, making them susceptible to tampering
- No GuardDuty, no Security Hub, no Config Rules — essentially no detective control layer whatsoever
Solution Design
The architecture I proposed centered on a single principle: security as a first-class pipeline citizen, not an afterthought bolted on at the end. Every gate in the pipeline had to be automated, auditable, and fail-closed — meaning if a security check couldn't run, the pipeline failed. Not warned. Failed.
AWS Services Selected (and Why)
AWS CodePipeline (V2) was the natural orchestration backbone. The V2 version introduced variables and triggers that the older V1 lacked, which was important for passing scan results between pipeline stages without writing to S3 and reading back. CodePipeline also integrates natively with EventBridge, which meant I could fire compliance events without custom Lambda glue code.
AWS CodeBuild handled all the compute-intensive security scanning tasks. The key architectural decision here was splitting security scanning into separate CodeBuild projects rather than stuffing everything into one monolithic buildspec. This gave us independent scalability, cleaner failure attribution ("the SAST stage failed, not the build stage"), and cheaper retries on transient failures.
Amazon ECR with Enhanced Scanning (Amazon Inspector v2) replaced the basic CVE scanning that was previously disabled. Enhanced scanning provides continuous monitoring — not just on-push — and sends findings to Security Hub and EventBridge automatically. The basic scanning uses the Clair project database, while enhanced scanning uses Inspector's more comprehensive intelligence that includes OS packages and programming language packages.
AWS Security Hub became the single pane of glass for all security findings. Inspector, GuardDuty, Config, and our custom CodeBuild scan results all funneled into Security Hub using the AWS Security Finding Format (ASFF). This was critical for the PCI-DSS audit — one place to pull evidence from.
AWS Config + Conformance Packs handled the automated compliance validation layer. Config continuously evaluates resource configurations against rules. CodePipeline can query Config compliance status as a pipeline gate — if your infrastructure drift check fails, the deployment doesn't proceed.
AWS Secrets Manager replaced every hardcoded credential and Parameter Store plaintext value. Secrets Manager supports automatic rotation, which is a hard requirement for PCI-DSS.
AWS KMS (Customer Managed Keys) was used to encrypt everything: CodePipeline artifacts in S3, CodeBuild environment variables, ECR images, and CloudWatch Logs. Using CMKs rather than AWS-managed keys gave us fine-grained control over who could decrypt what and when — critical for the separation-of-duties requirement.
Amazon Inspector v2 provided the InspectorScan action that is now natively available in CodePipeline, which can run both source code scans and ECR image scans as first-class pipeline actions without any custom CodeBuild glue.
AWS Systems Manager Session Manager replaced all SSH/bastion access. No port 22, no key pairs, no bastion hosts — Session Manager provides browser-based or CLI shell access to EC2 and ECS container instances with full session logging to CloudWatch and S3.
Architecture Description
Cost vs. Performance Trade-offs
The most contentious design discussion was around CodeBuild compute sizing. Running security scans on BUILD_GENERAL1_LARGE instances (4 vCPU, 7 GB) versus BUILD_GENERAL1_MEDIUM (2 vCPU, 3.75 GB) was roughly a 2x cost difference per build minute. I ran the math with the team: at roughly 80 pipeline executions per day, the SAST scan stage alone would cost ~$180/month on LARGE versus ~$90/month on MEDIUM. We went with MEDIUM for all scan stages and LARGE only for the Docker build itself, which is the most compute-intensive step. That's a practical FinOps decision that most teams overlook.
For the build environment, I used CodeBuild spot capacity where possible — particularly for the non-blocking SAST stages that could tolerate interruption and retry. AWS Spot capacity for CodeBuild isn't the same as EC2 Spot, but CodeBuild does support fleet-mode with spot capacity that can reduce costs by up to 70%.
Security and Compliance Requirements Addressed
PayStream needed to satisfy PCI-DSS requirements 6.3 (vulnerability scanning), 6.4 (security gates in SDLC), 8.2 (credential management), 10.2 (audit logging), and 11.3 (penetration testing integration). The architecture addressed each of these through automated pipeline stages rather than manual controls.
In-Depth Discussion of Key Areas
FinOps: Embedding Cost Governance Into the Pipeline
One thing I've learned across several engagements is that FinOps and DevSecOps intersect more than people realize. The pipeline itself is a cost center — every build minute, every artifact stored, every Lambda invocation for a compliance check costs money.
Here's how we embedded cost awareness:
-
Tagging enforcement via Config Rules: Every resource deployed through the pipeline had to carry
Environment,Project,CostCenter, andOwnertags. A Config managed rule (required-tags) evaluated this on every resource change. Non-compliant resources triggered a Security Hub finding and an EventBridge notification to the cost owner. - ECR Lifecycle Policies: This was an immediate quick win. I implemented lifecycle policies to expire untagged images after 1 day and keep only the last 10 tagged images per repository. This alone reduced ECR storage costs from $340/month to under $40/month.
{
"rules": [
{
"rulePriority": 1,
"description": "Expire untagged images after 1 day",
"selection": {
"tagStatus": "untagged",
"countType": "sinceImagePushed",
"countUnit": "days",
"countNumber": 1
},
"action": { "type": "expire" }
},
{
"rulePriority": 2,
"description": "Keep last 10 tagged images",
"selection": {
"tagStatus": "tagged",
"tagPrefixList": ["v"],
"countType": "imageCountMoreThan",
"countNumber": 10
},
"action": { "type": "expire" }
}
]
}
-
CodeBuild artifact caching: The dependency download phase was responsible for ~40% of build time. Enabling S3-backed cache for
pip installandnpm installreduced average build time from 43 minutes to 17 minutes — which in turn cut CodeBuild costs almost proportionally. - AWS Cost Anomaly Detection: I set up alerts for >20% week-over-week cost increases in the CodePipeline cost category. This catches situations like a runaway retry loop where a misconfigured pipeline triggers thousands of builds per hour.
- AWS Compute Optimizer recommendations: After two weeks of pipeline operation, Compute Optimizer data showed that our CodeBuild LARGE instances were consistently using only 45% of CPU during Docker builds. I right-sized three stages down to MEDIUM, saving another $60/month.
💡 Pro Tip: Use AWS Cost Allocation Tags from day one. Retroactively tagging resources is painful and usually incomplete. In a DevSecOps context, tags serve double duty — they're both FinOps tools and security evidence artifacts.
AWS Well-Architected Framework (WAF) Application
The Six Pillars of the Well-Architected Framework aren't just a compliance checkbox — they're a design forcing function. Here's how each pillar manifested in this project:
Operational Excellence: All pipeline stages produced structured JSON output written to S3. I configured CloudWatch Dashboards with pipeline health metrics — MTTR, deployment frequency, change failure rate, and lead time. These are the four DORA metrics, and having them automated meant the CISO could see security health at a glance without asking engineers.
Security (the primary pillar for this engagement): Every action in the pipeline ran under a purpose-specific IAM role with the minimum permissions needed. The CodeBuild role for SAST could only read from the source bucket and write to the artifacts bucket. It had zero IAM, EC2, or ECS permissions. The deploy role could update ECS services but couldn't modify IAM policies. This is the least-privilege principle applied concretely.
Reliability: Pipeline stages had retry logic configured (CodePipeline supports up to 5 retries per action). The Config conformance pack deployment used an idempotent CloudFormation template, so re-running it produced the same result regardless of current state.
Performance Efficiency: By parallelizing the SAST scan and the Dockerfile linting into concurrent CodeBuild actions within the same pipeline stage, I cut the security scanning wall-clock time from sequential 18 minutes to parallel 11 minutes.
Cost Optimization: Covered in the FinOps section above.
Sustainability: Fewer build minutes through caching and right-sizing = less compute energy consumed. It's a secondary benefit, but AWS Sustainability pillar guidance specifically calls out compute right-sizing as a key lever.
AWS Security Reference Architecture (SRA) Application
The AWS SRA provides prescriptive guidance for deploying security services in a multi-account AWS Organizations environment. PayStream was a single-account shop when I found them. My first recommendation — and it was non-negotiable from a PCI-DSS standpoint — was to migrate to a multi-account structure.
The SRA-aligned account structure I implemented:
| Account | Purpose | Key Services |
|---|---|---|
| Management (Root) | SCPs, billing, AWS Organizations | AWS Organizations, Service Control Policies |
| Security Tooling | Centralized security services | Security Hub (delegated admin), GuardDuty, Config aggregator, CloudTrail Lake |
| Log Archive | Immutable centralized logging | S3 (Object Lock, MFA delete), CloudWatch Logs |
| Shared Services | Shared pipeline infrastructure | CodePipeline, CodeBuild, ECR, Secrets Manager |
| Dev Workload | Development environment | ECS, RDS, VPC |
| Staging Workload | Pre-production environment | ECS, RDS, VPC |
| Prod Workload | Production environment | ECS, RDS, VPC (strict SCPs) |
The critical SRA concept I explained to the team is delegated administration. Rather than enabling Security Hub in every account individually, you designate the Security Tooling account as the Security Hub administrator. All member accounts automatically send findings there. This means even if a developer accidentally disables Security Hub in their account (which I've seen happen), the Security Tooling account still has a complete record of all prior findings.
Service Control Policies at the root OU level enforced guardrails that no individual account could override:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyLeavingOrganization",
"Effect": "Deny",
"Action": "organizations:LeaveOrganization",
"Resource": "*"
},
{
"Sid": "DenyDisableSecurityServices",
"Effect": "Deny",
"Action": [
"guardduty:DeleteDetector",
"guardduty:DisassociateFromMasterAccount",
"securityhub:DisableSecurityHub",
"config:DeleteConfigRule",
"config:StopConfigurationRecorder",
"cloudtrail:DeleteTrail",
"cloudtrail:StopLogging"
],
"Resource": "*"
},
{
"Sid": "RequireIMDSv2",
"Effect": "Deny",
"Action": "ec2:RunInstances",
"Resource": "arn:aws:ec2:*:*:instance/*",
"Condition": {
"StringNotEquals": {
"ec2:MetadataHttpTokens": "required"
}
}
}
]
}
AWS Systems Manager Session Manager
Eliminating SSH was one of the first things I did — and the one developers pushed back on most. The objection is always "but how do I debug a running container?"
Session Manager answers that question completely. Instead of opening port 22 on a security group, you install the SSM Agent on your EC2 instances (it comes pre-installed on Amazon Linux 2 and Amazon Linux 2023) and give the instance profile the AmazonSSMManagedInstanceCore managed policy. That's it. No inbound rules needed in the security group at all.
For ECS containers, you enable ECS Exec, which tunnels through Session Manager to give you a shell inside a running container:
# Enable ECS Exec on a service
aws ecs update-service \
--cluster paystream-prod \
--service payment-api \
--enable-execute-command
# Connect to a running container
aws ecs execute-command \
--cluster paystream-prod \
--task <task-id> \
--container payment-api \
--interactive \
--command "/bin/bash"
Every Session Manager session is automatically logged to CloudWatch Logs and S3. The logs capture the full command history with timestamps and the IAM principal who initiated the session. For PCI-DSS audit purposes, this is gold — you have a complete, tamper-evident record of every interactive access to production systems.
⚠️ Gotcha: Session Manager requires the instance to have outbound HTTPS (port 443) access to the SSM endpoints — either via an internet gateway, NAT gateway, or VPC Interface Endpoints for SSM. In a private VPC with no internet access, you need three VPC endpoints:
com.amazonaws.region.ssm,com.amazonaws.region.ssmmessages, andcom.amazonaws.region.ec2messages. I learned this the hard way on the staging environment when Session Manager silently failed to connect and I spent two hours assuming it was an IAM issue.
VPC Flow Logs: Network Visibility and Compliance
VPC Flow Logs capture metadata about accepted and rejected IP traffic flowing through your VPC's ENIs (Elastic Network Interfaces). They don't capture packet payloads — just the "who talked to whom, on what port, was it accepted or rejected, and how many bytes" metadata. That metadata is surprisingly powerful for compliance.
For PCI-DSS, requirement 10.2 mandates logging of all access to cardholder data, which means network traffic to and from the payment processing services needed to be logged. Flow Logs satisfied this requirement automatically.
Here's the Terraform to enable VPC Flow Logs at the VPC level, publishing to both CloudWatch Logs and S3 (dual-destination for redundancy and different retention policies):
resource "aws_flow_log" "paystream_vpc_flow_log" {
vpc_id = aws_vpc.paystream_prod.id
traffic_type = "ALL" # Capture ACCEPT, REJECT, and ALL traffic
iam_role_arn = aws_iam_role.flow_log_role.arn
log_destination = aws_cloudwatch_log_group.vpc_flow_logs.arn
log_format = "${version} ${account-id} ${interface-id} ${srcaddr} ${dstaddr} ${srcport} ${dstport} ${protocol} ${packets} ${bytes} ${windowstart} ${windowend} ${action} ${tcp-flags} $${flow-direction}"
}
resource "aws_flow_log" "paystream_vpc_flow_log_s3" {
vpc_id = aws_vpc.paystream_prod.id
traffic_type = "ALL"
log_destination_type = "s3"
log_destination = "${aws_s3_bucket.flow_logs_archive.arn}/vpc-flow-logs/"
log_format = "${version} ${account-id} ${interface-id} ${srcaddr} ${dstaddr} ${srcport} ${dstport} ${protocol} ${packets} ${bytes} ${windowstart} ${windowend} ${action} ${tcp-flags} $${flow-direction}"
}
resource "aws_cloudwatch_log_group" "vpc_flow_logs" {
name = "/aws/vpc/flowlogs/paystream-prod"
retention_in_days = 90 # PCI-DSS requires 1 year; use S3 for longer-term archival
kms_key_id = aws_kms_key.cloudwatch_key.arn
}
I also set up a CloudWatch Metric Filter and Alarm to detect port scanning patterns — a sudden spike in REJECT records from a single source IP within a 5-minute window triggers a GuardDuty finding and an SNS alert to the security team.
One thing that often surprises people: Flow Logs have a small but measurable cost. At PayStream's traffic volume (~2 GB of flow log data per day), the CloudWatch Logs ingestion cost was approximately $1/GB = ~$60/month. I moved logs older than 14 days to S3 Glacier Instant Retrieval, which reduced the total storage cost to under $15/month while maintaining the 90-day hot access period required for active investigation.
KMS Key Policies: Encryption Architecture
AWS KMS with Customer Managed Keys (CMKs) is one of those areas where the gap between "it works" and "it's properly secured" is enormous, and most teams live in "it works" territory.
The key insight about KMS key policies is that they're resource-based policies — they operate at the key level regardless of what IAM policies say. If a KMS key policy says "deny everyone except the key admin role from using this key," then even a user with AdministratorAccess on the account cannot use that key. This is fundamentally different from most AWS resources where IAM identity policies can override resource-based policies.
For PayStream, I created purpose-specific CMKs with tight key policies:
{
"Version": "2012-10-17",
"Id": "paystream-codepipeline-artifacts-key-policy",
"Statement": [
{
"Sid": "EnableIAMUserPermissions",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::ACCOUNT_ID:root"
},
"Action": "kms:*",
"Resource": "*"
},
{
"Sid": "AllowCodePipelineServiceRole",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::ACCOUNT_ID:role/CodePipelineServiceRole"
},
"Action": [
"kms:GenerateDataKey",
"kms:Decrypt"
],
"Resource": "*"
},
{
"Sid": "AllowCodeBuildServiceRole",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::ACCOUNT_ID:role/CodeBuildSecurityScanRole"
},
"Action": [
"kms:Decrypt",
"kms:GenerateDataKey"
],
"Resource": "*"
},
{
"Sid": "DenyUnencryptedObjectUploads",
"Effect": "Deny",
"Principal": "*",
"Action": "kms:Decrypt",
"Resource": "*",
"Condition": {
"StringNotEquals": {
"kms:CallerAccount": "ACCOUNT_ID"
}
}
}
]
}
An important distinction people often miss: kms:* in the key policy only works if an IAM identity policy also grants the permission. For IAM users and roles (non-root principals), both the KMS key policy AND the IAM identity policy must allow the action. For the account root principal, the key policy alone is sufficient. This is a frequent source of confusion.
For cross-account scenarios (the deploy role in the Prod account needs to decrypt artifacts encrypted by the Shared Services account KMS key), you need to grant permissions in two places: the KMS key policy must list the external role's ARN, and the role's IAM policy in the target account must grant the KMS decrypt action.
💡 Pro Tip: Enable KMS key rotation (annual automatic rotation) and set up CloudWatch Alarms on
kms:Decryptcalls that significantly exceed the baseline. A sudden spike in decrypt calls can indicate credential compromise where an attacker is exfiltrating data.
IAM Permission Boundary Mechanics
Permission boundaries are one of the most misunderstood IAM concepts, but in a DevSecOps context they're indispensable — particularly for CI/CD pipelines that need to create IAM roles.
The core problem they solve: Your CodePipeline deploy stage needs to create IAM roles for new Lambda functions or ECS task definitions. But if the deploy role has iam:CreateRole, it can theoretically create a role with AdministratorAccess — which is a privilege escalation path. This is why security teams often block CI/CD pipelines from touching IAM at all, which breaks modern IaC workflows.
The permission boundary solution: A permission boundary is a managed IAM policy that you attach to a role, which acts as a ceiling on what permissions that role can ever have — regardless of what policies are directly attached to it. The effective permissions are the intersection of what the identity policy allows AND what the boundary allows.
Here's how I implemented this for PayStream's Terraform deploy role:
# The permission boundary policy - defines the MAXIMUM permissions
# any role created by the deploy pipeline can have
resource "aws_iam_policy" "devsecops_permission_boundary" {
name = "DevSecOpsDeploymentBoundary"
description = "Maximum permissions for roles created via CI/CD pipeline"
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "AllowECSAndLambdaPermissions"
Effect = "Allow"
Action = [
"ecs:*",
"lambda:*",
"s3:GetObject",
"s3:PutObject",
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"xray:PutTraceSegments"
]
Resource = "*"
},
{
Sid = "DenyPrivilegeEscalation"
Effect = "Deny"
Action = [
"iam:CreateUser",
"iam:AttachUserPolicy",
"iam:PutUserPolicy",
"organizations:*",
"account:*"
]
Resource = "*"
}
]
})
}
# The deploy role itself - must pass the boundary when creating child roles
resource "aws_iam_role" "codepipeline_deploy_role" {
name = "CodePipelineDeployRole"
permissions_boundary = aws_iam_policy.devsecops_permission_boundary.arn
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = {
Service = "codepipeline.amazonaws.com"
}
Action = "sts:AssumeRole"
}]
})
}
The critical mechanic: when this deploy role creates a new IAM role via Terraform (aws_iam_role), the new role automatically inherits the permission boundary. The deploy role's IAM policy includes iam:CreateRole but only with a condition that enforces the boundary:
{
"Sid": "AllowCreateRoleWithBoundary",
"Effect": "Allow",
"Action": "iam:CreateRole",
"Resource": "*",
"Condition": {
"StringEquals": {
"iam:PermissionsBoundary": "arn:aws:iam::ACCOUNT_ID:policy/DevSecOpsDeploymentBoundary"
}
}
}
Without this condition, the deploy role could create a role without a boundary. The condition locks it down so every role created through the pipeline must wear the same constrictive boundary. It's an elegant mechanism.
⚠️ Gotcha: Permission boundaries do NOT grant permissions — they only constrain them. A common mistake is treating a permission boundary as an "allow list" and wondering why the role still can't do things that are in the boundary policy. The boundary is a ceiling; the floor is set by the identity policy attached to the role.
Disaster Recovery with RTO/RPO
While this engagement was primarily about DevSecOps, a complete production architecture had to address DR. PayStream's previous DR plan was "we have daily RDS snapshots, good luck." That's not a plan. I established formal RTO and RPO targets in collaboration with the business team:
- Recovery Time Objective (RTO): 4 hours (how long the business can tolerate being down)
- Recovery Point Objective (RPO): 15 minutes (how much data they can afford to lose)
The 15-minute RPO drove several architectural decisions:
For RDS (Aurora PostgreSQL): I enabled Aurora Global Database with a secondary region (ap-south-1, Mumbai, given PayStream's India-focused user base). Aurora replication typically achieves sub-second replication lag, well within the 15-minute RPO. Global Database failover (managed failover) completes in roughly 1 minute, which also easily satisfies the 4-hour RTO.
For the pipeline artifacts and state: S3 buckets holding pipeline artifacts were configured with Cross-Region Replication (CRR) to the DR region. ECR images were pushed to both regions simultaneously using a post-build CodeBuild step. This ensured the DR region could deploy the latest image without pulling from the primary region.
For the pipeline itself: The CodePipeline and CodeBuild configuration was maintained as Terraform code in a Git repository. In a disaster scenario, re-creating the pipeline in the DR region was a terraform apply away — a process that took approximately 8 minutes in our DR runbook test.
DR testing was integrated into the pipeline itself. Quarterly, a RunDrTest CodePipeline execution would:
- Spin up a test ECS cluster in the DR region using the latest DR image
- Run a synthetic transaction suite against it
- Verify database connectivity to the Aurora Global secondary
- Publish a DR test report to S3 and a Security Hub finding with the results
- Tear down the test infrastructure
This meant the DR plan was validated continuously, not just during the annual DR exercise that most companies do (and usually fake).
Implementation Journey
Phase 1: Foundation
The first two weeks were entirely about infrastructure foundations. No application code, no pipeline stages — just the bedrock that everything else would sit on.
VPC Design:
I designed a three-tier VPC architecture with strict subnet separation:
Production VPC: 10.0.0.0/16
Public Subnets (10.0.1.0/24, 10.0.2.0/24, 10.0.3.0/24)
→ ALB, NAT Gateways only. No application instances here.
Private App Subnets (10.0.10.0/24, 10.0.11.0/24, 10.0.12.0/24)
→ ECS Fargate tasks (payment API, transaction processor)
→ Route table: 0.0.0.0/0 → NAT Gateway
Private Data Subnets (10.0.20.0/24, 10.0.21.0/24, 10.0.22.0/24)
→ Aurora PostgreSQL cluster
→ ElastiCache (Redis)
→ No internet route whatsoever
Pipeline Subnet (10.0.30.0/24)
→ CodeBuild VPC network interface (for scans needing VPC access)
→ VPC Interface Endpoints for AWS services
VPC Flow Logs were enabled before any application traffic flowed through the VPC. I've made the mistake before of enabling Flow Logs after the fact and losing the forensic baseline for "normal" traffic patterns.
IAM Baseline:
I created a purpose-specific IAM role hierarchy before writing a single line of application code:
-
CodePipelineOrchestrationRole— can read/write artifacts S3, start CodeBuild, pass roles to CodeBuild/ECS -
CodeBuildSASTRole— read-only S3 artifacts, no AWS API calls except CloudWatch Logs write -
CodeBuildBuildRole— ECR push, S3 artifacts, CloudWatch Logs -
CodeBuildComplianceRole— Config read, Security Hub write, CloudWatch Logs -
ECSTaskExecutionRole— ECR pull, Secrets Manager read, CloudWatch Logs -
ECSTaskRole— application-specific permissions only (DynamoDB, SQS, etc.)
The initial challenge was explaining to the team why there were 6 roles for a single pipeline. The answer is that blast radius control is worth the setup overhead. If CodeBuildSASTRole is compromised (e.g., a malicious dependency in a SAST tool), the attacker can read build artifacts but cannot push images, touch ECS, or read secrets.
Phase 2: Core Services
With the foundation in place, I built the actual pipeline stages.
Stage 1: Source
I connected GitHub (migrating away from CodeCommit given GitHub's richer webhook support) using a CodeStar Connection. This kept credentials out of CodeBuild entirely — the connection is a service-managed credential stored securely by AWS.
Branch protection rules on GitHub enforced:
- All PRs require at least one approved review
- Status checks must pass (the SAST pipeline ran on every PR, not just merges to main)
- No direct pushes to
mainorrelease/*branches
Stage 2: Static Analysis Security Testing (SAST)
This stage ran three concurrent CodeBuild actions:
# buildspec-sast.yml
version: 0.2
phases:
install:
runtime-versions:
python: 3.11
commands:
- pip install bandit semgrep checkov
- npm install -g @hadolint/hadolint
build:
commands:
# Python SAST with Bandit
- bandit -r ./src -f json -o bandit-report.json || true
# Semgrep with OWASP ruleset
- semgrep scan --config=p/owasp-top-ten
--config=p/python
--json
--output=semgrep-report.json
./src || true
# Dockerfile linting
- hadolint Dockerfile --format json > hadolint-report.json || true
# IaC scanning for Terraform
- checkov -d ./terraform
--framework terraform
-o json
--output-file checkov-report.json || true
post_build:
commands:
# Fail on HIGH/CRITICAL findings
- python evaluate_sast_results.py
# Upload reports to S3 audit bucket
- aws s3 cp bandit-report.json s3://paystream-audit-reports/sast/$CODEBUILD_BUILD_ID/
- aws s3 cp semgrep-report.json s3://paystream-audit-reports/sast/$CODEBUILD_BUILD_ID/
artifacts:
files:
- '**/*-report.json'
base-directory: '.'
The evaluate_sast_results.py script parsed each report, applied a configurable severity threshold (CRITICAL always fails, HIGH fails unless explicitly suppressed with a JIRA ticket reference in a .security-exceptions.yaml file), and exited with code 1 if the threshold was exceeded.
Stage 3: Container Build and Scanning
This was the heart of the pipeline:
# buildspec-build-scan.yml
version: 0.2
env:
secrets-manager:
SONAR_TOKEN: "arn:aws:secretsmanager:ap-south-1:ACCOUNT:secret:sonar-token"
phases:
pre_build:
commands:
- REPOSITORY_URI=$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com/paystream-payment-api
- COMMIT_HASH=$(echo $CODEBUILD_RESOLVED_SOURCE_VERSION | cut -c 1-7)
- IMAGE_TAG=v${CODEBUILD_BUILD_NUMBER}-${COMMIT_HASH}
- aws ecr get-login-password --region $AWS_DEFAULT_REGION |
docker login --username AWS --password-stdin $REPOSITORY_URI
build:
commands:
# Multi-stage build with explicit base image pinning
- docker build
--build-arg BUILD_DATE=$(date -u +'%Y-%m-%dT%H:%M:%SZ')
--build-arg VCS_REF=$COMMIT_HASH
--label "build.id=$CODEBUILD_BUILD_ID"
-t $REPOSITORY_URI:$IMAGE_TAG
-f Dockerfile .
- docker push $REPOSITORY_URI:$IMAGE_TAG
post_build:
commands:
# Wait for ECR Enhanced Scanning (Inspector v2) to complete
- |
echo "Waiting for Inspector scan..."
for i in $(seq 1 30); do
STATUS=$(aws ecr describe-image-scan-findings \
--repository-name paystream-payment-api \
--image-id imageTag=$IMAGE_TAG \
--query 'imageScanStatus.status' \
--output text 2>/dev/null || echo "IN_PROGRESS")
if [ "$STATUS" = "COMPLETE" ]; then
echo "Scan complete after ${i} checks"
break
fi
echo "Scan in progress... ($i/30)"
sleep 20
done
# Parse and gate on findings
- |
CRITICAL=$(aws ecr describe-image-scan-findings \
--repository-name paystream-payment-api \
--image-id imageTag=$IMAGE_TAG \
--query 'imageScanFindings.findingCounts.CRITICAL' \
--output text)
HIGH=$(aws ecr describe-image-scan-findings \
--repository-name paystream-payment-api \
--image-id imageTag=$IMAGE_TAG \
--query 'imageScanFindings.findingCounts.HIGH' \
--output text)
echo "Critical: $CRITICAL, High: $HIGH"
if [ "$CRITICAL" != "None" ] && [ "$CRITICAL" -gt 0 ]; then
echo "PIPELINE BLOCKED: Critical vulnerabilities found"
exit 1
fi
if [ "$HIGH" != "None" ] && [ "$HIGH" -gt 5 ]; then
echo "PIPELINE BLOCKED: Too many HIGH vulnerabilities ($HIGH)"
exit 1
fi
# Write image metadata for downstream stages
- printf '[{"name":"payment-api","imageUri":"%s"}]'
$REPOSITORY_URI:$IMAGE_TAG > imagedefinitions.json
artifacts:
files:
- imagedefinitions.json
Stage 4: Automated Compliance Validation
This stage queried AWS Config to verify that the target environment (staging or production) was in a compliant state before deploying to it:
# compliance_gate.py
import boto3
import sys
def check_compliance_before_deploy(environment):
config = boto3.client('config')
# Check the Operational Best Practices for PCI DSS conformance pack
response = config.describe_compliance_by_config_rule(
ComplianceTypes=['NON_COMPLIANT']
)
non_compliant_critical = [
r for r in response['ComplianceByConfigRules']
if r['ConfigRuleName'].startswith('paystream-critical-')
and r['Compliance']['ComplianceType'] == 'NON_COMPLIANT'
]
if non_compliant_critical:
print(f"❌ COMPLIANCE GATE FAILED: {len(non_compliant_critical)} critical rules non-compliant")
for rule in non_compliant_critical:
print(f" - {rule['ConfigRuleName']}: {rule['Compliance']['ComplianceType']}")
sys.exit(1)
print(f"✅ Compliance gate passed for {environment}")
return True
if __name__ == '__main__':
check_compliance_before_deploy(sys.argv if len(sys.argv) > 1 else 'staging')
The conformance pack I deployed included both AWS Managed Config Rules and custom rules specific to PayStream's PCI-DSS requirements:
# paystream-pci-conformance-pack.yaml
Parameters:
AccessKeysRotatedParamMaxAccessKeyAge:
Default: '90'
Type: String
Resources:
# PCI Req 8.3 - MFA for all non-console access
MFAEnabledForIAMConsolAccess:
Properties:
ConfigRuleName: paystream-critical-mfa-console-access
Source:
Owner: AWS
SourceIdentifier: MFA_ENABLED_FOR_IAM_CONSOLE_ACCESS
Type: AWS::Config::ConfigRule
# PCI Req 6.3 - Vulnerability management
ECRImageScanningEnabled:
Properties:
ConfigRuleName: paystream-critical-ecr-scan-on-push
Source:
Owner: AWS
SourceIdentifier: ECR_PRIVATE_IMAGE_SCANNING_ENABLED
Type: AWS::Config::ConfigRule
# PCI Req 10.3 - VPC Flow Logs
VPCFlowLogsEnabled:
Properties:
ConfigRuleName: paystream-critical-vpc-flow-logs
Source:
Owner: AWS
SourceIdentifier: VPC_FLOW_LOGS_ENABLED
Type: AWS::Config::ConfigRule
# PCI Req 3.4 - Encryption at rest
S3BucketServerSideEncryptionEnabled:
Properties:
ConfigRuleName: paystream-critical-s3-encryption
Source:
Owner: AWS
SourceIdentifier: S3_BUCKET_SERVER_SIDE_ENCRYPTION_ENABLED
Type: AWS::Config::ConfigRule
# Secrets Manager rotation enabled
SecretsManagerRotationEnabled:
Properties:
ConfigRuleName: paystream-critical-secrets-rotation
Source:
Owner: AWS
SourceIdentifier: SECRETSMANAGER_ROTATION_ENABLED_CHECK
Type: AWS::Config::ConfigRule
Phase 3: Advanced Features
Monitoring and Observability Stack:
I deployed a three-layer observability stack:
- Infrastructure metrics: CloudWatch Container Insights for ECS (CPU, memory, network, disk per task), custom CloudWatch Metrics from application code published via the EMF (Embedded Metric Format) library
- Application logs: Structured JSON logs via FireLens (AWS's log routing solution built on Fluent Bit) routing to CloudWatch Logs and OpenSearch Service
- Distributed tracing: AWS X-Ray with sampling rules set to 5% for normal traffic and 100% for requests flagged with specific transaction IDs
The Security Dashboard in CloudWatch combined pipeline execution success/failure rates, Security Hub finding trends, Config compliance scores, and MTTD/MTTR metrics into a single operations view.
Auto-Scaling Configuration:
ECS Service Auto Scaling was configured with a multi-metric policy:
- Target tracking on CPU utilization (target: 60%)
- Step scaling on custom metric: SQS queue depth per ECS task (scale out when >500 messages/task)
- Scheduled scaling for known peak periods (payment processing business hours)
Minimum task count was set to 3 (spanning all three AZs) to ensure High Availability even before auto-scaling kicks in.
Disaster Recovery Implementation:
As described in the DR section above, I implemented Aurora Global Database with automated failover testing integrated into the pipeline's quarterly schedule.
Phase 4: Optimization
Performance Tuning:
After two weeks in production, I analyzed X-Ray traces and found that 35% of payment API latency was database connection establishment overhead. The solution was Amazon RDS Proxy, which maintains a persistent connection pool and reduces connection time from ~150ms to ~2ms for typical ECS tasks that start and stop frequently.
X-Ray trace analysis also revealed that the container image pull time on cold ECS task starts was 45 seconds for a 1.2 GB image. I addressed this through two measures:
- Multi-stage Docker builds that reduced the final image size from 1.2 GB to 280 MB
- ECR pull-through cache configuration to ensure images were always warm in the local ECR endpoint
Cost Optimization Measures:
- Moved non-production environments to ECS Fargate Spot — 70% compute cost reduction for dev and staging
- Implemented CloudWatch Logs tiering: 14 days in CloudWatch (hot), then S3 Standard-IA (warm), then Glacier Instant Retrieval after 90 days (cold)
- Used AWS Compute Optimizer recommendations to right-size ECS task CPU/memory allocations — found several tasks over-provisioned by 40%
- Enabled S3 Intelligent-Tiering on the audit reports bucket, which automatically moved infrequently accessed compliance reports to cheaper tiers
Security Hardening:
- Enabled Amazon Macie on all S3 buckets storing PII data (payment records) — Macie automatically discovers and alerts on sensitive data
- Configured GuardDuty with Malware Protection for ECS, which scans ECS task filesystem volumes on suspicious activity triggers
- Implemented AWS WAF in front of the ALB with the AWS Managed Rules group for common vulnerabilities (SQLi, XSS, bot control)
- Set ECR repository policies to deny image pull from outside the account unless explicitly authorized
Technical Specifications
Compute
- ECS Fargate (Production): Tasks sized at 1 vCPU / 2 GB (payment API), 0.5 vCPU / 1 GB (transaction processor)
- ECS Fargate Spot (Staging/Dev): Same sizes, 70% cost reduction
-
CodeBuild:
BUILD_GENERAL1_MEDIUM(2 vCPU, 3.75 GB) for security scans;BUILD_GENERAL1_LARGE(4 vCPU, 7 GB) for Docker build stage - RDS Proxy: Enabled between ECS tasks and Aurora, connection pool max 100 connections
Database
resource "aws_rds_cluster" "paystream_aurora" {
cluster_identifier = "paystream-prod"
engine = "aurora-postgresql"
engine_version = "15.4"
database_name = "paystream"
master_username = "paystream_admin"
manage_master_user_password = true # Secrets Manager managed password
vpc_security_group_ids = [aws_security_group.aurora_sg.id]
db_subnet_group_name = aws_db_subnet_group.data_tier.name
storage_encrypted = true
kms_key_id = aws_kms_key.aurora_key.arn
backup_retention_period = 35 # PCI-DSS requires 1 year; use exports for long-term
preferred_backup_window = "02:00-03:00"
preferred_maintenance_window = "sun:04:00-sun:05:00"
enabled_cloudwatch_logs_exports = ["postgresql"]
deletion_protection = true # Cannot accidentally delete production DB
tags = {
Environment = "production"
CostCenter = "platform-engineering"
DataClass = "PCI-DSS-Cardholder"
}
}
Network Architecture
- VPC CIDR:
10.0.0.0/16across 3 AZs (ap-south-1a, ap-south-1b, ap-south-1c) - Security Groups: Allow-listed by service-to-service communication; no broad CIDR-based rules
- NACLs: Stateless subnet-level controls supplementing security groups
- VPC Endpoints: Interface endpoints for ECR (API + Docker), S3 (Gateway), Secrets Manager, SSM, STS, CloudWatch Logs, KMS — eliminating internet traffic for AWS API calls
- Transit Gateway: Connecting workload VPCs to shared services VPC
Terraform Infrastructure Summary
# Key module structure
modules/
├── vpc/ # VPC, subnets, route tables, Flow Logs, VPC endpoints
├── iam/ # Roles, permission boundaries, policies
├── ecr/ # Repositories, lifecycle policies, scanning config
├── codepipeline/ # Pipeline stages, actions, artifacts bucket (KMS)
├── codebuild/ # Build projects, environments, VPC config
├── ecs/ # Cluster, services, task definitions, auto-scaling
├── rds/ # Aurora cluster, proxy, subnet group, parameter groups
├── security/ # Config rules, conformance pack, Security Hub, GuardDuty
├── monitoring/ # CloudWatch dashboards, alarms, metric filters
└── kms/ # CMKs for each service tier
Challenges and Solutions
Challenge 1: Container Scan False Positives Blocking Deployments
About two weeks after go-live, developers started complaining that pipeline deployments were failing on container scan results for CVEs that had no available fix. The python:3.11-slim base image carried several HIGH-severity CVEs in system libraries where the upstream maintainers had not yet released patches.
How I troubleshot it: I pulled the ECR scan findings for the last 50 failed builds and ran a frequency analysis. 73% of build failures were caused by exactly three CVEs — CVE-2023-XXXX in libssl1.1, CVE-2024-YYYY in zlib, and CVE-2024-ZZZZ in glibc — all with no available fix in the slim variant.
Solution: I implemented a structured exception process. If a CVE has no available fix (verified via the National Vulnerability Database API), developers could submit a suppression entry in a .security-exceptions.yaml file:
suppressions:
- cve_id: "CVE-2023-XXXX"
reason: "No available fix in base image; tracked in JIRA-4521"
expires: "2025-03-01"
approved_by: "security-team"
risk_accepted: true
The pipeline script checked expirations and required a JIRA ticket reference. Any suppression older than 90 days automatically became a blocking finding again. This gave developers a legitimate escape valve while maintaining security accountability.
Lesson learned: Zero-tolerance CVE policies sound good but fail in practice. Real security posture comes from risk-based decision making with documented accountability, not blanket blocks.
Challenge 2: Config Compliance Rules Constantly Non-Compliant Due to Terraform Plan Artifacts
The automated compliance gate was blocking pipeline runs because S3 buckets created temporarily during Terraform plan operations weren't encrypted — they existed for less than 3 minutes but Config evaluated them immediately.
How I troubleshot it: CloudTrail showed S3 CreateBucket events followed immediately by DeleteBucket events within 2-3 minutes. Config was capturing the intermediate "non-compliant" state of the short-lived bucket and marking the conformance pack as failed.
Solution: I added a config:EvaluateConfigRuleCompliance wait step with jitter before the compliance gate actually checked results. More importantly, I moved ephemeral Terraform plan operations to a separate AWS account (the Shared Services account) where the compliance rules were scoped to persistent resources only, not transient build artifacts.
Lesson learned: Config's near-real-time evaluation model can create false compliance failures for ephemeral resources. Design your compliance rules to exclude build-time resources or use resource-level suppression tags.
Challenge 3: Session Manager Connectivity Failures in Private Subnets
Three ECS instances in the private data subnet became unreachable via Session Manager during a network configuration change. No SSH fallback existed (by design), so I needed Session Manager to work.
How I troubleshot it: I checked the SSM Agent status via CloudWatch Logs (SSM agent logs its connection status). The logs showed "Failed to connect to service endpoint" — a clear indication of a networking issue rather than an IAM issue.
Solution: I had accidentally removed the com.amazonaws.ap-south-1.ssmmessages VPC endpoint during a security group cleanup. Restoring the endpoint restored connectivity within 2 minutes.
Lesson learned: Maintain a runbook for Session Manager troubleshooting. The three VPC endpoints required for private subnet access (ssm, ssmmessages, ec2messages) should be deployed from Terraform and protected from manual deletion via SCP. Also document the SSM Agent log location before you need it in an emergency.
Challenge 4: KMS Key Policy Locking Out CodeBuild During Cross-Account Deployments
When I extended the pipeline to deploy to the production account (separate AWS account), CodeBuild started failing with "Unable to decrypt artifact" errors despite the KMS key policy appearing correct.
How I troubleshot it: CloudTrail showed kms:Decrypt calls being denied with ExplicitDeny. But the KMS key policy had the CodeBuild role listed. The issue was that the CodeBuild role was in the Shared Services account, but the principal in the key policy was specified without the cross-account ARN format.
Solution: The fix required two changes: (1) updating the KMS key policy to explicitly list the cross-account role ARN (arn:aws:iam::PROD_ACCOUNT_ID:role/CodePipelineDeployRole), and (2) adding the KMS decrypt permission to the IAM role in the production account. Both conditions must be met for cross-account KMS access.
Lesson learned: Always test cross-account KMS scenarios in a non-production environment first. Cross-account KMS access requires changes in both accounts and the failure mode is a silent permission denial that looks identical to a misconfigured key policy in the same account.
Challenge 5: Pipeline Runtime Costs Exceeding Budget During Load Testing
During a load test phase, a developer left a test pipeline triggered by a misconfigured webhook running for 8 hours. It executed 340 pipeline runs, burning approximately $420 in CodeBuild costs in a single day.
Solution: I implemented multiple guardrails: (1) CodePipeline execution throttling using EventBridge rules to cap pipeline executions at 20/hour, (2) AWS Budgets alert at 80% of the daily CodePipeline cost threshold, (3) a Cost Anomaly Detection alert for >50% hour-over-hour increases in CodeBuild spend.
Results and Metrics
After 60 days in production (prior to the PCI-DSS audit), here were the measurable outcomes:
Security Posture Improvements:
- Mean time to detect vulnerabilities: 47 days → 4 minutes (99.9% improvement) — findings now appear in Security Hub within minutes of image push
- Critical CVEs reaching production: 100% elimination over the 60-day observation period (vs. 3 incidents in the prior 60 days)
- Security findings from SAST per week: 12 high/critical findings identified and resolved that would previously have reached production undetected
Pipeline and Development Velocity:
- Deployment frequency: 3/week → 12/week (4x improvement) — smaller, safer deploys became the norm
- Lead time for changes: 72 hours → 18 hours (75% reduction)
- Change failure rate: 22% → 4% (deployments requiring rollback)
- Pipeline execution time: 43 minutes → 17 minutes (60% faster due to caching and parallelization)
Compliance:
- Automated compliance evidence: 0% → 100% of evidence machine-generated and timestamped
- PCI-DSS audit result: Passed (Level 1 certification achieved in week 12)
- Config compliance score across 47 rules: 94% (up from unmeasured/~20% estimated)
- Manual security review hours per week: 40 hours → 6 hours (85% reduction)
Cost Optimization:
- ECR storage costs: $340/month → $38/month (89% reduction via lifecycle policies)
- Build time costs: ~$2,800/month → ~$1,100/month (61% reduction via right-sizing and caching)
- Eliminated bastion host EC2 costs: $140/month (replaced by Session Manager)
- Total DevOps toolchain spend: $7,200/month (within the $8,000 ceiling)
Reliability:
- Production availability: 99.4% → 99.97% (change failure rate reduction + faster rollback)
- Mean time to recover (MTTR) from incidents: 4.2 hours → 38 minutes (automated rollback triggered by CloudWatch alarms)
Key Takeaways
What worked exceptionally well:
- Parallel security scan stages dramatically cut pipeline execution time without reducing coverage — running SAST, Dockerfile linting, and dependency scanning concurrently rather than sequentially was an easy win
- Permission boundaries on CI/CD deploy roles gave us IaC automation power (creating IAM roles) without opening privilege escalation paths — this is the single IAM pattern I now apply to every engagement
- Config conformance packs as pipeline gates meant compliance was enforced continuously, not just at audit time — the QSA was genuinely surprised and impressed
- Structured exception management for CVE suppressions prevented the all-too-common outcome where teams disable security gates because they generate too much noise
- Terraform as the single source of truth for all infrastructure meant disaster recovery and account recreation were genuinely fast and reliable
What I'd do differently next time:
- Start the multi-account migration before any other work. I did it in parallel with pipeline development, and the account restructuring caused some painful rework of IAM ARNs and KMS key policies.
- Implement Amazon Inspector v2 enhanced scanning from day one rather than starting with basic ECR scanning and upgrading later — the migration required updating Config rules and Security Hub integrations mid-project
- Build the developer feedback loop earlier. I focused heavily on the security tooling and didn't prioritize the developer experience (IDE plugins for SAST, pre-commit hooks mirroring the pipeline checks) until week 6. Developers would have accepted the pipeline changes more readily with earlier visibility
- Use AWS Config Auto-Remediation for low-risk fixes (like adding missing tags) rather than just alerting — it reduces the compliance backlog significantly
Best Practices Discovered:
- Fail-closed, not fail-open: When a security check can't run (e.g., SAST tool crashes), the pipeline should fail. It's tempting to fail-open for availability, but you lose all security guarantees
- Every security finding needs an SLA: Critical = 24 hours, High = 7 days, Medium = 30 days. Without an SLA, findings pile up and teams start ignoring them
-
Tag everything at creation, not retroactively: A
created-by-pipeline: truetag on every resource makes cost allocation and security investigation dramatically simpler - Test your DR regularly and automatically: A DR plan that isn't tested regularly is a theoretical DR plan. Integrate DR testing into the pipeline schedule
- Security Hub as the single pane of glass: Resist the urge to build custom dashboards for individual security tools. Everything flows to Security Hub, everyone looks at Security Hub
Recommendations for Similar Projects:
- Before touching any pipeline code, establish your IAM role hierarchy and KMS key structure. Everything else depends on these
- Get developer buy-in early by showing how the pipeline catches real bugs in their code, not just theoretical security issues
- Budget 20% of your timeline for the compliance evidence documentation — auditors want specifics, and "we have automated compliance" isn't enough without a clear evidence chain
- In a regulated environment (PCI-DSS, HIPAA, SOC 2), deploy the AWS Security Hub PCI DSS standard immediately — it maps 100+ Config rules directly to PCI requirements and gives you an audit-ready compliance dashboard out of the box
Tech Stack Summary
- Compute: Amazon ECS Fargate (production), ECS Fargate Spot (non-production), CodeBuild (BUILD_GENERAL1_MEDIUM/LARGE)
- Storage: Amazon S3 (artifacts, audit reports, flow logs archive, CloudTrail), S3 Intelligent-Tiering (compliance evidence)
- Database: Amazon Aurora PostgreSQL 15.4 (Global Database for DR), RDS Proxy, ElastiCache Redis 7
- Networking: Amazon VPC (3-tier), VPC Flow Logs, AWS WAF v2, Application Load Balancer, VPC Interface Endpoints, Transit Gateway
- Security: AWS KMS (CMKs per service tier), AWS Secrets Manager, IAM with Permission Boundaries, SCPs, Amazon Inspector v2, Amazon GuardDuty (+ Malware Protection), Amazon Macie, AWS Security Hub, AWS Config + Conformance Packs, CloudTrail (+ CloudTrail Lake), AWS Certificate Manager
- Monitoring: Amazon CloudWatch (Metrics, Logs, Alarms, Container Insights, Dashboards), AWS X-Ray, Amazon OpenSearch Service (log analytics), Amazon EventBridge, Amazon SNS
- CI/CD: AWS CodePipeline V2, AWS CodeBuild, Amazon ECR (Enhanced Scanning), AWS CodeStar Connections (GitHub integration), Amazon Inspector (InspectorScan pipeline action)
- IaC Tools: Terraform (primary), AWS CloudFormation (conformance packs, Config rules), Checkov (IaC scanning), Hadolint (Dockerfile linting), Semgrep (SAST), Bandit (Python SAST)
This project showed me — again — that the best security architecture is one developers don't fight. When the pipeline gives developers faster, more reliable feedback on their code and catches issues that would otherwise cause 2 AM incidents, they stop treating security gates as obstacles and start treating them as features. That cultural shift, more than any individual AWS service, was the real outcome of this engagement.

Top comments (0)