I built a pipeline that rolls itself back when production breaks

#aws #devops #cicd #cloudformation

Deployments that break silently at night bother me. By the time someone checks Slack in the morning, users have been hitting 502s for hours. I built ShipGuard because I wanted the infrastructure to fix itself before I even knew something was wrong.

It's a CodePipeline that does blue/green deployment with canary traffic shifting to EC2. If the new version starts returning 5xx errors, CodeDeploy shifts traffic back to the old version and kills the broken instances. I don't have to do anything.

Three CloudFormation templates. Everything in source control. Nothing configured by hand.

The pipeline flow

Push to main. Everything after that is automatic:

Pull source from GitHub
npm audit, Trivy, git-secrets run first. High or critical vuln? Build dies.
Tests run, TypeScript compiles, artifact gets packaged
Deploy to staging (one instance, in-place)
Email lands asking me to approve
I approve. Blue/green starts on production.
10% of traffic routes to the new version for 5 minutes
CloudWatch alarm stays quiet? Remaining traffic shifts over.
Old instances terminated. Done.

If the alarm fires during steps 7 or 8, traffic goes 100% back to the previous version. Green instances die. I get an email explaining which alarm triggered the rollback.

Things that bit me

TimeBasedCanary doesn't work for EC2

I spent an afternoon trying to configure TimeBasedCanary in a custom deployment config. CloudFormation accepted the template at lint time and then failed at deploy time with "Traffic routing configuration should be null for Server deployment configuration."

Turns out canary percentage configs only exist for ECS and Lambda. For EC2, the ALB and target group weights handle traffic shifting, not a CodeDeploy config. Nowhere in the docs does it say this clearly; you just find out when it breaks.

IAM role chain from hell

Four roles. They all need to trust different AWS services, and they all need slightly different permissions:

Pipeline role needs iam:PassRole to hand off to CodeDeploy
CodeBuild role needs S3 access to the artifact bucket plus CloudWatch Logs
CodeDeploy role needs EC2, ASG, ALB, S3
EC2 instance profile needs to pull from S3 and push CloudWatch metrics

Miss one permission and you get "Access Denied" with no indication of which call failed or which role is the problem. I iterated on this more times than I'd like to admit.

You need an AMI in your Launch Template

This one's embarrassing. cfn-lint doesn't catch a missing ImageId in a Launch Template. CloudFormation doesn't catch it either, until the ASG actually tries to spin up an instance and fails. The fix is an SSM parameter that resolves to the latest Amazon Linux 2023 AMI:

LatestAmiId:
  Type: AWS::SSM::Parameter::Value<AWS::EC2::Image::Id>
  Default: /aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-x86_64

CodeStarNotifications has different tag syntax

I deployed the pipeline stack three times before figuring this out. Every other resource in CloudFormation uses Tags as a list:

Tags:
  - Key: Name
    Value: my-rule

AWS::CodeStarNotifications::NotificationRule wants a map:

Tags:
  Name: my-rule

CloudFormation gives you "Properties validation failed" with no hint about what's wrong. I found the answer buried in a GitHub issue after 20 minutes of searching.

Security scanning

Three scans run before tests. If any exit non-zero, the build stops, and nothing reaches staging:

pre_build:
  commands:
    - npm audit --audit-level=high
    - trivy fs --severity HIGH, CRITICAL --exit-code 1
    - git secrets --scan

No third-party service needed. npm audit is already there, Trivy downloads in a few seconds during install, and git-secrets is an AWS open source tool that's one clone away. The pipeline sends a notification identifying which scan killed the build.

The rollback alarm

Production5xxAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    MetricName: HTTPCode_Target_5XX_Count
    Namespace: AWS/ApplicationELB
    Statistic: Sum
    Period: 60
    EvaluationPeriods: 1
    Threshold: 10
    ComparisonOperator: GreaterThanThreshold
    TreatMissingData: notBreaching

10 or more 5xx errors in 60 seconds trigger the alarm. CodeDeploy has DEPLOYMENT_STOP_ON_ALARM in its rollback config, so it catches the alarm and reverses the deployment.

TreatMissingData: notBreaching is worth noting. Without it, the alarm fires during periods with zero traffic (nights, weekends) because "no data" defaults to "assume breach." That caused a false rollback the first time I tested this on a weekend.

What I'd change next time

I'd probably use ECS Fargate. CodeDeploy's blue/green for ECS actually supports TimeBasedCanary properly, so you can do true 10% -> 50% -> 100% shifts with observation windows between each step. EC2 blue/green is coarser. You get "new instances with traffic control," but the fine-grained percentage steps aren't natively there.

I stuck with EC2 because I wanted to learn how the instance-level deployment mechanics work. Worth it for the education. Probably not what I'd pick for a real production system in 2026.

Repo

Public: github.com/suletetes/ShipGuard

Three templates, one buildspec, one appspec, four shell scripts. Deploy staging first, then production, then pipeline. Push code. Pipeline picks it up.

Costs about $45/month with everything running. ALBs are most of that ($16 each). Tear down staging when you're not testing.

Stack

CodePipeline, CodeBuild, CodeDeploy
EC2 Auto Scaling behind ALBs
CloudWatch alarms
SNS
S3
CloudFormation
TypeScript/Express (the app being deployed)

If you've done the "deploy a Lambda" tutorials and want something closer to what production infrastructure actually looks like, this hits the right problems. Cross-stack references, IAM chains, blue/green mechanics, alarm-driven automation. The stuff that takes four tries to get right and nobody warns you about in advance.