DEV Community

Cover image for I built a pipeline that rolls itself back when production breaks
Suleiman Abdulkadir
Suleiman Abdulkadir

Posted on

I built a pipeline that rolls itself back when production breaks

Deployments that break silently at night bother me. By the time someone checks Slack in the morning, users have been hitting 502s for hours. I built ShipGuard because I wanted the infrastructure to fix itself before I even knew something was wrong.

It's a CodePipeline that does blue/green deployment with canary traffic shifting to EC2. If the new version starts returning 5xx errors, CodeDeploy shifts traffic back to the old version and kills the broken instances. I don't have to do anything.

Three CloudFormation templates. Everything in source control. Nothing configured by hand.

The pipeline flow

Push to main. Everything after that is automatic:

  1. Pull source from GitHub
  2. npm audit, Trivy, git-secrets run first. High or critical vuln? Build dies.
  3. Tests run, TypeScript compiles, artifact gets packaged
  4. Deploy to staging (one instance, in-place)
  5. Email lands asking me to approve
  6. I approve. Blue/green starts on production.
  7. 10% of traffic routes to the new version for 5 minutes
  8. CloudWatch alarm stays quiet? Remaining traffic shifts over.
  9. Old instances terminated. Done.

If the alarm fires during steps 7 or 8, traffic goes 100% back to the previous version. Green instances die. I get an email explaining which alarm triggered the rollback.

ShipGuard Architecture

Things that bit me

TimeBasedCanary doesn't work for EC2

I spent an afternoon trying to configure TimeBasedCanary in a custom deployment config. CloudFormation accepted the template at lint time and then failed at deploy time with "Traffic routing configuration should be null for Server deployment configuration."

Turns out canary percentage configs only exist for ECS and Lambda. For EC2, the ALB and target group weights handle traffic shifting, not a CodeDeploy config. Nowhere in the docs does it say this clearly; you just find out when it breaks.

IAM role chain from hell

Four roles. They all need to trust different AWS services, and they all need slightly different permissions:

  • Pipeline role needs iam:PassRole to hand off to CodeDeploy
  • CodeBuild role needs S3 access to the artifact bucket plus CloudWatch Logs
  • CodeDeploy role needs EC2, ASG, ALB, S3
  • EC2 instance profile needs to pull from S3 and push CloudWatch metrics

Miss one permission and you get "Access Denied" with no indication of which call failed or which role is the problem. I iterated on this more times than I'd like to admit.

You need an AMI in your Launch Template

This one's embarrassing. cfn-lint doesn't catch a missing ImageId in a Launch Template. CloudFormation doesn't catch it either, until the ASG actually tries to spin up an instance and fails. The fix is an SSM parameter that resolves to the latest Amazon Linux 2023 AMI:

LatestAmiId:
  Type: AWS::SSM::Parameter::Value<AWS::EC2::Image::Id>
  Default: /aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-x86_64
Enter fullscreen mode Exit fullscreen mode

CodeStarNotifications has different tag syntax

I deployed the pipeline stack three times before figuring this out. Every other resource in CloudFormation uses Tags as a list:

Tags:
  - Key: Name
    Value: my-rule
Enter fullscreen mode Exit fullscreen mode

AWS::CodeStarNotifications::NotificationRule wants a map:

Tags:
  Name: my-rule
Enter fullscreen mode Exit fullscreen mode

CloudFormation gives you "Properties validation failed" with no hint about what's wrong. I found the answer buried in a GitHub issue after 20 minutes of searching.

Security scanning

Three scans run before tests. If any exit non-zero, the build stops, and nothing reaches staging:

pre_build:
  commands:
    - npm audit --audit-level=high
    - trivy fs --severity HIGH, CRITICAL --exit-code 1
    - git secrets --scan
Enter fullscreen mode Exit fullscreen mode

No third-party service needed. npm audit is already there, Trivy downloads in a few seconds during install, and git-secrets is an AWS open source tool that's one clone away. The pipeline sends a notification identifying which scan killed the build.

The rollback alarm

Production5xxAlarm:
  Type: AWS::CloudWatch::Alarm
  Properties:
    MetricName: HTTPCode_Target_5XX_Count
    Namespace: AWS/ApplicationELB
    Statistic: Sum
    Period: 60
    EvaluationPeriods: 1
    Threshold: 10
    ComparisonOperator: GreaterThanThreshold
    TreatMissingData: notBreaching
Enter fullscreen mode Exit fullscreen mode

10 or more 5xx errors in 60 seconds trigger the alarm. CodeDeploy has DEPLOYMENT_STOP_ON_ALARM in its rollback config, so it catches the alarm and reverses the deployment.

TreatMissingData: notBreaching is worth noting. Without it, the alarm fires during periods with zero traffic (nights, weekends) because "no data" defaults to "assume breach." That caused a false rollback the first time I tested this on a weekend.

What I'd change next time

I'd probably use ECS Fargate. CodeDeploy's blue/green for ECS actually supports TimeBasedCanary properly, so you can do true 10% -> 50% -> 100% shifts with observation windows between each step. EC2 blue/green is coarser. You get "new instances with traffic control," but the fine-grained percentage steps aren't natively there.

I stuck with EC2 because I wanted to learn how the instance-level deployment mechanics work. Worth it for the education. Probably not what I'd pick for a real production system in 2026.

Repo

Public: github.com/suletetes/ShipGuard

Three templates, one buildspec, one appspec, four shell scripts. Deploy staging first, then production, then pipeline. Push code. Pipeline picks it up.

Costs about $45/month with everything running. ALBs are most of that ($16 each). Tear down staging when you're not testing.

Stack

  • CodePipeline, CodeBuild, CodeDeploy
  • EC2 Auto Scaling behind ALBs
  • CloudWatch alarms
  • SNS
  • S3
  • CloudFormation
  • TypeScript/Express (the app being deployed)

If you've done the "deploy a Lambda" tutorials and want something closer to what production infrastructure actually looks like, this hits the right problems. Cross-stack references, IAM chains, blue/green mechanics, alarm-driven automation. The stuff that takes four tries to get right and nobody warns you about in advance.

The deployed infrastructure

Here's what the stacks actually create in AWS:

CloudFormation stacks

CloudFormation Stacks

VPC subnets (multi-AZ)

Subnets

Security groups

Security Groups

Security Group Rules

Load balancers

ALB

Target groups (blue and green)

Target Groups

Auto Scaling groups

ASG

EC2 instances

Instances

S3 artifact bucket

S3

SNS topics

SNS

Top comments (0)