DEV Community

Cover image for Great Stack to Doesn't Work #6 — CI/CD: "Pipeline Green, Production Red"
Mehmet TURAÇ
Mehmet TURAÇ

Posted on

Great Stack to Doesn't Work #6 — CI/CD: "Pipeline Green, Production Red"

A survival guide for when everything goes wrong in production.


The pipeline is green. Every stage passed. Tests: green. Lint: green. Build: green. Security scan: green. The deploy button says "Ready." You click it.

Five minutes later, the error rate jumps to 15%. The pipeline is still green. It will stay green while your users can't check out, because the pipeline tests what you wrote, not what production does with it.


Why Your Pipeline Lies to You

A green pipeline means your code compiles, your tests pass, and your container builds. It does not mean your code works in production. The gap between "works in CI" and "works in production" is where incidents live.

The most common gaps:

Environment drift. CI runs on a clean container with a fresh database. Production has 3 years of accumulated data, schema migrations that ran in a different order during the early days, and environment variables that were set manually by someone who left the company.

Data shape. Your tests use factory-generated data with predictable shapes. Production has users who put emojis in their name field, addresses that are 4,000 characters long, and order records with null values in columns that "should never be null."

Traffic patterns. CI runs one test at a time, sequentially. Production handles 10,000 concurrent requests. Race conditions that never appear in CI appear within minutes in production.

Dependency versions. Your lock file pins exact versions, but your Docker base image pulls latest, or a system package updates between builds. The code is identical. The runtime is not.

The pipeline can't test for all of this. But it can test for more than it currently does.


Layer Caching: Cutting Build Times by 80%

Docker builds are slow because they're rebuilding layers that haven't changed. Every RUN instruction creates a layer. If the layer's inputs haven't changed, Docker can reuse the cached version.

The problem: CI environments often start with an empty cache. Every build is a fresh build. 12 minutes to install dependencies that haven't changed since last week.

Solutions:

Registry-based caching. Push cache layers to your container registry. Pull them at the start of each build.

docker build \
  --cache-from myregistry/myapp:cache \
  --build-arg BUILDKIT_INLINE_CACHE=1 \
  -t myregistry/myapp:latest .
docker push myregistry/myapp:latest
Enter fullscreen mode Exit fullscreen mode

GitHub Actions cache (or equivalent):

- uses: actions/cache@v4
  with:
    path: /tmp/.buildx-cache
    key: ${{ runner.os }}-buildx-${{ hashFiles('**/package-lock.json') }}
Enter fullscreen mode Exit fullscreen mode

Separate dependency and code layers. This is Docker 101 but people still get it wrong:

COPY package*.json ./
RUN npm ci
COPY . .
Enter fullscreen mode Exit fullscreen mode

Dependencies change weekly. Code changes hourly. Separate them so the expensive npm ci layer is cached across code-only changes.

A team I worked with reduced their build from 14 minutes to 3 minutes by adding registry-based caching and reordering their Dockerfile. No infrastructure changes. No new tools. Just understanding how Docker layer caching works.


Parallel Stages: Stop Running Tests Sequentially

If your test suite takes 20 minutes, and you have 4 CI runners, split the tests into 4 parallel groups. Each group takes 5 minutes. Total wall time: 5 minutes.

The naive approach — splitting by file count — creates unbalanced groups. One group might have 3 integration test files that each take 2 minutes, while another group has 50 unit test files that each take 100ms.

Better: split by historical timing data.

# GitHub Actions example with test splitting
jobs:
  test:
    strategy:
      matrix:
        shard: [1, 2, 3, 4]
    steps:
      - run: npx jest --shard=${{ matrix.shard }}/4
Enter fullscreen mode Exit fullscreen mode

Jest's --shard flag distributes tests across shards using file hashing. For more sophisticated balancing, tools like split_tests (Ruby), pytest-split, or CI-specific features (CircleCI's test splitting, Buildkite's parallelism) use timing data from previous runs to create balanced groups.


Flaky Tests: The "This Test Passes Sometimes" Syndrome

Flaky tests are worse than failing tests. A failing test tells you something is broken. A flaky test tells you nothing — it might be broken, or it might just be having a bad day.

The damage is insidious. Engineers start re-running the pipeline when a test fails. "Oh, that test is flaky, just retry." Now you're training the team to ignore test failures. The day a real bug causes a test to fail, nobody investigates — they just retry until it passes.

Detection:

  • Track test results over time. If a test fails more than 1% of the time and the failures don't correlate with code changes, it's flaky.
  • Quarantine flaky tests into a separate suite that runs but doesn't block the pipeline. Fix them with priority.

Common causes:

  • Time dependency. Tests that assume a specific time or date, or that measure elapsed time with tight tolerances. A test that passes in 100ms locally might take 300ms in CI due to shared resources.
  • Order dependency. Test A creates data, test B reads it. When tests run in a different order (parallel execution, random seed), test B fails.
  • External dependency. Tests that call a real API, read from a shared database, or depend on DNS resolution.
  • Race conditions. Async operations that complete faster on your machine than in CI.

Fix: isolate, mock, use deterministic clocks, and clean up after every test.


Rollback Strategies: Choosing Your Safety Net

When a deployment goes wrong, how fast can you get back to the previous version?

Rolling update: Replace pods one by one. If the new version is broken, you notice after some pods are already updated. Rolling back means deploying the previous version, which takes as long as the original deployment.

Blue-green: Run two identical environments. Blue is live. Deploy to green. Test green. Switch traffic from blue to green. If green fails, switch back to blue. Rollback is instant — just change the traffic routing. Cost: you need double the infrastructure.

Canary: Send 1% of traffic to the new version. Monitor error rates, latency, and business metrics. If everything looks good, gradually increase to 10%, 25%, 50%, 100%. If anything looks bad at any stage, route all traffic back to the stable version.

Feature flags: Deploy the code but don't activate it. The feature is behind a flag that defaults to off. Enable it for internal users first. Then 1% of users. Then 10%. If something breaks, flip the flag off. The code stays deployed; the feature deactivates. This is the most granular rollback mechanism — you can revert a single feature without touching the deployment.

The 42-minute pipeline team's rollback strategy was "deploy the previous version," which also took 42 minutes. Their canary threshold was set to 5% error rate. By the time the canary caught the problem, 3% of real users had already been affected, and the rollback took another 42 minutes. Total incident duration: over an hour.

After fixing the pipeline speed (11 minutes) and implementing feature flags, their rollback time dropped from 42 minutes to under 10 seconds — just a flag flip.


Secret Management: Stop Hardcoding Credentials

Secrets in environment variables are the minimum bar. But CI/CD pipelines have their own secret lifecycle that most teams handle poorly.

Token expiration. CI tokens, deploy keys, API keys — they all expire. If nobody monitors expiration dates, one morning your pipeline fails and nobody can deploy until someone provisions a new token. This happened to us: a GitHub App installation token expired mid-deployment. 45 minutes of "why is git clone failing?" before someone checked the token creation date.

Secret rotation. If you rotate a database password, you need to update it in your CI secrets, your Kubernetes secrets, your application config, and your monitoring system. Miss one, and something breaks silently.

Least privilege. Your CI pipeline doesn't need admin access to your cloud account. It needs permission to push images, update deployments, and maybe run migrations. Create a dedicated CI service account with only the permissions it needs.

Use a secret manager (HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager) and pull secrets at runtime. Don't bake them into images. Don't store them in git. Don't pass them as build arguments (they end up in Docker layer metadata).


GitOps: Let Git Be the Source of Truth

GitOps (ArgoCD, Flux) flips the deployment model. Instead of "CI pushes a new version to the cluster," git is the desired state and an operator pulls the desired state from git.

The workflow:

  1. PR changes the Kubernetes manifests or Helm values in a git repo.
  2. PR is reviewed, approved, merged.
  3. ArgoCD detects the change, compares it to the current cluster state, and applies the diff.

Benefits:

  • Every deployment is a git commit. Full audit trail.
  • Rollback is git revert. The operator sees the repo changed and syncs.
  • Drift detection — if someone kubectl applys something manually, ArgoCD detects the drift and can auto-correct.

The operational reality: GitOps adds complexity. You now have a git repo to manage, an operator to keep healthy, and a reconciliation loop that can conflict with manual interventions during incidents. It's worth it for teams with 10+ services and frequent deployments. For a team with 3 services deploying twice a week, a simple CI/CD pipeline is simpler and sufficient.


War Story: From 42 Minutes to 11

Monorepo. 4 services. 1 pipeline that built everything, tested everything, and deployed everything, regardless of which service changed.

The 42-minute breakdown:

  • Docker build: 8 minutes (no caching)
  • Unit tests: 12 minutes (sequential, 2,400 tests)
  • Integration tests: 14 minutes (starting 3 databases, running sequentially)
  • Deploy: 8 minutes (rolling update, health check wait)

The 8 changes:

  1. Registry-based Docker caching. Build dropped from 8 minutes to 2.
  2. Only build changed services. Used git diff to detect which service directories changed. If only service-A changed, only service-A builds and deploys.
  3. Parallel unit tests with sharding. 4 shards, 3 minutes per shard (wall time: 3 minutes instead of 12).
  4. Shared test database. Instead of starting a fresh database per test file, start one per test shard and use schema isolation. Integration test setup dropped from 6 minutes to 45 seconds.
  5. Parallel integration tests. With the shared database, integration tests could run in parallel. 14 minutes down to 4.
  6. Cached dependency installation. node_modules cached by lockfile hash. npm ci only runs when package-lock.json changes.
  7. Deploy only changed services. Same git diff approach. If service-B didn't change, don't redeploy it.
  8. Canary deploy with automated rollback. Instead of waiting for a full rolling update, deploy canary to 1 pod, run smoke tests, then proceed. If smoke tests fail, automatic rollback in 30 seconds.

Result: 11 minutes end-to-end for a single service change. 16 minutes for a full monorepo change. Developers went from deploying twice a day (because each deploy took so long) to deploying 8-10 times a day.


Key Takeaways

A green pipeline is a necessary condition for deployment, not a sufficient one. Your pipeline tests your code. Production tests your system.

Speed matters. A 42-minute pipeline doesn't just slow down deployment — it changes developer behavior. People batch changes, skip tests locally, and deploy less frequently. All of which increase risk.

Feature flags are the most underrated deployment tool. They decouple deployment from release. You can deploy code any time and release features when you're ready. Rollback is a flag flip, not a redeployment.

And manage your CI secrets like production secrets. They expire, they need rotation, and when they break, nobody can deploy.



Over to You

What's the longest your CI/CD pipeline has ever taken? How did you cut it down? And has anyone else been burned by an expired CI token during an incident?


If you enjoyed this, I write about production engineering, AI systems, and the messy reality of building software at scale.

Follow me:

This is part of the **Great Stack to Doesn't Work* series — a survival guide for when everything goes wrong in production. Follow the series to catch every episode.*

Top comments (0)