DEV Community

Cover image for The CI/CD Pipeline That Looked Fine But Was Silently Failing
Sumit Gautam
Sumit Gautam

Posted on

The CI/CD Pipeline That Looked Fine But Was Silently Failing

Everything was green. The deployment succeeded. Production was broken for hours. Here's what I learned.


There's a specific kind of production incident that's worse than an outage.

An outage is loud. Alerts fire, dashboards go red, everyone knows something is wrong. You fix it.

The silent failure is different. The pipeline is green. The deployment says "successful." No alerts fire. And somewhere in production, the wrong code is quietly running — serving stale responses, skipping validation, behaving in ways that don't match what you just merged. Nobody knows yet.

I've been on the wrong end of this more than once. Wrong Docker images deployed due to layer caching. Tests marked as passed that never actually ran. Environment variables from staging quietly bleeding into production. A deployment that reported success while the old version kept serving traffic because the agent never actually finished the job.

Each time, the CI/CD dashboard looked fine. That's what made it dangerous.

This article is about what green pipelines hide — and the specific verification habits that catch these failures before your users do.


Failure 1: The Docker Cache That Deployed Yesterday's Code

This one is subtle enough that it can fool you completely if you're not looking for it.

The scenario: you push a fix, the pipeline runs, the Docker build completes in 12 seconds instead of the usual 4 minutes. You don't think much of it — fast builds are good, right? Deployment succeeds. You check the service, it seems to respond. You close the laptop.

What actually happened: Docker's layer cache served a previously built image. Your COPY . . instruction didn't invalidate the cache because the file timestamps didn't change the way Docker expected — common in CI environments where the workspace is freshly checked out but mtime metadata doesn't match. The image that got deployed was built from code that predated your fix.

The dangerous part is that the build log looks correct. You see your Dockerfile steps. You see layer hashes. Nothing screams "wrong image."

What catches this:

Always embed the Git commit SHA into your image at build time and verify it at deploy time:

ARG GIT_COMMIT=unknown
LABEL git-commit=$GIT_COMMIT
ENV GIT_COMMIT=$GIT_COMMIT
Enter fullscreen mode Exit fullscreen mode
# GitHub Actions
- name: Build image
  run: |
    docker build \
      --build-arg GIT_COMMIT=${{ github.sha }} \
      --no-cache \
      -t myapp:${{ github.sha }} .
Enter fullscreen mode Exit fullscreen mode

Then expose this via a /healthz or /version endpoint in your application and verify it immediately post-deployment. If the SHA in the running container doesn't match the SHA that triggered the pipeline — you have a problem, and you know it within seconds, not hours.

For builds where you intentionally use caching for speed, use --cache-from with explicit cache sources rather than relying on local daemon cache. This gives you cache benefits with predictable, auditable behavior.


Failure 2: Tests That Were Skipped But Reported Green

This is the one that genuinely shook my confidence in pipelines for a while.

The scenario: a test suite that passed every run for weeks. No failures, consistent timing. Then a bug reaches production that the tests should have caught — and when you investigate, you find that the test step exited with code 0 (success) without actually running the tests. The framework had a configuration issue, found no test files matching the pattern, reported "0 tests run, 0 failures" and exited cleanly.

Zero failures. Zero tests. Green.

This happens across test frameworks. Jest, Pytest, JUnit — all of them, by default, exit successfully when they find nothing to run. They're not broken. They did exactly what you asked. You just didn't ask them to verify they ran something.

What catches this:

# GitHub Actions with pytest
- name: Run tests
  run: |
    pytest --tb=short -q

- name: Verify tests actually ran
  run: |
    COUNT=$(pytest --collect-only -q 2>&1 | tail -1 | grep -oP '^\d+')
    if [ "$COUNT" -lt "10" ]; then
      echo "ERROR: Expected at least 10 tests, found $COUNT"
      exit 1
    fi
Enter fullscreen mode Exit fullscreen mode

Add a minimum test count gate to your pipeline. It feels paranoid until the day it saves you. Also configure your test framework to fail explicitly on empty test runs — most modern frameworks support this:

# pytest.ini
[pytest]
addopts = --strict-markers
Enter fullscreen mode Exit fullscreen mode
// jest.config.js
{
  "passWithNoTests": false
}
Enter fullscreen mode Exit fullscreen mode

The principle: a pipeline step that can succeed by doing nothing is a liability.


Failure 3: The Wrong Environment Variables in Production

This failure is almost embarrassingly simple — which is exactly why it happens.

The scenario: a deployment to production uses a configuration value from staging. A database connection string, an API endpoint, a feature flag threshold. The application starts fine because the staging value is valid — it just points somewhere wrong. The service runs, the pipeline is green, and for hours your production traffic is quietly hitting staging infrastructure or using misconfigured limits.

In a Jenkins multi-environment setup, this often happens when:

  • Environment-specific credential bindings aren't properly scoped to the deployment stage
  • A previous build's workspace has leftover .env files
  • Variable precedence between pipeline parameters, Jenkins credentials, and application defaults isn't clearly understood

What catches this:

First, never rely on implicit environment variable inheritance in pipelines. Be explicit and loud about what each stage receives:

// Jenkinsfile
stage('Deploy Production') {
  environment {
    APP_ENV = 'production'
    DB_HOST = credentials('prod-db-host')
  }
  steps {
    sh '''
      echo "Deploying to: $APP_ENV"
      echo "DB host prefix: ${DB_HOST:0:8}..."
      ./deploy.sh
    '''
  }
}
Enter fullscreen mode Exit fullscreen mode

Second, add a post-deployment verification step that queries a /config or /env-check endpoint and asserts key environment markers are what you expect:

DEPLOYED_ENV=$(curl -sf https://myapp.prod/healthz | jq -r '.environment')
if [ "$DEPLOYED_ENV" != "production" ]; then
  echo "FATAL: Deployed environment is '$DEPLOYED_ENV', expected 'production'"
  exit 1
fi
Enter fullscreen mode Exit fullscreen mode

This takes 30 seconds to write and catches an entire class of misconfiguration failures permanently.


Failure 4: Deployment Succeeded, Old Code Still Running

This one is specifically painful because the deployment tooling is telling you the truth — it did succeed. The problem is that "deployment succeeded" and "new code is serving traffic" are not the same statement.

The scenario: a Kubernetes rollout reports complete. GitHub Actions shows a green checkmark. You hit the service and you're getting responses consistent with the old version. What happened?

Common causes:

  • Rollout completed but pods are serving from cached imageimagePullPolicy: IfNotPresent on a node that already has the old image with the same tag (the classic latest tag problem)
  • Old pods didn't terminate cleanly — they're still in Terminating state and still receiving traffic because the service selector hasn't fully propagated
  • The deployment updated but a HorizontalPodAutoscaler or another controller scaled back to old replicas before you checked
  • The CI agent itself failed mid-job, reported partial success, and the deployment step never fully executed

What catches this:

Never use mutable tags like latest in production Kubernetes manifests. Always deploy with the image SHA or a unique build tag:

# Bad
image: myapp:latest

# Good  
image: myapp:a3f8c21d
Enter fullscreen mode Exit fullscreen mode

Add explicit rollout verification as a pipeline step, not a manual check:

# GitHub Actions
- name: Verify rollout
  run: |
    kubectl rollout status deployment/myapp --timeout=120s

- name: Verify correct image is running
  run: |
    RUNNING_IMAGE=$(kubectl get pods -l app=myapp \
      -o jsonpath='{.items[0].spec.containers[0].image}')
    EXPECTED_IMAGE="myapp:${{ github.sha }}"

    if [ "$RUNNING_IMAGE" != "$EXPECTED_IMAGE" ]; then
      echo "Image mismatch: running $RUNNING_IMAGE, expected $EXPECTED_IMAGE"
      exit 1
    fi
Enter fullscreen mode Exit fullscreen mode

For the agent failure case — always configure your CI agents with heartbeat timeouts and ensure your pipeline has explicit failure handling for agent disconnection. A job that loses its agent mid-run should never report green.


Failure 5: The Agent That Quietly Gave Up

This is the most operationally unglamorous failure on this list, and possibly the most common in Jenkins environments.

The scenario: a build agent goes offline, becomes unresponsive, or hits a resource limit mid-job. Depending on your Jenkins configuration, this can result in the job being marked as successful if the failure happens during a non-critical step, or if the agent timeout is set too generously and the job just... stops reporting.

You check the console log. It ends mid-line. No error. No stack trace. Just silence — and a green badge.

What catches this:

// Jenkinsfile — always set explicit timeouts
pipeline {
  options {
    timeout(time: 30, unit: 'MINUTES')
    retry(1)
  }
  post {
    always {
      script {
        if (currentBuild.result == null) {
          currentBuild.result = 'FAILURE'
        }
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Monitor agent health as infrastructure — not as an afterthought. Agent failures should fire the same alerts as application failures. If your agents are running in Docker or Kubernetes, treat them with the same resource limits, health checks, and observability you'd apply to any production workload.


The Underlying Principle

Every failure above shares a root cause: the pipeline verified that steps executed, not that outcomes were correct.

A step that runs is not the same as a step that succeeded in the way you intended. Green means the process completed. It does not mean the result is what you think it is.

The discipline of post-deployment verification — checking the SHA, querying the running environment, asserting the test count, confirming the rollout image — closes this gap. It's not extra work. It's the last mile of the deployment that most pipelines are missing.

Build pipelines that are skeptical of themselves. Verify outcomes, not just execution. Treat a deployment as unconfirmed until the running system tells you it's correct — not until your CI dashboard does.

The dashboard will lie to you. Production won't.


Quick Reference: The Verification Checklist

Add these steps to every production deployment pipeline:

  • Image SHA verification — confirm running container matches the commit that triggered the build
  • Test count gate — assert minimum number of tests ran, fail on zero
  • Environment assertion — query running service to confirm correct environment config
  • Rollout image check — verify deployed pods are running the new image, not a cached version
  • Agent timeout + null result handling — ensure agent failures produce explicit pipeline failures
  • Explicit --no-cache policy — or documented, auditable cache-from strategy

None of these take more than 20 lines to implement. Together, they eliminate the entire class of "it looked fine" incidents.


Have you been burned by a silent pipeline failure? I'd genuinely like to hear what broke and what you did to catch it — drop it in the comments.


Top comments (0)