Sumit Gautam

Posted on Apr 22

The CI/CD Pipeline That Looked Fine But Was Silently Failing

#devops #githubactions #docker #cloudcomputing

Everything was green. The deployment succeeded. Production was broken for hours. Here's what I learned.

There's a specific kind of production incident that's worse than an outage.

An outage is loud. Alerts fire, dashboards go red, everyone knows something is wrong. You fix it.

The silent failure is different. The pipeline is green. The deployment says "successful." No alerts fire. And somewhere in production, the wrong code is quietly running — serving stale responses, skipping validation, behaving in ways that don't match what you just merged. Nobody knows yet.

I've been on the wrong end of this more than once. Wrong Docker images deployed due to layer caching. Tests marked as passed that never actually ran. Environment variables from staging quietly bleeding into production. A deployment that reported success while the old version kept serving traffic because the agent never actually finished the job.

Each time, the CI/CD dashboard looked fine. That's what made it dangerous.

This article is about what green pipelines hide — and the specific verification habits that catch these failures before your users do.

Failure 1: The Docker Cache That Deployed Yesterday's Code

This one is subtle enough that it can fool you completely if you're not looking for it.

The scenario: you push a fix, the pipeline runs, the Docker build completes in 12 seconds instead of the usual 4 minutes. You don't think much of it — fast builds are good, right? Deployment succeeds. You check the service, it seems to respond. You close the laptop.

What actually happened: Docker's layer cache served a previously built image. Your COPY . . instruction didn't invalidate the cache because the file timestamps didn't change the way Docker expected — common in CI environments where the workspace is freshly checked out but mtime metadata doesn't match. The image that got deployed was built from code that predated your fix.

The dangerous part is that the build log looks correct. You see your Dockerfile steps. You see layer hashes. Nothing screams "wrong image."

What catches this:

Always embed the Git commit SHA into your image at build time and verify it at deploy time:

ARG GIT_COMMIT=unknown
LABEL git-commit=$GIT_COMMIT
ENV GIT_COMMIT=$GIT_COMMIT

# GitHub Actions
- name: Build image
  run: |
    docker build \
      --build-arg GIT_COMMIT=${{ github.sha }} \
      --no-cache \
      -t myapp:${{ github.sha }} .

Then expose this via a /healthz or /version endpoint in your application and verify it immediately post-deployment. If the SHA in the running container doesn't match the SHA that triggered the pipeline — you have a problem, and you know it within seconds, not hours.

For builds where you intentionally use caching for speed, use --cache-from with explicit cache sources rather than relying on local daemon cache. This gives you cache benefits with predictable, auditable behavior.

Failure 2: Tests That Were Skipped But Reported Green

This is the one that genuinely shook my confidence in pipelines for a while.

The scenario: a test suite that passed every run for weeks. No failures, consistent timing. Then a bug reaches production that the tests should have caught — and when you investigate, you find that the test step exited with code 0 (success) without actually running the tests. The framework had a configuration issue, found no test files matching the pattern, reported "0 tests run, 0 failures" and exited cleanly.

Zero failures. Zero tests. Green.

This happens across test frameworks. Jest, Pytest, JUnit — all of them, by default, exit successfully when they find nothing to run. They're not broken. They did exactly what you asked. You just didn't ask them to verify they ran something.

What catches this:

# GitHub Actions with pytest
- name: Run tests
  run: |
    pytest --tb=short -q

- name: Verify tests actually ran
  run: |
    COUNT=$(pytest --collect-only -q 2>&1 | tail -1 | grep -oP '^\d+')
    if [ "$COUNT" -lt "10" ]; then
      echo "ERROR: Expected at least 10 tests, found $COUNT"
      exit 1
    fi

Add a minimum test count gate to your pipeline. It feels paranoid until the day it saves you. Also configure your test framework to fail explicitly on empty test runs — most modern frameworks support this:

# pytest.ini
[pytest]
addopts = --strict-markers

// jest.config.js
{
  "passWithNoTests": false
}

The principle: a pipeline step that can succeed by doing nothing is a liability.

Failure 3: The Wrong Environment Variables in Production

This failure is almost embarrassingly simple — which is exactly why it happens.

The scenario: a deployment to production uses a configuration value from staging. A database connection string, an API endpoint, a feature flag threshold. The application starts fine because the staging value is valid — it just points somewhere wrong. The service runs, the pipeline is green, and for hours your production traffic is quietly hitting staging infrastructure or using misconfigured limits.

In a Jenkins multi-environment setup, this often happens when:

Environment-specific credential bindings aren't properly scoped to the deployment stage
A previous build's workspace has leftover .env files
Variable precedence between pipeline parameters, Jenkins credentials, and application defaults isn't clearly understood

What catches this:

First, never rely on implicit environment variable inheritance in pipelines. Be explicit and loud about what each stage receives:

// Jenkinsfile
stage('Deploy Production') {
  environment {
    APP_ENV = 'production'
    DB_HOST = credentials('prod-db-host')
  }
  steps {
    sh '''
      echo "Deploying to: $APP_ENV"
      echo "DB host prefix: ${DB_HOST:0:8}..."
      ./deploy.sh
    '''
  }
}

Second, add a post-deployment verification step that queries a /config or /env-check endpoint and asserts key environment markers are what you expect:

DEPLOYED_ENV=$(curl -sf https://myapp.prod/healthz | jq -r '.environment')
if [ "$DEPLOYED_ENV" != "production" ]; then
  echo "FATAL: Deployed environment is '$DEPLOYED_ENV', expected 'production'"
  exit 1
fi

This takes 30 seconds to write and catches an entire class of misconfiguration failures permanently.

Failure 4: Deployment Succeeded, Old Code Still Running

This one is specifically painful because the deployment tooling is telling you the truth — it did succeed. The problem is that "deployment succeeded" and "new code is serving traffic" are not the same statement.

The scenario: a Kubernetes rollout reports complete. GitHub Actions shows a green checkmark. You hit the service and you're getting responses consistent with the old version. What happened?

Common causes:

Rollout completed but pods are serving from cached image — imagePullPolicy: IfNotPresent on a node that already has the old image with the same tag (the classic latest tag problem)
Old pods didn't terminate cleanly — they're still in Terminating state and still receiving traffic because the service selector hasn't fully propagated
The deployment updated but a HorizontalPodAutoscaler or another controller scaled back to old replicas before you checked
The CI agent itself failed mid-job, reported partial success, and the deployment step never fully executed

What catches this:

Never use mutable tags like latest in production Kubernetes manifests. Always deploy with the image SHA or a unique build tag:

# Bad
image: myapp:latest

# Good  
image: myapp:a3f8c21d

Add explicit rollout verification as a pipeline step, not a manual check:

# GitHub Actions
- name: Verify rollout
  run: |
    kubectl rollout status deployment/myapp --timeout=120s

- name: Verify correct image is running
  run: |
    RUNNING_IMAGE=$(kubectl get pods -l app=myapp \
      -o jsonpath='{.items[0].spec.containers[0].image}')
    EXPECTED_IMAGE="myapp:${{ github.sha }}"

    if [ "$RUNNING_IMAGE" != "$EXPECTED_IMAGE" ]; then
      echo "Image mismatch: running $RUNNING_IMAGE, expected $EXPECTED_IMAGE"
      exit 1
    fi

For the agent failure case — always configure your CI agents with heartbeat timeouts and ensure your pipeline has explicit failure handling for agent disconnection. A job that loses its agent mid-run should never report green.

Failure 5: The Agent That Quietly Gave Up

This is the most operationally unglamorous failure on this list, and possibly the most common in Jenkins environments.

The scenario: a build agent goes offline, becomes unresponsive, or hits a resource limit mid-job. Depending on your Jenkins configuration, this can result in the job being marked as successful if the failure happens during a non-critical step, or if the agent timeout is set too generously and the job just... stops reporting.

You check the console log. It ends mid-line. No error. No stack trace. Just silence — and a green badge.

What catches this:

// Jenkinsfile — always set explicit timeouts
pipeline {
  options {
    timeout(time: 30, unit: 'MINUTES')
    retry(1)
  }
  post {
    always {
      script {
        if (currentBuild.result == null) {
          currentBuild.result = 'FAILURE'
        }
      }
    }
  }
}

Monitor agent health as infrastructure — not as an afterthought. Agent failures should fire the same alerts as application failures. If your agents are running in Docker or Kubernetes, treat them with the same resource limits, health checks, and observability you'd apply to any production workload.

The Underlying Principle

Every failure above shares a root cause: the pipeline verified that steps executed, not that outcomes were correct.

A step that runs is not the same as a step that succeeded in the way you intended. Green means the process completed. It does not mean the result is what you think it is.

The discipline of post-deployment verification — checking the SHA, querying the running environment, asserting the test count, confirming the rollout image — closes this gap. It's not extra work. It's the last mile of the deployment that most pipelines are missing.

Build pipelines that are skeptical of themselves. Verify outcomes, not just execution. Treat a deployment as unconfirmed until the running system tells you it's correct — not until your CI dashboard does.

The dashboard will lie to you. Production won't.

Quick Reference: The Verification Checklist

Add these steps to every production deployment pipeline:

Image SHA verification — confirm running container matches the commit that triggered the build
Test count gate — assert minimum number of tests ran, fail on zero
Environment assertion — query running service to confirm correct environment config
Rollout image check — verify deployed pods are running the new image, not a cached version
Agent timeout + null result handling — ensure agent failures produce explicit pipeline failures
Explicit --no-cache policy — or documented, auditable cache-from strategy

None of these take more than 20 lines to implement. Together, they eliminate the entire class of "it looked fine" incidents.

Have you been burned by a silent pipeline failure? I'd genuinely like to hear what broke and what you did to catch it — drop it in the comments.

Top comments (5)

Alvarito1983 • Apr 22

This entire post is the CI/CD version of a pattern I've been calling invisible regressions — failures that are more expensive than outages precisely because nothing triggers the "fix it now" reflex. The pipeline looks correct while being silently wrong, and every day it runs you're accumulating false confidence.

The five cases you cover are spot on. I'd add a sixth that bit me recently: security scanners that finish successfully while scanning nothing.

A vulnerability scanner in the pipeline exits 0, reports "0 vulnerabilities found," and moves on. Turns out the scanner couldn't resolve its CVE database (transient network issue, silent fallback to empty DB) and was scanning against nothing. For weeks, every build got a clean security report. The signal was worse than no scanner at all, because the team trusted it.

Same mitigation as your test count gate: assert the scanner found at least N known-baseline findings on a reference image, fail if not. "A step that can succeed by doing nothing is a liability" applies to security scanning just as cleanly as to tests.

Great write-up.

Sumit Gautam • Apr 28

Hi Alvarito, thanks for the update and for sharing that extra case!

That "silent security scanner" failure is a perfect (and terrifying) addition to the list of invisible regressions. It’s the ultimate false sense of security when the tool effectively says "all clear" just because it didn't even try to look. I love your suggestion of using a known-baseline finding to validate the scanner—it’s exactly the kind of "verify the verifier" logic that keeps pipelines honest. Thanks for reading and for the great write-up!

Wes • Apr 23

The Git SHA in health endpoint pattern is a strong suggestion. It actually cuts out the k8s "deploy green but old pods still serving" failure mode. However your whole prescription asks us to add post-deploy verification steps, and those verification steps are themselves pipeline code that can fail in exactly the ways you catalogued. The /config assertion runs against the wrong cluster. The health probe hits a cached CDN response still reporting yesterday's SHA. The minimum test count gate passes because collected-but-skipped tests still count as collected. You describe the recursion without closing it: what makes the verifier verifiable? In practice the only real stopping point is a signal the outside world can actually observe, like a synthetic transaction running from outside your infra or a user-visible side effect. Where do you draw that line on your own systems: at what layer do you stop stacking verifiers and start trusting an external observer instead?

Sumit Gautam • Apr 29

This is the most honest pushback the article could get, and you're right — I described the recursion without closing it.

The verifier-verification problem is real. A /config assertion hitting the wrong cluster, a health probe returning a CDN-cached SHA, a test count gate fooled by collected-but-skipped tests — I've seen versions of all of these. Stacking internal verifiers doesn't escape the loop, it just moves it one layer deeper.
In practice, here's where I draw the line on my own systems: internal pipeline verification handles the mechanical checks — image SHA, rollout status, environment assertions. These catch the majority of silent failures cheaply. But I treat them as necessary, not sufficient.
The actual trust boundary is exactly where you landed — an external synthetic transaction from outside the infra. A probe that doesn't share the deployment's network path, credential scope, or caching layer. Something that observes the system the way a real user would, from the outside in. If that passes, I trust the deployment. If it fails after internal checks passed, that delta is where the real investigation starts.
The honest answer to "where do you stop stacking verifiers" is: you stop when you've crossed the trust boundary from internal to external. Everything inside your infra is suspect by definition. The external observer is the only signal that isn't.

Good question — genuinely made me sharpen how I'd written the conclusion.

Culprit • Jul 7

The verification-checklist angle is useful, especially the case where tests were skipped but the run still read green. When a failure like that only shows up intermittently in GitHub Actions, does your team have a repeatable way to identify which commit introduced the flake, or do you usually stop at reruns and symptom-level triage once the pipeline looks healthy again?