Everything was green. The deployment succeeded. Production was broken for hours. Here's what I learned.
There's a specific kind of production incident that's worse than an outage.
An outage is loud. Alerts fire, dashboards go red, everyone knows something is wrong. You fix it.
The silent failure is different. The pipeline is green. The deployment says "successful." No alerts fire. And somewhere in production, the wrong code is quietly running — serving stale responses, skipping validation, behaving in ways that don't match what you just merged. Nobody knows yet.
I've been on the wrong end of this more than once. Wrong Docker images deployed due to layer caching. Tests marked as passed that never actually ran. Environment variables from staging quietly bleeding into production. A deployment that reported success while the old version kept serving traffic because the agent never actually finished the job.
Each time, the CI/CD dashboard looked fine. That's what made it dangerous.
This article is about what green pipelines hide — and the specific verification habits that catch these failures before your users do.
Failure 1: The Docker Cache That Deployed Yesterday's Code
This one is subtle enough that it can fool you completely if you're not looking for it.
The scenario: you push a fix, the pipeline runs, the Docker build completes in 12 seconds instead of the usual 4 minutes. You don't think much of it — fast builds are good, right? Deployment succeeds. You check the service, it seems to respond. You close the laptop.
What actually happened: Docker's layer cache served a previously built image. Your COPY . . instruction didn't invalidate the cache because the file timestamps didn't change the way Docker expected — common in CI environments where the workspace is freshly checked out but mtime metadata doesn't match. The image that got deployed was built from code that predated your fix.
The dangerous part is that the build log looks correct. You see your Dockerfile steps. You see layer hashes. Nothing screams "wrong image."
What catches this:
Always embed the Git commit SHA into your image at build time and verify it at deploy time:
ARG GIT_COMMIT=unknown
LABEL git-commit=$GIT_COMMIT
ENV GIT_COMMIT=$GIT_COMMIT
# GitHub Actions
- name: Build image
run: |
docker build \
--build-arg GIT_COMMIT=${{ github.sha }} \
--no-cache \
-t myapp:${{ github.sha }} .
Then expose this via a /healthz or /version endpoint in your application and verify it immediately post-deployment. If the SHA in the running container doesn't match the SHA that triggered the pipeline — you have a problem, and you know it within seconds, not hours.
For builds where you intentionally use caching for speed, use --cache-from with explicit cache sources rather than relying on local daemon cache. This gives you cache benefits with predictable, auditable behavior.
Failure 2: Tests That Were Skipped But Reported Green
This is the one that genuinely shook my confidence in pipelines for a while.
The scenario: a test suite that passed every run for weeks. No failures, consistent timing. Then a bug reaches production that the tests should have caught — and when you investigate, you find that the test step exited with code 0 (success) without actually running the tests. The framework had a configuration issue, found no test files matching the pattern, reported "0 tests run, 0 failures" and exited cleanly.
Zero failures. Zero tests. Green.
This happens across test frameworks. Jest, Pytest, JUnit — all of them, by default, exit successfully when they find nothing to run. They're not broken. They did exactly what you asked. You just didn't ask them to verify they ran something.
What catches this:
# GitHub Actions with pytest
- name: Run tests
run: |
pytest --tb=short -q
- name: Verify tests actually ran
run: |
COUNT=$(pytest --collect-only -q 2>&1 | tail -1 | grep -oP '^\d+')
if [ "$COUNT" -lt "10" ]; then
echo "ERROR: Expected at least 10 tests, found $COUNT"
exit 1
fi
Add a minimum test count gate to your pipeline. It feels paranoid until the day it saves you. Also configure your test framework to fail explicitly on empty test runs — most modern frameworks support this:
# pytest.ini
[pytest]
addopts = --strict-markers
// jest.config.js
{
"passWithNoTests": false
}
The principle: a pipeline step that can succeed by doing nothing is a liability.
Failure 3: The Wrong Environment Variables in Production
This failure is almost embarrassingly simple — which is exactly why it happens.
The scenario: a deployment to production uses a configuration value from staging. A database connection string, an API endpoint, a feature flag threshold. The application starts fine because the staging value is valid — it just points somewhere wrong. The service runs, the pipeline is green, and for hours your production traffic is quietly hitting staging infrastructure or using misconfigured limits.
In a Jenkins multi-environment setup, this often happens when:
- Environment-specific credential bindings aren't properly scoped to the deployment stage
- A previous build's workspace has leftover
.envfiles - Variable precedence between pipeline parameters, Jenkins credentials, and application defaults isn't clearly understood
What catches this:
First, never rely on implicit environment variable inheritance in pipelines. Be explicit and loud about what each stage receives:
// Jenkinsfile
stage('Deploy Production') {
environment {
APP_ENV = 'production'
DB_HOST = credentials('prod-db-host')
}
steps {
sh '''
echo "Deploying to: $APP_ENV"
echo "DB host prefix: ${DB_HOST:0:8}..."
./deploy.sh
'''
}
}
Second, add a post-deployment verification step that queries a /config or /env-check endpoint and asserts key environment markers are what you expect:
DEPLOYED_ENV=$(curl -sf https://myapp.prod/healthz | jq -r '.environment')
if [ "$DEPLOYED_ENV" != "production" ]; then
echo "FATAL: Deployed environment is '$DEPLOYED_ENV', expected 'production'"
exit 1
fi
This takes 30 seconds to write and catches an entire class of misconfiguration failures permanently.
Failure 4: Deployment Succeeded, Old Code Still Running
This one is specifically painful because the deployment tooling is telling you the truth — it did succeed. The problem is that "deployment succeeded" and "new code is serving traffic" are not the same statement.
The scenario: a Kubernetes rollout reports complete. GitHub Actions shows a green checkmark. You hit the service and you're getting responses consistent with the old version. What happened?
Common causes:
-
Rollout completed but pods are serving from cached image —
imagePullPolicy: IfNotPresenton a node that already has the old image with the same tag (the classiclatesttag problem) -
Old pods didn't terminate cleanly — they're still in
Terminatingstate and still receiving traffic because the service selector hasn't fully propagated - The deployment updated but a HorizontalPodAutoscaler or another controller scaled back to old replicas before you checked
- The CI agent itself failed mid-job, reported partial success, and the deployment step never fully executed
What catches this:
Never use mutable tags like latest in production Kubernetes manifests. Always deploy with the image SHA or a unique build tag:
# Bad
image: myapp:latest
# Good
image: myapp:a3f8c21d
Add explicit rollout verification as a pipeline step, not a manual check:
# GitHub Actions
- name: Verify rollout
run: |
kubectl rollout status deployment/myapp --timeout=120s
- name: Verify correct image is running
run: |
RUNNING_IMAGE=$(kubectl get pods -l app=myapp \
-o jsonpath='{.items[0].spec.containers[0].image}')
EXPECTED_IMAGE="myapp:${{ github.sha }}"
if [ "$RUNNING_IMAGE" != "$EXPECTED_IMAGE" ]; then
echo "Image mismatch: running $RUNNING_IMAGE, expected $EXPECTED_IMAGE"
exit 1
fi
For the agent failure case — always configure your CI agents with heartbeat timeouts and ensure your pipeline has explicit failure handling for agent disconnection. A job that loses its agent mid-run should never report green.
Failure 5: The Agent That Quietly Gave Up
This is the most operationally unglamorous failure on this list, and possibly the most common in Jenkins environments.
The scenario: a build agent goes offline, becomes unresponsive, or hits a resource limit mid-job. Depending on your Jenkins configuration, this can result in the job being marked as successful if the failure happens during a non-critical step, or if the agent timeout is set too generously and the job just... stops reporting.
You check the console log. It ends mid-line. No error. No stack trace. Just silence — and a green badge.
What catches this:
// Jenkinsfile — always set explicit timeouts
pipeline {
options {
timeout(time: 30, unit: 'MINUTES')
retry(1)
}
post {
always {
script {
if (currentBuild.result == null) {
currentBuild.result = 'FAILURE'
}
}
}
}
}
Monitor agent health as infrastructure — not as an afterthought. Agent failures should fire the same alerts as application failures. If your agents are running in Docker or Kubernetes, treat them with the same resource limits, health checks, and observability you'd apply to any production workload.
The Underlying Principle
Every failure above shares a root cause: the pipeline verified that steps executed, not that outcomes were correct.
A step that runs is not the same as a step that succeeded in the way you intended. Green means the process completed. It does not mean the result is what you think it is.
The discipline of post-deployment verification — checking the SHA, querying the running environment, asserting the test count, confirming the rollout image — closes this gap. It's not extra work. It's the last mile of the deployment that most pipelines are missing.
Build pipelines that are skeptical of themselves. Verify outcomes, not just execution. Treat a deployment as unconfirmed until the running system tells you it's correct — not until your CI dashboard does.
The dashboard will lie to you. Production won't.
Quick Reference: The Verification Checklist
Add these steps to every production deployment pipeline:
- Image SHA verification — confirm running container matches the commit that triggered the build
- Test count gate — assert minimum number of tests ran, fail on zero
- Environment assertion — query running service to confirm correct environment config
- Rollout image check — verify deployed pods are running the new image, not a cached version
- Agent timeout + null result handling — ensure agent failures produce explicit pipeline failures
- Explicit
--no-cachepolicy — or documented, auditable cache-from strategy
None of these take more than 20 lines to implement. Together, they eliminate the entire class of "it looked fine" incidents.
Have you been burned by a silent pipeline failure? I'd genuinely like to hear what broke and what you did to catch it — drop it in the comments.
Top comments (0)