- Why fast post-deploy smoke tests matter
- Pre-test environment sanity checks
- 10 essential smoke tests to run immediately
- Interpreting failures and escalation steps
- Making the checklist repeatable and automated
- Practical Application
Deployments are the smallest event with the biggest potential impact: a trivial change that passes CI can still break the single user journey that generates revenue. You need a fast, deterministic signal from production in the first minutes after a release so you can either declare the build safe or stop everything and recover.
The problem you see on-call is rarely exotic: broken login, a 502 on the checkout API, a background job that never processed, or static files served with 404. Those failures surface as noise in the monitoring, angry customer messages, and frantic Slack threads — and by the time the team notices it’s often past the window where a quick revert would have sufficed. The right post-deploy smoke tests catch these show-stoppers before users do and give you an immediate action: pass, hold, or rollback.
Why fast post-deploy smoke tests matter
- A smoke test is a focused, minimal suite that validates whether the most important functions work after a build or deploy. Use them to decide whether a release is safe or must be stopped. Smoketests are not exhaustive; they are a fast gate.
- Running post-deploy smoke tests rapidly reduces blast radius and shortens detection-to-decision time, which aligns with DORA/Accelerate findings that continuous testing and fast verification correlate with lower change-failure rates and faster recovery. Short feedback here amplifies delivery confidence.
- The operational trade-off is explicit: speed over depth. You want a binary signal in minutes, not a slow parade of flaky end‑to‑end checks that make decision-making ambiguous.
Pre-test environment sanity checks
Before you execute the 10 checks, confirm the production environment is actually what you expect. These sanity checks take 30–90 seconds and remove a surprising number of false alarms.
- Confirm the deployment finished and targets are healthy:
-
kubectl rollout status deployment/my-service -n production --timeout=60s(Kubernetes). Use the latest deployment tag or artifact ID to avoid ambiguity.kubectlreadiness/liveness information is a primary signal.
-
- Verify the service health endpoint responds:
-
curl -fsS -o /dev/null -w "%{http_code}\n" https://api.example.com/healthz— expect200.
-
- Check traffic routing and feature flags:
- Confirm DNS points to the expected load balancer, and that the relevant feature flag states match the release plan (especially for partial/feature-flagged rollouts).
- Confirm migrations & schema upgrades completed:
- Verify migration job status or check
SELECT 1-style probe on the new schema.
- Verify migration job status or check
- Annotate the deployment in your observability tooling or dashboards so deployment-time comparisons are easy (deployment timestamp / version tags). This makes post-deploy signals attributable.
Important: Readiness and liveness probes are not optional. Use a lightweight
GET /healthzthat checks dependencies you care about (DB connectivity, cache warm, required downstream APIs). Kubernetes readiness/liveness probes are the standard mechanism to keep traffic away from unhealthy pods.
10 essential smoke tests to run immediately
Run these in order, fastest-first. Each item includes the what, how to run quickly, expected result, and first-triage steps.
1) Core service health (global): check the canonical health endpoint.
- How:
curl -fsS https://api.prod.example.com/healthzexpecting200and a small JSON body with statuses. - Triage: if 5xx,
kubectl logson recent pods and check readiness/liveness probes.
2) Authentication / login flow (critical path): verify token issuance for a test account.
-
How (cURL):
curl -s -X POST https://api.prod.example.com/auth/login \ -H "Content-Type: application/json" \ -d '{"email":"smoke@example.com","password":"__SMOKE__"}' -w "\n%{http_code}\n" Expect: 200 + valid token format. If auth fails, user journeys collapse — treat as critical. Check auth service logs and identity provider telemetry.
3) Primary read path (user home / profile): ensure key GETs return expected fields.
- How:
curl -s -H "Authorization: Bearer $TOKEN" https://api.prod.example.com/v1/users/me | jq .id - Expect: correct JSON shape, not a 500 or schema-less HTML error.
4) Primary write path (critical transaction): perform a minimal, safe write that exercises downstream processing (e.g., create an ephemeral cart item).
- How:
POST /cartwith synthetic payload; ensure201and a follow-upGETshows the item. - Triage: if write fails while read passes, check DB connection pool / write replicas and migrations.
5) Payment / external gateway connectivity (integration): ping the payments sandbox endpoint or run a test-mode authorization. Never charge real cards during smoke.
- Triage: check outbound firewall, certificate expiry, and recent credential rotations.
6) Background job / queue processing: enqueue a short test job and confirm the worker processes it.
- How (example): POST
/jobs/smokethen poll/jobs/{id}forcompleted. - Triage: if job created but not processed, look at worker pod logs, queue depth, and consumer lag.
7) Database connectivity + simple query: run SELECT 1 or a targeted sanity query (COUNT(*) FROM crucial_table LIMIT 1).
- How:
PGPASSWORD=$P psql -h db.prod -U smoke -d appdb -c "SELECT 1" - Expect: immediate success — investigate connection pool exhaustion or auth issues on failure.
8) Static assets and CDN: fetch a recent JS/CSS file or image via the CDN URL to confirm caching/CDN routing.
- How:
curl -I https://cdn.example.com/assets/app.jsand inspectX-Cache/Age. - Triage: 404s often indicate deployment slot swap problems or missing artifact upload.
9) Search / indexing (if core): execute a trivial query and confirm known document appears.
- How:
curl "https://search.prod.example.com?q=smoke-test-unique-token"expecting the smoke document. - Triage: if index stale, check indexer logs and ingestion lag.
10) Telemetry ingestion & error pipeline: confirm logs/traces/metrics are flowing and recent.
- How: query your logging/metrics tool for a log from the last 2 minutes or ensure the APM shows a trace for your smoke API call.
- Why: an app that looks fine but stops sending telemetry leaves you blind. Treat missing telemetry as high priority for mitigation.
Tools & automation notes:
- For backend fast checks prefer lightweight programmatic checks using
FastAPI'sTestClient(or equivalent) or HTTP requests so tests run without browser boot.TestClientsupports direct app calls and integrates withpytest. - For UI-critical checks (login, checkout smoke), use Playwright or Cypress configured for CI headless runs; both provide fast, deterministic runs suitable for a short smoke suite. Keep UI smoke specs tiny (2–4 steps).
Interpreting failures and escalation steps
A failure is either real (service truly broken) or flaky (test/environment). Triage quickly and escalate according to blast radius.
- Confirm quickly: reproduce the failure from a separate network and machine. Use
curlor the Playwright trace. - Scope the impact: single endpoint, single region, single tenant, or global? Look at traces, dashboards, error counts.
- Decide the action (triage matrix):
- Critical path broken (login, checkout, payments): Fail the deployment and rollback now. Rapid rollback is often the safest mitigation to buy time for investigation.
- Partial failure (one region, degraded performance): divert/shift traffic to healthy region, enable degraded mode, or increase capacity while investigating.
- Observability gap (telemetry missing): escalate to on-call infra/SRE — fix the telemetry first; otherwise you cannot triage.
- Document and communicate: produce a short Production Smoke Test Report with PASS/FAIL, build ID, timestamp, failed test(s), key log snippets, and the decision taken (rollback/mitigate/monitor). Use a single Slack/incident channel and pin the report. Example report template (paste into incident thread):
Production Smoke Test Report
Status: FAIL
Build: 2025.12.22-45f2ab
Time: 2025-12-22T15:08:32Z
Failed checks:
- POST /auth/login -> 500 (trace id: abc123)
- Background worker queue: job not processed (queue-depth: 321)
Immediate action: Rolled back to build 2025.12.22-12:00 (rollback completed 15:11Z)
Key logs:
auth-service[abc]: TypeError at /login ... stack...
Next: Triage leads assigned (#auth, #workers)
- Follow the runbook: call the owners listed in your service catalog or PagerDuty rotation, open an incident if customer impact exists, and run the standard postmortem flow once resolved.
Hard rule from the field: When user-impacting errors start right after deploy, revert first — investigate second. This buys time, reduces cognitive overload, and prevents cascading changes.
Making the checklist repeatable and automated
Manual checks are error-prone and slow. Make the checklist a runnable artifact of your pipeline.
- Single executable script approach (recommended): create
smoke.shthat runs the 10 checks in order, captures exit codes, and produces a concise summary (PASS/FAIL + failed items). Wrap each check so it times out quickly (e.g.,curl --max-time 10) and returns a structured JSON result. Sample pattern:
#!/usr/bin/env bash
set -euo pipefail
failures=()
run() { desc="$1"; shift; echo "-> $desc"; if ! "$@"; then failures+=("$desc"); fi }
run "health" curl -fsS https://api.prod.example.com/healthz >/dev/null
run "login" curl -fsS -X POST https://api... -d '{"..."}' >/dev/null
# ... other checks
if [ ${#failures[@]} -ne 0 ]; then
echo "SMOKE FAILED: ${failures[*]}"
exit 2
fi
echo "SMOKE PASS"
- CI wiring: trigger the smoke job from the deployment workflow using GitHub Actions
workflow_runordeployment_statusso the smoke job runs only after deploy completes. Configure the job to run in the production environment context and to fail the overall deployment pipeline if smoke fails.
name: Post-deploy smoke
on:
workflow_run:
workflows: ["Deploy to production"]
types: ["completed"]
jobs:
smoke:
if: ${{ github.event.workflow_run.conclusion == 'success' }}
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run smoke script
run: ./smoke.sh
Use workflow_run guards to avoid running smoke when deploy failed.
- UI smoke automation: store tiny Playwright specs that run in <60s. Capture the HTML report and screenshots as artifacts for failed runs. Playwright recommends CI-specific configuration and provides examples for GitHub Actions and Docker images.
- Reduce flakiness:
- Use synthetic test accounts that are reset-orphan-free.
- Test deterministically (avoid time-of-day dependent assertions).
- Allow one automatic retry for transient network or infra lint — but treat repeated failures as real.
- Observability integration: the CI smoke job should publish a deployment marker and an outcome metric (e.g.,
smoke.success = 0/1) to your monitoring so your SRE dashboard shows post-deploy health at a glance.
Practical Application
Below is a tight, copy-pasteable plan you can put into your next release process.
-
Pre-deploy (30–90s)
- Confirm artifact tag, migration status, deploy window, and feature-flag plan.
- Push deployment annotation (version, git sha) into observability.
Deploy (standard pipeline)
-
Post-deploy smoke (0–5 minutes)
- Run
smoke.sh(backend checks) — target total runtime under 5 minutes. - Run
playwright-smoke(UI checks) in parallel — target under 60s for headless runs. - Collect artifacts: smoke report, Playwright HTML, screenshots, and two sample logs.
- Run
-
Decision (1–2 minutes)
- All green → normal post-deploy monitoring window (e.g., 30 minutes).
- Any red on a critical path test → immediate rollback and incident triage.
-
Post-incident
- Run blameless postmortem for any rollback or significant regression.
- Add or adjust a smoke test if the failure was a test gap.
Minimal Playwright smoke example (TypeScript):
// tests/smoke.spec.ts
import { test, expect } from '@playwright/test';
test('login and load dashboard', async ({ page }) => {
await page.goto('/');
await page.fill('[data-qa=email]','smoke@example.com');
await page.fill('[data-qa=password]','__SMOKE__');
await page.click('[data-qa=login]');
await page.waitForSelector('[data-qa=dashboard]');
await expect(page).toHaveURL(/dashboard/);
});
Minimal FastAPI backend smoke (pytest + TestClient):
from fastapi.testclient import TestClient
from myapp.main import app
client = TestClient(app)
def test_health():
r = client.get("/healthz")
assert r.status_code == 200
assert r.json().get("status") == "ok"
def test_login_smoke():
r = client.post("/auth/login", json={"email":"smoke@example.com","password":"__SMOKE__"})
assert r.status_code == 200
assert "token" in r.json()
Quick comparison table
| Test type | Typical runtime (goal) | Automation tool | Run frequency |
|---|---|---|---|
| Health endpoint | < 2s | curl / TestClient | Every deploy |
| Auth/login | 2–6s | curl / Playwright | Every deploy |
| Read path | 1–3s | curl / TestClient | Every deploy |
| Write path | 3–10s | curl / TestClient | Every deploy |
| Background job | 5–30s | API probe / queue metrics | Every deploy |
| CDN asset | < 2s | curl -I | Every deploy |
| Telemetry ingest | < 30s | Monitoring query | Every deploy |
Practical report format (use at incident start):
- Status: PASS / FAIL
- Build:
version+sha- Time:
YYYY-MM-DDThh:mm:ssZ- Failed checks: list + one-line error (HTTP code, trace id)
- Action taken: rollback / mitigate / monitor
- Owner(s): team aliases
Sources
Types of software testing — Atlassian - Definition and role of smoke tests within a deployment/testing strategy.
Smoke test — MDN Web Docs - Concise glossary definition and context for smoke testing.
Accelerate / State of DevOps (DORA) — Google Cloud - Data-driven evidence linking continuous testing and delivery practices to improved deployment stability and recovery metrics.
Testing — FastAPI (TestClient) - Practical guidance for using TestClient to run lightweight backend checks and integrate with pytest.
Continuous Integration (CI) — Playwright docs - Recommended patterns for short, deterministic UI smoke suites and CI integration details.
Best Practices — Cypress Documentation - Guidance on keeping UI tests fast, deterministic, and suitable for CI smoke runs.
Pod lifecycle and probes — Kubernetes docs - Liveness/readiness/startup probe behavior and recommended use for health gating.
Events that trigger workflows — GitHub Actions docs - How to run post-deploy jobs (e.g., workflow_run or deployment_status) to execute smoke checks after a deployment completes.
SEV1 — The Art of Incident Command - Practical operational guidance for incident triage and the “rollback first” discipline used in on-call and SRE practice.
Top comments (0)