beefed.ai

Posted on Jun 23 • Originally published at beefed.ai

Production Smoke Test Checklist: 10 Fast Post-Deploy Checks

#testing

Why fast post-deploy smoke tests matter
Pre-test environment sanity checks
10 essential smoke tests to run immediately
Interpreting failures and escalation steps
Making the checklist repeatable and automated
Practical Application

Deployments are the smallest event with the biggest potential impact: a trivial change that passes CI can still break the single user journey that generates revenue. You need a fast, deterministic signal from production in the first minutes after a release so you can either declare the build safe or stop everything and recover.

The problem you see on-call is rarely exotic: broken login, a 502 on the checkout API, a background job that never processed, or static files served with 404. Those failures surface as noise in the monitoring, angry customer messages, and frantic Slack threads — and by the time the team notices it’s often past the window where a quick revert would have sufficed. The right post-deploy smoke tests catch these show-stoppers before users do and give you an immediate action: pass, hold, or rollback.

Why fast post-deploy smoke tests matter

A smoke test is a focused, minimal suite that validates whether the most important functions work after a build or deploy. Use them to decide whether a release is safe or must be stopped. Smoketests are not exhaustive; they are a fast gate.
Running post-deploy smoke tests rapidly reduces blast radius and shortens detection-to-decision time, which aligns with DORA/Accelerate findings that continuous testing and fast verification correlate with lower change-failure rates and faster recovery. Short feedback here amplifies delivery confidence.
The operational trade-off is explicit: speed over depth. You want a binary signal in minutes, not a slow parade of flaky end‑to‑end checks that make decision-making ambiguous.

Pre-test environment sanity checks

Before you execute the 10 checks, confirm the production environment is actually what you expect. These sanity checks take 30–90 seconds and remove a surprising number of false alarms.

Confirm the deployment finished and targets are healthy:
- kubectl rollout status deployment/my-service -n production --timeout=60s (Kubernetes). Use the latest deployment tag or artifact ID to avoid ambiguity. kubectl readiness/liveness information is a primary signal.
Verify the service health endpoint responds:
- curl -fsS -o /dev/null -w "%{http_code}\n" https://api.example.com/healthz — expect 200.
Check traffic routing and feature flags:
- Confirm DNS points to the expected load balancer, and that the relevant feature flag states match the release plan (especially for partial/feature-flagged rollouts).
Confirm migrations & schema upgrades completed:
- Verify migration job status or check SELECT 1-style probe on the new schema.
Annotate the deployment in your observability tooling or dashboards so deployment-time comparisons are easy (deployment timestamp / version tags). This makes post-deploy signals attributable.

Important: Readiness and liveness probes are not optional. Use a lightweight GET /healthz that checks dependencies you care about (DB connectivity, cache warm, required downstream APIs). Kubernetes readiness/liveness probes are the standard mechanism to keep traffic away from unhealthy pods.

10 essential smoke tests to run immediately

Run these in order, fastest-first. Each item includes the what, how to run quickly, expected result, and first-triage steps.

1) Core service health (global): check the canonical health endpoint.

How: curl -fsS https://api.prod.example.com/healthz expecting 200 and a small JSON body with statuses.
Triage: if 5xx, kubectl logs on recent pods and check readiness/liveness probes.

2) Authentication / login flow (critical path): verify token issuance for a test account.

How (cURL):

 curl -s -X POST https://api.prod.example.com/auth/login \
   -H "Content-Type: application/json" \
   -d '{"email":"smoke@example.com","password":"__SMOKE__"}' -w "\n%{http_code}\n"

Expect: 200 + valid token format. If auth fails, user journeys collapse — treat as critical. Check auth service logs and identity provider telemetry.

3) Primary read path (user home / profile): ensure key GETs return expected fields.

How: curl -s -H "Authorization: Bearer $TOKEN" https://api.prod.example.com/v1/users/me | jq .id
Expect: correct JSON shape, not a 500 or schema-less HTML error.

4) Primary write path (critical transaction): perform a minimal, safe write that exercises downstream processing (e.g., create an ephemeral cart item).

How: POST /cart with synthetic payload; ensure 201 and a follow-up GET shows the item.
Triage: if write fails while read passes, check DB connection pool / write replicas and migrations.

5) Payment / external gateway connectivity (integration): ping the payments sandbox endpoint or run a test-mode authorization. Never charge real cards during smoke.

Triage: check outbound firewall, certificate expiry, and recent credential rotations.

6) Background job / queue processing: enqueue a short test job and confirm the worker processes it.

How (example): POST /jobs/smoke then poll /jobs/{id} for completed.
Triage: if job created but not processed, look at worker pod logs, queue depth, and consumer lag.

7) Database connectivity + simple query: run SELECT 1 or a targeted sanity query (COUNT(*) FROM crucial_table LIMIT 1).

How: PGPASSWORD=$P psql -h db.prod -U smoke -d appdb -c "SELECT 1"
Expect: immediate success — investigate connection pool exhaustion or auth issues on failure.

8) Static assets and CDN: fetch a recent JS/CSS file or image via the CDN URL to confirm caching/CDN routing.

How: curl -I https://cdn.example.com/assets/app.js and inspect X-Cache / Age.
Triage: 404s often indicate deployment slot swap problems or missing artifact upload.

9) Search / indexing (if core): execute a trivial query and confirm known document appears.

How: curl "https://search.prod.example.com?q=smoke-test-unique-token" expecting the smoke document.
Triage: if index stale, check indexer logs and ingestion lag.

10) Telemetry ingestion & error pipeline: confirm logs/traces/metrics are flowing and recent.

- How: query your logging/metrics tool for a log from the last 2 minutes or ensure the APM shows a trace for your smoke API call.

- Why: an app that looks fine but stops sending telemetry leaves you blind. Treat missing telemetry as high priority for mitigation.

Tools & automation notes:

For backend fast checks prefer lightweight programmatic checks using FastAPI's TestClient (or equivalent) or HTTP requests so tests run without browser boot. TestClient supports direct app calls and integrates with pytest.
For UI-critical checks (login, checkout smoke), use Playwright or Cypress configured for CI headless runs; both provide fast, deterministic runs suitable for a short smoke suite. Keep UI smoke specs tiny (2–4 steps).

Interpreting failures and escalation steps

A failure is either real (service truly broken) or flaky (test/environment). Triage quickly and escalate according to blast radius.

Confirm quickly: reproduce the failure from a separate network and machine. Use curl or the Playwright trace.
Scope the impact: single endpoint, single region, single tenant, or global? Look at traces, dashboards, error counts.
Decide the action (triage matrix):
- Critical path broken (login, checkout, payments): Fail the deployment and rollback now. Rapid rollback is often the safest mitigation to buy time for investigation.
- Partial failure (one region, degraded performance): divert/shift traffic to healthy region, enable degraded mode, or increase capacity while investigating.
- Observability gap (telemetry missing): escalate to on-call infra/SRE — fix the telemetry first; otherwise you cannot triage.
Document and communicate: produce a short Production Smoke Test Report with PASS/FAIL, build ID, timestamp, failed test(s), key log snippets, and the decision taken (rollback/mitigate/monitor). Use a single Slack/incident channel and pin the report. Example report template (paste into incident thread):

   Production Smoke Test Report
   Status: FAIL
   Build: 2025.12.22-45f2ab
   Time: 2025-12-22T15:08:32Z
   Failed checks:
     - POST /auth/login -> 500 (trace id: abc123)
     - Background worker queue: job not processed (queue-depth: 321)
   Immediate action: Rolled back to build 2025.12.22-12:00 (rollback completed 15:11Z)
   Key logs:
     auth-service[abc]: TypeError at /login ... stack...
   Next: Triage leads assigned (#auth, #workers)

Follow the runbook: call the owners listed in your service catalog or PagerDuty rotation, open an incident if customer impact exists, and run the standard postmortem flow once resolved.

Hard rule from the field: When user-impacting errors start right after deploy, revert first — investigate second. This buys time, reduces cognitive overload, and prevents cascading changes.

Making the checklist repeatable and automated

Manual checks are error-prone and slow. Make the checklist a runnable artifact of your pipeline.

Single executable script approach (recommended): create smoke.sh that runs the 10 checks in order, captures exit codes, and produces a concise summary (PASS/FAIL + failed items). Wrap each check so it times out quickly (e.g., curl --max-time 10) and returns a structured JSON result. Sample pattern:

  #!/usr/bin/env bash
  set -euo pipefail
  failures=()
  run() { desc="$1"; shift; echo "-> $desc"; if ! "$@"; then failures+=("$desc"); fi }

  run "health" curl -fsS https://api.prod.example.com/healthz >/dev/null
  run "login" curl -fsS -X POST https://api... -d '{"..."}' >/dev/null
  # ... other checks

  if [ ${#failures[@]} -ne 0 ]; then
    echo "SMOKE FAILED: ${failures[*]}"
    exit 2
  fi
  echo "SMOKE PASS"

CI wiring: trigger the smoke job from the deployment workflow using GitHub Actions workflow_run or deployment_status so the smoke job runs only after deploy completes. Configure the job to run in the production environment context and to fail the overall deployment pipeline if smoke fails.

  name: Post-deploy smoke
  on:
    workflow_run:
      workflows: ["Deploy to production"]
      types: ["completed"]

  jobs:
    smoke:
      if: ${{ github.event.workflow_run.conclusion == 'success' }}
      runs-on: ubuntu-latest
      steps:
        - uses: actions/checkout@v4
        - name: Run smoke script
          run: ./smoke.sh

Use workflow_run guards to avoid running smoke when deploy failed.

UI smoke automation: store tiny Playwright specs that run in <60s. Capture the HTML report and screenshots as artifacts for failed runs. Playwright recommends CI-specific configuration and provides examples for GitHub Actions and Docker images.
Reduce flakiness:
- Use synthetic test accounts that are reset-orphan-free.
- Test deterministically (avoid time-of-day dependent assertions).
- Allow one automatic retry for transient network or infra lint — but treat repeated failures as real.
Observability integration: the CI smoke job should publish a deployment marker and an outcome metric (e.g., smoke.success = 0/1) to your monitoring so your SRE dashboard shows post-deploy health at a glance.

Practical Application

Below is a tight, copy-pasteable plan you can put into your next release process.

Pre-deploy (30–90s)
- Confirm artifact tag, migration status, deploy window, and feature-flag plan.
- Push deployment annotation (version, git sha) into observability.
Deploy (standard pipeline)
Post-deploy smoke (0–5 minutes)
- Run smoke.sh (backend checks) — target total runtime under 5 minutes.
- Run playwright-smoke (UI checks) in parallel — target under 60s for headless runs.
- Collect artifacts: smoke report, Playwright HTML, screenshots, and two sample logs.
Decision (1–2 minutes)
- All green → normal post-deploy monitoring window (e.g., 30 minutes).
- Any red on a critical path test → immediate rollback and incident triage.
Post-incident
- Run blameless postmortem for any rollback or significant regression.
- Add or adjust a smoke test if the failure was a test gap.

Minimal Playwright smoke example (TypeScript):

// tests/smoke.spec.ts
import { test, expect } from '@playwright/test';

test('login and load dashboard', async ({ page }) => {
  await page.goto('/');
  await page.fill('[data-qa=email]','smoke@example.com');
  await page.fill('[data-qa=password]','__SMOKE__');
  await page.click('[data-qa=login]');
  await page.waitForSelector('[data-qa=dashboard]');
  await expect(page).toHaveURL(/dashboard/);
});

Minimal FastAPI backend smoke (pytest + TestClient):

from fastapi.testclient import TestClient
from myapp.main import app

client = TestClient(app)

def test_health():
    r = client.get("/healthz")
    assert r.status_code == 200
    assert r.json().get("status") == "ok"

def test_login_smoke():
    r = client.post("/auth/login", json={"email":"smoke@example.com","password":"__SMOKE__"})
    assert r.status_code == 200
    assert "token" in r.json()

Quick comparison table

Test type	Typical runtime (goal)	Automation tool	Run frequency
Health endpoint	< 2s	curl / TestClient	Every deploy
Auth/login	2–6s	curl / Playwright	Every deploy
Read path	1–3s	curl / TestClient	Every deploy
Write path	3–10s	curl / TestClient	Every deploy
Background job	5–30s	API probe / queue metrics	Every deploy
CDN asset	< 2s	curl -I	Every deploy
Telemetry ingest	< 30s	Monitoring query	Every deploy

Practical report format (use at incident start):

Status: PASS / FAIL

Build: version+sha

Time: YYYY-MM-DDThh:mm:ssZ

Failed checks: list + one-line error (HTTP code, trace id)

Action taken: rollback / mitigate / monitor

Owner(s): team aliases

Sources

Types of software testing — Atlassian - Definition and role of smoke tests within a deployment/testing strategy.

Smoke test — MDN Web Docs - Concise glossary definition and context for smoke testing.

Accelerate / State of DevOps (DORA) — Google Cloud - Data-driven evidence linking continuous testing and delivery practices to improved deployment stability and recovery metrics.

Testing — FastAPI (TestClient) - Practical guidance for using TestClient to run lightweight backend checks and integrate with pytest.

Continuous Integration (CI) — Playwright docs - Recommended patterns for short, deterministic UI smoke suites and CI integration details.

Best Practices — Cypress Documentation - Guidance on keeping UI tests fast, deterministic, and suitable for CI smoke runs.

Pod lifecycle and probes — Kubernetes docs - Liveness/readiness/startup probe behavior and recommended use for health gating.

Events that trigger workflows — GitHub Actions docs - How to run post-deploy jobs (e.g., workflow_run or deployment_status) to execute smoke checks after a deployment completes.

SEV1 — The Art of Incident Command - Practical operational guidance for incident triage and the “rollback first” discipline used in on-call and SRE practice.