Intro to Chaos Engineering for QA. Learn how to test resilience by injecting failures with Docker and Playwright.
We are obsessed with the "Happy Path".
In traditional QA, we verify that the application works when everything is perfect:
- The network is stable.
- The database responds in 5ms.
- Third-party APIs are online.
But in production, nothing is perfect. Pods crash, networks lag, and databases lock up.
When these things happen, a standard Selenium/Playwright test just says: Failed. It doesn't tell you how the application failed. Did it show a graceful error message? Or did it crash with a white screen and a raw stack trace?
This is where Chaos Engineering comes in.
From QA to Resilience Engineering
Chaos Engineering isn't just for Site Reliability Engineers (SREs). As modern QAs, we need to stop asking "Does it work?" and start asking "What happens when it breaks?"
Today, I’ll show you how to write a Chaos Test using Python, Playwright, and the Docker SDK.
The Goal
We aren't going to wait for the database to fail. We are going to kill it intentionally in the middle of a test and verify that our frontend handles it gracefully.
The Stack
- Python (Test logic)
- Playwright (UI Interaction)
- Docker SDK (The Chaos Injector)
The Code 🐍
Here is the complete script. It connects to your local Docker daemon, finds the Postgres container, and strangles it while the user is trying to work.
import docker
import time
from playwright.sync_api import Page, expect
def test_database_failure_resilience(page: Page):
# 1. Setup: Connect to Docker
# We use the python-docker library to control the infrastructure
client = docker.from_env()
# Target your specific database container
try:
db_container = client.containers.get("postgres-prod")
except docker.errors.NotFound:
raise Exception("Database container not found! Is Docker running?")
# 2. Happy Path: Verify the app loads normally
print("✅ Step 1: Loading Dashboard...")
page.goto("http://localhost:3000/dashboard")
expect(page.locator(".user-balance")).to_be_visible()
# 🧨 CHAOS TIME: Kill the Database
print("🔥 Step 2: Injecting Chaos (Stopping DB)...")
db_container.stop()
# 3. Resilience Assertion
# The app should NOT show a white screen or crash.
# It SHOULD show a friendly "Connection Lost" toast or retry button.
print("👀 Step 3: Verifying graceful degradation...")
# Trigger an action that requires the DB
page.reload()
# Assert UI handles the error
expect(page.locator(".error-toast")).to_contain_text("Connection lost")
expect(page.locator(".retry-button")).to_be_visible()
# 🩹 RECOVERY: Bring the Database back
print("🩹 Step 4: Healing the infrastructure...")
db_container.start()
# Give the app a moment to reconnect (or trigger a manual retry)
page.locator(".retry-button").click()
# 4. Self-Healing Assertion
# The app should recover without requiring a full page refresh
expect(page.locator(".user-balance")).to_be_visible()
print("✅ Test Passed: System is resilient.")
Why this matters
If you run this test and your application shows a 500 Server Error page, you have found a bug. Not a functional bug, but an architectural bug.
By adding "Chaos Tests" to your regression suite, you guarantee that your product doesn't just work—it survives.
👋 Want more Chaos?
I write The 5-Minute QA—a daily newsletter for Senior QAs and SDETs. Every morning, I send one actionable tip on Chaos Engineering.

Top comments (0)