DEV Community

Ilya Ploskovitov
Ilya Ploskovitov

Posted on

Stop Testing Success. Kill the Database. 🧨

Intro to Chaos Engineering for QA. Learn how to test resilience by injecting failures with Docker and Playwright.

We are obsessed with the "Happy Path".

In traditional QA, we verify that the application works when everything is perfect:

  • The network is stable.
  • The database responds in 5ms.
  • Third-party APIs are online.

But in production, nothing is perfect. Pods crash, networks lag, and databases lock up.

When these things happen, a standard Selenium/Playwright test just says: Failed. It doesn't tell you how the application failed. Did it show a graceful error message? Or did it crash with a white screen and a raw stack trace?

This is where Chaos Engineering comes in.

From QA to Resilience Engineering

Chaos Engineering isn't just for Site Reliability Engineers (SREs). As modern QAs, we need to stop asking "Does it work?" and start asking "What happens when it breaks?"

Today, I’ll show you how to write a Chaos Test using Python, Playwright, and the Docker SDK.

The Goal

We aren't going to wait for the database to fail. We are going to kill it intentionally in the middle of a test and verify that our frontend handles it gracefully.

The Stack

  • Python (Test logic)
  • Playwright (UI Interaction)
  • Docker SDK (The Chaos Injector)

The Code 🐍

Here is the complete script. It connects to your local Docker daemon, finds the Postgres container, and strangles it while the user is trying to work.

import docker
import time
from playwright.sync_api import Page, expect

def test_database_failure_resilience(page: Page):
    # 1. Setup: Connect to Docker
    # We use the python-docker library to control the infrastructure
    client = docker.from_env()

    # Target your specific database container
    try:
        db_container = client.containers.get("postgres-prod")
    except docker.errors.NotFound:
        raise Exception("Database container not found! Is Docker running?")

    # 2. Happy Path: Verify the app loads normally
    print("✅ Step 1: Loading Dashboard...")
    page.goto("http://localhost:3000/dashboard")
    expect(page.locator(".user-balance")).to_be_visible()

    # 🧨 CHAOS TIME: Kill the Database
    print("🔥 Step 2: Injecting Chaos (Stopping DB)...")
    db_container.stop()

    # 3. Resilience Assertion
    # The app should NOT show a white screen or crash.
    # It SHOULD show a friendly "Connection Lost" toast or retry button.
    print("👀 Step 3: Verifying graceful degradation...")

    # Trigger an action that requires the DB
    page.reload() 

    # Assert UI handles the error
    expect(page.locator(".error-toast")).to_contain_text("Connection lost")
    expect(page.locator(".retry-button")).to_be_visible()

    # 🩹 RECOVERY: Bring the Database back
    print("🩹 Step 4: Healing the infrastructure...")
    db_container.start()

    # Give the app a moment to reconnect (or trigger a manual retry)
    page.locator(".retry-button").click()

    # 4. Self-Healing Assertion
    # The app should recover without requiring a full page refresh
    expect(page.locator(".user-balance")).to_be_visible()
    print("✅ Test Passed: System is resilient.")
Enter fullscreen mode Exit fullscreen mode

Why this matters

If you run this test and your application shows a 500 Server Error page, you have found a bug. Not a functional bug, but an architectural bug.

By adding "Chaos Tests" to your regression suite, you guarantee that your product doesn't just work—it survives.


👋 Want more Chaos?

I write The 5-Minute QA—a daily newsletter for Senior QAs and SDETs. Every morning, I send one actionable tip on Chaos Engineering.

👉 Subscribe here to get the tips in your inbox

Top comments (0)