DEV Community: Ilya Ploskovitov

Masking PII in Kubernetes: How we solved 3 annoying sidecar edge cases (v2.0.0)

Ilya Ploskovitov — Wed, 29 Apr 2026 18:15:20 +0000

Building a mutating webhook for Kubernetes is easy in tutorials, but brutal in production. You immediately hit the reality of volume permissions, security contexts, and zombie sidecars.

I recently released v2.0.0 of the PII-Shield Operator. It’s a Go-based tool that injects a sidecar into your pods to mask sensitive data (PII) before the logs hit Datadog or ELK.

Getting the core Shannon entropy logic to work was step one. Making it bulletproof for strict SOC2 environments was step two. Here are the three K8s edge cases we solved for this release:

Dropping the Shell (Moving to Distroless)
Security teams hate sidecars with shell access. In earlier versions, we used Alpine. Now, the agent is compiled with CGO_ENABLED=0 and deployed on gcr.io/distroless/static:nonroot. There is no /bin/sh and zero attack surface. It just tails the log files directly using native Go.
The "Immortal Sidecar" Problem
If you have ever injected sidecars into a K8s Job, you know it breaks the lifecycle. The main container finishes its work, but the sidecar keeps tailing logs forever, so the Job never reaches the Completed state.
To fix this, we moved to the new Native Sidecars feature (K8s 1.28+). The webhook now puts the agent inside the initContainers array with RestartPolicy: Always. Kubernetes finally understands how to kill the sidecar gracefully when the main app exits.
The emptyDir Permission Trap
When the webhook mounts an emptyDir volume to share logs, you often get Permission Denied. If the user's main app runs as root with a strict umask 0077, the nonroot sidecar can't read the file.
Instead of forcing users to rewrite their manifests, the Mutating Webhook now handles it silently. It checks the pod's SecurityContext and automatically injects fsGroup: 65532. The volume permissions match up, and the logs flow without errors.

I packaged the whole thing into a Helm chart so it takes about two minutes to test. You just label your pod with pii-shield.io/inject: "true" and the webhook handles the rest.

How to mask PII in Kubernetes before sending logs to Datadog

Ilya Ploskovitov — Sun, 15 Mar 2026 23:05:58 +0000

The Problem: Datadog Bills and GDPR Nightmares

If you are running applications in Kubernetes and shipping your logs to Datadog, you have probably faced two major headaches:

Cost: Datadog charges you based on the volume of logs ingested and indexed. Every megabyte counts. Moreover, while Datadog offers a built-in Sensitive Data Scanner, it is a premium feature billed separately on top of your base log costs. By using a free, open-source sidecar, you can completely bypass expensive vendor-side premium scrubbers.
Compliance: Sending Personally Identifiable Information (PII) like emails, credit card numbers, or API keys to a third-party logging service often violates GDPR and other privacy laws.

The more comprehensive your logs are for debugging, the higher your Datadog bill gets, and the bigger your risk of a privacy breach becomes.

The Standard Approach (And Why It Hurts)

The standard way to solve this is to configure the Datadog Agent to mask or scrub PII before it leaves your cluster.

However, this approach has significant drawbacks:

Complexity: Setting up custom parsing rules, regexes, and pipelines in the Datadog Agent configuration can be tedious and difficult to maintain.
High CPU Usage: Running heavy regex operations over massive volumes of text inside your log shipper consumes a lot of CPU resources. This can slow down your node's performance or require larger, more expensive compute instances.
Whack-a-Mole: You are constantly updating rules as your application output changes, which takes time and effort.

The Solution: PII-Shield as a Lightweight Sidecar

Instead of burdening your cluster-wide log shipper with heavy processing, you can mask PII before it even leaves the pod.

PII-Shield is a lightning-fast, zero-dependency tool written in Go. It acts as a sidecar container that sits right next to your application. It intercepts the logs in real-time, scrubs sensitive data using entropy detection and deterministic hashing, and then passes the clean logs forward.

By the time the Datadog Agent picks up the logs from the Kubernetes node, they are already completely sanitized.

Ready-to-Use Pod Configuration

Here is a practical example of how to inject PII-Shield as a sidecar into your Kubernetes Pod. We use a shared volume so PII-Shield can read the application's output and write safe logs to its own standard output.

apiVersion: v1
kind: Pod
metadata:
  name: my-app-with-pii-shield
spec:
  containers:
    - name: my-app
      image: my-app-image:v1.0.0
      # Instead of writing directly to stdout, the app writes to a shared file or pipe
      command: ["/bin/sh", "-c"]
      args: ["./my-app-binary > /shared-logs/app.log"]
      volumeMounts:
        - name: shared-logs
          mountPath: /shared-logs

    - name: pii-shield-sidecar
      image: thelisdeep/pii-shield:v1.2.3
      env:
        - name: PII_SALT
          value: "your-secure-random-salt"
      # PII-Shield reads the file in real-time, scrubs the data, and outputs to stdout
      command: ["/bin/sh", "-c"]
      args: ["tail -n +1 -f /shared-logs/app.log | pii-shield"]
      volumeMounts:
        - name: shared-logs
          mountPath: /shared-logs

  volumes:
    - name: shared-logs
      emptyDir: {}

Note: The thelisdeep/pii-shield image is multi-arch (supporting both amd64 and arm64), which is perfect if you are saving costs by running on ARM processors like AWS Graviton.

How does Datadog know what to read?
Because the main application now redirects its output to a file, its standard output (stdout) is empty. The Datadog Agent, which natively listens to stdout across all containers via Autodiscovery, will automatically pick up only the clean stream from the pii-shield-sidecar. There are no conflicts and no duplicate logs.

Why this is better:

Zero Configuration for Log Shippers: Datadog just receives clean logs. There are no complex pipeline rules to manage.
Bypass Premium Vendor Fees: Datadog's built-in Sensitive Data Scanner is a premium feature billed on top of your regular log volumes. By using a free, open-source sidecar, you completely eliminate the need for expensive vendor-side scrubbing.
Predictable Performance: PII-Shield utilizes zero-allocation JSON parsing and consumes a mere ~30Mi of memory (footprint). For a sidecar running in every pod across your cluster, this negligible resource footprint is critical.
Easy Debugging: With deterministic hashing, user@email.com becomes something like [HIDDEN:a1b2c3]. You can still trace that same user across your Datadog logs for debugging without ever knowing their real email.

By putting the shield right where the data is generated, you protect your users' privacy and keep your observability bills in check.

Ready to secure your Kubernetes logs?
Check out the PII-Shield repository on GitHub, try out the Helm chart, and if you find it useful, consider dropping a star!

Integrating PII-Shield into GuardSpine (WASM vs Native execution)

Ilya Ploskovitov — Wed, 25 Feb 2026 05:39:27 +0000

1. Introduction

GuardSpine's main job is AI-aware code governance—using AI models to automatically review pull requests and create cryptographic proof of those reviews. However, because GuardSpine sends code to third-party AI models (like Claude or GPT) to be reviewed, there is a massive risk. Sensitive information—like passwords and personal data (PII)—must be stopped from leaking to these AI providers. You cannot let personal data poison third-party AI models.

To solve this, the GuardSpine team turned to PII-Shield: a lightning-fast, open-source Go engine designed to find and redact sensitive data. The big engineering challenge was figuring out how to integrate this powerful external Go dependency into GuardSpine's strictly Python-based ecosystem. We had to make sure the integration was 100% secure (no data could accidentally leak to the internet) and extremely fast.

We had to make a choice. Should we run the Go tool as a normal program using Python's subprocess.run? Should we deploy a local server? Or, should we try something newer and run the Go code straight inside Python using WebAssembly (WASM)? This article explains how we tested normal programs against WebAssembly to adapt PII-Shield for this AI pipeline, the problems we found with slow startup times, and when you should use each method.

2. WebAssembly (WASI) Integration: The Sandboxed Approach

WebAssembly (WASM) is great for adding tools into Python. Its biggest benefit is that it works anywhere. Instead of building different versions of the tool for Linux, Mac, and Windows, we only need to build one .wasm file. Any computer running wasmtime can use it. Also, WASM is very secure because it runs in a "sandbox." This means the Go tool is locked inside and cannot access your network or your hard drive.

To make this work, we built a Python script (pii_wasm_client.py). When we need to hide data, this script starts the WASM engine. It passes important settings (like the secret PII_SALT password) safely using environment variables. Then, it uses temporary files to send the data into the WASM tool and get the clean text back.

import wasmtime
import os

# 1. Setup the Engine (Turn on the computer)
engine = wasmtime.Engine()

# 2. Load the Code (Load the program)
module = wasmtime.Module.from_file(engine, "pii-shield-wasi.wasm")

# 3. Setup the Rules (Configure WASI)
linker = wasmtime.Linker(engine)
linker.define_wasi()

# 4. Create a Safe Workspace (Create a Store)
store = wasmtime.Store(engine)
wasi = wasmtime.WasiConfig()
wasi.inherit_stdout()
store.set_wasi(wasi)

# 5. Run the Program
instance = linker.instantiate(store, module)
_start = instance.exports(store)["_start"]

try:
    _start(store)
except wasmtime.ExitTrap:
    pass # Expected exit

When we explain this to business leaders, they usually ask these important security questions:

"Why not just use a paid API on the internet?" Because the WASM tool runs locally on your machine, your code never leaves your private network.

"How do you keep the secret salt (PII_SALT) safe?" The secret is passed directly into the locked WASM sandbox. Because the sandbox is blocked off, nothing outside of it can steal the secret.

"Why not just use simple search patterns (Regex)?" Simple search patterns are fast but easily break on weird data, often flagging safe text as secrets by mistake (false positives). PII-Shield uses smart tokenization and entropy scoring to find real secrets, which requires a strong Go engine, not just simple text matching.

3. Real-World Benchmarks & The Bottlenecks

When we started, we thought running the WASM tool directly inside Python would be faster than asking the computer (OS) to start a whole new external program (subprocess.run).

To test this, we built a script to run the tool 10,000 times (you can find the open-source benchmark.py script in this public GitHub Gist). The text we tested was very small: just a 100-character JSON string. We compared a normal Linux program against the WASM engine.

The results were surprising. In our test with the tiny text, the normal program was about 4.7x faster:

Normal Program (subprocess.run): ~0.86 ms per run.
Starting and running the WASM Engine: ~4.03 ms per run.

Even when scaled to 100,000 iterations, the per-execution latency remained rock-solid (0.85ms vs 4.05ms). The execution time scaled perfectly linearly, proving that while WASM carries a strict 4ms "cold start" tax, it introduces absolutely no memory leaks or degraded performance over time—a crucial metric for GuardSpine's stability.

But saying "WASM is slower" is not fair. WebAssembly code is actually very fast. The problem was the "Cold Start"—how long it takes to turn the tool on. We found three big slowdowns:

First, Starting Go is Slow. Every time we ran the WASM tool, the Go language had to start up its memory manager and background tasks inside the sandbox. When you run a normal program, your computer’s Operating System does this almost instantly. In WASM, we had to wait for it every single time.

Second, The File Speed Problem (I/O). To send text into the WASM tool, our Python script created temporary files on the hard drive. Creating, writing, and deleting files takes a lot of time and caused about 30-50% of the delay.

Finally, The Text Was Too Small. Because our test text was only 100 characters, 99% of the time was spent just turning the tool on and managing files. The actual work of finding the secrets took almost zero time. If we tested a massive 10MB file instead, the 4x speed difference would likely disappear because the real work (finding secrets) would take much longer than the startup time.

4. Unlocking WASM's True Potential

Even though the "Cold Start" is slow, we shouldn't give up on WASM's amazing security and portability. To make WASM lightning-fast, we need to change how we run it:

The easiest fix is Keeping It Running. Our test turned the WASM tool on and off 10,000 times. If we turn it on just once when the server starts, we never have to wait for the Go startup delay again. WASM would be just as fast as a normal program.

Next, we must fix the slow files by using In-Memory Pipes. Instead of writing data to real files on the hard drive, we can push the data directly through standard in-memory streams (pipes). This fixes the file speed problem easily.

The perfect, final goal is Shared Memory. Instead of copying text back and forth, Python and WASM can look at the exact same spot in the computer's memory. This is the fastest possible way to share data, though it introduces complex engineering challenges, such as safely managing Go's Garbage Collector across the Python-WASM boundary.

Finally, we should look at TinyGo. The normal Go compiler adds a lot of heavy, extra background tasks. TinyGo is a smaller compiler made just for WASM. Using it would make the tool much smaller and make it start up 10 to 20 times faster.

5. The Verdict: When to use which?

Even though normal programs start faster, speed isn't the only thing that matters—security and stability matter too. The hidden superpower of WASM is that Python is in total control. If Python crashes, the WASM tool dies cleanly with it. If you use subprocess.run, a crashed Python app might leave behind "zombie" programs that eat up your server's resources.

So, when should you use a Normal Program? They are best for things you run rarely, like a command-line developer tool or a script that runs once an hour. The computer's Operating System is great at starting them quickly. While proper process management in Python can mitigate risks, WASM remains inherently safer for memory lifecycles.

When should you use WASM? WASM is the undisputable winner for modern, always-on Cloud servers. When you need to run untrusted code safely, or you need your tool to work on any operating system without building 10 different versions, WASM is unbeatable. It guarantees that your code stays locked in a safe box.

6. Conclusion

Integrating PII-Shield into GuardSpine taught us a lot about the trade-offs between raw speed and safe, modern design. While normal programs won the simple speed test, adapting WebAssembly gave the platform unmatched security and the ability to run anywhere without complex cross-compilation.

As the platform's processing volume scales, the roadmap for this integration is clear. We need to stop turning the WASM tool on and off for every log line. Instead, the plan is to keep the engine running persistently, compile it with TinyGo, and share memory directly across the Python-WASM boundary. The ultimate goal isn't just to match the speed of normal programs, but to beat them completely, creating a lightning-fast, unbreakable redaction layer for AI development.

Stop Leaking API Keys in your AI Agent Logs: A Go Sidecar Approach

Ilya Ploskovitov — Thu, 05 Feb 2026 21:41:24 +0000

Stop Leaking API Keys in your AI Agent Logs: A Go Sidecar Approach

Subtitle: The Hidden Privacy Leak in your AI Agents (and why your LLM "Audit Logs" are a GDPR Nightmare)

1. The Problem

Everyone is building AI agents right now. Whether you're using LangChain, AutoGPT, or custom loops, you are almost certainly logging their work. You keep traces of every step to debug reasoning loops or monitor costs.

Here lies the pain: Everything goes into these logs. User prompts, model responses, and raw JSON payloads from APIs.

The Visualization

Imagine this scenario:

Input Log:

{"user": "aragossa", "prompt": "My key is sk-live-123456"}

What your Logging System Saved:

{"user": "aragossa", "prompt": "My key is sk-live-123456"}

If a user pastes their password, API key, or PII into the chat, or if an API returns a sensitive internal token, that data is now permanently etched into your Elasticsearch, Datadog, or S3 bucket.

The Consequence: Your fine-tuning dataset is now "poisoned" with real user data. This is a massive GDPR violation and a ticking security time bomb. You can't just "delete" it if you don't know where it is.

2. The Gap: Why usual methods fail

Regex? Good luck maintaining a regex list for every possible API key format, session token, and PII variation in existence. It’s a game of whack-a-mole you will lose.
ML-based protection? Too slow. If your agent operates in real-time, you cannot afford a 500ms roundtrip to a BERT model just to sanitize specific log lines. PII scans often become the bottleneck.
Python-native logic? Processing massive text streams in an interpreted language (like Python) adds significant CPU overhead per log line compared to a compiled Go binary. In high-throughput pipes, this latency adds up fast.
Existing Observability Tools?
Systems like Fluent Bit, Datadog, or OpenTelemetry already offer redaction and PII masking, usually via pattern rules and regex. For many workloads that’s perfectly fine. The trade‑off is that these pipelines are not optimized for AI‑agent traces: they either run late in the pipeline (after logs have already left the pod) or rely on configuration‑heavy pattern catalogs that are hard to keep up‑to‑date in a world of ever‑changing API keys and internal tokens.

Static scanners like TruffleHog shine for repositories and CI, where you scan code at rest. They’re not meant to sit inline on a hot log stream and make sub‑millisecond decisions on every line.

Where PII‑Shield is different is not in “inventing redaction”, but in the combination of techniques tailored for AI agents: entropy + bigram signals for unknown secrets, deterministic HMAC instead of *** for referential integrity, and deep JSON traversal to keep your log schemas intact while still scrubbing sensitive values.

3. The Implementation: Enter PII-Shield

Meet PII-Shield. It’s a lightweight sidecar written in Go that sits right next to your agent.

Killer Feature #1: Entropy-based Detection

We don't just search for "password=". We look for chaos.
API keys and authentication tokens naturally have high "entropy" (randomness/complexity). Normal human speech has low entropy. By calculating the mathematical complexity of strings, we can flag 64-character hex strings or base64 blobs without knowing their specific format.

Killer Feature #2: Deterministic HMAC

This is the feature that caught attention on Hacker News.
We don't just replace secrets with ***. We turn secret123 into [HIDDEN:a1b2c3].

Input Log:

{"user": "aragossa", "prompt": "My key is sk-live-123456"}

PII-Shield Output:

{"user": "aragossa", "prompt": "My key is [HIDDEN:8f2a1b]"}

Why?
This is a deterministic HMAC (Hash-based Message Authentication Code).

It allows you to trace a specific user or session across multiple log lines without knowing who they are.
It preserves Referential Integrity for debugging. You can see that "Session A" failed 5 times, but you validly cannot see the Session ID itself.

Killer Feature #3: Statistical Adaptive Threshold

PII-Shield doesn't just use a hardcoded number. It learns the "baseline noise" of your logs. By calculating the mean and standard deviation ($2\sigma$), it automatically adjusts the sensitivity to your specific environment.

4. The Logic: Sidecar Architecture

The architecture is dead simple, leveraging the power of Kubernetes sidecars or UNIX pipes.

Your agent simply writes logs to stdout. PII-Shield intercepts the stream, scans it in real-time with near-zero overhead, sanitizes it, and passes it forward.

5. The Technical Meat

Why Go? Because we need raw speed and no dependencies.

1. Entropy & Bigrams (The Math)

Here is how we calculate the "Chaos" (Entropy) of a token in scanner.go. We use a combination of Shannon Entropy, Character Class Bonuses, and English Bigram analysis.

The Bigram Check is crucial: it penalizes strings that look like valid English (common letter pairs) and boosts score for "unnatural" strings. (Note: This is optimized for English but can be disabled or tuned via PII_DISABLE_BIGRAM_CHECK for other languages).

// From scanner.go
func CalculateComplexity(token string) float64 {
    // 1. Shannon Entropy
    entropy := calculateShannon(token) 

    // 2. Class Bonus (Upper, Lower, Digit, Symbols)
    // bonus := float64(classes-1) * 0.5
    bonus := calculateClassBonus(token) 

    // 3. Bigram Check (English Likelihood)
    // Penalizes common English, boosts random noise
    bigramScore := calculateBigramAdjustment(token) 

    return entropy + bonus + bigramScore
}

2. False Positives? (Whitelists)

"Entropy is great, but won't it eat my Git Hashes or UUIDs?"
PII-Shield includes built-in Whitelists (isSafe function) for standard identifiers like UUIDs, IPv6 addresses, Git Commit hashes (SHA-1), and MongoDB ObjectIDs. This ensures your debugging data stays intact while secrets get redacted.

3. Credit Card Detection (Luhn Algorithm)

Entropy isn't enough for credit cards, as numbers often have low randomness. PII-Shield includes a high-performance implementation of the Luhn Algorithm to scan for valid card checksums in the stream (FindLuhnSequences).

4. Deep JSON Inspection

For JSON logs, PII-Shield performs deep inspection (processJSONLine), preserving the schema while redacting values.
Note: PII-Shield re-serializes JSON (using json.Marshal), which may change key order/sorting. It guarantees semantic integrity but is best used early in your pipeline, before any byte-sensitive steps (like signing or exact-diff comparisons).

5. Deterministic Redaction

Here is the HMAC logic using a secure Salt:

func redactWithHMAC(sensitiveData string) string {
    // CurrentConfig.Salt is loaded securely from env vars
    mac := hmac.New(sha256.New, currentConfig.Salt)
    mac.Write([]byte(sensitiveData))
    hash := hex.EncodeToString(mac.Sum(nil))
    // We only keep a short prefix for tracing identity
    return fmt.Sprintf("[HIDDEN:%s]", hash[:6])
}

The main loop (cmd/cleaner/main.go) is a highly efficient buffered reader:

func main() {
    reader := bufio.NewScanner(os.Stdin)
    for reader.Scan() {
        text := reader.Text()
        // Core logic: Low allocation overhead
        // Imports github.com/aragossa/pii-shield/pkg/scanner
        cleaned := scanner.ScanAndRedact(text) 
        fmt.Println(cleaned)
    }
}

A Note on Scope: High-Entropy vs. Natural Language

Let's be clear: PII-Shield is laser-focused on High-Entropy secrets (API keys, tokens, auth headers) and structural patterns (Credit Cards). It is not a magic NLP bullet for detecting names like "John Smith" or free-text addresses. For that, you should treat PII-Shield as a low-latency "first line of defense" for your infrastructure, potentially complemented by heavier offline NLP tools for semantic analysis.

6. How to try it

I am looking for edge cases. If you are building AI agents, try running your trace logs through this and see what it catches (or misses).

You can run it locally with Docker:
docker run -i -e PII_SALT="mysalt" pii-shield < logs.txt

GitHub: https://github.com/aragossa/pii-shield

Why this matters:
Security often trails behind innovation. With the explosion of AI Agents, we are generating massive amounts of sensitive data in logs. PII-Shield is a "drop-in" safety net to ensure your innovation doesn't become a liability.

Playwright & Chaos Engineering: 3 Ways to Break Your UI in 10 Lines of Code 🧨

Ilya Ploskovitov — Tue, 03 Feb 2026 13:00:00 +0000

"The tests are green, but production is down."

We’ve all been there. Your CI/CD pipeline looks like a Christmas tree (all green), yet 5 minutes after deployment, the support tickets start rolling in. Why? Because we tend to test only the "Happy Path." In the real world, users enter elevators (network loss), backends have database deadlocks (500 errors), and low-end devices struggle with heavy JS (CPU race conditions).

Here are 3 simple ways to inject chaos into your Playwright tests using Python and TypeScript without any external dependencies.

1. The "Kill the Backend" Scenario (500 Error Injection)

What happens if your billing API fails? Does your UI show a "Retry" button, or does it hang forever?

Scenario: Intercept a critical API call and return a 500 Internal Server Error.

Python Code

def test_billing_failure(page):
    # Intercepting the payment endpoint
    page.route("**/api/v1/billing/pay", lambda route: route.fulfill(
        status=500,
        content_type="application/json",
        body='{"error": "Internal Database Error"}'
    ))

    page.goto("/checkout")
    page.get_by_role("button", name="Pay Now").click()

    # Assert that the UI handles the crash gracefully
    expect(page.locator(".error-message")).to_be_visible()

TypeScript Code

test('handle billing failure', async ({ page }) =&gt; {
  await page.route('**/api/v1/billing/pay', route =&gt; route.fulfill({
    status: 500,
    contentType: 'application/json',
    body: JSON.stringify({ error: 'Internal Database Error' }),
  }));

  await page.goto('/checkout');
  await page.getByRole('button', { name: 'Pay Now' }).click();

  await expect(page.locator('.error-message')).toBeVisible();
});

2. The "Elevator Effect" (Sudden Offline Mode)

Users move. Networks drop. If your app is an SPA, losing connection mid-session can lead to corrupted local states.

Scenario: Start a file upload and cut the internet connection.

Python Code

def test_upload_interruption(page, context):
    page.goto("/upload")
    page.get_by_label("File").set_input_files("heavy_video.mp4")

    # Chaos: Go offline instantly
    context.set_offline(True)

    # Expect a "Resume" button or "Connection lost" banner
    expect(page.get_by_role("button", name="Resume")).to_be_visible()

    context.set_offline(False) # Restore network

TypeScript Code

test('recovery on network loss', async ({ page, context }) =&gt; {
  await page.goto('/upload');
  await page.getByLabel('File').setInputFiles('heavy_video.mp4');

  await context.setOffline(true);

  await expect(page.getByRole('button', { name: 'Resume' })).toBeVisible();

  await context.setOffline(false);
});

3. The "Old Phone" Race Condition (CPU Throttling)

Async bugs often hide behind the speed of your developer laptop. By slowing down the CPU, you change the execution order of scripts and catch elusive race conditions.

Python Code

def test_race_condition(page):
    # Slow down CPU by 6x using Chrome DevTools Protocol (CDP)
    client = page.context.new_cdp_session(page)
    client.send("Emulation.setCPUThrottlingRate", {"rate": 6})

    page.goto("/heavy-dashboard")
    page.get_by_role("button", name="Load Stats").click()

    # Assert that the status eventually becomes 'Ready'
    expect(page.locator("#status")).to_contain_text("Ready", timeout=10000)

TypeScript Code

test('catch race conditions', async ({ page }) =&gt; {
  const client = await page.context().newCDPSession(page);
  await client.send('Emulation.setCPUThrottlingRate', { rate: 6 });

  await page.goto('/heavy-dashboard');
  await page.getByRole('button', { name: 'Load Stats' }).click();

  await expect(page.locator('#status')).toContainText('Ready', { timeout: 10000 });
});

💡 Pro Tip: When to Run These?

Don't run chaos tests on every PR. They are inherently more complex and can be "flaky" if your timeouts aren't tuned.

Best Practice: Add them to a nightly or pre-release suite.

Limit: Remember that CDP (CPU Throttling) only works on Chromium-based browsers.

Wrapping Up
Resilience is a feature. If you only test for success, you're only doing half of your job as a QA Engineer. Break your UI before your users do.

I’ve written a more detailed deep-dive on Resilience Strategy & CI/CD integration on my new blog. Check it out at ChaosQA.com.

Please, Stop Redirecting to Login on 401 Errors 🛑

Ilya Ploskovitov — Thu, 15 Jan 2026 21:57:24 +0000

You spend 15 minutes filling out a long configuration form. You get a Slack notification, switch tabs, reply to a colleague, and grab a coffee.

30 minutes later, you come back to the form and click "Save".

The page flashes. The login screen appears.And your data is gone.

This is the most annoying UX pattern in web development, and we need to stop doing it.

The "Lazy" Pattern

Why does this happen? Usually, it's because the JWT (access token) expired, the backend returned a 401 Unauthorized, and the frontend code did exactly what the tutorials said to do:

// Don't do this
axios.interceptors.response.use(null, error => {
  if (error.response.status === 401) {
    window.location.href = '/login'; // RIP data 💀
  }
  return Promise.reject(error);
});

Developers often argue: "But it's a security requirement! The session is dead!"

Yes, the session is dead. But that doesn't mean you have to kill the current page state.

The Better Way (Resilience)

If a user is just reading a dashboard, a redirect is fine. But if they have unsaved input (forms, comments, settings), a redirect is a bug.

Here is how a robust app handles this:

Intercept: Catch the 401 error.
Queue: Pause the failed request. Do not reload the page.
Refresh: Try to get a new token in the background (using a refresh token) OR show a modal asking for the password again.
Retry: Once authenticated, replay the original request with the new token.

The user doesn't even notice. The form saves successfully.

How to test this? (The hard part)

Implementing the "Silent Refresh" is tricky, but testing it is annoying.

Access tokens usually last 1 hour. You can't ask your QA team to "wait 60 minutes and then click Save" to verify the fix.

You need a way to trigger a 401 error exactly when you click the button, even if the token is valid.

The "Chaos" Approach

Instead of waiting for the token to expire naturally, we can just delete it "mid-flight."

I use Playwright for this. We can intercept the outgoing request and strip the Authorization header before it hits the server.

This forces the backend to reject the request, triggering your app's recovery logic immediately.

Here is a Python/Playwright snippet I use to verify my apps are "expiry-proof":

def test_chaos_silent_logout(page):
    # 1. Login and go to a form
    page.goto("/login")
    # ... perform login logic ...
    page.goto("/settings/profile")

    # 2. Fill out data
    page.fill("#bio", "Important text I don't want to lose.")

    # 3. CHAOS: Intercept the 'save' request
    def kill_token(route):
        headers = route.request.headers
        # We manually delete the token to simulate expiration
        if "authorization" in headers:
            del headers["authorization"]

        # Send the "naked" request. Backend will throw 401.
        route.continue_(headers=headers)

    # Attach the interceptor
    page.route("**/api/profile/save", kill_token)

    # 4. Click Save
    page.click("#save-btn")

    # 5. Check if we survived

    # If the app is bad, we are now on /login
    # if page.url == "/login": fail()

    # If the app is good, it refreshed the token and retried.
    # The text should still be there, and the save should succeed.
    expect(page.locator("#bio")).to_have_value("Important text I don't want to lose.")
    expect(page.locator(".success-message")).to_be_visible()

Summary

Network failures and expired tokens are facts of life. Your app should handle them without punishing the user.

If you want to build high-quality software, treat 401 Unauthorized as a recoverable error, not a fatal crash.

PS: If you need to test this on real mobile devices where you can't run Playwright scripts, you can use a Chaos Proxy to strip headers on the network level.

I got tired of guessing why my server crashed: Building a "Smart" Monitor with Global Checks & JSON Validation

Ilya Ploskovitov — Sat, 03 Jan 2026 21:08:37 +0000

TL;DR: Basic uptime tools just tell you "It's down." I wanted a tool that tells me why (DNS? SSL? App crash?), checks from Tokyo/NY, and validates JSON schemas. So I built OpsPulse.

The "It’s Just Down" Problem

Every developer has been there. It’s 3:00 AM. PagerDuty/Telegram screams "Service Down." You wake up, rush to the terminal, check the logs... and the service is fine.

Was it a network blip? Did the load balancer choke? Did an ISP in Europe drop packets?

Most uptime monitors are lazy. They check for a 200 OK from a single region (usually AWS us-east-1) and call it a day. That wasn't enough for me. I decided to build a platform that digs deeper without costing as much as Datadog.

Here is how I built OpsPulse and what makes it different.

1. Smart Diagnostics (Root Cause Analysis)

The killer feature of OpsPulse is context. It doesn’t just yell "Error!", it tries to diagnose the patient.

When an HTTP check fails, the worker triggers a cascade of lower-level checks:

Ping (ICMP): Is the server even reachable?
TCP Connect: Is the port open, but Nginx is hanging?
SSL Handshake: Did the cert expire, or is the chain of trust broken?

The Result: instead of a generic "Error 500", you get a Telegram alert saying:🔴 Status: DOWN📉 Reason: Web Server Error🧠 Context: Port 443 open, Ping OK, but Nginx returned 502 Bad Gateway. The issue is on the backend.

2. True Global Monitoring (Multi-Region)

Local checks lie. To verify availability properly, I integrated Google Cloud Functions. OpsPulse spins up ephemeral runners to check your resource simultaneously from the US, Europe, and Asia.

This enabled a Global DNS Monitor:

Checks propagation of A, MX, TXT records worldwide.
Uses Fuzzy Matching (handling trailing dots and format quirks).
If your site is up in NY but down in Tokyo — you’ll know.

3. Dev-Centric Features (Not just for websites)

I built this for developers, not just for marketing landing pages.

Advanced HTTP Monitor:It supports custom headers, all methods (GET, POST, PATCH), and strict Content Validation:

Positive Match: Ensure the response contains "Success".
Negative Match: Alert if the response contains "Exception" or "MySQL Error".

Heartbeat (Dead Man's Switch) with JSON Schema:Perfect for backups and cron jobs.

Scenario: Your backup script sends { "status": "ok", "size_mb": 2 }.
Config: Alert if size_mb < 50.
The Magic: I added JSON Schema Validation. You can enforce a strict structure on your incoming webhooks. It turns uptime monitoring into business-metric monitoring.

4. Security First

Since a monitoring tool sends requests everywhere, I had to prevent abuse:

SSRF Protection: Strict blocking of internal network scanning (localhost, 192.168.x.x) and cloud metadata endpoints.
SSL Chain Validation: We don't just check the expiry date. We validate the full chain of trust.
Header Sanitization: Stripping dangerous headers before webhook dispatch.

5. Alerts You Actually Want to Read

Grace Period: Ignore 1-second network hiccups.
Recovery Alerts: Get notified when systems are back online.
Channels: Telegram (bot), Slack (rich formatting), and custom Webhooks.

The Tech Stack

Frontend: Next.js + React (Real-time Dashboard).
Backend Worker: Python (For heavy lifting and network checks).
Cloud: Google Cloud Functions (For global nodes).
Database: PostgreSQL (via Supabase).

Conclusion

OpsPulse started as a side project to stop the 3 AM guessing game. Now it’s a full platform that helps me sleep better.

OpsPulse

What checks are missing from your current monitoring tools? Let me know in the comments! 👇

Stop Building "Zombie UI": The Resilient UX Checklist (Playwright + Python)

Ilya Ploskovitov — Wed, 24 Dec 2025 10:08:06 +0000

The Problem: The "Zombie UI" 🧠

You click "Submit". The database is writing data. The API is processing the request perfectly. The backend is healthy. But on the screen... nothing happens.

The button still looks clickable. The cursor is still a pointer. This is the "Dead Zone" — the gap between the user's input and the interface's reaction.

According to Jakob Nielsen's Response Time Limits, you have a strict budget:

0 - 100ms: Instant. Feels like manipulating a physical object.
100 - 300ms: Slight delay. Acceptable, but noticeable.
300 - 1000ms: User loses focus. "Is it working? Did I miss the button?"
> 1000ms: Zombie Mode. The user thinks the app crashed. They will refresh the page or rage-click the button, triggering duplicate transactions.

A working backend is not enough. If your UI freezes for 2 seconds without feedback, your feature is broken.

The Solution: The 3-Step Feedback Loop

We are used to writing tests like this: expect(success_message).to_be_visible(). But that is not enough. We must assert the intermediate states.

✅ I use this Resilient UX Checklist for every async action:
The Checklist

Immediate (<100ms): The button MUST become disabled. This prevents Rage Clicks and double-charges.
Short Wait (300ms): A spinner or skeleton loader MUST appear.
Long Wait (>3000ms): If the network is terrible (e.g., subway tunnel), show a "This is taking longer than usual..." toast. Never leave the user staring at an infinite spinner.

The Code (Python + Playwright)

How do we test this? We can't rely on random network lag. We need to deliberately freeze the request for 3 seconds to verify the application's "Patience Logic".

Here is a Playwright test that guarantees the "Zombie UI" never happens:

import time
from playwright.sync_api import Page, Route, expect

def test_slow_network_ux(page: Page):
    # 🛑 1. Setup the "Freeze" Interceptor
    def slow_handler(route: Route):
        print(f"❄️ Freezing request to {route.request.url} for 3s...")
        # Simulate Bad 3G / Subway Network
        # Note: In async tests, use 'await asyncio.sleep(3)'
        time.sleep(3) 
        route.continue_()

    # Intercept the checkout API
    page.route("**/api/checkout", slow_handler)

    page.goto("/cart")

    # 🎬 2. Trigger the Action
    submit_btn = page.locator("#submit-order")
    submit_btn.click()

    # ✅ 3. Assert "Immediate Feedback" (0-100ms)
    # The button must be disabled immediately
    expect(submit_btn).to_be_disabled()

    # ✅ 4. Assert "Loading State" (100-300ms)
    # The spinner must appear while we wait (we have 3 seconds)
    spinner = page.locator(".spinner-loader")
    expect(spinner).to_be_visible()

    # ✅ 5. Assert "Success State" (After 3s)
    # Eventually, the request completes
    expect(page.locator(".success-message")).to_be_visible(timeout=5000)
    # The spinner should disappear
    expect(spinner).not_to_be_visible()

Why automate this?

If you test this manually on localhost, you will blink and miss the spinner. By forcing a 3-second delay in CI, you guarantee that every user—even those on a slow mobile connection—gets a responsive UI, not a dead one.

Architecture Note: CDP vs System Proxy 🏗️

Why your local tests might be lying to you.

Most network interception tests (like the one above) use the Chrome DevTools Protocol (CDP).

CDP is great for Browsers: It gives you perfect control over traffic inside the Chrome process.
CDP fails on "The Full Matrix": You cannot easily attach CDP to a physical iPhone running Safari, a Smart TV app, or a native Android build.

The "Real World" Reality

If you want to run these Chaos scenarios on real physical devices (not emulators), code-based interception isn't enough.

You need a System Level Proxy (like Charles Proxy or a cloud-native tool like Chaos Proxy Debuggo). These tools sit between the physical device and the internet, allowing you to apply "3G Throttling" or "Random 500 Errors" to a real iPhone without changing a single line of your app's code.

Conclusion

Perceived Performance > Actual Performance.

You cannot always fix the slow SQL query. You cannot fix the user's spotty 4G connection. But you can fix how your UI communicates that delay.

💥 Break your API before your users do. Automated Network Chaos for CI/CD.

Ilya Ploskovitov — Sun, 21 Dec 2025 17:46:45 +0000

The Story: Why I built Chaos Proxy

We've all been there. The feature works perfectly on localhost. The E2E tests pass with flying colors. Then we deploy to production, and users on 3G networks start complaining that the app freezes, crashes, or—worst of all—charges them twice.

I realized that our CI pipelines were living in a fantasy world of 0ms latency and 100% uptime.

I wanted to simulate "Bad Network" conditions automatically in GitHub Actions, specifically for mobile apps and backend idempotency checks. I tried mocking requests in Playwright, but that didn't cover native Android/iOS emulators. I tried local proxies, but they were hard to script.

So I built Chaos Proxy —a cloud-based, programmable Chaos Proxy designed for CI/CD.

How it works

Debuggo isn't just a GUI tool. It’s an API-first platform. You can treat your network infrastructure like code.

Create: Your CI script calls our API to spin up an isolated, ephemeral proxy container.
Connect: You route your E2E test traffic (Web, Android, iOS) through this proxy.
Break: You send API commands to inject latency, trigger 503 errors, or tamper with headers in real-time.

Demo: Simulating 503 Errors in Chrome (Visual)

Key Features

API for CI/CD: Spin up and destroy proxies programmatically. No long-living servers to manage.
The "Rage Click" Test: Inject 3 seconds of latency into specific endpoints (e.g., /api/pay) to ensure your UI disables buttons correctly before the user clicks twice.
Native Mobile Support: Since it works at the network level (HTTP Proxy), it supports Android Emulators and iOS Simulators perfectly.
Response Fuzzing: Automatically tamper with JSON bodies to see if your app crashes on malformed data.

⚡️ See it in action
Here is how simple it is to inject a 503 Service Unavailable error into your checkout flow using curl:

curl -X PUT https://api.debuggo.app/v1/sessions/$SESSION_ID/rules \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "rules": [
      {
        "url_pattern": "*/api/checkout",
        "failure_rate": 50,
        "error_code": 503
      }
    ]
  }'

Demo: Automating Network Chaos via Terminal (CLI)

How you can get involved

I just launched the Public API Beta. I am looking for QA Engineers and DevOps folks who are tired of "flaky" apps and want to build true resilience.

Try the Free Tier: You can start manually or via API for free.
Break your App: Try the "Rage Click" test (Tip #5 on our blog).
Feedback: Let me know what integration you need next!

Stop trusting localhost. Start testing reality.

Announcing Chaos Proxy API: Automate Network Chaos in CI/CD 🚀

Ilya Ploskovitov — Sun, 21 Dec 2025 14:45:04 +0000

Moving Beyond "Localhost" Testing

Until now, Debuggo has been a fantastic tool for manual testing. You spin up a proxy, connect your phone, and verify how your app handles a 503 error or high latency. It works great for ad-hoc debugging.

But manual testing doesn't scale.

You cannot ask your QA team to manually verify "Offline Mode" handling on every single Pull Request. You cannot manually check if your payment gateway handles double-clicks correctly before every deploy.

To build truly resilient apps, you need Continuous Chaos.

Today, we are launching the Chaos Proxy API. Now you can programmatically create proxies, configure chaos rules, and tear them down—all within your CI/CD pipeline (GitHub Actions, GitLab CI, Jenkins).

Architecture: How it works in CI

The API gives you full control over the lifecycle of a Chaos Proxy directly from your pipeline scripts:

Create: Spin up a fresh, isolated proxy instance on demand (POST /sessions).
Configure: Apply chaos rules (latency, errors, body tampering) via JSON (PUT /rules).
Certify: Download the CA certificate to install on Android Emulators or iOS Simulators (GET /certs).
Test: Run your E2E suite (Playwright, Appium, Cypress) routing traffic through the proxy.
Destroy: Clean up resources when the test finishes (DELETE /sessions).

Real-World Example: GitHub Actions
Here is a complete workflow. This script spins up a proxy, injects a 3-second latency to simulate a slow network, runs tests to ensure the UI handles "Rage Clicks" correctly, and then shuts everything down.

name: 🧪 Chaos E2E Tests

on: [push]

jobs:
  chaos-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      # 1. Start the Proxy
      - name: 🚀 Start Debuggo Proxy
        id: start_proxy
        run: |
          RESPONSE=$(curl -s -X POST https://chaos-proxy.debuggo.app/api/v1/sessions \
            -H "Authorization: Bearer ${{ secrets.DEBUGGO_API_KEY }}")

          # Extract and save details to ENV
          echo "PROXY_ID=$(echo $RESPONSE | jq -r .id)" >> $GITHUB_ENV
          echo "PROXY_HOST=$(echo $RESPONSE | jq -r .host)" >> $GITHUB_ENV
          echo "PROXY_PORT=$(echo $RESPONSE | jq -r .port)" >> $GITHUB_ENV
          echo "PROXY_AUTH=$(echo $RESPONSE | jq -r .auth)" >> $GITHUB_ENV

      # 2. Configure Chaos (The "Bad 3G" Simulation)
      - name: 💣 Configure Chaos Rules
        run: |
          curl -X PUT https://chaos-proxy.debuggo.app/api/v1/sessions/$PROXY_ID/rules \
            -H "Authorization: Bearer ${{ secrets.DEBUGGO_API_KEY }}" \
            -H "Content-Type: application/json" \
            -d '{
              "rules": [
                {
                  "url_pattern": "*/api/checkout",
                  "delay": 3000,
                  "error_code": null
                }
              ]
            }'

      # 3. Run Tests
      - name: 🧪 Run Playwright Tests
        run: |
          # Route traffic through the authenticated proxy
          export HTTPS_PROXY="http://$PROXY_AUTH@$PROXY_HOST:$PROXY_PORT"
          npx playwright test

      # 4. Cleanup (Always run this, even if tests fail)
      - name: 🧹 Cleanup
        if: always()
        run: |
          curl -X DELETE https://chaos-proxy.debuggo.app/api/v1/sessions/$PROXY_ID \
            -H "Authorization: Bearer ${{ secrets.DEBUGGO_API_KEY }}"

API Reference
Use these endpoints to integrate Chaos Proxy into your custom scripts.

Authentication
Authenticate all requests by including your API Key in the header. You can generate a key in your Dashboard Settings.

Authorization: Bearer dbg_ci_YOUR_KEY

Start Proxy Session Creates a new isolated proxy container. Returns the host, port, and credentials.
Endpoint: POST /api/v1/sessions
Response:

{
  "id": "sess_abc123",
  "host": "proxy-us-east.debuggo.app",
  "port": 10245,
  "auth": "user:pass"
}

Configure Rules Updates the chaos logic in real-time. You can change rules mid-test (e.g., test success first, then inject failure).
Endpoint: PUT /api/v1/sessions/{session_id}/rules
Body Example:

{
  "rules": [
    {
      "url_pattern": "*/api/v1/checkout",
      "failure_rate": 100,
      "error_code": 503,
      "delay": 0
    },
    {
      "url_pattern": "*/api/v1/search",
      "delay": 2000
    }
  ]
}

Download CA Certificate Retrieves the Root CA certificate. Essential for automated setup of Android Emulators or iOS Simulators in CI.
Endpoint: GET /api/v1/certs/ca.pem
Usage:

curl -O https://chaos-proxy.debuggo.app/api/v1/certs/ca.pem
# Then install via adb

Stop Session Stops the proxy and releases the port.
Endpoint: DELETE /api/v1/sessions/{session_id}

Why automate Chaos?

Catch Regressions in "Unhappy Paths" Developers often break error handling logic because they rarely see errors locally. Automating a 500 Error test ensures your "Something went wrong" screen never breaks.
Validate Idempotency By injecting latency into your payment endpoints during CI, you can verify that your backend correctly handles duplicate requests (Rage Clicks) before they reach production.
Native Mobile Testing Unlike Playwright’s built-in page.route (which only works in a browser context), Debuggo works at the system level. This allows you to test Native Android and iOS apps running in emulators within your CI pipeline.

Ready to break your build (on purpose)? Get your API Key.

The "Spinner of Death": Why Localhost Latency is Lying to You

Ilya Ploskovitov — Thu, 18 Dec 2025 23:18:21 +0000

The "Localhost" Bias
We've all been there.

On your machine, the API responds in 5ms. The UI updates instantly. You click "Submit," the modal closes, and you move on to the next ticket. Status: Done. ✅

But on a user's 4G connection in a subway tunnel, that same API call takes 2 seconds.

Because you tested on localhost (Gigabit Fiber), you missed critical race conditions:

🖱️ The Double-Click Bug: The user clicks "Submit" twice because "nothing happened," charging their credit card twice.
🔄 The Infinite Spinner: The loader gets stuck forever because a packet was dropped.
🏎️ Race Conditions: Data arrives out of order, overwriting the user's input.

Your app feels fast because you are cheating. 0ms latency is a lie.

The Wrong Solution: time.sleep()
I often see tests that look like this:

# ❌ Don't do this
page.click("#submit")
time.sleep(2) # Simulating "network lag"
expect(page.locator(".success")).to_be_visible()

Why this fails: `sleep()` just pauses the test execution script. The browser engine itself is still blazing fast. It doesn't simulate network queues, slow handshakes, or constrained bandwidth. You aren't testing the network; you're just making your test suite slower.

The Right Solution: Network Throttling (CDP)

To test this properly in automation, you need to talk directly to the browser engine. You need to tell Chrome: "Pretend you are on a terrible 50kb/s connection."

We can do this using the Chrome DevTools Protocol (CDP) within Playwright. This forces the browser to handle packet delays and loading states exactly as a real user would experience.

The Code (Python + Playwright)
Here is how to inject a "Bad 3G" connection into your test:

from playwright.sync_api import Page, expect

def test_slow_network_handling(page: Page):
    # 1. Connect to Chrome DevTools Protocol (CDP)
    # This gives us low-level access to the browser
    client = page.context.new_cdp_session(page)

    # 2. 🧨 CHAOS: Emulate "Bad 3G"
    # Latency: 2000ms (2 seconds)
    # Throughput: 50kb/s (Very slow)
    client.send("Network.emulateNetworkConditions", {
        "offline": False,
        "latency": 2000, 
        "downloadThroughput": 50 * 1024,
        "uploadThroughput": 50 * 1024
    })

    page.goto("https://myapp.com/search")

    # 3. Trigger the slow action
    page.fill("#search-box", "Playwright")
    page.click("#search-btn")

    # 4. Resilience Assertion

    # Check 1: Does the UI prevent double submission?
    expect(page.locator("#search-btn")).to_be_disabled()

    # Check 2: Does the user get immediate feedback?
    expect(page.locator(".loading-spinner")).to_be_visible()

Why this matters: This test proves your UI provides feedback. If a user clicks a button and waits 2 seconds with no visual feedback, they will assume the app is broken.

But wait, what about Mobile Apps? 📱
The script above is perfect for automated CI pipelines running Chrome. But CDP has a major limitation: It doesn't work on a physical iPhone or Android device.

If you are a Mobile Developer or manual QA, you can't "attach Playwright" to the phone in your hand to simulate a subway tunnel.

The Manual Alternative (System-Level Proxy)

To test latency on a real device without writing code, you need a System-Level Proxy that sits between your phone and the internet.

You can use desktop tools like Charles Proxy (if you enjoy configuring Java apps and firewalls), or you can use a cloud-based tool like Chaos Proxy (which I'm building).

It allows you to simulate "Subway Mode" (2s latency) on any device—iPhone, Android, or Laptop—just by connecting to a Wi-Fi proxy.

The Workflow:

Create a "Chaos Rule" (e.g., Latency = 2000ms).
Connect your phone to the proxy via QR code.

3. Watch your app struggle (and then fix it).

Summary

Stop trusting Localhost. It hides your worst bugs.
Automated: Use Playwright + CDP to inject latency in your E2E tests.
Manual/Mobile: Use a Chaos Proxy to test resilience on physical devices.

Happy testing! 🧪

If you found this useful, check out my previous post: Stop Testing Success. Kill the Database.

Stop Testing Success. Kill the Database. 🧨

Ilya Ploskovitov — Thu, 11 Dec 2025 10:00:44 +0000

Intro to Chaos Engineering for QA. Learn how to test resilience by injecting failures with Docker and Playwright.

We are obsessed with the "Happy Path".

In traditional QA, we verify that the application works when everything is perfect:

The network is stable.
The database responds in 5ms.
Third-party APIs are online.

But in production, nothing is perfect. Pods crash, networks lag, and databases lock up.

When these things happen, a standard Selenium/Playwright test just says: Failed. It doesn't tell you how the application failed. Did it show a graceful error message? Or did it crash with a white screen and a raw stack trace?

This is where Chaos Engineering comes in.

From QA to Resilience Engineering

Chaos Engineering isn't just for Site Reliability Engineers (SREs). As modern QAs, we need to stop asking "Does it work?" and start asking "What happens when it breaks?"

Today, I’ll show you how to write a Chaos Test using Python, Playwright, and the Docker SDK.

The Goal

We aren't going to wait for the database to fail. We are going to kill it intentionally in the middle of a test and verify that our frontend handles it gracefully.

The Stack

Python (Test logic)
Playwright (UI Interaction)
Docker SDK (The Chaos Injector)

The Code 🐍

Here is the complete script. It connects to your local Docker daemon, finds the Postgres container, and strangles it while the user is trying to work.

import docker
import time
from playwright.sync_api import Page, expect

def test_database_failure_resilience(page: Page):
    # 1. Setup: Connect to Docker
    # We use the python-docker library to control the infrastructure
    client = docker.from_env()

    # Target your specific database container
    try:
        db_container = client.containers.get("postgres-prod")
    except docker.errors.NotFound:
        raise Exception("Database container not found! Is Docker running?")

    # 2. Happy Path: Verify the app loads normally
    print("✅ Step 1: Loading Dashboard...")
    page.goto("http://localhost:3000/dashboard")
    expect(page.locator(".user-balance")).to_be_visible()

    # 🧨 CHAOS TIME: Kill the Database
    print("🔥 Step 2: Injecting Chaos (Stopping DB)...")
    db_container.stop()

    # 3. Resilience Assertion
    # The app should NOT show a white screen or crash.
    # It SHOULD show a friendly "Connection Lost" toast or retry button.
    print("👀 Step 3: Verifying graceful degradation...")

    # Trigger an action that requires the DB
    page.reload() 

    # Assert UI handles the error
    expect(page.locator(".error-toast")).to_contain_text("Connection lost")
    expect(page.locator(".retry-button")).to_be_visible()

    # 🩹 RECOVERY: Bring the Database back
    print("🩹 Step 4: Healing the infrastructure...")
    db_container.start()

    # Give the app a moment to reconnect (or trigger a manual retry)
    page.locator(".retry-button").click()

    # 4. Self-Healing Assertion
    # The app should recover without requiring a full page refresh
    expect(page.locator(".user-balance")).to_be_visible()
    print("✅ Test Passed: System is resilient.")

Why this matters

If you run this test and your application shows a 500 Server Error page, you have found a bug. Not a functional bug, but an architectural bug.

By adding "Chaos Tests" to your regression suite, you guarantee that your product doesn't just work—it survives.

👋 Want more Chaos?

I write The 5-Minute QA—a daily newsletter for Senior QAs and SDETs. Every morning, I send one actionable tip on Chaos Engineering.

👉 Subscribe here to get the tips in your inbox

DEV Community: Ilya Ploskovitov

Masking PII in Kubernetes: How we solved 3 annoying sidecar edge cases (v2.0.0)

How to mask PII in Kubernetes before sending logs to Datadog

The Problem: Datadog Bills and GDPR Nightmares

The Standard Approach (And Why It Hurts)

The Solution: PII-Shield as a Lightweight Sidecar

Ready-to-Use Pod Configuration

Why this is better:

Integrating PII-Shield into GuardSpine (WASM vs Native execution)

1. Introduction

2. WebAssembly (WASI) Integration: The Sandboxed Approach

3. Real-World Benchmarks & The Bottlenecks

4. Unlocking WASM's True Potential

5. The Verdict: When to use which?

6. Conclusion

Stop Leaking API Keys in your AI Agent Logs: A Go Sidecar Approach

Stop Leaking API Keys in your AI Agent Logs: A Go Sidecar Approach

1. The Problem

The Visualization

2. The Gap: Why usual methods fail

3. The Implementation: Enter PII-Shield

Killer Feature #1: Entropy-based Detection

Killer Feature #2: Deterministic HMAC

Killer Feature #3: Statistical Adaptive Threshold

4. The Logic: Sidecar Architecture

5. The Technical Meat

1. Entropy & Bigrams (The Math)

2. False Positives? (Whitelists)

3. Credit Card Detection (Luhn Algorithm)

4. Deep JSON Inspection

5. Deterministic Redaction

A Note on Scope: High-Entropy vs. Natural Language

6. How to try it

Playwright & Chaos Engineering: 3 Ways to Break Your UI in 10 Lines of Code 🧨

Here are 3 simple ways to inject chaos into your Playwright tests using Python and TypeScript without any external dependencies.

1. The "Kill the Backend" Scenario (500 Error Injection)

2. The "Elevator Effect" (Sudden Offline Mode)

3. The "Old Phone" Race Condition (CPU Throttling)

💡 Pro Tip: When to Run These?

Please, Stop Redirecting to Login on 401 Errors 🛑

The "Lazy" Pattern

The Better Way (Resilience)

How to test this? (The hard part)

The "Chaos" Approach

Summary

I got tired of guessing why my server crashed: Building a "Smart" Monitor with Global Checks & JSON Validation

The "It’s Just Down" Problem

1. Smart Diagnostics (Root Cause Analysis)

2. True Global Monitoring (Multi-Region)

3. Dev-Centric Features (Not just for websites)

4. Security First

5. Alerts You Actually Want to Read

The Tech Stack

Conclusion

Stop Building "Zombie UI": The Resilient UX Checklist (Playwright + Python)

💥 Break your API before your users do. Automated Network Chaos for CI/CD.

Announcing Chaos Proxy API: Automate Network Chaos in CI/CD 🚀

The "Spinner of Death": Why Localhost Latency is Lying to You

Why this fails: sleep() just pauses the test execution script. The browser engine itself is still blazing fast. It doesn't simulate network queues, slow handshakes, or constrained bandwidth. You aren't testing the network; you're just making your test suite slower.

Why this matters: This test proves your UI provides feedback. If a user clicks a button and waits 2 seconds with no visual feedback, they will assume the app is broken.

3. Watch your app struggle (and then fix it).

Happy testing! 🧪

Stop Testing Success. Kill the Database. 🧨

From QA to Resilience Engineering

The Goal

The Stack

The Code 🐍

Why this fails: `sleep()` just pauses the test execution script. The browser engine itself is still blazing fast. It doesn't simulate network queues, slow handshakes, or constrained bandwidth. You aren't testing the network; you're just making your test suite slower.