DEV Community

Cover image for Chaos Engineering for Security: Breaking Systems To Strengthen Defenses
Pratik Mahalle
Pratik Mahalle

Posted on

Chaos Engineering for Security: Breaking Systems To Strengthen Defenses

I'm excited to be speaking about this topic at OpenSSF Community Day India, and wanted to share some insights on this fascinating intersection of chaos engineering and security.

When most people hear "chaos engineering," they immediately think of Netflix's famous Chaos Monkey randomly terminating servers to test system resilience. But what if we took that same philosophy and applied it to security? What if, instead of waiting for attackers to find our vulnerabilities, we intentionally broke our own systems to discover weaknesses first?

Welcome to Security Chaos Engineering – a practice that's transforming how we think about proactive security testing.

Security Testing Reality Check

Let's start with some uncomfortable truths about modern security:

  • 277 days - that's the average time it takes to detect a security breach(according to the theories)
  • Security testing typically happens too late in the development cycle
  • Most assumptions about security controls go completely untested
  • Teams deploy defenses and hope they work, only discovering gaps during real incidents

I've witnessed this pattern repeatedly in organizations I've worked with. Teams spend months perfecting their security configurations, only to discover during a real incident that their assumptions were wrong. That firewall rule they thought was bulletproof? It fails under load. The network segmentation they relied on? It breaks when pods start communicating unexpectedly.

Here's the key question that changed how I think about security: What if we could break things safely and strengthen our defenses before attackers find our weaknesses?

Enter Security Chaos Engineering

This is where Security Chaos Engineering comes in - a proactive approach that flips the traditional security script:

Traditional Security Approach:
Wait for attacks → Respond to incidents → Fix what broke

Security Chaos Engineering Approach:
Inject controlled failures → Test security responses → Strengthen defenses proactively

Definition: Security Chaos Engineering is the practice of testing security controls and incident response procedures through controlled failure injection and attack simulation.

Goal: Find vulnerabilities and weaknesses before real attackers do, while building confidence in our defensive capabilities.

The core principle remains simple: If we can break it in a controlled environment, we can fix it before an attacker exploits it.

Tools of the Trade

Before we dive into practical examples, let's talk about the tools that make Security Chaos Engineering possible. Having the right toolkit is crucial for effective security chaos experiments.

Tools

Core Chaos Engineering Platforms

ChaosMesh: My go-to choice for Kubernetes-native environments. It offers excellent security experiment support with intuitive YAML configurations and a solid web UI for monitoring experiments.

LitmusChaos: Perfect for complex, multi-step chaos workflows. It has extensive experiment templates and integrates well with CI/CD pipelines. The community is active and constantly adding new security-focused experiments.

Security Monitoring and Detection

KubeArmor: Runtime security monitoring that integrates beautifully with chaos experiments. It can detect policy violations and unusual behavior patterns during your tests.

Falco: Essential for detecting suspicious activities during chaos experiments. Its rule-based detection engine helps identify when your experiments uncover real security issues.

Open Policy Agent (OPA): Useful for testing policy enforcement under various failure conditions.

Network and Infrastructure Tools

Istio Service Mesh: Provides detailed network observability during chaos experiments, helping you understand how security policies behave under stress.

Weave Scope: Excellent for visualizing network topology and understanding blast radius during security chaos experiments.

Why Tool Selection Matters

Each tool serves a specific purpose in the security chaos engineering workflow:

  • Chaos platforms inject the failures
  • Monitoring tools observe the impact
  • Security tools detect when defenses fail
  • Network tools help understand the scope of impact

The key is creating a cohesive toolchain where these components work together seamlessly.

Real-World Scenarios: Learning by Breaking

Now that we have our toolkit ready, let me walk you through some practical scenarios where Security Chaos Engineering has proven invaluable.

Scenario 1: Testing Kubernetes Pod Compromise

Imagine you have a Kubernetes cluster with what you believe is proper network segmentation. Your pods are isolated, your network policies are in place, and everything looks secure on paper.

Here's how we can test this assumption:

# Using ChaosMesh to simulate a compromised pod
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: pod-compromise-test
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: web-frontend
  scheduler:
    cron: "@every 10m"
Enter fullscreen mode Exit fullscreen mode

But that's just the beginning. What we really want to test is what happens after compromise:

  1. Lateral Movement Testing: Once we simulate a pod compromise, can the "attacker" move laterally to other services?
  2. Data Exfiltration Simulation: Can sensitive data be accessed from the compromised pod?
  3. Privilege Escalation: Can the compromised pod gain higher privileges?

Scenario 2: Firewall Stress Testing

Your firewall rules work perfectly under normal conditions, but what happens when your system is under stress? Here's a test I regularly run:

# Using LitmusChaos for network chaos experiments
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosExperiment
metadata:
  name: network-loss-test
spec:
  definition:
    scope: Namespaced
    permissions:
      - apiGroups: [""]
        resources: ["pods"]
        verbs: ["create","delete","get","list"]
    image: "litmuschaos/go-runner:latest"
    args:
      - -c
      - ./experiments -name network-loss
    command:
      - /bin/bash
    env:
      - name: NETWORK_PACKET_LOSS_PERCENTAGE
        value: '50'
      - name: TOTAL_CHAOS_DURATION
        value: '300'
Enter fullscreen mode Exit fullscreen mode

This experiment simulates 50% packet loss for 5 minutes. During this chaos, we test:

  • Do firewall rules still apply correctly?
  • Are there any bypass opportunities when the network is degraded?
  • How does the system behave when connections are intermittent?

Demo: Setting Up Your First Security Chaos Experiment

Let me show you how to set up a basic security chaos engineering experiment using open-source tools.

Step 1: Environment Setup

First, let's set up a basic Kubernetes cluster with some security tools:

# Install ChaosMesh
curl -sSL https://mirrors.chaos-mesh.org/v2.6.2/install.sh | bash

# Install KubeArmor for runtime security
kubectl apply -f https://raw.githubusercontent.com/kubearmor/KubeArmor/main/deployments/kubearmor.yaml

# Verify installations
kubectl get pods -n chaos-mesh
kubectl get pods -n kubearmor
Enter fullscreen mode Exit fullscreen mode

Step 2: Create a Vulnerable Application

Let's deploy a simple web application with intentional security gaps:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vulnerable-app
  labels:
    app: vulnerable-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: vulnerable-app
  template:
    metadata:
      labels:
        app: vulnerable-app
    spec:
      containers:
      - name: web
        image: nginx:latest
        ports:
        - containerPort: 80
        securityContext:
          runAsUser: 0  # Running as root - intentionally vulnerable
          capabilities:
            add: ["NET_ADMIN"]  # Excessive privileges
Enter fullscreen mode Exit fullscreen mode

Step 3: Create Security Chaos Experiments

Now, let's create experiments to test our assumptions:

Experiment 1: Pod Failure Under Attack Simulation

apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: ddos-simulation
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - default
    labelSelectors:
      app: vulnerable-app
  delay:
    latency: "10ms"
    correlation: "100"
    jitter: "0ms"
  duration: "5m"
Enter fullscreen mode Exit fullscreen mode

Experiment 2: Testing Security Controls During Resource Exhaustion

apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: memory-stress-security-test
spec:
  appinfo:
    appns: default
    applabel: 'app=vulnerable-app'
    appkind: 'deployment'
  chaosServiceAccount: litmus-admin
  experiments:
  - name: pod-memory-hog
    spec:
      components:
        env:
        - name: MEMORY_CONSUMPTION
          value: '500'
        - name: TOTAL_CHAOS_DURATION
          value: '300'
Enter fullscreen mode Exit fullscreen mode

Step 4: Monitor and Observe

While the chaos experiments run, we monitor several security metrics:

# Monitor KubeArmor alerts
kubectl logs -n kubearmor -l kubearmor-app=kubearmor --tail=100 -f

# Check for privilege escalation attempts
kubectl get events --field-selector reason=FailedMount,reason=SecurityContextDeny

# Monitor network policies
kubectl describe networkpolicies
Enter fullscreen mode Exit fullscreen mode

What We Learn from Breaking Things

Running these experiments consistently reveals patterns that traditional security testing misses:

Discovery 1: Security Controls Fail Under Load

In one experiment, I discovered that our API rate limiting completely broke down when pods were under memory pressure. The security control was there, but it became ineffective when the system was stressed.

Discovery 2: Network Policies Have Edge Cases

During a network partition experiment, we found that certain pod-to-pod communications bypassed our network policies when DNS resolution was flaky. This created an unexpected attack vector.

Discovery 3: Monitoring Blind Spots

Chaos experiments revealed that our security monitoring had significant blind spots during high-CPU usage periods. Critical security events were being dropped or delayed.

Best Practices for Security Chaos Engineering

Based on my experience implementing this across different organizations, here are key practices that work:

Start Small: Begin with non-production environments and simple experiments. Don't try to simulate a full-scale attack on day one.

Hypothesis-Driven: Each experiment should test a specific security assumption. "What happens to our authentication system when the database is slow?" is better than "let's see what breaks."

Automate Everything: Manual chaos is not sustainable. Use tools like ChaosMesh and LitmusChaos to create repeatable, scheduled experiments.

Learn and Iterate: Each experiment should lead to concrete improvements in your security posture. If you're not fixing things you discover, you're not doing it right.

Build Observability First: You can't understand what breaks if you can't see it breaking. Ensure comprehensive monitoring and logging before running experiments.

Collaborate Across Teams: Security chaos engineering works best when security, development, and operations teams work together. Each brings unique perspectives on what might break.

Tools of the Trade

Here's my current toolkit for Security Chaos Engineering:

  • ChaosMesh: Excellent for Kubernetes-native chaos experiments with good security experiment support
  • LitmusChaos: Great for complex, multi-step chaos workflows
  • KubeArmor: Runtime security monitoring that integrates well with chaos experiments
  • Falco: For detecting unusual behavior during experiments
  • Gremlin: Commercial option with advanced security testing features

The Mindset Shift

The biggest challenge isn't technical – it's cultural. Organizations need to shift from "don't break production" to "break production safely and learn from it."

This means:

  • Embracing failure as a learning opportunity
  • Questioning assumptions about security controls regularly
  • Building systems that are resilient to both accidental and malicious failures
  • Investing time in proactive testing rather than just reactive incident response

Looking Forward

Security Chaos Engineering is still evolving, but I'm convinced it represents the future of proactive security. As systems become more complex and distributed, our security testing needs to evolve too.

The goal isn't to replace traditional security practices – it's to complement them. Penetration testing, vulnerability scanning, and compliance audits all have their place. But they're not enough anymore.

If you're curious to learn more about this approach, I'll be diving deeper into practical implementations and advanced scenarios at the OpenSSF Community Day India. We'll do live demos, explore real-world case studies, and hopefully convince a few more people that breaking things on purpose is actually a pretty good idea.


What's your experience with chaos engineering? Have you tried applying it to security? I'd love to hear your thoughts and experiences. Feel free to reach out – the security community gets stronger when we share our learnings, especially our failures.

Top comments (0)