DEV Community

Elad Hirsch
Elad Hirsch

Posted on

Building Resilient Systems on AWS - Chaos Engineering with Amazon EKS and AWS Fault Injection Simulator

How to prove your Kubernetes platform can handle failure—before your users find out it can't


The Uncomfortable Truth About Platform Stability

Here's a scenario every platform engineer dreads — your production environment experiences a critical incident. Users can't log in to your SaaS product. Your on-call team rushes to respond—only to discover that the badge readers at your office rely on the same network infrastructure that just went down. The people who are supposed to fix the problem can't even get inside the building.

Sound far-fetched? It happened to Facebook in 2021. A BGP routing change during maintenance accidentally severed all their data centers, taking them offline for over six hours. No DNS meant no Facebook—and their on-site card readers went dark too.

Facebook tweet

The lesson isn't about BGP misconfigurations. It's about a fundamental shift in how we think about system reliability — we must stop assuming our systems are resilient and start proving they are.


From "Prevent Failure" to "Embrace Failure"

For years, the platform engineering playbook was straightforward — maximize uptime, add redundancy, and when something breaks, write a test case so it never happens again. We built increasingly complex architectures—dozens of microservices, heterogeneous storage layers, multiple cloud providers, mixed communication patterns—and somehow convinced ourselves that this complexity equaled robustness.

It doesn't.

Modern distributed systems are built on eight dangerous assumptions known as the Fallacies of Distributed Computing:

  1. The network is reliable
  2. Latency is zero
  3. Bandwidth is unlimited
  4. The network is secure
  5. Topology doesn't change
  6. There is one administrator
  7. Transport cost is zero
  8. The network is homogeneous

Every one of these assumptions will eventually fail in production. The question isn't if your system will experience failure—it's when, and more importantly, will you be ready?

The Network Is Secure — A Dangerous Assumption

Dance Like

As AWS CTO Werner Vogels famously said — "Dance like nobody's watching, encrypt like everyone is." This mantra captures a critical truth about distributed systems security. Vogels has repeatedly emphasized the importance of safeguarding encryption keys, noting that "the key is the only tool that ensures you're the only one with access to your data."

In cloud-native architectures, assuming the network is secure leads to devastating breaches. Zero-trust principles aren't optional—they're essential.


Transport Cost Is Zero — The Hidden Budget Killer

Perhaps the most expensive fallacy to ignore is assuming transport cost is zero. In AWS, Data Transfer Out (DTO) costs can quickly become one of the largest line items on your bill if not properly managed.

Consider a typical microservices architecture — services communicate across availability zones, data flows between regions for disaster recovery, and APIs serve traffic globally. Each of these transfers incurs costs:

  • Inter-AZ traffic — $0.01/GB in each direction
  • Inter-region traffic — $0.02-$0.09/GB depending on regions
  • Internet egress — $0.09/GB for the first 10TB/month

DTO Cost

A service handling 100TB of monthly cross-AZ traffic could face $2,000/month in transfer costs alone—before any compute or storage charges. This is why architecture decisions like service placement, caching strategies, and data locality matter enormously in AWS.


The AWS Chaos Engineering Stack

AWS provides a powerful combination of services for implementing chaos engineering at scale:

  • Amazon EKS — Managed Kubernetes that provides the foundation for container orchestration with built-in resilience features
  • AWS Fault Injection Simulator (FIS) — A fully managed service for running chaos experiments against AWS resources
  • Chaos Mesh — A CNCF project that extends chaos capabilities specifically for Kubernetes workloads

What makes this combination particularly powerful is the deep integration between FIS and Kubernetes. AWS FIS can inject Kubernetes custom resources directly into your EKS clusters, allowing you to orchestrate Chaos Mesh experiments through a unified AWS control plane.


Why Amazon EKS Is the Foundation

Before we inject chaos, we need a platform that can actually respond to failure gracefully. Amazon EKS provides several built-in resilience mechanisms.

Self-Healing Through Controllers

Kubernetes controllers continuously reconcile the actual state of your cluster with the desired state. When a pod crashes, the Deployment controller notices the discrepancy and schedules a replacement. This reconciliation loop is the heartbeat of Kubernetes resilience.

Topology Awareness

EKS allows you to distribute pods across multiple Availability Zones within an AWS region. By using topology spread constraints, you can ensure that a single AZ failure doesn't take down your entire application:

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: payment-api
Enter fullscreen mode Exit fullscreen mode

Pod Disruption Budgets

PDBs let you specify the minimum number of pods that must remain available during voluntary disruptions. This ensures that even during chaos experiments or cluster upgrades, your service maintains capacity:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payment-api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: payment-api
Enter fullscreen mode Exit fullscreen mode

These mechanisms form the baseline. Chaos engineering tests whether they actually work under real failure conditions.


Setting Up the Chaos Engineering Environment

Here is a referencee example for chaos engineering on EKS. The from-kubernetes-to-chaos-mesh repository demonstrates how to integrate AWS FIS with Chaos Mesh.

Prerequisites

Before diving in, ensure you have:

  • AWS CLI configured with appropriate credentials
  • An existing Amazon EKS cluster (version 1.25+)
  • kubectl for Kubernetes cluster management
  • Helm for deploying Chaos Mesh

Connect to the EKS Cluster

First, configure kubectl to communicate with your Amazon EKS cluster. The AWS CLI makes this straightforward:

# Update kubeconfig for your EKS cluster
aws eks update-kubeconfig \
  --region us-east-1 \
  --name your-eks-cluster-name

# Verify the connection
kubectl get nodes
Enter fullscreen mode Exit fullscreen mode

You should see your cluster nodes listed with Ready status:

NAME                              STATUS   ROLES    AGE   VERSION
ip-10-0-1-123.ec2.internal        Ready    <none>   45d   v1.31.2-eks-5678abc
ip-10-0-2-456.ec2.internal        Ready    <none>   45d   v1.31.2-eks-5678abc
ip-10-0-3-789.ec2.internal        Ready    <none>   45d   v1.31.2-eks-5678abc
Enter fullscreen mode Exit fullscreen mode

Verify your cluster is spread across multiple Availability Zones — this is critical for meaningful resilience testing:

# Check node distribution across AZs
kubectl get nodes -L topology.kubernetes.io/zone

# Expected output shows nodes in different AZs:
# ip-10-0-1-123   Ready   us-east-1a
# ip-10-0-2-456   Ready   us-east-1b
# ip-10-0-3-789   Ready   us-east-1c
Enter fullscreen mode Exit fullscreen mode

Confirm you have the necessary permissions by checking cluster info:

kubectl cluster-info
# Kubernetes control plane is running at https://ABCD1234.gr7.us-east-1.eks.amazonaws.com
Enter fullscreen mode Exit fullscreen mode

Installing Chaos Mesh

Chaos Mesh deploys as a set of controllers and custom resource definitions in your cluster:

helm repo add chaos-mesh https://charts.chaos-mesh.org

helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace chaos-mesh \
  --create-namespace \
  --set chaosDaemon.runtime=containerd \
  --set chaosDaemon.socketPath=/run/containerd/containerd.sock
Enter fullscreen mode Exit fullscreen mode

The key components include:

  • Chaos Controller Manager — Orchestrates chaos experiments
  • Chaos Daemon — Runs on each node to execute failure injections
  • Dashboard — Web UI for managing experiments (optional)

AWS Fault Injection Simulator — The Control Plane for Chaos

AWS FIS is what ties everything together. Rather than running chaos experiments in isolation, FIS provides:

  • Centralized experiment management — Define, run, and monitor experiments from the AWS Console or API
  • Safety controls — Stop conditions that automatically halt experiments if metrics breach thresholds
  • Audit logging — Complete visibility into what experiments ran, when, and their outcomes
  • IAM integration — Fine-grained permissions for who can run which experiments

Creating an IAM Role for FIS

FIS needs permissions to interact with your EKS cluster:

# Create the trust policy
cat << EOF > fis-trust-policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "fis.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF

# Create the role
aws iam create-role \
  --role-name fis-chaos-experiment-role \
  --assume-role-policy-document file://fis-trust-policy.json

# Attach necessary policies
aws iam attach-role-policy \
  --role-name fis-chaos-experiment-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSClusterPolicy
Enter fullscreen mode Exit fullscreen mode

Configuring EKS for FIS Integration

FIS needs to authenticate to your EKS cluster. Update the aws-auth ConfigMap to allow the FIS role:

apiVersion: v1
kind: ConfigMap
metadata:
  name: aws-auth
  namespace: kube-system
data:
  mapRoles: |
    - rolearn: arn:aws:iam::123456789012:role/fis-chaos-experiment-role
      username: fis-user
      groups:
        - system:masters
Enter fullscreen mode Exit fullscreen mode

Real-World Chaos Experiments with Chaos Mesh

Let's walk through practical experiments that test different failure modes using Chaos Mesh orchestrated through AWS FIS.

Experiment 1 — Network Fault Injection

Network issues are among the most common causes of distributed system failures. This experiment simulates complete network isolation for a target service:

First, identify your target pods:

kubectl get pod -n application --show-labels | grep order-service
# order-service-7b68fd5f58-xk9mn   1/1   Running   app.kubernetes.io/name=order-service
Enter fullscreen mode Exit fullscreen mode

Create the FIS experiment template:

aws fis create-experiment-template \
  --cli-input-json '{
    "description": "Chaos Mesh network partition test",
    "targets": {
      "EKS-Cluster": {
        "resourceType": "aws:eks:cluster",
        "resourceArns": [
          "arn:aws:eks:us-east-1:123456789012:cluster/resilience-cluster"
        ],
        "selectionMode": "ALL"
      }
    },
    "actions": {
      "inject-network-partition": {
        "actionId": "aws:eks:inject-kubernetes-custom-resource",
        "description": "Simulate network partition on order-service",
        "parameters": {
          "kubernetesApiVersion": "chaos-mesh.org/v1alpha1",
          "kubernetesKind": "NetworkChaos",
          "kubernetesNamespace": "chaos-mesh",
          "kubernetesSpec": "{\"action\":\"partition\",\"mode\":\"all\",\"selector\":{\"namespaces\":[\"application\"],\"labelSelectors\":{\"app.kubernetes.io/name\":\"order-service\"}},\"direction\":\"both\"}",
          "maxDuration": "PT2M"
        },
        "targets": {
          "Cluster": "EKS-Cluster"
        }
      }
    },
    "stopConditions": [{ "source": "none" }],
    "roleArn": "arn:aws:iam::123456789012:role/fis-chaos-experiment-role",
    "tags": {
      "Purpose": "chaos-engineering",
      "Team": "platform"
    },
    "logConfiguration": {
      "cloudWatchLogsConfiguration": {
        "logGroupArn": "arn:aws:logs:us-east-1:123456789012:log-group:fis-experiments:*"
      },
      "logSchemaVersion": 2
    }
  }'
Enter fullscreen mode Exit fullscreen mode

During the experiment, you can observe the impact:

# Before experiment
curl http://order-service.application:8080/health -v
# Response: HTTP/1.1 200 OK

# During experiment  
curl http://order-service.application:8080/health -v
# Response: curl: (7) Failed to connect - Connection refused
Enter fullscreen mode Exit fullscreen mode

Experiment 2 — Container Kill with Chaos Mesh

This experiment tests your application's ability to recover from container failures:

aws fis create-experiment-template \
  --cli-input-json '{
    "description": "Chaos Mesh container termination test",
    "targets": {
      "EKS-Cluster": {
        "resourceType": "aws:eks:cluster",
        "resourceArns": [
          "arn:aws:eks:us-east-1:123456789012:cluster/resilience-cluster"
        ],
        "selectionMode": "ALL"
      }
    },
    "actions": {
      "terminate-container": {
        "actionId": "aws:eks:inject-kubernetes-custom-resource",
        "description": "Kill payment-service container",
        "parameters": {
          "kubernetesApiVersion": "chaos-mesh.org/v1alpha1",
          "kubernetesKind": "PodChaos",
          "kubernetesNamespace": "chaos-mesh",
          "kubernetesSpec": "{\"action\":\"container-kill\",\"mode\":\"one\",\"containerNames\":[\"payment-service\"],\"selector\":{\"namespaces\":[\"application\"],\"labelSelectors\":{\"app.kubernetes.io/name\":\"payment-service\"}}}",
          "maxDuration": "PT1M"
        },
        "targets": {
          "Cluster": "EKS-Cluster"
        }
      }
    },
    "stopConditions": [{ "source": "none" }],
    "roleArn": "arn:aws:iam::123456789012:role/fis-chaos-experiment-role",
    "tags": {
      "Purpose": "chaos-engineering"
    }
  }'
Enter fullscreen mode Exit fullscreen mode

Run the experiment:

aws fis start-experiment --experiment-template-id EXTabc123def456
Enter fullscreen mode Exit fullscreen mode

Monitor experiment status:

aws fis get-experiment --id EXPxyz789abc123 | jq '.experiment.state'
# Output: { "status": "completed" }
Enter fullscreen mode Exit fullscreen mode

Verify the container restart:

kubectl get pod -n application | grep payment-service
# payment-service-5d8f9c7b6a-m2k9p   1/1   Running   1 (3m12s ago)   8m45s

kubectl describe pod -n application payment-service-5d8f9c7b6a-m2k9p | grep -A5 "Events:"
# Events:
#   Normal  Pulled   3m15s (x2 over 8m48s)  kubelet  Container image already present
#   Normal  Created  3m15s (x2 over 8m48s)  kubelet  Created container payment-service
#   Normal  Started  3m14s (x2 over 8m47s)  kubelet  Started container payment-service
Enter fullscreen mode Exit fullscreen mode

The restart count of 1 confirms the chaos injection worked, and the Running status confirms Kubernetes successfully recovered the pod.


Measuring Success — What to Monitor

Chaos experiments are only valuable if you can observe their impact. Key metrics to track:

Application Metrics

  • Request latency (p50, p95, p99)
  • Error rates and HTTP status codes
  • Request throughput and queue depths

Kubernetes Metrics

  • Pod restart counts
  • Container CPU/memory during recovery
  • Time to pod ready state

AWS Metrics

  • EKS control plane API latency
  • Node health status across AZs
  • Application Load Balancer healthy target counts

AWS CloudWatch Container Insights provides much of this automatically for EKS clusters. For deeper application-level observability, consider integrating with AWS X-Ray for distributed tracing.


Implementing Safety Guardrails

Chaos engineering isn't about breaking things carelessly. AWS FIS provides stop conditions that automatically halt experiments when things go wrong:

{
  "stopConditions": [
    {
      "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:us-east-1:123456789012:alarm:ServiceErrorRateHigh"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Let's see how the flow works

Experiment Complete Flow


Best practices for safe chaos experiments:

  1. Start in non-production environments — Validate experiments in staging before production
  2. Define clear rollback procedures — Know how to quickly restore normal operation
  3. Use blast radius controls — Target specific pods/services rather than entire clusters
  4. Run during business hours — Have engineers available to respond if needed
  5. Communicate with stakeholders — Ensure relevant teams know experiments are planned

Breaking Things on Purpose

You Break , You Pay

The journey from reactive incident response to proactive resilience engineering represents a fundamental shift in how we think about system reliability.

From 99.9% Uptime Victory to Uptime Fear

For years, teams celebrated high uptime numbers as proof of system health. But 99.9% availability still means 8.76 hours of downtime per year—and those hours always seem to happen at the worst possible moment. The realization sets in — we don't actually know why our systems stay up, which means we don't know what will bring them down.

The Shift from "Prevent Failure" to "Embrace Failure"

Traditional engineering tries to eliminate all possible failure modes. Modern resilience engineering accepts that failure is inevitable and focuses on minimizing impact. This isn't pessimism—it's realism. Complex distributed systems have emergent behaviors that no amount of unit testing can predict.

From Netflix's Chaos Monkey to Chaos Engineering

Netflix pioneered this approach with Chaos Monkey in 2011—a tool that randomly terminated EC2 instances in production. The idea seemed radical — why would you intentionally break your own systems? The answer became clear — because you'd rather discover weaknesses on your terms, during business hours, with engineers ready to respond, than at 3 AM during peak traffic.

Today, chaos engineering has evolved far beyond random instance termination. Tools like Chaos Mesh enable sophisticated experiments — network partitions, DNS failures, clock skew, JVM faults, and more. AWS FIS brings this into the enterprise with centralized management, safety controls, and full audit trails.

Recovery Isn't the Only Goal

The most important outcome of chaos experiments isn't proving your system can recover—it's what you learn in the process. Each experiment reveals something about your architecture:

  • How does the system behave under partial failure?
  • Do circuit breakers trigger at the right thresholds?
  • Are timeout values appropriate?
  • Do health checks accurately reflect service health?
  • How long does recovery actually take?

This learning feeds back into system improvements, creating a virtuous cycle of increasing resilience.


Conclusion

The shift from "prevent failure" to "embrace failure" represents a fundamental change in how we build reliable systems. By combining Amazon EKS's orchestration capabilities with AWS Fault Injection Simulator's enterprise-grade chaos management and Chaos Mesh's Kubernetes-native failure injection, you can build platforms that don't just claim to be resilient—they prove it.

Your systems will fail. The only question is — will you learn from it?


About the author

Elad Hirsch is a Tech Lead at TeraSky CTO Office, a global provider of multi-cloud, cloud-native, and innovative IT solutions. With experience in principal engineering positions at Agmatix, Jfrog, IDI, and Finjan Security, his primary areas of expertise revolve around software architecture and DevOps practices. He is a proactive advocate for fostering a DevOps culture, enabling organizations to improve their software architecture and streamline operations in cloud-native environments.


Interested in learning more about chaos engineering on AWS? Check out the AWS Fault Injection Simulator documentation and the amazon-eks-chaos sample repository.

Tags: #AWS #Kubernetes #ChaosEngineering #EKS #DevOps #SRE #CloudNative #Resilience #ChaosMesh #FaultInjection

Top comments (0)