<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Elad Hirsch</title>
    <description>The latest articles on DEV Community by Elad Hirsch (@eladh).</description>
    <link>https://dev.to/eladh</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F486233%2F9da78142-f4e8-41bc-946b-7939314fe735.jpeg</url>
      <title>DEV Community: Elad Hirsch</title>
      <link>https://dev.to/eladh</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/eladh"/>
    <language>en</language>
    <item>
      <title>Building Resilient Systems on AWS - Chaos Engineering with Amazon EKS and AWS Fault Injection Simulator</title>
      <dc:creator>Elad Hirsch</dc:creator>
      <pubDate>Fri, 05 Dec 2025 17:15:47 +0000</pubDate>
      <link>https://dev.to/eladh/building-resilient-systems-on-aws-chaos-engineering-with-amazon-eks-and-aws-fault-injection-251a</link>
      <guid>https://dev.to/eladh/building-resilient-systems-on-aws-chaos-engineering-with-amazon-eks-and-aws-fault-injection-251a</guid>
      <description>&lt;p&gt;&lt;em&gt;How to prove your Kubernetes platform can handle failure—before your users find out it can't&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Uncomfortable Truth About Platform Stability
&lt;/h2&gt;

&lt;p&gt;Here's a scenario every platform engineer dreads — your production environment experiences a critical incident. Users can't log in to your SaaS product. Your on-call team rushes to respond—only to discover that the badge readers at your office rely on the same network infrastructure that just went down. The people who are supposed to fix the problem can't even get inside the building.&lt;/p&gt;

&lt;p&gt;Sound far-fetched? It happened to Facebook in 2021. A BGP routing change during maintenance accidentally severed all their data centers, taking them offline for over six hours. No DNS meant no Facebook—and their on-site card readers went dark too.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftkunzjf02wpczqjd116c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftkunzjf02wpczqjd116c.png" alt="Facebook tweet" width="800" height="535"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The lesson isn't about BGP misconfigurations. It's about a fundamental shift in how we think about system reliability — &lt;strong&gt;we must stop assuming our systems are resilient and start proving they are.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  From "Prevent Failure" to "Embrace Failure"
&lt;/h2&gt;

&lt;p&gt;For years, the platform engineering playbook was straightforward — maximize uptime, add redundancy, and when something breaks, write a test case so it never happens again. We built increasingly complex architectures—dozens of microservices, heterogeneous storage layers, multiple cloud providers, mixed communication patterns—and somehow convinced ourselves that this complexity equaled robustness.&lt;/p&gt;

&lt;p&gt;It doesn't.&lt;/p&gt;

&lt;p&gt;Modern distributed systems are built on eight dangerous assumptions known as the &lt;a href="https://en.wikipedia.org/wiki/Fallacies_of_distributed_computing" rel="noopener noreferrer"&gt;Fallacies of Distributed Computing&lt;/a&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The network is reliable&lt;/li&gt;
&lt;li&gt;Latency is zero&lt;/li&gt;
&lt;li&gt;Bandwidth is unlimited&lt;/li&gt;
&lt;li&gt;The network is secure&lt;/li&gt;
&lt;li&gt;Topology doesn't change&lt;/li&gt;
&lt;li&gt;There is one administrator&lt;/li&gt;
&lt;li&gt;Transport cost is zero&lt;/li&gt;
&lt;li&gt;The network is homogeneous&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Every one of these assumptions will eventually fail in production. The question isn't &lt;em&gt;if&lt;/em&gt; your system will experience failure—it's &lt;em&gt;when&lt;/em&gt;, and more importantly, &lt;em&gt;will you be ready?&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Network Is Secure — A Dangerous Assumption
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5g9btbw2k9jmwmkadvpw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5g9btbw2k9jmwmkadvpw.png" alt="Dance Like" width="800" height="464"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As AWS CTO Werner Vogels famously said — &lt;em&gt;"Dance like nobody's watching, encrypt like everyone is."&lt;/em&gt; This mantra captures a critical truth about distributed systems security. Vogels has repeatedly emphasized the importance of safeguarding encryption keys, noting that "the key is the only tool that ensures you're the only one with access to your data."&lt;/p&gt;

&lt;p&gt;In cloud-native architectures, assuming the network is secure leads to devastating breaches. Zero-trust principles aren't optional—they're essential.&lt;/p&gt;




&lt;h3&gt;
  
  
  Transport Cost Is Zero — The Hidden Budget Killer
&lt;/h3&gt;

&lt;p&gt;Perhaps the most expensive fallacy to ignore is assuming transport cost is zero. In AWS, &lt;strong&gt;Data Transfer Out (DTO)&lt;/strong&gt; costs can quickly become one of the largest line items on your bill if not properly managed.&lt;/p&gt;

&lt;p&gt;Consider a typical microservices architecture — services communicate across availability zones, data flows between regions for disaster recovery, and APIs serve traffic globally. Each of these transfers incurs costs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Inter-AZ traffic&lt;/strong&gt; — $0.01/GB in each direction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inter-region traffic&lt;/strong&gt; — $0.02-$0.09/GB depending on regions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Internet egress&lt;/strong&gt; — $0.09/GB for the first 10TB/month&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdr99fd6tmtd26scf9h4o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdr99fd6tmtd26scf9h4o.png" alt="DTO Cost" width="800" height="611"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A service handling 100TB of monthly cross-AZ traffic could face $2,000/month in transfer costs alone—before any compute or storage charges. This is why architecture decisions like service placement, caching strategies, and data locality matter enormously in AWS.&lt;/p&gt;




&lt;h2&gt;
  
  
  The AWS Chaos Engineering Stack
&lt;/h2&gt;

&lt;p&gt;AWS provides a powerful combination of services for implementing chaos engineering at scale:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Amazon EKS&lt;/strong&gt; — Managed Kubernetes that provides the foundation for container orchestration with built-in resilience features&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS Fault Injection Simulator (FIS)&lt;/strong&gt; — A fully managed service for running chaos experiments against AWS resources&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chaos Mesh&lt;/strong&gt; — A CNCF project that extends chaos capabilities specifically for Kubernetes workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What makes this combination particularly powerful is the deep integration between FIS and Kubernetes. AWS FIS can inject Kubernetes custom resources directly into your EKS clusters, allowing you to orchestrate Chaos Mesh experiments through a unified AWS control plane.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Amazon EKS Is the Foundation
&lt;/h2&gt;

&lt;p&gt;Before we inject chaos, we need a platform that can actually respond to failure gracefully. Amazon EKS provides several built-in resilience mechanisms.&lt;/p&gt;

&lt;h3&gt;
  
  
  Self-Healing Through Controllers
&lt;/h3&gt;

&lt;p&gt;Kubernetes controllers continuously reconcile the actual state of your cluster with the desired state. When a pod crashes, the Deployment controller notices the discrepancy and schedules a replacement. This reconciliation loop is the heartbeat of Kubernetes resilience.&lt;/p&gt;

&lt;h3&gt;
  
  
  Topology Awareness
&lt;/h3&gt;

&lt;p&gt;EKS allows you to distribute pods across multiple Availability Zones within an AWS region. By using topology spread constraints, you can ensure that a single AZ failure doesn't take down your entire application:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;topologySpreadConstraints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;maxSkew&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;topologyKey&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;topology.kubernetes.io/zone&lt;/span&gt;
    &lt;span class="na"&gt;whenUnsatisfiable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DoNotSchedule&lt;/span&gt;
    &lt;span class="na"&gt;labelSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payment-api&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pod Disruption Budgets
&lt;/h3&gt;

&lt;p&gt;PDBs let you specify the minimum number of pods that must remain available during voluntary disruptions. This ensures that even during chaos experiments or cluster upgrades, your service maintains capacity:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;policy/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PodDisruptionBudget&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payment-api-pdb&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;minAvailable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payment-api&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These mechanisms form the baseline. Chaos engineering tests whether they actually work under real failure conditions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setting Up the Chaos Engineering Environment
&lt;/h2&gt;

&lt;p&gt;Here is a referencee example for chaos engineering on EKS. The &lt;a href="https://github.com/eladh/from-kubernetes-to-chaos-mesh" rel="noopener noreferrer"&gt;from-kubernetes-to-chaos-mesh&lt;/a&gt; repository demonstrates how to integrate AWS FIS with Chaos Mesh.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;p&gt;Before diving in, ensure you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AWS CLI configured with appropriate credentials&lt;/li&gt;
&lt;li&gt;An existing Amazon EKS cluster (version 1.25+)&lt;/li&gt;
&lt;li&gt;kubectl for Kubernetes cluster management&lt;/li&gt;
&lt;li&gt;Helm for deploying Chaos Mesh&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Connect to the EKS Cluster
&lt;/h3&gt;

&lt;p&gt;First, configure kubectl to communicate with your Amazon EKS cluster. The AWS CLI makes this straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Update kubeconfig for your EKS cluster&lt;/span&gt;
aws eks update-kubeconfig &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--region&lt;/span&gt; us-east-1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; your-eks-cluster-name

&lt;span class="c"&gt;# Verify the connection&lt;/span&gt;
kubectl get nodes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see your cluster nodes listed with &lt;code&gt;Ready&lt;/code&gt; status:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NAME                              STATUS   ROLES    AGE   VERSION
ip-10-0-1-123.ec2.internal        Ready    &amp;lt;none&amp;gt;   45d   v1.31.2-eks-5678abc
ip-10-0-2-456.ec2.internal        Ready    &amp;lt;none&amp;gt;   45d   v1.31.2-eks-5678abc
ip-10-0-3-789.ec2.internal        Ready    &amp;lt;none&amp;gt;   45d   v1.31.2-eks-5678abc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify your cluster is spread across multiple Availability Zones — this is critical for meaningful resilience testing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check node distribution across AZs&lt;/span&gt;
kubectl get nodes &lt;span class="nt"&gt;-L&lt;/span&gt; topology.kubernetes.io/zone

&lt;span class="c"&gt;# Expected output shows nodes in different AZs:&lt;/span&gt;
&lt;span class="c"&gt;# ip-10-0-1-123   Ready   us-east-1a&lt;/span&gt;
&lt;span class="c"&gt;# ip-10-0-2-456   Ready   us-east-1b&lt;/span&gt;
&lt;span class="c"&gt;# ip-10-0-3-789   Ready   us-east-1c&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Confirm you have the necessary permissions by checking cluster info:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl cluster-info
&lt;span class="c"&gt;# Kubernetes control plane is running at https://ABCD1234.gr7.us-east-1.eks.amazonaws.com&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Installing Chaos Mesh
&lt;/h3&gt;

&lt;p&gt;Chaos Mesh deploys as a set of controllers and custom resource definitions in your cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm repo add chaos-mesh https://charts.chaos-mesh.org

helm &lt;span class="nb"&gt;install &lt;/span&gt;chaos-mesh chaos-mesh/chaos-mesh &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; chaos-mesh &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--create-namespace&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; chaosDaemon.runtime&lt;span class="o"&gt;=&lt;/span&gt;containerd &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; chaosDaemon.socketPath&lt;span class="o"&gt;=&lt;/span&gt;/run/containerd/containerd.sock
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key components include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Chaos Controller Manager&lt;/strong&gt; — Orchestrates chaos experiments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chaos Daemon&lt;/strong&gt; — Runs on each node to execute failure injections&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dashboard&lt;/strong&gt; — Web UI for managing experiments (optional)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  AWS Fault Injection Simulator — The Control Plane for Chaos
&lt;/h2&gt;

&lt;p&gt;AWS FIS is what ties everything together. Rather than running chaos experiments in isolation, FIS provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Centralized experiment management&lt;/strong&gt; — Define, run, and monitor experiments from the AWS Console or API&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safety controls&lt;/strong&gt; — Stop conditions that automatically halt experiments if metrics breach thresholds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Audit logging&lt;/strong&gt; — Complete visibility into what experiments ran, when, and their outcomes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IAM integration&lt;/strong&gt; — Fine-grained permissions for who can run which experiments&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Creating an IAM Role for FIS
&lt;/h3&gt;

&lt;p&gt;FIS needs permissions to interact with your EKS cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create the trust policy&lt;/span&gt;
&lt;span class="nb"&gt;cat&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class="no"&gt;EOF&lt;/span&gt;&lt;span class="sh"&gt; &amp;gt; fis-trust-policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "fis.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
&lt;/span&gt;&lt;span class="no"&gt;EOF

&lt;/span&gt;&lt;span class="c"&gt;# Create the role&lt;/span&gt;
aws iam create-role &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--role-name&lt;/span&gt; fis-chaos-experiment-role &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--assume-role-policy-document&lt;/span&gt; file://fis-trust-policy.json

&lt;span class="c"&gt;# Attach necessary policies&lt;/span&gt;
aws iam attach-role-policy &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--role-name&lt;/span&gt; fis-chaos-experiment-role &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--policy-arn&lt;/span&gt; arn:aws:iam::aws:policy/AmazonEKSClusterPolicy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Configuring EKS for FIS Integration
&lt;/h3&gt;

&lt;p&gt;FIS needs to authenticate to your EKS cluster. Update the aws-auth ConfigMap to allow the FIS role:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ConfigMap&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-auth&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-system&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;mapRoles&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;- rolearn: arn:aws:iam::123456789012:role/fis-chaos-experiment-role&lt;/span&gt;
      &lt;span class="s"&gt;username: fis-user&lt;/span&gt;
      &lt;span class="s"&gt;groups:&lt;/span&gt;
        &lt;span class="s"&gt;- system:masters&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Real-World Chaos Experiments with Chaos Mesh
&lt;/h2&gt;

&lt;p&gt;Let's walk through practical experiments that test different failure modes using Chaos Mesh orchestrated through AWS FIS.&lt;/p&gt;

&lt;h3&gt;
  
  
  Experiment 1 — Network Fault Injection
&lt;/h3&gt;

&lt;p&gt;Network issues are among the most common causes of distributed system failures. This experiment simulates complete network isolation for a target service:&lt;/p&gt;

&lt;p&gt;First, identify your target pods:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pod &lt;span class="nt"&gt;-n&lt;/span&gt; application &lt;span class="nt"&gt;--show-labels&lt;/span&gt; | &lt;span class="nb"&gt;grep &lt;/span&gt;order-service
&lt;span class="c"&gt;# order-service-7b68fd5f58-xk9mn   1/1   Running   app.kubernetes.io/name=order-service&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create the FIS experiment template:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws fis create-experiment-template &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cli-input-json&lt;/span&gt; &lt;span class="s1"&gt;'{
    "description": "Chaos Mesh network partition test",
    "targets": {
      "EKS-Cluster": {
        "resourceType": "aws:eks:cluster",
        "resourceArns": [
          "arn:aws:eks:us-east-1:123456789012:cluster/resilience-cluster"
        ],
        "selectionMode": "ALL"
      }
    },
    "actions": {
      "inject-network-partition": {
        "actionId": "aws:eks:inject-kubernetes-custom-resource",
        "description": "Simulate network partition on order-service",
        "parameters": {
          "kubernetesApiVersion": "chaos-mesh.org/v1alpha1",
          "kubernetesKind": "NetworkChaos",
          "kubernetesNamespace": "chaos-mesh",
          "kubernetesSpec": "{\"action\":\"partition\",\"mode\":\"all\",\"selector\":{\"namespaces\":[\"application\"],\"labelSelectors\":{\"app.kubernetes.io/name\":\"order-service\"}},\"direction\":\"both\"}",
          "maxDuration": "PT2M"
        },
        "targets": {
          "Cluster": "EKS-Cluster"
        }
      }
    },
    "stopConditions": [{ "source": "none" }],
    "roleArn": "arn:aws:iam::123456789012:role/fis-chaos-experiment-role",
    "tags": {
      "Purpose": "chaos-engineering",
      "Team": "platform"
    },
    "logConfiguration": {
      "cloudWatchLogsConfiguration": {
        "logGroupArn": "arn:aws:logs:us-east-1:123456789012:log-group:fis-experiments:*"
      },
      "logSchemaVersion": 2
    }
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;During the experiment, you can observe the impact:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Before experiment&lt;/span&gt;
curl http://order-service.application:8080/health &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;span class="c"&gt;# Response: HTTP/1.1 200 OK&lt;/span&gt;

&lt;span class="c"&gt;# During experiment  &lt;/span&gt;
curl http://order-service.application:8080/health &lt;span class="nt"&gt;-v&lt;/span&gt;
&lt;span class="c"&gt;# Response: curl: (7) Failed to connect - Connection refused&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Experiment 2 — Container Kill with Chaos Mesh
&lt;/h3&gt;

&lt;p&gt;This experiment tests your application's ability to recover from container failures:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws fis create-experiment-template &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--cli-input-json&lt;/span&gt; &lt;span class="s1"&gt;'{
    "description": "Chaos Mesh container termination test",
    "targets": {
      "EKS-Cluster": {
        "resourceType": "aws:eks:cluster",
        "resourceArns": [
          "arn:aws:eks:us-east-1:123456789012:cluster/resilience-cluster"
        ],
        "selectionMode": "ALL"
      }
    },
    "actions": {
      "terminate-container": {
        "actionId": "aws:eks:inject-kubernetes-custom-resource",
        "description": "Kill payment-service container",
        "parameters": {
          "kubernetesApiVersion": "chaos-mesh.org/v1alpha1",
          "kubernetesKind": "PodChaos",
          "kubernetesNamespace": "chaos-mesh",
          "kubernetesSpec": "{\"action\":\"container-kill\",\"mode\":\"one\",\"containerNames\":[\"payment-service\"],\"selector\":{\"namespaces\":[\"application\"],\"labelSelectors\":{\"app.kubernetes.io/name\":\"payment-service\"}}}",
          "maxDuration": "PT1M"
        },
        "targets": {
          "Cluster": "EKS-Cluster"
        }
      }
    },
    "stopConditions": [{ "source": "none" }],
    "roleArn": "arn:aws:iam::123456789012:role/fis-chaos-experiment-role",
    "tags": {
      "Purpose": "chaos-engineering"
    }
  }'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the experiment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws fis start-experiment &lt;span class="nt"&gt;--experiment-template-id&lt;/span&gt; EXTabc123def456
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Monitor experiment status:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws fis get-experiment &lt;span class="nt"&gt;--id&lt;/span&gt; EXPxyz789abc123 | jq &lt;span class="s1"&gt;'.experiment.state'&lt;/span&gt;
&lt;span class="c"&gt;# Output: { "status": "completed" }&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify the container restart:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pod &lt;span class="nt"&gt;-n&lt;/span&gt; application | &lt;span class="nb"&gt;grep &lt;/span&gt;payment-service
&lt;span class="c"&gt;# payment-service-5d8f9c7b6a-m2k9p   1/1   Running   1 (3m12s ago)   8m45s&lt;/span&gt;

kubectl describe pod &lt;span class="nt"&gt;-n&lt;/span&gt; application payment-service-5d8f9c7b6a-m2k9p | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-A5&lt;/span&gt; &lt;span class="s2"&gt;"Events:"&lt;/span&gt;
&lt;span class="c"&gt;# Events:&lt;/span&gt;
&lt;span class="c"&gt;#   Normal  Pulled   3m15s (x2 over 8m48s)  kubelet  Container image already present&lt;/span&gt;
&lt;span class="c"&gt;#   Normal  Created  3m15s (x2 over 8m48s)  kubelet  Created container payment-service&lt;/span&gt;
&lt;span class="c"&gt;#   Normal  Started  3m14s (x2 over 8m47s)  kubelet  Started container payment-service&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The restart count of 1 confirms the chaos injection worked, and the Running status confirms Kubernetes successfully recovered the pod.&lt;/p&gt;




&lt;h2&gt;
  
  
  Measuring Success — What to Monitor
&lt;/h2&gt;

&lt;p&gt;Chaos experiments are only valuable if you can observe their impact. Key metrics to track:&lt;/p&gt;

&lt;h3&gt;
  
  
  Application Metrics
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Request latency (p50, p95, p99)&lt;/li&gt;
&lt;li&gt;Error rates and HTTP status codes&lt;/li&gt;
&lt;li&gt;Request throughput and queue depths&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Kubernetes Metrics
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Pod restart counts&lt;/li&gt;
&lt;li&gt;Container CPU/memory during recovery&lt;/li&gt;
&lt;li&gt;Time to pod ready state&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  AWS Metrics
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;EKS control plane API latency&lt;/li&gt;
&lt;li&gt;Node health status across AZs&lt;/li&gt;
&lt;li&gt;Application Load Balancer healthy target counts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AWS CloudWatch Container Insights provides much of this automatically for EKS clusters. For deeper application-level observability, consider integrating with AWS X-Ray for distributed tracing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Implementing Safety Guardrails
&lt;/h2&gt;

&lt;p&gt;Chaos engineering isn't about breaking things carelessly. AWS FIS provides stop conditions that automatically halt experiments when things go wrong:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stopConditions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"source"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aws:cloudwatch:alarm"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:cloudwatch:us-east-1:123456789012:alarm:ServiceErrorRateHigh"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Let's see how the flow works
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzyq58x737qumkuzz8vih.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzyq58x737qumkuzz8vih.png" alt="Experiment Complete Flow" width="800" height="3585"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Best practices for safe chaos experiments:
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start in non-production environments&lt;/strong&gt; — Validate experiments in staging before production&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Define clear rollback procedures&lt;/strong&gt; — Know how to quickly restore normal operation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use blast radius controls&lt;/strong&gt; — Target specific pods/services rather than entire clusters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run during business hours&lt;/strong&gt; — Have engineers available to respond if needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Communicate with stakeholders&lt;/strong&gt; — Ensure relevant teams know experiments are planned&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Breaking Things on Purpose
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr08h242kb9ydmzhvvwoa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr08h242kb9ydmzhvvwoa.png" alt="You Break , You Pay" width="800" height="1115"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The journey from reactive incident response to proactive resilience engineering represents a fundamental shift in how we think about system reliability.&lt;/p&gt;

&lt;h3&gt;
  
  
  From 99.9% Uptime Victory to Uptime Fear
&lt;/h3&gt;

&lt;p&gt;For years, teams celebrated high uptime numbers as proof of system health. But 99.9% availability still means 8.76 hours of downtime per year—and those hours always seem to happen at the worst possible moment. The realization sets in — we don't actually know &lt;em&gt;why&lt;/em&gt; our systems stay up, which means we don't know what will bring them down.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Shift from "Prevent Failure" to "Embrace Failure"
&lt;/h3&gt;

&lt;p&gt;Traditional engineering tries to eliminate all possible failure modes. Modern resilience engineering accepts that failure is inevitable and focuses on minimizing impact. This isn't pessimism—it's realism. Complex distributed systems have emergent behaviors that no amount of unit testing can predict.&lt;/p&gt;

&lt;h3&gt;
  
  
  From Netflix's Chaos Monkey to Chaos Engineering
&lt;/h3&gt;

&lt;p&gt;Netflix pioneered this approach with Chaos Monkey in 2011—a tool that randomly terminated EC2 instances in production. The idea seemed radical — why would you intentionally break your own systems? The answer became clear — because you'd rather discover weaknesses on your terms, during business hours, with engineers ready to respond, than at 3 AM during peak traffic.&lt;/p&gt;

&lt;p&gt;Today, chaos engineering has evolved far beyond random instance termination. Tools like Chaos Mesh enable sophisticated experiments — network partitions, DNS failures, clock skew, JVM faults, and more. AWS FIS brings this into the enterprise with centralized management, safety controls, and full audit trails.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recovery Isn't the Only Goal
&lt;/h3&gt;

&lt;p&gt;The most important outcome of chaos experiments isn't proving your system can recover—it's what you learn in the process. Each experiment reveals something about your architecture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How does the system behave under partial failure?&lt;/li&gt;
&lt;li&gt;Do circuit breakers trigger at the right thresholds?&lt;/li&gt;
&lt;li&gt;Are timeout values appropriate?&lt;/li&gt;
&lt;li&gt;Do health checks accurately reflect service health?&lt;/li&gt;
&lt;li&gt;How long does recovery actually take?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This learning feeds back into system improvements, creating a virtuous cycle of increasing resilience.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The shift from "prevent failure" to "embrace failure" represents a fundamental change in how we build reliable systems. By combining Amazon EKS's orchestration capabilities with AWS Fault Injection Simulator's enterprise-grade chaos management and Chaos Mesh's Kubernetes-native failure injection, you can build platforms that don't just claim to be resilient—they prove it.&lt;/p&gt;

&lt;p&gt;Your systems will fail. The only question is — &lt;strong&gt;will you learn from it?&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  About the author
&lt;/h2&gt;

&lt;p&gt;Elad Hirsch is a Tech Lead at TeraSky CTO Office, a global provider of multi-cloud, cloud-native, and innovative IT solutions. With experience in principal engineering positions at Agmatix, Jfrog, IDI, and Finjan Security, his primary areas of expertise revolve around software architecture and DevOps practices. He is a proactive advocate for fostering a DevOps culture, enabling organizations to improve their software architecture and streamline operations in cloud-native environments.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #AWS #Kubernetes #ChaosEngineering #EKS #DevOps #SRE #CloudNative #Resilience #ChaosMesh #FaultInjection&lt;/p&gt;

</description>
      <category>devops</category>
      <category>testing</category>
      <category>aws</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>From 30 Minutes to 4 - How EBS Volume Cloning Transformed Our Customer CI Pipeline</title>
      <dc:creator>Elad Hirsch</dc:creator>
      <pubDate>Wed, 03 Dec 2025 20:14:22 +0000</pubDate>
      <link>https://dev.to/eladh/from-30-minutes-to-4-how-ebs-volume-cloning-transformed-our-ci-pipeline-2b1o</link>
      <guid>https://dev.to/eladh/from-30-minutes-to-4-how-ebs-volume-cloning-transformed-our-ci-pipeline-2b1o</guid>
      <description>&lt;h2&gt;
  
  
  The Silent Killer of Developer Productivity
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F31f2b2o28pkedmcmtrmf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F31f2b2o28pkedmcmtrmf.png" alt="Waiting for an operation to be completed" width="600" height="492"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As a developer, I know the frustration of waiting for a build. You push your code and instead of just looking at the screen, you switch to another task, lose context, and by the time the CI pipeline finishes, you've forgotten what you were working on. In my customer's case, this frustration had a very specific &lt;strong&gt;30 minutes per build&lt;/strong&gt;, mostly spent downloading Maven dependencies.&lt;/p&gt;

&lt;p&gt;Their setup wasn't unusual. They had GitHub Actions self-hosted runners spinning up on Amazon EKS for each triggered CI job. These runners needed Maven packages stored in their on-premises Nexus repository, connected via VPN. The architecture made sense on paper—until they measured the actual latency impact.&lt;/p&gt;

&lt;p&gt;The math was crystal clear. With dozens of developers triggering hundreds of builds daily, they were hemorrhaging thousands of developer-hours monthly just waiting for package downloads. Something had to change.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding the Root Cause
&lt;/h2&gt;

&lt;p&gt;Before diving into solutions, they needed to understand why their builds were so slow. The diagnosis revealed several compounding factors:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Network Latency&lt;/strong&gt; - Every CI job started fresh. A new runner pod meant a clean slate—no cached dependencies, no memory of previous builds. Each time, Maven dutifully reached across the VPN to their on-premises Nexus server, pulling hundreds of megabytes of packages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VPN Overhead&lt;/strong&gt; -  The VPN connection added its own latency tax. What would be milliseconds on a local network became seconds across the encrypted tunnel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Package Volume&lt;/strong&gt; - Their monorepo had accumulated years of dependencies. A full Maven dependency tree meant downloading a substantial chunk of data for every single build.&lt;/p&gt;

&lt;p&gt;And lastly &lt;strong&gt;Ephemeral Infrastructure&lt;/strong&gt; -  The beauty of EKS-based runners is their isolation and cleanliness. The curse is starting from zero every time.&lt;/p&gt;

&lt;p&gt;They needed a way to preserve the benefits of ephemeral, isolated runners while eliminating the cold-start penalty of downloading dependencies.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Road of Failed Experiments
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Take #1 - Docker Layer Caching
&lt;/h3&gt;

&lt;p&gt;The obvious first solution was Docker layer caching. If they could cache the Maven dependencies in a Docker layer, subsequent builds could reuse them. They implemented a multi-stage Dockerfile that installed dependencies in an early layer, theoretically allowing Docker to skip the download step if nothing changed.&lt;/p&gt;

&lt;p&gt;The reality was messier than the theory.&lt;/p&gt;

&lt;p&gt;Maven dependency management is notoriously cache-unfriendly. When a single package version updates—even a minor patch—the entire layer containing dependencies becomes invalid. Docker layer caching works on an all-or-nothing principle at each layer. One changed dependency means re-downloading everything.&lt;/p&gt;

&lt;p&gt;In practice, their layer cache was invalidated almost daily. Sometimes multiple times per day. The "optimization" became a false promise—their builds were consistently slow, with occasional fast runs that made the slow ones feel even more painful.&lt;/p&gt;

&lt;h3&gt;
  
  
  Take #2 - EBS Snapshots
&lt;/h3&gt;

&lt;p&gt;Their second approach leveraged EBS snapshots. They created a snapshot of a volume containing all their Maven dependencies, then restored it for each CI run.&lt;/p&gt;

&lt;p&gt;This improved their build times to approximately 9 minutes—a meaningful improvement, but still far from optimal. The snapshot restoration process, while faster than downloading packages over VPN, still added significant overhead. EBS snapshots are designed for disaster recovery and data persistence, not high-frequency, low-latency access patterns.&lt;/p&gt;

&lt;p&gt;They were getting closer, but they knew there had to be a better way.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why We Rejected Other Alternatives
&lt;/h3&gt;

&lt;p&gt;During this period, they explored several other options that ultimately didn't worked: &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;S3 Mount (s3fs/goofys)&lt;/strong&gt; - They experimented with mounting an S3 bucket containing their Maven cache directly into the runner pods. The latency was better than the VPN, reducing build times to around 10 minutes. However, S3's object storage semantics don't align well with the random-access patterns of Maven builds. The filesystem abstraction layer added its own overhead, and they saw inconsistent performance based on S3 service conditions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Amazon EFS&lt;/strong&gt; - Elastic File System seemed promising—a managed NFS solution that multiple pods could mount simultaneously. However, two concerns stopped them. First, cost: EFS pricing for their access patterns would have been significant. Second, and more concerning, they worried about potential file corruption with multiple concurrent writers. While EFS handles this technically, their Maven usage patterns weren't designed with shared filesystem semantics in mind.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dedicated EC2 with Local Nexus&lt;/strong&gt; -  They considered running their own Nexus proxy on an EC2 instance with fast EBS storage. This would have solved the latency problem but introduced new complexity: managing another server, handling inter-AZ traffic costs, and maintaining yet another piece of infrastructure. The operational overhead didn't justify the benefit.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Final Take - EBS Volume Cloning
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6plnzlics0l1xg61tvm2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6plnzlics0l1xg61tvm2.png" alt="EBS Volume Cloning" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The solution that finally worked came from an often-overlooked EBS feature: &lt;strong&gt;volume cloning&lt;/strong&gt;. Unlike snapshots, which create a point-in-time copy that needs to be restored, cloned volumes are immediately usable. The cloning operation leverages EBS's underlying storage architecture to create what's effectively a copy-on-write reference to the original volume.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Architecture
&lt;/h3&gt;

&lt;p&gt;Their final solution consists of three components working in harmony:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Base Volume&lt;/strong&gt; -  A 100GB GP3 EBS volume named &lt;code&gt;maven-cache-shared&lt;/code&gt; serves as the authoritative source of Maven packages. This volume lives in their cluster, always available, always up-to-date.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Nightly Updater&lt;/strong&gt; - A cronjob called &lt;code&gt;maven-cache-updater&lt;/code&gt; runs every night. Its job is simple but crucial:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Mount the shared base volume&lt;/li&gt;
&lt;li&gt;Run Maven commands to fetch any new or updated packages from the remote Nexus server&lt;/li&gt;
&lt;li&gt;Unmount the volume immediately after&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This nightly synchronization means their base volume is never more than 24 hours stale. For most practical purposes, it contains everything their builds need.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On-Demand Cloning&lt;/strong&gt; -  When a CI job triggers, the magic happens. Instead of downloading packages or restoring snapshots, they create a clone of the base volume. This clone becomes the runner's Maven cache.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Workflow in Action
&lt;/h3&gt;

&lt;p&gt;Here's what happens when a developer pushes code:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F284t9x1tejm5uuz1pn4y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F284t9x1tejm5uuz1pn4y.png" alt="Workflow" width="800" height="1239"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Job Trigger&lt;/strong&gt; - GitHub Actions receives the push event and triggers their CI workflow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Runner Instantiation&lt;/strong&gt; -  A new self-hosted runner pod spins up on EKS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Volume Cloning&lt;/strong&gt; -  The runner's init process creates a new PVC (Persistent Volume Claim) configured as a clone of &lt;code&gt;maven-cache-shared&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build Execution&lt;/strong&gt; -  Maven runs against the cloned volume, finding all dependencies already present&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cleanup&lt;/strong&gt; -  After CI completion, both the runner pod and the cloned PVC are terminated&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The entire process—from push to build completion—now takes &lt;strong&gt;3.5 to 4 minutes&lt;/strong&gt;. The cloning operation itself accounts for just 20-25 seconds of that time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Technical Implementation
&lt;/h2&gt;

&lt;p&gt;The Kubernetes configuration for this setup involves a few key pieces:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PersistentVolumeClaim&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;maven-cache-runner-${RUN_ID}&lt;/span&gt; &lt;span class="c1"&gt;# templated per run&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;accessModes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;ReadWriteOnce&lt;/span&gt;
  &lt;span class="na"&gt;storageClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gp3&lt;/span&gt;
  &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;100Gi&lt;/span&gt;
  &lt;span class="na"&gt;dataSource&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PersistentVolumeClaim&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;maven-cache-shared&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The critical element is the &lt;code&gt;dataSource&lt;/code&gt; field. By specifying an existing PVC as the data source, Kubernetes instructs the EBS CSI driver to create a clone rather than an empty volume.&lt;/p&gt;

&lt;p&gt;The nightly updater cronjob looks something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;batch/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CronJob&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;maven-cache-updater&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;2&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*"&lt;/span&gt;  &lt;span class="c1"&gt;# 2 AM daily&lt;/span&gt;
  &lt;span class="na"&gt;jobTemplate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;updater&lt;/span&gt;
            &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;maven:3.9-eclipse-temurin-17&lt;/span&gt;
            &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/bin/sh&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;-c&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
              &lt;span class="s"&gt;cd /maven-cache&lt;/span&gt;
              &lt;span class="s"&gt;mvn dependency:go-offline -DrepoUrl=${NEXUS_URL}&lt;/span&gt;
            &lt;span class="na"&gt;volumeMounts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;maven-cache&lt;/span&gt;
              &lt;span class="na"&gt;mountPath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/maven-cache&lt;/span&gt;
          &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;maven-cache&lt;/span&gt;
            &lt;span class="na"&gt;persistentVolumeClaim&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;claimName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;maven-cache-shared&lt;/span&gt;
          &lt;span class="na"&gt;restartPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;OnFailure&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why This Works So Well
&lt;/h2&gt;

&lt;p&gt;The elegance of this solution lies in how it aligns with both EBS's strengths and their operational requirements:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Instant Availability&lt;/strong&gt; -  EBS cloning is nearly instantaneous because it doesn't copy data immediately. The clone starts as a reference to the original volume's data blocks. Only when writes occur does the copy-on-write mechanism create new blocks. For their read-heavy Maven cache workload, this is perfect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Complete Isolation&lt;/strong&gt; - Each CI run gets its own volume. There's no risk of one build corrupting another's cache. No lock contention. No race conditions. Each runner operates in blissful isolation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Predictable Performance&lt;/strong&gt; - GP3 volumes provide consistent IOPS and throughput regardless of volume size. Their cloned volumes perform identically to the base volume from the moment they're created.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cost Efficiency&lt;/strong&gt; (What They pay for):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One 100GB GP3 base volume (running 24/7)&lt;/li&gt;
&lt;li&gt;Cloned volumes for the duration of each CI run (~4.5 minutes on average)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cloned volumes exist for such short periods that their cost is negligible. A 100GB GP3 volume costs roughly $8/month. Their cloned volumes, existing for minutes at a time, add pennies to the monthly bill.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Operational Simplicity&lt;/strong&gt; -  There's no complex caching logic to maintain. No cache invalidation strategies to debug. The nightly updater ensures freshness, and the cloning mechanism handles distribution.&lt;/p&gt;

&lt;h2&gt;
  
  
  Moving forward Optimizations
&lt;/h2&gt;

&lt;p&gt;Their current implementation works well, but there's room for improvement from Volume Right-Sizing ,Multi-Region Strategy and Cache Warming ,and we will explore them in the follow up article :-) &lt;/p&gt;

&lt;h2&gt;
  
  
  About the author
&lt;/h2&gt;

&lt;p&gt;Elad Hirsch is a Tech Lead at TeraSky CTO Office, a global provider of multi-cloud, cloud-native, and innovative IT solutions. With experience in principal engineering positions at Agmatix, Jfrog, IDI, and Finjan Security, his primary areas of expertise revolve around software architecture and DevOps practices. He is a proactive advocate for fostering a DevOps culture, enabling organizations to improve their software architecture and streamline operations in cloud-native environments.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have you implemented similar optimizations in your CI pipeline? I'd love to hear about your approaches in the comments.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tags&lt;/strong&gt;: #AWS #DevOps #CICD #Kubernetes #EBS #GitHubActions #Maven #Performance #CloudEngineering&lt;/p&gt;

</description>
      <category>aws</category>
      <category>cicd</category>
      <category>performance</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
