DEV Community: Elad Hirsch

Building Resilient Systems on AWS - Chaos Engineering with Amazon EKS and AWS Fault Injection Simulator

Elad Hirsch — Fri, 05 Dec 2025 17:15:47 +0000

How to prove your Kubernetes platform can handle failure—before your users find out it can't

The Uncomfortable Truth About Platform Stability

Here's a scenario every platform engineer dreads — your production environment experiences a critical incident. Users can't log in to your SaaS product. Your on-call team rushes to respond—only to discover that the badge readers at your office rely on the same network infrastructure that just went down. The people who are supposed to fix the problem can't even get inside the building.

Sound far-fetched? It happened to Facebook in 2021. A BGP routing change during maintenance accidentally severed all their data centers, taking them offline for over six hours. No DNS meant no Facebook—and their on-site card readers went dark too.

The lesson isn't about BGP misconfigurations. It's about a fundamental shift in how we think about system reliability — we must stop assuming our systems are resilient and start proving they are.

From "Prevent Failure" to "Embrace Failure"

For years, the platform engineering playbook was straightforward — maximize uptime, add redundancy, and when something breaks, write a test case so it never happens again. We built increasingly complex architectures—dozens of microservices, heterogeneous storage layers, multiple cloud providers, mixed communication patterns—and somehow convinced ourselves that this complexity equaled robustness.

It doesn't.

Modern distributed systems are built on eight dangerous assumptions known as the Fallacies of Distributed Computing:

The network is reliable
Latency is zero
Bandwidth is unlimited
The network is secure
Topology doesn't change
There is one administrator
Transport cost is zero
The network is homogeneous

Every one of these assumptions will eventually fail in production. The question isn't if your system will experience failure—it's when, and more importantly, will you be ready?

The Network Is Secure — A Dangerous Assumption

As AWS CTO Werner Vogels famously said — "Dance like nobody's watching, encrypt like everyone is." This mantra captures a critical truth about distributed systems security. Vogels has repeatedly emphasized the importance of safeguarding encryption keys, noting that "the key is the only tool that ensures you're the only one with access to your data."

In cloud-native architectures, assuming the network is secure leads to devastating breaches. Zero-trust principles aren't optional—they're essential.

Transport Cost Is Zero — The Hidden Budget Killer

Perhaps the most expensive fallacy to ignore is assuming transport cost is zero. In AWS, Data Transfer Out (DTO) costs can quickly become one of the largest line items on your bill if not properly managed.

Consider a typical microservices architecture — services communicate across availability zones, data flows between regions for disaster recovery, and APIs serve traffic globally. Each of these transfers incurs costs:

Inter-AZ traffic — $0.01/GB in each direction
Inter-region traffic — $0.02-$0.09/GB depending on regions
Internet egress — $0.09/GB for the first 10TB/month

A service handling 100TB of monthly cross-AZ traffic could face $2,000/month in transfer costs alone—before any compute or storage charges. This is why architecture decisions like service placement, caching strategies, and data locality matter enormously in AWS.

The AWS Chaos Engineering Stack

AWS provides a powerful combination of services for implementing chaos engineering at scale:

Amazon EKS — Managed Kubernetes that provides the foundation for container orchestration with built-in resilience features
AWS Fault Injection Simulator (FIS) — A fully managed service for running chaos experiments against AWS resources
Chaos Mesh — A CNCF project that extends chaos capabilities specifically for Kubernetes workloads

What makes this combination particularly powerful is the deep integration between FIS and Kubernetes. AWS FIS can inject Kubernetes custom resources directly into your EKS clusters, allowing you to orchestrate Chaos Mesh experiments through a unified AWS control plane.

Why Amazon EKS Is the Foundation

Before we inject chaos, we need a platform that can actually respond to failure gracefully. Amazon EKS provides several built-in resilience mechanisms.

Self-Healing Through Controllers

Kubernetes controllers continuously reconcile the actual state of your cluster with the desired state. When a pod crashes, the Deployment controller notices the discrepancy and schedules a replacement. This reconciliation loop is the heartbeat of Kubernetes resilience.

Topology Awareness

EKS allows you to distribute pods across multiple Availability Zones within an AWS region. By using topology spread constraints, you can ensure that a single AZ failure doesn't take down your entire application:

topologySpreadConstraints:
  - maxSkew: 1
    topologyKey: topology.kubernetes.io/zone
    whenUnsatisfiable: DoNotSchedule
    labelSelector:
      matchLabels:
        app: payment-api

Pod Disruption Budgets

PDBs let you specify the minimum number of pods that must remain available during voluntary disruptions. This ensures that even during chaos experiments or cluster upgrades, your service maintains capacity:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: payment-api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: payment-api

These mechanisms form the baseline. Chaos engineering tests whether they actually work under real failure conditions.

Setting Up the Chaos Engineering Environment

Here is a referencee example for chaos engineering on EKS. The from-kubernetes-to-chaos-mesh repository demonstrates how to integrate AWS FIS with Chaos Mesh.

Prerequisites

Before diving in, ensure you have:

AWS CLI configured with appropriate credentials
An existing Amazon EKS cluster (version 1.25+)
kubectl for Kubernetes cluster management
Helm for deploying Chaos Mesh

Connect to the EKS Cluster

First, configure kubectl to communicate with your Amazon EKS cluster. The AWS CLI makes this straightforward:

# Update kubeconfig for your EKS cluster
aws eks update-kubeconfig \
  --region us-east-1 \
  --name your-eks-cluster-name

# Verify the connection
kubectl get nodes

You should see your cluster nodes listed with Ready status:

NAME                              STATUS   ROLES    AGE   VERSION
ip-10-0-1-123.ec2.internal        Ready    <none>   45d   v1.31.2-eks-5678abc
ip-10-0-2-456.ec2.internal        Ready    <none>   45d   v1.31.2-eks-5678abc
ip-10-0-3-789.ec2.internal        Ready    <none>   45d   v1.31.2-eks-5678abc

Verify your cluster is spread across multiple Availability Zones — this is critical for meaningful resilience testing:

# Check node distribution across AZs
kubectl get nodes -L topology.kubernetes.io/zone

# Expected output shows nodes in different AZs:
# ip-10-0-1-123   Ready   us-east-1a
# ip-10-0-2-456   Ready   us-east-1b
# ip-10-0-3-789   Ready   us-east-1c

Confirm you have the necessary permissions by checking cluster info:

kubectl cluster-info
# Kubernetes control plane is running at https://ABCD1234.gr7.us-east-1.eks.amazonaws.com

Installing Chaos Mesh

Chaos Mesh deploys as a set of controllers and custom resource definitions in your cluster:

helm repo add chaos-mesh https://charts.chaos-mesh.org

helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace chaos-mesh \
  --create-namespace \
  --set chaosDaemon.runtime=containerd \
  --set chaosDaemon.socketPath=/run/containerd/containerd.sock

The key components include:

Chaos Controller Manager — Orchestrates chaos experiments
Chaos Daemon — Runs on each node to execute failure injections
Dashboard — Web UI for managing experiments (optional)

AWS Fault Injection Simulator — The Control Plane for Chaos

AWS FIS is what ties everything together. Rather than running chaos experiments in isolation, FIS provides:

Centralized experiment management — Define, run, and monitor experiments from the AWS Console or API
Safety controls — Stop conditions that automatically halt experiments if metrics breach thresholds
Audit logging — Complete visibility into what experiments ran, when, and their outcomes
IAM integration — Fine-grained permissions for who can run which experiments

Creating an IAM Role for FIS

FIS needs permissions to interact with your EKS cluster:

# Create the trust policy
cat << EOF > fis-trust-policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "fis.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}
EOF

# Create the role
aws iam create-role \
  --role-name fis-chaos-experiment-role \
  --assume-role-policy-document file://fis-trust-policy.json

# Attach necessary policies
aws iam attach-role-policy \
  --role-name fis-chaos-experiment-role \
  --policy-arn arn:aws:iam::aws:policy/AmazonEKSClusterPolicy

Configuring EKS for FIS Integration

FIS needs to authenticate to your EKS cluster. Update the aws-auth ConfigMap to allow the FIS role:

apiVersion: v1
kind: ConfigMap
metadata:
  name: aws-auth
  namespace: kube-system
data:
  mapRoles: |
    - rolearn: arn:aws:iam::123456789012:role/fis-chaos-experiment-role
      username: fis-user
      groups:
        - system:masters

Real-World Chaos Experiments with Chaos Mesh

Let's walk through practical experiments that test different failure modes using Chaos Mesh orchestrated through AWS FIS.

Experiment 1 — Network Fault Injection

Network issues are among the most common causes of distributed system failures. This experiment simulates complete network isolation for a target service:

First, identify your target pods:

kubectl get pod -n application --show-labels | grep order-service
# order-service-7b68fd5f58-xk9mn   1/1   Running   app.kubernetes.io/name=order-service

Create the FIS experiment template:

aws fis create-experiment-template \
  --cli-input-json '{
    "description": "Chaos Mesh network partition test",
    "targets": {
      "EKS-Cluster": {
        "resourceType": "aws:eks:cluster",
        "resourceArns": [
          "arn:aws:eks:us-east-1:123456789012:cluster/resilience-cluster"
        ],
        "selectionMode": "ALL"
      }
    },
    "actions": {
      "inject-network-partition": {
        "actionId": "aws:eks:inject-kubernetes-custom-resource",
        "description": "Simulate network partition on order-service",
        "parameters": {
          "kubernetesApiVersion": "chaos-mesh.org/v1alpha1",
          "kubernetesKind": "NetworkChaos",
          "kubernetesNamespace": "chaos-mesh",
          "kubernetesSpec": "{\"action\":\"partition\",\"mode\":\"all\",\"selector\":{\"namespaces\":[\"application\"],\"labelSelectors\":{\"app.kubernetes.io/name\":\"order-service\"}},\"direction\":\"both\"}",
          "maxDuration": "PT2M"
        },
        "targets": {
          "Cluster": "EKS-Cluster"
        }
      }
    },
    "stopConditions": [{ "source": "none" }],
    "roleArn": "arn:aws:iam::123456789012:role/fis-chaos-experiment-role",
    "tags": {
      "Purpose": "chaos-engineering",
      "Team": "platform"
    },
    "logConfiguration": {
      "cloudWatchLogsConfiguration": {
        "logGroupArn": "arn:aws:logs:us-east-1:123456789012:log-group:fis-experiments:*"
      },
      "logSchemaVersion": 2
    }
  }'

During the experiment, you can observe the impact:

# Before experiment
curl http://order-service.application:8080/health -v
# Response: HTTP/1.1 200 OK

# During experiment  
curl http://order-service.application:8080/health -v
# Response: curl: (7) Failed to connect - Connection refused

Experiment 2 — Container Kill with Chaos Mesh

This experiment tests your application's ability to recover from container failures:

aws fis create-experiment-template \
  --cli-input-json '{
    "description": "Chaos Mesh container termination test",
    "targets": {
      "EKS-Cluster": {
        "resourceType": "aws:eks:cluster",
        "resourceArns": [
          "arn:aws:eks:us-east-1:123456789012:cluster/resilience-cluster"
        ],
        "selectionMode": "ALL"
      }
    },
    "actions": {
      "terminate-container": {
        "actionId": "aws:eks:inject-kubernetes-custom-resource",
        "description": "Kill payment-service container",
        "parameters": {
          "kubernetesApiVersion": "chaos-mesh.org/v1alpha1",
          "kubernetesKind": "PodChaos",
          "kubernetesNamespace": "chaos-mesh",
          "kubernetesSpec": "{\"action\":\"container-kill\",\"mode\":\"one\",\"containerNames\":[\"payment-service\"],\"selector\":{\"namespaces\":[\"application\"],\"labelSelectors\":{\"app.kubernetes.io/name\":\"payment-service\"}}}",
          "maxDuration": "PT1M"
        },
        "targets": {
          "Cluster": "EKS-Cluster"
        }
      }
    },
    "stopConditions": [{ "source": "none" }],
    "roleArn": "arn:aws:iam::123456789012:role/fis-chaos-experiment-role",
    "tags": {
      "Purpose": "chaos-engineering"
    }
  }'

Run the experiment:

aws fis start-experiment --experiment-template-id EXTabc123def456

Monitor experiment status:

aws fis get-experiment --id EXPxyz789abc123 | jq '.experiment.state'
# Output: { "status": "completed" }

Verify the container restart:

kubectl get pod -n application | grep payment-service
# payment-service-5d8f9c7b6a-m2k9p   1/1   Running   1 (3m12s ago)   8m45s

kubectl describe pod -n application payment-service-5d8f9c7b6a-m2k9p | grep -A5 "Events:"
# Events:
#   Normal  Pulled   3m15s (x2 over 8m48s)  kubelet  Container image already present
#   Normal  Created  3m15s (x2 over 8m48s)  kubelet  Created container payment-service
#   Normal  Started  3m14s (x2 over 8m47s)  kubelet  Started container payment-service

The restart count of 1 confirms the chaos injection worked, and the Running status confirms Kubernetes successfully recovered the pod.

Measuring Success — What to Monitor

Chaos experiments are only valuable if you can observe their impact. Key metrics to track:

Application Metrics

Request latency (p50, p95, p99)
Error rates and HTTP status codes
Request throughput and queue depths

Kubernetes Metrics

Pod restart counts
Container CPU/memory during recovery
Time to pod ready state

AWS Metrics

EKS control plane API latency
Node health status across AZs
Application Load Balancer healthy target counts

AWS CloudWatch Container Insights provides much of this automatically for EKS clusters. For deeper application-level observability, consider integrating with AWS X-Ray for distributed tracing.

Implementing Safety Guardrails

Chaos engineering isn't about breaking things carelessly. AWS FIS provides stop conditions that automatically halt experiments when things go wrong:

{
  "stopConditions": [
    {
      "source": "aws:cloudwatch:alarm",
      "value": "arn:aws:cloudwatch:us-east-1:123456789012:alarm:ServiceErrorRateHigh"
    }
  ]
}

Let's see how the flow works

Best practices for safe chaos experiments:

Start in non-production environments — Validate experiments in staging before production
Define clear rollback procedures — Know how to quickly restore normal operation
Use blast radius controls — Target specific pods/services rather than entire clusters
Run during business hours — Have engineers available to respond if needed
Communicate with stakeholders — Ensure relevant teams know experiments are planned

Breaking Things on Purpose

The journey from reactive incident response to proactive resilience engineering represents a fundamental shift in how we think about system reliability.

From 99.9% Uptime Victory to Uptime Fear

For years, teams celebrated high uptime numbers as proof of system health. But 99.9% availability still means 8.76 hours of downtime per year—and those hours always seem to happen at the worst possible moment. The realization sets in — we don't actually know why our systems stay up, which means we don't know what will bring them down.

The Shift from "Prevent Failure" to "Embrace Failure"

Traditional engineering tries to eliminate all possible failure modes. Modern resilience engineering accepts that failure is inevitable and focuses on minimizing impact. This isn't pessimism—it's realism. Complex distributed systems have emergent behaviors that no amount of unit testing can predict.

From Netflix's Chaos Monkey to Chaos Engineering

Netflix pioneered this approach with Chaos Monkey in 2011—a tool that randomly terminated EC2 instances in production. The idea seemed radical — why would you intentionally break your own systems? The answer became clear — because you'd rather discover weaknesses on your terms, during business hours, with engineers ready to respond, than at 3 AM during peak traffic.

Today, chaos engineering has evolved far beyond random instance termination. Tools like Chaos Mesh enable sophisticated experiments — network partitions, DNS failures, clock skew, JVM faults, and more. AWS FIS brings this into the enterprise with centralized management, safety controls, and full audit trails.

Recovery Isn't the Only Goal

The most important outcome of chaos experiments isn't proving your system can recover—it's what you learn in the process. Each experiment reveals something about your architecture:

How does the system behave under partial failure?
Do circuit breakers trigger at the right thresholds?
Are timeout values appropriate?
Do health checks accurately reflect service health?
How long does recovery actually take?

This learning feeds back into system improvements, creating a virtuous cycle of increasing resilience.

Conclusion

The shift from "prevent failure" to "embrace failure" represents a fundamental change in how we build reliable systems. By combining Amazon EKS's orchestration capabilities with AWS Fault Injection Simulator's enterprise-grade chaos management and Chaos Mesh's Kubernetes-native failure injection, you can build platforms that don't just claim to be resilient—they prove it.

Your systems will fail. The only question is — will you learn from it?

About the author

Elad Hirsch is a Tech Lead at TeraSky CTO Office, a global provider of multi-cloud, cloud-native, and innovative IT solutions. With experience in principal engineering positions at Agmatix, Jfrog, IDI, and Finjan Security, his primary areas of expertise revolve around software architecture and DevOps practices. He is a proactive advocate for fostering a DevOps culture, enabling organizations to improve their software architecture and streamline operations in cloud-native environments.

Tags: #AWS #Kubernetes #ChaosEngineering #EKS #DevOps #SRE #CloudNative #Resilience #ChaosMesh #FaultInjection

From 30 Minutes to 4 - How EBS Volume Cloning Transformed Our Customer CI Pipeline

Elad Hirsch — Wed, 03 Dec 2025 20:14:22 +0000

The Silent Killer of Developer Productivity

As a developer, I know the frustration of waiting for a build. You push your code and instead of just looking at the screen, you switch to another task, lose context, and by the time the CI pipeline finishes, you've forgotten what you were working on. In my customer's case, this frustration had a very specific 30 minutes per build, mostly spent downloading Maven dependencies.

Their setup wasn't unusual. They had GitHub Actions self-hosted runners spinning up on Amazon EKS for each triggered CI job. These runners needed Maven packages stored in their on-premises Nexus repository, connected via VPN. The architecture made sense on paper—until they measured the actual latency impact.

The math was crystal clear. With dozens of developers triggering hundreds of builds daily, they were hemorrhaging thousands of developer-hours monthly just waiting for package downloads. Something had to change.

Understanding the Root Cause

Before diving into solutions, they needed to understand why their builds were so slow. The diagnosis revealed several compounding factors:

Network Latency - Every CI job started fresh. A new runner pod meant a clean slate—no cached dependencies, no memory of previous builds. Each time, Maven dutifully reached across the VPN to their on-premises Nexus server, pulling hundreds of megabytes of packages.

VPN Overhead - The VPN connection added its own latency tax. What would be milliseconds on a local network became seconds across the encrypted tunnel.

Package Volume - Their monorepo had accumulated years of dependencies. A full Maven dependency tree meant downloading a substantial chunk of data for every single build.

And lastly Ephemeral Infrastructure - The beauty of EKS-based runners is their isolation and cleanliness. The curse is starting from zero every time.

They needed a way to preserve the benefits of ephemeral, isolated runners while eliminating the cold-start penalty of downloading dependencies.

The Road of Failed Experiments

Take #1 - Docker Layer Caching

The obvious first solution was Docker layer caching. If they could cache the Maven dependencies in a Docker layer, subsequent builds could reuse them. They implemented a multi-stage Dockerfile that installed dependencies in an early layer, theoretically allowing Docker to skip the download step if nothing changed.

The reality was messier than the theory.

Maven dependency management is notoriously cache-unfriendly. When a single package version updates—even a minor patch—the entire layer containing dependencies becomes invalid. Docker layer caching works on an all-or-nothing principle at each layer. One changed dependency means re-downloading everything.

In practice, their layer cache was invalidated almost daily. Sometimes multiple times per day. The "optimization" became a false promise—their builds were consistently slow, with occasional fast runs that made the slow ones feel even more painful.

Take #2 - EBS Snapshots

Their second approach leveraged EBS snapshots. They created a snapshot of a volume containing all their Maven dependencies, then restored it for each CI run.

This improved their build times to approximately 9 minutes—a meaningful improvement, but still far from optimal. The snapshot restoration process, while faster than downloading packages over VPN, still added significant overhead. EBS snapshots are designed for disaster recovery and data persistence, not high-frequency, low-latency access patterns.

They were getting closer, but they knew there had to be a better way.

Why We Rejected Other Alternatives

During this period, they explored several other options that ultimately didn't worked:

S3 Mount (s3fs/goofys) - They experimented with mounting an S3 bucket containing their Maven cache directly into the runner pods. The latency was better than the VPN, reducing build times to around 10 minutes. However, S3's object storage semantics don't align well with the random-access patterns of Maven builds. The filesystem abstraction layer added its own overhead, and they saw inconsistent performance based on S3 service conditions.

Amazon EFS - Elastic File System seemed promising—a managed NFS solution that multiple pods could mount simultaneously. However, two concerns stopped them. First, cost: EFS pricing for their access patterns would have been significant. Second, and more concerning, they worried about potential file corruption with multiple concurrent writers. While EFS handles this technically, their Maven usage patterns weren't designed with shared filesystem semantics in mind.

Dedicated EC2 with Local Nexus - They considered running their own Nexus proxy on an EC2 instance with fast EBS storage. This would have solved the latency problem but introduced new complexity: managing another server, handling inter-AZ traffic costs, and maintaining yet another piece of infrastructure. The operational overhead didn't justify the benefit.

The Final Take - EBS Volume Cloning

The solution that finally worked came from an often-overlooked EBS feature: volume cloning. Unlike snapshots, which create a point-in-time copy that needs to be restored, cloned volumes are immediately usable. The cloning operation leverages EBS's underlying storage architecture to create what's effectively a copy-on-write reference to the original volume.

The Architecture

Their final solution consists of three components working in harmony:

The Base Volume - A 100GB GP3 EBS volume named maven-cache-shared serves as the authoritative source of Maven packages. This volume lives in their cluster, always available, always up-to-date.

The Nightly Updater - A cronjob called maven-cache-updater runs every night. Its job is simple but crucial:

Mount the shared base volume
Run Maven commands to fetch any new or updated packages from the remote Nexus server
Unmount the volume immediately after

This nightly synchronization means their base volume is never more than 24 hours stale. For most practical purposes, it contains everything their builds need.

On-Demand Cloning - When a CI job triggers, the magic happens. Instead of downloading packages or restoring snapshots, they create a clone of the base volume. This clone becomes the runner's Maven cache.

The Workflow in Action

Here's what happens when a developer pushes code:

Job Trigger - GitHub Actions receives the push event and triggers their CI workflow
Runner Instantiation - A new self-hosted runner pod spins up on EKS
Volume Cloning - The runner's init process creates a new PVC (Persistent Volume Claim) configured as a clone of maven-cache-shared
Build Execution - Maven runs against the cloned volume, finding all dependencies already present
Cleanup - After CI completion, both the runner pod and the cloned PVC are terminated

The entire process—from push to build completion—now takes 3.5 to 4 minutes. The cloning operation itself accounts for just 20-25 seconds of that time.

The Technical Implementation

The Kubernetes configuration for this setup involves a few key pieces:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: maven-cache-runner-${RUN_ID} # templated per run
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp3
  resources:
    requests:
      storage: 100Gi
  dataSource:
    kind: PersistentVolumeClaim
    name: maven-cache-shared

The critical element is the dataSource field. By specifying an existing PVC as the data source, Kubernetes instructs the EBS CSI driver to create a clone rather than an empty volume.

The nightly updater cronjob looks something like this:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: maven-cache-updater
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: updater
            image: maven:3.9-eclipse-temurin-17
            command:
            - /bin/sh
            - -c
            - |
              cd /maven-cache
              mvn dependency:go-offline -DrepoUrl=${NEXUS_URL}
            volumeMounts:
            - name: maven-cache
              mountPath: /maven-cache
          volumes:
          - name: maven-cache
            persistentVolumeClaim:
              claimName: maven-cache-shared
          restartPolicy: OnFailure

Why This Works So Well

The elegance of this solution lies in how it aligns with both EBS's strengths and their operational requirements:

Instant Availability - EBS cloning is nearly instantaneous because it doesn't copy data immediately. The clone starts as a reference to the original volume's data blocks. Only when writes occur does the copy-on-write mechanism create new blocks. For their read-heavy Maven cache workload, this is perfect.

Complete Isolation - Each CI run gets its own volume. There's no risk of one build corrupting another's cache. No lock contention. No race conditions. Each runner operates in blissful isolation.

Predictable Performance - GP3 volumes provide consistent IOPS and throughput regardless of volume size. Their cloned volumes perform identically to the base volume from the moment they're created.

Cost Efficiency (What They pay for):

One 100GB GP3 base volume (running 24/7)
Cloned volumes for the duration of each CI run (~4.5 minutes on average)

The cloned volumes exist for such short periods that their cost is negligible. A 100GB GP3 volume costs roughly $8/month. Their cloned volumes, existing for minutes at a time, add pennies to the monthly bill.

Operational Simplicity - There's no complex caching logic to maintain. No cache invalidation strategies to debug. The nightly updater ensures freshness, and the cloning mechanism handles distribution.

Moving forward Optimizations

Their current implementation works well, but there's room for improvement from Volume Right-Sizing ,Multi-Region Strategy and Cache Warming ,and we will explore them in the follow up article :-)

About the author

Have you implemented similar optimizations in your CI pipeline? I'd love to hear about your approaches in the comments.

Tags: #AWS #DevOps #CICD #Kubernetes #EBS #GitHubActions #Maven #Performance #CloudEngineering