Elad Hirsch

Posted on Dec 3, 2025 • Edited on Dec 10, 2025

From 30 Minutes to 4 - How EBS Volume Cloning Transformed Our Customer CI Pipeline

#aws #cicd #performance #productivity

The Silent Killer of Developer Productivity

As a developer, I know the frustration of waiting for a build. You push your code and instead of just looking at the screen, you switch to another task, lose context, and by the time the CI pipeline finishes, you've forgotten what you were working on. In my customer's case, this frustration had a very specific 30 minutes per build, mostly spent downloading Maven dependencies.

Their setup wasn't unusual. They had GitHub Actions self-hosted runners spinning up on Amazon EKS for each triggered CI job. These runners needed Maven packages stored in their on-premises Nexus repository, connected via VPN. The architecture made sense on paper—until they measured the actual latency impact.

The math was crystal clear. With dozens of developers triggering hundreds of builds daily, they were hemorrhaging thousands of developer-hours monthly just waiting for package downloads. Something had to change.

Understanding the Root Cause

Before diving into solutions, they needed to understand why their builds were so slow. The diagnosis revealed several compounding factors:

Network Latency - Every CI job started fresh. A new runner pod meant a clean slate—no cached dependencies, no memory of previous builds. Each time, Maven dutifully reached across the VPN to their on-premises Nexus server, pulling hundreds of megabytes of packages.

VPN Overhead - The VPN connection added its own latency tax. What would be milliseconds on a local network became seconds across the encrypted tunnel.

Package Volume - Their monorepo had accumulated years of dependencies. A full Maven dependency tree meant downloading a substantial chunk of data for every single build.

And lastly Ephemeral Infrastructure - The beauty of EKS-based runners is their isolation and cleanliness. The curse is starting from zero every time.

They needed a way to preserve the benefits of ephemeral, isolated runners while eliminating the cold-start penalty of downloading dependencies.

The Road of Failed Experiments

Take #1 - Docker Layer Caching

The obvious first solution was Docker layer caching. If they could cache the Maven dependencies in a Docker layer, subsequent builds could reuse them. They implemented a multi-stage Dockerfile that installed dependencies in an early layer, theoretically allowing Docker to skip the download step if nothing changed.

The reality was messier than the theory.

Maven dependency management is notoriously cache-unfriendly. When a single package version updates—even a minor patch—the entire layer containing dependencies becomes invalid. Docker layer caching works on an all-or-nothing principle at each layer. One changed dependency means re-downloading everything.

In practice, their layer cache was invalidated almost daily. Sometimes multiple times per day. The "optimization" became a false promise—their builds were consistently slow, with occasional fast runs that made the slow ones feel even more painful.

Take #2 - EBS Snapshots

Their second approach leveraged EBS snapshots. They created a snapshot of a volume containing all their Maven dependencies, then restored it for each CI run.

This improved their build times to approximately 9 minutes—a meaningful improvement, but still far from optimal. The snapshot restoration process, while faster than downloading packages over VPN, still added significant overhead. EBS snapshots are designed for disaster recovery and data persistence, not high-frequency, low-latency access patterns.

They were getting closer, but they knew there had to be a better way.

Why We Rejected Other Alternatives

During this period, they explored several other options that ultimately didn't worked:

S3 Mount (s3fs/goofys) - They experimented with mounting an S3 bucket containing their Maven cache directly into the runner pods. The latency was better than the VPN, reducing build times to around 10 minutes. However, S3's object storage semantics don't align well with the random-access patterns of Maven builds. The filesystem abstraction layer added its own overhead, and they saw inconsistent performance based on S3 service conditions.

Amazon EFS - Elastic File System seemed promising—a managed NFS solution that multiple pods could mount simultaneously. However, two concerns stopped them. First, cost: EFS pricing for their access patterns would have been significant. Second, and more concerning, they worried about potential file corruption with multiple concurrent writers. While EFS handles this technically, their Maven usage patterns weren't designed with shared filesystem semantics in mind.

Dedicated EC2 with Local Nexus - They considered running their own Nexus proxy on an EC2 instance with fast EBS storage. This would have solved the latency problem but introduced new complexity: managing another server, handling inter-AZ traffic costs, and maintaining yet another piece of infrastructure. The operational overhead didn't justify the benefit.

The Final Take - EBS Volume Cloning

The solution that finally worked came from an often-overlooked EBS feature: volume cloning. Unlike snapshots, which create a point-in-time copy that needs to be restored, cloned volumes are immediately usable. The cloning operation leverages EBS's underlying storage architecture to create what's effectively a copy-on-write reference to the original volume.

The Architecture

Their final solution consists of three components working in harmony:

The Base Volume - A 100GB GP3 EBS volume named maven-cache-shared serves as the authoritative source of Maven packages. This volume lives in their cluster, always available, always up-to-date.

The Nightly Updater - A cronjob called maven-cache-updater runs every night. Its job is simple but crucial:

Mount the shared base volume
Run Maven commands to fetch any new or updated packages from the remote Nexus server
Unmount the volume immediately after

This nightly synchronization means their base volume is never more than 24 hours stale. For most practical purposes, it contains everything their builds need.

On-Demand Cloning - When a CI job triggers, the magic happens. Instead of downloading packages or restoring snapshots, they create a clone of the base volume. This clone becomes the runner's Maven cache.

The Workflow in Action

Here's what happens when a developer pushes code:

Job Trigger - GitHub Actions receives the push event and triggers their CI workflow
Runner Instantiation - A new self-hosted runner pod spins up on EKS
Volume Cloning - The runner's init process creates a new PVC (Persistent Volume Claim) configured as a clone of maven-cache-shared
Build Execution - Maven runs against the cloned volume, finding all dependencies already present
Cleanup - After CI completion, both the runner pod and the cloned PVC are terminated

The entire process—from push to build completion—now takes 3.5 to 4 minutes. The cloning operation itself accounts for just 20-25 seconds of that time.

The Technical Implementation

The Kubernetes configuration for this setup involves a few key pieces:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: maven-cache-runner-${RUN_ID} # templated per run
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp3
  resources:
    requests:
      storage: 100Gi
  dataSource:
    kind: PersistentVolumeClaim
    name: maven-cache-shared

The critical element is the dataSource field. By specifying an existing PVC as the data source, Kubernetes instructs the EBS CSI driver to create a clone rather than an empty volume.

The nightly updater cronjob looks something like this:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: maven-cache-updater
spec:
  schedule: "0 2 * * *"  # 2 AM daily
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: updater
            image: maven:3.9-eclipse-temurin-17
            command:
            - /bin/sh
            - -c
            - |
              cd /maven-cache
              mvn dependency:go-offline -DrepoUrl=${NEXUS_URL}
            volumeMounts:
            - name: maven-cache
              mountPath: /maven-cache
          volumes:
          - name: maven-cache
            persistentVolumeClaim:
              claimName: maven-cache-shared
          restartPolicy: OnFailure

Why This Works So Well

The elegance of this solution lies in how it aligns with both EBS's strengths and their operational requirements:

Instant Availability - EBS cloning is nearly instantaneous because it doesn't copy data immediately. The clone starts as a reference to the original volume's data blocks. Only when writes occur does the copy-on-write mechanism create new blocks. For their read-heavy Maven cache workload, this is perfect.

Complete Isolation - Each CI run gets its own volume. There's no risk of one build corrupting another's cache. No lock contention. No race conditions. Each runner operates in blissful isolation.

Predictable Performance - GP3 volumes provide consistent IOPS and throughput regardless of volume size. Their cloned volumes perform identically to the base volume from the moment they're created.

Cost Efficiency (What They pay for):

One 100GB GP3 base volume (running 24/7)
Cloned volumes for the duration of each CI run (~4.5 minutes on average)

The cloned volumes exist for such short periods that their cost is negligible. A 100GB GP3 volume costs roughly $8/month. Their cloned volumes, existing for minutes at a time, add pennies to the monthly bill.

Operational Simplicity - There's no complex caching logic to maintain. No cache invalidation strategies to debug. The nightly updater ensures freshness, and the cloning mechanism handles distribution.

Moving forward Optimizations

Their current implementation works well, but there's room for improvement from Volume Right-Sizing ,Multi-Region Strategy and Cache Warming ,and we will explore them in the follow up article :-)

About the author

Elad Hirsch is a Tech Lead at TeraSky CTO Office, a global provider of multi-cloud, cloud-native, and innovative IT solutions. With experience in principal engineering positions at Agmatix, Jfrog, IDI, and Finjan Security, his primary areas of expertise revolve around software architecture and DevOps practices. He is a proactive advocate for fostering a DevOps culture, enabling organizations to improve their software architecture and streamline operations in cloud-native environments.

Have you implemented similar optimizations in your CI pipeline? I'd love to hear about your approaches in the comments.

Tags: #AWS #DevOps #CICD #Kubernetes #EBS #GitHubActions #Maven #Performance #CloudEngineering

DEV Community