Mustafa ERBAY

Posted on May 31 • Originally published at mustafaerbay.com.tr

Build Cache Strategies: The Operational Burden of Speed

#buildcache #cicd #operasyon #performance

Build times, especially in enterprise software projects, are often an overlooked bottleneck that leads to significant inefficiencies. While working on a production ERP, deployments taking hours after merging a new feature branch became a problem that tested developers' patience and slowed down our delivery speed. Building from scratch with every small change was both a waste of resources and a major operational burden.

To solve this problem, I implemented various build cache strategies. My goal was not only to speed up builds but also to manage the operational complexity that comes with this speed. Because speed often brings new problems; issues like the correct functioning, management, and security of the cache require constant attention.

Why Were Build Times a Problem? Symptoms and Initial Observations

In a client project, especially when we transitioned to a microservices architecture, build times started to increase exponentially. Each service had its own dependencies, its own compilation steps, and these were built from scratch every time a CI/CD pipeline ran. The average deployment time after a git push, even for a small change, reached 45 minutes. Sometimes, when there was a major dependency update, this could even extend to 1.5 hours.

This situation prolonged the feedback loop for developers, causing them to wait constantly. A developer who pushed a change first thing in the morning would only see the test results in the afternoon. This was a major obstacle, especially during a period when we expected rapid iterations. The main symptoms I observed were:

High CPU and Disk Usage: Build servers were constantly running at 90%+ CPU and IOPS values. Disks were filling up quickly, and storage costs were increasing due to temporary build files.
Low Developer Productivity: Developers found it difficult to focus on other tasks while waiting for builds to finish, forcing them to context switch.
CI/CD Queues: When multiple developers pushed changes simultaneously, long queues formed in the CI/CD pipelines. This further extended everyone's waiting time.

ℹ️ My Observation

In my experience, as build times increase, developers tend to accumulate small changes. This can lead to larger and riskier integrations. Build cache is one of the keys to breaking this vicious cycle.

At the root of these problems was the fact that each build step ran independently and in isolation. For example, when creating Docker images, each RUN command creates a new layer, and if previous layers haven't changed, it benefits from the cache. However, when dependency files (like package.json, requirements.txt) change frequently or build steps are not well optimized, this cache mechanism was not effective enough. My first goal was to find ways to use this fundamental Docker layer caching principle more efficiently.

Different Build Cache Mechanisms and My Preferences

To optimize build times, I researched and implemented different cache mechanisms in my projects. Each has its own advantages and operational overheads.

1. In-Container (Layer) Caching

Docker's built-in layer caching mechanism is the most basic and often the most effective method. Each command in a Dockerfile creates a layer. If a previous layer and command haven't changed, Docker pulls this layer from the cache instead of rebuilding it.

While working on a production ERP, I learned to better utilize this cache by optimizing Dockerfiles. For example, separating the dependency installation step (which doesn't change often) from the application code (which changes frequently) significantly shortens build times:

# First copy dependency files
COPY package.json package-lock.json ./
# Install dependencies (if package.json hasn't changed, this layer comes from cache)
RUN npm ci

# Now copy application code
COPY . .
# Compile/build the application
RUN npm run build

This simple change ensured that the npm ci step was skipped every time, as long as package.json didn't change. In my ERP project, this small optimization reduced the dependency installation step from 3-5 minutes to mere seconds. However, this method alone wasn't sufficient because the cache was created from scratch every time on different machines or CI/CD runners. Different runners or new machines couldn't benefit from each other's cache, even if they had built the same layer before.

2. External (Shared) Build Cache

To allow multiple CI/CD runners or developers to use the same cache, I turned to external cache mechanisms. At this point, Docker Buildx's remote cache feature was a game-changer for me. Buildx can save build results (layers) to a Docker registry or an external storage unit (e.g., S3-compatible storage) and pull them from there.

In a client project, I set up Buildx with S3-compatible storage in an environment with multiple CI/CD runners. Each build, upon completion, sent its cache to this storage, and a new build, when started, first pulled the cache from there.

docker buildx create --name mybuilder --driver docker-container --use
docker buildx build --builder mybuilder \
  --platform linux/amd64 \
  --cache-from type=registry,ref=myregistry.com/my-app:buildcache \
  --cache-to type=registry,ref=myregistry.com/my-app:buildcache,mode=max \
  -t myregistry.com/my-app:latest . --push

Thanks to this setup, when different runners started the same build, they could instantly benefit from the cache of a previously completed build. In my observation, this method reduced average build times by 60-70%. Specifically, it prevented node_modules or maven dependencies from being downloaded repeatedly.

💡 A Tip from My Experience

When using remote cache with Buildx, it's important to use the mode=max parameter. This attempts to save not only the final layers but also all intermediate layers as cache, providing a more comprehensive cache for the next build.

For my side product's CI/CD pipeline, I also experimented with tools like sccache. These tools can significantly speed up build times for languages like C/C++ by storing compiler outputs on a cache server. This can be particularly useful in large monorepos or projects with many dependencies.

Operational Burden: Cache Management and Maintenance

Implementing build cache strategies not only speeds up builds but also introduces a new operational burden. If we don't manage this burden correctly, we can lose the speed gains to complexity.

Cache Invalidation Strategies

The biggest enemy of cache is "stale" data. Incorrect invalidation can lead to the deployment of old code or erroneous behavior. In my ERP project, we once struggled with strange bugs in the production environment for a week due to a faulty dependency stuck in the cache. The root cause was a build cache holding an incorrect version.

To prevent such situations, I implemented the following strategies:

Dependency-Aware Cache: Invalidate the cache completely when the hashes of dependency definition files like package.json or go.mod change.
Manual Cache Clearing: Sometimes, especially during major version updates or critical security patches, it's necessary to manually clear the entire cache. I added a step to the CI/CD pipeline to trigger this action.
Time-Based Eviction: Automatically delete caches that haven't been used for a certain period.

Disk Space Management and Cost

Build caches, especially in large projects, can occupy significant disk space. When using remote cache, this translates to storage costs. In one of my client's projects, the Buildx cache on S3 reached over 500GB of storage monthly.

To optimize this cost and disk usage:

Lifecycle Rules: On storage services like S3, I defined lifecycle rules to automatically delete objects older than a certain period or move them to cheaper storage tiers. For example, deleting cache objects older than 30 days.
Periodic Cleanup Scripts: On self-hosted build servers, I created cron jobs that cleaned up unused Docker images, build caches, and temporary files.

#!/bin/bash
# Clean up Docker build caches older than 7 days
docker builder prune --filter "until=7d" --force

# Clean up unused Docker images
docker image prune --all --force

# Clean up unused Docker volumes
docker volume prune --force

These scripts prevented disk fill-ups from reaching critical levels and helped me keep storage costs under control.

Security Risks

Build cache also carries potential security risks. If a build cache contains sensitive data (API keys, passwords, etc.) and this cache is not properly protected, it can create a serious security vulnerability.

In a production environment, an API key was accidentally included in build arguments. This key posed a risk of leaking into the build cache. To prevent such situations:

Avoid Caching Sensitive Data: Use secret mounts in Dockerfiles or pass sensitive data as environment variables during build to ensure it's not cached.
Access Control: Implement strict access control on remote cache storage (e.g., S3 buckets). Ensure that only CI/CD runners and authorized personnel can access it.

⚠️ Important Note

Avoid directly using sensitive data during RUN commands in your Dockerfile. This data can be cached as layers and later become accessible via docker history or docker save. Buildx's mount=type=secret feature is a good solution for this.

Performance Metrics and Observation

When optimizing the performance of a system, observability is essential. I closely monitored metrics to understand the effectiveness of build cache strategies and to detect operational problems early.

Build Time Tracking

I collected the total build time as a metric by recording the start and end times of each build in CI/CD systems. I visualized this data using Prometheus and Grafana.

# Example Prometheus rule (assuming we collect metrics from a CI/CD system)
- job_name: 'ci_cd_builds'
  metrics_path: /metrics
  static_configs:
    - targets: ['ci-cd-runner-1:9100', 'ci-cd-runner-2:9100']
  relabel_configs:
    - source_labels: [__address__]
      regex: '([^:]+):.*'
      target_label: instance
      replacement: '$1'

In Grafana, I created a graph showing the change in build times over time. This graph allowed me to clearly see unexpected increases in build times overnight or the impact of an optimization. For example, I immediately noticed that build times jumped from 15 minutes to 40 minutes after a dependency update on April 28th, thanks to this graph.

Cache Hit/Miss Ratios

The most critical metric showing how effectively a build cache is working is the cache hit/miss ratio. Tools like Buildx can provide this information in their output with the --progress=plain flag.

# Example docker build output
...
#10 [build 5/5] RUN npm run build
#10 CACHED
#10 DONE 0.0s
...

By parsing these outputs or collecting these ratios from Buildx's build reports, I observed how frequently the cache was used. If the cache hit ratio dropped below 80%, it usually indicated a cache invalidation strategy issue or a deficiency in Buildx settings. When the cache hit ratio dropped to 50% in my side product's CI/CD, I found that the reason was Buildx not having write permissions to the remote cache. After correcting the permissions, the ratio quickly rose above 95%.

Anomaly Detection and Alerts

I set up alerting systems to automatically detect anomalous changes in build time and cache hit/miss ratios. Using Prometheus Alertmanager, I configured notifications to be sent to a Slack channel or via email if certain thresholds were exceeded (e.g., "build time increased by more than 50% in the last 1 hour" or "cache hit ratio dropped below 70% in the last 4 hours").

ℹ️ Related Post

I previously wrote a post on [related: system metrics collection and alerting mechanisms]. These principles also apply to build cache metrics.

This proactive monitoring allowed us to address problems before developers noticed them or before the operational burden became severe. For example, when a build time alarm went off at 03:14, I immediately intervened, resolved a disk space issue, and prevented a potential outage.

Trade-offs and Choosing the Right Strategy

When choosing build cache strategies, we always encounter a trade-off. There is no single "one-size-fits-all" solution. In my experience, these decisions were usually based on a balance of speed, complexity, cost, and security.

Speed vs. Complexity

The simplest Docker layer caching method is fast and easy to set up but does not allow cache sharing between different CI/CD runners. This might be sufficient for small projects or single-runner environments. However, it falls short in larger and distributed CI/CD infrastructures.

Remote cache (Buildx + registry/S3) offers much higher speeds because it makes the cache shareable. However, its setup and management are more complex. Setting up a registry or an S3 bucket, managing access permissions, and configuring lifecycle rules add to the operational burden. While developing a production company's ERP, we initially managed with simple layer caching. But as the project grew and the number of developers increased, we had to switch to Buildx. This transition initially required several days of setup and configuration.

Cost vs. Efficiency

Faster builds generally require more resources. If you use cloud storage services like S3 for remote cache, storage and data transfer costs increase. Additionally, build servers might need more CPU and RAM.

For the backend of my side product, I initially self-hosted everything on my own VPS. Instead of using external S3 storage for the build cache, I preferred to keep the cache on my local disk. This saved me storage costs, but build times were longer, and if I ever needed to reinstall the VPS, the cache would be completely lost. At some point, the inefficiency caused by longer build times outweighed the S3 storage cost, and I switched to remote cache.

My cost analysis for a client's project was as follows:

Option A (Zero Cache): Average build time 45 min. 10 builds per day = 450 min (~7.5 hours) build time. Developer waiting cost (average hourly rate x waiting time) + need for more CI/CD runners.
Option B (Remote Cache): Average build time 10 min. 10 builds per day = 100 min (~1.6 hours) build time. S3 storage cost (monthly ~50 USD) + Buildx setup/management time.

My calculations showed that Option B's total cost, especially in terms of developer productivity, was much lower. For 5 developers doing 10 builds a day, a 35-minute gain translated to thousands of dollars in productivity increase per month.

💡 My Preference

Generally, I prefer to use remote build cache in CI/CD pipelines. Although there's an initial operational overhead, it more than pays for itself in the long run in terms of developer productivity and resource optimization. Especially when creating Docker images, Buildx has become an indispensable tool.

Conclusion

Build cache strategies are a critical part of increasing performance and ensuring developer productivity in modern software development processes. However, it's not just about "turning on the cache." Choosing the right cache mechanism, defining invalidation strategies, and managing disk space and security risks require significant operational overhead and attention.

In my experience, what's important in this process is always considering the trade-offs and finding the most suitable solution for the project's needs. From a simple setup for a side product to a complex Buildx integration in a large enterprise ERP, I found different balance points in each project. I always acted on the principle that "speed has a cost" and continuously sought new ways to make this cost manageable. The next step is to work on AI-powered cache invalidation mechanisms to make build caches even smarter.

DEV Community