CI/CD Cost Engineering

#webdev #programming

---
title: "Cut Your CI/CD Bill by 85% with Spot Instance Runners"
published: true
description: "A hands-on walkthrough of self-hosted GitHub Actions runners on Kubernetes spot instances with persistent caching and preemption handling."
tags: devops, kubernetes, cloud, performance
canonical_url: https://blog.mvpfactory.co/cut-cicd-bill-85-percent-spot-instance-runners
---

## What We're Building

By the end of this tutorial, you'll have a self-hosted GitHub Actions runner infrastructure on Kubernetes spot instances that cuts your CI/CD spend by 85%. We'll set up actions-runner-controller (ARC), handle spot preemption gracefully, wire up persistent Gradle and Docker layer caches, and build cost-per-build dashboards that keep the savings honest.

Let me show you a pattern I use in every project that takes CI/CD seriously.

## Prerequisites

- A Kubernetes cluster (EKS, GKE, or similar) with permissions to create node pools
- Familiarity with GitHub Actions workflows
- Helm installed for deploying ARC
- Prometheus and Grafana for metrics (optional but recommended from day one)

## Step 1: Understand the Cost Problem

GitHub-hosted runners bill per-minute with no volume discount. Here's what the numbers look like at 2,000 build-hours per month:

| Runner type | vCPU | RAM | Cost/min (Linux) | Monthly cost |
|---|---|---|---|---|
| GitHub-hosted (4-core) | 4 | 16 GB | $0.064 | ~$7,680 |
| Self-hosted on-demand (c6a.xlarge) | 4 | 8 GB | ~$0.025 | ~$3,000 |
| Self-hosted spot (c6a.xlarge) | 4 | 8 GB | ~$0.008 | ~$960 |

That bottom row is where the 85% reduction lives. Let's build it.

## Step 2: Create a Spot Node Pool for CI Runners

Dedicate a node pool to CI runners using spot/preemptible instances. Taints keep production workloads off these nodes:

yaml
nodePool:
name: ci-runners
machineType: c6a.xlarge
spotInstances: true
taints:
- key: workload-type
value: ci
effect: NoSchedule
labels:
role: ci-runner


ARC's `RunnerDeployment` targets this pool with matching tolerations and a `nodeSelector`, so runners only land on spot nodes.

## Step 3: Handle Spot Preemption Gracefully

Here is the gotcha that will save you hours. Spot instances can be reclaimed with a two-minute warning. If you don't handle this, builds get corrupted mid-run.

The approach has three pieces:

1. A termination handler DaemonSet watches the cloud provider's metadata endpoint for interruption notices.
2. On notice, the handler cordons the node and sends `SIGTERM` to the runner process.
3. ARC's runner reports failure gracefully, and the workflow's retry strategy re-queues the job on a healthy node.

yaml
jobs:
build:
runs-on: self-hosted
strategy:
max-parallel: 4
timeout-minutes: 30
max-attempts: 2


Spot eviction rates on compute-heavy instance families tend to sit between 3-8%. With retry logic, actual build failures from preemption drop below 1%.

## Step 4: Set Up Persistent Caching

Spot savings are worthless if every evicted job restarts from scratch. You need persistent caching. Full stop.

Provision a persistent volume mounted to all runner pods. Here's what the cache gives you:

| Cache target | Cold build | Warm build | Savings |
|---|---|---|---|
| Gradle dependencies + build cache (2-5 GB) | 8-12 min | 1-3 min | ~75% |
| Docker layer cache via BuildKit (5-15 GB) | 6-10 min | 1-2 min | ~80% |
| Node modules, hashed (1-3 GB) | 2-4 min | 10-20s | ~90% |

Here is the minimal setup to get this working. For Gradle (critical for Kotlin/Android projects):

properties

gradle.properties

org.gradle.caching=true
org.gradle.caching.local.directory=/mnt/ci-cache/gradle/build-cache


For Docker BuildKit:

bash
docker buildx build \
--cache-from type=local,src=/mnt/ci-cache/docker \
--cache-to type=local,dest=/mnt/ci-cache/docker,mode=max \
.


Add a daily CronJob that prunes entries older than 7 days and caps total size at a fixed threshold. Simple LRU based on access time works fine.

## Step 5: Instrument Cost-Per-Build Metrics

The docs don't mention this, but without measurement, costs creep back up and nobody notices. Export these from every build via a post-job hook to Prometheus:

- **cost_per_build** — (instance cost/min × duration) + storage cost
- **cache_hit_rate** — percentage of tasks served from cache
- **spot_eviction_rate** — evictions / total jobs
- **queue_wait_time** — time from trigger to runner assignment

Build Grafana dashboards around these. When cost-per-build trends upward, you can see exactly which cache degraded or which workflow lost parallelism.

## Gotchas

- **Scaling runners without caching first** — Adding more runners without shared caches just multiplies cold-build costs. Invest in Gradle build cache and Docker layer cache before parallelism.
- **Ignoring cache eviction** — Without eviction, caches grow forever and your storage costs eat into your savings.
- **No retry strategy** — A bare spot setup without `max-attempts` will give you a 3-8% build failure rate. Always add retry logic.
- **Missing metrics** — Without cost-per-build dashboards, optimization conversations stay vibes-based. Instrument from day one.

## Wrapping Up

Start with ARC and a spot node pool. Even a bare-bones setup with retry logic cuts costs by 60%+ with minimal reliability risk. Layer in shared caches for the full 85% reduction, and instrument cost-per-build so the savings stay durable as your team grows.

The infrastructure payoff is immediate — this is one of those rare cases where the engineering investment pays for itself in the first billing cycle.

**Resources:**
- [actions-runner-controller (ARC)](https://github.com/actions/actions-runner-controller)
- [Gradle Build Cache docs](https://docs.gradle.org/current/userguide/build_cache.html)
- [Docker BuildKit cache documentation](https://docs.docker.com/build/cache/backends/)

DEV Community

CI/CD Cost Engineering

gradle.properties

Top comments (0)