Khachatur Ashotyan

Posted on May 23

Spot instances as GitHub Actions runners

#aws #devops #cicd #githubactions

Part 1 covered Jenkins as Code with ephemeral workers. Part 2 covered macOS workers. This one is about moving a chunk of the CI workload off Jenkins entirely, onto GitHub Actions, with EC2 spot instances as the runner fleet.

This isn't a "Jenkins is dead, use GitHub Actions" post. Jenkins still handles the heavy builds: macOS, Windows, anything that runs for hours or needs custom orchestration. GitHub Actions runs alongside it for a narrower class of workload where it fits better.

What follows is the self-hosted spot runner pattern: how to point GitHub Actions at your own ephemeral EC2 fleet, and the things that bite once you do.

Why bother

GitHub's managed runners are fine for small teams. There are a few reasons to switch:

1. Cost at volume. GitHub bills its managed Linux runners at $0.008/minute ($0.48/hour). Fine for a few builds a day. Last month we ran 80,887 runner-minutes across 29,347 jobs (~1,350 hours). On managed runners that would have been ~$647. Our actual EC2 bill for the runner fleet was ~$160 - $130 on spot, $28 EC2-Other (EBS, ENIs, data transfer). Roughly 4x cheaper, and the gap widens the more you run.

2. Instance shape. Managed runners come in fixed sizes. Builds that need 16 vCPUs and 64 GB of RAM, a GPU, or arm64 either pay for the largest tier or don't fit at all. Self-hosted lets you pick whatever EC2 instance type the build actually needs.

3. Network access. Builds that talk to private resources (internal artifact registries, RDS, anything behind a VPC) are awkward on managed runners. Self-hosted runners live inside your VPC, so they hit those resources directly without proxies or tunnels.

Cost is what got us to try it. The VPC access and instance flexibility came along for free.

What a "self-hosted spot runner" is

A self-hosted GitHub Actions runner is a small agent. It registers with a GitHub repo or org, polls for jobs matching its labels, runs them, and reports results back. Anything that can run the binary works as a host (bare metal, VM, container, whatever).

It can be either:

Persistent: registered once, sits there, picks up jobs as they come.
Ephemeral: single-use registration token, picks up one job, de-registers, shuts down.

We went with ephemeral. Long-lived self-hosted runners combine the operational burden of managing a host with the build-pollution risk of a shared agent and a security blast radius that never closes.

Every GitHub Actions job gets its own EC2 spot instance, freshly launched from a Packer-baked AMI. The job runs, then the instance is terminated. Job runs, instance terminates. The same one-build-per-worker lifecycle as our Jenkins workers, on a different control plane.

The architecture

No single service does this end-to-end. Either you wire it together yourself, or you reach for one of the open-source modules that already does. I went with terraform-aws-github-runner. It's the most mature module in this space and fits cleanly into a Terraform-managed AWS account. (If you remember the project under its old name, philips-labs/terraform-aws-github-runner, it's the same code, moved to the github-aws-runners org.)

When someone opens a PR that triggers a workflow:

GitHub fires a workflow_job webhook the moment a job is queued.
API Gateway plus a webhook Lambda check the HMAC, filter for relevant runner labels, and push a message onto an SQS queue.
A scale-up Lambda drains that queue. For each queued job it launches an EC2 spot instance from a specific AMI, with user-data carrying a single-use registration token.
The instance comes up. Cloud-init runs; the runner binary registers itself with GitHub and starts polling for jobs.
The runner came up with matching labels, so GitHub schedules the queued job onto it.
The workflow runs whatever your YAML says: checkout, build, test, push artifacts.
The runner is registered as --ephemeral, so the agent exits after one job. A scheduled scale-down Lambda cleans up anything left over.

The Lambda code, the SQS queues, and the IAM glue all live inside the module. You don't write any of that yourself. What you do write is the Terraform configuration that declares which runners exist, which AMI they use, and which instance types are eligible.

Multi-tier runners

This setup gets useful in practice when you split runners into tiers distinguished by labels. Workflows pick a tier by setting runs-on: in the workflow YAML.

We run three:

Small (t3.medium / m5.large): linters, formatters, doc builds, anything that doesn't really stress a CPU. Spawns fast and spot capacity at this size is never a problem.
Large (m5.xlarge / c5.xlarge): the typical build-and-test workflow that wants some CPU but doesn't hammer it.
Compute-intensive (c7a.4xlarge / c8a.8xlarge): compile-heavy builds, large test suites, anything that scales with cores.

Each tier is its own call to the same Terraform module with different labels and instance-type lists. Sanitized:

module "github-runners" {
  source  = "github-aws-runners/github-runner/aws//modules/multi-runner"
  version = "~> 6.0"

  multi_runner_config = {
    "linux-x64-small" = {
      runner_config = {
        runner_extra_labels = "linux,x64,small"
        instance_types      = local.default_instances
        ami_filter          = { name = ["*ci-runner-x64*"] }
        enable_ephemeral_runners = true
        enable_spot_instances    = true
      }
    }

    "linux-x64-compute-intensive" = {
      runner_config = {
        runner_extra_labels = "linux,x64,compute-intensive"
        instance_types      = local.compute_intensive
        ami_filter          = { name = ["*ci-runner-x64*"] }
        enable_ephemeral_runners = true
        enable_spot_instances    = true
      }
    }

    # ...and so on for the other tiers
  }

  # Common stuff: webhook secret, GitHub app credentials, VPC config, etc.
  github_app = { ... }
  webhook_secret = random_id.webhook_secret.hex
  vpc_id  = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnets
}

Picking the tier in a workflow:

jobs:
  build:
    runs-on: [self-hosted, linux, x64, compute-intensive]
    steps:
      - uses: actions/checkout@v4
      - run: make build

GitHub matches the runs-on array against runner labels and picks any registered runner that has them all. The Lambda only spawns instances on demand, so an unused tier costs nothing.

Scheduling - warm pool during hours, off at night

Pure on-demand scaling sounds ideal in theory. Zero idle runners, pay per job, the Lambda spawns instances only when GitHub queues something. Two patterns spoil it in practice.

The morning-rush problem. The first few PRs of the day queue around the time people log in. On pure on-demand, every one of those jobs eats the full cold-start latency, somewhere between 60 and 120 seconds from queue to running. A dozen developers pushing at 9am turns into a visible backlog.

The 3am problem. Even on spot, idle runners cost something. There's EBS attached to the warm AMIs, plus the always-on orchestration Lambdas. Outside business hours the queue is mostly empty, so there's no reason to keep capacity hot.

The runner module addresses both with idle pools and scheduled scaling.

What works for us:

During business hours (weekdays 08:00 to 20:00 in our primary timezone), we keep a warm pool of N runners per tier, sitting registered with GitHub and ready to grab the first matching job. When one claims a job, the scale-up Lambda spawns a replacement, so the pool stays at N. Cold start for the user effectively disappears.
Outside that window, the pool size drops to zero. Late-night and weekend jobs still run; they just pay the cold-start tax. Most of what runs at those hours is scheduled batch work that doesn't care about an extra minute.

The Terraform for it (sanitized, per-tier):

runner_config = {
  # ... labels, instance_types, ami_filter as before ...

  enable_ephemeral_runners = true
  enable_spot_instances    = true

  # Warm pool - kept at this size during the cron windows below.
  idle_config = [
    {
      cron      = "0 8 * * MON-FRI"   # ramp up at 08:00 weekdays
      timeZone  = "Europe/Berlin"
      idleCount = 3
    },
    {
      cron      = "0 20 * * MON-FRI"  # ramp down at 20:00 weekdays
      timeZone  = "Europe/Berlin"
      idleCount = 0
    },
  ]
}

A few notes about this pattern:

Warm-pool runners are still single-use, just pre-launched. Each one picks up one job and dies, with the pool replenished by fresh instances rather than reused ones. That keeps the byte-identical state property from the Jenkins side intact.
Pool size is a tuning decision. Too small and you still cold-start during the rush; too big and you're burning money on idle capacity. We tune per tier based on the morning queue depth we actually observe. The compute-intensive tier gets a smaller pool because those jobs are rarer.
Spot eviction during pool-idle time is fine. If AWS reclaims a pool runner before it ever picks up a job, the scale-up Lambda just launches a replacement. The pool size is a target, not a fixed set of instances.
Holidays are a remaining problem. The cron schedule doesn't know about public holidays, so on a Monday holiday the pool still ramps up at 08:00 to serve nobody. The cost is small enough that nobody's been motivated to build a calendar-aware scheduler.

During working hours the developer experience is roughly as fast as managed runners. Outside those hours the bill is close to nothing.

The AMI is still a Packer image

As with the Jenkins workers, anything the runner needs to do its job (language runtimes, build tools, Docker, cached dependencies) gets baked into a Packer AMI ahead of time. The AMI is versioned, lives in your AWS account, and is referenced by the runner's ami_filter.

For GitHub Actions the image is usually lighter than the Jenkins worker AMIs - GHA workflows install most of their tooling at runtime via setup-node, setup-python, setup-java. So the base image just needs:

Ubuntu (or whatever OS).
The GitHub Actions runner binary, pre-downloaded.
Docker (for docker build / docker run).
AWS CLI (most workflows hit S3 or ECR).
Basic build deps: git, curl, jq, unzip.
A stable runtime or two we don't want to redownload every build (Node LTS, Python 3.x).

The image stays around 5 to 10 GB, small enough that pulling and booting fits comfortably inside the cold-start budget we already have.

Spot interruptions

The thing everyone asks the first time they look at spot for CI is what happens when AWS reclaims the instance mid-build.

The build fails and GitHub re-queues it. A fresh instance picks it up. There's no recovery logic to write, because the runner is ephemeral and the workflow ought to be idempotent anyway. One partial build is lost, the retry starts clean.

Spot interruption notices give you two minutes of warning before AWS pulls the plug. The runner listens for that signal and de-registers from GitHub cleanly before shutdown, which the module handles for you. Without that, GitHub briefly shows a "runner went offline mid-job" error before the retry. Annoying, but not fatal.

In practice the interruption rate I see is around 1-3% of jobs on the small and large tiers, and a bit higher on compute-intensive because the larger instance types have less spot capacity per AZ. For most workloads that's a fine trade for the savings. For workflows that genuinely can't tolerate a retry (release builds, deploys with side effects), I either flip enable_spot_instances = false for that tier or send the job over to Jenkins, where the lifecycle is more tightly controlled.

Trade-offs vs. Jenkins workers

"Should this run on Jenkins or GitHub Actions?" comes up a lot. How I think about it:

Workload shape	Where it goes
PR-triggered, short, idempotent	GitHub Actions on spot. Quick spin-up, cheap, no Jenkins overhead.
Long-running build (1h+)	Jenkins. Spot interruption risk is too high for long jobs.
macOS / Windows builds	Jenkins. The worker setup from Parts 1 and 2 lives there.
Custom orchestration (matrix sharding, dynamic parallelism, gated promotion)	Jenkins. Groovy DSL handles this more flexibly than the GHA matrix.
Deploys / releases with side effects	Jenkins on dedicated workers, or GHA on on-demand. No spot.
Open-source / contributor-facing repos	GitHub Actions. Don't expose Jenkins to contributors.
Builds that need ephemeral access to a specific cloud service	Whichever is in the right VPC. Usually GHA for the small stuff.

The two systems cover different shapes of work. Moving everything to GitHub Actions would have been a mistake, but moving the small PR-scoped jobs off Jenkins freed up real capacity for the big builds that remain.

What I'm still figuring out

Cold-start latency outside the warm pool window. The pool hides it during business hours, but outside those hours every job eats the full 60-120 seconds. We're fine with this trade-off most of the time, though it occasionally annoys people working evenings.
Spot capacity in specific AZs. The compute-intensive tier sometimes can't get spot capacity in our preferred AZ and the queue backs up. The module's multi-AZ fallback helps but doesn't eliminate it. On genuinely bursty days we fall back to on-demand temporarily.
Holiday-aware pool scheduling. The cron schedule doesn't know about public holidays, so we burn a small amount of money ramping up on holidays nobody is working. Low impact, but it's the kind of thing that bothers you every time you remember.
AMI sprawl. Every architecture (x64, arm64) and base-image variant is its own AMI lineage. We rebuild them on a schedule via Packer the same way we do the Jenkins worker AMIs, but the operational overhead is a real cost.
Cost attribution. Spot instances inherit tags from the launch template, but not every downstream resource (EBS volumes, ENIs) picks up the right cost-attribution tags automatically. That's a separate problem and I'm not opening it here.

Closing thought

All three posts in this series end up at the same place. Ephemeral workers, baked images, everything orchestrated from git, secrets pulled from a vault at runtime. Jenkins is one way to wire that pattern together, and GitHub Actions on self-hosted spot is another. Nothing says you pick one and only one.

The worker lifecycle is the part you can't compromise on: don't keep workers between builds. Once that's in place, everything else (Jenkins versus GHA, spot versus on-demand, Tart versus vSphere) is swappable, and you can change your mind later without burning the platform down.

That wraps the series for now. If any of it saves you a week of figuring it out yourself, this was worth writing.

Appendix - tools mentioned

terraform-aws-github-runner - the Terraform module that wires up the whole thing. (Formerly philips-labs/terraform-aws-github-runner.)
GitHub Actions self-hosted runners - official docs.
HashiCorp Packer - bakes the runner AMIs.
Terraform - calls the module above.
AWS EC2 spot - cheap interruptible compute.
AWS Lambda and SQS - the queueing/orchestration glue (managed by the module).

Part 3 of My CI/CD Odyssey. Thanks for reading. If you run self-hosted CI differently, I'd be curious to hear about it in the comments.

DEV Community