Sami Chibani

Posted on Mar 27

CI/CD in the Era of AI and Platform Engineering: A Deep Dive into Dagger CI (Part 2)

#ai #devops #cicd #python

Part 2: Decoupling Pipelines from Infrastructure

In a Platform Engineering world, developers shouldn't care where CI runs. The pipeline code stays the same; only the runner configuration changes.

In Part 1, we built our first Dagger pipeline: typed, testable code that runs identically on a developer's laptop and in GitHub Actions. We left off with a minimal, working CI workflow: a few lines of YAML that run dagger call on GitHub's shared runners, with zero infrastructure to manage.

That setup works, but it has limits. Every job starts with a cold Dagger cache, builds share GitHub's 2-vCPU runners with the rest of the world, and there's no persistent state between runs. For a small project, that's fine. For a team shipping dozens of PRs a day, it's a bottleneck.

This part is more SRE/Platform Engineer oriented. We'll dig into caching internals, runner infrastructure, and the operational tooling that makes Dagger viable at scale. We'll explore three approaches, each with different tradeoffs between simplicity, performance, and control.

The Runner Spectrum

Approach	Setup Time	Cost	Control	Best For
GitHub Actions + Dagger Action	5 minutes	$0 (free tier)	Low	Getting started, small teams
Depot.dev Managed Runners	15 minutes	~$0.04/min	Medium	Fast builds, medium teams
Kubernetes ARC + Shared Cache	2-4 hours	Your infra	Full	Enterprise, high volume

Option 1: GitHub Actions with dagger-for-github

The simplest approach. Zero infrastructure to manage.

Create .github/workflows/ci.yml:

name: CI

on:
  push:
    branches: [main]
  pull_request:

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6

      - name: Build and Test
        uses: dagger/dagger-for-github@v8.4.1
        with:
          version: "0.20.3"
          verb: call
          args: all --source=.

The dagger/dagger-for-github action installs Dagger, starts the engine, and runs your command.

Understanding Dagger's Cache Architecture

Before we talk about CI runners, we need to understand how Dagger caches work, because caching is probably the single most impactful optimization you'll make. OpenMeter cut their CI pipeline from 25 minutes to 5 minutes primarily through better caching, and the same principles apply here.

Dagger maintains three distinct cache categories:

Cache Type	What It Stores	Scope
Layers	Build instructions and results of API calls	Content-addressed by input hash
Volumes	Persistent filesystem directories (package caches, build artifacts)	Named, persisted across engine sessions
Function calls	Return values from module function invocations	Keyed by function signature + arguments

The engine runs a background garbage collector that keeps cache storage below 75% of total disk capacity while preserving at least 20% free space. You can inspect and manage the cache manually:

# View cache usage summary
dagger core engine local-cache entry-set

# View detailed cache entries
dagger core engine local-cache entry-set entries

# Prune all unused cache entries
dagger core engine local-cache prune

# Prune using the default GC policy
dagger core engine local-cache prune --use-default-policy

Now let's see how these three cache types work in practice.

Layer Cache: Content-Addressed Operation Caching

Every operation in Dagger's execution graph is cached based on the hash of its inputs. When you chain .with_exec(["pip", "install", ..."]) after .with_file("requirements.txt", ...), Dagger hashes the requirements file content and the base container state. If neither changed, the entire install step is skipped, not re-executed.

This is similar to Docker layer caching, but with an important difference: Dagger's cache is content-addressed, not order-dependent. Docker invalidates all layers after the first changed layer. Dagger evaluates each operation independently based on its actual inputs.

The ordering trick: Just like Docker, you should place instructions that change infrequently before those that change often. Install dependencies before copying source code:

# Good: dependencies cached separately from source code
(
    dag.container()
    .from_("python:3.13-slim")
    .with_file("/app/requirements.txt", source.file("requirements.txt"))
    .with_exec(["pip", "install", "-r", "requirements.txt"])  # Cached until requirements.txt changes
    .with_directory("/app/src", source.directory("src"))        # Only this invalidates on code changes
)

# Bad: source code change invalidates dependency install
(
    dag.container()
    .from_("python:3.13-slim")
    .with_directory("/app", source)                             # Any file change invalidates everything below
    .with_exec(["pip", "install", "-r", "requirements.txt"])   # Re-runs even if requirements.txt didn't change
)

This single reordering can save minutes per build. The layer cache lives in the Dagger engine's local storage, by default at /var/lib/dagger.

Cache Volumes: Persistent Package Manager Storage

Beyond layer caching, Dagger provides cache volumes: named persistent directories you mount into containers. These are the equivalent of Docker's --mount=type=cache but integrated into the Dagger API.

The typical use case: package manager caches. Without a cache volume, pip install re-downloads every package on every run even if the layer cache misses. With a cache volume, pip finds the packages already in its local cache and skips the download.

@function
async def build_backend(
    self,
    source: Annotated[dagger.Directory, Doc("Backend source directory")],
) -> dagger.Container:
    """Build the FastAPI backend container."""
    return (
        dag.container()
        .from_("python:3.13-slim")
        .with_workdir("/app")
        .with_file("/app/requirements.txt", source.file("requirements.txt"))
        # Mount a persistent cache volume for pip's download cache.
        # dag.cache_volume("pip") creates a named volume that persists
        # across runs. pip won't re-download packages it already has.
        .with_mounted_cache("/root/.cache/pip", dag.cache_volume("python-pip"))
        .with_exec(["pip", "install", "-r", "requirements.txt"])
        .with_directory("/app/src", source.directory("src"))
        .with_env_variable("PORT", "8080")
        .with_exposed_port(8080)
        .with_entrypoint(["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8080"])
    )

Here's the pattern for common package managers:

# Python (pip)
.with_mounted_cache("/root/.cache/pip", dag.cache_volume("python-pip"))
.with_exec(["pip", "install", "-r", "requirements.txt"])

# Python (uv) — dramatically faster with cache
.with_mounted_cache("/root/.cache/uv", dag.cache_volume("python-uv"))
.with_exec(["uv", "pip", "install", "--system", "-r", "requirements.txt"])

# Node.js (npm)
.with_mounted_cache("/root/.npm", dag.cache_volume("node-npm"))
.with_exec(["npm", "ci"])

# Go
.with_mounted_cache("/go/pkg/mod", dag.cache_volume("go-mod"))
.with_mounted_cache("/root/.cache/go-build", dag.cache_volume("go-build"))
.with_exec(["go", "build", "./..."])

Cache volumes are scoped to the Dagger engine instance. On your laptop, they persist indefinitely. On an ephemeral CI runner, they need to be persisted externally (we'll cover how below).

How the Three Caches Stack

Here's what happens on a typical CI run where you changed one Python source file:

Function call cache: checks if build_backend() was called with the same source directory hash. If yes, returns the cached result immediately (entire function skipped). If no, proceeds to step 2.
Layer cache: the from_("python:3.13-slim") step is cached (image didn't change). The with_file("requirements.txt", ...) step is cached (file didn't change). The pip install step is cached (requirements + base image didn't change). Only the with_directory("/app/src", ...) step and everything after it re-runs.
Cache volume: if the pip install step does need to re-run (e.g., you added a new dependency), pip finds most packages already downloaded in the cache volume. Only the new package is fetched.

The result: a build that takes 3 minutes cold can drop to 30 seconds when only source code changed, and even a dependency change is fast because the volume cache eliminates most network I/O.

Persisting Cache on Ephemeral Runners

GitHub Actions runners are ephemeral. Every job starts from a clean VM. Without persistence, every run starts with a cold cache.

Why `actions/cache` Doesn't Work Here

Your first instinct might be to cache /var/lib/dagger with actions/cache@v4. Unfortunately, this doesn't work. Dagger stores its state inside a Docker volume named dagger-engine, not directly on the host filesystem. The path /var/lib/dagger only exists inside the engine container, not on the host where actions/cache can reach it.

The Solution: Docker Volume Caching

Since the Dagger engine already persists its state in a named Docker volume (dagger-engine), we just need a way to back up and restore that volume across ephemeral CI runs. BYK/docker-volume-cache-action does exactly that:

name: CI

on:
  push:
    branches: [main]
  pull_request:

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6

      # 1. Restore the Dagger engine's Docker volume from cache.
      #    Use a per-job cache key so parallel matrix jobs don't
      #    overwrite each other. restore-keys falls back to any
      #    existing cache for this OS if the exact key doesn't match.
      - name: Restore Dagger engine cache
        uses: BYK/docker-volume-cache-action/restore@v1
        with:
          key: dagger-engine-${{ runner.os }}-backend
          restore-keys: |
            dagger-engine-${{ runner.os }}-
          volumes: dagger-engine

      # 2. Run Dagger — the _EXPERIMENTAL_DAGGER_RUNNER_HOST env var
      #    tells the CLI to use a Docker volume named "dagger-engine"
      #    for engine state. This is what makes the restore/save work.
      - name: Build and Test
        uses: dagger/dagger-for-github@v8.4.1
        env:
          _EXPERIMENTAL_DAGGER_RUNNER_HOST: "docker-image://registry.dagger.io/engine:v0.20.3?volume=dagger-engine"
        with:
          version: "0.20.3"
          verb: call
          args: test-backend --source=./backend

      # 3. Stop the engine so the volume is consistent before saving
      - name: Stop Dagger engine
        if: always()
        run: docker stop $(docker ps -q -f 'name=dagger-engine') 2>/dev/null || true

      # 4. Save the volume back to cache — only on main branch.
      #    PRs restore from main's cache but don't save, avoiding
      #    ~1m30 of overhead per job on every pull request.
      - name: Save Dagger engine cache
        if: always() && github.ref == 'refs/heads/main'
        uses: BYK/docker-volume-cache-action/save@v1
        with:
          key: dagger-engine-${{ runner.os }}-backend
          volumes: dagger-engine

A few things to note about this setup:

_EXPERIMENTAL_DAGGER_RUNNER_HOST tells the Dagger CLI to store engine state in a Docker volume named dagger-engine. Without this, the CLI creates an internal volume with an auto-generated name that the cache action can't find.
Per-job cache keys (e.g. -backend, -frontend) are important when you run parallel matrix jobs. With a shared key, the last job to save would overwrite the cache with only its own data. Per-job keys keep each job's cache independent.
Save only on main avoids wasting ~1m30 per job on pull requests. PRs still benefit from restoring the cache built by main.
restore-keys fallback allows a cache miss on the exact key to fall back to any cache for the same OS, which is useful when a job runs for the first time but other jobs have already populated a cache.

On the first run, the cache is cold and everything executes. On the second run, the Docker volume is restored with all Dagger layers, cache volumes, and function call results intact, and Dagger picks up right where it left off.

Performance note: This approach introduces a "cold start" overhead on every job — the restore step decompresses the volume at the start, and the save step compresses it at the end. For a ~1 GB engine cache, this adds roughly 30-60 seconds per job. This overhead partially offsets Dagger's built-in caching gains, making it the least performant of the three runner options. It's a good starting point, but if build speed is critical, Depot runners or self-hosted ARC runners with persistent volumes (Options 2 and 3 below) eliminate this overhead entirely.

Note: GitHub Actions caches are scoped to a branch (with fallback to the default branch) and expire after 7 days of inactivity, with a 10 GB limit per repository. This is generous enough for most projects — but if you hit the limit, Dagger Cloud's distributed caching (covered below) is the next step.

The important point: none of this caching logic is CI-specific. The .with_mounted_cache() calls and operation ordering live in your pipeline code. They work identically on your laptop, on GitHub Actions, on Depot, or on a self-hosted runner. The only CI-specific piece is the volume cache restore/save steps that persist the engine state, and even that becomes unnecessary with persistent runners (see Options 2 and 3 below) or Dagger Cloud.

Dagger Cloud: Distributed Caching and Pipeline Observability

Dagger Cloud is Dagger's centralized control plane. It provides three capabilities that become increasingly valuable as your pipeline usage grows: pipeline visualization, distributed caching, and module management.

Why Dagger Cloud?

The Docker volume caching approach above works well for small projects, but has limitations:

10 GB limit per repository: large monorepos or Docker-heavy builds can blow through this
Branch-scoped: cache from main is available to PRs, but PRs don't share cache between each other
Single runner: the cache only benefits the specific runner that uploaded it
Save/restore overhead: compressing and decompressing a multi-GB Docker volume adds 30-60 seconds per job

Dagger Cloud solves all of these with a distributed cache that syncs layers and volumes across every Dagger Engine connected to your organization. Cache volumes are downloaded at the start of each run and uploaded back at completion. Layers are fetched on demand as the engine needs them. This means a build on one runner benefits from the cache populated by a completely different runner, even on a different branch.

OpenMeter reported cutting their CI from 25 minutes to 10 minutes with Dagger Cloud caching alone (a 2.5x improvement) before adding faster runners for the full 5x speedup.

Setting Up a Dagger Cloud Account

Pricing:

Plan	Cost	Users	Features
Individual	Free	1	Pipeline visualization, traces
Team	$50/month	Up to 10	Visualization + distributed caching + module sharing
Enterprise	Custom	Unlimited	Dedicated support, SLA, advanced features

All plans include a 14-day free trial of Team features.

Step 1: Create your account.

Head to dagger.io/cloud and create an account directly with your GitHub or Google account. Follow the guided setup: create your organization (alphanumeric + dashes, must be unique), select a plan, and optionally invite teammates.

You can then authenticate from the CLI:

dagger login

This opens your browser with a verification link. Confirm the unique key matches what the CLI displays, and you're connected.

Tip: Use your company name or team name for the organization — you can't change it later.

Step 2: Get your token.

Once logged in, navigate to your organization settings:

https://dagger.cloud/{your-org-name}/settings?tab=Tokens

Click the eye icon to reveal your token. Copy it. You'll need it for CI configuration.

Step 3: Add the token to GitHub Actions.

Go to your repository → Settings → Secrets and variables → Actions → New repository secret. Name it DAGGER_CLOUD_TOKEN and paste your token.

Step 4: Update your workflow.

Add the cloud-token parameter to the Dagger action:

- name: Build and Test
  uses: dagger/dagger-for-github@v8.4.1
  with:
    version: "0.20.3"
    verb: call
    args: all --source=.
    cloud-token: ${{ secrets.DAGGER_CLOUD_TOKEN }}

That's it. You can now remove the volume cache restore/save steps. Dagger Cloud handles cache persistence automatically. Alternatively, keep both for redundancy during the transition.

Dagger Cloud also supports GitLab CI, CircleCI, Jenkins, and Argo Workflows. The setup is the same: store DAGGER_CLOUD_TOKEN as a secret/variable in your CI system, and Dagger picks it up automatically via the environment variable.

Pipeline Visualization and Traces

Every Dagger run connected to Dagger Cloud generates a trace, a visual breakdown of your pipeline execution showing:

Which operations executed and their duration
Which operations were cache hits (skipped) vs cache misses (re-executed)
Dependency relationships between operations
Error locations and output for failed steps

This is invaluable for optimization. Instead of guessing which steps are slow, you can see exactly where time is spent and whether your caching strategy is working. If a step that should be cached keeps re-executing, the trace will show you.

Public traces: Dagger Cloud automatically detects if traces come from public repositories and makes them publicly accessible by default. You can toggle this in organization settings under Visibility.

Exemple of Dagger pipeline trace with Gantt view

Install the GitHub App (Optional)

For even tighter integration, install the Dagger Cloud GitHub App. This adds Dagger pipeline status checks directly to your pull requests, with links to the trace view for each run.

Module Management

If you publish Dagger modules (like we do in Part 3), Dagger Cloud can scan your GitHub repositories and display module information: API documentation, activity history, dependencies, and linked traces. Enable this through Settings → Git Sources → Install the GitHub Application → select repositories → Enable module scanning.

Modules page view from Dagger Cloud

Pros: Zero infrastructure setup, free tier for individuals, works with any GitHub repository (personal or organization), automatic updates.
Cons: Slowest of the three options due to volume save/restore overhead on every job, shared infrastructure, 10 GB cache limit per repository (Dagger Cloud removes this limit).

Option 2: Depot.dev Managed Runners

Depot provides managed GitHub runners optimized for container builds. The standout features: Dagger integration with persistent caching and native arm64 support.

Warning: Depot runners require your repository to be part of a GitHub organization. They do not work with personal GitHub repositories. If you're working on a personal repo, you'll need to create an organization (free tier is sufficient) and transfer or fork the repository there before you can use Depot runners.

name: CI (Depot)

on:
  push:
    branches: [main]
  pull_request:

jobs:
  build:
    runs-on: depot-ubuntu-latest,dagger=0.20.3
    steps:
      - uses: actions/checkout@v6

      - name: Build and Test
        run: dagger call all --source=.

That's it. No depot/setup-action, no dagger-for-github action, no environment variables. By appending dagger=0.20.3 to the runs-on label, Depot pre-configures the runner with the Dagger engine, persistent cache, and all necessary plumbing. You just call dagger directly.

Depot runners have 16 CPU / 32GB RAM standard (vs GitHub's 2 CPU / 7GB), persistent NVMe caching, and native arm64 with no emulation. A build that takes 8 minutes on GitHub's free runners often completes in under 2 minutes on Depot.

Pros: 4-10x faster, persistent cache, native multi-arch, simple migration.
Cons: Paid service (~$0.04/minute), another vendor dependency.

Option 3: Self-Hosted Kubernetes with ARC

For maximum control, run your own runners on Kubernetes using GitHub's Actions Runner Controller (ARC).

Architecture

┌──────────────────────────────────────────────────────────────┐
│                    Kubernetes Cluster                          │
│                                                               │
│  ┌───────────────────────────────────────────────────────┐   │
│  │     ARC Controller (Namespace: arc-systems)            │   │
│  └───────────────────────────────────────────────────────┘   │
│                            │                                  │
│                            ▼                                  │
│  ┌───────────────────────────────────────────────────────┐   │
│  │   arc-runners namespace                                │   │
│  │                                                        │   │
│  │   ┌──────────┐  ┌──────────┐  ┌──────────┐           │   │
│  │   │ Runner 1 │  │ Runner 2 │  │ Runner 3 │           │   │
│  │   │ (pod)    │  │ (pod)    │  │ (pod)    │           │   │
│  │   └─────┬────┘  └─────┬────┘  └─────┬────┘           │   │
│  │         │              │              │                │   │
│  │         │    kube-pod:// or Unix socket                │   │
│  │         │              │              │                │   │
│  │   ┌─────▼──────────────▼──────────────▼─────┐         │   │
│  │   │      Dagger Engine (Helm chart)         │         │   │
│  │   │      DaemonSet or StatefulSet           │         │   │
│  │   │                                         │         │   │
│  │   │      /var/lib/dagger (cache)            │         │   │
│  │   │      hostPath (DS) or PVC (STS)         │         │   │
│  │   └─────────────────────┬───────────────────┘         │   │
│  └─────────────────────────┼─────────────────────────────┘   │
│                             │                                 │
│                             ▼                                 │
│                  ┌─────────────────────┐                     │
│                  │    Dagger Cloud     │  (optional)          │
│                  │    Magicache        │                     │
│                  └─────────────────────┘                     │
└──────────────────────────────────────────────────────────────┘

The key architectural difference from Options 1 and 2: the Dagger engine runs as a shared, long-lived process (DaemonSet or StatefulSet) rather than being started fresh inside each runner. Runner pods are lightweight: they just run the GitHub Actions runner process and connect to the engine. This separation means the engine's cache persists across jobs, and runner pods can scale independently from the engine.

Prerequisites: Setting Up the GitHub App

ARC authenticates to GitHub via a GitHub App. You need to create one before deploying the infrastructure.

Step 1: Create the app.

Go to your GitHub App creation page:

Organization: https://github.com/organizations/<YOUR_ORG>/settings/apps/new
Personal account: https://github.com/settings/apps/new

Fill in the form:

GitHub App name: something descriptive like ARC Dagger Runners
Homepage URL: your organization's URL
Webhook: uncheck Active (ARC doesn't use webhooks)

Set the required Repository permissions:

Permission	Access
Actions	Read-only
Administration	Read & write
Checks	Read-only
Metadata	Read-only

If your runners will be shared across an organization (recommended), also set this Organization permission:

Permission	Access
Self-hosted runners	Read & write

Under "Where can this GitHub App be installed?", choose Only on this account. Click Create GitHub App.

On the app's settings page, note the App ID (displayed near the top). Then scroll to Private keys and click Generate a private key. Save the .pem file securely.

Step 2: Install the app.

For an organization (all repos can use the runners):

Go to https://github.com/organizations/<YOUR_ORG>/settings/apps
Click Configure next to your app
Under Repository access, choose All repositories (or select specific ones)
Click Save
Note the Installation ID from the URL: .../installations/<INSTALLATION_ID>

For a single repository:

Go to https://github.com/<OWNER>/<REPO>/settings/installations
Click Configure next to your app
Note the Installation ID from the URL

The gh_config_url in the Terraform module should match the scope:

Organization: https://github.com/your-org
Single repo: https://github.com/your-org/your-repo

Tip: Start with organization-level installation even if you only have one repo today. It's easier to add repos later without changing the Terraform configuration.

Terraform Setup

Writing the raw Helm releases and Kubernetes resources by hand gives you full understanding of the moving parts, but for production use, you want a reusable module that handles the boilerplate. We've open-sourced one that builds on the same GKE and ARC foundations as the official terraform-google-github-actions-runners module and adds Dagger-specific features.

Full source: github.com/telchak/terraform-dagger-runners

The module is organized in three layers:

Layer	Module path	What it does
Engine (root)	`/`	Deploys Dagger engines via Helm. CI-agnostic, cloud-agnostic.
CI platform	`modules/github/`	Adds ARC controller + runner scale sets. Calls root for engine.
Cloud	`modules/github/gke/`	Provisions a GKE cluster. Calls CI platform module.

You pick the layer that matches your situation. Need a full GKE cluster? Use modules/github/gke/. Already have a Kubernetes cluster (GKE, EKS, AKS, on-premise)? Use modules/github/ directly.

Required Google APIs — If you're provisioning a GKE cluster for self-hosted runners, enable these APIs on your GCP project:
gcloud services enable \
  container.googleapis.com \
  compute.googleapis.com \
  iam.googleapis.com \
  cloudresourcemanager.googleapis.com
Kubernetes Engine API (container.googleapis.com) — create and manage GKE clusters

Compute Engine API (compute.googleapis.com) — provision VMs, networks, and disks

IAM API (iam.googleapis.com) — service accounts for nodes and workload identity

Cloud Resource Manager API (cloudresourcemanager.googleapis.com) — project-level operations

Scenario A: Create a new GKE cluster with runners

This is the all-in-one option. The module provisions the GKE cluster, networking, ARC controller, runner scale sets, and Dagger engines:

module "dagger-runners" {
  source = "github.com/telchak/terraform-dagger-runners//modules/github/gke?ref=v0.1.0"

  # GCP
  project_id     = var.project_id
  region         = "us-central1"
  create_network = true
  machine_type   = "n2-standard-4"   # 4 vCPU, 16 GB RAM — fits ~15 runners
  disk_size_gb   = 100
  min_node_count = 1
  max_node_count = 4
  spot           = true              # Safe with StatefulSet — cache is on PVC

  # GitHub App credentials from Step 1 & 2 above.
  # Store these in terraform.tfvars (git-ignored) or use TF_VAR_ env vars.
  gh_app_id              = var.gh_app_id
  gh_app_installation_id = var.gh_app_installation_id
  gh_app_private_key     = var.gh_app_private_key
  gh_config_url          = "https://github.com/your-org"  # org or repo URL

  # Each version gets its own runner scale set with a unique label.
  # Use runs-on: dagger-v0.20 in your workflows.
  dagger_versions = ["0.20.3"]

  min_runners = 0   # Scale to zero when idle
  max_runners = 10

  # StatefulSet mode: persistent PVC cache per engine version.
  # Cache survives pod restarts, node preemptions, and scale-downs.
  # Runners connect to the engine via kube-pod:// protocol.
  engine_mode                         = "statefulset"
  persistent_cache_size               = "100Gi"
  persistent_cache_storage_class_name = "premium-rwo"  # GKE SSD-backed PD

  # Runner pod resources — runners are lightweight (they just connect
  # to the shared Dagger engine), so requests are minimal.
  runner_size_templates = {
    "" = {
      runner_requests = { cpu = "50m", memory = "256Mi" }
      runner_limits   = {}
    }
  }

  # Optional: Dagger Cloud for distributed caching + traces
  dagger_cloud_token = var.dagger_cloud_token
}

Scenario B: Deploy onto an existing Kubernetes cluster

If you already have a Kubernetes cluster (GKE, EKS, AKS, on-premise, k3s), use modules/github/ directly. You configure the kubernetes and helm providers yourself; the module deploys only the ARC controller, runner scale sets, and Dagger engines:

module "dagger-runners" {
  source = "github.com/telchak/terraform-dagger-runners//modules/github?ref=v0.1.0"

  # GitHub App credentials from Step 1 & 2 above.
  gh_app_id              = var.gh_app_id
  gh_app_installation_id = var.gh_app_installation_id
  gh_app_private_key     = var.gh_app_private_key
  gh_config_url          = "https://github.com/your-org"

  dagger_versions = ["0.20.3"]
  min_runners     = 0
  max_runners     = 10

  # DaemonSet mode (default): simplest setup, no PVC needed.
  # One engine per node, ephemeral cache at /var/lib/dagger on the host.
  # Works on any cluster without a StorageClass.
  engine_mode = "daemonset"

  # Or use StatefulSet mode for persistent cache:
  # engine_mode       = "statefulset"
  # persistent_cache_size = "100Gi"
  # Leave persistent_cache_storage_class_name empty to use the cluster's
  # default StorageClass, or set it explicitly:
  #   GKE:         "premium-rwo"
  #   EKS:         "gp3"
  #   AKS:         "managed-premium"
  #   On-premise:  "longhorn", "rook-ceph-block", etc.

  # Optional: Dagger Cloud for distributed caching + traces
  # dagger_cloud_token = var.dagger_cloud_token
}

This is the option to use when you manage your own infrastructure, or when running on a cloud provider the module doesn't have a dedicated cloud layer for yet (e.g. AWS, Azure). The interface is identical; only the cluster provisioning is your responsibility.

Step 3: Deploy

# Store credentials safely — never commit the private key.
export TF_VAR_gh_app_private_key="$(cat /path/to/your-app.pem)"

terraform init
terraform plan \
  -var="project_id=my-project" \
  -var="gh_app_id=123456" \
  -var="gh_app_installation_id=12345678"
terraform apply

Tip: If you manage your GitHub App secret through an external system like HashiCorp Vault or External Secrets Operator, you can pass gh_app_existing_secret_name instead of the credentials — the module will reference the pre-existing Kubernetes secret and skip creating one.

The module handles: ARC controller, runner scale sets, Dagger engine deployment (via the official Dagger Helm chart), StatefulSet RBAC, VPA, Dagger Cloud secret injection, and failed pod cleanup. The key Dagger-specific features:

dagger_versions: list of Dagger engine versions to deploy. Each version creates exact labels (dagger-v0.20.3), minor aliases (dagger-v0.20 pointing to the latest patch), and a dagger-latest label pointing to the newest version overall.
Engine modes: daemonset (one engine per node, ephemeral hostPath cache) or statefulset (one engine per version, persistent PVC cache). StatefulSet is recommended for Spot instances and when cache warmth matters.
_EXPERIMENTAL_DAGGER_RUNNER_HOST: automatically set on each runner to point at the correct Dagger engine for that scale set (Unix socket for DaemonSet, kube-pod:// for StatefulSet).
Dagger Cloud token: injected as a Kubernetes secret and set as DAGGER_CLOUD_TOKEN env var on runner pods when provided. Also enables Magicache on the engine.

Multi-Version Runners

One of the most powerful features is running multiple Dagger versions side-by-side. This is invaluable during version upgrades, as teams can migrate repositories at their own pace:

# Deploy two versions simultaneously
dagger_versions = ["0.19.5", "0.20.3"]

This creates five runner scale sets: exact versions, minor aliases, and a latest pointer:

runs-on: dagger-v0.19.5: exact, pinned to 0.19.5
runs-on: dagger-v0.19: minor alias, points to 0.19.5
runs-on: dagger-v0.20.3: exact, pinned to 0.20.3
runs-on: dagger-v0.20: minor alias, points to 0.20.3
runs-on: dagger-latest: always the newest version (0.20.3)

Teams that want reproducibility pin to an exact version. Teams that want automatic patch updates use the minor alias. Repositories that just need "whatever is current" use dagger-latest.

Runner Size Templates

By default, runner pods have no resource constraints, so they get BestEffort QoS and are the first to be evicted under memory pressure. For production workloads, the module supports size templates that define resource requests and limits per runner. Each label gets a size suffix:

runner_size_templates = {
  # Default size (no suffix): runs-on: dagger-v0.20
  "" = {
    runner_requests = { cpu = "50m", memory = "256Mi" }
    runner_limits   = {}
  }
  # Large size (suffix): runs-on: dagger-v0.20-large
  large = {
    runner_requests = { cpu = "250m", memory = "512Mi" }
    runner_limits   = { cpu = "1", memory = "1Gi" }
  }
}

Runner pods are lightweight: they only run the GitHub Actions runner process. The Dagger engine runs separately as a DaemonSet or StatefulSet and is shared across all runners. This separation means you don't need to give runner pods large CPU/memory allocations; the engine handles the heavy lifting independently.

With the above templates and dagger_versions = ["0.20.3"], you get six labels: dagger-v0.20.3, dagger-v0.20, dagger-latest (default size), and dagger-v0.20.3-large, dagger-v0.20-large, dagger-latest-large. Teams pick the size that matches their workload:

jobs:
  lint:
    runs-on: dagger-v0.20           # default size — fast checks
  build:
    runs-on: dagger-v0.20-large     # heavy builds with work outside Dagger

Node Sizing

The minimum viable node is 2 vCPU / 8 GB RAM, but larger nodes are more cost-efficient because they amortize system pod overhead (kube-dns, ARC controller, Dagger engine) across more runner pods:

Machine type	vCPU	RAM	Runners per node	Best for
`e2-standard-2`	2	8 GB	~5	Dev/test, low concurrency
`n2-standard-4`	4	16 GB	~15	Small teams
`n2-standard-8`	8	32 GB	~30	Medium teams, multi-version
`n2-standard-16`	16	64 GB	~50+	Large teams, high concurrency

Runner counts assume 50m CPU / 256Mi memory requests per runner and ~1.5 vCPU / 2 GB overhead for system pods + engine. Actual capacity depends on your workload.

Caching: Engine Modes and Dagger Cloud

With self-hosted runners, you choose how the Dagger engine stores its cache, and the Terraform module supports all options:

DaemonSet mode (simplest): engine_mode = "daemonset". One engine per node, cache on the host disk via hostPath. No PVC needed; works on any cluster. Cache is ephemeral: lost when the node is recycled. Best for stable node pools or when Dagger Cloud handles cache persistence.

StatefulSet mode (persistent): engine_mode = "statefulset". One engine per Dagger version, cache on a PVC managed by the Helm chart. Cache survives pod restarts, node preemptions, and scale-downs. Runners connect via kube-pod:// protocol (RBAC created automatically). Best for Spot instances and when cache warmth is critical.

Dagger Cloud (complementary): Set dagger_cloud_token in the module. Magicache syncs cache layers across all engines connected to your organization, even across clusters and regions. You also get the trace UI and module management described earlier. Works with both DaemonSet and StatefulSet modes.

You can combine them: StatefulSet for fast local cache on PVC + Dagger Cloud for cross-cluster sharing and observability. This is the recommended production setup.

Vertical Pod Autoscaler (VPA)

Getting resource requests right for the Dagger engine is tricky. Too low and the scheduler under-provisions nodes; too high and you waste capacity. The module includes optional Vertical Pod Autoscaler support to solve this:

# Enable VPA on the cluster (GKE-specific, creates the VPA objects automatically)
enable_vertical_pod_autoscaling = true

# Start in recommendation-only mode — no automatic resizing
vpa_update_mode = "Off"

With vpa_update_mode = "Off", the VPA controller observes actual engine resource usage and produces recommendations without touching the pods. You can inspect them with:

kubectl get vpa -n arc-runners -o yaml

Look for the recommendation section, which shows target, lowerBound, and upperBound for CPU and memory. After a few days of representative workload, use these values to tune dagger_engine_requests:

# Apply the VPA recommendations to your engine requests
dagger_engine_requests = {
  cpu    = "500m"    # from VPA target recommendation
  memory = "16Gi"   # from VPA target recommendation
}

In-Place Pod Resizing (Kubernetes 1.33+)

Starting with Kubernetes 1.33, the InPlacePodVerticalScaling feature gate enables resizing pod resources without restarting them. This is particularly valuable for the Dagger engine, since a restart means losing in-memory state and warming the cache from disk. The module supports this via:

vpa_update_mode = "InPlaceOrRecreate"

With this mode, the VPA will resize the engine pod's CPU and memory in place when possible, falling back to a recreate only if the node can't accommodate the new size. This feature graduates to GA in Kubernetes 1.35, at which point it works without any feature gate configuration.

Tip: Start with "Off" to collect recommendations, then move to "InPlaceOrRecreate" once you're confident in the VPA's suggestions and your cluster supports it. This gives you automatic right-sizing without engine downtime.

Scale to Zero

The min_runners = 0 setting means no idle runners consuming resources. Combined with min_node_count = 0 on the GKE node pool and cluster autoscaling, this creates fully elastic CI infrastructure that costs nearly nothing when idle.

Pros: Full control, scale to zero, private networks, infrastructure as code, multi-version Dagger support.
Cons: Initial setup complexity, Kubernetes expertise required, you manage updates.

Image generated with Google's Gemini Imagen "Nano Banana Pro"

Choosing Your Path

If you...	Choose...
Are just starting with Dagger	GitHub Actions + Dagger Action
Need fast builds without infra work	Depot.dev
Have Kubernetes expertise and volume	Self-hosted ARC
Need air-gapped or private network	Self-hosted ARC
Want the best of both worlds	Depot for dev, ARC for production

Dagger Cloud is orthogonal to the runner choice and works with all three options. Start with the free Individual plan for pipeline visualization and traces. Upgrade to Team ($50/month) when you need distributed caching across runners or when the actions/cache 10 GB limit becomes a bottleneck.

Your pipeline code doesn't change. Only the runner configuration does. That's the decoupling that matters.

What's Next

In Part 3, we'll build a library of reusable Dagger modules: GCP Auth, Artifact Registry, Cloud Run, and Firebase Hosting. These modules compose into a single pipeline that deploys both our Angular frontend and FastAPI backend.

The modules will work on all three runner setups we just configured.

Up Next: Part 3: From Scripts to a Platform: Your CI/CD Module Library

This is Part 2 of a 4-part series. Follow for updates.

Tags: #cicd #dagger #kubernetes #github-actions #platform-engineering

DEV Community

CI/CD in the Era of AI and Platform Engineering: A Deep Dive into Dagger CI (Part 2)

Part 2: Decoupling Pipelines from Infrastructure

The Runner Spectrum

Option 1: GitHub Actions with dagger-for-github

Understanding Dagger's Cache Architecture

Layer Cache: Content-Addressed Operation Caching

Cache Volumes: Persistent Package Manager Storage

How the Three Caches Stack

Persisting Cache on Ephemeral Runners

Why `actions/cache` Doesn't Work Here

The Solution: Docker Volume Caching

Dagger Cloud: Distributed Caching and Pipeline Observability

Why Dagger Cloud?

Setting Up a Dagger Cloud Account

Pipeline Visualization and Traces

Install the GitHub App (Optional)

Module Management

Option 2: Depot.dev Managed Runners

Option 3: Self-Hosted Kubernetes with ARC

Architecture

Prerequisites: Setting Up the GitHub App

Terraform Setup

Scenario A: Create a new GKE cluster with runners

Scenario B: Deploy onto an existing Kubernetes cluster

Step 3: Deploy

Multi-Version Runners

Runner Size Templates

Node Sizing

Caching: Engine Modes and Dagger Cloud

Vertical Pod Autoscaler (VPA)

In-Place Pod Resizing (Kubernetes 1.33+)

Scale to Zero

Choosing Your Path

What's Next

Top comments (0)

Part 2: Decoupling Pipelines from Infrastructure

The Runner Spectrum

Option 1: GitHub Actions with dagger-for-github

Understanding Dagger's Cache Architecture

Layer Cache: Content-Addressed Operation Caching

Cache Volumes: Persistent Package Manager Storage

How the Three Caches Stack

Persisting Cache on Ephemeral Runners

Why actions/cache Doesn't Work Here

The Solution: Docker Volume Caching

Dagger Cloud: Distributed Caching and Pipeline Observability

Why Dagger Cloud?

Setting Up a Dagger Cloud Account

Pipeline Visualization and Traces

Install the GitHub App (Optional)

Module Management

Option 2: Depot.dev Managed Runners

Option 3: Self-Hosted Kubernetes with ARC

Architecture

Prerequisites: Setting Up the GitHub App

Terraform Setup

Scenario A: Create a new GKE cluster with runners

Scenario B: Deploy onto an existing Kubernetes cluster

Step 3: Deploy

Multi-Version Runners

Runner Size Templates

Node Sizing

Caching: Engine Modes and Dagger Cloud

Vertical Pod Autoscaler (VPA)

In-Place Pod Resizing (Kubernetes 1.33+)

Scale to Zero

Choosing Your Path

What's Next

Why `actions/cache` Doesn't Work Here