Ederson Brilhante

Posted on Jun 20

No Silver Bullets: Engineering a Multi-Tenant CI Platform a Small Team Can Run

#devops #kubernetes #aws #githubactions

A deep, teaching walkthrough of how Cisco’s internal Forge deployment runs ~40 teams and ~10,000 GitHub Actions jobs a day on AWS — and the dozen deliberate engineering trade-offs that made it survivable with near-zero ops. This is the long version on purpose. I’m going to show you the machinery, not wave at it, because a platform you can’t see inside is just a magic box you’re afraid to touch. By the end you should understand not only what Forge does but why each piece is built the way it is — well enough to argue with the decisions.

TL;DR — Forge (open-source name: ForgeMT) is a multi-tenant GitHub Actions runner platform on AWS. In Cisco's internal production deployment it runs ~40 teams and ~10k jobs/day with near-zero manual provisioning/debugging — which is not the same as zero support. Jobs run on ephemeral runners across two lanes — EC2 VMs and Kubernetes/ARC pods. The design is a stack of deliberate trade-offs, each with a stated cost: Calico to beat VPC-IP exhaustion, per-tenant DinD node pools for blast-radius isolation around privileged Docker builds, zero static credentials with a self-checking trust-validator, immutable blue/green clusters for safe upgrades, directory-as-config Terragrunt so onboarding is a config change, self-healing pipelines + centralized Splunk observability, and Renovate + dogfooded images to stay current. The thesis: near-zero ops isn't luck or a magic tool — it's a dozen good trade-offs that add up. Short on time? Read §4 (how a job runs) and §13 (the pattern).

Contents

The wall everyone hits
What Forge is (and the vocabulary you'll need)
Why it exists
How a job actually runs — the part most explanations skip
Trade-off #1 — Networking: the IP ceiling
Trade-off #2 — Isolation: per-tenant DinD node pools
Trade-off #3 — Identity: zero static credentials
Trade-off #4 — Immutable clusters & blue/green
Trade-off #5 — Config at scale
Trade-off #6 — The connective tissue
Trade-off #7 — Staying fresh: automated dependencies & dogfooded images
Operating it: ownership, where it breaks, and the sharp edges
The pattern

1. The wall everyone hits

Setting up a GitHub Actions self-hosted runner is a fifteen-minute job. You spin up a VM, run the config.sh script GitHub gives you, it registers the machine against your org or repo, you add runs-on: self-hosted to a workflow, and your job lands on your box. It feels great. It is great — for one team.

The trouble never arrives with the first runner. It arrives with the second team. And the fifth. And the twentieth.

Here's the mechanism of the pain, because it's worth being precise about. A self-hosted runner is a long-lived machine you own. That means you own its patching, its security posture, its disk filling up, its runner-agent upgrades, its scaling when ten jobs arrive at once, and its idle cost when none do. The moment a second team needs runners — maybe they need a different toolchain, a bigger instance, or access to an internal network the first team didn't — the path of least resistance is for them to copy the setup and run their own. Now two teams each carry that whole operational burden. Multiply by twenty.

What you end up with isn't "twenty teams using runners." It's twenty subtly different runner platforms, each patched on its own schedule, each drifting in its own direction, each with its own half-built answer to "how do we give this runner AWS access without leaking a key." "It works on Team A's runners but not ours" becomes a real sentence in real incident channels. Your operational cost grows roughly linearly with adoption, and — this is the part that should worry you — your security posture gets worse as you grow, because every team reinvents secrets handling and isolation slightly differently and slightly wrong.

This is the wall. Forge is what one team built after hitting it.

And the single most important thing to understand before we go further: the interesting part is not any one clever component. There is no magic tool in here. Forge is roughly a dozen deliberate decisions, each made in a different problem domain, each with a real cost, composed into something a small team can actually operate. The rest of this article walks the load-bearing ones — with the actual code and the actual trade-off accepted. If you're building anything multi-tenant, the shape of these decisions will transfer even if you never touch a GitHub runner.

2. What Forge is (and the vocabulary you'll need)

Forge is a secure, multi-tenant GitHub Actions runner platform on AWS. Teams run their CI/CD jobs on managed, ephemeral runners that live in company-managed AWS accounts. For a developer in an already-onboarded repo, adoption is usually a one-line runs-on change — no infrastructure to own, no migration project, no workflow rewrite. (Standing up a new tenant is a controlled platform-team step, which we'll be honest about in §4 and §12.)

(A naming note: the open-source project is ForgeMT — "MT" for multi-tenant; internally people usually just say Forge. I use "Forge" throughout for readability, but the repo and docs say "ForgeMT.")

Before we go deeper, let's nail down five terms. If you already live in this world, skim; if you don't, these five unlock everything below.

GitHub-hosted vs. self-hosted runners. When you write runs-on: ubuntu-latest, GitHub spins up a throwaway VM it owns, runs your job, and throws it away. Convenient, but that VM lives in GitHub's network — it cannot reach your private databases, your internal package registries, or your VPC, and you can't customize it much or control its cost at high volume. A self-hosted runner is a machine you register with GitHub; GitHub sends jobs to it, but you own everything about the box. Forge is a platform for self-hosted runners — it exists precisely for the jobs that GitHub-hosted runners can't serve.

Tenant. A tenant is the isolation-and-configuration boundary for a team — its own runners, identity, and labels. Forge runs those runners inside the Forge-managed AWS deployment for that tenant; the tenant can then optionally grant its runners access to its own external AWS accounts by listing the IAM roles the runner role is allowed to assume (more in §7). So "each tenant has its own account" is the wrong mental model — the right one is "each tenant is an isolated boundary, with opt-in bridges to whatever AWS it actually needs." When we say Forge is "multi-tenant," we mean many teams share one platform (one codebase, one upgrade path, one operations team) while staying strictly separated at runtime. Holding both of those at once — shared platform, isolated execution — is the entire engineering problem.

Ephemeral runner. This is the keystone idea. Instead of long-lived machines, Forge creates a brand-new runner for a single job and destroys it the instant that job finishes — pass or fail. Every job gets a pristine machine that has never seen another job. This buys you enormous things almost for free: no state leaks between jobs, no configuration drift accumulating on a box over months, no credentials cached on a runner that outlives the work, and no "clean up after yourself" step that everyone forgets. It costs you one thing — startup latency, since you pay to boot a machine per job — which we'll see Forge manage carefully later.

Control plane vs. tenant plane. The control plane is the Forge-owned brain: it receives the signal that a job needs a runner, provisions the runner, scales the fleet, wires up identity, and collects logs and metrics. The tenant plane is where the actual jobs execute. Teams live entirely in the tenant plane — they write workflows and pick runners — and never touch the control plane. This split is what lets one small team operate the control plane for everyone.

The two lanes. Forge runs jobs two different ways, and lets each job choose:

The EC2 lane gives a job a full virtual machine (or even a bare-metal host). Use it for heavy builds, specialized hardware, custom operating systems, macOS/Windows, or anything that needs real VM-level isolation and control. A job can run directly on the VM or inside a container via a container: block in the workflow. Under the hood this is built on the open-source terraform-aws-github-runner project.
The Kubernetes lane (often called ARC, after Actions Runner Controller, the open-source operator it's built on) runs each job in a Kubernetes pod. Use it for fast, bursty, container-native work where you want pods to spin up in seconds and scale to zero when idle.

Within each lane, tenants pick a size or flavor. On EC2 that's typically small (linting, quick tests), standard (general builds and integration tests), large (heavy builds, load tests), and metal (bare-metal hosts for jobs that need full hardware control). On Kubernetes it's k8s (lightweight pods for jobs that don't need Docker) and dind (Docker-in-Docker, for building container images inside a pod) — plus dedicated named scale sets like dependabot, which is not a third execution model but an ARC scale set typically backed by the DinD template, carved out so dependency-update jobs are isolated and capacity-limited on their own. A tenant can run several of these at once, each with its own parallelism limit, so one team's flood of jobs can't starve another's.

Two lanes is itself a deliberate choice — most platforms pick one. Forge keeps both because the workload genuinely spans both: a 90-minute hardware-in-the-loop build and a 30-second lint check do not want the same execution model.

Tenants pick a lane and a size by label. A workflow says something like runs-on: [self-hosted, x64, "type:standard"] for a medium EC2 runner, or runs-on: [k8s] for a Kubernetes pod, or runs-on: [dind] for a pod that can build Docker images. Those labels are not cosmetic — as we'll see in §9, they're the literal API contract between the tenant and the platform, generated from configuration and matched exactly.

That diagram is the entire mental model at altitude. The rest of this article is what's inside each of those boxes — and why.

3. Why it exists

Forge didn't begin as an architecture diagram. It began as a fix for a specific, escalating mess.

The first teams ran the open-source terraform-aws-github-runner module, each on their own. It's a good module. It worked. What didn't work was the practice of everyone running their own copy. The same bug got fixed three separate times in three repos. The same incident — a runner type out of capacity, a webhook misfiring — recurred across teams that never compared notes. Because each team upgraded on its own cadence, the behavior of "a runner" diverged between teams, so debugging required knowing which team's particular vintage you were looking at. Knowledge became tribal. Three teams meant three deployments, several AWS accounts per environment, and three sets of undocumented quirks, with billing and troubleshooting smeared across all of them.

Two specific constraints turned this from "annoying" into "we cannot continue":

IPv4 exhaustion. The runners had to live in corporate-provisioned AWS subnets, and those subnets were small — and not something an individual team could enlarge. Here's the trap: there was plenty of CPU and memory available, but the network ran out of IP addresses long before the compute did. Every runner needs an IP. When you've handed out every address in the subnet, the next job sits in a queue staring at idle compute it can't use, because there's no address to give it. The bottleneck wasn't the thing everyone watches (compute) — it was the thing nobody watches (IPs). This single constraint forced a real architectural change, and we'll spend all of §5 on it because it's the most instructive decision in the platform.

Internal access and shared networks. CI jobs frequently need to reach internal things — a private artifact registry, an internal API, a database, a service only reachable over the corporate network. GitHub-hosted runners simply cannot do this; they're outside the perimeter. And when CI runs on shared corporate networks, CI traffic (which is spiky and heavy) competes with operational traffic, creating noisy-neighbor problems. The upshot: security and network design stopped being someone else's problem and became part of CI design. You cannot bolt them on afterward.

Faced with this, the team had a binary choice: keep letting every team run its own stack and drown in duplicated maintenance and divergent security, or standardize onto one platform. The bet they made — and it's the thesis of the whole thing — was to keep the flexibility teams loved (custom images, full VM control, internal access, their own AWS resources) while removing the drift (one module, one upgrade path, one set of guardrails). Standardize the boundaries; preserve the freedom inside them.

The next six sections are what "guardrails without taking away freedom" actually required. But first, we have to look at how a single job flows through the system — because every later decision is in service of that flow.

4. How a job actually runs — the part most explanations skip

If you only remember the altitude diagram, Forge stays a magic box. So let's trace a real job, end to end, twice — once per lane. This is the spine everything else hangs on.

The EC2 lane, step by step

A developer pushes code. Their workflow contains runs-on: [self-hosted, x64, "type:standard", "tnt:acme"]. GitHub sees the job needs a self-hosted runner and emits a workflow_job webhook with action: queued.
The webhook hits an API Gateway, which is the public front door of that tenant's Forge deployment. It forwards the request to a small webhook Lambda.
The webhook Lambda authenticates and matches. It verifies the request's HMAC signature against the GitHub App's webhook secret (so randoms can't trigger your runners), then checks the job's labels against the set of runner types this deployment offers. If a label set matches, it enqueues a message onto an SQS queue dedicated to that runner type. (If nothing matches, it does nothing — the job will sit unfulfilled, which, as we'll see, is itself a debuggable signal.)
A scale-up Lambda consumes the SQS message. SQS here isn't decoration — it's a buffer that decouples the burst of webhooks from the rate at which you can launch instances, and it gives you retries for free. The scale-up Lambda decides how many runners to launch and issues an EC2 CreateFleet call in instant mode (give me N instances now), choosing on-demand or spot capacity per the tenant's config.
The instance boots and registers itself. Its user-data script pulls a just-in-time (JIT) runner registration token, configures the GitHub Actions runner agent as ephemeral (it will accept exactly one job), and joins the tenant's runner group. The moment it registers, GitHub hands it the queued job.
The job runs. Job-started and job-completed hooks fire; the runner streams logs to CloudWatch.
The runner self-destructs. Because it registered as ephemeral, the agent deregisters from GitHub after the single job. A separate scale-down Lambda, running on a schedule (every few minutes), reaps the now-idle instance and any stragglers older than a minimum runtime. Supporting Lambdas update tags and archived logs.

Two real-world wrinkles live in steps 4–5, and both are worth knowing. First, capacity isn't guaranteed: a CreateFleet can come back short with errors like InsufficientInstanceCapacity or — tellingly — InsufficientFreeAddressesInSubnet (yes, you can run out of subnet IPs; that's the §5 problem wearing a different hat). Forge classifies these errors rather than treating them all the same: a spot shortfall can fail over to on-demand, and retryable capacity errors are re-queued instead of dropped on the floor. Second, a hard-won lesson: a webhook with a valid HMAC signature is not necessarily a usable one. In one production incident, deliveries were signed correctly but carried an installation ID belonging to a different GitHub App — so the token request, and thus runner creation, failed even though signature validation passed cleanly. The fix that went into the runbook: when runner creation fails, don't stop at "signature verified" — verify that the payload's installation context actually matches the app you think is serving the request. It's a perfect example of why the observability in §10 exists: the symptom ("runners won't start") was three layers away from the cause.

The Kubernetes (ARC) lane, step by step

The Kubernetes lane reaches the same destination by a different road, because Kubernetes already has a scaling brain.

The same workflow_job signal reaches the cluster, but here the ARC listener for the tenant's scale set notices there's demand.
ARC creates an ephemeral runner resource (a pod spec) for the job.
Karpenter provisions a node if needed. Karpenter is a Kubernetes autoscaler that watches for unschedulable pods and launches right-sized EC2 nodes to fit them (and removes them when idle). If the tenant's nodes are full or scaled to zero, Karpenter boots one.
Kubernetes schedules the runner pod. For DinD, tenant isolation is enforced by the per-tenant Karpenter node pool plus taints/tolerations and required node affinity (§6); for plain k8s, the checked-in template relies on namespace, service-account/IAM, runner-group, and GitHub-routing boundaries unless an extra scheduling layer is added.
The pod's containers come up: the runner container itself, plus — for Docker builds — a rootless Docker-in-Docker sidecar, plus hook and log sidecars.
The job runs in the pod; logs flow to Kubernetes logging and onward.
The ephemeral runner resource and its pod are deleted. Idle nodes get consolidated away by Karpenter.

The reason Forge keeps both lanes is now concrete: the EC2 path gives you a whole machine and total control but pays a full VM boot; the ARC path scales pods fast and bin-packs them onto shared nodes but constrains you to a container model. Different jobs genuinely want different trade-offs.

One piece of plumbing makes both flows actually route to the right place: runner groups. Each tenant's runners register into a GitHub runner group scoped to that tenant, and a small reconciler keeps the mapping correct — when a tenant adds a repository, it's automatically registered into the right runner group, so that repo's jobs land only on that tenant's runners. This is what stops one tenant's workflow from ever pulling another tenant's capacity, and it's why onboarding a new repo is a no-op for the platform team rather than a ticket.

One caveat on "adopt by changing one line." That's true for a developer in an already-onboarded repo — flip the runs-on label and you're done. But onboarding a new tenant is a controlled platform-team operation, not one line: create the tenant's Terragrunt config, register/install the GitHub App, set repository_selection (all or selected), wire the runner group, configure any optional IAM-role trust and ECR access, and set the EC2/ARC runner specs (sizes, AMIs, images, resource limits). Forge makes that repeatable; it doesn't make it disappear. (We come back to it in §12.)

With the flow in hand, every decision below is really an answer to the question: "what breaks in that flow when you run it for forty teams in a small network, and how do you keep it secure and operable?"

5. Trade-off #1 — Networking: the IP ceiling

Networking is where most multi-tenant Kubernetes platforms quietly die, and it's the constraint that forced Forge's first and most instructive decision. To understand it you need one piece of AWS background.

The background: how pods get IP addresses on EKS

By default, Amazon's EKS uses the AWS VPC CNI (the aws-node component) for pod networking. CNI stands for Container Network Interface — the plugin that decides how a pod gets a network identity. The defining property of the AWS VPC CNI is that every pod gets a real, routable VPC IP address, drawn from the same pool your EC2 instances use. That's lovely for interoperability (a pod is a first-class citizen on your network) and brutal for density, because of how AWS attaches addresses.

An EC2 instance gets IPs through Elastic Network Interfaces (ENIs) — virtual NICs. Each instance type supports a fixed maximum number of ENIs, and each ENI supports a fixed number of private IPs. So the number of pods you can run on a node is capped by hardware-ish limits:

max pods per node ≈ (max ENIs) × (IPs per ENI − 1)

The constraint: a small, fixed network

Forge's runners must live in corporate-provisioned VPCs, and those are small and non-negotiable. The live data plane sits in a /24 VPC — 256 total addresses — carved into two /25 subnets of about 123 usable addresses each (AWS reserves five per subnet). You do not get to enlarge it. That's the box you're in.

The math that kills the default CNI

Take a common worker size, a node that supports 8 ENIs at 30 IPs each. That's 8 × (30 − 1) = 232, call it ~234 pod IPs per node. Now look at what that means in a /25 with ~123 usable addresses: a single fully-packed node would need almost twice the addresses the entire subnet contains. Scale to the cluster's target of five nodes and you'd be asking for ~1,170 IPs — roughly 4.5× the entire /24 VPC.

This isn't "the cluster runs slowly." It's "the cluster cannot be scheduled." The IP ceiling is hit before a single meaningful workload lands. And critically, throwing more compute at it makes things worse, because more nodes means more IP demand. The thing everyone instinctively scales (compute) is the thing actively hurting you.

The decision: replace the CNI with an overlay

Forge deletes the AWS VPC CNI and installs Calico, configured as an overlay network:

kubectl delete daemonset -n kube-system aws-node || true
kubectl apply -f .../calico/tigera-operator.yaml --server-side
# Installation custom resource:
spec:
  cni:           { type: Calico }
  calicoNetwork: { bgp: Disabled }

An overlay network gives pods IP addresses from a private, cluster-internal range (a CIDR that exists only inside Kubernetes) and encapsulates pod-to-pod traffic so it can travel across the real VPC without the VPC needing to know about every pod IP. The VPC route tables never see pod addresses. (bgp: Disabled tells Calico not to advertise routes via the BGP routing protocol — appropriate here, since we're encapsulating rather than peering pod routes into the network fabric.)

The consequence is the whole point: VPC IP addresses are now consumed only by nodes, not by pods — and a node is exactly one VPC IP, whether it packs one pod or a hundred. (Hold onto that fact; in §10 it quietly flips a cost recommendation.)

Now do the math again. A /24 with 256 addresses can hold on the order of 250 nodes instead of less than one node's worth of pods. Pod density stops being an AWS-imposed accident and becomes a deliberate policy choice — Forge fixes it at maxPods: 100 per node — completely decoupled from ENI limits. The ceiling that made the platform impossible is simply gone.

What it costs (because nothing is free)

What it buys: breaks the IP ceiling; lets you run dense pod workloads inside tiny, fixed subnets; pod density becomes a tuning knob instead of a hard wall.
What it costs: you now run and upgrade a second networking layer with its own release cadence and its own failure modes. Overlay networking has real operational edges — image-pull problems and operator-ordering bugs have caused incidents serious enough to have named fix branches in the repo. And bring-up ordering is genuinely fragile: the aws-node deletion is best-effort (|| true), and every node must carry a synthetic dependency so it can never join the cluster before the CNI swap completes — a node that registers with no CNI has no working pod network at all, which is a confusing way to fail.

The lesson worth taking even if you never run Kubernetes: at scale, the binding constraint is frequently not the resource everyone watches. Here it was IP addresses, not CPU. There was no "add more nodes" answer (that makes it worse) and no "resize the subnet" answer (you don't control it), so the only move was to change the networking layer itself. Find your real ceiling before you optimize the comfortable one. This is decision #1 of about a dozen.

6. Trade-off #2 — Isolation: per-tenant DinD node pools

This is the clearest place in the entire platform where Forge chose isolation over efficiency, knowingly, and paid cash for it. That's exactly what makes it a good teaching example — it's an explicit trade, not a default.

The background: Karpenter, taints, tolerations, affinity

A few Kubernetes concepts make this section legible:

Karpenter is an autoscaler that provisions EC2 nodes on demand to fit pending pods, using a NodePool (rules for what kind of nodes to create) and an EC2NodeClass (the AWS specifics — AMI, IAM role, subnets). When pods need a home and none fits, Karpenter launches a node; when nodes go idle, it removes them.
A taint is a "keep off" mark on a node: by default, pods won't schedule onto a tainted node.
A toleration is a pod's permission slip that lets it tolerate a specific taint — i.e., it's allowed onto that node. Important subtlety: a toleration lets a pod land on a tainted node; it does not force it to. A pod with a toleration could still land somewhere else.
nodeAffinity is a pod's requirement about which nodes it must run on. A requiredDuringScheduling affinity is a hard rule: schedule me only on nodes matching this.

The decision: per-tenant nodes for the path that needs them

The strongest node-level isolation is applied where the risk actually lives: the DinD path. Each DinD tenant gets its own Karpenter NodePool and EC2NodeClass (named karpenter-<tenant>), and those nodes are stamped with two taints:

taints:
  - { key: forge.local/scale_set_type, value: dind, effect: NoSchedule }
  - { key: forge.local/tenant,          value: <tenant>, effect: NoSchedule }

A tenant's build pods then carry both matching tolerations and a hard requiredDuringScheduling nodeAffinity on those same two keys. You need both halves, and understanding why is the whole point:

The tolerations let the tenant's pods onto the tenant's (tainted) nodes — and the taints keep everyone else's pods off.
The required affinity stops the tenant's own pods from wandering onto some other, untainted node. Tolerations alone would permit that drift; affinity forbids it.

Belt and suspenders. The result is that a DinD tenant's jobs can land only on that tenant's machines — enforced by the scheduler, not by hope or convention. (To be precise about scope: this hard taint/affinity node-pinning is the DinD template's behavior. The plain k8s runner mode still gets real isolation — its own namespace, service account, runner group, and IAM role — but it doesn't carry the same tenant taint/affinity in the checked-in template; if you need hard node-pinning for non-DinD jobs too, you add that scheduling layer explicitly.)

One implementation detail worth admiring: the per-tenant EC2NodeClass isn't hand-written forty times. A data source reads the shared node class (kubectl get ec2nodeclass karpenter -o yaml), strips the server-managed fields with yq, and renames it with jq to karpenter-<tenant>. So every tenant inherits identical, correct AMI/role/subnet wiring under its own name. Sameness by construction; difference only in the label.

Why Docker-in-Docker is the forcing function

You might ask: why pay for per-tenant nodes at all? The answer is Docker-in-Docker (DinD). Many CI jobs build container images, which means they need a Docker daemon, which traditionally means a privileged container — one with elevated host access. Running privileged builds from multiple tenants on shared nodes is an unacceptable blast radius: a malicious or simply buggy privileged build could reach a neighbor. Pinning each tenant to its own nodes contains that.

And even within that, Forge adds a second layer: DinD runs rootless.

dind sidecar: image docker:dind-rootless
  runAsUser 1001
  subuid/subgid 100000:65536
  privileged: true   # the nested runtime needs it — but the user is 1001, not root

The subuid/subgid remapping is user-namespace remapping: processes that think they're running as root inside the build are actually mapped to an unprivileged high-numbered UID (100000+) on the host. So even the one genuinely privileged thing in the system is defanged at the user level. You get node-level isolation and user-level isolation wrapped around the riskiest capability Forge offers.

What it buys: a blast radius of exactly one tenant, enforced by the scheduler; the ability to offer privileged Docker builds safely at all; a noisy or compromised build that cannot touch a neighbor.
What it costs: money. One DinD node pool per tenant means worse bin-packing and more partially-idle nodes than a single shared pool would have. It also leans on a Karpenter capacity-diversity feature the manifest's own comment flags as alpha, and it multiplies the number of NodePool/EC2NodeClass objects to manage. Forge bounds the cost with mitigations — DinD scales to zero when idle (min_runners: 0), idle nodes consolidate quickly (consolidateAfter: 1m), and a karpenter.sh/do-not-disrupt annotation protects a node that's mid-job from being consolidated out from under it.

The lesson: isolation is a dial, not a boolean. The engineering maturity isn't "maximize isolation" or "maximize density" — it's to choose, consciously, where you sit on the cost↔blast-radius curve for your threat model, and to write that choice down so the next person knows it was deliberate. Forge turned the dial toward isolation because privileged builds demanded it, and paid for it on purpose. Decision #2.

7. Trade-off #3 — Identity: zero static credentials

CI is one of the highest-value targets in any organization: it executes code with access to credentials, source, artifacts, and internal networks. If you're going to attack a company, its build system is a wonderful place to start. So Forge's credential model is built around one non-negotiable rule: no long-lived secrets in pipelines, ever.

The background: roles, AssumeRole, and why "no keys" is even possible

In AWS, you can grant access two ways. The old way is a static credential — an access key and secret, long-lived, that you stash somewhere (a workflow secret, an env var) and that works until someone rotates it. Static keys are the thing that leaks: copied into logs, committed by accident, shared between teams, never rotated.

The modern way is role assumption. An IAM role is a set of permissions that an authorized identity can temporarily borrow by calling AWS STS (Security Token Service) AssumeRole, which mints short-lived credentials that expire in minutes to hours. Nothing long-lived is stored anywhere. The question is just: what proves you're authorized to assume the role? That proof is the runner's ambient identity.

The model: short-lived role-chaining off the runner's identity

A Forge runner is given an ambient AWS identity — on EC2, an instance profile; on Kubernetes, an EKS Pod Identity association. That identity is permitted to AssumeRole into the tenant's own role, and the tenant configures that role to trust the Forge runner role. The workflow then uses the standard aws-actions/configure-aws-credentials action with role-to-assume, gets short-lived creds, and does its work. No static key ever exists.

One precise correction, because it's almost always described wrong: this is STS role-chaining off the runner's instance/pod identity — not GitHub OIDC. The trust originates from the AWS identity Forge attaches to the runner, not from a token GitHub issues. (GitHub OIDC is a fine pattern; it's just not the one in play here, and conflating them will mislead you when you debug.) And the tenant stays in control of exactly how far that access reaches: the role Forge assumes can grant direct resource access, or it can be the first hop in a chain of role assumptions into further accounts — the tenant decides the scope by writing the policies, not the platform.

On Kubernetes the path is deliberately dual. Pod Identity is primary. But DinD scale sets also get an IRSA trust — IAM Roles for Service Accounts, where a projected Kubernetes service-account token (audience sts.amazonaws.com) is exchanged for AWS credentials via AssumeRoleWithWebIdentity. Why carry two mechanisms? Because inside Docker-in-Docker, the Pod Identity agent's link-local metadata hop (169.254.170.23) isn't reliably reachable, so the projected token is the fallback that keeps AWS auth working from within the nested Docker runtime. It's the kind of detail you only learn by getting paged. (One precision note, since "OIDC" gets thrown around loosely: GitHub OIDC is not the trust root for tenant access here — but AWS/EKS OIDC does appear, via IRSA, on this DinD fallback path. Two different OIDCs; keep them separate when you debug.)

The problem: the tenant's mistake is your incident

Here's the operational reality that drives the next design. The overwhelming majority of onboarding failures are not runner bugs — they're IAM trust mistakes: the tenant typo'd an ARN, or pointed the trust at the wrong principal, or allowed sts:AssumeRole but forgot sts:TagSession (which Forge needs for session tagging). And these mistakes are invisible until a real job runs and fails — usually at the worst possible moment, in front of the tenant.

In a multi-tenant platform, the boundary you don't control (the tenant's IAM) is the boundary that pages you. So Forge validates that boundary proactively, on a schedule, with a small purpose-built robot.

The robot: a trust-validator

The trust-validator is a deliberately-built system of two Lambdas plus a delay queue, running every ten minutes:

Two design choices here are pure real-world scar tissue, and both teach something:

Why two Lambdas instead of one? The validator must assume a role whose ARN the preparer has to inject trust for before the validator ever runs. If the validator's identity were created as a normal Terraform output, you'd have a chicken-and-egg dependency cycle. Forge breaks it by computing the validator's role ARN deterministically in Terraform, so the preparer can reference it ahead of time. Splitting prepare-and-validate into two functions is what makes that clean.
Why a 300-second delay? Because IAM is eventually consistent. When you write a trust policy, that change is not instantly visible at every STS endpoint — for a short window you'll get a spurious AccessDenied even though the policy is correct. A naive validator would report false failures constantly. The delay (the variable is literally named iam_propagation_delay_seconds) gives the change time to propagate, and as a second safety net AccessDenied is explicitly treated as retryable with backoff. "Eventual consistency" sounds academic until it's flaking your validator every ten minutes.

And note the validator tests TagSession separately from AssumeRole, precisely because a tenant role can permit one and forget the other — checking only the first gives you a false green. The whole thing wraps cleanup in a finally so the temporary trust is always removed, even when validation throws.

What it buys: trust problems caught proactively, before any real job depends on them; a clear per-tenant pass/fail across both AssumeRole and TagSession; what would be recurring 2am pages turned into a ten-minute cron job.
What it costs: you build and operate an actual small distributed system — two Lambdas, a delay queue, live IAM mutation — purely to validate configuration. Cleanup has to be bulletproof (that finally is load-bearing — a missed cleanup leaves a trust door ajar), and handling eventual consistency adds genuine complexity.

The lesson: in any multi-tenant system, the boundary you don't own is the boundary that wakes you up. Investing in synthetic, scheduled validation of that boundary is what converts an entire class of inevitable incidents into a dashboard you glance at. For a platform run with near-zero ops, that conversion isn't a nicety — it's the mechanism that makes near-zero ops possible. Decision #3.

8. Trade-off #4 — Immutable clusters & blue/green

Upgrading a live Kubernetes cluster is one of the more anxiety-inducing operations in this whole space. An EKS upgrade is really a coordinated upgrade of many coupled things at once — the control plane version, the add-ons (CoreDNS, kube-proxy, the EBS CSI driver), Karpenter, and the CNI — and any one of them can regress in a way that's hard to undo while production traffic is flowing through the cluster. Forge sidesteps the anxiety with a simple, radical stance: never upgrade a cluster in place.

The background: immutable infrastructure

"Immutable infrastructure" means you don't modify running things — you replace them. Instead of patching a server, you build a new image and roll it out. Instead of upgrading a cluster, you build a new cluster and move work to it. The payoff is that rollback becomes trivial (the old thing is still there, untouched) and "upgrade" stops being a high-wire act. The cost is that you need a clean way to move work between the old and new things.

The decision: disposable clusters in blue/green pairs

Forge runs EKS clusters as immutable and disposable, in blue/green pairs — at any time there's a -blue and a -green cluster, one active. To upgrade, you stand up the fresh sibling and migrate tenants onto it, one at a time. The entire cutover is controlled by just two values in a tenant's configuration: arc_cluster_name (which cluster this tenant lives on) and migrate_arc_cluster (a switch that tears down the tenant's footprint on its current cluster).

The mechanism that makes this clean is subtle and worth internalizing: the ARC module never hardcodes a cluster endpoint. Its Kubernetes and Helm providers resolve dynamically from a data source keyed on the cluster name:

data "aws_eks_cluster" "c" { name = var.eks_cluster_name }   # = arc_cluster_name
provider "kubernetes" {
  host  = data.aws_eks_cluster.c.endpoint
  token = data.aws_eks_cluster_auth.c.token
}

Because everything resolves from the name, flipping arc_cluster_name from -green to -blue makes the very next apply repoint every in-cluster resource at the other cluster — no endpoints to edit, no state to surgically move. And migrate_arc_cluster: true acts as a master switch threaded through the module tree: when true, every in-cluster resource (the namespace, the ARC controller, the scale set, the Pod Identity association, the Karpenter node pool) is set to count = 0 and destroyed. One detail makes the whole thing safe: the IAM runner role is deliberately not gated by that switch, so it survives the migration — which means the tenant's trust relationships (and the trust-validator from §7) don't churn just because you moved clusters.

The sequence, concretely

A migration script drives this per tenant:

Detect direction — read the current arc_cluster_name; figure out green→blue or blue→green (and hard-error if it's neither, rather than guess).
Drain to zero — patch each runner scale set to minRunners: 0, maxRunners: 0, then wait until no runner pods remain. You don't yank work out from under running jobs; you stop accepting new ones and let the in-flight ones finish.
Disable on OLD — set migrate_arc_cluster = true with the old name and apply: the tenant's footprint on the source cluster is torn down.
Pre-stage on NEW — flip the name to the target, still migrate = true, and apply: providers now point at the new cluster, but nothing is installed yet.
Enable on NEW — set migrate = false with the target name and apply, then re-apply the trust-validator against the new home.

The honest trade-off

What it buys: safe, reversible cluster upgrades — the source is untouched until the target is proven, so rollback is "flip the name back"; other tenants are completely unaffected because each has its own namespace and, for DinD workloads, its own node pool; you migrate on your own schedule, tenant by tenant.
What it costs: the migrating tenant has a runner gap — the window between drain (step 2) and enable (step 5) when its jobs queue and wait. And the tooling has sharp edges worth knowing: a step that strips Kubernetes finalizers busy-loops with no timeout (if the controller is wedged, it spins forever), and the drain check is a substring match on pod names that a stuck terminating pod can block.

Here's the part you must say out loud, because it's where a careless write-up would lie to you: this is not zero-downtime for the migrating tenant. It is blast-radius-of-one — the platform keeps serving everyone else flawlessly while a single tenant cuts over with a brief gap. The unit of "no disruption" is the platform, not the individual tenant. Naming that precisely is the difference between a talk a senior engineer trusts and one they tune out. Decision #4.

9. Trade-off #5 — Config at scale

Everything above only matters if you can actually onboard forty teams without it being forty projects. A platform that's painful to add tenants to never gets forty tenants — it gets five and a backlog. So the configuration architecture is load-bearing for the whole low-ops story.

The background: Terraform vs. Terragrunt, and DRY

Terraform describes infrastructure as code. But across dozens of near-identical deployments (same module, different tenant/region/account), plain Terraform pushes you toward copy-paste — and copy-paste is where drift is born. Terragrunt is a thin wrapper over Terraform whose main job is to keep things DRY (Don't Repeat Yourself): shared configuration is defined once and inherited, and each deployment is just the small set of values that make it unique.

The decision: the directory path is the configuration

Forge deploys one module per tenant × region × account, arranged so that the filesystem layout encodes the coordinates:

environments/<aws_account>/regions/<aws_region>/vpcs/<vpc_alias>/tenants/<tenant_name>/

The tenant's name is literally basename() of its directory, and the path coordinates are account → region → VPC alias → tenant. Configuration cascades down the tree via Terragrunt's find_in_parent_folders: organization-wide constants at the top; an account/environment layer that supplies the AWS account id, profile, and remote-state settings; a region layer with the region and its short alias; a VPC layer with the concrete VPC and subnet IDs; and finally the tenant's own config.yml. A small config.hcl at the tenant level yamldecodes that YAML and expands it into the runner specs — and, crucially, into the runs-on label set that is the tenant's API:

runner_labels = [
  "type:${spec.type}",
  "self-hosted",
  spec.runner_architecture,
  "env:ops-${include.env.locals.env}",
]
extra_labels = [
  "ec2",
  "rgn:${local.region_alias}",
  "vpc:${local.vpc_alias}",
  "tnt:${local.tenant_name}",
]

There's a clever bit in how those labels are matched. Rather than requiring a workflow to specify the exact full label set, Forge generates matchers for every contiguous sub-slice of the extra labels appended to the base — so a job can match with any reasonable subset (just type:standard and self-hosted, say) while the platform still always knows the full identity (tenant, region, VPC, lane). Flexibility for the user, precision for the platform.

Two more touches that make scale comfortable:

Version pinning with a local-dev escape hatch. Module versions are pinned indirectly through a release_versions.yaml (everything at a single release ref), and a use_local_repos flag flips every module source from the git ref to a local file:// checkout for iteration. One switch, zero per-tenant edits — you can test a platform change against a real tenant config without touching the tenant config.
State isolation for free. There's one S3 state bucket and one DynamoDB lock table per account (the bucket name carries the account id), and the Terraform state key is the directory path. So per-tenant/region/VPC state separation isn't designed per tenant — it falls straight out of where the directory sits. The structure does the work.

What it buys: adding a tenant is a configuration change, not a project — minutes, not days; dev and prod can run different module versions safely; genuine DRY across ~40 deployments behind one module and one upgrade path.
What it costs: Terragrunt plus layered HCL has a real learning curve, and there are footguns worth flagging — remote state can live in one region even for resources in another (a cross-region surprise), and the state bucket and lock table sharing a name can confuse newcomers. Forge also deliberately avoids cross-unit dependency blocks (each tenant is one self-contained module instance); the rare cross-module ordering, like a full cluster rebuild, is handled by an external DAG-resolver script with a pragmatic stabilization delay. Pragmatic over pure.

The lesson: when you have N near-identical deployments, make the difference between them a small data file and let structure carry the sameness. If your marginal deployment costs minutes, "forty tenants" stops being a scaling problem and becomes a directory listing. Decision #5.

10. Trade-off #6 — The connective tissue

The big architectural calls get the headlines, but a platform actually survives on its connective tissue — the small, resilient systems that quietly remove whole categories of toil so a human never has to. Three of them, then the thing that ties it all together.

Audit-grade job-log archival

GitHub keeps job logs for a limited window, but you often need them longer — for audits, for debugging a flaky job a week later, for compliance evidence. So when a job finishes (workflow_job with action: completed), Forge archives its logs through a deliberately decoupled pipeline:

The event flows EventBridge → a dispatcher Lambda (which filters and forwards) → SQS → an archiver Lambda that authenticates as the GitHub App (minting a short-lived installation token), downloads the logs, and writes them to a per-tenant S3 bucket, KMS-encrypted, keyed by {repo}/{run}/{attempt}/{job} as both a raw .log and a structured .json. Two details show the care: the SQS visibility timeout is set just above the archiver's Lambda timeout (the event-source mapping refuses to be created otherwise — there's a comment in the code so the next person doesn't "fix" it), and failed messages retry up to ten times before landing in a dead-letter queue (DLQ). A separate redrive Lambda re-injects DLQ messages back into the pipeline every ten minutes, so a transient GitHub API hiccup self-heals instead of silently dropping audit logs. Nobody gets paged for a blip.

A self-healing global lock

Some operations must not run concurrently across repos or workflows. Forge provides a distributed mutex backed by DynamoDB: a table keyed on a lock_id, with secondary indexes on the workflow run and attempt, and a TTL on a timestamp. Workflows acquire and release the lock (every runner role carries the small policy needed to do so). The self-healing part is a janitor Lambda that runs every ten minutes, checks each held lock's workflow-run status via the GitHub App, and deletes any whose run has already completed — with the DynamoDB TTL as a final backstop for anything the janitor can't resolve. A lock that would otherwise wedge the system instead expires and cleans itself up.

Warm pools, used surgically

Remember the one cost of ephemeral runners: per-job startup latency. And the numbers are worth being concrete about, because the picture is "fast when something's already warm, slow when you have to boot a machine":

EC2, warm-pool hit: a pre-booted runner picks up the job in under ~20 seconds.
EC2, cold: booting a fresh instance — provision, bootstrap, register — takes on the order of ~5 minutes for Linux. And cold-start climbs with the OS and host type: Windows boots noticeably slower than Linux, and macOS is slowest of all, because Mac runners live on AWS Dedicated Hosts — you can wait on dedicated-host allocation and then a Mac instance boot, which is far slower than a normal VM.
Kubernetes pod on an already-running (idle) node, image cached: also under ~20 seconds — the pod just schedules and the container starts. But this one is genuinely case-by-case: if the runner image isn't already cached on the node, you add image-pull time; and if no node is available, you add node boot (below). So a non-DinD pod's start is "fast when the node's up and the image is local," and degrades from there.
Kubernetes/DinD when no node is available (e.g., scaled to zero): you pay the worst case — Karpenter has to boot a node first, then the scheduler places the pod, then it pulls/starts the container — which lands in minutes, comparable to an EC2 cold start, because you're booting an EC2 node either way.

So the real asymmetry isn't "EC2 slow, Kubernetes fast" — it's "warm/idle is fast on both lanes; cold means booting an EC2 machine on both lanes, with the extra K8s variable of whether the image is already on the node." The blunt fix is a warm pool, and the expensive mistake is keeping warm capacity everywhere, always.

There's also a cost asymmetry between the two warm strategies that ties straight back to §5, and it's sharp. A node is exactly one VPC IP — with Calico, it makes no difference whether that node runs one pod or a hundred; it still consumes a single VPC address. So if you want idle, warm capacity sitting ready: an EC2 warm pool of N runners holds N VPC IPs the entire time it waits (spending your scarcest resource just to shave startup), whereas you can park many idle DinD pods behind one warm node and spend exactly one IP.

That single fact inverts the default recommendation at scale. For ordinary use, Forge actually recommends EC2 over DinD-on-EKS, because per job it's cheaper. But for high-churn, scaled workloads that want warm capacity hot and ready, DinD wins — you can keep a pool of idle runners behind a single locked node (one IP) instead of paying one IP per warm EC2 runner. And this isn't a rule set once and forgotten: Forge measures each tenant's actual usage (those cost/usage dashboards again) and periodically recalculates and recommends the cheaper mix as the tenant's pattern shifts. The trade-off is computed from data, not guessed at onboarding — usage-aware, and revisited.

So Forge uses warm capacity surgically: most runner types run none (pure on-demand, accept the cold-start), and the few latency-sensitive EC2 types keep a small warm pool only during business hours, doubled by timezone — e.g. one warm runner topped up cron(*/5 9-18 ? * MON-FRI) in both America/Los_Angeles and Europe/Madrid, idle nights and weekends. On Kubernetes the same effect comes from keeping a little node headroom rather than scaling hard to zero on a latency-sensitive tenant — and it costs you nodes, not precious IPs. Cost is spent exactly where it buys responsiveness and nowhere else — the difference between "we have warm pools" and "we have a warm-pool bill."

Central logging & observability with Splunk — the thing that makes one team possible

Underneath all of it sits the system that makes operating forty tenants by a small team even conceivable: centralized logging and metrics. You cannot watch forty tenants by hand; you can only instrument them.

The pipeline has a few distinct paths, and it's worth being precise about them because "it all goes to Splunk" hides the real shape. The clean way to hold it in your head is: logs go one way (to Splunk Cloud, via CloudWatch), metrics go another (to Splunk O11y).

Logs — everything — go to CloudWatch Logs, then to Splunk Cloud. EC2 runners, the control-plane Lambdas, and EKS (both node and pod logs) all send their logs to CloudWatch Logs. From there, ingestion into Splunk Cloud (the log-analytics product) is done through Splunk's Data Manager app — the supported service for pulling AWS data into Splunk Cloud — and, true to the rest of the platform, Forge configures Data Manager as code: there's a Terraform module in the open-source repo that wires up the ingestion rather than anyone clicking through a console. A runner alone emits several log streams — OS syslog, the cloud-init / EC2 user-data bootstrap output (where most "the runner never came up" failures hide), the GitHub Actions job logs, the runner agent/worker logs, and the job-started/completed hook output — and they all travel this same path. (The archived job logs from §10's pipeline also land in per-tenant S3 and feed Splunk Cloud.) The result: a job's full story — bootstrap, execution, outcome — is reconstructable from one place after the runner itself is long gone, which, being ephemeral, it always is.
AWS metrics go to Splunk O11y. The Splunk Observability (O11y) + AWS integration pulls the CloudWatch metrics for the platform's AWS resources into Splunk O11y (the metrics/observability product) — and it, too, is configured by a Terraform module in the Forge repo, so the observability wiring is reproducible and reviewable like everything else.
EKS adds extra pod metrics to O11y via OTel. On top of the AWS metrics, each EKS cluster runs the Splunk OpenTelemetry (OTel) collector to send additional, pod-level metrics to Splunk O11y — the fine-grained container telemetry CloudWatch's AWS metrics don't capture. Note this is the metrics path only; EKS logs still go the log route above (CloudWatch → Splunk Cloud).

The payoff isn't just that the data lands somewhere — it's that Splunk Cloud carries a large set of Forge-specific field extractions, so the logs aren't an opaque blob: you can slice and filter them by meaningful dimensions (tenant, region, account, runner type, job, conclusion, and more), and the metrics in O11y are likewise sliceable by dimension. That's what turns "we have the logs" into "I can answer a question in one query." And — consistent with everything else in this article — those extractions and the dashboards themselves are managed as code, through Terraform's Splunk and SignalFx providers, not hand-built in a console; the observability is as reproducible as the infrastructure it watches. On top of it, each tenant gets its own dashboards — runner-lifecycle health, queue/capacity, trust-validation failures, the webhook/job-log pipeline, optimization signals like high-memory-job detection, and cost breakdowns fed by AWS Billing Data Exports so a team can see and right-size its own spend.

This isn't abstract — it's a concrete set of dashboards, each mapping a symptom to a subsystem: a runner-capacity view (GitHub queue pressure vs. ARC scale-set state), Lambda operations (scale errors, tagging, ingestion retries), K8s storage/network (PVC/EBS attach, scheduler capacity, CNI readiness, API-audit), trust failures (the AssumeRole/TagSession detail from §7), ARC/DinD health (init containers, hook sidecars, runner versions, Karpenter signals), EC2 runner lifecycle (webhook → scale → AMI/SSM → user-data → hook), the webhook/job-log pipeline (dispatcher → SQS/DLQ → archiver → ingestion), and — tellingly — an ingestion-quality view that watches the telemetry itself for missing fields and dropped logs, because a dashboard you can't trust is worse than no dashboard. In the internal production deployment, the clusters also run Falco for runtime security monitoring (it's not in the public repo), so anomalous syscall behavior inside a running job is observable too. The throughline: every GitHub-side symptom ("my job is stuck") has a backend view that walks you down the chain — label mismatch → webhook → Lambda/SQS → EC2/ARC → runner logs — and localizes it to a subsystem in a click or two, the difference between an operator who guesses and one who knows. And the same central data does two more jobs beyond troubleshooting: it provides the audit trail compliance needs (who ran what, where, with which permissions), and it makes per-tenant cost attribution possible instead of one undifferentiated AWS bill.

The payoff is an operating loop rather than a permanent fire drill: monitor → alert → automate. When something recurs, it becomes a detector or a script, and then it stops being work. That loop is why "near-zero ops" is a true statement and not a brag — the rigor of the instrumentation is precisely what removes the human toil.

Debugging an ephemeral platform — Teleport for auditable SSH

Ephemerality (§6, §7) is wonderful for security and cleanliness, but it creates a real problem: how do you debug a machine that deletes itself the instant the job ends? You can't SSH into a failed runner after the fact — it's gone. Forge answers this three ways, in order of how often you reach for them.

First and most often, the logs already have it. Because every stream is centralized in Splunk (above), the large majority of debugging is post-hoc and hands-off — you read the bootstrap output, the job log, and the hook output without touching a box at all.

Second, rerun it. Every job runs in a fully reproducible, ephemeral environment, so re-running a failed job from the GitHub UI replays the exact conditions with no leftover side effects — a luxury you simply don't have with long-lived, drifted runners.

Third, when you genuinely need to be on the machine, Teleport provides live, auditable SSH — to both EC2 runners and Kubernetes pods. The developer logs in through corporate SSO (tsh login --proxy=<teleport-proxy> --auth=CloudSSO), and because the runner would normally vanish at job end, they keep it alive deliberately — a sleep step in the workflow, or a wrapper — then tsh ssh into the EC2 instance or tsh kube into the pod and inspect it as it runs. Crucially, this is break-glass, not a backdoor: access is gated by AD/identity-group entitlement (requested via a ticket), it's narrow and time-bound, and Teleport records sessions centrally — so live debugging never means an untracked shell or a shared bastion key. You get the convenience of "just SSH in" without surrendering the audit trail that made the platform compliant in the first place.

That's the trade-off in miniature: ephemeral runners buy you security and a clean slate every time, and cost you easy debugging — and Forge pays that cost with central logs, reproducible reruns, and auditable Teleport access, rather than by giving up ephemerality. You keep the security property and stay able to troubleshoot.

11. Trade-off #7 — Staying fresh: automated dependencies & dogfooded images

A note on scope: unlike the previous six, this one is mostly internal process rather than open-source code. The building blocks are public — the Forge repo ships a Renovate config and Packer/Ansible image-build examples — but the end-to-end pipeline described here (the image repos and the dogfood CI) lives in internal repos. I'm including it because it's the part that makes everything above survivable over time, and the pattern is what's worth stealing, not the hostnames.

Here's a truth that gets left out of most platform write-ups: a platform is never "done." Even with zero new features, it sits on a foundation that moves constantly underneath you — GitHub changes the Actions runtime, the runner agent ships a new minimum version, ARC and Karpenter and Calico cut releases, Terraform providers update, base-OS packages get CVEs, and hundreds of transitive dependencies drift. (A concrete one: when GitHub migrated Actions from the Node 20 runtime to Node 24, every runner image and a pile of tenant workflows had to be tested and updated — not because Forge changed anything, but because the ground moved underneath it.) Standing still is not free; standing still is how you rot. The hard part of operating a platform with near-zero ops is therefore not the initial build — it's keeping a large, moving dependency surface current without a human babysitting it. Forge does that with three interlocking pieces.

Piece 1 — Renovate: turn the dependency treadmill into reviewable PRs

The first piece is Renovate, a bot that continuously scans every dependency source in every repo — Terraform/OpenTofu modules, GitHub Actions, Docker base images, Helm charts, the pinned upstream runner module, even tool versions — and, when something updates, opens a pull request automatically. A shared Renovate configuration (defined once, extended by every repo) sets the policy: updates are grouped sensibly (all AWS provider bumps together, all Docker bumps together) so you review related changes as a unit; safe changes (patch bumps, digest pins, pre-commit hooks) can auto-merge; major version bumps are labeled and delayed so a human looks; and security patches are prioritized.

The effect is a mindset shift. Dependency maintenance stops being a periodic, dreaded, manual sweep and becomes a steady stream of small, pre-tested PRs you either glance at and approve or let auto-merge. You're never "behind" in a scary way — you're continuously, incrementally current. For a one-person-shaped operation, this is the difference between drowning in the treadmill and riding it.

But automated dependency PRs are only safe if something proves each bump doesn't break the platform. That's where the other two pieces come in.

Piece 2 — Images as code: a layered base image

Forge's runners don't use off-the-shelf images; they run images built as code with Packer (which bakes machine images) and Ansible (which configures what goes inside). The structure is two-layered, across three operating systems:

Base images for Ubuntu, macOS, and Windows — deliberately minimal. Each starts from a CIS-aligned hardened OS and installs only the common foundation every runner of that OS needs: the GitHub Actions runner agent, the container/Docker runtime where applicable, the AWS CLI, and the platform's own agents for access and observability (Teleport for break-glass, the Splunk/CloudWatch log shippers). That's it. The base is kept lean on purpose. (For container jobs, images are pulled from either a Forge ECR or the tenant's own ECR, depending on configuration — so a team can ship a private container image without it ever passing through the platform.)
Tenants own their toolchains. A tenant either runs on the minimal base image as-is, or builds its own custom image on top of the base with whatever it actually needs — Go, Python, Terraform/OpenTofu, Packer, and so on. Forge deliberately does not manage those toolchains. Curating every team's language and tool versions doesn't scale and isn't the platform's job — it's a treadmill that never ends. Keeping the base minimal and pushing toolchains into tenant-owned custom images is exactly what stops the platform from drowning in "can you add X to the image" tickets, and it means one team's toolchain choices can't destabilize anyone else's runners.
The Forge team is itself a tenant. The team that operates Forge runs its own custom image for its own runners — because it consumes Forge exactly like any other tenant, through the same base-image → custom-image path it gives everyone else. That's not a metaphor for dogfooding; it is the dogfooding. The platform team eats the same food it serves.
A build-matrix control system lets a PR skip specific OS/version/architecture combinations via tokens in the PR title or commit message (e.g. skip:windows, skip:ubuntu:22:arm64) — so a change that only touches one image doesn't pay to rebuild all of them across three OSes and multiple architectures.

And there's a second trade-off nested inside the tenant's choice — when to get its tools onto the runner. A team can install everything at workflow runtime (apt-get, pip install, download a toolchain at the top of the job): zero image to build or maintain, but every single run pays that installation cost and is exposed to a slow or flaky upstream mirror. Or it can bake those tools into a custom image: more work to build and keep fresh, but runs start instantly with everything already present and reproducible. Neither is "right" — it's the classic build-time-vs-run-time trade-off, and Forge deliberately leaves it to the tenant, because only the tenant knows whether it values low maintenance or fast, repeatable runs. The platform's job is to make both paths work cleanly, not to pick for them.

Building images as code matters for two reasons. First, freshness is not optional: GitHub will reject self-hosted runners whose agent is too old, so a stale image doesn't just lag — it eventually fails jobs outright. Scheduled, automated rebuilds keep images inside that window. Second, an image defined as code is an image you can build and exercise in a pull request — which is the whole game.

Piece 3 — Dogfooding: every image is tested as a real runner before merge

This is the keystone, and it's stronger than "run some tests." When Renovate (or a human) opens a PR that bumps a dependency or changes an image, the pipeline builds the candidate image inside that PR and then registers it as a real GitHub Actions runner and runs actual builds on it — before the PR can merge. Not a mock, not a smoke check: the freshly built image is brought up as a genuine runner and made to do real work. And it runs on the Forge team's own runners — because, as above, the Forge team is a tenant of its own platform. So a change to the platform is proven on the platform, by the team that owns it, acting as a real user of it. If a dependency bump or an image change breaks the runner, it breaks in the PR, visibly, on a real runner — not in production three days later.

"Dogfooding" — using your own product to build your own product — is doing real safety work here, not just signaling virtue. Because the Forge team runs as a tenant and every image is validated as an actual runner before it ships, the automation in Piece 1 becomes trustworthy: you can let safe dependency PRs auto-merge precisely because a broken image could never have passed a real build on the team's own runners. The three pieces only work as a set — Renovate generates the change, images-as-code make it buildable per PR, and the real-runner dogfood test (on the Forge-team tenant) proves it before it ships. And because tenants consume Forge through pinned versions (§9), a freshly validated release rolls out to teams on their cadence, not all at once.

What it buys: a large, fast-moving dependency surface stays current with little human effort; dependency and image changes are proven on the real platform before merge, which makes safe auto-merge trustworthy; runner images never silently rot past GitHub's agent-version cutoff.
What it costs: you build and maintain a non-trivial CI/image pipeline (Packer, Ansible, build matrices, the dogfood wiring); you spend CI minutes building and testing images on every relevant PR; and you take on the discipline of keeping the dogfood loop green, because once you trust it enough to auto-merge, a flaky loop is a real liability.

The lesson: "done" is a myth for a platform; the ecosystem moves whether you do or not. The way a small team keeps up is to make staying current automatic and self-proving — generate the changes with a bot, define your artifacts as code so they're buildable per change, and dogfood so every change is tested on the very system it modifies. That trio is what lets "near-zero ops" survive contact with a year of upstream churn. Decision #7.

12. Operating it: ownership, where it breaks, and the sharp edges

An explainer that stops at architecture skips the half that actually fills a platform team's week. Four operational realities round out the picture.

Ownership boundaries — most incidents live here. A surprising share of "Forge is broken" tickets aren't Forge bugs; they sit on a seam between owners:

Platform / Forge team owns the modules, EKS clusters, runner lifecycle, GitHub App plumbing, base images, shared observability, and guardrails.
Tenant team owns its workflows, workload permissions, custom toolchains/images, repository selection, and the external IAM roles it asks runners to assume.
Security / infra teams own the corporate VPCs, subnets and routing, Teleport entitlements, and Splunk access.

Knowing the boundary is half the triage: a job that can't reach an internal API is usually network/routing (security-infra); a job that can't assume a role is usually tenant IAM; "no runner picked it up" is usually labels/runner-group (platform + tenant config).

Where it breaks, and where to look first.

Symptom	Likely layer	First place to look
Job stuck waiting for a runner	Label mismatch, runner group, webhook, EC2/ARC capacity	GitHub labels; runner-group reconciler logs; webhook/Lambda/SQS dashboard
EC2 runner never registers	AMI, user-data, subnet IPs, EC2 capacity, App token	CloudWatch user-data logs; scale-up Lambda; EC2-lifecycle dashboard
ARC pod stuck pending	Karpenter capacity, storage, node taints, resource requests	K8s storage/network dashboard; ARC-lifecycle dashboard
AWS auth fails inside a job	Tenant trust, missing `sts:TagSession`, wrong role ARN	Trust-validator dashboard; tenant IAM trust policy
Logs missing	Job-log archiver, SQS/DLQ, Splunk ingestion	Webhook/job-log-pipeline dashboard; ingestion-quality dashboard

Sharp edges worth respecting — the platform isn't as simple as a happy-path diagram suggests:

scale_set_type must be exactly dind or k8s.
Kubernetes CPU/memory need units — 500m, 1Gi — not bare numbers.
ami_kms_key_arn must be "" for an unencrypted AMI.
macOS requires use_dedicated_host: true and matching placement (plus License Manager gotchas).
warm-pool schedules use AWS cron syntax, not Unix cron.
migrate_arc_cluster should be true only during an intentional migration.
subnet-IP exhaustion and EC2 capacity errors are first-class failure modes, not edge cases (§5, again).

Security at a glance — the controls, in one place: GitHub App keys in SSM Parameter Store; webhook HMAC validation; per-tenant runner groups + repository selection; short-lived AWS creds via runner-role assumption (no static keys in workflows); sts:AssumeRole + sts:TagSession; KMS-encrypted S3 job logs; Teleport for audited break-glass access; Falco runtime monitoring (internal deployment); rootless DinD on per-tenant nodes; CIS-aligned hardened base images.

And one honest definition: "near-zero ops" is not zero support. It does not mean nobody operates Forge. It means recurring manual work has been converted into automation, dashboards, scheduled validators, redrive loops, and self-service config — so the human job shifts from "SSH into a random runner and guess" to "look at the subsystem that owns the symptom, and when a pattern repeats, improve the detector or the automation." The support load — capacity, IAM trust, Teleport/onboarding, ARC/DinD issues, image updates, tenant guidance — is real. It's bounded and routed, not eliminated.

13. The pattern

Here is the entire argument on one page:

Decision	What it bought	What it cost
Calico overlay CNI (§5)	Broke the IP ceiling	A second networking layer to operate
Per-tenant DinD node pools (§6)	Blast-radius isolation for Docker builds	Money — worse bin-packing & more Karpenter objects
Trust-validator (§7)	Proactive trust checks	A small distributed system to run
Blue/green clusters (§8)	Safe, reversible upgrades	A per-tenant runner gap on migration
Directory-as-config (§9)	Onboarding in minutes	Terragrunt complexity & footguns
Self-healing tissue (§10)	Resilience without babysitting	More moving parts
Automated deps + dogfooding (§11)	Stay current without manual toil	A build/test pipeline to maintain

A credibility note before the numbers: everything about the design in this article is in the open-source repo and you can read the modules yourself; the scale figures that follow — ~40 teams, ~10k jobs/day, near-zero ops — come from internal production experience, not from the public code. I've tried to keep that line visible throughout.

Put together, this runs around 40 tenant teams and ~10,000 CI jobs a day, across multiple AWS regions and both execution lanes, grown organically from 3 teams to 40+ — operated with near-zero ops, by design. Read that last phrase carefully, because it's the most misunderstood claim in the whole story. It does not mean a heroically overworked person is quietly holding it together, and it does not mean an under-resourced org is getting lucky. It means low operational cost is the output of the stack of trade-offs above: ephemeral runners that can't drift, infrastructure that's entirely code, a boundary-validator that catches tenant mistakes before they page anyone, immutable clusters that make upgrades boring, configuration that makes onboarding a directory entry, and observability that turns operations into a loop. Low ops is what rigor looks like from the outside.

And that's the takeaway worth carrying even if you never run a GitHub Actions runner in your life:

Great platforms are not a lucky idea or a magic tool. They are a pattern of many correct decisions, made across many different problem domains — networking, isolation, identity, lifecycle, configuration, operations, and supply chain — each chosen with clear eyes about its cost, and composed so the whole holds together. The craft is not picking the single best technology for one problem. It's making a dozen good trade-offs in a dozen domains and having them add up. Anyone can copy one of these decisions. The engineering is in the composition.

So the next time you design something, don't ask "what's the best tech here?" Ask, for each choice: what does this buy me, what does it cost me, and is that the trade I want at my scale? Then write the answer down. That document — the one you just read — is as much the deliverable as the code, because it's the thing that turns a magic box back into an understandable machine.

When Forge is overkill

One last decision, and it's the one that keeps the rest honest: knowing when not to build this. If you're a small team with a handful of repos, Forge is too much. Start with the basics — ephemeral runners (the upstream EC2 module or ARC on its own), GitHub Actions, and a bit of Terraform — and only reach for tenancy, per-tenant isolation, trust validation, blue/green clusters, and a dogfooded image pipeline when you actually feel the pain they remove. Every trade-off in this article buys something real, but each also costs real complexity, and at small scale you'd be paying the cost without needing the benefit. Forge earns its keep in multi-team environments where governance, isolation, and platform automation genuinely matter. The same judgment that makes the dozen decisions good is the judgment that says: at three repos, don't make any of them. Maturity isn't building the biggest system — it's building the right-sized one, and being able to say which is which.

Read the code

This article keeps saying "read the modules" — so here's the map. Everything below is in the public repo:

Topic	Where to read
Tenant orchestration (umbrella)	`modules/platform/forge_runners`
EC2 runners	`modules/platform/ec2_deployment`
ARC / Kubernetes runners	`modules/platform/arc_deployment`, `modules/core/arc`
Calico + Karpenter on EKS	`modules/infra/eks`
Trust-validator (§7)	`modules/platform/forge_runners/forge_trust_validator`
Job-log archival (§10)	`modules/platform/forge_runners/github_actions_job_logs`
Global lock (§10)	`modules/platform/forge_runners/github_global_lock`
Runner groups + repo registration (§4)	`modules/platform/forge_runners/github_app_runner_group`
Splunk dashboards / config (§10)	`modules/integrations/splunk_cloud_conf_shared`, `modules/integrations/splunk_o11y_conf_shared`

Forge is open source: github.com/cisco-open/forge (Apache-2.0).

Docs: cisco-open.github.io/forge.

Read the modules — the trade-offs are all in there if you know where to look.