DEV Community: Ederson Brilhante

No Silver Bullets: Engineering a Multi-Tenant CI Platform a Small Team Can Run

Ederson Brilhante — Sat, 20 Jun 2026 20:41:43 +0000

A deep, teaching walkthrough of how Cisco’s internal Forge deployment runs ~40 teams and ~10,000 GitHub Actions jobs a day on AWS — and the dozen deliberate engineering trade-offs that made it survivable with near-zero ops. This is the long version on purpose. I’m going to show you the machinery, not wave at it, because a platform you can’t see inside is just a magic box you’re afraid to touch. By the end you should understand not only what Forge does but why each piece is built the way it is — well enough to argue with the decisions.

TL;DR — Forge (open-source name: ForgeMT) is a multi-tenant GitHub Actions runner platform on AWS. In Cisco's internal production deployment it runs ~40 teams and ~10k jobs/day with near-zero manual provisioning/debugging — which is not the same as zero support. Jobs run on ephemeral runners across two lanes — EC2 VMs and Kubernetes/ARC pods. The design is a stack of deliberate trade-offs, each with a stated cost: Calico to beat VPC-IP exhaustion, per-tenant DinD node pools for blast-radius isolation around privileged Docker builds, zero static credentials with a self-checking trust-validator, immutable blue/green clusters for safe upgrades, directory-as-config Terragrunt so onboarding is a config change, self-healing pipelines + centralized Splunk observability, and Renovate + dogfooded images to stay current. The thesis: near-zero ops isn't luck or a magic tool — it's a dozen good trade-offs that add up. Short on time? Read §4 (how a job runs) and §13 (the pattern).

Contents

The wall everyone hits
What Forge is (and the vocabulary you'll need)
Why it exists
How a job actually runs — the part most explanations skip
Trade-off #1 — Networking: the IP ceiling
Trade-off #2 — Isolation: per-tenant DinD node pools
Trade-off #3 — Identity: zero static credentials
Trade-off #4 — Immutable clusters & blue/green
Trade-off #5 — Config at scale
Trade-off #6 — The connective tissue
Trade-off #7 — Staying fresh: automated dependencies & dogfooded images
Operating it: ownership, where it breaks, and the sharp edges
The pattern

1. The wall everyone hits

Setting up a GitHub Actions self-hosted runner is a fifteen-minute job. You spin up a VM, run the config.sh script GitHub gives you, it registers the machine against your org or repo, you add runs-on: self-hosted to a workflow, and your job lands on your box. It feels great. It is great — for one team.

The trouble never arrives with the first runner. It arrives with the second team. And the fifth. And the twentieth.

Here's the mechanism of the pain, because it's worth being precise about. A self-hosted runner is a long-lived machine you own. That means you own its patching, its security posture, its disk filling up, its runner-agent upgrades, its scaling when ten jobs arrive at once, and its idle cost when none do. The moment a second team needs runners — maybe they need a different toolchain, a bigger instance, or access to an internal network the first team didn't — the path of least resistance is for them to copy the setup and run their own. Now two teams each carry that whole operational burden. Multiply by twenty.

What you end up with isn't "twenty teams using runners." It's twenty subtly different runner platforms, each patched on its own schedule, each drifting in its own direction, each with its own half-built answer to "how do we give this runner AWS access without leaking a key." "It works on Team A's runners but not ours" becomes a real sentence in real incident channels. Your operational cost grows roughly linearly with adoption, and — this is the part that should worry you — your security posture gets worse as you grow, because every team reinvents secrets handling and isolation slightly differently and slightly wrong.

This is the wall. Forge is what one team built after hitting it.

And the single most important thing to understand before we go further: the interesting part is not any one clever component. There is no magic tool in here. Forge is roughly a dozen deliberate decisions, each made in a different problem domain, each with a real cost, composed into something a small team can actually operate. The rest of this article walks the load-bearing ones — with the actual code and the actual trade-off accepted. If you're building anything multi-tenant, the shape of these decisions will transfer even if you never touch a GitHub runner.

2. What Forge is (and the vocabulary you'll need)

Forge is a secure, multi-tenant GitHub Actions runner platform on AWS. Teams run their CI/CD jobs on managed, ephemeral runners that live in company-managed AWS accounts. For a developer in an already-onboarded repo, adoption is usually a one-line runs-on change — no infrastructure to own, no migration project, no workflow rewrite. (Standing up a new tenant is a controlled platform-team step, which we'll be honest about in §4 and §12.)

(A naming note: the open-source project is ForgeMT — "MT" for multi-tenant; internally people usually just say Forge. I use "Forge" throughout for readability, but the repo and docs say "ForgeMT.")

Before we go deeper, let's nail down five terms. If you already live in this world, skim; if you don't, these five unlock everything below.

GitHub-hosted vs. self-hosted runners. When you write runs-on: ubuntu-latest, GitHub spins up a throwaway VM it owns, runs your job, and throws it away. Convenient, but that VM lives in GitHub's network — it cannot reach your private databases, your internal package registries, or your VPC, and you can't customize it much or control its cost at high volume. A self-hosted runner is a machine you register with GitHub; GitHub sends jobs to it, but you own everything about the box. Forge is a platform for self-hosted runners — it exists precisely for the jobs that GitHub-hosted runners can't serve.

Tenant. A tenant is the isolation-and-configuration boundary for a team — its own runners, identity, and labels. Forge runs those runners inside the Forge-managed AWS deployment for that tenant; the tenant can then optionally grant its runners access to its own external AWS accounts by listing the IAM roles the runner role is allowed to assume (more in §7). So "each tenant has its own account" is the wrong mental model — the right one is "each tenant is an isolated boundary, with opt-in bridges to whatever AWS it actually needs." When we say Forge is "multi-tenant," we mean many teams share one platform (one codebase, one upgrade path, one operations team) while staying strictly separated at runtime. Holding both of those at once — shared platform, isolated execution — is the entire engineering problem.

Ephemeral runner. This is the keystone idea. Instead of long-lived machines, Forge creates a brand-new runner for a single job and destroys it the instant that job finishes — pass or fail. Every job gets a pristine machine that has never seen another job. This buys you enormous things almost for free: no state leaks between jobs, no configuration drift accumulating on a box over months, no credentials cached on a runner that outlives the work, and no "clean up after yourself" step that everyone forgets. It costs you one thing — startup latency, since you pay to boot a machine per job — which we'll see Forge manage carefully later.

Control plane vs. tenant plane. The control plane is the Forge-owned brain: it receives the signal that a job needs a runner, provisions the runner, scales the fleet, wires up identity, and collects logs and metrics. The tenant plane is where the actual jobs execute. Teams live entirely in the tenant plane — they write workflows and pick runners — and never touch the control plane. This split is what lets one small team operate the control plane for everyone.

The two lanes. Forge runs jobs two different ways, and lets each job choose:

The EC2 lane gives a job a full virtual machine (or even a bare-metal host). Use it for heavy builds, specialized hardware, custom operating systems, macOS/Windows, or anything that needs real VM-level isolation and control. A job can run directly on the VM or inside a container via a container: block in the workflow. Under the hood this is built on the open-source terraform-aws-github-runner project.
The Kubernetes lane (often called ARC, after Actions Runner Controller, the open-source operator it's built on) runs each job in a Kubernetes pod. Use it for fast, bursty, container-native work where you want pods to spin up in seconds and scale to zero when idle.

Within each lane, tenants pick a size or flavor. On EC2 that's typically small (linting, quick tests), standard (general builds and integration tests), large (heavy builds, load tests), and metal (bare-metal hosts for jobs that need full hardware control). On Kubernetes it's k8s (lightweight pods for jobs that don't need Docker) and dind (Docker-in-Docker, for building container images inside a pod) — plus dedicated named scale sets like dependabot, which is not a third execution model but an ARC scale set typically backed by the DinD template, carved out so dependency-update jobs are isolated and capacity-limited on their own. A tenant can run several of these at once, each with its own parallelism limit, so one team's flood of jobs can't starve another's.

Two lanes is itself a deliberate choice — most platforms pick one. Forge keeps both because the workload genuinely spans both: a 90-minute hardware-in-the-loop build and a 30-second lint check do not want the same execution model.

Tenants pick a lane and a size by label. A workflow says something like runs-on: [self-hosted, x64, "type:standard"] for a medium EC2 runner, or runs-on: [k8s] for a Kubernetes pod, or runs-on: [dind] for a pod that can build Docker images. Those labels are not cosmetic — as we'll see in §9, they're the literal API contract between the tenant and the platform, generated from configuration and matched exactly.

That diagram is the entire mental model at altitude. The rest of this article is what's inside each of those boxes — and why.

3. Why it exists

Forge didn't begin as an architecture diagram. It began as a fix for a specific, escalating mess.

The first teams ran the open-source terraform-aws-github-runner module, each on their own. It's a good module. It worked. What didn't work was the practice of everyone running their own copy. The same bug got fixed three separate times in three repos. The same incident — a runner type out of capacity, a webhook misfiring — recurred across teams that never compared notes. Because each team upgraded on its own cadence, the behavior of "a runner" diverged between teams, so debugging required knowing which team's particular vintage you were looking at. Knowledge became tribal. Three teams meant three deployments, several AWS accounts per environment, and three sets of undocumented quirks, with billing and troubleshooting smeared across all of them.

Two specific constraints turned this from "annoying" into "we cannot continue":

IPv4 exhaustion. The runners had to live in corporate-provisioned AWS subnets, and those subnets were small — and not something an individual team could enlarge. Here's the trap: there was plenty of CPU and memory available, but the network ran out of IP addresses long before the compute did. Every runner needs an IP. When you've handed out every address in the subnet, the next job sits in a queue staring at idle compute it can't use, because there's no address to give it. The bottleneck wasn't the thing everyone watches (compute) — it was the thing nobody watches (IPs). This single constraint forced a real architectural change, and we'll spend all of §5 on it because it's the most instructive decision in the platform.

Internal access and shared networks. CI jobs frequently need to reach internal things — a private artifact registry, an internal API, a database, a service only reachable over the corporate network. GitHub-hosted runners simply cannot do this; they're outside the perimeter. And when CI runs on shared corporate networks, CI traffic (which is spiky and heavy) competes with operational traffic, creating noisy-neighbor problems. The upshot: security and network design stopped being someone else's problem and became part of CI design. You cannot bolt them on afterward.

Faced with this, the team had a binary choice: keep letting every team run its own stack and drown in duplicated maintenance and divergent security, or standardize onto one platform. The bet they made — and it's the thesis of the whole thing — was to keep the flexibility teams loved (custom images, full VM control, internal access, their own AWS resources) while removing the drift (one module, one upgrade path, one set of guardrails). Standardize the boundaries; preserve the freedom inside them.

The next six sections are what "guardrails without taking away freedom" actually required. But first, we have to look at how a single job flows through the system — because every later decision is in service of that flow.

4. How a job actually runs — the part most explanations skip

If you only remember the altitude diagram, Forge stays a magic box. So let's trace a real job, end to end, twice — once per lane. This is the spine everything else hangs on.

The EC2 lane, step by step

A developer pushes code. Their workflow contains runs-on: [self-hosted, x64, "type:standard", "tnt:acme"]. GitHub sees the job needs a self-hosted runner and emits a workflow_job webhook with action: queued.
The webhook hits an API Gateway, which is the public front door of that tenant's Forge deployment. It forwards the request to a small webhook Lambda.
The webhook Lambda authenticates and matches. It verifies the request's HMAC signature against the GitHub App's webhook secret (so randoms can't trigger your runners), then checks the job's labels against the set of runner types this deployment offers. If a label set matches, it enqueues a message onto an SQS queue dedicated to that runner type. (If nothing matches, it does nothing — the job will sit unfulfilled, which, as we'll see, is itself a debuggable signal.)
A scale-up Lambda consumes the SQS message. SQS here isn't decoration — it's a buffer that decouples the burst of webhooks from the rate at which you can launch instances, and it gives you retries for free. The scale-up Lambda decides how many runners to launch and issues an EC2 CreateFleet call in instant mode (give me N instances now), choosing on-demand or spot capacity per the tenant's config.
The instance boots and registers itself. Its user-data script pulls a just-in-time (JIT) runner registration token, configures the GitHub Actions runner agent as ephemeral (it will accept exactly one job), and joins the tenant's runner group. The moment it registers, GitHub hands it the queued job.
The job runs. Job-started and job-completed hooks fire; the runner streams logs to CloudWatch.
The runner self-destructs. Because it registered as ephemeral, the agent deregisters from GitHub after the single job. A separate scale-down Lambda, running on a schedule (every few minutes), reaps the now-idle instance and any stragglers older than a minimum runtime. Supporting Lambdas update tags and archived logs.

Two real-world wrinkles live in steps 4–5, and both are worth knowing. First, capacity isn't guaranteed: a CreateFleet can come back short with errors like InsufficientInstanceCapacity or — tellingly — InsufficientFreeAddressesInSubnet (yes, you can run out of subnet IPs; that's the §5 problem wearing a different hat). Forge classifies these errors rather than treating them all the same: a spot shortfall can fail over to on-demand, and retryable capacity errors are re-queued instead of dropped on the floor. Second, a hard-won lesson: a webhook with a valid HMAC signature is not necessarily a usable one. In one production incident, deliveries were signed correctly but carried an installation ID belonging to a different GitHub App — so the token request, and thus runner creation, failed even though signature validation passed cleanly. The fix that went into the runbook: when runner creation fails, don't stop at "signature verified" — verify that the payload's installation context actually matches the app you think is serving the request. It's a perfect example of why the observability in §10 exists: the symptom ("runners won't start") was three layers away from the cause.

The Kubernetes (ARC) lane, step by step

The Kubernetes lane reaches the same destination by a different road, because Kubernetes already has a scaling brain.

The same workflow_job signal reaches the cluster, but here the ARC listener for the tenant's scale set notices there's demand.
ARC creates an ephemeral runner resource (a pod spec) for the job.
Karpenter provisions a node if needed. Karpenter is a Kubernetes autoscaler that watches for unschedulable pods and launches right-sized EC2 nodes to fit them (and removes them when idle). If the tenant's nodes are full or scaled to zero, Karpenter boots one.
Kubernetes schedules the runner pod. For DinD, tenant isolation is enforced by the per-tenant Karpenter node pool plus taints/tolerations and required node affinity (§6); for plain k8s, the checked-in template relies on namespace, service-account/IAM, runner-group, and GitHub-routing boundaries unless an extra scheduling layer is added.
The pod's containers come up: the runner container itself, plus — for Docker builds — a rootless Docker-in-Docker sidecar, plus hook and log sidecars.
The job runs in the pod; logs flow to Kubernetes logging and onward.
The ephemeral runner resource and its pod are deleted. Idle nodes get consolidated away by Karpenter.

The reason Forge keeps both lanes is now concrete: the EC2 path gives you a whole machine and total control but pays a full VM boot; the ARC path scales pods fast and bin-packs them onto shared nodes but constrains you to a container model. Different jobs genuinely want different trade-offs.

One piece of plumbing makes both flows actually route to the right place: runner groups. Each tenant's runners register into a GitHub runner group scoped to that tenant, and a small reconciler keeps the mapping correct — when a tenant adds a repository, it's automatically registered into the right runner group, so that repo's jobs land only on that tenant's runners. This is what stops one tenant's workflow from ever pulling another tenant's capacity, and it's why onboarding a new repo is a no-op for the platform team rather than a ticket.

One caveat on "adopt by changing one line." That's true for a developer in an already-onboarded repo — flip the runs-on label and you're done. But onboarding a new tenant is a controlled platform-team operation, not one line: create the tenant's Terragrunt config, register/install the GitHub App, set repository_selection (all or selected), wire the runner group, configure any optional IAM-role trust and ECR access, and set the EC2/ARC runner specs (sizes, AMIs, images, resource limits). Forge makes that repeatable; it doesn't make it disappear. (We come back to it in §12.)

With the flow in hand, every decision below is really an answer to the question: "what breaks in that flow when you run it for forty teams in a small network, and how do you keep it secure and operable?"

5. Trade-off #1 — Networking: the IP ceiling

Networking is where most multi-tenant Kubernetes platforms quietly die, and it's the constraint that forced Forge's first and most instructive decision. To understand it you need one piece of AWS background.

The background: how pods get IP addresses on EKS

By default, Amazon's EKS uses the AWS VPC CNI (the aws-node component) for pod networking. CNI stands for Container Network Interface — the plugin that decides how a pod gets a network identity. The defining property of the AWS VPC CNI is that every pod gets a real, routable VPC IP address, drawn from the same pool your EC2 instances use. That's lovely for interoperability (a pod is a first-class citizen on your network) and brutal for density, because of how AWS attaches addresses.

An EC2 instance gets IPs through Elastic Network Interfaces (ENIs) — virtual NICs. Each instance type supports a fixed maximum number of ENIs, and each ENI supports a fixed number of private IPs. So the number of pods you can run on a node is capped by hardware-ish limits:

max pods per node ≈ (max ENIs) × (IPs per ENI − 1)

The constraint: a small, fixed network

Forge's runners must live in corporate-provisioned VPCs, and those are small and non-negotiable. The live data plane sits in a /24 VPC — 256 total addresses — carved into two /25 subnets of about 123 usable addresses each (AWS reserves five per subnet). You do not get to enlarge it. That's the box you're in.

The math that kills the default CNI

Take a common worker size, a node that supports 8 ENIs at 30 IPs each. That's 8 × (30 − 1) = 232, call it ~234 pod IPs per node. Now look at what that means in a /25 with ~123 usable addresses: a single fully-packed node would need almost twice the addresses the entire subnet contains. Scale to the cluster's target of five nodes and you'd be asking for ~1,170 IPs — roughly 4.5× the entire /24 VPC.

This isn't "the cluster runs slowly." It's "the cluster cannot be scheduled." The IP ceiling is hit before a single meaningful workload lands. And critically, throwing more compute at it makes things worse, because more nodes means more IP demand. The thing everyone instinctively scales (compute) is the thing actively hurting you.

The decision: replace the CNI with an overlay

Forge deletes the AWS VPC CNI and installs Calico, configured as an overlay network:

kubectl delete daemonset -n kube-system aws-node || true
kubectl apply -f .../calico/tigera-operator.yaml --server-side
# Installation custom resource:
spec:
  cni:           { type: Calico }
  calicoNetwork: { bgp: Disabled }

An overlay network gives pods IP addresses from a private, cluster-internal range (a CIDR that exists only inside Kubernetes) and encapsulates pod-to-pod traffic so it can travel across the real VPC without the VPC needing to know about every pod IP. The VPC route tables never see pod addresses. (bgp: Disabled tells Calico not to advertise routes via the BGP routing protocol — appropriate here, since we're encapsulating rather than peering pod routes into the network fabric.)

The consequence is the whole point: VPC IP addresses are now consumed only by nodes, not by pods — and a node is exactly one VPC IP, whether it packs one pod or a hundred. (Hold onto that fact; in §10 it quietly flips a cost recommendation.)

Now do the math again. A /24 with 256 addresses can hold on the order of 250 nodes instead of less than one node's worth of pods. Pod density stops being an AWS-imposed accident and becomes a deliberate policy choice — Forge fixes it at maxPods: 100 per node — completely decoupled from ENI limits. The ceiling that made the platform impossible is simply gone.

What it costs (because nothing is free)

What it buys: breaks the IP ceiling; lets you run dense pod workloads inside tiny, fixed subnets; pod density becomes a tuning knob instead of a hard wall.
What it costs: you now run and upgrade a second networking layer with its own release cadence and its own failure modes. Overlay networking has real operational edges — image-pull problems and operator-ordering bugs have caused incidents serious enough to have named fix branches in the repo. And bring-up ordering is genuinely fragile: the aws-node deletion is best-effort (|| true), and every node must carry a synthetic dependency so it can never join the cluster before the CNI swap completes — a node that registers with no CNI has no working pod network at all, which is a confusing way to fail.

The lesson worth taking even if you never run Kubernetes: at scale, the binding constraint is frequently not the resource everyone watches. Here it was IP addresses, not CPU. There was no "add more nodes" answer (that makes it worse) and no "resize the subnet" answer (you don't control it), so the only move was to change the networking layer itself. Find your real ceiling before you optimize the comfortable one. This is decision #1 of about a dozen.

6. Trade-off #2 — Isolation: per-tenant DinD node pools

This is the clearest place in the entire platform where Forge chose isolation over efficiency, knowingly, and paid cash for it. That's exactly what makes it a good teaching example — it's an explicit trade, not a default.

The background: Karpenter, taints, tolerations, affinity

A few Kubernetes concepts make this section legible:

Karpenter is an autoscaler that provisions EC2 nodes on demand to fit pending pods, using a NodePool (rules for what kind of nodes to create) and an EC2NodeClass (the AWS specifics — AMI, IAM role, subnets). When pods need a home and none fits, Karpenter launches a node; when nodes go idle, it removes them.
A taint is a "keep off" mark on a node: by default, pods won't schedule onto a tainted node.
A toleration is a pod's permission slip that lets it tolerate a specific taint — i.e., it's allowed onto that node. Important subtlety: a toleration lets a pod land on a tainted node; it does not force it to. A pod with a toleration could still land somewhere else.
nodeAffinity is a pod's requirement about which nodes it must run on. A requiredDuringScheduling affinity is a hard rule: schedule me only on nodes matching this.

The decision: per-tenant nodes for the path that needs them

The strongest node-level isolation is applied where the risk actually lives: the DinD path. Each DinD tenant gets its own Karpenter NodePool and EC2NodeClass (named karpenter-<tenant>), and those nodes are stamped with two taints:

taints:
  - { key: forge.local/scale_set_type, value: dind, effect: NoSchedule }
  - { key: forge.local/tenant,          value: <tenant>, effect: NoSchedule }

A tenant's build pods then carry both matching tolerations and a hard requiredDuringScheduling nodeAffinity on those same two keys. You need both halves, and understanding why is the whole point:

The tolerations let the tenant's pods onto the tenant's (tainted) nodes — and the taints keep everyone else's pods off.
The required affinity stops the tenant's own pods from wandering onto some other, untainted node. Tolerations alone would permit that drift; affinity forbids it.

Belt and suspenders. The result is that a DinD tenant's jobs can land only on that tenant's machines — enforced by the scheduler, not by hope or convention. (To be precise about scope: this hard taint/affinity node-pinning is the DinD template's behavior. The plain k8s runner mode still gets real isolation — its own namespace, service account, runner group, and IAM role — but it doesn't carry the same tenant taint/affinity in the checked-in template; if you need hard node-pinning for non-DinD jobs too, you add that scheduling layer explicitly.)

One implementation detail worth admiring: the per-tenant EC2NodeClass isn't hand-written forty times. A data source reads the shared node class (kubectl get ec2nodeclass karpenter -o yaml), strips the server-managed fields with yq, and renames it with jq to karpenter-<tenant>. So every tenant inherits identical, correct AMI/role/subnet wiring under its own name. Sameness by construction; difference only in the label.

Why Docker-in-Docker is the forcing function

You might ask: why pay for per-tenant nodes at all? The answer is Docker-in-Docker (DinD). Many CI jobs build container images, which means they need a Docker daemon, which traditionally means a privileged container — one with elevated host access. Running privileged builds from multiple tenants on shared nodes is an unacceptable blast radius: a malicious or simply buggy privileged build could reach a neighbor. Pinning each tenant to its own nodes contains that.

And even within that, Forge adds a second layer: DinD runs rootless.

dind sidecar: image docker:dind-rootless
  runAsUser 1001
  subuid/subgid 100000:65536
  privileged: true   # the nested runtime needs it — but the user is 1001, not root

The subuid/subgid remapping is user-namespace remapping: processes that think they're running as root inside the build are actually mapped to an unprivileged high-numbered UID (100000+) on the host. So even the one genuinely privileged thing in the system is defanged at the user level. You get node-level isolation and user-level isolation wrapped around the riskiest capability Forge offers.

What it buys: a blast radius of exactly one tenant, enforced by the scheduler; the ability to offer privileged Docker builds safely at all; a noisy or compromised build that cannot touch a neighbor.
What it costs: money. One DinD node pool per tenant means worse bin-packing and more partially-idle nodes than a single shared pool would have. It also leans on a Karpenter capacity-diversity feature the manifest's own comment flags as alpha, and it multiplies the number of NodePool/EC2NodeClass objects to manage. Forge bounds the cost with mitigations — DinD scales to zero when idle (min_runners: 0), idle nodes consolidate quickly (consolidateAfter: 1m), and a karpenter.sh/do-not-disrupt annotation protects a node that's mid-job from being consolidated out from under it.

The lesson: isolation is a dial, not a boolean. The engineering maturity isn't "maximize isolation" or "maximize density" — it's to choose, consciously, where you sit on the cost↔blast-radius curve for your threat model, and to write that choice down so the next person knows it was deliberate. Forge turned the dial toward isolation because privileged builds demanded it, and paid for it on purpose. Decision #2.

7. Trade-off #3 — Identity: zero static credentials

CI is one of the highest-value targets in any organization: it executes code with access to credentials, source, artifacts, and internal networks. If you're going to attack a company, its build system is a wonderful place to start. So Forge's credential model is built around one non-negotiable rule: no long-lived secrets in pipelines, ever.

The background: roles, AssumeRole, and why "no keys" is even possible

In AWS, you can grant access two ways. The old way is a static credential — an access key and secret, long-lived, that you stash somewhere (a workflow secret, an env var) and that works until someone rotates it. Static keys are the thing that leaks: copied into logs, committed by accident, shared between teams, never rotated.

The modern way is role assumption. An IAM role is a set of permissions that an authorized identity can temporarily borrow by calling AWS STS (Security Token Service) AssumeRole, which mints short-lived credentials that expire in minutes to hours. Nothing long-lived is stored anywhere. The question is just: what proves you're authorized to assume the role? That proof is the runner's ambient identity.

The model: short-lived role-chaining off the runner's identity

A Forge runner is given an ambient AWS identity — on EC2, an instance profile; on Kubernetes, an EKS Pod Identity association. That identity is permitted to AssumeRole into the tenant's own role, and the tenant configures that role to trust the Forge runner role. The workflow then uses the standard aws-actions/configure-aws-credentials action with role-to-assume, gets short-lived creds, and does its work. No static key ever exists.

One precise correction, because it's almost always described wrong: this is STS role-chaining off the runner's instance/pod identity — not GitHub OIDC. The trust originates from the AWS identity Forge attaches to the runner, not from a token GitHub issues. (GitHub OIDC is a fine pattern; it's just not the one in play here, and conflating them will mislead you when you debug.) And the tenant stays in control of exactly how far that access reaches: the role Forge assumes can grant direct resource access, or it can be the first hop in a chain of role assumptions into further accounts — the tenant decides the scope by writing the policies, not the platform.

On Kubernetes the path is deliberately dual. Pod Identity is primary. But DinD scale sets also get an IRSA trust — IAM Roles for Service Accounts, where a projected Kubernetes service-account token (audience sts.amazonaws.com) is exchanged for AWS credentials via AssumeRoleWithWebIdentity. Why carry two mechanisms? Because inside Docker-in-Docker, the Pod Identity agent's link-local metadata hop (169.254.170.23) isn't reliably reachable, so the projected token is the fallback that keeps AWS auth working from within the nested Docker runtime. It's the kind of detail you only learn by getting paged. (One precision note, since "OIDC" gets thrown around loosely: GitHub OIDC is not the trust root for tenant access here — but AWS/EKS OIDC does appear, via IRSA, on this DinD fallback path. Two different OIDCs; keep them separate when you debug.)

The problem: the tenant's mistake is your incident

Here's the operational reality that drives the next design. The overwhelming majority of onboarding failures are not runner bugs — they're IAM trust mistakes: the tenant typo'd an ARN, or pointed the trust at the wrong principal, or allowed sts:AssumeRole but forgot sts:TagSession (which Forge needs for session tagging). And these mistakes are invisible until a real job runs and fails — usually at the worst possible moment, in front of the tenant.

In a multi-tenant platform, the boundary you don't control (the tenant's IAM) is the boundary that pages you. So Forge validates that boundary proactively, on a schedule, with a small purpose-built robot.

The robot: a trust-validator

The trust-validator is a deliberately-built system of two Lambdas plus a delay queue, running every ten minutes:

Two design choices here are pure real-world scar tissue, and both teach something:

Why two Lambdas instead of one? The validator must assume a role whose ARN the preparer has to inject trust for before the validator ever runs. If the validator's identity were created as a normal Terraform output, you'd have a chicken-and-egg dependency cycle. Forge breaks it by computing the validator's role ARN deterministically in Terraform, so the preparer can reference it ahead of time. Splitting prepare-and-validate into two functions is what makes that clean.
Why a 300-second delay? Because IAM is eventually consistent. When you write a trust policy, that change is not instantly visible at every STS endpoint — for a short window you'll get a spurious AccessDenied even though the policy is correct. A naive validator would report false failures constantly. The delay (the variable is literally named iam_propagation_delay_seconds) gives the change time to propagate, and as a second safety net AccessDenied is explicitly treated as retryable with backoff. "Eventual consistency" sounds academic until it's flaking your validator every ten minutes.

And note the validator tests TagSession separately from AssumeRole, precisely because a tenant role can permit one and forget the other — checking only the first gives you a false green. The whole thing wraps cleanup in a finally so the temporary trust is always removed, even when validation throws.

What it buys: trust problems caught proactively, before any real job depends on them; a clear per-tenant pass/fail across both AssumeRole and TagSession; what would be recurring 2am pages turned into a ten-minute cron job.
What it costs: you build and operate an actual small distributed system — two Lambdas, a delay queue, live IAM mutation — purely to validate configuration. Cleanup has to be bulletproof (that finally is load-bearing — a missed cleanup leaves a trust door ajar), and handling eventual consistency adds genuine complexity.

The lesson: in any multi-tenant system, the boundary you don't own is the boundary that wakes you up. Investing in synthetic, scheduled validation of that boundary is what converts an entire class of inevitable incidents into a dashboard you glance at. For a platform run with near-zero ops, that conversion isn't a nicety — it's the mechanism that makes near-zero ops possible. Decision #3.

8. Trade-off #4 — Immutable clusters & blue/green

Upgrading a live Kubernetes cluster is one of the more anxiety-inducing operations in this whole space. An EKS upgrade is really a coordinated upgrade of many coupled things at once — the control plane version, the add-ons (CoreDNS, kube-proxy, the EBS CSI driver), Karpenter, and the CNI — and any one of them can regress in a way that's hard to undo while production traffic is flowing through the cluster. Forge sidesteps the anxiety with a simple, radical stance: never upgrade a cluster in place.

The background: immutable infrastructure

"Immutable infrastructure" means you don't modify running things — you replace them. Instead of patching a server, you build a new image and roll it out. Instead of upgrading a cluster, you build a new cluster and move work to it. The payoff is that rollback becomes trivial (the old thing is still there, untouched) and "upgrade" stops being a high-wire act. The cost is that you need a clean way to move work between the old and new things.

The decision: disposable clusters in blue/green pairs

Forge runs EKS clusters as immutable and disposable, in blue/green pairs — at any time there's a -blue and a -green cluster, one active. To upgrade, you stand up the fresh sibling and migrate tenants onto it, one at a time. The entire cutover is controlled by just two values in a tenant's configuration: arc_cluster_name (which cluster this tenant lives on) and migrate_arc_cluster (a switch that tears down the tenant's footprint on its current cluster).

The mechanism that makes this clean is subtle and worth internalizing: the ARC module never hardcodes a cluster endpoint. Its Kubernetes and Helm providers resolve dynamically from a data source keyed on the cluster name:

data "aws_eks_cluster" "c" { name = var.eks_cluster_name }   # = arc_cluster_name
provider "kubernetes" {
  host  = data.aws_eks_cluster.c.endpoint
  token = data.aws_eks_cluster_auth.c.token
}

Because everything resolves from the name, flipping arc_cluster_name from -green to -blue makes the very next apply repoint every in-cluster resource at the other cluster — no endpoints to edit, no state to surgically move. And migrate_arc_cluster: true acts as a master switch threaded through the module tree: when true, every in-cluster resource (the namespace, the ARC controller, the scale set, the Pod Identity association, the Karpenter node pool) is set to count = 0 and destroyed. One detail makes the whole thing safe: the IAM runner role is deliberately not gated by that switch, so it survives the migration — which means the tenant's trust relationships (and the trust-validator from §7) don't churn just because you moved clusters.

The sequence, concretely

A migration script drives this per tenant:

Detect direction — read the current arc_cluster_name; figure out green→blue or blue→green (and hard-error if it's neither, rather than guess).
Drain to zero — patch each runner scale set to minRunners: 0, maxRunners: 0, then wait until no runner pods remain. You don't yank work out from under running jobs; you stop accepting new ones and let the in-flight ones finish.
Disable on OLD — set migrate_arc_cluster = true with the old name and apply: the tenant's footprint on the source cluster is torn down.
Pre-stage on NEW — flip the name to the target, still migrate = true, and apply: providers now point at the new cluster, but nothing is installed yet.
Enable on NEW — set migrate = false with the target name and apply, then re-apply the trust-validator against the new home.

The honest trade-off

What it buys: safe, reversible cluster upgrades — the source is untouched until the target is proven, so rollback is "flip the name back"; other tenants are completely unaffected because each has its own namespace and, for DinD workloads, its own node pool; you migrate on your own schedule, tenant by tenant.
What it costs: the migrating tenant has a runner gap — the window between drain (step 2) and enable (step 5) when its jobs queue and wait. And the tooling has sharp edges worth knowing: a step that strips Kubernetes finalizers busy-loops with no timeout (if the controller is wedged, it spins forever), and the drain check is a substring match on pod names that a stuck terminating pod can block.

Here's the part you must say out loud, because it's where a careless write-up would lie to you: this is not zero-downtime for the migrating tenant. It is blast-radius-of-one — the platform keeps serving everyone else flawlessly while a single tenant cuts over with a brief gap. The unit of "no disruption" is the platform, not the individual tenant. Naming that precisely is the difference between a talk a senior engineer trusts and one they tune out. Decision #4.

9. Trade-off #5 — Config at scale

Everything above only matters if you can actually onboard forty teams without it being forty projects. A platform that's painful to add tenants to never gets forty tenants — it gets five and a backlog. So the configuration architecture is load-bearing for the whole low-ops story.

The background: Terraform vs. Terragrunt, and DRY

Terraform describes infrastructure as code. But across dozens of near-identical deployments (same module, different tenant/region/account), plain Terraform pushes you toward copy-paste — and copy-paste is where drift is born. Terragrunt is a thin wrapper over Terraform whose main job is to keep things DRY (Don't Repeat Yourself): shared configuration is defined once and inherited, and each deployment is just the small set of values that make it unique.

The decision: the directory path is the configuration

Forge deploys one module per tenant × region × account, arranged so that the filesystem layout encodes the coordinates:

environments/<aws_account>/regions/<aws_region>/vpcs/<vpc_alias>/tenants/<tenant_name>/

The tenant's name is literally basename() of its directory, and the path coordinates are account → region → VPC alias → tenant. Configuration cascades down the tree via Terragrunt's find_in_parent_folders: organization-wide constants at the top; an account/environment layer that supplies the AWS account id, profile, and remote-state settings; a region layer with the region and its short alias; a VPC layer with the concrete VPC and subnet IDs; and finally the tenant's own config.yml. A small config.hcl at the tenant level yamldecodes that YAML and expands it into the runner specs — and, crucially, into the runs-on label set that is the tenant's API:

runner_labels = [
  "type:${spec.type}",
  "self-hosted",
  spec.runner_architecture,
  "env:ops-${include.env.locals.env}",
]
extra_labels = [
  "ec2",
  "rgn:${local.region_alias}",
  "vpc:${local.vpc_alias}",
  "tnt:${local.tenant_name}",
]

There's a clever bit in how those labels are matched. Rather than requiring a workflow to specify the exact full label set, Forge generates matchers for every contiguous sub-slice of the extra labels appended to the base — so a job can match with any reasonable subset (just type:standard and self-hosted, say) while the platform still always knows the full identity (tenant, region, VPC, lane). Flexibility for the user, precision for the platform.

Two more touches that make scale comfortable:

Version pinning with a local-dev escape hatch. Module versions are pinned indirectly through a release_versions.yaml (everything at a single release ref), and a use_local_repos flag flips every module source from the git ref to a local file:// checkout for iteration. One switch, zero per-tenant edits — you can test a platform change against a real tenant config without touching the tenant config.
State isolation for free. There's one S3 state bucket and one DynamoDB lock table per account (the bucket name carries the account id), and the Terraform state key is the directory path. So per-tenant/region/VPC state separation isn't designed per tenant — it falls straight out of where the directory sits. The structure does the work.

What it buys: adding a tenant is a configuration change, not a project — minutes, not days; dev and prod can run different module versions safely; genuine DRY across ~40 deployments behind one module and one upgrade path.
What it costs: Terragrunt plus layered HCL has a real learning curve, and there are footguns worth flagging — remote state can live in one region even for resources in another (a cross-region surprise), and the state bucket and lock table sharing a name can confuse newcomers. Forge also deliberately avoids cross-unit dependency blocks (each tenant is one self-contained module instance); the rare cross-module ordering, like a full cluster rebuild, is handled by an external DAG-resolver script with a pragmatic stabilization delay. Pragmatic over pure.

The lesson: when you have N near-identical deployments, make the difference between them a small data file and let structure carry the sameness. If your marginal deployment costs minutes, "forty tenants" stops being a scaling problem and becomes a directory listing. Decision #5.

10. Trade-off #6 — The connective tissue

The big architectural calls get the headlines, but a platform actually survives on its connective tissue — the small, resilient systems that quietly remove whole categories of toil so a human never has to. Three of them, then the thing that ties it all together.

Audit-grade job-log archival

GitHub keeps job logs for a limited window, but you often need them longer — for audits, for debugging a flaky job a week later, for compliance evidence. So when a job finishes (workflow_job with action: completed), Forge archives its logs through a deliberately decoupled pipeline:

The event flows EventBridge → a dispatcher Lambda (which filters and forwards) → SQS → an archiver Lambda that authenticates as the GitHub App (minting a short-lived installation token), downloads the logs, and writes them to a per-tenant S3 bucket, KMS-encrypted, keyed by {repo}/{run}/{attempt}/{job} as both a raw .log and a structured .json. Two details show the care: the SQS visibility timeout is set just above the archiver's Lambda timeout (the event-source mapping refuses to be created otherwise — there's a comment in the code so the next person doesn't "fix" it), and failed messages retry up to ten times before landing in a dead-letter queue (DLQ). A separate redrive Lambda re-injects DLQ messages back into the pipeline every ten minutes, so a transient GitHub API hiccup self-heals instead of silently dropping audit logs. Nobody gets paged for a blip.

A self-healing global lock

Some operations must not run concurrently across repos or workflows. Forge provides a distributed mutex backed by DynamoDB: a table keyed on a lock_id, with secondary indexes on the workflow run and attempt, and a TTL on a timestamp. Workflows acquire and release the lock (every runner role carries the small policy needed to do so). The self-healing part is a janitor Lambda that runs every ten minutes, checks each held lock's workflow-run status via the GitHub App, and deletes any whose run has already completed — with the DynamoDB TTL as a final backstop for anything the janitor can't resolve. A lock that would otherwise wedge the system instead expires and cleans itself up.

Warm pools, used surgically

Remember the one cost of ephemeral runners: per-job startup latency. And the numbers are worth being concrete about, because the picture is "fast when something's already warm, slow when you have to boot a machine":

EC2, warm-pool hit: a pre-booted runner picks up the job in under ~20 seconds.
EC2, cold: booting a fresh instance — provision, bootstrap, register — takes on the order of ~5 minutes for Linux. And cold-start climbs with the OS and host type: Windows boots noticeably slower than Linux, and macOS is slowest of all, because Mac runners live on AWS Dedicated Hosts — you can wait on dedicated-host allocation and then a Mac instance boot, which is far slower than a normal VM.
Kubernetes pod on an already-running (idle) node, image cached: also under ~20 seconds — the pod just schedules and the container starts. But this one is genuinely case-by-case: if the runner image isn't already cached on the node, you add image-pull time; and if no node is available, you add node boot (below). So a non-DinD pod's start is "fast when the node's up and the image is local," and degrades from there.
Kubernetes/DinD when no node is available (e.g., scaled to zero): you pay the worst case — Karpenter has to boot a node first, then the scheduler places the pod, then it pulls/starts the container — which lands in minutes, comparable to an EC2 cold start, because you're booting an EC2 node either way.

So the real asymmetry isn't "EC2 slow, Kubernetes fast" — it's "warm/idle is fast on both lanes; cold means booting an EC2 machine on both lanes, with the extra K8s variable of whether the image is already on the node." The blunt fix is a warm pool, and the expensive mistake is keeping warm capacity everywhere, always.

There's also a cost asymmetry between the two warm strategies that ties straight back to §5, and it's sharp. A node is exactly one VPC IP — with Calico, it makes no difference whether that node runs one pod or a hundred; it still consumes a single VPC address. So if you want idle, warm capacity sitting ready: an EC2 warm pool of N runners holds N VPC IPs the entire time it waits (spending your scarcest resource just to shave startup), whereas you can park many idle DinD pods behind one warm node and spend exactly one IP.

That single fact inverts the default recommendation at scale. For ordinary use, Forge actually recommends EC2 over DinD-on-EKS, because per job it's cheaper. But for high-churn, scaled workloads that want warm capacity hot and ready, DinD wins — you can keep a pool of idle runners behind a single locked node (one IP) instead of paying one IP per warm EC2 runner. And this isn't a rule set once and forgotten: Forge measures each tenant's actual usage (those cost/usage dashboards again) and periodically recalculates and recommends the cheaper mix as the tenant's pattern shifts. The trade-off is computed from data, not guessed at onboarding — usage-aware, and revisited.

So Forge uses warm capacity surgically: most runner types run none (pure on-demand, accept the cold-start), and the few latency-sensitive EC2 types keep a small warm pool only during business hours, doubled by timezone — e.g. one warm runner topped up cron(*/5 9-18 ? * MON-FRI) in both America/Los_Angeles and Europe/Madrid, idle nights and weekends. On Kubernetes the same effect comes from keeping a little node headroom rather than scaling hard to zero on a latency-sensitive tenant — and it costs you nodes, not precious IPs. Cost is spent exactly where it buys responsiveness and nowhere else — the difference between "we have warm pools" and "we have a warm-pool bill."

Central logging & observability with Splunk — the thing that makes one team possible

Underneath all of it sits the system that makes operating forty tenants by a small team even conceivable: centralized logging and metrics. You cannot watch forty tenants by hand; you can only instrument them.

The pipeline has a few distinct paths, and it's worth being precise about them because "it all goes to Splunk" hides the real shape. The clean way to hold it in your head is: logs go one way (to Splunk Cloud, via CloudWatch), metrics go another (to Splunk O11y).

Logs — everything — go to CloudWatch Logs, then to Splunk Cloud. EC2 runners, the control-plane Lambdas, and EKS (both node and pod logs) all send their logs to CloudWatch Logs. From there, ingestion into Splunk Cloud (the log-analytics product) is done through Splunk's Data Manager app — the supported service for pulling AWS data into Splunk Cloud — and, true to the rest of the platform, Forge configures Data Manager as code: there's a Terraform module in the open-source repo that wires up the ingestion rather than anyone clicking through a console. A runner alone emits several log streams — OS syslog, the cloud-init / EC2 user-data bootstrap output (where most "the runner never came up" failures hide), the GitHub Actions job logs, the runner agent/worker logs, and the job-started/completed hook output — and they all travel this same path. (The archived job logs from §10's pipeline also land in per-tenant S3 and feed Splunk Cloud.) The result: a job's full story — bootstrap, execution, outcome — is reconstructable from one place after the runner itself is long gone, which, being ephemeral, it always is.
AWS metrics go to Splunk O11y. The Splunk Observability (O11y) + AWS integration pulls the CloudWatch metrics for the platform's AWS resources into Splunk O11y (the metrics/observability product) — and it, too, is configured by a Terraform module in the Forge repo, so the observability wiring is reproducible and reviewable like everything else.
EKS adds extra pod metrics to O11y via OTel. On top of the AWS metrics, each EKS cluster runs the Splunk OpenTelemetry (OTel) collector to send additional, pod-level metrics to Splunk O11y — the fine-grained container telemetry CloudWatch's AWS metrics don't capture. Note this is the metrics path only; EKS logs still go the log route above (CloudWatch → Splunk Cloud).

The payoff isn't just that the data lands somewhere — it's that Splunk Cloud carries a large set of Forge-specific field extractions, so the logs aren't an opaque blob: you can slice and filter them by meaningful dimensions (tenant, region, account, runner type, job, conclusion, and more), and the metrics in O11y are likewise sliceable by dimension. That's what turns "we have the logs" into "I can answer a question in one query." And — consistent with everything else in this article — those extractions and the dashboards themselves are managed as code, through Terraform's Splunk and SignalFx providers, not hand-built in a console; the observability is as reproducible as the infrastructure it watches. On top of it, each tenant gets its own dashboards — runner-lifecycle health, queue/capacity, trust-validation failures, the webhook/job-log pipeline, optimization signals like high-memory-job detection, and cost breakdowns fed by AWS Billing Data Exports so a team can see and right-size its own spend.

This isn't abstract — it's a concrete set of dashboards, each mapping a symptom to a subsystem: a runner-capacity view (GitHub queue pressure vs. ARC scale-set state), Lambda operations (scale errors, tagging, ingestion retries), K8s storage/network (PVC/EBS attach, scheduler capacity, CNI readiness, API-audit), trust failures (the AssumeRole/TagSession detail from §7), ARC/DinD health (init containers, hook sidecars, runner versions, Karpenter signals), EC2 runner lifecycle (webhook → scale → AMI/SSM → user-data → hook), the webhook/job-log pipeline (dispatcher → SQS/DLQ → archiver → ingestion), and — tellingly — an ingestion-quality view that watches the telemetry itself for missing fields and dropped logs, because a dashboard you can't trust is worse than no dashboard. In the internal production deployment, the clusters also run Falco for runtime security monitoring (it's not in the public repo), so anomalous syscall behavior inside a running job is observable too. The throughline: every GitHub-side symptom ("my job is stuck") has a backend view that walks you down the chain — label mismatch → webhook → Lambda/SQS → EC2/ARC → runner logs — and localizes it to a subsystem in a click or two, the difference between an operator who guesses and one who knows. And the same central data does two more jobs beyond troubleshooting: it provides the audit trail compliance needs (who ran what, where, with which permissions), and it makes per-tenant cost attribution possible instead of one undifferentiated AWS bill.

The payoff is an operating loop rather than a permanent fire drill: monitor → alert → automate. When something recurs, it becomes a detector or a script, and then it stops being work. That loop is why "near-zero ops" is a true statement and not a brag — the rigor of the instrumentation is precisely what removes the human toil.

Debugging an ephemeral platform — Teleport for auditable SSH

Ephemerality (§6, §7) is wonderful for security and cleanliness, but it creates a real problem: how do you debug a machine that deletes itself the instant the job ends? You can't SSH into a failed runner after the fact — it's gone. Forge answers this three ways, in order of how often you reach for them.

First and most often, the logs already have it. Because every stream is centralized in Splunk (above), the large majority of debugging is post-hoc and hands-off — you read the bootstrap output, the job log, and the hook output without touching a box at all.

Second, rerun it. Every job runs in a fully reproducible, ephemeral environment, so re-running a failed job from the GitHub UI replays the exact conditions with no leftover side effects — a luxury you simply don't have with long-lived, drifted runners.

Third, when you genuinely need to be on the machine, Teleport provides live, auditable SSH — to both EC2 runners and Kubernetes pods. The developer logs in through corporate SSO (tsh login --proxy=<teleport-proxy> --auth=CloudSSO), and because the runner would normally vanish at job end, they keep it alive deliberately — a sleep step in the workflow, or a wrapper — then tsh ssh into the EC2 instance or tsh kube into the pod and inspect it as it runs. Crucially, this is break-glass, not a backdoor: access is gated by AD/identity-group entitlement (requested via a ticket), it's narrow and time-bound, and Teleport records sessions centrally — so live debugging never means an untracked shell or a shared bastion key. You get the convenience of "just SSH in" without surrendering the audit trail that made the platform compliant in the first place.

That's the trade-off in miniature: ephemeral runners buy you security and a clean slate every time, and cost you easy debugging — and Forge pays that cost with central logs, reproducible reruns, and auditable Teleport access, rather than by giving up ephemerality. You keep the security property and stay able to troubleshoot.

11. Trade-off #7 — Staying fresh: automated dependencies & dogfooded images

A note on scope: unlike the previous six, this one is mostly internal process rather than open-source code. The building blocks are public — the Forge repo ships a Renovate config and Packer/Ansible image-build examples — but the end-to-end pipeline described here (the image repos and the dogfood CI) lives in internal repos. I'm including it because it's the part that makes everything above survivable over time, and the pattern is what's worth stealing, not the hostnames.

Here's a truth that gets left out of most platform write-ups: a platform is never "done." Even with zero new features, it sits on a foundation that moves constantly underneath you — GitHub changes the Actions runtime, the runner agent ships a new minimum version, ARC and Karpenter and Calico cut releases, Terraform providers update, base-OS packages get CVEs, and hundreds of transitive dependencies drift. (A concrete one: when GitHub migrated Actions from the Node 20 runtime to Node 24, every runner image and a pile of tenant workflows had to be tested and updated — not because Forge changed anything, but because the ground moved underneath it.) Standing still is not free; standing still is how you rot. The hard part of operating a platform with near-zero ops is therefore not the initial build — it's keeping a large, moving dependency surface current without a human babysitting it. Forge does that with three interlocking pieces.

Piece 1 — Renovate: turn the dependency treadmill into reviewable PRs

The first piece is Renovate, a bot that continuously scans every dependency source in every repo — Terraform/OpenTofu modules, GitHub Actions, Docker base images, Helm charts, the pinned upstream runner module, even tool versions — and, when something updates, opens a pull request automatically. A shared Renovate configuration (defined once, extended by every repo) sets the policy: updates are grouped sensibly (all AWS provider bumps together, all Docker bumps together) so you review related changes as a unit; safe changes (patch bumps, digest pins, pre-commit hooks) can auto-merge; major version bumps are labeled and delayed so a human looks; and security patches are prioritized.

The effect is a mindset shift. Dependency maintenance stops being a periodic, dreaded, manual sweep and becomes a steady stream of small, pre-tested PRs you either glance at and approve or let auto-merge. You're never "behind" in a scary way — you're continuously, incrementally current. For a one-person-shaped operation, this is the difference between drowning in the treadmill and riding it.

But automated dependency PRs are only safe if something proves each bump doesn't break the platform. That's where the other two pieces come in.

Piece 2 — Images as code: a layered base image

Forge's runners don't use off-the-shelf images; they run images built as code with Packer (which bakes machine images) and Ansible (which configures what goes inside). The structure is two-layered, across three operating systems:

Base images for Ubuntu, macOS, and Windows — deliberately minimal. Each starts from a CIS-aligned hardened OS and installs only the common foundation every runner of that OS needs: the GitHub Actions runner agent, the container/Docker runtime where applicable, the AWS CLI, and the platform's own agents for access and observability (Teleport for break-glass, the Splunk/CloudWatch log shippers). That's it. The base is kept lean on purpose. (For container jobs, images are pulled from either a Forge ECR or the tenant's own ECR, depending on configuration — so a team can ship a private container image without it ever passing through the platform.)
Tenants own their toolchains. A tenant either runs on the minimal base image as-is, or builds its own custom image on top of the base with whatever it actually needs — Go, Python, Terraform/OpenTofu, Packer, and so on. Forge deliberately does not manage those toolchains. Curating every team's language and tool versions doesn't scale and isn't the platform's job — it's a treadmill that never ends. Keeping the base minimal and pushing toolchains into tenant-owned custom images is exactly what stops the platform from drowning in "can you add X to the image" tickets, and it means one team's toolchain choices can't destabilize anyone else's runners.
The Forge team is itself a tenant. The team that operates Forge runs its own custom image for its own runners — because it consumes Forge exactly like any other tenant, through the same base-image → custom-image path it gives everyone else. That's not a metaphor for dogfooding; it is the dogfooding. The platform team eats the same food it serves.
A build-matrix control system lets a PR skip specific OS/version/architecture combinations via tokens in the PR title or commit message (e.g. skip:windows, skip:ubuntu:22:arm64) — so a change that only touches one image doesn't pay to rebuild all of them across three OSes and multiple architectures.

And there's a second trade-off nested inside the tenant's choice — when to get its tools onto the runner. A team can install everything at workflow runtime (apt-get, pip install, download a toolchain at the top of the job): zero image to build or maintain, but every single run pays that installation cost and is exposed to a slow or flaky upstream mirror. Or it can bake those tools into a custom image: more work to build and keep fresh, but runs start instantly with everything already present and reproducible. Neither is "right" — it's the classic build-time-vs-run-time trade-off, and Forge deliberately leaves it to the tenant, because only the tenant knows whether it values low maintenance or fast, repeatable runs. The platform's job is to make both paths work cleanly, not to pick for them.

Building images as code matters for two reasons. First, freshness is not optional: GitHub will reject self-hosted runners whose agent is too old, so a stale image doesn't just lag — it eventually fails jobs outright. Scheduled, automated rebuilds keep images inside that window. Second, an image defined as code is an image you can build and exercise in a pull request — which is the whole game.

Piece 3 — Dogfooding: every image is tested as a real runner before merge

This is the keystone, and it's stronger than "run some tests." When Renovate (or a human) opens a PR that bumps a dependency or changes an image, the pipeline builds the candidate image inside that PR and then registers it as a real GitHub Actions runner and runs actual builds on it — before the PR can merge. Not a mock, not a smoke check: the freshly built image is brought up as a genuine runner and made to do real work. And it runs on the Forge team's own runners — because, as above, the Forge team is a tenant of its own platform. So a change to the platform is proven on the platform, by the team that owns it, acting as a real user of it. If a dependency bump or an image change breaks the runner, it breaks in the PR, visibly, on a real runner — not in production three days later.

"Dogfooding" — using your own product to build your own product — is doing real safety work here, not just signaling virtue. Because the Forge team runs as a tenant and every image is validated as an actual runner before it ships, the automation in Piece 1 becomes trustworthy: you can let safe dependency PRs auto-merge precisely because a broken image could never have passed a real build on the team's own runners. The three pieces only work as a set — Renovate generates the change, images-as-code make it buildable per PR, and the real-runner dogfood test (on the Forge-team tenant) proves it before it ships. And because tenants consume Forge through pinned versions (§9), a freshly validated release rolls out to teams on their cadence, not all at once.

What it buys: a large, fast-moving dependency surface stays current with little human effort; dependency and image changes are proven on the real platform before merge, which makes safe auto-merge trustworthy; runner images never silently rot past GitHub's agent-version cutoff.
What it costs: you build and maintain a non-trivial CI/image pipeline (Packer, Ansible, build matrices, the dogfood wiring); you spend CI minutes building and testing images on every relevant PR; and you take on the discipline of keeping the dogfood loop green, because once you trust it enough to auto-merge, a flaky loop is a real liability.

The lesson: "done" is a myth for a platform; the ecosystem moves whether you do or not. The way a small team keeps up is to make staying current automatic and self-proving — generate the changes with a bot, define your artifacts as code so they're buildable per change, and dogfood so every change is tested on the very system it modifies. That trio is what lets "near-zero ops" survive contact with a year of upstream churn. Decision #7.

12. Operating it: ownership, where it breaks, and the sharp edges

An explainer that stops at architecture skips the half that actually fills a platform team's week. Four operational realities round out the picture.

Ownership boundaries — most incidents live here. A surprising share of "Forge is broken" tickets aren't Forge bugs; they sit on a seam between owners:

Platform / Forge team owns the modules, EKS clusters, runner lifecycle, GitHub App plumbing, base images, shared observability, and guardrails.
Tenant team owns its workflows, workload permissions, custom toolchains/images, repository selection, and the external IAM roles it asks runners to assume.
Security / infra teams own the corporate VPCs, subnets and routing, Teleport entitlements, and Splunk access.

Knowing the boundary is half the triage: a job that can't reach an internal API is usually network/routing (security-infra); a job that can't assume a role is usually tenant IAM; "no runner picked it up" is usually labels/runner-group (platform + tenant config).

Where it breaks, and where to look first.

Symptom	Likely layer	First place to look
Job stuck waiting for a runner	Label mismatch, runner group, webhook, EC2/ARC capacity	GitHub labels; runner-group reconciler logs; webhook/Lambda/SQS dashboard
EC2 runner never registers	AMI, user-data, subnet IPs, EC2 capacity, App token	CloudWatch user-data logs; scale-up Lambda; EC2-lifecycle dashboard
ARC pod stuck pending	Karpenter capacity, storage, node taints, resource requests	K8s storage/network dashboard; ARC-lifecycle dashboard
AWS auth fails inside a job	Tenant trust, missing `sts:TagSession`, wrong role ARN	Trust-validator dashboard; tenant IAM trust policy
Logs missing	Job-log archiver, SQS/DLQ, Splunk ingestion	Webhook/job-log-pipeline dashboard; ingestion-quality dashboard

Sharp edges worth respecting — the platform isn't as simple as a happy-path diagram suggests:

scale_set_type must be exactly dind or k8s.
Kubernetes CPU/memory need units — 500m, 1Gi — not bare numbers.
ami_kms_key_arn must be "" for an unencrypted AMI.
macOS requires use_dedicated_host: true and matching placement (plus License Manager gotchas).
warm-pool schedules use AWS cron syntax, not Unix cron.
migrate_arc_cluster should be true only during an intentional migration.
subnet-IP exhaustion and EC2 capacity errors are first-class failure modes, not edge cases (§5, again).

Security at a glance — the controls, in one place: GitHub App keys in SSM Parameter Store; webhook HMAC validation; per-tenant runner groups + repository selection; short-lived AWS creds via runner-role assumption (no static keys in workflows); sts:AssumeRole + sts:TagSession; KMS-encrypted S3 job logs; Teleport for audited break-glass access; Falco runtime monitoring (internal deployment); rootless DinD on per-tenant nodes; CIS-aligned hardened base images.

And one honest definition: "near-zero ops" is not zero support. It does not mean nobody operates Forge. It means recurring manual work has been converted into automation, dashboards, scheduled validators, redrive loops, and self-service config — so the human job shifts from "SSH into a random runner and guess" to "look at the subsystem that owns the symptom, and when a pattern repeats, improve the detector or the automation." The support load — capacity, IAM trust, Teleport/onboarding, ARC/DinD issues, image updates, tenant guidance — is real. It's bounded and routed, not eliminated.

13. The pattern

Here is the entire argument on one page:

Decision	What it bought	What it cost
Calico overlay CNI (§5)	Broke the IP ceiling	A second networking layer to operate
Per-tenant DinD node pools (§6)	Blast-radius isolation for Docker builds	Money — worse bin-packing & more Karpenter objects
Trust-validator (§7)	Proactive trust checks	A small distributed system to run
Blue/green clusters (§8)	Safe, reversible upgrades	A per-tenant runner gap on migration
Directory-as-config (§9)	Onboarding in minutes	Terragrunt complexity & footguns
Self-healing tissue (§10)	Resilience without babysitting	More moving parts
Automated deps + dogfooding (§11)	Stay current without manual toil	A build/test pipeline to maintain

A credibility note before the numbers: everything about the design in this article is in the open-source repo and you can read the modules yourself; the scale figures that follow — ~40 teams, ~10k jobs/day, near-zero ops — come from internal production experience, not from the public code. I've tried to keep that line visible throughout.

Put together, this runs around 40 tenant teams and ~10,000 CI jobs a day, across multiple AWS regions and both execution lanes, grown organically from 3 teams to 40+ — operated with near-zero ops, by design. Read that last phrase carefully, because it's the most misunderstood claim in the whole story. It does not mean a heroically overworked person is quietly holding it together, and it does not mean an under-resourced org is getting lucky. It means low operational cost is the output of the stack of trade-offs above: ephemeral runners that can't drift, infrastructure that's entirely code, a boundary-validator that catches tenant mistakes before they page anyone, immutable clusters that make upgrades boring, configuration that makes onboarding a directory entry, and observability that turns operations into a loop. Low ops is what rigor looks like from the outside.

And that's the takeaway worth carrying even if you never run a GitHub Actions runner in your life:

Great platforms are not a lucky idea or a magic tool. They are a pattern of many correct decisions, made across many different problem domains — networking, isolation, identity, lifecycle, configuration, operations, and supply chain — each chosen with clear eyes about its cost, and composed so the whole holds together. The craft is not picking the single best technology for one problem. It's making a dozen good trade-offs in a dozen domains and having them add up. Anyone can copy one of these decisions. The engineering is in the composition.

So the next time you design something, don't ask "what's the best tech here?" Ask, for each choice: what does this buy me, what does it cost me, and is that the trade I want at my scale? Then write the answer down. That document — the one you just read — is as much the deliverable as the code, because it's the thing that turns a magic box back into an understandable machine.

When Forge is overkill

One last decision, and it's the one that keeps the rest honest: knowing when not to build this. If you're a small team with a handful of repos, Forge is too much. Start with the basics — ephemeral runners (the upstream EC2 module or ARC on its own), GitHub Actions, and a bit of Terraform — and only reach for tenancy, per-tenant isolation, trust validation, blue/green clusters, and a dogfooded image pipeline when you actually feel the pain they remove. Every trade-off in this article buys something real, but each also costs real complexity, and at small scale you'd be paying the cost without needing the benefit. Forge earns its keep in multi-team environments where governance, isolation, and platform automation genuinely matter. The same judgment that makes the dozen decisions good is the judgment that says: at three repos, don't make any of them. Maturity isn't building the biggest system — it's building the right-sized one, and being able to say which is which.

Read the code

This article keeps saying "read the modules" — so here's the map. Everything below is in the public repo:

Topic	Where to read
Tenant orchestration (umbrella)	`modules/platform/forge_runners`
EC2 runners	`modules/platform/ec2_deployment`
ARC / Kubernetes runners	`modules/platform/arc_deployment`, `modules/core/arc`
Calico + Karpenter on EKS	`modules/infra/eks`
Trust-validator (§7)	`modules/platform/forge_runners/forge_trust_validator`
Job-log archival (§10)	`modules/platform/forge_runners/github_actions_job_logs`
Global lock (§10)	`modules/platform/forge_runners/github_global_lock`
Runner groups + repo registration (§4)	`modules/platform/forge_runners/github_app_runner_group`
Splunk dashboards / config (§10)	`modules/integrations/splunk_cloud_conf_shared`, `modules/integrations/splunk_o11y_conf_shared`

Forge is open source: github.com/cisco-open/forge (Apache-2.0).

Docs: cisco-open.github.io/forge.

Read the modules — the trade-offs are all in there if you know where to look.

ForgeMT: GitHub Actions at Scale with Security and Multi-Tenancy on AWS

Ederson Brilhante — Tue, 12 Aug 2025 14:51:47 +0000

Introduction

GitHub Actions is the go-to CI/CD tool for many teams. But when your organization runs thousands of pipelines daily, the default setup breaks down. You hit limits on scale, security, and governance — plus skyrocketing costs.

GitHub-hosted runners are easy but expensive and don’t meet strict compliance needs. Existing self-hosted solutions like Actions Runner Controller (ARC) or Terraform EC2 modules don’t fully solve multi-tenant isolation, automation, or centralized control.

ForgeMT, built inside Cisco’s Security Business Group, fills that gap. It’s an open-source AWS-native platform that manages ephemeral runners with strong tenant isolation, full automation, and enterprise-grade governance.

This article explains why ForgeMT matters and how it works — providing a practical look at building scalable, secure GitHub Actions runner platforms.

Why Enterprise CI/CD Runners Fail at Scale

At large organizations, scaling GitHub Actions runners encounters four key bottlenecks:

Fragmented Infrastructure:
Teams independently choose their CI/CD tools: Jenkins, Travis, CircleCI, or self-hosted runners—which accelerates local delivery but creates duplicated effort, configuration drift, and fragmented monitoring. Without a unified platform, scalability, security, and reliability degrade.

Weak Tenant Isolation:
Runners run untrusted code across teams. Without strong isolation, one compromised job can leak credentials or escalate attacks across tenants. Poor audit trails slow breach detection and hinder compliance.

Scalability Limits:
Static IP pools cause IPv4 exhaustion, and manual provisioning delays runner startup. Without elastic scaling, resources are wasted or pipelines queue up, killing developer velocity.

Maintenance and Governance Overhead:
Uneven patching weakens security, infrastructure drift complicates troubleshooting, and audits become expensive and error-prone. Secure scaling demands centralized governance, consistent policy enforcement, and automation.

In short, enterprises fail to scale GitHub Actions runners without a platform that:

Centralizes multi-tenancy
Automates lifecycle management
Provides enterprise-grade observability and governance

But beware—over-centralization can kill flexibility and introduce new challenges.

Why GitHub Actions — And Why It’s Not Enough at Enterprise Scale

GitHub Actions is popular because it offers:

Deep GitHub integration: triggers on PRs, branches, and tags with no extra logins, plus automatic secret and artifact handling.
Extensible ecosystem: thousands of marketplace actions simplify workflow creation.
Flexible runners: GitHub-hosted runners for convenience, or self-hosted for control, cost savings, and compliance.
Granular security: native GitHub Apps, OIDC tokens, and fine-grained permissions enforce least privilege.
Rapid scale: pipelines at repo or org level enable smooth CI/CD growth.

However, GitHub Actions alone can’t meet enterprise-scale demands. Enterprises require:

Strong tenant isolation and centralized governance across thousands of pipelines.
A unified platform to avoid fragmented infrastructure and scaling bottlenecks.
Fine-grained identity, network controls, and compliance enforcement.
Automation for onboarding, patching, and auditing to reduce operational overhead.

Cloud providers like AWS supply identity, networking, and automation building blocks—IAM/OIDC, VPC segmentation, EC2, EKS (needed to build secure, scalable, multi-tenant CI/CD platforms).

Existing Solutions and Why They Fall Short

Actions Runner Controller (ARC) runs ephemeral Kubernetes pods as GitHub runners, scaling dynamically with declarative config and Kubernetes-native integration. But:

Kubernetes namespaces alone don’t provide strong security isolation.
No native AWS IAM/OIDC integration.
Lacks onboarding, governance, and audit automation.
Network policy management is manual, increasing operational overhead.

Terraform AWS GitHub Runner Module provisions EC2 self-hosted runners with customizable AMIs, integrating well with IaC pipelines. However:

Typically deployed per team, causing fragmentation.
No native multi-tenant isolation.
Requires manual IAM and account setup.
No onboarding or patching automation.

Commercial Runner-as-a-Service options offer simple UX, automatic scaling, and vendor-managed maintenance with SLAs, but:

High costs at scale.
Vendor lock-in risks.
Limited multi-tenant isolation.
Often don’t meet strict compliance requirements.

Where ForgeMT Fits In

ForgeMT combines the best of these approaches to deliver an enterprise-ready platform:

Orchestrates ephemeral runners seamlessly.
Uses AWS-native identity and network isolation (IAM/OIDC).
Built-in governance with full lifecycle automation.
Designed for large, security-focused organizations.

ForgeMT doesn’t reinvent ARC or EC2 modules but extends them with:

Strict multi-tenant isolation: Each team runs in a separate AWS account to contain blast radius. IAM/OIDC enforces least privilege. Calico CNI manages Kubernetes network segmentation.
Full automation: Tenant onboarding, runner patching, centralized monitoring, and drift remediation happen automatically, cutting manual toil and errors.
Centralized control plane: One dashboard securely manages all tenants with governance, audit logs, and compliance-ready traceability.
Cost optimization: Spot instances, warm pools, and autoscaling based on real-time metrics and spot prices reduce costs without sacrificing availability.
Open-source transparency: 100% open source—no vendor lock-in, no license fees, full customization freedom.

Architecture Overview

At its core, ForgeMT is a centralized control plane that orchestrates ephemeral runner provisioning and lifecycle management across multiple tenants running on both EC2 and Kubernetes.

Key Components

Terraform module for EC2 runners — provisions ephemeral EC2 runners with autoscaling, spot/on-demand, and ephemeral lifecycle.
Actions Runner Controller (ARC) — manages EKS-based runners as Kubernetes pods with tenant namespace isolation.
OpenTofu + Terragrunt — Infrastructure as Code managing tenant/account/region deployments declaratively.
IAM Trust Policies — secure runner access with ephemeral credentials via role assumption.
Splunk & Observability — centralized logs and metrics per tenant.
Teleport — secure SSH access to ephemeral runners for auditing and debugging.
EKS + Calico CNI — scalable pod networking with strong tenant segmentation and minimal IP usage.
EKS + Karpenter — demand-driven node autoscaling with spot and on-demand instances, plus warm pools.

ForgeMT Control Plane

The control plane is the platform’s brain — managing runner provisioning, lifecycle, security, scaling, and observability.

Centralized Orchestration: Decides when and where to spin up ephemeral runners (EC2 or Kubernetes pods).
Multi-Tenant Isolation: Isolates each tenant via dedicated AWS accounts or Kubernetes namespaces, IAM roles, and network policies.
Security Enforcement: Applies hardened runner configurations, automates ephemeral credential rotation, and enforces least privilege.
Scaling & Optimization: Integrates with Karpenter and EC2 autoscaling to scale runners up/down with demand and cost awareness.
Observability & Governance: Streams logs and metrics to Splunk; provides audit trails and compliance dashboards.

Runner Types and Usage

Tenant Isolation

Each ForgeMT deployment is single-tenant and region-specific. IAM roles, policies, VPCs, and services are scoped exclusively to that tenant-region pair. This hard boundary prevents cross-tenant access, simplifies compliance, and minimizes blast radius.

EC2 Runners

Ephemeral VMs booted from Forge-provided or tenant-custom AMIs.
Jobs run directly on VMs or inside containers.
IAM role assumption replaces static credentials.
Terminated after each job to avoid drift or leaks.

EKS Runners

Managed by ARC as Kubernetes pods in tenant namespaces.
Images pulled from Forge or tenant ECR repositories.
Scales dynamically for burst workloads.

Warm Pools and Limits

ForgeMT supports warm pools of pre-initialized runners to minimize cold start latency—especially beneficial for EC2 runners with slower boot times.

Per-tenant limits enforce:

Max concurrent runners
Warm pool size
Runner lifetime (auto-termination after jobs)

These controls prevent resource abuse and keep costs predictable.

Tenant Onboarding

Deploying a new tenant is straightforward and fully automated via a single declarative config file, for example:

gh_config:
  ghes_url: ''
  ghes_org: cisco-open
tenant:
  iam_roles_to_assume:
    - arn:aws:iam::123456789012:role/role_for_forge_runners
  ecr_registries:
    - 123456789012.dkr.ecr.eu-west-1.amazonaws.com
ec2_runner_specs:
  small:
    ami_name: forge-gh-runner-v*
    ami_owner: '123456789012'
    ami_kms_key_arn: ''
    max_instances: 1
    instance_types:
      - t2.small
      - t2.medium
      - t2.large
      - t3.small
      - t3.medium
      - t3.large
    pool_config: []
    volume:
      size: 200
      iops: 3000
      throughput: 125
      type: gp3
  large:
    ami_name: forge-gh-runner-v*
    ami_owner: '123456789012'
    ami_kms_key_arn: ''
    max_instances: 1
    instance_types:
      - c6i.8xlarge
      - c5.9xlarge
      - c5.12xlarge
      - c6i.12xlarge
      - c6i.16xlarge
    pool_config: []
    volume:
      size: 200
      iops: 3000
      throughput: 125
      type: gp3
arc_runner_specs:
  dind:
    runner_size:
      max_runners: 100
      min_runners: 1
    scale_set_name: dependabot
    scale_set_type: dind
    container_actions_runner: 123456789012.dkr.ecr.eu-west-1.amazonaws.com/actions-runner:latest
    container_requests_cpu: 500m
    container_requests_memory: 1Gi
    container_limits_cpu: '1'
    container_limits_memory: 2Gi
    volume_requests_storage_type: gp2
    volume_requests_storage_size: 10Gi
  k8s:
    runner_size:
      max_runners: 100
      min_runners: 1
    scale_set_name: k8s
    scale_set_type: k8s
    container_actions_runner: 123456789012.dkr.ecr.eu-west-1.amazonaws.com/actions-runner:latest
    container_requests_cpu: 500m
    container_requests_memory: 1Gi
    container_limits_cpu: '1'
    container_limits_memory: 2Gi
    volume_requests_storage_type: gp2
    volume_requests_storage_size: 10Gi

The ForgeMT platform uses this config to:

Provision tenant-specific AWS accounts and resources.
Set IAM roles with least privilege trust policies.
Configure GitHub integration and runner specs.
Enforce tenant limits and runner types.

This automation enables zero-touch onboarding with no manual AWS or GitHub setup required by the tenant.

Extensibility

ForgeMT lets tenants customize their environments and control runner access:

Custom AMIs for EC2 runners with tenant-specific tooling.
Private ECR repositories to host container images for VMs or Kubernetes.
Tenant IAM roles with trust policies so ForgeMT runners assume them securely without static keys.
Advanced access patterns like chained role assumptions or resource-based policies for complex needs.

This lets each team tune cost, security, and performance independently without affecting core platform stability.

Security Model

ForgeMT’s foundation is strong isolation and ephemeral execution to reduce risk:

Dedicated IAM roles, namespaces, and AWS accounts per tenant.
No cross-tenant visibility or access.
Ephemeral runners destroyed immediately after job completion to prevent credential or data leakage.
Temporary credentials via IAM role assumption replace static AWS keys.
Fine-grained access control configurable by tenants for resource permissions.
Full audit trail of provisioning, execution, and shutdown logged via CloudWatch → Splunk.
Meets CIS Benchmarks and internal security policies.

Debugging in a Secure, Ephemeral World

Ephemeral runners mean persistent debugging isn’t possible by design, but ForgeMT offers:

Live debugging with Teleport: Keep runners alive temporarily via workflow tweaks to enable SSH into running jobs.
Reproducible reruns: Failed jobs can be rerun identically from GitHub UI.
Log-based troubleshooting: Access runner telemetry, syslogs, and job logs centrally without infrastructure exposure.
Kubernetes support: Same debugging mechanisms apply to EKS runners, preserving isolation and auditability.

Conclusion

ForgeMT is likely overkill for small teams. Start simple with ephemeral runners (EC2 or ARC), GitHub Actions, and Terraform automation. Only scale up when you hit real pain points. ForgeMT shines in multi-team environments where tenant isolation, governance, and platform automation are mission-critical. For solo teams, it just adds unnecessary complexity.

ForgeMT addresses the major enterprise challenges of running GitHub Actions runners at scale by delivering:

Strong multi-tenant isolation
Fully automated lifecycle management and governance
Flexible runner types with cost-aware autoscaling and warm pools
Secure, ephemeral environments that meet compliance needs
An open-source, extensible platform for customization

For organizations struggling to scale self-hosted runners securely and efficiently on AWS, ForgeMT provides a battle-tested, transparent platform that combines AWS best practices with developer-friendly automation.

Dive Into the ForgeMT Project

Ideas are cheap — execution is what counts. ForgeMT’s source code is public — check it out:

👉 https://github.com/cisco-open/forge/

⭐️ If you find it useful, don’t forget to drop a star!

🤝 Connect

Let’s connect on LinkedIn and GitHub.

Learn how ForgeMT simplifies multi-tenant GitHub Actions runners with security, scalability, and automation. Read the full case study to see how it can streamline your CI/CD pipelines:

Ederson Brilhante — Sat, 17 May 2025 10:14:18 +0000

Ederson Brilhante

May 16 '25

ForgeMT: A Scalable, Secure Multi-Tenant GitHub Runner Platform at Cisco

#terraform #devops #platformengineering #githubactions

14 min read

ForgeMT: A Scalable, Secure Multi-Tenant GitHub Runner Platform at Cisco

Ederson Brilhante — Fri, 16 May 2025 08:55:41 +0000

🧭 Why ForgeMT Exists

ForgeMT is a centralized platform that enables engineering teams to run GitHub Actions securely and efficiently — without building or managing their own CI infrastructure.

It provides ephemeral runners (EC2 or Kubernetes), strict tenant isolation, and full automation behind a hardened, shared control plane.

Before ForgeMT, every team in Cisco’s Security Business Group had to build and maintain their own CI setup — leading to duplicated effort, inconsistent security, slow onboarding, and rising operational overhead.

ForgeMT replaced this fragmented approach with a secure, scalable, multi-tenant platform — saving time, reducing risk, and accelerating adoption.

⚡ Fast Facts (ForgeMT Impact)

⏱️ 80+ engineering hours saved/month per team
📦 40,000+ GitHub Actions jobs/month
✅ 99.9% success rate across tenants

This post explains:

🚀 Why ForgeMT was needed
💼 What impact it had — From reliability to cost savings and security compliance.
🧱 How it works - Deep Dive into Architecture

👉 Jump to what matters most:

💼 Business Impact – For leadership and stakeholders
🏗️ Architecture – For platform engineers and DevOps
🧠 Or keep reading for full technical context and background

🚨 From Fragmented CI to Scalable, Secure Solutions: The Journey to ForgeMT

Credit: Prototype by Matthew Giassa - MASc, EIT —who championed the Philips Labs GitHub Runner module across multiple teams.

Before ForgeMT, each team used its own CI stack—Jenkins, Travis, or Concourse. While these tools met local needs, they created long-term issues: inconsistent patching, security gaps, and poor scalability.

Matthew built a promising PoC, but it was a siloed setup with manual AWS, GitHub, and Terraform steps. Rigid subnetting caused IPv4 exhaustion, and teams copy-pasting Terraform modules led to high maintenance overhead and config drift.

To address this complexity, I drove the end‑to‑end technical design and implementation of ForgeMT—a centralized, multi‑tenant GitHub Actions runner service on AWS—while coordinating with infrastructure, security, and platform stakeholders to ensure a smooth production launch.

At scale, teams were running thousands of Actions jobs across dozens of isolated environments—each with its own patch cadence, network quirks, and IAM policies. ForgeMT unifies these into a single control plane, delivering consistent security, predictable performance, and dramatically simplified operations.

For detailed business impact metrics (time saved, reliability gains, cost optimization), see the Business Impact section.

It builds on proven ephemeral EC2 and EKS/ARC runner modules, adding:

IAM/OIDC-based tenant isolation
Built-in observability (metrics, logs, dashboards)
Automation for patching, Terraform drift, repo onboarding, and global Actions locks

By consolidating infrastructure into a hardened control plane, ForgeMT ensured that security and compliance were at the forefront while enabling rapid onboarding, eliminating manual patching, and solving IPv4 exhaustion. This was achieved by scaling pod-based runners via EKS + Calico CNI, with a strong focus on tenant isolation, IAM roles, and security groups (SG) to control access. The hardened control plane preserved the security and flexibility of the original prototype, delivering a secure, compliant, and scalable platform.

📊 Business Impact

ForgeMT has not only met the demands of various teams but also helped scale securely under Cisco’s guidance, optimizing cloud spend and increasing reliability across all stakeholders:

Dramatic time savings (80 + hours/month per team): By automating every aspect of runner lifecycle—OS patching, Terraform module updates, ephemeral provisioning, and even repository registration—teams were freed from manual CI maintenance and could refocus on shipping features.
Optimized cloud spend: Spot and On-demand Instances, right‑sized instance selection per job type, and EKS + Calico’s IP‑efficient networking cut infrastructure costs without slowing builds. ForgeMT also supports warm instance pools for high-frequency jobs, avoiding cold starts when speed is critical—striking a smart balance between performance and cost.
Rock‑solid reliability (99.9% success over 40K+ jobs/month): Centralizing infrastructure eliminated snowflake environments and drift, reducing job failures caused by misconfiguration or stale runners to near zero.
Enterprise‑grade security & compliance: IAM/OIDC per‑tenant isolation, CIS‑benchmarked AMIs, and end‑to‑end logging into Splunk ensured every action was auditable, vault‑grade credentials were never exposed, and internal audits passed with zero findings.
True multi‑tenancy at scale: Teams retain autonomy over AMIs, ECRs, and workflow definitions while ForgeMT transparently handles networking, isolation, and autoscaling—supporting dozens of teams without additional IP consumption or operational overhead.
AWS account isolation per tenant: Each tenant can have one or more individual AWS accounts, with full control over their own network setup. This includes the flexibility to configure internal or public subnets within their AWS accounts, ensuring strong security boundaries and independent resource management without ForgeMT managing their network.

Together, these outcomes turned a fractured, high‑toil CI landscape into a self‑service platform that scales securely, reduces costs, and accelerates delivery.

⚙️ ForgeMT Architecture Overview

📦 Core Components & Technical Foundations

These results are enabled by the following technical components:

Terraform module for EC2 runners: Utilized as a Terraform module to provision ephemeral EC2-based GitHub Actions runners, supporting auto-scaling and cost optimization by using AWS spot and on-demand instances. This setup ensures that runners are created on-demand and terminated after use, aligning with the ephemeral nature of ForgeMT's infrastructure.
ARC (Actions Runner Controller): Employed to manage EKS-based GitHub Actions runners, enabling containerized, isolated job execution via Kubernetes. This approach leverages Kubernetes' orchestration capabilities for efficient scaling and management of CI/CD workloads.
OpenTofu + Terragrunt: Implemented for Infrastructure as Code (IaC), ensuring region-, account-, and tenant-specific infrastructure deployments with DRY (Don't Repeat Yourself) principles. This methodology facilitates consistent and repeatable infrastructure provisioning across multiple environments.
IAM Trust Policies: Adopted to secure runner access using short-lived credentials via IAM roles and trust relationships, eliminating the need for static credentials and enhancing security.
Splunk Cloud & O11y(Observability): Integrated for centralized logging and metrics aggregation, providing real-time observability across ForgeMT components. This setup enables detailed telemetry, including per-tenant dashboards for monitoring resource usage and optimization insights.
Teleport: Utilized to provide secure, auditable SSH access to EC2 runners and Kubernetes pods, enhancing compliance, access control, and auditing capabilities.
EKS + Calico CNI: Leveraged to scale pod provisioning without consuming additional VPC IPs, utilizing Calico's efficient networking. This setup ensures tenant isolation and optimizes network resource usage within limited VPC subnets.
EKS + Karpenter: Enables dynamic, demand-driven autoscaling of Kubernetes worker nodes. Automatically provisions the most suitable and cost-effective EC2 instance types based on real-time pod requirements. Supports spot and on-demand capacity, prioritizing efficiency and performance. Warm pools can be configured to reduce cold start latency while maintaining cost control—ideal for high-churn CI/CD workloads.

These technologies form the backbone of ForgeMT, enabling its robust performance and scalability.

🧠 ForgeMT Control Plane (Managed by Forge Team)

The ForgeMT control plane hosts shared infrastructure and reusable IaC modules:

ForgeMT GitHub App: Installed on tenant repositories to listen for GitHub workflow events and dynamically register ephemeral runners.
ForgeMT AMIs & Forge ECR: Default base images for runners (VMs and containers).
Terraform Modules: Each tenant-region pair deploys an isolated ForgeMT instance.
API Gateway + Lambda: Processes GitHub webhook jobs to trigger runner provisioning.
Centralized Logging: Runner logs are forwarded to CloudWatch, then into Splunk Cloud Platform.
Centralized Observability: All AWS metrics are sent to Splunk O11y Cloud
Teleport: Secure, role-based SSH access to VM runners (if needed), with session logging.

🏗️ Tenant Isolation

Each ForgeMT deployment is dedicated to a single tenant, ensuring full isolation within a specific AWS region. This approach guarantees that IAM roles, policies, services, and AWS resources are scoped uniquely for each tenant-region pair, enforcing strict security, compliance, and minimizing the blast radius.

💻 Runner Types

🧱 AWS EC2-Based Runners (VM and Metal)

Ephemeral Runner Provisioning: EC2 runners are provisioned using Forge-provided AMIs or tenant-specific custom AMIs. These instances are pre-configured with the necessary tools to execute CI/CD jobs.
Workload Execution: Jobs can be executed directly on the EC2 instance or via containers, using container: blocks in GitHub workflows.
Security: Authentication to tenant AWS resources is handled through IAM roles and trust policies, eliminating the need for static credentials and ensuring dynamic, secure access control.
Ephemeral Nature: Once a job is completed, the EC2 instance is terminated, maintaining a completely stateless environment.

" width="800" height="849">

☸️ EKS-Based Runners (Kubernetes)

Kubernetes-Orchestrated Actions: Using the Actions Runner Controller (ARC), EKS runners are provisioned as pods within an Amazon EKS cluster.
Resource Isolation: Each tenant is assigned a dedicated namespace, service account, and IAM role, ensuring strict isolation of resources and permissions.
Container Images: Runners can pull container images from either the Forge ECR or the tenant’s own ECR, depending on the configuration.
Scalability: EKS is ideal for high-scale operations, leveraging Kubernetes' orchestration capabilities to manage the lifecycle of runners efficiently.

🔁 Warm Pool

Reducing Startup Latency: An optional warm pool can be configured for both EC2 and EKS runners, pre-initializing instances or pods to reduce waiting times during high demand.
Importance for EKS: For EKS runners, the need for warm pools is significantly reduced, as Kubernetes already provides rapid scaling and efficient pod initialization.
Usage in EC2: The warm pool helps minimize the initialization time for EC2 instances, resulting in faster job execution times for critical tasks.

💻 Examples of Runner Types in ForgeMT

ForgeMT offers flexibility for tenants to configure multiple runner types simultaneously, adapting to their workload needs. Each tenant can define as many runners as needed, with a parallelism limit set per tenant and runner type. Here are some typical runner examples and their use cases:

🧱 EC2 Runners

Small: Lightweight instances for tasks with minimal resource usage, such as quick tests or linting.
Standard: Instances for balanced workloads, ideal for code compilation or integration tests.
Large: High-performance instances for tasks requiring more processing power, such as complex builds or load tests.
Bare Metal: Bare-metal instances for applications that need full control over the hardware, such as simulations or intensive processing tasks.

☸️ Kubernetes Runners

Dependabot: Used for automated dependency update jobs.
Light (k8s): Runners for simple tasks that don't require Docker, like linting or unit test execution.
Docker-in-Docker (DinD): Used for jobs that require Docker inside Kubernetes, such as image building or integration tests involving containers.

🔄 Configurable Parallelism per Tenant and Runner Type

Each tenant can configure their own set of runners and use different EC2 instance types or Kubernetes pods simultaneously.
The parallelism limit can be configured per runner type and tenant, ensuring that running multiple jobs does not overload resources.
This allows each team to run jobs in parallel based on their needs without impacting the performance of other tenants or jobs.

Considerations

Choosing the Right Runner: Depending on workload complexity and job requirements, you may choose EC2 or EKS runners. EKS is generally preferred for lightweight, scalable workloads, while EC2 may be necessary for jobs with specific hardware or memory requirements.

⚙️ GitHub Integration

GitHub events trigger ForgeMT through a webhook via API Gateway, dynamically registering runners into the appropriate GitHub Runner Groups associated with the tenant. The runner lifecycle is designed to be ephemeral: runners are registered just-in-time for job execution and are destroyed once the job is completed. When a new repository is installed, it is automatically registered with the correct GitHub Runner Group, ensuring seamless integration with the right tenant's runners.

🔌 Extensibility

Each tenant account can optionally manage the following resources:

Tenant AMIs (for AWS EC2 runners): Custom-built images with pre-installed tooling tailored to the tenant's specific requirements.
Tenant ECR: Houses custom container images used for VM-based container jobs, GitHub composite actions, or full pod images in EKS.
Tenant IAM Role: Configured with trust relationships to allow ForgeMT runners to securely assume roles without the need for AWS access keys.
ForgeMT offers flexibility for teams to customize their runners according to their specific needs. If a tenant requires a custom Amazon Machine Image (AMI) or container image, it is their responsibility to build and maintain it. We provide a base image to get them started, but the final configuration is under their control. Once the custom image is ready, it can be shared with our accounts and integrated into the ForgeMT platform, enabling the team to meet their unique requirements.

🔄 Optional Configurations

Tenants can choose to configure the following based on their specific needs:

Accessing AWS Resources via Runners: To enable runners to interact with AWS services within the tenant's account, an IAM role must be established with a trust relationship permitting ForgeMT to assume it.
Pulling Images from Tenant ECR: If runners need to pull images from the tenant's ECR—be it for container jobs, composite actions, or Kubernetes pods—the tenant must configure appropriate repository policies and IAM permissions to allow these operations.
Accessing Additional Tenant Resources: For runners to access other AWS resources within the tenant's account, the IAM role assumed by ForgeMT must have policies granting the necessary permissions. This might involve setting up a chain of role assumptions or defining specific resource-based policies.

📊 Observability: Splunk Cloud & O11y

ForgeMT delivers full-stack observability with centralized logging and per-tenant metrics:

Centralized logging: All relevant logs — syslog, AWS EC2 user data, GitHub runner job logs, worker logs, and agent logs — are sent to CloudWatch Logs and forwarded to Splunk Cloud for full visibility and auditability.
Metrics via Splunk O11y: Captures detailed telemetry.
Per-tenant dashboards: Each team gets dedicated dashboards showing cost breakdowns, resource usage, and optimization insights (e.g., high-memory job detection).

🔐 Security Model

Strong tenant isolation: Every tenant has its own IAM roles, namespaces, and resources.
IAM Role Assumption: Eliminates use of long-lived AWS credentials.
No cross-tenant visibility: Runners cannot access other tenant workloads or secrets.
Fine-grained access control: Each tenant defines what their runners can access by configuring the IAM role being assumed—this can include direct resource access or chained role assumptions for more advanced patterns.
🔒 Ephemeral Isolation: ForgeMT runners are automatically destroyed after every job — success or failure. This guarantees a clean slate every time, eliminates environment drift, blocks credential persistence, and prevents resource leaks by default.

🛡️ Compliance & Observability

ForgeMT ensures strict compliance and security throughout the lifecycle of its ephemeral runners, from provisioning to execution and shutdown.

Full Audit Trail: Every runner lifecycle event — including provisioning, execution, and shutdown — is logged, ensuring complete visibility and traceability for compliance audits. This audit trail is vital for maintaining transparency in high-security environments.
CloudWatch → Splunk Integration: Logs from the runners are forwarded from CloudWatch to Splunk, enabling teams to perform real-time queries on logs. This integration supports compliance audits by providing detailed, queryable logs that can be easily reviewed and accessed for regulatory requirements.
IAM Integration: By using IAM (Identity and Access Management), ForgeMT eliminates the use of hardcoded credentials or AWS long-term access keys. This significantly reduces the risk of unauthorized access and enhances security by enforcing role-based access and temporary credentials that follow the principle of least privilege.
Security Standards Compliance: ForgeMT meets internal security standards, which are aligned with industry best practices such as CIS Benchmarks. This ensures that the platform adheres to rigorous security controls and provides a secure environment for multi-tenant workloads.

🔍 Debugging Securely and Effectively

ForgeMT offers teams the option to choose between EC2 Spot Instances and On-Demand Instances, allowing for flexibility in cost optimization. While Spot Instances can provide significant cost savings, they come with the inherent risk that AWS may reclaim the instance at any time. Teams are responsible for evaluating this risk and determining whether to use Spot or On-Demand Instances based on the criticality of their workloads.

Given ForgeMT's design of ephemeral runners, which are terminated immediately after each job to prevent state persistence and credential leakage, debugging presents unique challenges. However, the platform offers robust solutions to address these challenges.

For real-time debugging, developers can access running jobs via Teleport. By including a sleep step in the workflow or using a custom wrapper, the runner can be kept alive temporarily. This allows for manual inspection and troubleshooting while the job is still running.

Additionally, even without live access, ForgeMT maintains comprehensive observability. Teams can rely on syslogs, GitHub Actions job logs, and runner-level telemetry to understand job behavior. Every job runs in a fully reproducible environment, meaning developers can simply rerun failed jobs through the GitHub UI, replicating the exact conditions without side effects while maintaining full auditability.

For Kubernetes-based runners, the same debugging approach applies: Teleport can be used for live access to running jobs. The integration with Kubernetes allows teams to extend the same debugging capabilities while leveraging the scalability and flexibility of the containerized environment.

🚀 ForgeMT: Powering Tenants with Flexibility and Control

💥 Ephemeral by design — Runners are created per job and disappear afterward. No drift. No patching. No residual garbage.
🛠️ Infra-as-Code from top to bottom — Fully automated. Declarative. Version-controlled. No snowflakes.
🔐 Strong isolation baked in — IAM, OIDC, and security group segmentation per tenant. No cross-tenant blast radius.
📦 Run anything, per tenant — EC2 or EKS. k8s, dind, or metal. Each tenant defines their own mix.
🚦 Control usage at scale — Enforce parallelism limits per tenant/type. No surprises. No abuse.
🕹️ Custom policies, zero effort — Tenants define autoscaling, labels, and configurations via GitHub — no AWS skills required.
🧘 No infra for tenants to manage — No patching, no VPCs, no accounts. Just push code.
🕵️ Observability without ownership — Logs, metrics, and traces exposed per tenant. No nodes to babysit.
⚡ Fast time-to-first-run — Cold starts optimized. Most runners boot in <20s, even for large jobs.
🌎 Network-aware provisioning — Runners automatically deploy into the correct subnet, zone, or region.
📊 Usage-aware scaling — Instance types are selected based on cost/performance tradeoffs — no more overprovisioning by default.
🧩 GitHub-native workflows — No toolchain rewrites required. Just drop in the runs-on labels and go.
🚫 No global queues — Each tenant is scoped, isolated, and throttled independently.

🛠️ Implementation & Adoption

It took about 2 months to evolve from a single-tenant, EC2-only setup into a fully multi-tenant platform. Highlights:

🔹 *Kubernetes support *— with Calico CNI + Karpenter
🔹 Tenant isolation by design
🔹 Per-tenant automation & base images
🔹 EKS pod identity for secure access
🔹 Integrated with Teleport, Splunk, and full observability
🔹 Custom dashboards with enriched telemetry

🚀 Frictionless Adoption

Onboarding was dead simple.

For most tenants, switching to ForgeMT meant updating just the runs-on label in their GitHub Actions workflows — ⚡ No rewrites. No migrations. No downtime.

For teams that required deeper isolation, assuming their own IAM role was just as straightforward:

- name: Configure AWS Credentials
  uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::<tenant-account>:role/<role-name>
    aws-region: <aws-region>
    role-duration-seconds: 900

- name: Example
  run: aws cloudformation list-stacks

💡 This approach made adoption fast, safe, and low-friction — even for teams skeptical of platform changes.

🚫 Overkill Warning: When ForgeMT Is Too Much

If you're a small team, ForgeMT might be overkill. Start with the basics: ephemeral runners (EC2 or ARC), GitHub Actions, and Terraform automation. Scale up only when you hit real pain. ForgeMT shines in multi-team setups where governance, tenant isolation, and platform automation matter. For solo teams, it may just add complexity you don’t need.

🔭 What’s Next

I’m currently focused on:

Cost-aware scheduling — Prioritizing jobs based on real-time pricing and instance efficiency, optimizing for performance while reducing costs.
Dynamic autoscaling — Moving from static warm pool rules to a more responsive, metrics-driven approach that adapts to the bursty nature of GitHub Actions workloads.
Deeper observability — Integrating GitHub metrics for actionable insights that drive optimized runner performance.
AI-driven scaling optimization — Leveraging historical data to predict workload demands, optimize resource allocation, and automate scaling decisions based on both performance and cost metrics.

If you’re tackling similar problems — or looking to adopt, extend, or contribute to ForgeMT — let’s talk. I’m always open to collaborating with engineers building serious DevSecOps infrastructure.

🧪 Dive Into the ForgeMT Project

Ideas are cheap — execution is everything. The ForgeMT source code is now publicly available — check it out:

👉 https://github.com/cisco-open/forge/

⭐️ Don’t forget to give it a star ;)!

✍️ In Short

ForgeMT emerged from real-world CI pain at enterprise scale. What began as a prototype to fix local inefficiencies has grown into a secure, multi-tenant, production-grade runner platform. I’m sharing this so others can skip the trial-and-error and build smarter from the start.

🤝 Connect

Let’s connect on LinkedIn and GitHub.

Always happy to trade notes with like-minded builders.

This article was originally published on LinkedIn.

Decoding the Myth of 'Junior' in DevOps and SRE: Navigating Challenges and Cultivating Expertise

Ederson Brilhante — Tue, 16 Jan 2024 15:38:01 +0000

In my view, assigning roles such as 'Junior DevOps' and 'Junior SRE (Site Reliability Engineer)' seems impractical, reminiscent of labeling someone an 'Entry-Level Software Architect.'

Navigating the intricate landscape

Navigating the intricate landscape of DevOps and SRE demands proficiency in coding, networking, cloud technologies, security, and system administration. Envisioning someone with limited experience adeptly maneuvering through this multifaceted skill set poses a significant challenge.

The Software Architect analogy

Similarly, giving the title "Software Architect" to beginners doesn't align with the intricate demands of the role. Crafting sophisticated software solutions requires years of practical experience, involving intricate system design and understanding. Expecting a junior engineer to architect and implement a secure, scalable microservices architecture without in-depth knowledge and experience in the design principles of distributed systems is unrealistic.

Quantity vs. Experience fallacy

Furthermore, the belief that numerous junior roles collectively can achieve the same level of effectiveness as a seasoned professional echoes the fallacy of favoring quantity over experience. While each junior role contributes to the team's growth, the efficiency and strategic thinking of an experienced architect often outpace the combined efforts of multiple entry-level professionals.

Pressure on companies

In addition, the pressure on companies to leverage the benefits of DevOps and SRE roles within their organization often stems from the growing need for seamless integration between development and operations. Individuals in these positions are expected to possess a profound understanding of both coding and operations, creating a unique blend of skills. Unfortunately, finding professionals who embody this multidisciplinary expertise is a formidable challenge. Those who can seamlessly bridge the gap between traditional sysadmins and developers are not only rare but also come at a premium, given the scarcity of individuals with such comprehensive skills in the overall job market.

Scarcity leading to desperation

This scarcity sometimes leads companies to consider entry-level candidates, hoping to quickly train them to fill the void. However, the complex nature of the disciplines touched upon by DevOps and SRE roles means that becoming proficient in each area takes years of hands-on experience. The high demand and limited supply of individuals with these multifaceted skills contribute to the desperation companies feel in recruiting for these roles.

Acknowledging the shortage

Acknowledging this shortage is crucial, especially as it extends beyond DevOps and SRE roles to other senior positions. Over the past two decades, the industry has witnessed a trend of companies poaching professionals from one another rather than investing in training new talents. This cycle has created a snowball effect, further exacerbating the shortage of skilled individuals.

Solution: Attracting seasoned developers

A potential solution lies in attracting seasoned developers with a penchant for infrastructure and operations to transition into roles in DevOps and SRE. These individuals often bring a wealth of experience, having naturally acquired knowledge in areas beyond coding, such as security, infrastructure, databases, and operations. Their diverse skill set aligns with the demands of contemporary senior developers who are expected to possess expertise beyond language-specific coding skills. By encouraging such transitions, companies can tap into a pool of experienced professionals and mitigate the challenges associated with the scarcity of multidisciplinary talent in the market.

Recommended pathway for aspiring professionals

For aspiring professionals entering the tech industry, a recommended pathway involves starting as a developer before venturing into the multifaceted realms of DevOps and SRE. Beginning as a developer allows individuals to hone their coding skills and gain a solid foundation in software engineering principles. As they accumulate experience and familiarity with the development lifecycle, they can then gradually navigate towards operations, infrastructure, and other related disciplines. This gradual journey not only provides a comprehensive understanding of the intricacies of both coding and operations but also allows individuals to develop a deeper appreciation for the challenges addressed by DevOps and SRE roles. This approach acknowledges the value of hands-on experience and ensures that individuals entering these dynamic fields are well-equipped to contribute meaningfully to the integration of development and operations within an organization.

Combining Packer, QEMU, Ubuntu Cloud Images, and Ansible

Ederson Brilhante — Wed, 14 Jun 2023 20:50:24 +0000

Hello everyone! I want to share a current use case at my company where I have the opportunity to work with Packer, QEMU, Ansible and Ubuntu Cloud Images leveraging the concept of Infrastructure as Code (IaC).

Infrastructure as Code (IaC) is a software engineering practice that enables the management and provisioning of infrastructure resources through code. Instead of manually configuring servers and infrastructure components, IaC allows you to define your desired infrastructure state using declarative or imperative code. It brings automation, version control, and consistency to infrastructure management.

In our case, we utilize Packer, which is a powerful tool falling under the umbrella of IaC. Packer enables the creation of identical machine images for multiple platforms, such as virtual machines, containers, or cloud instances. With Packer, we define the configuration of our desired machine image, including the operating system, software stack, and customizations, all through code. Packer then automates the process of building these machine images, ensuring consistency and reproducibility.

To further enhance our image-building process, we integrate Ansible as the provisioner for Packer. Ansible is an open-source automation tool that enables the configuration and management of systems through simple, human-readable YAML files. With Ansible, we define the desired state of our machine image, including the installation of packages, configuration files, and any other necessary setup steps. Ansible seamlessly integrates with Packer, allowing us to provision our machine image with ease.

In our deployments, we rely on Ubuntu images to meet our diverse cloud computing needs. Ubuntu offers three types of images: live, server, and cloud. Live images provide a fully functional Ubuntu desktop environment that can be run directly from a USB drive or DVD without the need for installation, while server images are optimized for server deployments. However, for our specific use case, we have different requirements depending on the deployment environment. For our deployments in public cloud environments, we leverage the official Ubuntu images provided by the cloud provider, which are tailored and certified for their specific platform. Similarly, in our private on-premises cloud, we utilize the cloud version of Ubuntu images. These cloud images are specifically designed and pre-configured for cloud computing platforms, offering optimized performance and scalability. They enable us to efficiently deploy and manage Ubuntu instances in both our public and private cloud environments.

Now, let's delve into our challenge. The process of building VM images for the public cloud involves the use of appropriate plugins for Packer. However, when it comes to our on-premises cloud, we encountered an obstacle. Our existing process relied on a deprecated plugin relying on QEMU as its underlying technology. QEMU, an open-source virtualization tool, empowers us to operate and manage virtual machines in various formats, including qcow2. To overcome this hurdle, our aim was to leverage QEMU using an official and updated plugin for Packer. This integration would seamlessly incorporate QEMU into our image-building process, delivering enhanced efficiency and reliability.

While I had prior experience with Packer, my familiarity with QEMU was limited, especially when it came to using Packer with QEMU. To address this knowledge gap, I referred to the official documentation of Packer. However, I encountered a challenge: the documentation provided an example using a server version of CentOS, which wasn't suitable for my requirements. I needed a cloud version of Ubuntu, which does not come with default user and password credentials. To overcome this hurdle, I created a seed image that included user-data and meta-data. This seed image allows us to "emulate" the cloud-init functionality. By combining this seed image with the Ubuntu image, Packer can establish an SSH connection to the virtual machine successfully.

In the seed image, we create a user with the necessary credentials for the initial build process. It's important to note that the user created in the seed image is only intended for the build phase and is not present in the final image. This approach ensures that the final image does not contain any unnecessary or insecure credentials, maintaining a clean and secure environment.

Here's the code to generate the seed image for the packer_qemu.seed.pkr.hcl script:

source "file" "user_data" {
  content = <<EOF
#cloud-config
ssh_pwauth: True
users:
  - name: user
    plain_text_passwd: packer
    sudo: ALL=(ALL) NOPASSWD:ALL
    shell: /bin/bash
    lock_passwd: false
EOF
  target  = "user-data"
}

source "file" "meta_data" {
  content = <<EOF
{"instance-id":"packer-worker.tenant-local","local-hostname":"packer-worker"}
EOF
  target  = "meta-data"
}

build {
  sources = ["sources.file.user_data", "sources.file.meta_data"]

  provisioner "shell-local" {
    inline = ["genisoimage -output cidata.iso -input-charset utf-8 -volid cidata -joliet -r user-data meta-data"]
  }
}

In the provided code, the genisoimage command-line tool plays a crucial role in generating the necessary configuration files for cloud-init. Specifically, it is used to create the cidata.iso file, which encapsulates the user_data and meta_data files. These files contain important cloud-init configuration data, such as user credentials and metadata information for the instance.
By utilizing genisoimage, we can create a bootable ISO image that incorporates the required configuration data. This ISO image is then seamlessly integrated into the image-building process by Packer.
To gain a better understanding of the genisoimage command-line options and functionality, you can refer to the official documentation at Genisoimage Documentation. The documentation provides detailed explanations and examples to help you effectively utilize genisoimage in your image-building workflow.

Here's the code for the packer_qemu.qcow2.pkr.hcl script that uses the seed image and cloud image to build a new image and then runs an Ansible playbook to configure the new image:

packer {
  required_plugins {
    vagrant = {
      version = "1.0.9"
      source  = "github.com/hashicorp/qemu"
    }
    ansible = {
      version = "1.0.4"
      source  = "github.com/hashicorp/ansible"
    }
  }
}

source "qemu" "ubuntu" {
  format           = "qcow2"
  disk_image       = true
  disk_size        = "10G"
  headless         = true
  iso_checksum     = "file:https://cloud-images.ubuntu.com/focal/current/SHA256SUMS"
  iso_url          = "https://cloud-images.ubuntu.com/focal/current/focal-server-cloudimg-amd64.img"
  qemuargs         = [["-m", "12G"], ["-smp", "8"], ["-cdrom", "cidata.iso"], ["-serial", "mon:stdio"]]
  shutdown_command = "echo 'packer' | sudo -S shutdown -P now"
  ssh_password     = "packer"
  ssh_username     = "user"
  vm_name          = "build.qcow2"
  output_directory = "output"
}

build {
  sources = ["source.qemu.ubuntu"]

  provisioner "ansible" {
    playbook_file = "ansible/qemu.yml"
  }
}

This code snippet showcases the Packer configuration language to orchestrate the build process. It begins with the packer block, which outlines the essential plugins required for Packer, including qemu and ansible. The source block focuses on configuring the QEMU source, encompassing various settings such as the format, disk size, ISO checksum and URL, QEMU arguments, SSH credentials, and more. Within the build block, the QEMU source is designated for the build process.
Additionally, the provisioner section incorporates the ansible provisioner, specifying the Ansible playbook (ansible/qemu.yml) to execute for further customization of the newly created image. For a comprehensive understanding of the packer plugin arguments pertaining to qemu, you can refer to the official documentation at Packer QEMU Plugin. The documentation offers detailed insights into the various plugin options and configurations available for QEMU integration within Packer.

By combining Packer, QEMU, Ubuntu Cloud Images, and Ansible, we are able to automate the process of building consistent and reproducible machine images for our on-premises Data Center. This streamlined approach saves time, ensures consistency across our environments, and maintains a secure image without unnecessary credentials.

I hope sharing our experience and providing these code snippets will help the community facing similar challenges in the future. Let's continue building and automating together! If you have any questions or suggestions, feel free to reach out.

Building labs using component-based architecture with Terraform and Ansible

Ederson Brilhante — Thu, 14 Apr 2022 15:48:35 +0000

Currently, I am a Site Reliability Engineer(SRE) in the observability team at Splunk. But when I worked in this solution I was part of the GDI(Get Data In) organisation at Splunk.

Now, let's talk about the problem.
Part of the engineer's job in GDI is building add-ons to Splunk. Add-ons, in a nutshell, are plugins to connect third party data sources to Splunk platform.

Every time we need to work on a new add-on version of a specific third party, we need to set up 2 labs, 1 for development purposes, and the other with QA specifications.

The GDI organisation owns many add-ons, so we use a strategy to make rotations in which team and who in the team will work in a new version every time.
This is good to spread the knowledge, but we had problems keeping reliable and consistent labs across the dev cycle and the teams.

A big fraction of people's time was manual work to set up the labs(manual configuration or writing new bash/power-shell scripts). Along with a lot of time expended in the development process, the manual work creates a great deal of headache for the developers

The teams agreed it needed some automation, in order to reduce the pain to create labs and avoid duplication or rework.

We came up with the idea of using infra as code(IaC). Which was nothing so special that other companies weren't already doing.

Because the teams are small, and they are focused on the development of add-ons, we need an approach where the teams could have customised labs, but not necessary to write IaC scripts.

Based on the Design Principles of react components, we came up with an idea to create components that can be reused and plugged in other components. And each component would be a Terraform Module, an Ansible Playbook or an Ansible Role.

For a better elucidation, let's use this example - Build a lab with 3 different environments:

Environment A will have 4 windows instances:
- 1 Windows Server 2016 as Domain Controller.
- 3 Windows Servers as members server:
  - 1 Windows Server 2016 with Windows Event Collector:
  - Splunk Universal Forward
  - Collecting only Sysmon events from nodes
  - 1 Windows Server 2016 with Windows Event Forwarding
  - 1 Windows Server 2019 with Windows Event Forwarding
Environment B will have 7 windows instances:
- 1 Windows Server 2016 as Domain Controller.
- 6 Windows Servers as members server:
  - 1 Windows Server 2016 with Windows Event Collector (WEC A):
  - Splunk Universal Forward
  - Collecting only Application events from nodes
  - 1 Windows Server 2016 with Windows Event Forwarding, sending logs to WEC A
  - 1 Windows Server 2019 with Windows Event Forwarding, sending logs to WEC A
  - 1 Windows Server 2019 with Windows Event Collector (WEC B):
  - Splunk Universal Forward
  - Collecting only Security events from nodes
  - 1 Windows Server 2016 with Windows Event Forwarding, sending logs to WEC B
  - 1 Windows Server 2019 with Windows Event Forwarding, sending logs to WEC B
Environment C will have 3 windows instances:
- 1 Windows Server 2016 as Domain Controller with Windows Event Collector :
  - Splunk Universal Forward
  - Collecting only Security and Sysmon events from nodes
- 2 Windows Servers as Members Server:
  - 1 Windows Server 2016 with Windows Event Forwarding
  - 1 Windows Server 2019 with Windows Event Forwarding

Normally, Using terraform modules and Ansible Playbooks we could reproduce these environments.
We would need to create specific playbooks and terraform configs for each environment.
And here comes the problem. Spending time coding permutations in some similar configurations.

To avoid that, our approach with component based architecture, we only have to write a single config file describing which modules these labs need to run without touching any Terraform Script or Ansible Playbook.

Architecture

The solution we made is compatible with many kind of labs configurations deployed in AWS.

Terraform scripts are used to deploy the infrastructure, spinning up EC2 instances and other AWS resources. And to provision softwares and system configuration inside of each EC2 instance, terraform calls proper ansible playbooks.

Playbooks are a group of roles. A role represents an implementation of specifics configuration in an independent way.

Take role windows_splunk_universal_forward as example. This role downloads, installs and configures a splunk universal forward instance in Windows. This role is coded to be used any windows version.

Repo Structure:

   ├── ansible
   │   ├── all_roles
   │   │   └── distros
   │   │       ├── linux
   │   │       │   └── roles
   │   │       │       └── <new-linux-role>
   │   │       └── windows
   │   │           └── roles
   │   │               └── <new-windows-role>
   │   └── playbooks
   │       └── <new-playbook>
   └── terraform
       └── modules
           ├── distros
           │   └── <new-distro-type>
           └── environments
               └── <new-environment-type>

Terraform

Terraform is an open-source infrastructure as code software tool that provides a consistent CLI workflow to manage hundreds of cloud services. Terraform codifies cloud APIs into declarative configuration files.

For more info check on official documentation.

Terraform Structure:

   terraform/
   ├── modules
   │   ├── constants
   │   ├── core
   │   ├── distros
   │   │   └── <distro-type>
   │   └── environments
   │       └── <environment-type>
   └── wire

What is an environment?

A environment is a pre-defined kind of relations between nodes.
Each module environment is found in path terraform/modules/environments. And uses the modules in terraform/modules/distros to build the proper relations.

For elucidation, take this case as an example:

1 Windows Domain Controller.
X number of Member Servers.

We have this hierarchy, because we need create first the DC and so give some data to member servers, such as IP:

```
# file: terraform/modules/environments/linux-standalone/main.tf

module "windows-domain-controller" {
    source = "../../distros/windows-server"
    ...
}

module "windows-server-member" {
    source = "../../distros/windows-server"
    ...
    windows_domain_controller = module.windows-domain-controller
    ...
}
```

What is a distro?

A distro is a pre-defined kind of AMI with specific kind of setup and/or provisioning.
Each module distro is found in path terraform/modules/distros. And have a proper ansible playbook to execute the provisioning.

For elucidation, take these cases as examples:

Linux
Windows
Splunk
Free BSD

Terraform example:

```

locals {
...
provisioning_command     = "ansible-playbook -i $PUBLIC_IP /opt/automation/tools/ansible/playbooks/windows.yml --extra-vars='${local.extra_vars}'"
}

...

resource "aws_instance" "windows_server" {
...
}

resource "null_resource" "ansible" {

triggers = {
    command = replace(local.provisioning_command, "$PUBLIC_IP", "'${aws_instance.windows_server.public_ip},'")
}

provisioner "local-exec" {
    command = replace(local.provisioning_command, "$PUBLIC_IP", "'${aws_instance.windows_server.public_ip},'")
}
}
```

Ansible

Ansible is an open-source software provisioning, configuration management, and application-deployment tool enabling infrastructure as code. It runs on many Unix-like systems, and can configure both Unix-like systems as well as Microsoft Windows.

For more info check on official documentation.

Ansible Structure:

```
ansible/
├── all_roles
│   ├── distros
│   │   └── <distro-type>
│   │       └── roles
│   │           └── <distro-role>
└── playbooks
```

What is a distro type?

A distro type is folder that centralize all ansible roles that can be used executed in a specific.

Take windows as example: ansible/all_roles/distros/windows. This folder centralize all ansible roles that can be used executed in a windows machines.

What is a distro role?

A distro role is a group of ansible tasks, that implements related configurations that represents a functionality.

For elucidation, take the list of tasks from splunk UF role:
- Downloads Splunk UF
- Installs the download file
- Sets default configuration
- Starts Splunk UF

Explaining the config file

Here you can find a complete config example:

config = {
  "myenv01" = {
    "type" = "windows_standalone"
    "nodes" = {
      "myvm01" = {
        "type" = "windows"
        "enabled_roles" = {
          "windows_funcionality01" = true
          "windows_funcionality02" = true
        }
        "os" = {
          "size"    = "t2.medium"
          "distro"  = "windows"
          "type"    = "windows"
          "version" = "2016"
        }
      }
    }
  }
  "myvm02" = {
    "type" = "linux_standalone"
    "nodes" = {
      "mylinux01" = {
        "type" = "linux"
        "enabled_roles" = {
          "linux_funcionality01" = true
          "linux_funcionality02" = true
        }
        "os" = {
          "size"    = "t2.medium"
          "distro"  = "ubuntu"
          "type"    = "linux"
          "version" = "20"
        }
      }
    }
  }
}

Under the hood this configuration will be translated to create 2 EC2 instances in AWS, and each instance will run playbooks with specific roles.

Block explanation:

Each block myenv0x represents how the environment will be deployed. The type represents which predefined environment will be used.
Each block myvm0x represents a VM that will be created. The type represents which predefined distro will be used.
The block os has 4 properties that will create a proper EC2 instance:
- The AWS type instance
- Type of Distro (windows, linux, etc)
- OS Distro(ubuntu, debian, suse, windows, etc)
- Version of the OS Distro

With this info the terraform will know which AWS AMI to use to spin up in the EC2 instance

The block enabled_roles represents a list of Ansible Roles to execute in each instance

For more details about the code and implementation, check the code demo, fully functional.

A serverless full-stack application using only git, google drive, and public ci/cd runners

Ederson Brilhante — Fri, 16 Apr 2021 13:40:49 +0000

TL;DR; How I built the Vilicus Service, a serverless full-stack application with backend workers and database only using git and ci/cd runners.

What is Vilicus?

Vilicus is an open-source tool that orchestrates security scans of container images(Docker/OCI) and centralizes all results into a database for further analysis and metrics.

Vilicus provides many alternatives to use it:

Own Installation;
GitHub Action in your GitHub workflows;
Template CI in your GitLab CI/CD pipelines;
Free Online Service;

This article explains how it was possible to build the Free Online Service without using a traditional deployment.

Architecture

The Frontend is hosted in GitHub Pages. This frontend is a landing page with a free service to scan or display the vulnerabilities in container images.

The results of container image scans are stored in a GitLab Repository.

When the user asks to show the results from an image, the frontend consumes the GitLab API to retrieve the file with vulns from this image. In case this image is not scanned yet, the user has the option to schedule a scan using a google form.

When this form is filled, the data is sent to a Google Spreadsheet.

A GitHub Workflow runs every 5 minutes to check if there are new answers in this Spreadsheet. For each new image in the Spreadsheet, this workflow triggers another Workflow to scan the image and save the result in the GitLab Repository.

Why store in GitLab?

GitLab provides bigger limits.

Here's a summary of differences in offering on public cloud and free tier:

	Free users	Max repo size (GB)	Max file size (MB)	Max API calls per hour (per client)
GitHub	3	2	100	5000
BitBucket	5	1	Unlimited (up to repo size)	5000
GitLab	Unlimited	10	Unlimited (up to repo size)	36000

Google Drive

This choice was a "quick win". In a usual deployment, the backend could call an API passing secrets without the clients knowing the secrets.

But because I am using GitHub Pages I cannot use that(Well, I could do it in the javascript, but anyone using the Browser Inspect would see the secrets. So let's don't do it 😉)

This makes the Google Spreadsheet perform as a Queue.

Google Form:

Google Spreadsheet with answers:

GitHub Workflows

The Schedule Workflow runs at most every 5 minutes. This workflow executes the python script that checks if there are new rows in the Google Spreadsheet, and for each row is made an HTTP request to trigger the event repository_dispatch.

This makes the workflows perform as backend workers.

Schedule in workflow:

name: Schedule
on:
  schedule:
    - cron:  '*/5 * * * *'
...

Event repository_dispatch in WorkFlow:

name: Report
on: [repository_dispatch]
...

Screenshots:

Schedule History:

Schedule WorkFlow:

Scans History:

Report Workflow:

Scan Report stored in GitLab:

Source Code:

Do you want to know more about GitHub Actions?

Github Pages

The Frontend is running in GitHub Pages.

By default an application running in GH Pages is hosted as http://<github-user>.github.io/<repository>.

But GitHub allows you to customize the domain, because that it's possible to access Vilicus using https://vilicus.edersonbrilhante.com.br instead of http://edersonbrilhante.github.io/vilicus.

GitHub Workflow to build the application and deploy it in GH Pages

Building the source code:

- name: Build
  run: |
    cd website
    npm install
    npm run-script build
  env:
    REACT_APP_GA_CODE: ${{ secrets.REACT_APP_GA_CODE }}
    REACT_APP_FORM_SCAN: ${{ secrets.REACT_APP_FORM_SCAN }}

Deploying the build:

- name: Deploy
  uses: JamesIves/github-pages-deploy-action@releases/v3
  with:
    GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
    BRANCH: gh-pages
    FOLDER: website/build

Source Code:

Do you want to know more about GitHub Pages?

That’s it!

In case you have any questions, please leave a comment here or ping me on LinkedIn.

Fast startup application with database stored in container images

Ederson Brilhante — Thu, 01 Apr 2021 15:15:54 +0000

TL;DR; This article shows which strategy I implemented to allow an application to be ready to use in a few minutes rather than many hours.

In this article, I will talk about the strategy I used in the project Vilicus to have big databases synced in new setups. For those who don't know Vilicus yet, I recommend reading my article about it.

Why the application takes too much time to start?

At this moment the project Vilicus uses Anchore, Clair, and Trivy as vendors to run security scans in container images. Each vendor has its own programming language, database, internal dependencies and can use different vulnerabilities databases.

Vilicus itself starts in milliseconds, but to be ready to use it's necessary to wait for the vendors to sync the vulnerabilities database with the latest changes. But these syncs can consume a lot of time.

See for example Anchore, the one with more time-consuming to complete the sync:

There is no exact time frame for the initial sync to complete as it depends heavily on environmental factors, such as the host's memory/cpu allocation, disk space, and network bandwidth. Generally, the initial sync should complete within 8 hours but may take longer. Subsequent feed updates are much faster as only deltas are updated.
https://docs.anchore.com/current/docs/faq/

Clair takes more or less 20 minutes. And Trivy is ready in a few seconds.

If you run everything from scratch will take almost 1 day to sync all vulnerabilities databases, but after this major sync, the next syncs will be faster.

This will be a problem if you would like to run an ephemeral instance in your CI / CD, so waiting hours for the sync to be completed before you can run the first scan will be inviable. Thinking about how to fix this problem, I came with a solution: Save updated database snapshots in container images every day.

Now you must be thinking, this is not a good practice, and normally I would agree. But I believe there are exceptions in specific cases, such as fixing the problem is more important than conventions.

Saving the database in a container image

I'll show you in detail how I made Anchore work, but Clair and Trivy are not much different

Anchore

First I have a compacted dump SQL, with the database already synced with less last 6 months, stored in a container image: vilicus/anchoredb:dumpsql. So we don't need to wait many hours, we just update the delta.

I used this image as a base to create a local image(vilicus/anchoredb:files) with a script to restore the database when this image runs as a container.

Dockerfile content

FROM vilicus/anchoredb:dumpsql as dumpsql

FROM postgres:9.6.21-alpine
LABEL vilicus.app.version=9.6.21-alpine

COPY --chown=postgres:postgres --from=dumpsql /opt/vilicus/data/anchore_db.tar.gz /opt/vilicus/data/anchore_db.tar.gz
COPY deployments/dockerfiles/anchore/db/files/restore-dbs.sh /docker-entrypoint-initdb.d/01.restore-dbs.sh

Building the container image

docker build -f deployments/dockerfiles/anchore/db/files/Dockerfile -t vilicus/anchoredb:files .

The image vilicus/anchoredb:files is referenced in deployments/docker-compose.updater.yml

Here we start the anchore and the anchoredb.

docker-compose -f deployments/docker-compose.updater.yml up \
    --build -d --force \
    --remove-orphans \
    --renew-anon-volumes anchore

After that, we run this command to restore the database.

docker exec anchoredb sh -c 'docker-entrypoint.sh postgres' &

So we wait for the restore and the database we ready to be connected.

docker run --network container:anchore vilicus/vilicus:latest \
    sh -c "dockerize -wait http://anchore:8228/health -wait-retry-interval 10s -timeout 1000s echo done"

With the Anchore Engine and the DB ready, we start the sync.

docker exec anchore sh -c 'anchore-cli system wait'

When the sync finishes we stop anchore and we kill gracefully the Postgres PID in anchoredb.

docker stop anchore
docker exec -u postgres anchoredb sh -c 'pg_ctl stop -m smart'

We commit the container, with the changes made by the sync, into a new container image vilicus/anchoredb:local-update

CID=$(docker inspect --format="{{.Id}}" anchoredb)
docker commit $CID vilicus/anchoredb:local-update

So we finally build the container image that goes to docker hub, by copying the Postgres data from the image vilicus/anchoredb:local-update

Dockerfile content

FROM as db
FROM postgres:9.6.21-alpine
COPY --chown=postgres:postgres --from=db /data/ /data

Building the container image

docker build -f deployments/dockerfiles/anchore/db/Dockerfile -t vilicus/anchoredb:latest .

Check the complete script here

Clair and Trivy

For Clair check here.

For Trivy check here.

Updating the images every day

To have the databases with the latest changes, I have a GitHub workflow that runs a job everyday building the images and pushing them to the Docker Hub.

Check the workflow

Complete workflow

That's it!

In case you have any questions, please leave a comment here or ping me on 🔗 LinkedIn.

GitLab Runners as a Service with Github Action

Ederson Brilhante — Thu, 01 Apr 2021 14:52:43 +0000

TL;DR; This article will show how to implement the action "Gitlab Runner Service Action" in a "GitHub Workflow" that is triggered by a "GitLab-CI job", and this way having temporary GitLab Runners hosted by GitHub.

For more info about GitHub workflow, check the official documentation

For more info about GitLab-CI, check the official documentation

Steps

Step 1

Create a new GitHub repository with the following GitHub Workflow. File location: .github/workflows/gitlab-runner.yaml

name: Gitlab Runner Service
on: [repository_dispatch]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Maximize Build Space
        uses: easimon/maximize-build-space@master
        with:
          root-reserve-mb: 512
          swap-size-mb: 1024
          remove-dotnet: 'true'
          remove-android: 'true'
          remove-haskell: 'true'

      - name: Gitlab Runner
        uses: edersonbrilhante/gitlab-runner-action@main
        with:
          registration-token: "${{ github.event.client_payload.registration_token }}"
          docker-image: "docker:19.03.12"
          name: ${{ github.run_id }}
          tag-list: "crosscicd"

What does this workflow do?

This workflow will run just when the event repository_dispatch is triggered. The first step will be to increase the free space removing useless packages for our GitLab runner. And the second step will run the action that registers a new GitLab Runner with a tag crosscicd, so start it and unregister it after a GitLab-CI job is completed with success or failure.

Step 2

Create a new GitLab repository with the following GitLab-CI config. File location: .gitlab-ci.yml

start-crosscicd:
  image: alpine
  before_script:
    - apk add --update curl && rm -rf /var/cache/apk/*
  script: |
    curl -H "Authorization: token ${GITHUB_TOKEN}" \
    -H 'Accept: application/vnd.github.everest-preview+json' \
    "https://api.github.com/repos/${GITHUB_REPO}/dispatches" \
    -d '{"event_type": "gitlab_trigger_'${CI_PIPELINE_ID}'", "client_payload": {"registration_token": "'${GITLAB_REGISTRATION_TOKEN}'"}}'

github:
  image: docker:latest
  services:
    - name: docker:dind
      alias: thedockerhost
  variables:
    DOCKER_HOST: tcp://thedockerhost:2375/
    DOCKER_DRIVER: overlay2
    DOCKER_TLS_CERTDIR: ""
  script:
    - df -h
    - docker run --privileged ubuntu df -h
  tags:
    - crosscicd

What does this gitlab-ci?

The job start-crosscicd will trigger the GitHub workflow, creating the GitLab runner with the tag crosscicd. And the job GitHub will wait for a runner with a tag crosscicd.

Step 3

Set the EnvVars in the new GitLab Repo

    GITHUB_REPO:<username>/<github-repo>
    GITHUB_TOKEN:<GitHub Access Token>
    GITLAB_REGISTRATION_TOKEN:<GitLab Registration Token>

How to create a new GitHub Access Token:

Go to https://github.com/settings/tokens/new
Mark the item workflow and click in generate a token

How to get Registration Token:

Go to https://gitlab.com/<username>/<repo>/-/settings/ci_cd and click and expand Runners
Copy the Registration Token

Where to store the EnvVars?

Go to https://gitlab.com/<username>/<repo>/-/settings/ci_cd and click and expand Variables
Click in Add Variable and save it for each EnvVar

Step 4

Now your pipeline is ready to run the GitLab Runner in GitHub trigger by Gitlab-CI Job :)

Example

Video Demo

Screenshots

Job start-crosscicd trigger Github Workflow

Workflow triggered by Gitlab-CI job

There is 17GB free by default

After Maximize we have 54GB free to use

Code

That’s it!

In case you have any questions, please leave a comment here or ping me on 🔗 LinkedIn.

Vilicus — An overseer for security scanning of container images

Ederson Brilhante — Wed, 31 Mar 2021 20:19:18 +0000

Vilicus is an open-source tool that orchestrates security scans of container images(Docker/OCI) and centralizes all results into a database for further analysis and metrics.

Why scan for vulnerabilities?

A recent analysis of around 4 million Docker Hub images by cyber security firm Prevasio found that 51% of the images had exploitable vulnerabilities. A large number of these were cryptocurrency miners, both open and hidden, and 6,432 of the images had malware.
https://www.infoq.com/news/2020/12/dockerhub-image-vulnerabilities/

Image from https://prevasio.com/static/web/viewer.html?file=/static/Red_Kangaroo.pdf

Docker image security scanning is a process for finding security vulnerabilities within your Docker image files.
Typically, image scanning works by parsing through the packages or other dependencies that are defined in a container image file, then checking to see whether there are any known vulnerabilities in those packages or dependencies.
https://resources.whitesourcesoftware.com/blog-whitesource/docker-image-security-scanning

How does it work?

There are many tools to scan container images for vulnerabilities such as Anchore, Clair, and Trivy. But sometimes the results from the same image can be different. And this project comes to help the developers to improve the quality of their container images by finding vulnerabilities and thus addressing them with agnostic sight from vendors.

Some articles comparing the scanning tools:

Architecture

Cached Database

Vilicus updates daily the vendor databases with the latest changes in the vulns DBs.

Using a strategy to storage the database data in layers of docker images, the whole platform is ready to use in minutes instead of hours. Starting the sync feed with vulns from scratch can take at least 6 hours.

Check the strategy used in Anchore, Clair and Trivy

Local Registry

Vilicus provides a local registry, so you can build a local image and scanning it without pushing it to a remote repository.

docker build -t localhost:5000/local-image:my-tag .

curl -o docker-compose.yml https://raw.githubusercontent.com/edersonbrilhante/vilicus/main/deployments/docker-compose.yml

docker-compose up -d

IMAGE=localregistry.vilicus.svc:5000/local-image:my-tag

docker run -v ${PWD}/artifacts:/artifacts \
  --network container:vilicus \
  vilicus/vilicus:latest \
  sh -c "dockerize -wait http://vilicus:8080/healthz -wait-retry-interval 60s -timeout 2000s vilicus-client -p /opt/vilicus/configs/conf.yaml -i ${IMAGE}  -t /opt/vilicus/contrib/sarif.tpl -o /artifacts/results.sarif"

GitHub Action

GitHub Actions makes it easy to automate all your software workflows, now with world-class CI/CD. Build, test, and deploy your code right from GitHub. Make code reviews, branch management, and issue triaging work the way you want.
https://github.com/features/actions

Vilicus provides a GitHub action to help you scanning container images in your CI/CD.

Container scanning

A scan can be done using a remote image and a local image. Using a remote repository such as docker.io the image will be docker.io/your-organization/image:tag:

  - name: Scan image
    uses: edersonbrilhante/vilicus-github-action@main
    with:
      image: "docker.io/myorganization/myimage:tag"

And to use a local image its need to tag as localhost:5000/image:tag:

  - name: Scan image
    uses: edersonbrilhante/vilicus-github-action@main
    with:
      image: "localhost:5000/myimage:tag"

Full example

Complete example with steps for cleaning space, building local image, Vilicus scanning, and uploading results to GitHub Security

name: Container Image CI
on: [push]
jobs:
  build
    runs-on: ubuntu-latest
    steps:
      - name: Maximize Build Space
        uses: easimon/maximize-build-space@master
        with:
          root-reserve-mb: 512
          swap-size-mb: 1024
          remove-dotnet: 'true'
          remove-android: 'true'
          remove-haskell: 'true'
      - name: Checkout branch
        uses: actions/checkout@v2
      - name: Build the Container image
        run: docker build -t localhost:5000/local-image:${GITHUB_SHA} .
      - name: Vilicus Scan
        uses: edersonbrilhante/vilicus-github-action@main
        with:
          image: localhost:5000/local-image:${{ github.sha }}
      - name: Upload results to github security
        uses: github/codeql-action/upload-sarif@v1
        with:
          sarif_file: artifacts/results.sarif

Results in GitHub Security

Check an example using Vilicus GitHub Action

Pipeline example

List with all vulns found

Vuln details

Source Code

VIlicus GitHub Action

Vilicus

That’s it!

In case you have any questions, please leave a comment here or ping me on 🔗 LinkedIn.