<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Ederson Brilhante</title>
    <description>The latest articles on DEV Community by Ederson Brilhante (@edersonbrilhante).</description>
    <link>https://dev.to/edersonbrilhante</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F606500%2F67cc9470-d75a-4b86-bb9f-07329fb2558a.jpeg</url>
      <title>DEV Community: Ederson Brilhante</title>
      <link>https://dev.to/edersonbrilhante</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/edersonbrilhante"/>
    <language>en</language>
    <item>
      <title>No Silver Bullets: Engineering a Multi-Tenant CI Platform a Small Team Can Run</title>
      <dc:creator>Ederson Brilhante</dc:creator>
      <pubDate>Sat, 20 Jun 2026 20:41:43 +0000</pubDate>
      <link>https://dev.to/edersonbrilhante/no-silver-bullets-engineering-a-multi-tenant-ci-platform-a-small-team-can-run-if</link>
      <guid>https://dev.to/edersonbrilhante/no-silver-bullets-engineering-a-multi-tenant-ci-platform-a-small-team-can-run-if</guid>
      <description>&lt;p&gt;&lt;em&gt;A deep, teaching walkthrough of how Cisco’s internal Forge deployment runs ~40 teams and ~10,000 GitHub Actions jobs a day on AWS — and the dozen deliberate engineering trade-offs that made it survivable with near-zero ops. This is the long version on purpose. I’m going to show you the machinery, not wave at it, because a platform you can’t see inside is just a magic box you’re afraid to touch. By the end you should understand not only what Forge does but why each piece is built the way it is — well enough to argue with the decisions.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — Forge (open-source name: &lt;em&gt;ForgeMT&lt;/em&gt;) is a multi-tenant GitHub Actions runner platform on AWS. &lt;em&gt;In Cisco's internal production deployment&lt;/em&gt; it runs ~40 teams and ~10k jobs/day with &lt;strong&gt;near-zero manual provisioning/debugging&lt;/strong&gt; — which is not the same as zero support. Jobs run on &lt;strong&gt;ephemeral&lt;/strong&gt; runners across two lanes — EC2 VMs and Kubernetes/ARC pods. The design is a stack of deliberate trade-offs, each with a stated cost: &lt;strong&gt;Calico&lt;/strong&gt; to beat VPC-IP exhaustion, &lt;strong&gt;per-tenant DinD node pools&lt;/strong&gt; for blast-radius isolation around privileged Docker builds, &lt;strong&gt;zero static credentials&lt;/strong&gt; with a self-checking trust-validator, &lt;strong&gt;immutable blue/green clusters&lt;/strong&gt; for safe upgrades, &lt;strong&gt;directory-as-config Terragrunt&lt;/strong&gt; so onboarding is a config change, &lt;strong&gt;self-healing pipelines + centralized Splunk observability&lt;/strong&gt;, and &lt;strong&gt;Renovate + dogfooded images&lt;/strong&gt; to stay current. The thesis: near-zero ops isn't luck or a magic tool — it's a dozen good trade-offs that add up. Short on time? Read §4 (how a job runs) and §13 (the pattern).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Contents&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The wall everyone hits&lt;/li&gt;
&lt;li&gt;What Forge is (and the vocabulary you'll need)&lt;/li&gt;
&lt;li&gt;Why it exists&lt;/li&gt;
&lt;li&gt;How a job actually runs — the part most explanations skip&lt;/li&gt;
&lt;li&gt;Trade-off #1 — Networking: the IP ceiling&lt;/li&gt;
&lt;li&gt;Trade-off #2 — Isolation: per-tenant DinD node pools&lt;/li&gt;
&lt;li&gt;Trade-off #3 — Identity: zero static credentials&lt;/li&gt;
&lt;li&gt;Trade-off #4 — Immutable clusters &amp;amp; blue/green&lt;/li&gt;
&lt;li&gt;Trade-off #5 — Config at scale&lt;/li&gt;
&lt;li&gt;Trade-off #6 — The connective tissue&lt;/li&gt;
&lt;li&gt;Trade-off #7 — Staying fresh: automated dependencies &amp;amp; dogfooded images&lt;/li&gt;
&lt;li&gt;Operating it: ownership, where it breaks, and the sharp edges&lt;/li&gt;
&lt;li&gt;The pattern&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  1. The wall everyone hits
&lt;/h2&gt;

&lt;p&gt;Setting up a GitHub Actions self-hosted runner is a fifteen-minute job. You spin up a VM, run the &lt;code&gt;config.sh&lt;/code&gt; script GitHub gives you, it registers the machine against your org or repo, you add &lt;code&gt;runs-on: self-hosted&lt;/code&gt; to a workflow, and your job lands on your box. It feels great. It &lt;em&gt;is&lt;/em&gt; great — for one team.&lt;/p&gt;

&lt;p&gt;The trouble never arrives with the first runner. It arrives with the second team. And the fifth. And the twentieth.&lt;/p&gt;

&lt;p&gt;Here's the mechanism of the pain, because it's worth being precise about. A self-hosted runner is a long-lived machine you own. That means &lt;em&gt;you&lt;/em&gt; own its patching, its security posture, its disk filling up, its runner-agent upgrades, its scaling when ten jobs arrive at once, and its idle cost when none do. The moment a second team needs runners — maybe they need a different toolchain, a bigger instance, or access to an internal network the first team didn't — the path of least resistance is for them to copy the setup and run their own. Now two teams each carry that whole operational burden. Multiply by twenty.&lt;/p&gt;

&lt;p&gt;What you end up with isn't "twenty teams using runners." It's &lt;strong&gt;twenty subtly different runner platforms&lt;/strong&gt;, each patched on its own schedule, each drifting in its own direction, each with its own half-built answer to "how do we give this runner AWS access without leaking a key." "It works on Team A's runners but not ours" becomes a real sentence in real incident channels. Your operational cost grows roughly linearly with adoption, and — this is the part that should worry you — your &lt;em&gt;security posture gets worse&lt;/em&gt; as you grow, because every team reinvents secrets handling and isolation slightly differently and slightly wrong.&lt;/p&gt;

&lt;p&gt;This is the wall. Forge is what one team built after hitting it.&lt;/p&gt;

&lt;p&gt;And the single most important thing to understand before we go further: &lt;strong&gt;the interesting part is not any one clever component.&lt;/strong&gt; There is no magic tool in here. Forge is roughly a dozen deliberate decisions, each made in a different problem domain, each with a real cost, composed into something a small team can actually operate. The rest of this article walks the load-bearing ones — with the actual code and the actual trade-off accepted. If you're building anything multi-tenant, the &lt;em&gt;shape&lt;/em&gt; of these decisions will transfer even if you never touch a GitHub runner.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. What Forge is (and the vocabulary you'll need)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Forge is a secure, multi-tenant GitHub Actions runner platform on AWS.&lt;/strong&gt; Teams run their CI/CD jobs on managed, ephemeral runners that live in company-managed AWS accounts. For a developer in an &lt;em&gt;already-onboarded&lt;/em&gt; repo, adoption is usually a one-line &lt;code&gt;runs-on&lt;/code&gt; change — no infrastructure to own, no migration project, no workflow rewrite. (Standing up a &lt;em&gt;new&lt;/em&gt; tenant is a controlled platform-team step, which we'll be honest about in §4 and §12.)&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(A naming note: the open-source project is ForgeMT — "MT" for multi-tenant; internally people usually just say Forge. I use "Forge" throughout for readability, but the repo and docs say "ForgeMT.")&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Before we go deeper, let's nail down five terms. If you already live in this world, skim; if you don't, these five unlock everything below.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub-hosted vs. self-hosted runners.&lt;/strong&gt; When you write &lt;code&gt;runs-on: ubuntu-latest&lt;/code&gt;, GitHub spins up a throwaway VM &lt;em&gt;it&lt;/em&gt; owns, runs your job, and throws it away. Convenient, but that VM lives in GitHub's network — it cannot reach your private databases, your internal package registries, or your VPC, and you can't customize it much or control its cost at high volume. A &lt;em&gt;self-hosted&lt;/em&gt; runner is a machine &lt;em&gt;you&lt;/em&gt; register with GitHub; GitHub sends jobs to it, but you own everything about the box. Forge is a platform for self-hosted runners — it exists precisely for the jobs that GitHub-hosted runners can't serve.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tenant.&lt;/strong&gt; A tenant is the isolation-and-configuration boundary for a team — its own runners, identity, and labels. Forge runs those runners &lt;em&gt;inside the Forge-managed AWS deployment for that tenant&lt;/em&gt;; the tenant can then &lt;strong&gt;optionally&lt;/strong&gt; grant its runners access to its &lt;em&gt;own&lt;/em&gt; external AWS accounts by listing the IAM roles the runner role is allowed to assume (more in §7). So "each tenant has its own account" is the wrong mental model — the right one is "each tenant is an isolated boundary, with opt-in bridges to whatever AWS it actually needs." When we say Forge is "multi-tenant," we mean many teams share one &lt;em&gt;platform&lt;/em&gt; (one codebase, one upgrade path, one operations team) while staying strictly separated at runtime. Holding both of those at once — shared platform, isolated execution — is the entire engineering problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ephemeral runner.&lt;/strong&gt; This is the keystone idea. Instead of long-lived machines, Forge creates a brand-new runner for a &lt;em&gt;single job&lt;/em&gt; and destroys it the instant that job finishes — pass or fail. Every job gets a pristine machine that has never seen another job. This buys you enormous things almost for free: no state leaks between jobs, no configuration drift accumulating on a box over months, no credentials cached on a runner that outlives the work, and no "clean up after yourself" step that everyone forgets. It costs you one thing — startup latency, since you pay to boot a machine per job — which we'll see Forge manage carefully later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Control plane vs. tenant plane.&lt;/strong&gt; The &lt;em&gt;control plane&lt;/em&gt; is the Forge-owned brain: it receives the signal that a job needs a runner, provisions the runner, scales the fleet, wires up identity, and collects logs and metrics. The &lt;em&gt;tenant plane&lt;/em&gt; is where the actual jobs execute. Teams live entirely in the tenant plane — they write workflows and pick runners — and never touch the control plane. This split is what lets one small team operate the control plane for everyone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The two lanes.&lt;/strong&gt; Forge runs jobs two different ways, and lets each job choose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;EC2 lane&lt;/strong&gt; gives a job a full virtual machine (or even a bare-metal host). Use it for heavy builds, specialized hardware, custom operating systems, macOS/Windows, or anything that needs real VM-level isolation and control. A job can run directly on the VM &lt;em&gt;or&lt;/em&gt; inside a container via a &lt;code&gt;container:&lt;/code&gt; block in the workflow. Under the hood this is built on the open-source &lt;code&gt;terraform-aws-github-runner&lt;/code&gt; project.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;Kubernetes lane&lt;/strong&gt; (often called ARC, after &lt;em&gt;Actions Runner Controller&lt;/em&gt;, the open-source operator it's built on) runs each job in a Kubernetes pod. Use it for fast, bursty, container-native work where you want pods to spin up in seconds and scale to zero when idle.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Within each lane, tenants pick a &lt;em&gt;size or flavor&lt;/em&gt;. On EC2 that's typically &lt;code&gt;small&lt;/code&gt; (linting, quick tests), &lt;code&gt;standard&lt;/code&gt; (general builds and integration tests), &lt;code&gt;large&lt;/code&gt; (heavy builds, load tests), and &lt;code&gt;metal&lt;/code&gt; (bare-metal hosts for jobs that need full hardware control). On Kubernetes it's &lt;code&gt;k8s&lt;/code&gt; (lightweight pods for jobs that don't need Docker) and &lt;code&gt;dind&lt;/code&gt; (Docker-in-Docker, for building container images inside a pod) — plus dedicated &lt;em&gt;named scale sets&lt;/em&gt; like &lt;code&gt;dependabot&lt;/code&gt;, which is not a third execution model but an ARC scale set typically backed by the DinD template, carved out so dependency-update jobs are isolated and capacity-limited on their own. A tenant can run several of these at once, each with its own parallelism limit, so one team's flood of jobs can't starve another's.&lt;/p&gt;

&lt;p&gt;Two lanes is itself a deliberate choice — most platforms pick one. Forge keeps both because the workload genuinely spans both: a 90-minute hardware-in-the-loop build and a 30-second lint check do not want the same execution model.&lt;/p&gt;

&lt;p&gt;Tenants pick a lane and a size by &lt;em&gt;label&lt;/em&gt;. A workflow says something like &lt;code&gt;runs-on: [self-hosted, x64, "type:standard"]&lt;/code&gt; for a medium EC2 runner, or &lt;code&gt;runs-on: [k8s]&lt;/code&gt; for a Kubernetes pod, or &lt;code&gt;runs-on: [dind]&lt;/code&gt; for a pod that can build Docker images. Those labels are not cosmetic — as we'll see in §9, they're the literal API contract between the tenant and the platform, generated from configuration and matched exactly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F9jinwbpo8og2ij6hh90o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F9jinwbpo8og2ij6hh90o.png" alt="Forge architecture: developer to GitHub to control plane to ephemeral EC2/Kubernetes runners, with optional tenant AWS access." width="800" height="820"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That diagram is the entire mental model at altitude. The rest of this article is what's inside each of those boxes — and why.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Why it exists
&lt;/h2&gt;

&lt;p&gt;Forge didn't begin as an architecture diagram. It began as a fix for a specific, escalating mess.&lt;/p&gt;

&lt;p&gt;The first teams ran the open-source &lt;code&gt;terraform-aws-github-runner&lt;/code&gt; module, each on their own. It's a good module. It worked. What &lt;em&gt;didn't&lt;/em&gt; work was the practice of everyone running their own copy. The same bug got fixed three separate times in three repos. The same incident — a runner type out of capacity, a webhook misfiring — recurred across teams that never compared notes. Because each team upgraded on its own cadence, the behavior of "a runner" diverged between teams, so debugging required knowing which team's particular vintage you were looking at. Knowledge became tribal. Three teams meant three deployments, several AWS accounts per environment, and three sets of undocumented quirks, with billing and troubleshooting smeared across all of them.&lt;/p&gt;

&lt;p&gt;Two specific constraints turned this from "annoying" into "we cannot continue":&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IPv4 exhaustion.&lt;/strong&gt; The runners had to live in corporate-provisioned AWS subnets, and those subnets were &lt;em&gt;small&lt;/em&gt; — and not something an individual team could enlarge. Here's the trap: there was plenty of CPU and memory available, but the &lt;em&gt;network&lt;/em&gt; ran out of IP addresses long before the compute did. Every runner needs an IP. When you've handed out every address in the subnet, the next job sits in a queue staring at idle compute it can't use, because there's no address to give it. The bottleneck wasn't the thing everyone watches (compute) — it was the thing nobody watches (IPs). This single constraint forced a real architectural change, and we'll spend all of §5 on it because it's the most instructive decision in the platform.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Internal access and shared networks.&lt;/strong&gt; CI jobs frequently need to reach &lt;em&gt;internal&lt;/em&gt; things — a private artifact registry, an internal API, a database, a service only reachable over the corporate network. GitHub-hosted runners simply cannot do this; they're outside the perimeter. And when CI runs on shared corporate networks, CI traffic (which is spiky and heavy) competes with operational traffic, creating noisy-neighbor problems. The upshot: security and network design stopped being someone else's problem and became &lt;em&gt;part of CI design&lt;/em&gt;. You cannot bolt them on afterward.&lt;/p&gt;

&lt;p&gt;Faced with this, the team had a binary choice: keep letting every team run its own stack and drown in duplicated maintenance and divergent security, or &lt;strong&gt;standardize onto one platform&lt;/strong&gt;. The bet they made — and it's the thesis of the whole thing — was to keep the &lt;em&gt;flexibility&lt;/em&gt; teams loved (custom images, full VM control, internal access, their own AWS resources) while removing the &lt;em&gt;drift&lt;/em&gt; (one module, one upgrade path, one set of guardrails). Standardize the boundaries; preserve the freedom inside them.&lt;/p&gt;

&lt;p&gt;The next six sections are what "guardrails without taking away freedom" actually required. But first, we have to look at how a single job flows through the system — because every later decision is in service of that flow.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. How a job actually runs — the part most explanations skip
&lt;/h2&gt;

&lt;p&gt;If you only remember the altitude diagram, Forge stays a magic box. So let's trace a real job, end to end, twice — once per lane. This is the spine everything else hangs on.&lt;/p&gt;

&lt;h3&gt;
  
  
  The EC2 lane, step by step
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;A developer pushes code.&lt;/strong&gt; Their workflow contains &lt;code&gt;runs-on: [self-hosted, x64, "type:standard", "tnt:acme"]&lt;/code&gt;. GitHub sees the job needs a self-hosted runner and emits a &lt;strong&gt;&lt;code&gt;workflow_job&lt;/code&gt; webhook&lt;/strong&gt; with &lt;code&gt;action: queued&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The webhook hits an API Gateway&lt;/strong&gt;, which is the public front door of that tenant's Forge deployment. It forwards the request to a small &lt;strong&gt;webhook Lambda&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The webhook Lambda authenticates and matches.&lt;/strong&gt; It verifies the request's HMAC signature against the GitHub App's webhook secret (so randoms can't trigger your runners), then checks the job's labels against the set of runner types this deployment offers. If a label set matches, it &lt;strong&gt;enqueues a message onto an SQS queue&lt;/strong&gt; dedicated to that runner type. (If nothing matches, it does nothing — the job will sit unfulfilled, which, as we'll see, is itself a debuggable signal.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A scale-up Lambda consumes the SQS message.&lt;/strong&gt; SQS here isn't decoration — it's a buffer that decouples the burst of webhooks from the rate at which you can launch instances, and it gives you retries for free. The scale-up Lambda decides how many runners to launch and issues an EC2 &lt;strong&gt;&lt;code&gt;CreateFleet&lt;/code&gt;&lt;/strong&gt; call in &lt;code&gt;instant&lt;/code&gt; mode (give me N instances now), choosing on-demand or spot capacity per the tenant's config.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The instance boots and registers itself.&lt;/strong&gt; Its user-data script pulls a &lt;em&gt;just-in-time&lt;/em&gt; (JIT) runner registration token, configures the GitHub Actions runner agent as &lt;strong&gt;ephemeral&lt;/strong&gt; (it will accept exactly one job), and joins the tenant's &lt;em&gt;runner group&lt;/em&gt;. The moment it registers, GitHub hands it the queued job.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The job runs.&lt;/strong&gt; Job-started and job-completed hooks fire; the runner streams logs to CloudWatch.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The runner self-destructs.&lt;/strong&gt; Because it registered as ephemeral, the agent deregisters from GitHub after the single job. A separate &lt;strong&gt;scale-down Lambda&lt;/strong&gt;, running on a schedule (every few minutes), reaps the now-idle instance and any stragglers older than a minimum runtime. Supporting Lambdas update tags and archived logs.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fb8tn1ysgudcbwbgtbulk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fb8tn1ysgudcbwbgtbulk.png" alt="EC2 runner job lifecycle: GitHub webhook to API Gateway to webhook Lambda to SQS to scale-up Lambda to EC2 fleet, register, run, reap." width="800" height="2150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Two real-world wrinkles live in steps 4–5, and both are worth knowing. First, &lt;strong&gt;capacity isn't guaranteed&lt;/strong&gt;: a &lt;code&gt;CreateFleet&lt;/code&gt; can come back short with errors like &lt;code&gt;InsufficientInstanceCapacity&lt;/code&gt; or — tellingly — &lt;code&gt;InsufficientFreeAddressesInSubnet&lt;/code&gt; (yes, you can run out of &lt;em&gt;subnet IPs&lt;/em&gt;; that's the §5 problem wearing a different hat). Forge classifies these errors rather than treating them all the same: a spot shortfall can &lt;strong&gt;fail over to on-demand&lt;/strong&gt;, and retryable capacity errors are re-queued instead of dropped on the floor. Second, a hard-won lesson: &lt;em&gt;a webhook with a valid HMAC signature is not necessarily a usable one.&lt;/em&gt; In one production incident, deliveries were signed correctly but carried an &lt;strong&gt;installation ID belonging to a different GitHub App&lt;/strong&gt; — so the token request, and thus runner creation, failed even though signature validation passed cleanly. The fix that went into the runbook: when runner creation fails, don't stop at "signature verified" — verify that the payload's installation context actually matches the app you think is serving the request. It's a perfect example of why the observability in §10 exists: the symptom ("runners won't start") was three layers away from the cause.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Kubernetes (ARC) lane, step by step
&lt;/h3&gt;

&lt;p&gt;The Kubernetes lane reaches the same destination by a different road, because Kubernetes already has a scaling brain.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The same &lt;code&gt;workflow_job&lt;/code&gt; signal reaches the cluster, but here the &lt;strong&gt;ARC listener&lt;/strong&gt; for the tenant's scale set notices there's demand.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ARC creates an ephemeral runner resource&lt;/strong&gt; (a pod spec) for the job.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Karpenter provisions a node if needed.&lt;/strong&gt; Karpenter is a Kubernetes autoscaler that watches for unschedulable pods and launches right-sized EC2 nodes to fit them (and removes them when idle). If the tenant's nodes are full or scaled to zero, Karpenter boots one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes schedules the runner pod.&lt;/strong&gt; For DinD, tenant isolation is enforced by the per-tenant Karpenter node pool plus taints/tolerations and required node affinity (§6); for plain &lt;code&gt;k8s&lt;/code&gt;, the checked-in template relies on namespace, service-account/IAM, runner-group, and GitHub-routing boundaries unless an extra scheduling layer is added.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The pod's containers come up:&lt;/strong&gt; the runner container itself, plus — for Docker builds — a rootless Docker-in-Docker sidecar, plus hook and log sidecars.&lt;/li&gt;
&lt;li&gt;The job runs in the pod; logs flow to Kubernetes logging and onward.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The ephemeral runner resource and its pod are deleted.&lt;/strong&gt; Idle nodes get consolidated away by Karpenter.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The reason Forge keeps &lt;em&gt;both&lt;/em&gt; lanes is now concrete: the EC2 path gives you a whole machine and total control but pays a full VM boot; the ARC path scales pods fast and bin-packs them onto shared nodes but constrains you to a container model. Different jobs genuinely want different trade-offs.&lt;/p&gt;

&lt;p&gt;One piece of plumbing makes both flows actually &lt;em&gt;route&lt;/em&gt; to the right place: &lt;strong&gt;runner groups.&lt;/strong&gt; Each tenant's runners register into a GitHub &lt;em&gt;runner group&lt;/em&gt; scoped to that tenant, and a small reconciler keeps the mapping correct — when a tenant adds a repository, it's automatically registered into the right runner group, so that repo's jobs land only on that tenant's runners. This is what stops one tenant's workflow from ever pulling another tenant's capacity, and it's why onboarding a new repo is a no-op for the platform team rather than a ticket.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One caveat on "adopt by changing one line."&lt;/strong&gt; That's true for a developer in an &lt;em&gt;already-onboarded&lt;/em&gt; repo — flip the &lt;code&gt;runs-on&lt;/code&gt; label and you're done. But onboarding a &lt;em&gt;new tenant&lt;/em&gt; is a controlled platform-team operation, not one line: create the tenant's Terragrunt config, register/install the GitHub App, set &lt;code&gt;repository_selection&lt;/code&gt; (&lt;code&gt;all&lt;/code&gt; or &lt;code&gt;selected&lt;/code&gt;), wire the runner group, configure any optional IAM-role trust and ECR access, and set the EC2/ARC runner specs (sizes, AMIs, images, resource limits). Forge makes that repeatable; it doesn't make it disappear. (We come back to it in §12.)&lt;/p&gt;

&lt;p&gt;With the flow in hand, every decision below is really an answer to the question: &lt;em&gt;"what breaks in that flow when you run it for forty teams in a small network, and how do you keep it secure and operable?"&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Trade-off #1 — Networking: the IP ceiling
&lt;/h2&gt;

&lt;p&gt;Networking is where most multi-tenant Kubernetes platforms quietly die, and it's the constraint that forced Forge's first and most instructive decision. To understand it you need one piece of AWS background.&lt;/p&gt;

&lt;h3&gt;
  
  
  The background: how pods get IP addresses on EKS
&lt;/h3&gt;

&lt;p&gt;By default, Amazon's EKS uses the &lt;strong&gt;AWS VPC CNI&lt;/strong&gt; (the &lt;code&gt;aws-node&lt;/code&gt; component) for pod networking. CNI stands for &lt;em&gt;Container Network Interface&lt;/em&gt; — the plugin that decides how a pod gets a network identity. The defining property of the AWS VPC CNI is that &lt;strong&gt;every pod gets a real, routable VPC IP address&lt;/strong&gt;, drawn from the same pool your EC2 instances use. That's lovely for interoperability (a pod is a first-class citizen on your network) and brutal for density, because of how AWS attaches addresses.&lt;/p&gt;

&lt;p&gt;An EC2 instance gets IPs through &lt;strong&gt;Elastic Network Interfaces (ENIs)&lt;/strong&gt; — virtual NICs. Each instance type supports a fixed maximum number of ENIs, and each ENI supports a fixed number of private IPs. So the number of pods you can run on a node is capped by hardware-ish limits:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;max pods per node ≈ (max ENIs) × (IPs per ENI − 1)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The constraint: a small, fixed network
&lt;/h3&gt;

&lt;p&gt;Forge's runners must live in corporate-provisioned VPCs, and those are &lt;strong&gt;small and non-negotiable&lt;/strong&gt;. The live data plane sits in a &lt;strong&gt;/24 VPC — 256 total addresses&lt;/strong&gt; — carved into two &lt;strong&gt;/25 subnets&lt;/strong&gt; of about 123 usable addresses each (AWS reserves five per subnet). You do not get to enlarge it. That's the box you're in.&lt;/p&gt;

&lt;h3&gt;
  
  
  The math that kills the default CNI
&lt;/h3&gt;

&lt;p&gt;Take a common worker size, a node that supports 8 ENIs at 30 IPs each. That's &lt;code&gt;8 × (30 − 1) = 232&lt;/code&gt;, call it &lt;strong&gt;~234 pod IPs per node&lt;/strong&gt;. Now look at what that means in a /25 with ~123 usable addresses: &lt;strong&gt;a single fully-packed node would need almost twice the addresses the entire subnet contains.&lt;/strong&gt; Scale to the cluster's target of five nodes and you'd be asking for &lt;strong&gt;~1,170 IPs — roughly 4.5× the entire /24 VPC.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This isn't "the cluster runs slowly." It's "the cluster cannot be scheduled." The IP ceiling is hit before a single meaningful workload lands. And critically, throwing more compute at it makes things &lt;em&gt;worse&lt;/em&gt;, because more nodes means more IP demand. The thing everyone instinctively scales (compute) is the thing actively hurting you.&lt;/p&gt;

&lt;h3&gt;
  
  
  The decision: replace the CNI with an overlay
&lt;/h3&gt;

&lt;p&gt;Forge deletes the AWS VPC CNI and installs &lt;strong&gt;Calico&lt;/strong&gt;, configured as an &lt;strong&gt;overlay network&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl delete daemonset &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system aws-node &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;true
&lt;/span&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; .../calico/tigera-operator.yaml &lt;span class="nt"&gt;--server-side&lt;/span&gt;
&lt;span class="c"&gt;# Installation custom resource:&lt;/span&gt;
spec:
  cni:           &lt;span class="o"&gt;{&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;: Calico &lt;span class="o"&gt;}&lt;/span&gt;
  calicoNetwork: &lt;span class="o"&gt;{&lt;/span&gt; bgp: Disabled &lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An &lt;strong&gt;overlay&lt;/strong&gt; network gives pods IP addresses from a private, cluster-internal range (a CIDR that exists only inside Kubernetes) and &lt;em&gt;encapsulates&lt;/em&gt; pod-to-pod traffic so it can travel across the real VPC without the VPC needing to know about every pod IP. The VPC route tables never see pod addresses. (&lt;code&gt;bgp: Disabled&lt;/code&gt; tells Calico not to advertise routes via the BGP routing protocol — appropriate here, since we're encapsulating rather than peering pod routes into the network fabric.)&lt;/p&gt;

&lt;p&gt;The consequence is the whole point: &lt;strong&gt;VPC IP addresses are now consumed only by nodes, not by pods&lt;/strong&gt; — and &lt;strong&gt;a node is exactly one VPC IP&lt;/strong&gt;, whether it packs one pod or a hundred. (Hold onto that fact; in §10 it quietly flips a cost recommendation.)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fcxlguxl7diifblb8n1c1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fcxlguxl7diifblb8n1c1.png" alt="VPC CNI vs Calico: with the default CNI every pod consumes a VPC IP; with Calico only nodes consume VPC IPs (a node = 1 IP)." width="799" height="452"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now do the math again. A /24 with 256 addresses can hold on the order of &lt;strong&gt;250 nodes&lt;/strong&gt; instead of &lt;em&gt;less than one node's worth of pods&lt;/em&gt;. Pod density stops being an AWS-imposed accident and becomes a deliberate policy choice — Forge fixes it at &lt;code&gt;maxPods: 100&lt;/code&gt; per node — completely decoupled from ENI limits. The ceiling that made the platform impossible is simply gone.&lt;/p&gt;

&lt;h3&gt;
  
  
  What it costs (because nothing is free)
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What it buys:&lt;/strong&gt; breaks the IP ceiling; lets you run dense pod workloads inside tiny, fixed subnets; pod density becomes a tuning knob instead of a hard wall.&lt;br&gt;
&lt;strong&gt;What it costs:&lt;/strong&gt; you now run and upgrade a &lt;em&gt;second networking layer&lt;/em&gt; with its own release cadence and its own failure modes. Overlay networking has real operational edges — image-pull problems and operator-ordering bugs have caused incidents serious enough to have named fix branches in the repo. And bring-up ordering is genuinely fragile: the &lt;code&gt;aws-node&lt;/code&gt; deletion is best-effort (&lt;code&gt;|| true&lt;/code&gt;), and every node must carry a synthetic dependency so it can never &lt;em&gt;join the cluster before the CNI swap completes&lt;/em&gt; — a node that registers with no CNI has no working pod network at all, which is a confusing way to fail.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;The lesson worth taking even if you never run Kubernetes:&lt;/strong&gt; at scale, the binding constraint is frequently &lt;em&gt;not&lt;/em&gt; the resource everyone watches. Here it was IP addresses, not CPU. There was no "add more nodes" answer (that makes it worse) and no "resize the subnet" answer (you don't control it), so the only move was to change the networking layer itself. Find your real ceiling before you optimize the comfortable one. This is decision #1 of about a dozen.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Trade-off #2 — Isolation: per-tenant DinD node pools
&lt;/h2&gt;

&lt;p&gt;This is the clearest place in the entire platform where Forge chose isolation &lt;em&gt;over&lt;/em&gt; efficiency, knowingly, and paid cash for it. That's exactly what makes it a good teaching example — it's an explicit trade, not a default.&lt;/p&gt;

&lt;h3&gt;
  
  
  The background: Karpenter, taints, tolerations, affinity
&lt;/h3&gt;

&lt;p&gt;A few Kubernetes concepts make this section legible:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Karpenter&lt;/strong&gt; is an autoscaler that provisions EC2 nodes on demand to fit pending pods, using a &lt;strong&gt;&lt;code&gt;NodePool&lt;/code&gt;&lt;/strong&gt; (rules for what kind of nodes to create) and an &lt;strong&gt;&lt;code&gt;EC2NodeClass&lt;/code&gt;&lt;/strong&gt; (the AWS specifics — AMI, IAM role, subnets). When pods need a home and none fits, Karpenter launches a node; when nodes go idle, it removes them.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;taint&lt;/strong&gt; is a "keep off" mark on a node: by default, pods won't schedule onto a tainted node.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;toleration&lt;/strong&gt; is a pod's permission slip that lets it &lt;em&gt;tolerate&lt;/em&gt; a specific taint — i.e., it's &lt;em&gt;allowed&lt;/em&gt; onto that node. Important subtlety: a toleration lets a pod land on a tainted node; it does &lt;strong&gt;not&lt;/strong&gt; force it to. A pod with a toleration could still land somewhere else.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;nodeAffinity&lt;/code&gt;&lt;/strong&gt; is a pod's &lt;em&gt;requirement&lt;/em&gt; about which nodes it must run on. A &lt;code&gt;requiredDuringScheduling&lt;/code&gt; affinity is a hard rule: schedule me &lt;em&gt;only&lt;/em&gt; on nodes matching this.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The decision: per-tenant nodes for the path that needs them
&lt;/h3&gt;

&lt;p&gt;The strongest node-level isolation is applied where the risk actually lives: the &lt;strong&gt;DinD path&lt;/strong&gt;. Each DinD tenant gets its &lt;strong&gt;own Karpenter &lt;code&gt;NodePool&lt;/code&gt; and &lt;code&gt;EC2NodeClass&lt;/code&gt;&lt;/strong&gt; (named &lt;code&gt;karpenter-&amp;lt;tenant&amp;gt;&lt;/code&gt;), and those nodes are stamped with two taints:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;taints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;forge.local/scale_set_type&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;dind&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;effect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;NoSchedule&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;forge.local/tenant&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt;          &lt;span class="nv"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;&amp;lt;tenant&amp;gt;&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;effect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;NoSchedule&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A tenant's build pods then carry &lt;strong&gt;both&lt;/strong&gt; matching tolerations &lt;strong&gt;and&lt;/strong&gt; a hard &lt;code&gt;requiredDuringScheduling&lt;/code&gt; nodeAffinity on those same two keys. You need both halves, and understanding why is the whole point:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;tolerations&lt;/strong&gt; let the tenant's pods onto the tenant's (tainted) nodes — and the taints keep &lt;em&gt;everyone else's&lt;/em&gt; pods off.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;required affinity&lt;/strong&gt; stops the tenant's own pods from wandering onto some other, untainted node. Tolerations alone would permit that drift; affinity forbids it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Belt and suspenders. The result is that a DinD tenant's jobs can land &lt;strong&gt;only&lt;/strong&gt; on that tenant's machines — enforced by the scheduler, not by hope or convention. (To be precise about scope: this hard taint/affinity node-pinning is the DinD template's behavior. The plain &lt;code&gt;k8s&lt;/code&gt; runner mode still gets real isolation — its own namespace, service account, runner group, and IAM role — but it doesn't carry the same tenant taint/affinity in the checked-in template; if you need hard node-pinning for non-DinD jobs too, you add that scheduling layer explicitly.)&lt;/p&gt;

&lt;p&gt;One implementation detail worth admiring: the per-tenant &lt;code&gt;EC2NodeClass&lt;/code&gt; isn't hand-written forty times. A data source reads the shared node class (&lt;code&gt;kubectl get ec2nodeclass karpenter -o yaml&lt;/code&gt;), strips the server-managed fields with &lt;code&gt;yq&lt;/code&gt;, and renames it with &lt;code&gt;jq&lt;/code&gt; to &lt;code&gt;karpenter-&amp;lt;tenant&amp;gt;&lt;/code&gt;. So every tenant inherits identical, correct AMI/role/subnet wiring under its own name. Sameness by construction; difference only in the label.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Docker-in-Docker is the forcing function
&lt;/h3&gt;

&lt;p&gt;You might ask: why pay for per-tenant nodes at all? The answer is &lt;strong&gt;Docker-in-Docker (DinD)&lt;/strong&gt;. Many CI jobs build container images, which means they need a Docker daemon, which traditionally means a &lt;em&gt;privileged&lt;/em&gt; container — one with elevated host access. Running privileged builds from multiple tenants on shared nodes is an unacceptable blast radius: a malicious or simply buggy privileged build could reach a neighbor. Pinning each tenant to its own nodes contains that.&lt;/p&gt;

&lt;p&gt;And even within that, Forge adds a second layer: DinD runs &lt;strong&gt;rootless&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;dind sidecar&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;image docker:dind-rootless&lt;/span&gt;
  &lt;span class="s"&gt;runAsUser &lt;/span&gt;&lt;span class="m"&gt;1001&lt;/span&gt;
  &lt;span class="s"&gt;subuid/subgid 100000:65536&lt;/span&gt;
  &lt;span class="s"&gt;privileged&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;   &lt;span class="c1"&gt;# the nested runtime needs it — but the user is 1001, not root&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;subuid/subgid&lt;/code&gt; remapping is &lt;strong&gt;user-namespace remapping&lt;/strong&gt;: processes that think they're running as root &lt;em&gt;inside&lt;/em&gt; the build are actually mapped to an unprivileged high-numbered UID (100000+) on the host. So even the one genuinely privileged thing in the system is defanged at the user level. You get node-level isolation &lt;em&gt;and&lt;/em&gt; user-level isolation wrapped around the riskiest capability Forge offers.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What it buys:&lt;/strong&gt; a blast radius of exactly one tenant, enforced by the scheduler; the ability to offer privileged Docker builds &lt;em&gt;safely at all&lt;/em&gt;; a noisy or compromised build that cannot touch a neighbor.&lt;br&gt;
&lt;strong&gt;What it costs:&lt;/strong&gt; &lt;strong&gt;money.&lt;/strong&gt; One DinD node pool per tenant means worse bin-packing and more partially-idle nodes than a single shared pool would have. It also leans on a Karpenter capacity-diversity feature the manifest's own comment flags as &lt;em&gt;alpha&lt;/em&gt;, and it multiplies the number of &lt;code&gt;NodePool&lt;/code&gt;/&lt;code&gt;EC2NodeClass&lt;/code&gt; objects to manage. Forge bounds the cost with mitigations — DinD scales to zero when idle (&lt;code&gt;min_runners: 0&lt;/code&gt;), idle nodes consolidate quickly (&lt;code&gt;consolidateAfter: 1m&lt;/code&gt;), and a &lt;code&gt;karpenter.sh/do-not-disrupt&lt;/code&gt; annotation protects a node that's mid-job from being consolidated out from under it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;The lesson:&lt;/strong&gt; isolation is a &lt;em&gt;dial, not a boolean&lt;/em&gt;. The engineering maturity isn't "maximize isolation" or "maximize density" — it's to choose, consciously, where you sit on the cost↔blast-radius curve for &lt;em&gt;your&lt;/em&gt; threat model, and to write that choice down so the next person knows it was deliberate. Forge turned the dial toward isolation because privileged builds demanded it, and paid for it on purpose. Decision #2.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Trade-off #3 — Identity: zero static credentials
&lt;/h2&gt;

&lt;p&gt;CI is one of the highest-value targets in any organization: it executes code with access to credentials, source, artifacts, and internal networks. If you're going to attack a company, its build system is a wonderful place to start. So Forge's credential model is built around one non-negotiable rule: &lt;strong&gt;no long-lived secrets in pipelines, ever.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The background: roles, AssumeRole, and why "no keys" is even possible
&lt;/h3&gt;

&lt;p&gt;In AWS, you can grant access two ways. The old way is a &lt;strong&gt;static credential&lt;/strong&gt; — an access key and secret, long-lived, that you stash somewhere (a workflow secret, an env var) and that works until someone rotates it. Static keys are the thing that leaks: copied into logs, committed by accident, shared between teams, never rotated.&lt;/p&gt;

&lt;p&gt;The modern way is &lt;strong&gt;role assumption&lt;/strong&gt;. An IAM &lt;strong&gt;role&lt;/strong&gt; is a set of permissions that an &lt;em&gt;authorized identity&lt;/em&gt; can temporarily borrow by calling AWS STS (Security Token Service) &lt;code&gt;AssumeRole&lt;/code&gt;, which mints &lt;strong&gt;short-lived credentials&lt;/strong&gt; that expire in minutes to hours. Nothing long-lived is stored anywhere. The question is just: what proves you're authorized to assume the role? That proof is the runner's &lt;em&gt;ambient&lt;/em&gt; identity.&lt;/p&gt;

&lt;h3&gt;
  
  
  The model: short-lived role-chaining off the runner's identity
&lt;/h3&gt;

&lt;p&gt;A Forge runner is given an ambient AWS identity — on EC2, an &lt;strong&gt;instance profile&lt;/strong&gt;; on Kubernetes, an &lt;strong&gt;EKS Pod Identity&lt;/strong&gt; association. That identity is permitted to &lt;code&gt;AssumeRole&lt;/code&gt; into the &lt;em&gt;tenant's own&lt;/em&gt; role, and the tenant configures that role to &lt;em&gt;trust&lt;/em&gt; the Forge runner role. The workflow then uses the standard &lt;code&gt;aws-actions/configure-aws-credentials&lt;/code&gt; action with &lt;code&gt;role-to-assume&lt;/code&gt;, gets short-lived creds, and does its work. No static key ever exists.&lt;/p&gt;

&lt;p&gt;One precise correction, because it's almost always described wrong: this is &lt;strong&gt;STS role-chaining off the runner's instance/pod identity — not GitHub OIDC.&lt;/strong&gt; The trust originates from the AWS identity Forge attaches to the runner, not from a token GitHub issues. (GitHub OIDC is a fine pattern; it's just not the one in play here, and conflating them will mislead you when you debug.) And the tenant stays in control of exactly how far that access reaches: the role Forge assumes can grant direct resource access, or it can be the first hop in a &lt;strong&gt;chain of role assumptions&lt;/strong&gt; into further accounts — the tenant decides the scope by writing the policies, not the platform.&lt;/p&gt;

&lt;p&gt;On Kubernetes the path is deliberately &lt;em&gt;dual&lt;/em&gt;. &lt;strong&gt;Pod Identity is primary.&lt;/strong&gt; But DinD scale sets &lt;em&gt;also&lt;/em&gt; get an &lt;strong&gt;IRSA&lt;/strong&gt; trust — IAM Roles for Service Accounts, where a projected Kubernetes service-account token (audience &lt;code&gt;sts.amazonaws.com&lt;/code&gt;) is exchanged for AWS credentials via &lt;code&gt;AssumeRoleWithWebIdentity&lt;/code&gt;. Why carry two mechanisms? Because inside Docker-in-Docker, the Pod Identity agent's link-local metadata hop (&lt;code&gt;169.254.170.23&lt;/code&gt;) isn't reliably reachable, so the projected token is the fallback that keeps AWS auth working from &lt;em&gt;within&lt;/em&gt; the nested Docker runtime. It's the kind of detail you only learn by getting paged. (One precision note, since "OIDC" gets thrown around loosely: &lt;em&gt;GitHub&lt;/em&gt; OIDC is not the trust root for tenant access here — but &lt;em&gt;AWS/EKS&lt;/em&gt; OIDC does appear, via IRSA, on this DinD fallback path. Two different OIDCs; keep them separate when you debug.)&lt;/p&gt;

&lt;h3&gt;
  
  
  The problem: the tenant's mistake is your incident
&lt;/h3&gt;

&lt;p&gt;Here's the operational reality that drives the next design. The overwhelming majority of onboarding failures are &lt;strong&gt;not runner bugs&lt;/strong&gt; — they're &lt;strong&gt;IAM trust mistakes&lt;/strong&gt;: the tenant typo'd an ARN, or pointed the trust at the wrong principal, or allowed &lt;code&gt;sts:AssumeRole&lt;/code&gt; but forgot &lt;code&gt;sts:TagSession&lt;/code&gt; (which Forge needs for session tagging). And these mistakes are invisible until a &lt;em&gt;real&lt;/em&gt; job runs and fails — usually at the worst possible moment, in front of the tenant.&lt;/p&gt;

&lt;p&gt;In a multi-tenant platform, the boundary you don't control (the tenant's IAM) is the boundary that pages &lt;em&gt;you&lt;/em&gt;. So Forge validates that boundary proactively, on a schedule, with a small purpose-built robot.&lt;/p&gt;

&lt;h3&gt;
  
  
  The robot: a trust-validator
&lt;/h3&gt;

&lt;p&gt;The trust-validator is a deliberately-built system of &lt;strong&gt;two Lambdas plus a delay queue&lt;/strong&gt;, running every ten minutes:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Frvkw8csyopkuq68ygeat.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Frvkw8csyopkuq68ygeat.png" alt="Trust-validator: preparer Lambda injects temp trust, SQS delay ~300s for IAM propagation, validator tests AssumeRole + TagSession, then strips trust." width="800" height="531"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Two design choices here are pure real-world scar tissue, and both teach something:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Why two Lambdas instead of one?&lt;/strong&gt; The validator must assume a role whose ARN the &lt;em&gt;preparer&lt;/em&gt; has to inject trust for &lt;em&gt;before&lt;/em&gt; the validator ever runs. If the validator's identity were created as a normal Terraform output, you'd have a chicken-and-egg dependency cycle. Forge breaks it by computing the validator's role ARN deterministically in Terraform, so the preparer can reference it ahead of time. Splitting prepare-and-validate into two functions is what makes that clean.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Why a 300-second delay?&lt;/strong&gt; Because &lt;strong&gt;IAM is eventually consistent.&lt;/strong&gt; When you write a trust policy, that change is not instantly visible at every STS endpoint — for a short window you'll get a spurious &lt;code&gt;AccessDenied&lt;/code&gt; even though the policy is correct. A naive validator would report false failures constantly. The delay (the variable is literally named &lt;code&gt;iam_propagation_delay_seconds&lt;/code&gt;) gives the change time to propagate, and as a second safety net &lt;code&gt;AccessDenied&lt;/code&gt; is explicitly treated as &lt;em&gt;retryable&lt;/em&gt; with backoff. "Eventual consistency" sounds academic until it's flaking your validator every ten minutes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And note the validator tests &lt;code&gt;TagSession&lt;/code&gt; &lt;em&gt;separately&lt;/em&gt; from &lt;code&gt;AssumeRole&lt;/code&gt;, precisely because a tenant role can permit one and forget the other — checking only the first gives you a false green. The whole thing wraps cleanup in a &lt;code&gt;finally&lt;/code&gt; so the temporary trust is &lt;em&gt;always&lt;/em&gt; removed, even when validation throws.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What it buys:&lt;/strong&gt; trust problems caught proactively, before any real job depends on them; a clear per-tenant pass/fail across both &lt;code&gt;AssumeRole&lt;/code&gt; and &lt;code&gt;TagSession&lt;/code&gt;; what would be recurring 2am pages turned into a ten-minute cron job.&lt;br&gt;
&lt;strong&gt;What it costs:&lt;/strong&gt; you build and operate an actual small distributed system — two Lambdas, a delay queue, live IAM mutation — &lt;em&gt;purely to validate configuration&lt;/em&gt;. Cleanup has to be bulletproof (that &lt;code&gt;finally&lt;/code&gt; is load-bearing — a missed cleanup leaves a trust door ajar), and handling eventual consistency adds genuine complexity.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;The lesson:&lt;/strong&gt; in any multi-tenant system, the boundary you don't own is the boundary that wakes you up. Investing in &lt;em&gt;synthetic, scheduled validation&lt;/em&gt; of that boundary is what converts an entire class of inevitable incidents into a dashboard you glance at. For a platform run with near-zero ops, that conversion isn't a nicety — it's the mechanism that makes near-zero ops possible. Decision #3.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Trade-off #4 — Immutable clusters &amp;amp; blue/green
&lt;/h2&gt;

&lt;p&gt;Upgrading a live Kubernetes cluster is one of the more anxiety-inducing operations in this whole space. An EKS upgrade is really a coordinated upgrade of many coupled things at once — the control plane version, the add-ons (CoreDNS, kube-proxy, the EBS CSI driver), Karpenter, and the CNI — and any one of them can regress in a way that's hard to undo while production traffic is flowing through the cluster. Forge sidesteps the anxiety with a simple, radical stance: &lt;strong&gt;never upgrade a cluster in place.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The background: immutable infrastructure
&lt;/h3&gt;

&lt;p&gt;"Immutable infrastructure" means you don't modify running things — you replace them. Instead of patching a server, you build a new image and roll it out. Instead of upgrading a cluster, you build a &lt;em&gt;new&lt;/em&gt; cluster and move work to it. The payoff is that rollback becomes trivial (the old thing is still there, untouched) and "upgrade" stops being a high-wire act. The cost is that you need a clean way to &lt;em&gt;move work&lt;/em&gt; between the old and new things.&lt;/p&gt;

&lt;h3&gt;
  
  
  The decision: disposable clusters in blue/green pairs
&lt;/h3&gt;

&lt;p&gt;Forge runs EKS clusters as immutable and disposable, in &lt;strong&gt;blue/green pairs&lt;/strong&gt; — at any time there's a &lt;code&gt;-blue&lt;/code&gt; and a &lt;code&gt;-green&lt;/code&gt; cluster, one active. To upgrade, you stand up the fresh sibling and migrate tenants onto it, one at a time. The entire cutover is controlled by just &lt;strong&gt;two values in a tenant's configuration&lt;/strong&gt;: &lt;code&gt;arc_cluster_name&lt;/code&gt; (which cluster this tenant lives on) and &lt;code&gt;migrate_arc_cluster&lt;/code&gt; (a switch that tears down the tenant's footprint on its current cluster).&lt;/p&gt;

&lt;p&gt;The mechanism that makes this clean is subtle and worth internalizing: the ARC module &lt;strong&gt;never hardcodes a cluster endpoint.&lt;/strong&gt; Its Kubernetes and Helm providers resolve dynamically from a data source keyed on the cluster &lt;em&gt;name&lt;/em&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="s2"&gt;"aws_eks_cluster"&lt;/span&gt; &lt;span class="s2"&gt;"c"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;eks_cluster_name&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;   &lt;span class="c1"&gt;# = arc_cluster_name&lt;/span&gt;
&lt;span class="nx"&gt;provider&lt;/span&gt; &lt;span class="s2"&gt;"kubernetes"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;host&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_eks_cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;endpoint&lt;/span&gt;
  &lt;span class="nx"&gt;token&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_eks_cluster_auth&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;c&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;token&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because everything resolves from the &lt;em&gt;name&lt;/em&gt;, flipping &lt;code&gt;arc_cluster_name&lt;/code&gt; from &lt;code&gt;-green&lt;/code&gt; to &lt;code&gt;-blue&lt;/code&gt; makes the very next apply repoint &lt;em&gt;every&lt;/em&gt; in-cluster resource at the other cluster — no endpoints to edit, no state to surgically move. And &lt;code&gt;migrate_arc_cluster: true&lt;/code&gt; acts as a master switch threaded through the module tree: when true, every in-cluster resource (the namespace, the ARC controller, the scale set, the Pod Identity association, the Karpenter node pool) is set to &lt;code&gt;count = 0&lt;/code&gt; and destroyed. One detail makes the whole thing safe: the &lt;strong&gt;IAM runner role is deliberately &lt;em&gt;not&lt;/em&gt; gated&lt;/strong&gt; by that switch, so it survives the migration — which means the tenant's trust relationships (and the trust-validator from §7) don't churn just because you moved clusters.&lt;/p&gt;

&lt;h3&gt;
  
  
  The sequence, concretely
&lt;/h3&gt;

&lt;p&gt;A migration script drives this per tenant:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F9d1k9sf3edpltdvzatsj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F9d1k9sf3edpltdvzatsj.png" alt="Blue/green migration: detect direction, drain to zero, disable on old, pre-stage new, enable on new, re-apply trust-validator." width="800" height="1787"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Detect direction&lt;/strong&gt; — read the current &lt;code&gt;arc_cluster_name&lt;/code&gt;; figure out green→blue or blue→green (and hard-error if it's neither, rather than guess).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drain to zero&lt;/strong&gt; — patch each runner scale set to &lt;code&gt;minRunners: 0, maxRunners: 0&lt;/code&gt;, then wait until no runner pods remain. You don't yank work out from under running jobs; you stop accepting new ones and let the in-flight ones finish.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disable on OLD&lt;/strong&gt; — set &lt;code&gt;migrate_arc_cluster = true&lt;/code&gt; with the old name and apply: the tenant's footprint on the source cluster is torn down.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pre-stage on NEW&lt;/strong&gt; — flip the name to the target, still &lt;code&gt;migrate = true&lt;/code&gt;, and apply: providers now point at the new cluster, but nothing is installed yet.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enable on NEW&lt;/strong&gt; — set &lt;code&gt;migrate = false&lt;/code&gt; with the target name and apply, then re-apply the trust-validator against the new home.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  The honest trade-off
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What it buys:&lt;/strong&gt; safe, &lt;em&gt;reversible&lt;/em&gt; cluster upgrades — the source is untouched until the target is proven, so rollback is "flip the name back"; other tenants are completely unaffected because each has its own namespace and, for DinD workloads, its own node pool; you migrate on your own schedule, tenant by tenant.&lt;br&gt;
&lt;strong&gt;What it costs:&lt;/strong&gt; the &lt;em&gt;migrating&lt;/em&gt; tenant has a &lt;strong&gt;runner gap&lt;/strong&gt; — the window between drain (step 2) and enable (step 5) when its jobs queue and wait. And the tooling has sharp edges worth knowing: a step that strips Kubernetes finalizers busy-loops with no timeout (if the controller is wedged, it spins forever), and the drain check is a substring match on pod names that a stuck terminating pod can block.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here's the part you must say out loud, because it's where a careless write-up would lie to you: &lt;strong&gt;this is not zero-downtime for the migrating tenant.&lt;/strong&gt; It is &lt;em&gt;blast-radius-of-one&lt;/em&gt; — the platform keeps serving everyone else flawlessly while a single tenant cuts over with a brief gap. The unit of "no disruption" is the platform, not the individual tenant. Naming that precisely is the difference between a talk a senior engineer trusts and one they tune out. Decision #4.&lt;/p&gt;

&lt;h2&gt;
  
  
  9. Trade-off #5 — Config at scale
&lt;/h2&gt;

&lt;p&gt;Everything above only matters if you can actually &lt;em&gt;onboard&lt;/em&gt; forty teams without it being forty projects. A platform that's painful to add tenants to never gets forty tenants — it gets five and a backlog. So the configuration architecture is load-bearing for the whole low-ops story.&lt;/p&gt;

&lt;h3&gt;
  
  
  The background: Terraform vs. Terragrunt, and DRY
&lt;/h3&gt;

&lt;p&gt;Terraform describes infrastructure as code. But across dozens of near-identical deployments (same module, different tenant/region/account), plain Terraform pushes you toward copy-paste — and copy-paste is where drift is born. &lt;strong&gt;Terragrunt&lt;/strong&gt; is a thin wrapper over Terraform whose main job is to keep things &lt;strong&gt;DRY&lt;/strong&gt; (Don't Repeat Yourself): shared configuration is defined once and &lt;em&gt;inherited&lt;/em&gt;, and each deployment is just the small set of values that make it unique.&lt;/p&gt;

&lt;h3&gt;
  
  
  The decision: the directory path &lt;em&gt;is&lt;/em&gt; the configuration
&lt;/h3&gt;

&lt;p&gt;Forge deploys &lt;strong&gt;one module per tenant × region × account&lt;/strong&gt;, arranged so that the &lt;strong&gt;filesystem layout encodes the coordinates&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;environments/&amp;lt;aws_account&amp;gt;/regions/&amp;lt;aws_region&amp;gt;/vpcs/&amp;lt;vpc_alias&amp;gt;/tenants/&amp;lt;tenant_name&amp;gt;/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tenant's name is literally &lt;code&gt;basename()&lt;/code&gt; of its directory, and the path coordinates are account → region → VPC alias → tenant. Configuration cascades &lt;em&gt;down&lt;/em&gt; the tree via Terragrunt's &lt;code&gt;find_in_parent_folders&lt;/code&gt;: organization-wide constants at the top; an &lt;strong&gt;account/environment&lt;/strong&gt; layer that supplies the AWS account id, profile, and remote-state settings; a &lt;strong&gt;region&lt;/strong&gt; layer with the region and its short alias; a &lt;strong&gt;VPC&lt;/strong&gt; layer with the concrete VPC and subnet IDs; and finally the tenant's own &lt;code&gt;config.yml&lt;/code&gt;. A small &lt;code&gt;config.hcl&lt;/code&gt; at the tenant level &lt;code&gt;yamldecode&lt;/code&gt;s that YAML and expands it into the runner specs — and, crucially, into the &lt;strong&gt;&lt;code&gt;runs-on&lt;/code&gt; label set&lt;/strong&gt; that is the tenant's API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;runner_labels&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="s2"&gt;"type:${spec.type}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s2"&gt;"self-hosted"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;runner_architecture&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s2"&gt;"env:ops-${include.env.locals.env}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="nx"&gt;extra_labels&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
  &lt;span class="s2"&gt;"ec2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s2"&gt;"rgn:${local.region_alias}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s2"&gt;"vpc:${local.vpc_alias}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="s2"&gt;"tnt:${local.tenant_name}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There's a clever bit in how those labels are &lt;em&gt;matched&lt;/em&gt;. Rather than requiring a workflow to specify the exact full label set, Forge generates matchers for &lt;strong&gt;every contiguous sub-slice&lt;/strong&gt; of the extra labels appended to the base — so a job can match with any reasonable subset (just &lt;code&gt;type:standard&lt;/code&gt; and &lt;code&gt;self-hosted&lt;/code&gt;, say) while the platform still always knows the full identity (tenant, region, VPC, lane). Flexibility for the user, precision for the platform.&lt;/p&gt;

&lt;p&gt;Two more touches that make scale comfortable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Version pinning with a local-dev escape hatch.&lt;/strong&gt; Module versions are pinned indirectly through a &lt;code&gt;release_versions.yaml&lt;/code&gt; (everything at a single release ref), and a &lt;code&gt;use_local_repos&lt;/code&gt; flag flips &lt;em&gt;every&lt;/em&gt; module source from the git ref to a local &lt;code&gt;file://&lt;/code&gt; checkout for iteration. One switch, zero per-tenant edits — you can test a platform change against a real tenant config without touching the tenant config.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;State isolation for free.&lt;/strong&gt; There's one S3 state bucket and one DynamoDB lock table &lt;em&gt;per account&lt;/em&gt; (the bucket name carries the account id), and the &lt;strong&gt;Terraform state key is the directory path&lt;/strong&gt;. So per-tenant/region/VPC state separation isn't designed per tenant — it falls straight out of where the directory sits. The structure does the work.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What it buys:&lt;/strong&gt; adding a tenant is a &lt;em&gt;configuration change, not a project&lt;/em&gt; — minutes, not days; dev and prod can run &lt;em&gt;different module versions&lt;/em&gt; safely; genuine DRY across ~40 deployments behind one module and one upgrade path.&lt;br&gt;
&lt;strong&gt;What it costs:&lt;/strong&gt; Terragrunt plus layered HCL has a real learning curve, and there are footguns worth flagging — remote state can live in one region even for resources in another (a cross-region surprise), and the state bucket and lock table sharing a name can confuse newcomers. Forge also deliberately avoids cross-unit &lt;code&gt;dependency&lt;/code&gt; blocks (each tenant is one self-contained module instance); the rare cross-module ordering, like a full cluster rebuild, is handled by an external DAG-resolver script with a pragmatic stabilization delay. Pragmatic over pure.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;The lesson:&lt;/strong&gt; when you have &lt;em&gt;N&lt;/em&gt; near-identical deployments, make the &lt;strong&gt;difference&lt;/strong&gt; between them a small data file and let &lt;strong&gt;structure&lt;/strong&gt; carry the sameness. If your marginal deployment costs minutes, "forty tenants" stops being a scaling problem and becomes a directory listing. Decision #5.&lt;/p&gt;

&lt;h2&gt;
  
  
  10. Trade-off #6 — The connective tissue
&lt;/h2&gt;

&lt;p&gt;The big architectural calls get the headlines, but a platform actually &lt;em&gt;survives&lt;/em&gt; on its connective tissue — the small, resilient systems that quietly remove whole categories of toil so a human never has to. Three of them, then the thing that ties it all together.&lt;/p&gt;

&lt;h3&gt;
  
  
  Audit-grade job-log archival
&lt;/h3&gt;

&lt;p&gt;GitHub keeps job logs for a limited window, but you often need them longer — for audits, for debugging a flaky job a week later, for compliance evidence. So when a job finishes (&lt;code&gt;workflow_job&lt;/code&gt; with &lt;code&gt;action: completed&lt;/code&gt;), Forge archives its logs through a deliberately decoupled pipeline:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Flbwj707v3q6zw5uellw9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Flbwj707v3q6zw5uellw9.png" alt="Job-log pipeline: workflow_job.completed to EventBridge to dispatcher Lambda to SQS to archiver Lambda to per-tenant S3, with DLQ redrive every 10 minutes." width="800" height="1183"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The event flows EventBridge → a &lt;strong&gt;dispatcher Lambda&lt;/strong&gt; (which filters and forwards) → &lt;strong&gt;SQS&lt;/strong&gt; → an &lt;strong&gt;archiver Lambda&lt;/strong&gt; that authenticates as the GitHub App (minting a short-lived installation token), downloads the logs, and writes them to a &lt;strong&gt;per-tenant S3 bucket&lt;/strong&gt;, KMS-encrypted, keyed by &lt;code&gt;{repo}/{run}/{attempt}/{job}&lt;/code&gt; as both a raw &lt;code&gt;.log&lt;/code&gt; and a structured &lt;code&gt;.json&lt;/code&gt;. Two details show the care: the SQS visibility timeout is set &lt;em&gt;just above&lt;/em&gt; the archiver's Lambda timeout (the event-source mapping refuses to be created otherwise — there's a comment in the code so the next person doesn't "fix" it), and failed messages retry up to ten times before landing in a &lt;strong&gt;dead-letter queue (DLQ)&lt;/strong&gt;. A separate redrive Lambda re-injects DLQ messages back into the pipeline &lt;strong&gt;every ten minutes&lt;/strong&gt;, so a transient GitHub API hiccup &lt;em&gt;self-heals&lt;/em&gt; instead of silently dropping audit logs. Nobody gets paged for a blip.&lt;/p&gt;

&lt;h3&gt;
  
  
  A self-healing global lock
&lt;/h3&gt;

&lt;p&gt;Some operations must not run concurrently across repos or workflows. Forge provides a distributed mutex backed by &lt;strong&gt;DynamoDB&lt;/strong&gt;: a table keyed on a &lt;code&gt;lock_id&lt;/code&gt;, with secondary indexes on the workflow run and attempt, and a TTL on a timestamp. Workflows acquire and release the lock (every runner role carries the small policy needed to do so). The self-healing part is a janitor Lambda that runs every ten minutes, checks each held lock's workflow-run status via the GitHub App, and deletes any whose run has already completed — with the DynamoDB TTL as a final backstop for anything the janitor can't resolve. A lock that would otherwise wedge the system instead expires and cleans itself up.&lt;/p&gt;

&lt;h3&gt;
  
  
  Warm pools, used surgically
&lt;/h3&gt;

&lt;p&gt;Remember the one cost of ephemeral runners: per-job startup latency. And the numbers are worth being concrete about, because the picture is "fast when something's already warm, slow when you have to boot a machine":&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;EC2, warm-pool hit:&lt;/strong&gt; a pre-booted runner picks up the job in &lt;strong&gt;under ~20 seconds.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;EC2, cold:&lt;/strong&gt; booting a fresh instance — provision, bootstrap, register — takes on the order of &lt;strong&gt;~5 minutes&lt;/strong&gt; for Linux. And cold-start climbs with the OS and host type: &lt;strong&gt;Windows&lt;/strong&gt; boots noticeably slower than Linux, and &lt;strong&gt;macOS&lt;/strong&gt; is slowest of all, because Mac runners live on &lt;strong&gt;AWS Dedicated Hosts&lt;/strong&gt; — you can wait on dedicated-host allocation &lt;em&gt;and then&lt;/em&gt; a Mac instance boot, which is far slower than a normal VM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes pod on an already-running (idle) node, image cached:&lt;/strong&gt; also &lt;strong&gt;under ~20 seconds&lt;/strong&gt; — the pod just schedules and the container starts. But this one is genuinely case-by-case: if the runner image &lt;em&gt;isn't&lt;/em&gt; already cached on the node, you add image-pull time; and if no node is available, you add node boot (below). So a non-DinD pod's start is "fast when the node's up and the image is local," and degrades from there.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes/DinD when no node is available&lt;/strong&gt; (e.g., scaled to zero): you pay the worst case — &lt;strong&gt;Karpenter has to boot a node first, &lt;em&gt;then&lt;/em&gt; the scheduler places the pod, &lt;em&gt;then&lt;/em&gt; it pulls/starts the container&lt;/strong&gt; — which lands in minutes, comparable to an EC2 cold start, because you're booting an EC2 node either way.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the real asymmetry isn't "EC2 slow, Kubernetes fast" — it's "&lt;strong&gt;warm/idle is fast on both lanes; cold means booting an EC2 machine on both lanes&lt;/strong&gt;, with the extra K8s variable of whether the image is already on the node." The blunt fix is a warm pool, and the expensive mistake is keeping warm capacity everywhere, always.&lt;/p&gt;

&lt;p&gt;There's also a cost asymmetry between the two warm strategies that ties straight back to §5, and it's sharp. &lt;strong&gt;A node is exactly one VPC IP&lt;/strong&gt; — with Calico, it makes no difference whether that node runs one pod or a hundred; it still consumes a single VPC address. So if you want &lt;em&gt;idle, warm&lt;/em&gt; capacity sitting ready: an EC2 warm pool of N runners holds &lt;strong&gt;N VPC IPs&lt;/strong&gt; the entire time it waits (spending your scarcest resource just to shave startup), whereas you can park many idle DinD pods behind &lt;strong&gt;one warm node and spend exactly one IP.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That single fact inverts the default recommendation at scale. For ordinary use, Forge actually recommends &lt;strong&gt;EC2 over DinD-on-EKS&lt;/strong&gt;, because per job it's cheaper. But for &lt;strong&gt;high-churn, scaled workloads&lt;/strong&gt; that want warm capacity hot and ready, &lt;strong&gt;DinD wins&lt;/strong&gt; — you can keep a pool of idle runners behind a single locked node (one IP) instead of paying one IP per warm EC2 runner. And this isn't a rule set once and forgotten: Forge &lt;strong&gt;measures each tenant's actual usage&lt;/strong&gt; (those cost/usage dashboards again) and periodically &lt;strong&gt;recalculates and recommends&lt;/strong&gt; the cheaper mix as the tenant's pattern shifts. The trade-off is computed from data, not guessed at onboarding — usage-aware, and revisited.&lt;/p&gt;

&lt;p&gt;So Forge uses warm capacity &lt;em&gt;surgically&lt;/em&gt;: most runner types run &lt;strong&gt;none&lt;/strong&gt; (pure on-demand, accept the cold-start), and the few latency-sensitive EC2 types keep a small warm pool only &lt;strong&gt;during business hours, doubled by timezone&lt;/strong&gt; — e.g. one warm runner topped up &lt;code&gt;cron(*/5 9-18 ? * MON-FRI)&lt;/code&gt; in both &lt;code&gt;America/Los_Angeles&lt;/code&gt; and &lt;code&gt;Europe/Madrid&lt;/code&gt;, idle nights and weekends. On Kubernetes the same effect comes from keeping a little node headroom rather than scaling hard to zero on a latency-sensitive tenant — and it costs you nodes, not precious IPs. Cost is spent exactly where it buys responsiveness and nowhere else — the difference between "we have warm pools" and "we have a warm-pool bill."&lt;/p&gt;

&lt;h3&gt;
  
  
  Central logging &amp;amp; observability with Splunk — the thing that makes one team possible
&lt;/h3&gt;

&lt;p&gt;Underneath all of it sits the system that makes operating forty tenants by a small team even conceivable: &lt;strong&gt;centralized logging and metrics.&lt;/strong&gt; You cannot &lt;em&gt;watch&lt;/em&gt; forty tenants by hand; you can only &lt;em&gt;instrument&lt;/em&gt; them.&lt;/p&gt;

&lt;p&gt;The pipeline has a few distinct paths, and it's worth being precise about them because "it all goes to Splunk" hides the real shape. The clean way to hold it in your head is: &lt;strong&gt;logs go one way (to Splunk Cloud, via CloudWatch), metrics go another (to Splunk O11y).&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Logs — everything — go to CloudWatch Logs, then to Splunk Cloud.&lt;/strong&gt; EC2 runners, the control-plane Lambdas, &lt;em&gt;and&lt;/em&gt; EKS (both node and pod logs) all send their logs to &lt;strong&gt;CloudWatch Logs&lt;/strong&gt;. From there, ingestion into &lt;strong&gt;Splunk Cloud&lt;/strong&gt; (the log-analytics product) is done through Splunk's &lt;strong&gt;Data Manager&lt;/strong&gt; app — the supported service for pulling AWS data into Splunk Cloud — and, true to the rest of the platform, Forge &lt;em&gt;configures Data Manager as code&lt;/em&gt;: there's a Terraform module in the open-source repo that wires up the ingestion rather than anyone clicking through a console. A runner alone emits several log streams — OS &lt;strong&gt;syslog&lt;/strong&gt;, the &lt;strong&gt;cloud-init / EC2 user-data&lt;/strong&gt; bootstrap output (where most "the runner never came up" failures hide), the &lt;strong&gt;GitHub Actions job logs&lt;/strong&gt;, the runner &lt;strong&gt;agent/worker logs&lt;/strong&gt;, and the &lt;strong&gt;job-started/completed hook&lt;/strong&gt; output — and they all travel this same path. (The archived job logs from §10's pipeline also land in per-tenant S3 and feed Splunk Cloud.) The result: a job's full story — bootstrap, execution, outcome — is reconstructable from one place after the runner itself is long gone, which, being ephemeral, it always is.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS metrics go to Splunk O11y.&lt;/strong&gt; The &lt;strong&gt;Splunk Observability (O11y) + AWS integration&lt;/strong&gt; pulls the CloudWatch metrics for the platform's AWS resources into &lt;strong&gt;Splunk O11y&lt;/strong&gt; (the metrics/observability product) — and it, too, is &lt;strong&gt;configured by a Terraform module in the Forge repo&lt;/strong&gt;, so the observability wiring is reproducible and reviewable like everything else.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;EKS adds extra pod metrics to O11y via OTel.&lt;/strong&gt; On top of the AWS metrics, each EKS cluster runs the &lt;strong&gt;Splunk OpenTelemetry (OTel) collector&lt;/strong&gt; to send &lt;em&gt;additional, pod-level&lt;/em&gt; metrics to Splunk O11y — the fine-grained container telemetry CloudWatch's AWS metrics don't capture. Note this is the &lt;em&gt;metrics&lt;/em&gt; path only; EKS &lt;strong&gt;logs&lt;/strong&gt; still go the log route above (CloudWatch → Splunk Cloud).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The payoff isn't just that the data lands somewhere — it's that &lt;strong&gt;Splunk Cloud carries a large set of Forge-specific field extractions&lt;/strong&gt;, so the logs aren't an opaque blob: you can slice and filter them by meaningful dimensions (tenant, region, account, runner type, job, conclusion, and more), and the metrics in O11y are likewise sliceable by dimension. That's what turns "we have the logs" into "I can answer a question in one query." And — consistent with everything else in this article — &lt;em&gt;those extractions and the dashboards themselves are managed as code&lt;/em&gt;, through Terraform's Splunk and SignalFx providers, not hand-built in a console; the observability is as reproducible as the infrastructure it watches. On top of it, each tenant gets its own dashboards — runner-lifecycle health, queue/capacity, trust-validation failures, the webhook/job-log pipeline, optimization signals like high-memory-job detection, and &lt;strong&gt;cost breakdowns fed by AWS Billing Data Exports&lt;/strong&gt; so a team can see and right-size its own spend.&lt;/p&gt;

&lt;p&gt;This isn't abstract — it's a concrete set of dashboards, each mapping a &lt;em&gt;symptom&lt;/em&gt; to a &lt;em&gt;subsystem&lt;/em&gt;: a runner-&lt;strong&gt;capacity&lt;/strong&gt; view (GitHub queue pressure vs. ARC scale-set state), &lt;strong&gt;Lambda operations&lt;/strong&gt; (scale errors, tagging, ingestion retries), &lt;strong&gt;K8s storage/network&lt;/strong&gt; (PVC/EBS attach, scheduler capacity, CNI readiness, API-audit), &lt;strong&gt;trust failures&lt;/strong&gt; (the AssumeRole/TagSession detail from §7), &lt;strong&gt;ARC/DinD health&lt;/strong&gt; (init containers, hook sidecars, runner versions, Karpenter signals), &lt;strong&gt;EC2 runner lifecycle&lt;/strong&gt; (webhook → scale → AMI/SSM → user-data → hook), the &lt;strong&gt;webhook/job-log pipeline&lt;/strong&gt; (dispatcher → SQS/DLQ → archiver → ingestion), and — tellingly — an &lt;strong&gt;ingestion-quality&lt;/strong&gt; view that watches the telemetry &lt;em&gt;itself&lt;/em&gt; for missing fields and dropped logs, because a dashboard you can't trust is worse than no dashboard. In the internal production deployment, the clusters also run &lt;strong&gt;Falco&lt;/strong&gt; for runtime security monitoring (it's not in the public repo), so anomalous syscall behavior inside a running job is observable too. The throughline: every GitHub-side symptom ("my job is stuck") has a backend view that walks you down the chain — label mismatch → webhook → Lambda/SQS → EC2/ARC → runner logs — and localizes it to a subsystem in a click or two, the difference between an operator who &lt;em&gt;guesses&lt;/em&gt; and one who &lt;em&gt;knows&lt;/em&gt;. And the same central data does two more jobs beyond troubleshooting: it provides the &lt;strong&gt;audit trail&lt;/strong&gt; compliance needs (who ran what, where, with which permissions), and it makes &lt;strong&gt;per-tenant cost attribution&lt;/strong&gt; possible instead of one undifferentiated AWS bill.&lt;/p&gt;

&lt;p&gt;The payoff is an operating &lt;em&gt;loop&lt;/em&gt; rather than a permanent fire drill: &lt;strong&gt;monitor → alert → automate.&lt;/strong&gt; When something recurs, it becomes a detector or a script, and then it stops being work. That loop is why "near-zero ops" is a true statement and not a brag — the rigor of the instrumentation is precisely what removes the human toil.&lt;/p&gt;

&lt;h3&gt;
  
  
  Debugging an ephemeral platform — Teleport for auditable SSH
&lt;/h3&gt;

&lt;p&gt;Ephemerality (§6, §7) is wonderful for security and cleanliness, but it creates a real problem: &lt;strong&gt;how do you debug a machine that deletes itself the instant the job ends?&lt;/strong&gt; You can't SSH into a failed runner after the fact — it's gone. Forge answers this three ways, in order of how often you reach for them.&lt;/p&gt;

&lt;p&gt;First and most often, &lt;strong&gt;the logs already have it.&lt;/strong&gt; Because every stream is centralized in Splunk (above), the large majority of debugging is post-hoc and hands-off — you read the bootstrap output, the job log, and the hook output without touching a box at all.&lt;/p&gt;

&lt;p&gt;Second, &lt;strong&gt;rerun it.&lt;/strong&gt; Every job runs in a fully reproducible, ephemeral environment, so re-running a failed job from the GitHub UI replays the exact conditions with no leftover side effects — a luxury you simply don't have with long-lived, drifted runners.&lt;/p&gt;

&lt;p&gt;Third, when you genuinely need to be &lt;em&gt;on&lt;/em&gt; the machine, &lt;strong&gt;Teleport provides live, auditable SSH&lt;/strong&gt; — to both EC2 runners and Kubernetes pods. The developer logs in through corporate SSO (&lt;code&gt;tsh login --proxy=&amp;lt;teleport-proxy&amp;gt; --auth=CloudSSO&lt;/code&gt;), and because the runner would normally vanish at job end, they keep it alive deliberately — a &lt;code&gt;sleep&lt;/code&gt; step in the workflow, or a wrapper — then &lt;code&gt;tsh ssh&lt;/code&gt; into the EC2 instance or &lt;code&gt;tsh kube&lt;/code&gt; into the pod and inspect it as it runs. Crucially, this is &lt;em&gt;break-glass, not a backdoor&lt;/em&gt;: access is gated by AD/identity-group entitlement (requested via a ticket), it's narrow and time-bound, and Teleport records sessions centrally — so live debugging never means an untracked shell or a shared bastion key. You get the convenience of "just SSH in" without surrendering the audit trail that made the platform compliant in the first place.&lt;/p&gt;

&lt;p&gt;That's the trade-off in miniature: ephemeral runners &lt;strong&gt;buy&lt;/strong&gt; you security and a clean slate every time, and &lt;strong&gt;cost&lt;/strong&gt; you easy debugging — and Forge pays that cost with central logs, reproducible reruns, and auditable Teleport access, rather than by giving up ephemerality. You keep the security property &lt;em&gt;and&lt;/em&gt; stay able to troubleshoot.&lt;/p&gt;




&lt;h2&gt;
  
  
  11. Trade-off #7 — Staying fresh: automated dependencies &amp;amp; dogfooded images
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;A note on scope: unlike the previous six, this one is mostly internal process rather than open-source code. The building blocks are public — the Forge repo ships a Renovate config and Packer/Ansible image-build examples — but the end-to-end pipeline described here (the image repos and the dogfood CI) lives in internal repos. I'm including it because it's the part that makes everything above survivable over time, and the pattern is what's worth stealing, not the hostnames.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Here's a truth that gets left out of most platform write-ups: &lt;strong&gt;a platform is never "done."&lt;/strong&gt; Even with zero new features, it sits on a foundation that moves constantly underneath you — GitHub changes the Actions runtime, the runner agent ships a new minimum version, ARC and Karpenter and Calico cut releases, Terraform providers update, base-OS packages get CVEs, and hundreds of transitive dependencies drift. (A concrete one: when GitHub migrated Actions from the Node 20 runtime to Node 24, every runner image &lt;em&gt;and&lt;/em&gt; a pile of tenant workflows had to be tested and updated — not because Forge changed anything, but because the ground moved underneath it.) Standing still is not free; standing still is how you rot. The hard part of operating a platform with near-zero ops is therefore not the initial build — it's keeping a large, moving dependency surface current &lt;em&gt;without&lt;/em&gt; a human babysitting it. Forge does that with three interlocking pieces.&lt;/p&gt;

&lt;h3&gt;
  
  
  Piece 1 — Renovate: turn the dependency treadmill into reviewable PRs
&lt;/h3&gt;

&lt;p&gt;The first piece is &lt;strong&gt;Renovate&lt;/strong&gt;, a bot that continuously scans every dependency source in every repo — Terraform/OpenTofu modules, GitHub Actions, Docker base images, Helm charts, the pinned upstream runner module, even tool versions — and, when something updates, &lt;strong&gt;opens a pull request automatically.&lt;/strong&gt; A shared Renovate configuration (defined once, extended by every repo) sets the policy: updates are grouped sensibly (all AWS provider bumps together, all Docker bumps together) so you review related changes as a unit; safe changes (patch bumps, digest pins, pre-commit hooks) can &lt;strong&gt;auto-merge&lt;/strong&gt;; major version bumps are labeled and delayed so a human looks; and security patches are prioritized.&lt;/p&gt;

&lt;p&gt;The effect is a mindset shift. Dependency maintenance stops being a periodic, dreaded, manual sweep and becomes a steady stream of small, pre-tested PRs you either glance at and approve or let auto-merge. You're never "behind" in a scary way — you're continuously, incrementally current. For a one-person-shaped operation, this is the difference between drowning in the treadmill and riding it.&lt;/p&gt;

&lt;p&gt;But automated dependency PRs are only safe if something &lt;em&gt;proves&lt;/em&gt; each bump doesn't break the platform. That's where the other two pieces come in.&lt;/p&gt;

&lt;h3&gt;
  
  
  Piece 2 — Images as code: a layered base image
&lt;/h3&gt;

&lt;p&gt;Forge's runners don't use off-the-shelf images; they run images built &lt;strong&gt;as code&lt;/strong&gt; with Packer (which bakes machine images) and Ansible (which configures what goes inside). The structure is two-layered, across three operating systems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Base images for Ubuntu, macOS, and Windows — deliberately minimal.&lt;/strong&gt; Each starts from a &lt;strong&gt;CIS-aligned hardened OS&lt;/strong&gt; and installs only the common foundation every runner of that OS needs: the GitHub Actions runner agent, the container/Docker runtime where applicable, the AWS CLI, and the platform's own agents for access and observability (Teleport for break-glass, the Splunk/CloudWatch log shippers). That's &lt;em&gt;it&lt;/em&gt;. The base is kept lean on purpose. (For container jobs, images are pulled from either a Forge ECR or the tenant's &lt;em&gt;own&lt;/em&gt; ECR, depending on configuration — so a team can ship a private container image without it ever passing through the platform.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tenants own their toolchains.&lt;/strong&gt; A tenant either runs on the minimal base image as-is, or builds its &lt;strong&gt;own custom image&lt;/strong&gt; on top of the base with whatever it actually needs — Go, Python, Terraform/OpenTofu, Packer, and so on. Forge deliberately does &lt;strong&gt;not&lt;/strong&gt; manage those toolchains. Curating every team's language and tool versions doesn't scale and isn't the platform's job — it's a treadmill that never ends. Keeping the base minimal and pushing toolchains into &lt;em&gt;tenant-owned&lt;/em&gt; custom images is exactly what stops the platform from drowning in "can you add X to the image" tickets, and it means one team's toolchain choices can't destabilize anyone else's runners.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Forge team is itself a tenant.&lt;/strong&gt; The team that operates Forge runs its &lt;em&gt;own&lt;/em&gt; custom image for its &lt;em&gt;own&lt;/em&gt; runners — because it consumes Forge exactly like any other tenant, through the same base-image → custom-image path it gives everyone else. That's not a metaphor for dogfooding; it &lt;em&gt;is&lt;/em&gt; the dogfooding. The platform team eats the same food it serves.&lt;/li&gt;
&lt;li&gt;A build-matrix control system lets a PR &lt;strong&gt;skip&lt;/strong&gt; specific OS/version/architecture combinations via tokens in the PR title or commit message (e.g. &lt;code&gt;skip:windows&lt;/code&gt;, &lt;code&gt;skip:ubuntu:22:arm64&lt;/code&gt;) — so a change that only touches one image doesn't pay to rebuild all of them across three OSes and multiple architectures.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And there's a second trade-off nested inside the tenant's choice — &lt;em&gt;when&lt;/em&gt; to get its tools onto the runner. A team can &lt;strong&gt;install everything at workflow runtime&lt;/strong&gt; (&lt;code&gt;apt-get&lt;/code&gt;, &lt;code&gt;pip install&lt;/code&gt;, download a toolchain at the top of the job): zero image to build or maintain, but every single run pays that installation cost and is exposed to a slow or flaky upstream mirror. Or it can &lt;strong&gt;bake those tools into a custom image&lt;/strong&gt;: more work to build and keep fresh, but runs start instantly with everything already present and reproducible. Neither is "right" — it's the classic build-time-vs-run-time trade-off, and Forge deliberately leaves it to the tenant, because only the tenant knows whether it values low maintenance or fast, repeatable runs. The platform's job is to make &lt;em&gt;both&lt;/em&gt; paths work cleanly, not to pick for them.&lt;/p&gt;

&lt;p&gt;Building images as code matters for two reasons. First, freshness is not optional: GitHub will &lt;strong&gt;reject self-hosted runners whose agent is too old&lt;/strong&gt;, so a stale image doesn't just lag — it eventually fails jobs outright. Scheduled, automated rebuilds keep images inside that window. Second, an image defined as code is an image you can build &lt;em&gt;and exercise&lt;/em&gt; in a pull request — which is the whole game.&lt;/p&gt;

&lt;h3&gt;
  
  
  Piece 3 — Dogfooding: every image is tested as a real runner before merge
&lt;/h3&gt;

&lt;p&gt;This is the keystone, and it's stronger than "run some tests." When Renovate (or a human) opens a PR that bumps a dependency or changes an image, the pipeline &lt;strong&gt;builds the candidate image inside that PR&lt;/strong&gt; and then &lt;strong&gt;registers it as a real GitHub Actions runner and runs actual builds on it&lt;/strong&gt; — before the PR can merge. Not a mock, not a smoke check: the freshly built image is brought up as a genuine runner and made to do real work. And it runs on the &lt;strong&gt;Forge team's own runners&lt;/strong&gt; — because, as above, the Forge team is a tenant of its own platform. So a change to the platform is proven &lt;em&gt;on&lt;/em&gt; the platform, by the team that owns it, acting as a real user of it. If a dependency bump or an image change breaks the runner, it breaks &lt;em&gt;in the PR&lt;/em&gt;, visibly, on a real runner — not in production three days later.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fj0xmn5dv8ijga53ryha6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fj0xmn5dv8ijga53ryha6.png" alt="Dogfood flow: PR builds a candidate image, registers it as a real runner on the Forge-team tenant, runs real builds, then merges/releases." width="800" height="997"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;"Dogfooding" — using your own product to build your own product — is doing real safety work here, not just signaling virtue. Because the Forge team runs &lt;em&gt;as a tenant&lt;/em&gt; and every image is validated &lt;em&gt;as an actual runner&lt;/em&gt; before it ships, the automation in Piece 1 becomes trustworthy: you can let safe dependency PRs auto-merge precisely because a broken image could never have passed a real build on the team's own runners. The three pieces only work as a set — Renovate generates the change, images-as-code make it buildable per PR, and the real-runner dogfood test (on the Forge-team tenant) proves it before it ships. And because tenants consume Forge through pinned versions (§9), a freshly validated release rolls out to teams on &lt;em&gt;their&lt;/em&gt; cadence, not all at once.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What it buys:&lt;/strong&gt; a large, fast-moving dependency surface stays current with little human effort; dependency and image changes are &lt;em&gt;proven on the real platform&lt;/em&gt; before merge, which makes safe auto-merge trustworthy; runner images never silently rot past GitHub's agent-version cutoff.&lt;br&gt;
&lt;strong&gt;What it costs:&lt;/strong&gt; you build and maintain a non-trivial CI/image pipeline (Packer, Ansible, build matrices, the dogfood wiring); you spend CI minutes building and testing images on every relevant PR; and you take on the discipline of keeping the dogfood loop green, because once you trust it enough to auto-merge, a flaky loop is a real liability.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;The lesson:&lt;/strong&gt; "done" is a myth for a platform; the ecosystem moves whether you do or not. The way a small team keeps up is to make staying current &lt;em&gt;automatic and self-proving&lt;/em&gt; — generate the changes with a bot, define your artifacts as code so they're buildable per change, and dogfood so every change is tested on the very system it modifies. That trio is what lets "near-zero ops" survive contact with a year of upstream churn. Decision #7.&lt;/p&gt;




&lt;h2&gt;
  
  
  12. Operating it: ownership, where it breaks, and the sharp edges
&lt;/h2&gt;

&lt;p&gt;An explainer that stops at architecture skips the half that actually fills a platform team's week. Four operational realities round out the picture.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ownership boundaries — most incidents live here.&lt;/strong&gt; A surprising share of "Forge is broken" tickets aren't Forge bugs; they sit on a seam between owners:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Platform / Forge team&lt;/strong&gt; owns the modules, EKS clusters, runner lifecycle, GitHub App plumbing, base images, shared observability, and guardrails.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tenant team&lt;/strong&gt; owns its workflows, workload permissions, custom toolchains/images, repository selection, and the external IAM roles it asks runners to assume.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security / infra teams&lt;/strong&gt; own the corporate VPCs, subnets and routing, Teleport entitlements, and Splunk access.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Knowing the boundary &lt;em&gt;is&lt;/em&gt; half the triage: a job that can't reach an internal API is usually network/routing (security-infra); a job that can't assume a role is usually tenant IAM; "no runner picked it up" is usually labels/runner-group (platform + tenant config).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where it breaks, and where to look first.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symptom&lt;/th&gt;
&lt;th&gt;Likely layer&lt;/th&gt;
&lt;th&gt;First place to look&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Job stuck waiting for a runner&lt;/td&gt;
&lt;td&gt;Label mismatch, runner group, webhook, EC2/ARC capacity&lt;/td&gt;
&lt;td&gt;GitHub labels; runner-group reconciler logs; webhook/Lambda/SQS dashboard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EC2 runner never registers&lt;/td&gt;
&lt;td&gt;AMI, user-data, subnet IPs, EC2 capacity, App token&lt;/td&gt;
&lt;td&gt;CloudWatch user-data logs; scale-up Lambda; EC2-lifecycle dashboard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ARC pod stuck pending&lt;/td&gt;
&lt;td&gt;Karpenter capacity, storage, node taints, resource requests&lt;/td&gt;
&lt;td&gt;K8s storage/network dashboard; ARC-lifecycle dashboard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AWS auth fails inside a job&lt;/td&gt;
&lt;td&gt;Tenant trust, missing &lt;code&gt;sts:TagSession&lt;/code&gt;, wrong role ARN&lt;/td&gt;
&lt;td&gt;Trust-validator dashboard; tenant IAM trust policy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logs missing&lt;/td&gt;
&lt;td&gt;Job-log archiver, SQS/DLQ, Splunk ingestion&lt;/td&gt;
&lt;td&gt;Webhook/job-log-pipeline dashboard; ingestion-quality dashboard&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Sharp edges worth respecting&lt;/strong&gt; — the platform isn't as simple as a happy-path diagram suggests:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;scale_set_type&lt;/code&gt; must be exactly &lt;code&gt;dind&lt;/code&gt; or &lt;code&gt;k8s&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Kubernetes CPU/memory need units — &lt;code&gt;500m&lt;/code&gt;, &lt;code&gt;1Gi&lt;/code&gt; — not bare numbers.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ami_kms_key_arn&lt;/code&gt; must be &lt;code&gt;""&lt;/code&gt; for an unencrypted AMI.&lt;/li&gt;
&lt;li&gt;macOS requires &lt;code&gt;use_dedicated_host: true&lt;/code&gt; and matching placement (plus License Manager gotchas).&lt;/li&gt;
&lt;li&gt;warm-pool schedules use &lt;strong&gt;AWS&lt;/strong&gt; cron syntax, not Unix cron.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;migrate_arc_cluster&lt;/code&gt; should be &lt;code&gt;true&lt;/code&gt; &lt;strong&gt;only&lt;/strong&gt; during an intentional migration.&lt;/li&gt;
&lt;li&gt;subnet-IP exhaustion and EC2 capacity errors are first-class failure modes, not edge cases (§5, again).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Security at a glance&lt;/strong&gt; — the controls, in one place: GitHub App keys in SSM Parameter Store; webhook HMAC validation; per-tenant runner groups + repository selection; short-lived AWS creds via runner-role assumption (no static keys in workflows); &lt;code&gt;sts:AssumeRole&lt;/code&gt; + &lt;code&gt;sts:TagSession&lt;/code&gt;; KMS-encrypted S3 job logs; Teleport for audited break-glass access; Falco runtime monitoring (internal deployment); rootless DinD on per-tenant nodes; CIS-aligned hardened base images.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;And one honest definition: "near-zero ops" is not zero support.&lt;/strong&gt; It does not mean nobody operates Forge. It means recurring manual work has been converted into automation, dashboards, scheduled validators, redrive loops, and self-service config — so the human job shifts from "SSH into a random runner and guess" to "look at the subsystem that owns the symptom, and when a pattern repeats, improve the detector or the automation." The support load — capacity, IAM trust, Teleport/onboarding, ARC/DinD issues, image updates, tenant guidance — is real. It's bounded and routed, not eliminated.&lt;/p&gt;

&lt;h2&gt;
  
  
  13. The pattern
&lt;/h2&gt;

&lt;p&gt;Here is the entire argument on one page:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decision&lt;/th&gt;
&lt;th&gt;What it bought&lt;/th&gt;
&lt;th&gt;What it cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Calico overlay CNI (§5)&lt;/td&gt;
&lt;td&gt;Broke the IP ceiling&lt;/td&gt;
&lt;td&gt;A second networking layer to operate&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-tenant DinD node pools (§6)&lt;/td&gt;
&lt;td&gt;Blast-radius isolation for Docker builds&lt;/td&gt;
&lt;td&gt;Money — worse bin-packing &amp;amp; more Karpenter objects&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trust-validator (§7)&lt;/td&gt;
&lt;td&gt;Proactive trust checks&lt;/td&gt;
&lt;td&gt;A small distributed system to run&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Blue/green clusters (§8)&lt;/td&gt;
&lt;td&gt;Safe, reversible upgrades&lt;/td&gt;
&lt;td&gt;A per-tenant runner gap on migration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Directory-as-config (§9)&lt;/td&gt;
&lt;td&gt;Onboarding in minutes&lt;/td&gt;
&lt;td&gt;Terragrunt complexity &amp;amp; footguns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Self-healing tissue (§10)&lt;/td&gt;
&lt;td&gt;Resilience without babysitting&lt;/td&gt;
&lt;td&gt;More moving parts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Automated deps + dogfooding (§11)&lt;/td&gt;
&lt;td&gt;Stay current without manual toil&lt;/td&gt;
&lt;td&gt;A build/test pipeline to maintain&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A credibility note before the numbers: everything about the &lt;em&gt;design&lt;/em&gt; in this article is in the open-source repo and you can read the modules yourself; the &lt;em&gt;scale&lt;/em&gt; figures that follow — ~40 teams, ~10k jobs/day, near-zero ops — come from internal production experience, not from the public code. I've tried to keep that line visible throughout.&lt;/p&gt;

&lt;p&gt;Put together, this runs around &lt;strong&gt;40 tenant teams and ~10,000 CI jobs a day&lt;/strong&gt;, across multiple AWS regions and both execution lanes, grown organically from 3 teams to 40+ — operated with &lt;strong&gt;near-zero ops, by design.&lt;/strong&gt; Read that last phrase carefully, because it's the most misunderstood claim in the whole story. It does &lt;em&gt;not&lt;/em&gt; mean a heroically overworked person is quietly holding it together, and it does &lt;em&gt;not&lt;/em&gt; mean an under-resourced org is getting lucky. It means low operational cost is the &lt;strong&gt;output&lt;/strong&gt; of the stack of trade-offs above: ephemeral runners that can't drift, infrastructure that's entirely code, a boundary-validator that catches tenant mistakes before they page anyone, immutable clusters that make upgrades boring, configuration that makes onboarding a directory entry, and observability that turns operations into a loop. Low ops is what &lt;em&gt;rigor&lt;/em&gt; looks like from the outside.&lt;/p&gt;

&lt;p&gt;And that's the takeaway worth carrying even if you never run a GitHub Actions runner in your life:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Great platforms are not a lucky idea or a magic tool. They are a pattern of many correct decisions, made across many different problem domains — networking, isolation, identity, lifecycle, configuration, operations, and supply chain — each chosen with clear eyes about its cost, and composed so the whole holds together.&lt;/strong&gt; The craft is not picking the single best technology for one problem. It's making a dozen good trade-offs in a dozen domains and having them &lt;em&gt;add up&lt;/em&gt;. Anyone can copy one of these decisions. The engineering is in the composition.&lt;/p&gt;

&lt;p&gt;So the next time you design something, don't ask "what's the best tech here?" Ask, for each choice: what does this &lt;em&gt;buy&lt;/em&gt; me, what does it &lt;em&gt;cost&lt;/em&gt; me, and is that the trade I want at my scale? Then write the answer down. That document — the one you just read — is as much the deliverable as the code, because it's the thing that turns a magic box back into an understandable machine.&lt;/p&gt;

&lt;h2&gt;
  
  
  When Forge is overkill
&lt;/h2&gt;

&lt;p&gt;One last decision, and it's the one that keeps the rest honest: &lt;strong&gt;knowing when &lt;em&gt;not&lt;/em&gt; to build this.&lt;/strong&gt; If you're a small team with a handful of repos, Forge is too much. Start with the basics — ephemeral runners (the upstream EC2 module or ARC on its own), GitHub Actions, and a bit of Terraform — and only reach for tenancy, per-tenant isolation, trust validation, blue/green clusters, and a dogfooded image pipeline when you actually feel the pain they remove. Every trade-off in this article &lt;em&gt;buys&lt;/em&gt; something real, but each also &lt;em&gt;costs&lt;/em&gt; real complexity, and at small scale you'd be paying the cost without needing the benefit. Forge earns its keep in multi-team environments where governance, isolation, and platform automation genuinely matter. The same judgment that makes the dozen decisions good is the judgment that says: at three repos, don't make any of them. Maturity isn't building the biggest system — it's building the right-sized one, and being able to say which is which.&lt;/p&gt;

&lt;h2&gt;
  
  
  Read the code
&lt;/h2&gt;

&lt;p&gt;This article keeps saying "read the modules" — so here's the map. Everything below is in the public repo:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Topic&lt;/th&gt;
&lt;th&gt;Where to read&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tenant orchestration (umbrella)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;modules/platform/forge_runners&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EC2 runners&lt;/td&gt;
&lt;td&gt;&lt;code&gt;modules/platform/ec2_deployment&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ARC / Kubernetes runners&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;modules/platform/arc_deployment&lt;/code&gt;, &lt;code&gt;modules/core/arc&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Calico + Karpenter on EKS&lt;/td&gt;
&lt;td&gt;&lt;code&gt;modules/infra/eks&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trust-validator (§7)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;modules/platform/forge_runners/forge_trust_validator&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Job-log archival (§10)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;modules/platform/forge_runners/github_actions_job_logs&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Global lock (§10)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;modules/platform/forge_runners/github_global_lock&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Runner groups + repo registration (§4)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;modules/platform/forge_runners/github_app_runner_group&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Splunk dashboards / config (§10)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;modules/integrations/splunk_cloud_conf_shared&lt;/code&gt;, &lt;code&gt;modules/integrations/splunk_o11y_conf_shared&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;p&gt;&lt;em&gt;Forge is open source: &lt;a href="https://github.com/cisco-open/forge" rel="noopener noreferrer"&gt;&lt;code&gt;github.com/cisco-open/forge&lt;/code&gt;&lt;/a&gt; (Apache-2.0).&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Docs: &lt;a href="https://cisco-open.github.io/forge" rel="noopener noreferrer"&gt;&lt;code&gt;cisco-open.github.io/forge&lt;/code&gt;&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Read the modules — the trade-offs are all in there if you know where to look.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>kubernetes</category>
      <category>aws</category>
      <category>githubactions</category>
    </item>
    <item>
      <title>ForgeMT: GitHub Actions at Scale with Security and Multi-Tenancy on AWS</title>
      <dc:creator>Ederson Brilhante</dc:creator>
      <pubDate>Tue, 12 Aug 2025 14:51:47 +0000</pubDate>
      <link>https://dev.to/edersonbrilhante/forgemt-github-actions-at-scale-with-security-and-multi-tenancy-on-aws-3no9</link>
      <guid>https://dev.to/edersonbrilhante/forgemt-github-actions-at-scale-with-security-and-multi-tenancy-on-aws-3no9</guid>
      <description>&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;GitHub Actions is the go-to CI/CD tool for many teams. But when your organization runs thousands of pipelines daily, the default setup breaks down. You hit limits on scale, security, and governance — plus skyrocketing costs.&lt;/p&gt;

&lt;p&gt;GitHub-hosted runners are easy but expensive and don’t meet strict compliance needs. Existing self-hosted solutions like Actions Runner Controller (ARC) or Terraform EC2 modules don’t fully solve multi-tenant isolation, automation, or centralized control.&lt;/p&gt;

&lt;p&gt;ForgeMT, built inside Cisco’s Security Business Group, fills that gap. It’s an open-source AWS-native platform that manages ephemeral runners with strong tenant isolation, full automation, and enterprise-grade governance.&lt;/p&gt;

&lt;p&gt;This article explains why ForgeMT matters and how it works — providing a practical look at building scalable, secure GitHub Actions runner platforms.&lt;/p&gt;




&lt;h1&gt;
  
  
  Why Enterprise CI/CD Runners Fail at Scale
&lt;/h1&gt;

&lt;p&gt;At large organizations, scaling GitHub Actions runners encounters four key bottlenecks:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fragmented Infrastructure:&lt;/strong&gt;&lt;br&gt;
Teams independently choose their CI/CD tools: Jenkins, Travis, CircleCI, or self-hosted runners—which accelerates local delivery but creates duplicated effort, configuration drift, and fragmented monitoring. Without a unified platform, scalability, security, and reliability degrade.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Weak Tenant Isolation:&lt;/strong&gt;&lt;br&gt;
Runners run untrusted code across teams. Without strong isolation, one compromised job can leak credentials or escalate attacks across tenants. Poor audit trails slow breach detection and hinder compliance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scalability Limits:&lt;/strong&gt;&lt;br&gt;
Static IP pools cause IPv4 exhaustion, and manual provisioning delays runner startup. Without elastic scaling, resources are wasted or pipelines queue up, killing developer velocity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Maintenance and Governance Overhead:&lt;/strong&gt;&lt;br&gt;
Uneven patching weakens security, infrastructure drift complicates troubleshooting, and audits become expensive and error-prone. Secure scaling demands centralized governance, consistent policy enforcement, and automation.&lt;/p&gt;

&lt;p&gt;In short, enterprises fail to scale GitHub Actions runners without a platform that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Centralizes multi-tenancy&lt;/li&gt;
&lt;li&gt;Automates lifecycle management&lt;/li&gt;
&lt;li&gt;Provides enterprise-grade observability and governance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But beware—over-centralization can kill flexibility and introduce new challenges.&lt;/p&gt;


&lt;h1&gt;
  
  
  Why GitHub Actions — And Why It’s Not Enough at Enterprise Scale
&lt;/h1&gt;

&lt;p&gt;GitHub Actions is popular because it offers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Deep GitHub integration:&lt;/strong&gt; triggers on PRs, branches, and tags with no extra logins, plus automatic secret and artifact handling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Extensible ecosystem:&lt;/strong&gt; thousands of marketplace actions simplify workflow creation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexible runners:&lt;/strong&gt; GitHub-hosted runners for convenience, or self-hosted for control, cost savings, and compliance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Granular security:&lt;/strong&gt; native GitHub Apps, OIDC tokens, and fine-grained permissions enforce least privilege.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rapid scale:&lt;/strong&gt; pipelines at repo or org level enable smooth CI/CD growth.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, GitHub Actions alone can’t meet enterprise-scale demands. Enterprises require:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strong tenant isolation and centralized governance across thousands of pipelines.&lt;/li&gt;
&lt;li&gt;A unified platform to avoid fragmented infrastructure and scaling bottlenecks.&lt;/li&gt;
&lt;li&gt;Fine-grained identity, network controls, and compliance enforcement.&lt;/li&gt;
&lt;li&gt;Automation for onboarding, patching, and auditing to reduce operational overhead.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cloud providers like AWS supply identity, networking, and automation building blocks—IAM/OIDC, VPC segmentation, EC2, EKS (needed to build secure, scalable, multi-tenant CI/CD platforms).&lt;/p&gt;


&lt;h1&gt;
  
  
  Existing Solutions and Why They Fall Short
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Actions Runner Controller (ARC)&lt;/strong&gt; runs ephemeral Kubernetes pods as GitHub runners, scaling dynamically with declarative config and Kubernetes-native integration. But:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Kubernetes namespaces alone don’t provide strong security isolation.&lt;/li&gt;
&lt;li&gt;No native AWS IAM/OIDC integration.&lt;/li&gt;
&lt;li&gt;Lacks onboarding, governance, and audit automation.&lt;/li&gt;
&lt;li&gt;Network policy management is manual, increasing operational overhead.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Terraform AWS GitHub Runner Module&lt;/strong&gt; provisions EC2 self-hosted runners with customizable AMIs, integrating well with IaC pipelines. However:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Typically deployed per team, causing fragmentation.&lt;/li&gt;
&lt;li&gt;No native multi-tenant isolation.&lt;/li&gt;
&lt;li&gt;Requires manual IAM and account setup.&lt;/li&gt;
&lt;li&gt;No onboarding or patching automation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Commercial Runner-as-a-Service&lt;/strong&gt; options offer simple UX, automatic scaling, and vendor-managed maintenance with SLAs, but:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High costs at scale.&lt;/li&gt;
&lt;li&gt;Vendor lock-in risks.&lt;/li&gt;
&lt;li&gt;Limited multi-tenant isolation.&lt;/li&gt;
&lt;li&gt;Often don’t meet strict compliance requirements.&lt;/li&gt;
&lt;/ul&gt;


&lt;h1&gt;
  
  
  Where ForgeMT Fits In
&lt;/h1&gt;

&lt;p&gt;ForgeMT combines the best of these approaches to deliver an enterprise-ready platform:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Orchestrates ephemeral runners seamlessly.&lt;/li&gt;
&lt;li&gt;Uses AWS-native identity and network isolation (IAM/OIDC).&lt;/li&gt;
&lt;li&gt;Built-in governance with full lifecycle automation.&lt;/li&gt;
&lt;li&gt;Designed for large, security-focused organizations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ForgeMT doesn’t reinvent ARC or EC2 modules but extends them with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strict multi-tenant isolation:&lt;/strong&gt; Each team runs in a separate AWS account to contain blast radius. IAM/OIDC enforces least privilege. Calico CNI manages Kubernetes network segmentation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full automation:&lt;/strong&gt; Tenant onboarding, runner patching, centralized monitoring, and drift remediation happen automatically, cutting manual toil and errors.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Centralized control plane:&lt;/strong&gt; One dashboard securely manages all tenants with governance, audit logs, and compliance-ready traceability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost optimization:&lt;/strong&gt; Spot instances, warm pools, and autoscaling based on real-time metrics and spot prices reduce costs without sacrificing availability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open-source transparency:&lt;/strong&gt; 100% open source—no vendor lock-in, no license fees, full customization freedom.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe53c06e4uen18938cyez.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe53c06e4uen18938cyez.jpg" alt="10k ft view" width="800" height="635"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h1&gt;
  
  
  Architecture Overview
&lt;/h1&gt;

&lt;p&gt;At its core, ForgeMT is a centralized control plane that orchestrates ephemeral runner provisioning and lifecycle management across multiple tenants running on both EC2 and Kubernetes.&lt;/p&gt;
&lt;h2&gt;
  
  
  Key Components
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/github-aws-runners/terraform-aws-github-runner" rel="noopener noreferrer"&gt;Terraform module for EC2 runners&lt;/a&gt; — provisions ephemeral EC2 runners with autoscaling, spot/on-demand, and ephemeral lifecycle.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/actions/actions-runner-controller" rel="noopener noreferrer"&gt;Actions Runner Controller (ARC)&lt;/a&gt; — manages EKS-based runners as Kubernetes pods with tenant namespace isolation.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://opentofu.org/" rel="noopener noreferrer"&gt;OpenTofu + Terragrunt&lt;/a&gt; — Infrastructure as Code managing tenant/account/region deployments declaratively.&lt;/li&gt;
&lt;li&gt;IAM Trust Policies — secure runner access with ephemeral credentials via role assumption.&lt;/li&gt;
&lt;li&gt;Splunk &amp;amp; Observability — centralized logs and metrics per tenant.&lt;/li&gt;
&lt;li&gt;Teleport — secure SSH access to ephemeral runners for auditing and debugging.&lt;/li&gt;
&lt;li&gt;EKS + Calico CNI — scalable pod networking with strong tenant segmentation and minimal IP usage.&lt;/li&gt;
&lt;li&gt;EKS + Karpenter — demand-driven node autoscaling with spot and on-demand instances, plus warm pools.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxf5xdxe5o5vt3f12lx7k.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxf5xdxe5o5vt3f12lx7k.jpg" alt="10k ft view multi tenants" width="800" height="701"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h1&gt;
  
  
  ForgeMT Control Plane
&lt;/h1&gt;

&lt;p&gt;The control plane is the platform’s brain — managing runner provisioning, lifecycle, security, scaling, and observability.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Centralized Orchestration:&lt;/strong&gt; Decides when and where to spin up ephemeral runners (EC2 or Kubernetes pods).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Tenant Isolation:&lt;/strong&gt; Isolates each tenant via dedicated AWS accounts or Kubernetes namespaces, IAM roles, and network policies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security Enforcement:&lt;/strong&gt; Applies hardened runner configurations, automates ephemeral credential rotation, and enforces least privilege.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scaling &amp;amp; Optimization:&lt;/strong&gt; Integrates with Karpenter and EC2 autoscaling to scale runners up/down with demand and cost awareness.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Observability &amp;amp; Governance:&lt;/strong&gt; Streams logs and metrics to Splunk; provides audit trails and compliance dashboards.&lt;/li&gt;
&lt;/ol&gt;


&lt;h1&gt;
  
  
  Runner Types and Usage
&lt;/h1&gt;
&lt;h2&gt;
  
  
  Tenant Isolation
&lt;/h2&gt;

&lt;p&gt;Each ForgeMT deployment is single-tenant and region-specific. IAM roles, policies, VPCs, and services are scoped exclusively to that tenant-region pair. This hard boundary prevents cross-tenant access, simplifies compliance, and minimizes blast radius.&lt;/p&gt;
&lt;h2&gt;
  
  
  EC2 Runners
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Ephemeral VMs booted from Forge-provided or tenant-custom AMIs.&lt;/li&gt;
&lt;li&gt;Jobs run directly on VMs or inside containers.&lt;/li&gt;
&lt;li&gt;IAM role assumption replaces static credentials.&lt;/li&gt;
&lt;li&gt;Terminated after each job to avoid drift or leaks.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frcbtubvwhxfmlrl3fiai.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frcbtubvwhxfmlrl3fiai.jpg" alt="EC2 runner" width="800" height="713"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  EKS Runners
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Managed by ARC as Kubernetes pods in tenant namespaces.&lt;/li&gt;
&lt;li&gt;Images pulled from Forge or tenant ECR repositories.&lt;/li&gt;
&lt;li&gt;Scales dynamically for burst workloads.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq9pkf6g69uz87ovz0vrx.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq9pkf6g69uz87ovz0vrx.jpg" alt="EKS runner" width="800" height="703"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Warm Pools and Limits
&lt;/h2&gt;

&lt;p&gt;ForgeMT supports warm pools of pre-initialized runners to minimize cold start latency—especially beneficial for EC2 runners with slower boot times.&lt;/p&gt;

&lt;p&gt;Per-tenant limits enforce:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Max concurrent runners&lt;/li&gt;
&lt;li&gt;Warm pool size&lt;/li&gt;
&lt;li&gt;Runner lifetime (auto-termination after jobs)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These controls prevent resource abuse and keep costs predictable.&lt;/p&gt;


&lt;h1&gt;
  
  
  Tenant Onboarding
&lt;/h1&gt;

&lt;p&gt;Deploying a new tenant is straightforward and fully automated via a single declarative config file, for example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;gh_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ghes_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;
  &lt;span class="na"&gt;ghes_org&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cisco-open&lt;/span&gt;
&lt;span class="na"&gt;tenant&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;iam_roles_to_assume&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::123456789012:role/role_for_forge_runners&lt;/span&gt;
  &lt;span class="na"&gt;ecr_registries&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;123456789012.dkr.ecr.eu-west-1.amazonaws.com&lt;/span&gt;
&lt;span class="na"&gt;ec2_runner_specs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;small&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;ami_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;forge-gh-runner-v*&lt;/span&gt;
    &lt;span class="na"&gt;ami_owner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;123456789012'&lt;/span&gt;
    &lt;span class="na"&gt;ami_kms_key_arn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;
    &lt;span class="na"&gt;max_instances&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;instance_types&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;t2.small&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;t2.medium&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;t2.large&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;t3.small&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;t3.medium&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;t3.large&lt;/span&gt;
    &lt;span class="na"&gt;pool_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[]&lt;/span&gt;
    &lt;span class="na"&gt;volume&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;200&lt;/span&gt;
      &lt;span class="na"&gt;iops&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
      &lt;span class="na"&gt;throughput&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;125&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gp3&lt;/span&gt;
  &lt;span class="na"&gt;large&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;ami_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;forge-gh-runner-v*&lt;/span&gt;
    &lt;span class="na"&gt;ami_owner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;123456789012'&lt;/span&gt;
    &lt;span class="na"&gt;ami_kms_key_arn&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;
    &lt;span class="na"&gt;max_instances&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;instance_types&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;c6i.8xlarge&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;c5.9xlarge&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;c5.12xlarge&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;c6i.12xlarge&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;c6i.16xlarge&lt;/span&gt;
    &lt;span class="na"&gt;pool_config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[]&lt;/span&gt;
    &lt;span class="na"&gt;volume&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;200&lt;/span&gt;
      &lt;span class="na"&gt;iops&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
      &lt;span class="na"&gt;throughput&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;125&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gp3&lt;/span&gt;
&lt;span class="na"&gt;arc_runner_specs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;dind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runner_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;max_runners&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
      &lt;span class="na"&gt;min_runners&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;scale_set_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dependabot&lt;/span&gt;
    &lt;span class="na"&gt;scale_set_type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dind&lt;/span&gt;
    &lt;span class="na"&gt;container_actions_runner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;123456789012.dkr.ecr.eu-west-1.amazonaws.com/actions-runner:latest&lt;/span&gt;
    &lt;span class="na"&gt;container_requests_cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;500m&lt;/span&gt;
    &lt;span class="na"&gt;container_requests_memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1Gi&lt;/span&gt;
    &lt;span class="na"&gt;container_limits_cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1'&lt;/span&gt;
    &lt;span class="na"&gt;container_limits_memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2Gi&lt;/span&gt;
    &lt;span class="na"&gt;volume_requests_storage_type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gp2&lt;/span&gt;
    &lt;span class="na"&gt;volume_requests_storage_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10Gi&lt;/span&gt;
  &lt;span class="na"&gt;k8s&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runner_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;max_runners&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
      &lt;span class="na"&gt;min_runners&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;scale_set_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;k8s&lt;/span&gt;
    &lt;span class="na"&gt;scale_set_type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;k8s&lt;/span&gt;
    &lt;span class="na"&gt;container_actions_runner&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;123456789012.dkr.ecr.eu-west-1.amazonaws.com/actions-runner:latest&lt;/span&gt;
    &lt;span class="na"&gt;container_requests_cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;500m&lt;/span&gt;
    &lt;span class="na"&gt;container_requests_memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1Gi&lt;/span&gt;
    &lt;span class="na"&gt;container_limits_cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1'&lt;/span&gt;
    &lt;span class="na"&gt;container_limits_memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2Gi&lt;/span&gt;
    &lt;span class="na"&gt;volume_requests_storage_type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gp2&lt;/span&gt;
    &lt;span class="na"&gt;volume_requests_storage_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10Gi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The ForgeMT platform uses this config to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Provision tenant-specific AWS accounts and resources.&lt;/li&gt;
&lt;li&gt;Set IAM roles with least privilege trust policies.&lt;/li&gt;
&lt;li&gt;Configure GitHub integration and runner specs.&lt;/li&gt;
&lt;li&gt;Enforce tenant limits and runner types.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This automation enables &lt;strong&gt;zero-touch onboarding&lt;/strong&gt; with no manual AWS or GitHub setup required by the tenant.&lt;/p&gt;




&lt;h1&gt;
  
  
  Extensibility
&lt;/h1&gt;

&lt;p&gt;ForgeMT lets tenants customize their environments and control runner access:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Custom AMIs&lt;/strong&gt; for EC2 runners with tenant-specific tooling.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Private ECR repositories&lt;/strong&gt; to host container images for VMs or Kubernetes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tenant IAM roles&lt;/strong&gt; with trust policies so ForgeMT runners assume them securely without static keys.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Advanced access patterns&lt;/strong&gt; like chained role assumptions or resource-based policies for complex needs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This lets each team tune cost, security, and performance independently without affecting core platform stability.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fas07rit1wmg6ale1zos1.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fas07rit1wmg6ale1zos1.jpg" alt=" " width="800" height="323"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  Security Model
&lt;/h1&gt;

&lt;p&gt;ForgeMT’s foundation is strong isolation and ephemeral execution to reduce risk:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dedicated IAM roles, namespaces, and AWS accounts&lt;/strong&gt; per tenant.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;No cross-tenant visibility or access.&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ephemeral runners&lt;/strong&gt; destroyed immediately after job completion to prevent credential or data leakage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Temporary credentials via IAM role assumption&lt;/strong&gt; replace static AWS keys.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-grained access control&lt;/strong&gt; configurable by tenants for resource permissions.&lt;/li&gt;
&lt;li&gt;Full audit trail of provisioning, execution, and shutdown logged via CloudWatch → Splunk.&lt;/li&gt;
&lt;li&gt;Meets CIS Benchmarks and internal security policies.&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Debugging in a Secure, Ephemeral World
&lt;/h1&gt;

&lt;p&gt;Ephemeral runners mean persistent debugging isn’t possible by design, but ForgeMT offers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Live debugging with Teleport:&lt;/strong&gt; Keep runners alive temporarily via workflow tweaks to enable SSH into running jobs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reproducible reruns:&lt;/strong&gt; Failed jobs can be rerun identically from GitHub UI.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Log-based troubleshooting:&lt;/strong&gt; Access runner telemetry, syslogs, and job logs centrally without infrastructure exposure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes support:&lt;/strong&gt; Same debugging mechanisms apply to EKS runners, preserving isolation and auditability.&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;ForgeMT is likely overkill for small teams. Start simple with ephemeral runners (EC2 or ARC), GitHub Actions, and Terraform automation. Only scale up when you hit real pain points. ForgeMT shines in multi-team environments where tenant isolation, governance, and platform automation are mission-critical. For solo teams, it just adds unnecessary complexity.&lt;/p&gt;

&lt;p&gt;ForgeMT addresses the major enterprise challenges of running GitHub Actions runners at scale by delivering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strong multi-tenant isolation&lt;/li&gt;
&lt;li&gt;Fully automated lifecycle management and governance&lt;/li&gt;
&lt;li&gt;Flexible runner types with cost-aware autoscaling and warm pools&lt;/li&gt;
&lt;li&gt;Secure, ephemeral environments that meet compliance needs&lt;/li&gt;
&lt;li&gt;An open-source, extensible platform for customization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For organizations struggling to scale self-hosted runners securely and efficiently on AWS, ForgeMT provides a battle-tested, transparent platform that combines AWS best practices with developer-friendly automation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Dive Into the ForgeMT Project
&lt;/h2&gt;

&lt;p&gt;Ideas are cheap — execution is what counts. ForgeMT’s source code is public — check it out:&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://github.com/cisco-open/forge/" rel="noopener noreferrer"&gt;https://github.com/cisco-open/forge/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;⭐️ If you find it useful, don’t forget to drop a star!&lt;/p&gt;




&lt;h2&gt;
  
  
  🤝 Connect
&lt;/h2&gt;

&lt;p&gt;Let’s connect on &lt;a href="https://www.linkedin.com/in/edersonbrilhante/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; and &lt;a href="https://github.com/edersonbrilhante" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>githubactions</category>
      <category>devops</category>
      <category>cicd</category>
      <category>aws</category>
    </item>
    <item>
      <title>Learn how ForgeMT simplifies multi-tenant GitHub Actions runners with security, scalability, and automation. Read the full case study to see how it can streamline your CI/CD pipelines:</title>
      <dc:creator>Ederson Brilhante</dc:creator>
      <pubDate>Sat, 17 May 2025 10:14:18 +0000</pubDate>
      <link>https://dev.to/edersonbrilhante/learn-how-forgemt-simplifies-multi-tenant-github-actions-runners-with-security-scalability-and-215a</link>
      <guid>https://dev.to/edersonbrilhante/learn-how-forgemt-simplifies-multi-tenant-github-actions-runners-with-security-scalability-and-215a</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/edersonbrilhante/forgemt-a-scalable-secure-multi-tenant-github-runner-platform-at-cisco-735" class="crayons-story__hidden-navigation-link"&gt;ForgeMT: A Scalable, Secure Multi-Tenant GitHub Runner Platform at Cisco&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/edersonbrilhante" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F606500%2F67cc9470-d75a-4b86-bb9f-07329fb2558a.jpeg" alt="edersonbrilhante profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/edersonbrilhante" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Ederson Brilhante
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Ederson Brilhante
                
              
              &lt;div id="story-author-preview-content-2493311" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/edersonbrilhante" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F606500%2F67cc9470-d75a-4b86-bb9f-07329fb2558a.jpeg" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Ederson Brilhante&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/edersonbrilhante/forgemt-a-scalable-secure-multi-tenant-github-runner-platform-at-cisco-735" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;May 16 '25&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/edersonbrilhante/forgemt-a-scalable-secure-multi-tenant-github-runner-platform-at-cisco-735" id="article-link-2493311"&gt;
          ForgeMT: A Scalable, Secure Multi-Tenant GitHub Runner Platform at Cisco
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/terraform"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;terraform&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/devops"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;devops&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/platformengineering"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;platformengineering&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/githubactions"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;githubactions&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/edersonbrilhante/forgemt-a-scalable-secure-multi-tenant-github-runner-platform-at-cisco-735" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/exploding-head-daceb38d627e6ae9b730f36a1e390fca556a4289d5a41abb2c35068ad3e2c4b5.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/multi-unicorn-b44d6f8c23cdd00964192bedc38af3e82463978aa611b4365bd33a0f1f4f3e97.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;10&lt;span class="hidden s:inline"&gt;&amp;nbsp;reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/edersonbrilhante/forgemt-a-scalable-secure-multi-tenant-github-runner-platform-at-cisco-735#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              

              1&lt;span class="hidden s:inline"&gt;&amp;nbsp;comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            14 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
      <category>terraform</category>
      <category>devops</category>
      <category>platformengineering</category>
      <category>githubactions</category>
    </item>
    <item>
      <title>ForgeMT: A Scalable, Secure Multi-Tenant GitHub Runner Platform at Cisco</title>
      <dc:creator>Ederson Brilhante</dc:creator>
      <pubDate>Fri, 16 May 2025 08:55:41 +0000</pubDate>
      <link>https://dev.to/edersonbrilhante/forgemt-a-scalable-secure-multi-tenant-github-runner-platform-at-cisco-735</link>
      <guid>https://dev.to/edersonbrilhante/forgemt-a-scalable-secure-multi-tenant-github-runner-platform-at-cisco-735</guid>
      <description>&lt;h2&gt;
  
  
  🧭 Why ForgeMT Exists
&lt;/h2&gt;

&lt;p&gt;ForgeMT is a centralized platform that enables engineering teams to run GitHub Actions securely and efficiently — without building or managing their own CI infrastructure.&lt;/p&gt;

&lt;p&gt;It provides ephemeral runners (EC2 or Kubernetes), strict tenant isolation, and full automation behind a hardened, shared control plane.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before ForgeMT&lt;/strong&gt;, every team in Cisco’s Security Business Group had to build and maintain their own CI setup — leading to duplicated effort, inconsistent security, slow onboarding, and rising operational overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ForgeMT&lt;/strong&gt; replaced this fragmented approach with a secure, scalable, multi-tenant platform — saving time, reducing risk, and accelerating adoption.&lt;/p&gt;

&lt;h3&gt;
  
  
  ⚡ Fast Facts (ForgeMT Impact)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;⏱️ 80+ engineering hours saved/month per team&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;📦 40,000+ GitHub Actions jobs/month&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;✅ 99.9% success rate across tenants&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;This post explains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;🚀 Why ForgeMT was needed&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;💼 What impact it had&lt;/strong&gt; — From reliability to cost savings and security compliance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🧱 How it works&lt;/strong&gt; - Deep Dive into Architecture&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;👉 Jump to what matters most:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;💼 Business Impact&lt;/strong&gt; – For leadership and stakeholders&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🏗️ Architecture&lt;/strong&gt; – For platform engineers and DevOps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🧠 Or keep reading&lt;/strong&gt; for full technical context and background&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🚨 From Fragmented CI to Scalable, Secure Solutions: The Journey to ForgeMT
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;Credit: Prototype by Matthew Giassa - MASc, EIT —who championed the Philips Labs GitHub Runner module across multiple teams.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Before ForgeMT, each team used its own CI stack—Jenkins, Travis, or Concourse. While these tools met local needs, they created long-term issues: inconsistent patching, security gaps, and poor scalability.&lt;/p&gt;

&lt;p&gt;Matthew built a promising PoC, but it was a siloed setup with manual AWS, GitHub, and Terraform steps. Rigid subnetting caused IPv4 exhaustion, and teams copy-pasting Terraform modules led to high maintenance overhead and config drift.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2s3dw0mqyuv1kpi2g553.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2s3dw0mqyuv1kpi2g553.jpeg" alt="Before: Siloed Team Runners" width="800" height="653"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;To address this complexity, I drove the end‑to‑end technical design and implementation of ForgeMT&lt;/strong&gt;—a centralized, multi‑tenant GitHub Actions runner service on AWS—while coordinating with infrastructure, security, and platform stakeholders to ensure a smooth production launch.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69k0jhzeo620078h7ita.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69k0jhzeo620078h7ita.png" alt="After: 10k feet view" width="800" height="782"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At scale, teams were running thousands of Actions jobs across dozens of isolated environments—each with its own patch cadence, network quirks, and IAM policies. ForgeMT unifies these into a single control plane, delivering consistent security, predictable performance, and dramatically simplified operations.&lt;/p&gt;

&lt;p&gt;For detailed business impact metrics (time saved, reliability gains, cost optimization), see the Business Impact section.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It builds on proven ephemeral EC2 and EKS/ARC runner modules, adding:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;IAM/OIDC-based tenant isolation&lt;/li&gt;
&lt;li&gt;Built-in observability (metrics, logs, dashboards)&lt;/li&gt;
&lt;li&gt;Automation for patching, Terraform drift, repo onboarding, and global Actions locks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By consolidating infrastructure into a hardened control plane, ForgeMT ensured that security and compliance were at the forefront while enabling rapid onboarding, eliminating manual patching, and solving IPv4 exhaustion. This was achieved by scaling pod-based runners via EKS + Calico CNI, with a strong focus on tenant isolation, IAM roles, and security groups (SG) to control access. The hardened control plane preserved the security and flexibility of the original prototype, delivering a secure, compliant, and scalable platform.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F71lzhc49vpvee1lutwr5.jpeg" alt="After: Centralized ForgeMT Control Plane" width="800" height="782"&gt;
&lt;/h2&gt;

&lt;h2&gt;
  
  
  📊 Business Impact
&lt;/h2&gt;

&lt;p&gt;ForgeMT has not only met the demands of various teams but also helped scale securely under Cisco’s guidance, optimizing cloud spend and increasing reliability across all stakeholders:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dramatic time savings (80 + hours/month per team):&lt;/strong&gt; By automating every aspect of runner lifecycle—OS patching, Terraform module updates, ephemeral provisioning, and even repository registration—teams were freed from manual CI maintenance and could refocus on shipping features.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimized cloud spend:&lt;/strong&gt; Spot and On-demand Instances, right‑sized instance selection per job type, and EKS + Calico’s IP‑efficient networking cut infrastructure costs without slowing builds. ForgeMT also supports warm instance pools for high-frequency jobs, avoiding cold starts when speed is critical—striking a smart balance between performance and cost.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rock‑solid reliability (99.9% success over 40K+ jobs/month):&lt;/strong&gt; Centralizing infrastructure eliminated snowflake environments and drift, reducing job failures caused by misconfiguration or stale runners to near zero.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise‑grade security &amp;amp; compliance:&lt;/strong&gt; IAM/OIDC per‑tenant isolation, CIS‑benchmarked AMIs, and end‑to‑end logging into Splunk ensured every action was auditable, vault‑grade credentials were never exposed, and internal audits passed with zero findings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;True multi‑tenancy at scale:&lt;/strong&gt; Teams retain autonomy over AMIs, ECRs, and workflow definitions while ForgeMT transparently handles networking, isolation, and autoscaling—supporting dozens of teams without additional IP consumption or operational overhead.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AWS account isolation per tenant:&lt;/strong&gt; Each tenant can have one or more individual AWS accounts, with full control over their own network setup. This includes the flexibility to configure internal or public subnets within their AWS accounts, ensuring strong security boundaries and independent resource management without ForgeMT managing their network.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Together, these outcomes turned a fractured, high‑toil CI landscape into a self‑service platform that scales securely, reduces costs, and accelerates delivery.&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚙️ ForgeMT Architecture Overview
&lt;/h2&gt;

&lt;h3&gt;
  
  
  📦 Core Components &amp;amp; Technical Foundations
&lt;/h3&gt;

&lt;p&gt;These results are enabled by the following technical components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/github-aws-runners/terraform-aws-github-runner" rel="noopener noreferrer"&gt;Terraform module for EC2 runners:&lt;/a&gt; Utilized as a Terraform module to provision ephemeral EC2-based GitHub Actions runners, supporting auto-scaling and cost optimization by using AWS spot and on-demand instances. This setup ensures that runners are created on-demand and terminated after use, aligning with the ephemeral nature of ForgeMT's infrastructure. &lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/actions/actions-runner-controller" rel="noopener noreferrer"&gt;ARC (Actions Runner Controller)&lt;/a&gt;: Employed to manage EKS-based GitHub Actions runners, enabling containerized, isolated job execution via Kubernetes. This approach leverages Kubernetes' orchestration capabilities for efficient scaling and management of CI/CD workloads.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://opentofu.org/" rel="noopener noreferrer"&gt;OpenTofu&lt;/a&gt; + &lt;a href="https://terragrunt.gruntwork.io/" rel="noopener noreferrer"&gt;Terragrunt&lt;/a&gt;: Implemented for Infrastructure as Code (IaC), ensuring region-, account-, and tenant-specific infrastructure deployments with DRY (Don't Repeat Yourself) principles. This methodology facilitates consistent and repeatable infrastructure provisioning across multiple environments.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://aws.amazon.com/blogs/security/how-to-use-trust-policies-with-iam-roles/" rel="noopener noreferrer"&gt;IAM Trust Policies&lt;/a&gt;: Adopted to secure runner access using short-lived credentials via IAM roles and trust relationships, eliminating the need for static credentials and enhancing security.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.splunk.com/en_us/download/o11y-cloud-free-trial.html" rel="noopener noreferrer"&gt;Splunk Cloud&lt;/a&gt; &amp;amp; &lt;a href="https://www.splunk.com/en_us/download/o11y-cloud-free-trial.html" rel="noopener noreferrer"&gt;O11y(Observability)&lt;/a&gt;: Integrated for centralized logging and metrics aggregation, providing real-time observability across ForgeMT components. This setup enables detailed telemetry, including per-tenant dashboards for monitoring resource usage and optimization insights.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://goteleport.com/" rel="noopener noreferrer"&gt;Teleport&lt;/a&gt;: Utilized to provide secure, auditable SSH access to EC2 runners and Kubernetes pods, enhancing compliance, access control, and auditing capabilities.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://aws.amazon.com/eks/" rel="noopener noreferrer"&gt;EKS&lt;/a&gt; + &lt;a href="https://docs.tigera.io/calico/latest/about/" rel="noopener noreferrer"&gt;Calico CNI&lt;/a&gt;: Leveraged to scale pod provisioning without consuming additional VPC IPs, utilizing Calico's efficient networking. This setup ensures tenant isolation and optimizes network resource usage within limited VPC subnets.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://aws.amazon.com/eks/" rel="noopener noreferrer"&gt;EKS&lt;/a&gt; + &lt;a href="https://karpenter.sh/" rel="noopener noreferrer"&gt;Karpenter&lt;/a&gt;: Enables dynamic, demand-driven autoscaling of Kubernetes worker nodes. Automatically provisions the most suitable and cost-effective EC2 instance types based on real-time pod requirements. Supports spot and on-demand capacity, prioritizing efficiency and performance. Warm pools can be configured to reduce cold start latency while maintaining cost control—ideal for high-churn CI/CD workloads.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These technologies form the backbone of ForgeMT, enabling its robust performance and scalability.&lt;/p&gt;




&lt;h3&gt;
  
  
  🧠 ForgeMT Control Plane (Managed by Forge Team)
&lt;/h3&gt;

&lt;p&gt;The ForgeMT control plane hosts shared infrastructure and reusable IaC modules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ForgeMT GitHub App:&lt;/strong&gt; Installed on tenant repositories to listen for GitHub workflow events and dynamically register ephemeral runners.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ForgeMT AMIs &amp;amp; Forge ECR:&lt;/strong&gt; Default base images for runners (VMs and containers).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Terraform Modules:&lt;/strong&gt; Each tenant-region pair deploys an isolated ForgeMT instance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API Gateway + Lambda:&lt;/strong&gt; Processes GitHub webhook jobs to trigger runner provisioning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Centralized Logging:&lt;/strong&gt; Runner logs are forwarded to CloudWatch, then into Splunk Cloud Platform.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Centralized Observability:&lt;/strong&gt; All AWS metrics are sent to Splunk O11y Cloud&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Teleport:&lt;/strong&gt; Secure, role-based SSH access to VM runners (if needed), with session logging.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  🏗️ Tenant Isolation
&lt;/h3&gt;

&lt;p&gt;Each ForgeMT deployment is dedicated to a single tenant, ensuring full isolation within a specific AWS region. This approach guarantees that IAM roles, policies, services, and AWS resources are scoped uniquely for each tenant-region pair, enforcing strict security, compliance, and minimizing the blast radius.&lt;/p&gt;




&lt;h3&gt;
  
  
  💻 Runner Types
&lt;/h3&gt;

&lt;h4&gt;
  
  
  🧱 AWS EC2-Based Runners (VM and Metal)
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Ephemeral Runner Provisioning:&lt;/strong&gt; EC2 runners are provisioned using Forge-provided AMIs or tenant-specific custom AMIs. These instances are pre-configured with the necessary tools to execute CI/CD jobs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workload Execution:&lt;/strong&gt; Jobs can be executed directly on the EC2 instance or via containers, using container: blocks in GitHub workflows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security:&lt;/strong&gt; Authentication to tenant AWS resources is handled through IAM roles and trust policies, eliminating the need for static credentials and ensuring dynamic, secure access control.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ephemeral Nature:&lt;/strong&gt; Once a job is completed, the EC2 instance is terminated, maintaining a completely stateless environment.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F11dsb73l5s74zu4jgidq.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F11dsb73l5s74zu4jgidq.jpeg" alt="Forge Control Plane for AWS EC2 Runners&lt;br&gt;
" width="800" height="849"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  ☸️ EKS-Based Runners (Kubernetes)
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Kubernetes-Orchestrated Actions:&lt;/strong&gt; Using the Actions Runner Controller (ARC), EKS runners are provisioned as pods within an Amazon EKS cluster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource Isolation:&lt;/strong&gt; Each tenant is assigned a dedicated namespace, service account, and IAM role, ensuring strict isolation of resources and permissions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Container Images:&lt;/strong&gt; Runners can pull container images from either the Forge ECR or the tenant’s own ECR, depending on the configuration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability:&lt;/strong&gt; EKS is ideal for high-scale operations, leveraging Kubernetes' orchestration capabilities to manage the lifecycle of runners efficiently.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpy234v7u4zcqkhut34yn.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpy234v7u4zcqkhut34yn.jpeg" alt="Forge Control Plane for K8S Pod Runners" width="800" height="674"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  🔁 Warm Pool
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reducing Startup Latency:&lt;/strong&gt; An optional warm pool can be configured for both EC2 and EKS runners, pre-initializing instances or pods to reduce waiting times during high demand.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Importance for EKS:&lt;/strong&gt; For EKS runners, the need for warm pools is significantly reduced, as Kubernetes already provides rapid scaling and efficient pod initialization.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Usage in EC2:&lt;/strong&gt; The warm pool helps minimize the initialization time for EC2 instances, resulting in faster job execution times for critical tasks.&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  💻 Examples of Runner Types in ForgeMT
&lt;/h3&gt;

&lt;p&gt;ForgeMT offers flexibility for tenants to configure multiple runner types simultaneously, adapting to their workload needs. Each tenant can define as many runners as needed, with a parallelism limit set per tenant and runner type. Here are some typical runner examples and their use cases:&lt;/p&gt;
&lt;h4&gt;
  
  
  🧱 EC2 Runners
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Small:&lt;/strong&gt; Lightweight instances for tasks with minimal resource usage, such as quick tests or linting.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standard:&lt;/strong&gt; Instances for balanced workloads, ideal for code compilation or integration tests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Large:&lt;/strong&gt; High-performance instances for tasks requiring more processing power, such as complex builds or load tests.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bare Metal:&lt;/strong&gt; Bare-metal instances for applications that need full control over the hardware, such as simulations or intensive processing tasks.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  ☸️ Kubernetes Runners
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dependabot:&lt;/strong&gt; Used for automated dependency update jobs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Light (k8s):&lt;/strong&gt; Runners for simple tasks that don't require Docker, like linting or unit test execution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker-in-Docker (DinD):&lt;/strong&gt; Used for jobs that require Docker inside Kubernetes, such as image building or integration tests involving containers.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  🔄 Configurable Parallelism per Tenant and Runner Type
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Each &lt;strong&gt;tenant&lt;/strong&gt; can configure their own set of runners and use different EC2 instance types or Kubernetes pods simultaneously.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;parallelism limit&lt;/strong&gt; can be configured per runner type and tenant, ensuring that running multiple jobs does not overload resources.&lt;/li&gt;
&lt;li&gt;This allows each team to run jobs in parallel based on their needs without impacting the performance of other tenants or jobs.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  Considerations
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Choosing the Right Runner:&lt;/strong&gt; Depending on workload complexity and job requirements, you may choose EC2 or EKS runners. EKS is generally preferred for lightweight, scalable workloads, while EC2 may be necessary for jobs with specific hardware or memory requirements.&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  ⚙️ GitHub Integration
&lt;/h3&gt;

&lt;p&gt;GitHub events trigger ForgeMT through a webhook via API Gateway, dynamically registering runners into the appropriate GitHub Runner Groups associated with the tenant. The runner lifecycle is designed to be ephemeral: runners are registered just-in-time for job execution and are destroyed once the job is completed. When a new repository is installed, it is automatically registered with the correct GitHub Runner Group, ensuring seamless integration with the right tenant's runners.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy1nhvzmwcst3cngxv9h4.jpeg" alt="Github Integration Forge Control Plane" width="800" height="1314"&gt;
&lt;/h2&gt;
&lt;h3&gt;
  
  
  🔌 Extensibility
&lt;/h3&gt;

&lt;p&gt;Each tenant account can optionally manage the following resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tenant AMIs (for AWS EC2 runners):&lt;/strong&gt; Custom-built images with pre-installed tooling tailored to the tenant's specific requirements.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tenant ECR:&lt;/strong&gt; Houses custom container images used for VM-based container jobs, GitHub composite actions, or full pod images in EKS.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tenant IAM Role:&lt;/strong&gt; Configured with trust relationships to allow ForgeMT runners to securely assume roles without the need for AWS access keys.&lt;/li&gt;
&lt;li&gt;ForgeMT offers flexibility for teams to customize their runners according to their specific needs. If a tenant requires a custom Amazon Machine Image (AMI) or container image, it is their responsibility to build and maintain it. We provide a base image to get them started, but the final configuration is under their control. Once the custom image is ready, it can be shared with our accounts and integrated into the ForgeMT platform, enabling the team to meet their unique requirements.&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  🔄 Optional Configurations
&lt;/h4&gt;

&lt;p&gt;Tenants can choose to configure the following based on their specific needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Accessing AWS Resources via Runners:&lt;/strong&gt; To enable runners to interact with AWS services within the tenant's account, an IAM role must be established with a trust relationship permitting ForgeMT to assume it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pulling Images from Tenant ECR:&lt;/strong&gt; If runners need to pull images from the tenant's ECR—be it for container jobs, composite actions, or Kubernetes pods—the tenant must configure appropriate repository policies and IAM permissions to allow these operations.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accessing Additional Tenant Resources:&lt;/strong&gt; For runners to access other AWS resources within the tenant's account, the IAM role assumed by ForgeMT must have policies granting the necessary permissions. This might involve setting up a chain of role assumptions or defining specific resource-based policies.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8i4l0u09vehborjbdtdk.jpeg" alt="Tenant's AWS Account Integration with ForgeMT Control Plane" width="800" height="368"&gt;
&lt;/h2&gt;
&lt;h2&gt;
  
  
  📊 Observability: Splunk Cloud &amp;amp; O11y
&lt;/h2&gt;

&lt;p&gt;ForgeMT delivers full-stack observability with centralized logging and per-tenant metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Centralized logging:&lt;/strong&gt; All relevant logs — syslog, AWS EC2 user data, GitHub runner job logs, worker logs, and agent logs — are sent to CloudWatch Logs and forwarded to Splunk Cloud for full visibility and auditability.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metrics via Splunk O11y:&lt;/strong&gt; Captures detailed telemetry.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-tenant dashboards:&lt;/strong&gt; Each team gets dedicated dashboards showing cost breakdowns, resource usage, and optimization insights (e.g., high-memory job detection).&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdll4w5y6ze3qonrl0k3w.jpeg" alt="ForgeMT Control Plane integration with Splunk Cloud" width="800" height="613"&gt;
&lt;/h2&gt;
&lt;h3&gt;
  
  
  🔐 Security Model
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strong tenant isolation:&lt;/strong&gt; Every tenant has its own IAM roles,  namespaces, and resources.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IAM Role Assumption:&lt;/strong&gt; Eliminates use of long-lived AWS credentials.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No cross-tenant visibility:&lt;/strong&gt; Runners cannot access other tenant workloads or secrets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-grained access control:&lt;/strong&gt; Each tenant defines what their runners can access by configuring the IAM role being assumed—this can include direct resource access or chained role assumptions for more advanced patterns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔒 Ephemeral Isolation:&lt;/strong&gt; ForgeMT runners are automatically destroyed after every job — success or failure. This guarantees a clean slate every time, eliminates environment drift, blocks credential persistence, and prevents resource leaks by default.&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  🛡️ Compliance &amp;amp; Observability
&lt;/h3&gt;

&lt;p&gt;ForgeMT ensures strict compliance and security throughout the lifecycle of its ephemeral runners, from provisioning to execution and shutdown.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Full Audit Trail:&lt;/strong&gt; Every runner lifecycle event — including provisioning, execution, and shutdown — is logged, ensuring complete visibility and traceability for compliance audits. This audit trail is vital for maintaining transparency in high-security environments.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CloudWatch → Splunk Integration:&lt;/strong&gt; Logs from the runners are forwarded from CloudWatch to Splunk, enabling teams to perform real-time queries on logs. This integration supports compliance audits by providing detailed, queryable logs that can be easily reviewed and accessed for regulatory requirements.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IAM Integration:&lt;/strong&gt; By using IAM (Identity and Access Management), ForgeMT eliminates the use of hardcoded credentials or AWS long-term access keys. This significantly reduces the risk of unauthorized access and enhances security by enforcing role-based access and temporary credentials that follow the principle of least privilege.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security Standards Compliance:&lt;/strong&gt; ForgeMT meets internal security standards, which are aligned with industry best practices such as CIS Benchmarks. This ensures that the platform adheres to rigorous security controls and provides a secure environment for multi-tenant workloads.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  🔍 Debugging Securely and Effectively
&lt;/h3&gt;

&lt;p&gt;ForgeMT offers teams the option to choose between EC2 Spot Instances and On-Demand Instances, allowing for flexibility in cost optimization. While Spot Instances can provide significant cost savings, they come with the inherent risk that AWS may reclaim the instance at any time. Teams are responsible for evaluating this risk and determining whether to use Spot or On-Demand Instances based on the criticality of their workloads.&lt;/p&gt;

&lt;p&gt;Given ForgeMT's design of ephemeral runners, which are terminated immediately after each job to prevent state persistence and credential leakage, debugging presents unique challenges. However, the platform offers robust solutions to address these challenges.&lt;/p&gt;

&lt;p&gt;For real-time debugging, developers can access running jobs via &lt;strong&gt;Teleport&lt;/strong&gt;. By including a sleep step in the workflow or using a custom wrapper, the runner can be kept alive temporarily. This allows for manual inspection and troubleshooting while the job is still running.&lt;/p&gt;

&lt;p&gt;Additionally, even without live access, ForgeMT maintains comprehensive observability. Teams can rely on syslogs, GitHub Actions job logs, and runner-level telemetry to understand job behavior. Every job runs in a fully reproducible environment, meaning developers can simply rerun failed jobs through the GitHub UI, replicating the exact conditions without side effects while maintaining full auditability.&lt;/p&gt;

&lt;p&gt;For &lt;strong&gt;Kubernetes-based runners&lt;/strong&gt;, the same debugging approach applies: &lt;strong&gt;Teleport&lt;/strong&gt; can be used for live access to running jobs. The integration with Kubernetes allows teams to extend the same debugging capabilities while leveraging the scalability and flexibility of the containerized environment.&lt;/p&gt;


&lt;h3&gt;
  
  
  🚀 ForgeMT: Powering Tenants with Flexibility and Control
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;💥 Ephemeral by design —&lt;/strong&gt; Runners are created per job and disappear afterward. No drift. No patching. No residual garbage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🛠️ Infra-as-Code from top to bottom —&lt;/strong&gt; Fully automated. Declarative. Version-controlled. No snowflakes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔐 Strong isolation baked in —&lt;/strong&gt; IAM, OIDC, and security group segmentation per tenant. No cross-tenant blast radius.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📦 Run anything, per tenant —&lt;/strong&gt; EC2 or EKS. k8s, dind, or metal. Each tenant defines their own mix.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🚦 Control usage at scale —&lt;/strong&gt; Enforce parallelism limits per tenant/type. No surprises. No abuse.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🕹️ Custom policies, zero effort —&lt;/strong&gt; Tenants define autoscaling, labels, and configurations via GitHub — no AWS skills required.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🧘 No infra for tenants to manage —&lt;/strong&gt; No patching, no VPCs, no accounts. Just push code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🕵️ Observability without ownership —&lt;/strong&gt; Logs, metrics, and traces exposed per tenant. No nodes to babysit.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;⚡ Fast time-to-first-run —&lt;/strong&gt; Cold starts optimized. Most runners boot in &amp;lt;20s, even for large jobs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🌎 Network-aware provisioning —&lt;/strong&gt; Runners automatically deploy into the correct subnet, zone, or region.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📊 Usage-aware scaling —&lt;/strong&gt; Instance types are selected based on cost/performance tradeoffs — no more overprovisioning by default.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🧩 GitHub-native workflows —&lt;/strong&gt; No toolchain rewrites required. Just drop in the runs-on labels and go.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🚫 No global queues —&lt;/strong&gt; Each tenant is scoped, isolated, and throttled independently.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdgmluht56f3mhucb6goz.jpeg" alt="ForgeMT Control Plane and all the integrations" width="800" height="782"&gt;
&lt;/h2&gt;
&lt;h3&gt;
  
  
  🛠️ Implementation &amp;amp; Adoption
&lt;/h3&gt;

&lt;p&gt;It took about 2 months to evolve from a single-tenant, EC2-only setup into a fully multi-tenant platform. Highlights:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🔹 *&lt;em&gt;Kubernetes support *&lt;/em&gt;— with Calico CNI + Karpenter&lt;/li&gt;
&lt;li&gt;🔹 &lt;strong&gt;Tenant isolation&lt;/strong&gt; by design&lt;/li&gt;
&lt;li&gt;🔹 &lt;strong&gt;Per-tenant automation&lt;/strong&gt; &amp;amp; base images&lt;/li&gt;
&lt;li&gt;🔹 &lt;strong&gt;EKS pod identity&lt;/strong&gt; for secure access&lt;/li&gt;
&lt;li&gt;🔹 Integrated with &lt;strong&gt;Teleport, Splunk&lt;/strong&gt;, and full observability&lt;/li&gt;
&lt;li&gt;🔹 Custom dashboards with enriched telemetry&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  🚀 Frictionless Adoption
&lt;/h3&gt;

&lt;p&gt;Onboarding was dead simple.&lt;/p&gt;

&lt;p&gt;For most tenants, switching to ForgeMT meant updating &lt;strong&gt;just the&lt;/strong&gt; runs-on label in their GitHub Actions workflows — ⚡ No rewrites. No migrations. No downtime.&lt;/p&gt;

&lt;p&gt;For teams that required deeper isolation, assuming their own IAM role was just as straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- name: Configure AWS Credentials
  uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::&amp;lt;tenant-account&amp;gt;:role/&amp;lt;role-name&amp;gt;
    aws-region: &amp;lt;aws-region&amp;gt;
    role-duration-seconds: 900

- name: Example
  run: aws cloudformation list-stacks
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;💡 This approach made adoption &lt;strong&gt;fast, safe, and low-friction&lt;/strong&gt; — even for teams skeptical of platform changes.&lt;/p&gt;




&lt;h2&gt;
  
  
  🚫 Overkill Warning: When ForgeMT Is Too Much
&lt;/h2&gt;

&lt;p&gt;If you're a small team, ForgeMT might be overkill. Start with the basics: ephemeral runners (EC2 or ARC), GitHub Actions, and Terraform automation. Scale up only when you hit real pain. ForgeMT shines in multi-team setups where governance, tenant isolation, and platform automation matter. For solo teams, it may just add complexity you don’t need.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔭 What’s Next
&lt;/h2&gt;

&lt;p&gt;I’m currently focused on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost-aware scheduling&lt;/strong&gt; — Prioritizing jobs based on real-time pricing and instance efficiency, optimizing for performance while reducing costs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic autoscaling&lt;/strong&gt; — Moving from static warm pool rules to a more responsive, metrics-driven approach that adapts to the bursty nature of GitHub Actions workloads.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deeper observability&lt;/strong&gt; — Integrating GitHub metrics for actionable insights that drive optimized runner performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI-driven scaling optimization&lt;/strong&gt; — Leveraging historical data to predict workload demands, optimize resource allocation, and automate scaling decisions based on both performance and cost metrics.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you’re tackling similar problems — or looking to adopt, extend, or contribute to ForgeMT — let’s talk. I’m always open to collaborating with engineers building serious DevSecOps infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧪 Dive Into the ForgeMT Project
&lt;/h2&gt;

&lt;p&gt;Ideas are cheap — execution is everything. The ForgeMT source code is now publicly available — check it out:&lt;/p&gt;

&lt;p&gt;👉 &lt;a href="https://github.com/cisco-open/forge/" rel="noopener noreferrer"&gt;https://github.com/cisco-open/forge/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;⭐️ Don’t forget to give it a star ;)!&lt;/p&gt;




&lt;h2&gt;
  
  
  ✍️ In Short
&lt;/h2&gt;

&lt;p&gt;ForgeMT emerged from real-world CI pain at enterprise scale. What began as a prototype to fix local inefficiencies has grown into a secure, multi-tenant, production-grade runner platform. I’m sharing this so others can skip the trial-and-error and build smarter from the start.&lt;/p&gt;




&lt;h2&gt;
  
  
  🤝 Connect
&lt;/h2&gt;

&lt;p&gt;Let’s connect on &lt;a href="https://www.linkedin.com/in/edersonbrilhante/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; and &lt;a href="https://github.com/edersonbrilhante" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;Always happy to trade notes with like-minded builders.&lt;/p&gt;

&lt;p&gt;This article was originally published on &lt;a href="https://www.linkedin.com/pulse/forge-scalable-secure-multi-tenant-github-runner-brilhante--fyxbf" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>devops</category>
      <category>platformengineering</category>
      <category>githubactions</category>
    </item>
    <item>
      <title>Decoding the Myth of 'Junior' in DevOps and SRE: Navigating Challenges and Cultivating Expertise</title>
      <dc:creator>Ederson Brilhante</dc:creator>
      <pubDate>Tue, 16 Jan 2024 15:38:01 +0000</pubDate>
      <link>https://dev.to/edersonbrilhante/decoding-the-myth-of-junior-in-devops-and-sre-navigating-challenges-and-cultivating-expertise-4bmk</link>
      <guid>https://dev.to/edersonbrilhante/decoding-the-myth-of-junior-in-devops-and-sre-navigating-challenges-and-cultivating-expertise-4bmk</guid>
      <description>&lt;p&gt;In my view, assigning roles such as &lt;strong&gt;'Junior DevOps'&lt;/strong&gt; and &lt;strong&gt;'Junior SRE (Site Reliability Engineer)'&lt;/strong&gt; seems impractical, reminiscent of labeling someone an &lt;strong&gt;'Entry-Level Software Architect.'&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Navigating the intricate landscape
&lt;/h2&gt;

&lt;p&gt;Navigating the intricate landscape of DevOps and SRE demands proficiency in &lt;strong&gt;coding, networking, cloud technologies, security, and system administration.&lt;/strong&gt; Envisioning someone with limited experience adeptly maneuvering through this multifaceted skill set poses a significant challenge.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Software Architect analogy
&lt;/h2&gt;

&lt;p&gt;Similarly, giving the title &lt;strong&gt;"Software Architect"&lt;/strong&gt; to beginners doesn't align with the intricate demands of the role. Crafting sophisticated software solutions requires years of practical experience, involving intricate system design and understanding. Expecting a junior engineer to architect and implement a secure, scalable microservices architecture without in-depth knowledge and experience in the design principles of distributed systems is unrealistic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quantity vs. Experience fallacy
&lt;/h2&gt;

&lt;p&gt;Furthermore, the belief that numerous junior roles collectively can achieve the same level of effectiveness as a seasoned professional echoes the fallacy of favoring &lt;strong&gt;quantity over experience.&lt;/strong&gt; While each junior role contributes to the team's growth, the efficiency and strategic thinking of an experienced architect often outpace the combined efforts of multiple entry-level professionals.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pressure on companies
&lt;/h2&gt;

&lt;p&gt;In addition, the pressure on companies to leverage the benefits of DevOps and SRE roles within their organization often stems from the growing need for seamless integration between development and operations. Individuals in these positions are expected to possess a profound understanding of both coding and operations, creating a unique blend of skills. Unfortunately, finding professionals who embody this multidisciplinary expertise is a formidable challenge. Those who can seamlessly bridge the gap between traditional sysadmins and developers are not only rare but also come at a premium, given the scarcity of individuals with such comprehensive skills in the overall job market.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scarcity leading to desperation
&lt;/h2&gt;

&lt;p&gt;This scarcity sometimes leads companies to consider entry-level candidates, hoping to quickly train them to fill the void. However, the complex nature of the disciplines touched upon by DevOps and SRE roles means that becoming proficient in each area takes &lt;strong&gt;years of hands-on experience.&lt;/strong&gt; The high demand and limited supply of individuals with these multifaceted skills contribute to the desperation companies feel in recruiting for these roles.&lt;/p&gt;

&lt;h2&gt;
  
  
  Acknowledging the shortage
&lt;/h2&gt;

&lt;p&gt;Acknowledging this shortage is crucial, especially as it extends beyond DevOps and SRE roles to other senior positions. Over the past two decades, the industry has witnessed a trend of companies poaching professionals from one another rather than investing in training new talents. This cycle has created a snowball effect, further exacerbating the shortage of skilled individuals.&lt;/p&gt;

&lt;h2&gt;
  
  
  Solution: Attracting seasoned developers
&lt;/h2&gt;

&lt;p&gt;A potential solution lies in attracting seasoned developers with a penchant for infrastructure and operations to transition into roles in DevOps and SRE. These individuals often bring a wealth of experience, having naturally acquired knowledge in areas beyond coding, such as security, infrastructure, databases, and operations. Their diverse skill set aligns with the demands of contemporary senior developers who are expected to possess expertise beyond language-specific coding skills. By encouraging such transitions, companies can tap into a pool of experienced professionals and mitigate the challenges associated with the scarcity of multidisciplinary talent in the market.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recommended pathway for aspiring professionals
&lt;/h2&gt;

&lt;p&gt;For aspiring professionals entering the tech industry, a recommended pathway involves starting as a developer before venturing into the multifaceted realms of DevOps and SRE. Beginning as a developer allows individuals to hone their coding skills and gain a solid foundation in software engineering principles. As they accumulate experience and familiarity with the development lifecycle, they can then gradually navigate towards operations, infrastructure, and other related disciplines. This gradual journey not only provides a comprehensive understanding of the intricacies of both coding and operations but also allows individuals to develop a deeper appreciation for the challenges addressed by DevOps and SRE roles. This approach acknowledges the value of hands-on experience and ensures that individuals entering these dynamic fields are well-equipped to contribute meaningfully to the integration of development and operations within an organization.&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>devops</category>
      <category>discuss</category>
      <category>career</category>
    </item>
    <item>
      <title>Combining Packer, QEMU, Ubuntu Cloud Images, and Ansible</title>
      <dc:creator>Ederson Brilhante</dc:creator>
      <pubDate>Wed, 14 Jun 2023 20:50:24 +0000</pubDate>
      <link>https://dev.to/edersonbrilhante/combining-packer-qemu-ubuntu-cloud-images-and-ansible-47eg</link>
      <guid>https://dev.to/edersonbrilhante/combining-packer-qemu-ubuntu-cloud-images-and-ansible-47eg</guid>
      <description>&lt;p&gt;Hello everyone! I want to share a current use case at my company where I have the opportunity to work with &lt;a href="https://www.packer.io/"&gt;Packer&lt;/a&gt;, &lt;a href="https://www.qemu.org/"&gt;QEMU&lt;/a&gt;, &lt;a href="https://www.ansible.com/"&gt;Ansible&lt;/a&gt; and &lt;a href="https://cloud-images.ubuntu.com/"&gt;Ubuntu Cloud Images&lt;/a&gt; leveraging the concept of Infrastructure as Code (IaC).&lt;/p&gt;

&lt;p&gt;Infrastructure as Code (IaC) is a software engineering practice that enables the management and provisioning of infrastructure resources through code. Instead of manually configuring servers and infrastructure components, IaC allows you to define your desired infrastructure state using declarative or imperative code. It brings automation, version control, and consistency to infrastructure management.&lt;/p&gt;

&lt;p&gt;In our case, we utilize Packer, which is a powerful tool falling under the umbrella of IaC. Packer enables the creation of identical machine images for multiple platforms, such as virtual machines, containers, or cloud instances. With Packer, we define the configuration of our desired machine image, including the operating system, software stack, and customizations, all through code. Packer then automates the process of building these machine images, ensuring consistency and reproducibility.&lt;/p&gt;

&lt;p&gt;To further enhance our image-building process, we integrate Ansible as the provisioner for Packer. Ansible is an open-source automation tool that enables the configuration and management of systems through simple, human-readable YAML files. With Ansible, we define the desired state of our machine image, including the installation of packages, configuration files, and any other necessary setup steps. Ansible seamlessly integrates with Packer, allowing us to provision our machine image with ease.&lt;/p&gt;

&lt;p&gt;In our deployments, we rely on Ubuntu images to meet our diverse cloud computing needs. Ubuntu offers three types of images: live, server, and cloud. Live images provide a fully functional Ubuntu desktop environment that can be run directly from a USB drive or DVD without the need for installation, while server images are optimized for server deployments. However, for our specific use case, we have different requirements depending on the deployment environment. For our deployments in public cloud environments, we leverage the official Ubuntu images provided by the cloud provider, which are tailored and certified for their specific platform. Similarly, in our private on-premises cloud, we utilize the cloud version of Ubuntu images. These cloud images are specifically designed and pre-configured for cloud computing platforms, offering optimized performance and scalability. They enable us to efficiently deploy and manage Ubuntu instances in both our public and private cloud environments.&lt;/p&gt;

&lt;p&gt;Now, let's delve into our challenge. The process of building VM images for the public cloud involves the use of appropriate plugins for Packer. However, when it comes to our on-premises cloud, we encountered an obstacle. Our existing process relied on a deprecated plugin relying on QEMU as its underlying technology. QEMU, an open-source virtualization tool, empowers us to operate and manage virtual machines in various formats, including qcow2. To overcome this hurdle, our aim was to leverage QEMU using an official and updated plugin for Packer. This integration would seamlessly incorporate QEMU into our image-building process, delivering enhanced efficiency and reliability.&lt;/p&gt;

&lt;p&gt;While I had prior experience with Packer, my familiarity with QEMU was limited, especially when it came to using Packer with QEMU. To address this knowledge gap, I referred to the official documentation of Packer. However, I encountered a challenge: the documentation provided an example using a server version of CentOS, which wasn't suitable for my requirements. I needed a cloud version of Ubuntu, which does not come with default user and password credentials. To overcome this hurdle, I created a seed image that included user-data and meta-data. This seed image allows us to "emulate" the cloud-init functionality. By combining this seed image with the Ubuntu image, Packer can establish an SSH connection to the virtual machine successfully.&lt;/p&gt;

&lt;p&gt;In the seed image, we create a user with the necessary credentials for the initial build process. It's important to note that the user created in the seed image is only intended for the build phase and is not present in the final image. This approach ensures that the final image does not contain any unnecessary or insecure credentials, maintaining a clean and secure environment.&lt;/p&gt;

&lt;p&gt;Here's the code to generate the seed image for the packer_qemu.seed.pkr.hcl script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;source "file" "user_data" {
  content = &amp;lt;&amp;lt;EOF
#cloud-config
ssh_pwauth: True
users:
  - name: user
    plain_text_passwd: packer
    sudo: ALL=(ALL) NOPASSWD:ALL
    shell: /bin/bash
    lock_passwd: false
EOF
  target  = "user-data"
}

source "file" "meta_data" {
  content = &amp;lt;&amp;lt;EOF
{"instance-id":"packer-worker.tenant-local","local-hostname":"packer-worker"}
EOF
  target  = "meta-data"
}

build {
  sources = ["sources.file.user_data", "sources.file.meta_data"]

  provisioner "shell-local" {
    inline = ["genisoimage -output cidata.iso -input-charset utf-8 -volid cidata -joliet -r user-data meta-data"]
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;In the provided code, the genisoimage command-line tool plays a crucial role in generating the necessary configuration files for cloud-init. Specifically, it is used to create the cidata.iso file, which encapsulates the user_data and meta_data files. These files contain important cloud-init configuration data, such as user credentials and metadata information for the instance.&lt;br&gt;
By utilizing genisoimage, we can create a bootable ISO image that incorporates the required configuration data. This ISO image is then seamlessly integrated into the image-building process by Packer.&lt;br&gt;
To gain a better understanding of the genisoimage command-line options and functionality, you can refer to the official documentation at &lt;a href="https://manpages.debian.org/buster/genisoimage/genisoimage.1.en.html"&gt;Genisoimage Documentation&lt;/a&gt;. The documentation provides detailed explanations and examples to help you effectively utilize genisoimage in your image-building workflow.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here's the code for the packer_qemu.qcow2.pkr.hcl script that uses the seed image and cloud image to build a new image and then runs an Ansible playbook to configure the new image:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;packer {
  required_plugins {
    vagrant = {
      version = "1.0.9"
      source  = "github.com/hashicorp/qemu"
    }
    ansible = {
      version = "1.0.4"
      source  = "github.com/hashicorp/ansible"
    }
  }
}

source "qemu" "ubuntu" {
  format           = "qcow2"
  disk_image       = true
  disk_size        = "10G"
  headless         = true
  iso_checksum     = "file:https://cloud-images.ubuntu.com/focal/current/SHA256SUMS"
  iso_url          = "https://cloud-images.ubuntu.com/focal/current/focal-server-cloudimg-amd64.img"
  qemuargs         = [["-m", "12G"], ["-smp", "8"], ["-cdrom", "cidata.iso"], ["-serial", "mon:stdio"]]
  shutdown_command = "echo 'packer' | sudo -S shutdown -P now"
  ssh_password     = "packer"
  ssh_username     = "user"
  vm_name          = "build.qcow2"
  output_directory = "output"
}

build {
  sources = ["source.qemu.ubuntu"]

  provisioner "ansible" {
    playbook_file = "ansible/qemu.yml"
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;This code snippet showcases the Packer configuration language to orchestrate the build process. It begins with the packer block, which outlines the essential plugins required for Packer, including qemu and ansible. The source block focuses on configuring the QEMU source, encompassing various settings such as the format, disk size, ISO checksum and URL, QEMU arguments, SSH credentials, and more. Within the build block, the QEMU source is designated for the build process. &lt;br&gt;
Additionally, the provisioner section incorporates the ansible provisioner, specifying the Ansible playbook (ansible/qemu.yml) to execute for further customization of the newly created image. For a comprehensive understanding of the packer plugin arguments pertaining to qemu, you can refer to the official documentation at &lt;a href="https://developer.hashicorp.com/packer/plugins/builders/qemu"&gt;Packer QEMU Plugin&lt;/a&gt;. The documentation offers detailed insights into the various plugin options and configurations available for QEMU integration within Packer.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;By combining Packer, QEMU, Ubuntu Cloud Images, and Ansible, we are able to automate the process of building consistent and reproducible machine images for our on-premises Data Center. This streamlined approach saves time, ensures consistency across our environments, and maintains a secure image without unnecessary credentials.&lt;/p&gt;

&lt;p&gt;I hope sharing our experience and providing these code snippets will help the community facing similar challenges in the future. Let's continue building and automating together! If you have any questions or suggestions, feel free to reach out.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>ubuntu</category>
      <category>iac</category>
      <category>automation</category>
    </item>
    <item>
      <title>Building labs using component-based architecture with Terraform and Ansible</title>
      <dc:creator>Ederson Brilhante</dc:creator>
      <pubDate>Thu, 14 Apr 2022 15:48:35 +0000</pubDate>
      <link>https://dev.to/edersonbrilhante/building-labs-using-component-based-architecture-with-terraform-and-ansible-15dm</link>
      <guid>https://dev.to/edersonbrilhante/building-labs-using-component-based-architecture-with-terraform-and-ansible-15dm</guid>
      <description>&lt;p&gt;Currently, I am a Site Reliability Engineer(SRE) in the observability team at Splunk. But when I worked in this solution I was part of the GDI(Get Data In) organisation at Splunk.&lt;/p&gt;

&lt;p&gt;Now, let's talk about the problem. &lt;br&gt;
Part of the engineer's job in GDI is building add-ons to Splunk. Add-ons, in a nutshell, are plugins to connect third party data sources to Splunk platform. &lt;/p&gt;

&lt;p&gt;Every time we need to work on a new add-on version of a specific third party,  we need to set up 2 labs, 1 for development purposes, and the other with QA specifications. &lt;/p&gt;

&lt;p&gt;The GDI organisation owns many add-ons, so we use a strategy to make rotations in which team and who in the team will work in a new version every time.&lt;br&gt;
This is good to spread the knowledge, but we had problems keeping reliable and consistent labs across the dev cycle and the teams.&lt;/p&gt;

&lt;p&gt;A big fraction of  people's time was manual work to set up the labs(manual configuration or writing new bash/power-shell scripts). Along with a lot of time expended in the development process, the manual work creates a great deal of headache for the developers&lt;/p&gt;

&lt;p&gt;The teams agreed it needed some automation, in order to reduce the pain to create labs and avoid duplication or rework. &lt;/p&gt;

&lt;p&gt;We came up with the idea of using infra as code(IaC). Which was nothing so special that other companies weren't already doing.&lt;/p&gt;

&lt;p&gt;Because the teams are small, and they are focused on the development of add-ons, we need an approach where the teams could have customised labs, but not necessary to write IaC scripts.&lt;/p&gt;

&lt;p&gt;Based on the Design Principles of react components, we came up with an idea to create components that can be reused and plugged in other components. And each component would be a Terraform Module, an Ansible Playbook or an Ansible Role.&lt;/p&gt;

&lt;p&gt;For a better elucidation, let's use this example - Build a lab with 3 different environments:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Environment A will have 4 windows instances:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1 Windows Server 2016 as Domain Controller.&lt;/li&gt;
&lt;li&gt;3 Windows Servers as members server:

&lt;ul&gt;
&lt;li&gt;1 Windows Server 2016 with Windows Event Collector:&lt;/li&gt;
&lt;li&gt;Splunk Universal Forward&lt;/li&gt;
&lt;li&gt;Collecting only Sysmon events from nodes&lt;/li&gt;
&lt;li&gt;1 Windows Server 2016 with Windows Event Forwarding&lt;/li&gt;
&lt;li&gt;1 Windows Server 2019 with Windows Event Forwarding&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Environment B will have 7 windows instances:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1 Windows Server 2016 as Domain Controller.&lt;/li&gt;
&lt;li&gt;6 Windows Servers as members server:

&lt;ul&gt;
&lt;li&gt;1 Windows Server 2016 with Windows Event Collector (WEC A):&lt;/li&gt;
&lt;li&gt;Splunk Universal Forward&lt;/li&gt;
&lt;li&gt;Collecting only Application events from nodes&lt;/li&gt;
&lt;li&gt;1 Windows Server 2016 with Windows Event Forwarding, sending logs to WEC A&lt;/li&gt;
&lt;li&gt;1 Windows Server 2019 with Windows Event Forwarding, sending logs to WEC A&lt;/li&gt;
&lt;li&gt;1 Windows Server 2019 with Windows Event Collector (WEC B):&lt;/li&gt;
&lt;li&gt;Splunk Universal Forward&lt;/li&gt;
&lt;li&gt;Collecting only Security events from nodes&lt;/li&gt;
&lt;li&gt;1 Windows Server 2016 with Windows Event Forwarding, sending logs to WEC B&lt;/li&gt;
&lt;li&gt;1 Windows Server 2019 with Windows Event Forwarding, sending logs to WEC B&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Environment C will have 3 windows instances:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1 Windows Server 2016 as Domain Controller with Windows Event Collector :

&lt;ul&gt;
&lt;li&gt;Splunk Universal Forward&lt;/li&gt;
&lt;li&gt;Collecting only Security and Sysmon events from nodes&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;2 Windows Servers as Members Server:

&lt;ul&gt;
&lt;li&gt;1 Windows Server 2016 with Windows Event Forwarding&lt;/li&gt;
&lt;li&gt;1 Windows Server 2019 with Windows Event Forwarding&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Normally, Using terraform modules and Ansible Playbooks we could reproduce these environments. &lt;br&gt;
We would need to create specific playbooks and terraform configs for each environment. &lt;br&gt;
And here comes the problem. Spending time coding permutations in some similar configurations. &lt;/p&gt;

&lt;p&gt;To avoid that, our approach with component based architecture, we only have to write a single config file describing which modules these labs need to run without touching any Terraform Script or Ansible Playbook.&lt;/p&gt;
&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fehhewqqr3phz7t9831np.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fehhewqqr3phz7t9831np.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The solution we made is compatible with many kind of labs configurations deployed in AWS.&lt;/p&gt;

&lt;p&gt;Terraform scripts are used to deploy the infrastructure, spinning up EC2 instances and other AWS resources. And to provision softwares and system configuration inside of each EC2 instance, terraform calls proper ansible playbooks.&lt;/p&gt;

&lt;p&gt;Playbooks are a group of roles. A role represents an implementation of specifics configuration in an independent way.&lt;/p&gt;

&lt;p&gt;Take role &lt;code&gt;windows_splunk_universal_forward&lt;/code&gt; as example. This role downloads, installs and configures a splunk universal forward instance in Windows. This role is coded to be used any windows version.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repo Structure:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   ├── ansible
   │   ├── all_roles
   │   │   └── distros
   │   │       ├── linux
   │   │       │   └── roles
   │   │       │       └── &amp;lt;new-linux-role&amp;gt;
   │   │       └── windows
   │   │           └── roles
   │   │               └── &amp;lt;new-windows-role&amp;gt;
   │   └── playbooks
   │       └── &amp;lt;new-playbook&amp;gt;
   └── terraform
       └── modules
           ├── distros
           │   └── &amp;lt;new-distro-type&amp;gt;
           └── environments
               └── &amp;lt;new-environment-type&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Terraform
&lt;/h3&gt;

&lt;p&gt;Terraform is an open-source infrastructure as code software tool that provides a consistent CLI workflow to manage hundreds of cloud services. Terraform codifies cloud APIs into declarative configuration files.&lt;/p&gt;

&lt;p&gt;For more info check on &lt;a href="https://www.terraform.io/" rel="noopener noreferrer"&gt;official documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Terraform Structure:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   terraform/
   ├── modules
   │   ├── constants
   │   ├── core
   │   ├── distros
   │   │   └── &amp;lt;distro-type&amp;gt;
   │   └── environments
   │       └── &amp;lt;environment-type&amp;gt;
   └── wire
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What is an environment?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A environment is a pre-defined kind of relations between nodes.&lt;br&gt;
Each module &lt;code&gt;environment&lt;/code&gt; is found in path &lt;code&gt;terraform/modules/environments&lt;/code&gt;. And uses the modules in &lt;code&gt;terraform/modules/distros&lt;/code&gt; to build the proper relations.&lt;/p&gt;

&lt;p&gt;For elucidation, take this case as an example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1 Windows Domain Controller.&lt;/li&gt;
&lt;li&gt;X number of Member Servers.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We have this hierarchy, because we need create first the DC and so give some data to member servers, such as IP:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;```
# file: terraform/modules/environments/linux-standalone/main.tf

module "windows-domain-controller" {
    source = "../../distros/windows-server"
    ...
}

module "windows-server-member" {
    source = "../../distros/windows-server"
    ...
    windows_domain_controller = module.windows-domain-controller
    ...
}
```
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What is a distro?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A distro is a pre-defined kind of AMI with specific kind of setup and/or provisioning.&lt;br&gt;
Each module &lt;code&gt;distro&lt;/code&gt; is found in path &lt;code&gt;terraform/modules/distros&lt;/code&gt;. And have a proper ansible playbook to execute the provisioning.&lt;/p&gt;

&lt;p&gt;For elucidation, take these cases as examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Linux&lt;/li&gt;
&lt;li&gt;Windows&lt;/li&gt;
&lt;li&gt;Splunk&lt;/li&gt;
&lt;li&gt;Free BSD&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Terraform example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;```

locals {
...
provisioning_command     = "ansible-playbook -i $PUBLIC_IP /opt/automation/tools/ansible/playbooks/windows.yml --extra-vars='${local.extra_vars}'"
}

...

resource "aws_instance" "windows_server" {
...
}

resource "null_resource" "ansible" {

triggers = {
    command = replace(local.provisioning_command, "$PUBLIC_IP", "'${aws_instance.windows_server.public_ip},'")
}

provisioner "local-exec" {
    command = replace(local.provisioning_command, "$PUBLIC_IP", "'${aws_instance.windows_server.public_ip},'")
}
}
```
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Ansible
&lt;/h3&gt;

&lt;p&gt;Ansible is an open-source software provisioning, configuration management, and application-deployment tool enabling infrastructure as code. It runs on many Unix-like systems, and can configure both Unix-like systems as well as Microsoft Windows.&lt;/p&gt;

&lt;p&gt;For more info check on &lt;a href="https://www.ansible.com/" rel="noopener noreferrer"&gt;official documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ansible Structure:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;```
ansible/
├── all_roles
│&amp;nbsp;&amp;nbsp; ├── distros
│&amp;nbsp;&amp;nbsp; │&amp;nbsp;&amp;nbsp; └── &amp;lt;distro-type&amp;gt;
│&amp;nbsp;&amp;nbsp; │&amp;nbsp;&amp;nbsp;     └── roles
│&amp;nbsp;&amp;nbsp; │&amp;nbsp;&amp;nbsp;         └── &amp;lt;distro-role&amp;gt;
└── playbooks
```
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What is a distro type?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A distro type is folder that centralize all ansible roles that can be used executed in a specific.&lt;/p&gt;

&lt;p&gt;Take windows as example: &lt;code&gt;ansible/all_roles/distros/windows&lt;/code&gt;. This folder centralize all ansible roles that can be used executed in a windows machines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is a distro role?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A distro role is a group of ansible tasks, that implements related configurations that represents a functionality.&lt;/p&gt;

&lt;p&gt;For elucidation, take the list of tasks from splunk UF role:&lt;br&gt;
    - Downloads Splunk UF&lt;br&gt;
    - Installs the download file&lt;br&gt;
    - Sets default configuration&lt;br&gt;
    - Starts Splunk UF&lt;/p&gt;
&lt;h3&gt;
  
  
  Explaining the config file
&lt;/h3&gt;

&lt;p&gt;Here you can find a complete config example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;config = {
  "myenv01" = {
    "type" = "windows_standalone"
    "nodes" = {
      "myvm01" = {
        "type" = "windows"
        "enabled_roles" = {
          "windows_funcionality01" = true
          "windows_funcionality02" = true
        }
        "os" = {
          "size"    = "t2.medium"
          "distro"  = "windows"
          "type"    = "windows"
          "version" = "2016"
        }
      }
    }
  }
  "myvm02" = {
    "type" = "linux_standalone"
    "nodes" = {
      "mylinux01" = {
        "type" = "linux"
        "enabled_roles" = {
          "linux_funcionality01" = true
          "linux_funcionality02" = true
        }
        "os" = {
          "size"    = "t2.medium"
          "distro"  = "ubuntu"
          "type"    = "linux"
          "version" = "20"
        }
      }
    }
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under the hood this configuration will be translated to create 2 EC2 instances in AWS, and each instance will run playbooks with specific roles.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Block explanation:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Each block &lt;code&gt;myenv0x&lt;/code&gt; represents how the environment will be deployed. The type represents which predefined environment will be used.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Each block &lt;code&gt;myvm0x&lt;/code&gt; represents a VM that will be created. The type represents which predefined distro will be used.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;The block &lt;code&gt;os&lt;/code&gt; has 4 properties that will create a proper EC2 instance: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The AWS type instance&lt;/li&gt;
&lt;li&gt;Type of Distro (windows, linux, etc)&lt;/li&gt;
&lt;li&gt;OS Distro(ubuntu, debian, suse, windows, etc)&lt;/li&gt;
&lt;li&gt;Version of the OS Distro&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;With this info the terraform will know which AWS AMI to use to spin up in the EC2 instance&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The block &lt;code&gt;enabled_roles&lt;/code&gt; represents a list of Ansible Roles to execute in each instance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For more details about the code and implementation, check the &lt;a href="https://github.com/edersonbrilhante/lab-builder-demo" rel="noopener noreferrer"&gt;code demo&lt;/a&gt;, fully functional.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>terraform</category>
      <category>ansible</category>
      <category>architecture</category>
    </item>
    <item>
      <title>A serverless full-stack application using only git, google drive, and public ci/cd runners</title>
      <dc:creator>Ederson Brilhante</dc:creator>
      <pubDate>Fri, 16 Apr 2021 13:40:49 +0000</pubDate>
      <link>https://dev.to/edersonbrilhante/a-serverless-full-stack-application-using-only-git-google-drive-and-public-ci-cd-runners-262l</link>
      <guid>https://dev.to/edersonbrilhante/a-serverless-full-stack-application-using-only-git-google-drive-and-public-ci-cd-runners-262l</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR; How I built the Vilicus Service, a serverless full-stack application with backend workers and database only using git and ci/cd runners.&lt;/strong&gt;&lt;/p&gt;




&lt;h4&gt;
  
  
  What is Vilicus?
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://dev.to/edersonbrilhante/vilicus-a-overseer-for-security-scanning-of-container-images-eji"&gt;Vilicus&lt;/a&gt; is an open-source tool that orchestrates security scans of container images(Docker/OCI) and centralizes all results into a database for further analysis and metrics.&lt;/p&gt;

&lt;p&gt;Vilicus provides many alternatives to use it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/edersonbrilhante/vilicus"&gt;Own Installation&lt;/a&gt;;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/edersonbrilhante/vilicus-github-action"&gt;GitHub Action&lt;/a&gt; in your GitHub workflows;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/edersonbrilhante/vilicus-gitlab"&gt;Template CI&lt;/a&gt; in your GitLab CI/CD pipelines;&lt;/li&gt;
&lt;li&gt;
&lt;a href="http://vilicus.edersonbrilhante.com.br/"&gt;Free Online Service&lt;/a&gt;;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This article explains how it was possible to build the Free Online Service without using a traditional deployment.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--MaLtB1bF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/i9eolikd4iicgb3r37js.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--MaLtB1bF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/i9eolikd4iicgb3r37js.png" alt="Architecture" width="701" height="371"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Frontend is hosted in GitHub Pages. This frontend is a landing page with a free service to scan or display the vulnerabilities in container images. &lt;/p&gt;

&lt;p&gt;The results of container image scans are stored in a GitLab Repository.&lt;/p&gt;

&lt;p&gt;When the user asks to show the results from an image, the frontend consumes the GitLab API to retrieve the file with vulns from this image. In case this image is not scanned yet, the user has the option to schedule a scan using a google form.&lt;/p&gt;

&lt;p&gt;When this form is filled, the data is sent to a Google Spreadsheet.&lt;/p&gt;

&lt;p&gt;A GitHub Workflow runs every 5 minutes to check if there are new answers in this Spreadsheet. For each new image in the Spreadsheet, this workflow triggers another Workflow to scan the image and save the result in the GitLab Repository.&lt;/p&gt;

&lt;h4&gt;
  
  
  Why store in GitLab?
&lt;/h4&gt;

&lt;p&gt;GitLab provides bigger limits. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Here's a summary of differences in offering on public cloud and free tier:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Free users&lt;/th&gt;
&lt;th&gt;Max repo size (GB)&lt;/th&gt;
&lt;th&gt;Max file size (MB)&lt;/th&gt;
&lt;th&gt;Max API calls per hour (per client)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GitHub&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;5000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BitBucket&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Unlimited (up to repo size)&lt;/td&gt;
&lt;td&gt;5000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GitLab&lt;/td&gt;
&lt;td&gt;Unlimited&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Unlimited (up to repo size)&lt;/td&gt;
&lt;td&gt;36000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h3&gt;
  
  
  Google Drive
&lt;/h3&gt;

&lt;p&gt;This choice was a "quick win". In a usual deployment, the backend could call an API passing secrets without the clients knowing the secrets. &lt;/p&gt;

&lt;p&gt;But because I am using GitHub Pages I cannot use that(Well, I could do it in the javascript, but anyone using the Browser Inspect would see the secrets. So let's don't do it 😉)&lt;/p&gt;

&lt;p&gt;This makes the Google Spreadsheet perform as a Queue.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Google Form:&lt;/em&gt;&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--pyUQWnG5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nyxtv3xkmk90k2i6ljhg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--pyUQWnG5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nyxtv3xkmk90k2i6ljhg.png" alt="Form" width="800" height="664"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Google Spreadsheet with answers:&lt;/em&gt;&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--OY2FJpFH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ezihbao6fboqhhjkrg4l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--OY2FJpFH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ezihbao6fboqhhjkrg4l.png" alt="Answers" width="800" height="110"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h3&gt;
  
  
  GitHub Workflows
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;Schedule Workflow&lt;/code&gt; runs at most every 5 minutes. This workflow executes the python script that checks if there are new rows in the &lt;code&gt;Google Spreadsheet&lt;/code&gt;, and for each row is made an HTTP request to trigger the event &lt;code&gt;repository_dispatch&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This makes the workflows perform as backend workers.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Schedule in workflow:&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Schedule&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;schedule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;cron&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;*/5&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;*'&lt;/span&gt;
&lt;span class="nn"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Event &lt;code&gt;repository_dispatch&lt;/code&gt; in WorkFlow:&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Report&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;repository_dispatch&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="nn"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Screenshots:
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Schedule History:&lt;/em&gt;&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--52iyEBwk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vx0omaxnghu518cjqfqf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--52iyEBwk--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vx0omaxnghu518cjqfqf.png" alt="Schedules" width="800" height="417"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Schedule WorkFlow:&lt;/em&gt;&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Csmo2bfX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0zn81m0ai0je3puhr56v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Csmo2bfX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0zn81m0ai0je3puhr56v.png" alt="Schedule" width="800" height="240"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Scans History:&lt;/em&gt;&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--1U2jDv5N--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/pzhsml4624pfb2edj4u2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--1U2jDv5N--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/pzhsml4624pfb2edj4u2.png" alt="Scans" width="800" height="363"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Report Workflow:&lt;/em&gt;&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--_pAnFAc5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/c6p60gvam145vyajlfmu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--_pAnFAc5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/c6p60gvam145vyajlfmu.png" alt="Report" width="800" height="415"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Scan Report stored in GitLab:&lt;/em&gt;&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--xhlDDfsA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5qg7eshm699k7f22t5p1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xhlDDfsA--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5qg7eshm699k7f22t5p1.png" alt="Report File Example" width="800" height="419"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  Source Code:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/edersonbrilhante/vilicus-report/blob/main/.github/workflows/schedule-jobs.yml"&gt;Schedule Workflow&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/edersonbrilhante/vilicus-report/blob/main/.github/workflows/report.yml"&gt;Report Workflow&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/edersonbrilhante/vilicus-report/blob/main/commit.py"&gt;Script to upload the report file to Gitlab&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/edersonbrilhante/vilicus-report/blob/main/schedule-jobs.py"&gt;Script to iterate the answers and trigger new scans&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://gitlab.com/edersonbrilhante/vilicus-reports-db"&gt;GitLab Repo with report files&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h4&gt;
  
  
  Do you want to know more about GitHub Actions?
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.github.com/en/actions/learn-github-actions"&gt;Learn GitHub Actions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions"&gt;Workflow syntax&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  Github Pages
&lt;/h3&gt;

&lt;p&gt;The Frontend is running in GitHub Pages. &lt;/p&gt;

&lt;p&gt;By default an application running in GH Pages is hosted as &lt;code&gt;http://&amp;lt;github-user&amp;gt;.github.io/&amp;lt;repository&amp;gt;&lt;/code&gt;. &lt;/p&gt;

&lt;p&gt;But GitHub allows you to customize the domain, because that it's possible to access Vilicus using &lt;code&gt;https://vilicus.edersonbrilhante.com.br&lt;/code&gt; instead of &lt;code&gt;http://edersonbrilhante.github.io/vilicus&lt;/code&gt;.&lt;/p&gt;
&lt;h4&gt;
  
  
  GitHub Workflow to build the application and deploy it in GH Pages
&lt;/h4&gt;

&lt;p&gt;&lt;em&gt;Building the source code:&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;cd website&lt;/span&gt;
    &lt;span class="s"&gt;npm install&lt;/span&gt;
    &lt;span class="s"&gt;npm run-script build&lt;/span&gt;
  &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;REACT_APP_GA_CODE&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.REACT_APP_GA_CODE }}&lt;/span&gt;
    &lt;span class="na"&gt;REACT_APP_FORM_SCAN&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.REACT_APP_FORM_SCAN }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Deploying the build:&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deploy&lt;/span&gt;
  &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;JamesIves/github-pages-deploy-action@releases/v3&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;GITHUB_TOKEN&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.GITHUB_TOKEN }}&lt;/span&gt;
    &lt;span class="na"&gt;BRANCH&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gh-pages&lt;/span&gt;
    &lt;span class="na"&gt;FOLDER&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;website/build&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Source Code:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/edersonbrilhante/vilicus/blob/main/.github/workflows/build-gh-pages.yml"&gt;Workflow to deploy code in GitHub Pages&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/edersonbrilhante/vilicus/tree/main/website"&gt;Application Source Code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/edersonbrilhante/vilicus/tree/gh-pages"&gt;Deployed code in GH Pages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Do you want to know more about GitHub Pages?
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.github.com/en/pages/getting-started-with-github-pages/configuring-a-publishing-source-for-your-github-pages-site"&gt;Configuring a publishing source&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://docs.github.com/en/pages/configuring-a-custom-domain-for-your-github-pages-site"&gt;Configuring a custom domain&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  That’s it!
&lt;/h2&gt;

&lt;p&gt;In case you have any questions, please leave a comment here or ping me on &lt;a href="https://www.linkedin.com/in/edersonbrilhante"&gt;LinkedIn&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>serverless</category>
      <category>showdev</category>
      <category>devops</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Fast startup application with database stored in container images</title>
      <dc:creator>Ederson Brilhante</dc:creator>
      <pubDate>Thu, 01 Apr 2021 15:15:54 +0000</pubDate>
      <link>https://dev.to/edersonbrilhante/fast-startup-application-with-database-stored-in-container-images-1k04</link>
      <guid>https://dev.to/edersonbrilhante/fast-startup-application-with-database-stored-in-container-images-1k04</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR;&lt;/strong&gt; This article shows which strategy I implemented to allow an application to be ready to use in a few minutes rather than many hours.&lt;/p&gt;

&lt;p&gt;In this article, I will talk about the strategy I used in the project Vilicus to have big databases synced in new setups. For those who don't know Vilicus yet, I recommend reading &lt;a href="https://dev.to/edersonbrilhante/vilicus-a-overseer-for-security-scanning-of-container-images-eji"&gt;my article&lt;/a&gt; about it.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why the application takes too much time to start?
&lt;/h2&gt;

&lt;p&gt;At this moment the project &lt;a href="https://github.com/edersonbrilhante/vilicus"&gt;Vilicus&lt;/a&gt; uses &lt;a href="https://github.com/anchore/anchore-engine"&gt;Anchore&lt;/a&gt;, &lt;a href="https://github.com/quay/clair"&gt;Clair&lt;/a&gt;, and &lt;a href="https://github.com/aquasecurity/trivy"&gt;Trivy&lt;/a&gt; as vendors to run security scans in container images. Each vendor has its own programming language, database, internal dependencies and can use different vulnerabilities databases.&lt;/p&gt;

&lt;p&gt;Vilicus itself starts in milliseconds, but to be ready to use it's necessary to wait for the vendors to sync the vulnerabilities database with the latest changes. But these syncs can consume a lot of time.&lt;/p&gt;

&lt;p&gt;See for example Anchore, the one with more time-consuming to complete the sync:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;There is no exact time frame for the initial sync to complete as it depends heavily on environmental factors, such as the host's memory/cpu allocation, disk space, and network bandwidth. Generally, the initial sync should complete within 8 hours but may take longer. Subsequent feed updates are much faster as only deltas are updated.&lt;br&gt;
&lt;a href="https://docs.anchore.com/current/docs/faq/"&gt;https://docs.anchore.com/current/docs/faq/&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Clair takes more or less 20 minutes. And Trivy is ready in a few seconds.&lt;/p&gt;

&lt;p&gt;If you run everything from scratch will take almost 1 day to sync all vulnerabilities databases, but after this major sync, the next syncs will be faster.&lt;/p&gt;

&lt;p&gt;This will be a problem if you would like to run an ephemeral instance in your CI / CD, so waiting hours for the sync to be completed before you can run the first scan will be inviable. Thinking about how to fix this problem, I came with a solution: Save updated database snapshots in container images every day.&lt;/p&gt;

&lt;p&gt;Now you must be thinking, this is not a good practice, and normally I would agree. But I believe there are exceptions in specific cases, such as fixing the problem is more important than conventions.&lt;/p&gt;




&lt;h2&gt;
  
  
  Saving the database in a container image
&lt;/h2&gt;

&lt;p&gt;I'll show you in detail how I made Anchore work, but Clair and Trivy are not much different&lt;/p&gt;

&lt;h3&gt;
  
  
  Anchore
&lt;/h3&gt;

&lt;p&gt;First I have a compacted dump SQL, with the database already synced with less last 6 months, stored in a container image: &lt;a href="https://hub.docker.com/layers/vilicus/anchoredb/dumpsql/images/sha256-d9fffded216ee40bf31580467af80eeadab0988cbefe5cd82acfde410f683370?context=explore"&gt;vilicus/anchoredb:dumpsql&lt;/a&gt;. So we don't need to wait many hours, we just update the delta.&lt;/p&gt;

&lt;p&gt;I used this image as a base to create a local image(vilicus/anchoredb:files) with a script to restore the database when this image runs as a container.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/edersonbrilhante/vilicus/blob/v0.0.3/deployments/dockerfiles/anchore/db/files/Dockerfile"&gt;Dockerfile content&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FROM vilicus/anchoredb:dumpsql as dumpsql

FROM postgres:9.6.21-alpine
LABEL vilicus.app.version=9.6.21-alpine

COPY --chown=postgres:postgres --from=dumpsql /opt/vilicus/data/anchore_db.tar.gz /opt/vilicus/data/anchore_db.tar.gz
COPY deployments/dockerfiles/anchore/db/files/restore-dbs.sh /docker-entrypoint-initdb.d/01.restore-dbs.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://github.com/edersonbrilhante/vilicus/blob/e700a01a08483e188aad9129d0bda9c12067a6cc/scripts/build-anchore-image.sh#L57-L61"&gt;Building the container image&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker build -f deployments/dockerfiles/anchore/db/files/Dockerfile -t vilicus/anchoredb:files .
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The image &lt;code&gt;vilicus/anchoredb:files&lt;/code&gt; is referenced in &lt;a href="https://github.com/edersonbrilhante/vilicus/blob/v0.0.3/deployments/docker-compose.updater.yml"&gt;deployments/docker-compose.updater.yml&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here we start the anchore and the anchoredb.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker-compose -f deployments/docker-compose.updater.yml up \
    --build -d --force \
    --remove-orphans \
    --renew-anon-volumes anchore
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After that, we run this command to restore the database.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker exec anchoredb sh -c 'docker-entrypoint.sh postgres' &amp;amp;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So we wait for the restore and the database we ready to be connected.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker run --network container:anchore vilicus/vilicus:latest \
    sh -c "dockerize -wait http://anchore:8228/health -wait-retry-interval 10s -timeout 1000s echo done"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the Anchore Engine and the DB ready, we start the sync.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker exec anchore sh -c 'anchore-cli system wait'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the sync finishes we stop anchore and we kill gracefully the Postgres PID in anchoredb.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker stop anchore
docker exec -u postgres anchoredb sh -c 'pg_ctl stop -m smart'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We commit the container, with the changes made by the sync, into a new container image &lt;code&gt;vilicus/anchoredb:local-update&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CID=$(docker inspect --format="{{.Id}}" anchoredb)
docker commit $CID vilicus/anchoredb:local-update
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So we finally build the container image that goes to docker hub, by copying the Postgres data from the image &lt;code&gt;vilicus/anchoredb:local-update&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://raw.githubusercontent.com/edersonbrilhante/vilicus/v0.0.3/deployments/dockerfiles/anchore/db/Dockerfile"&gt;Dockerfile content&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FROM as db
FROM postgres:9.6.21-alpine
COPY --chown=postgres:postgres --from=db /data/ /data
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://github.com/edersonbrilhante/vilicus/blob/e700a01a08483e188aad9129d0bda9c12067a6cc/scripts/build-anchore-image.sh#L30"&gt;Building the container image&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker build -f deployments/dockerfiles/anchore/db/Dockerfile -t vilicus/anchoredb:latest .
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check the complete script &lt;a href="https://github.com/edersonbrilhante/vilicus/blob/v0.0.3/scripts/build-anchore-image.sh"&gt;here&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Clair and Trivy
&lt;/h3&gt;

&lt;p&gt;For Clair check &lt;a href="https://github.com/edersonbrilhante/vilicus/blob/main/scripts/build-clair-image.sh"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;For Trivy check &lt;a href="https://github.com/edersonbrilhante/vilicus/blob/main/scripts/push-trivy-image.sh"&gt;here&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Updating the images every day
&lt;/h2&gt;

&lt;p&gt;To have the databases with the latest changes, I have a GitHub workflow that runs a job everyday building the images and pushing them to the Docker Hub.&lt;/p&gt;

&lt;p&gt;Check the &lt;a href="https://github.com/edersonbrilhante/vilicus/blob/main/.github/workflows/build-images.yml"&gt;workflow&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--IGZwEcKB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/grpk74axc68q1fbzwwo2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--IGZwEcKB--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/grpk74axc68q1fbzwwo2.png" alt="Complete workflow" width="800" height="480"&gt;&lt;/a&gt;Complete workflow&lt;p&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  That's it!
&lt;/h2&gt;

&lt;p&gt;In case you have any questions, please leave a comment here or ping me on &lt;a href="https://www.linkedin.com/in/edersonbrilhante"&gt;🔗 LinkedIn&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>docker</category>
      <category>devops</category>
      <category>database</category>
      <category>programming</category>
    </item>
    <item>
      <title>GitLab Runners as a Service with Github Action</title>
      <dc:creator>Ederson Brilhante</dc:creator>
      <pubDate>Thu, 01 Apr 2021 14:52:43 +0000</pubDate>
      <link>https://dev.to/edersonbrilhante/gitlab-runners-as-a-service-with-github-action-149n</link>
      <guid>https://dev.to/edersonbrilhante/gitlab-runners-as-a-service-with-github-action-149n</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR;&lt;/strong&gt; This article will show how to implement the action "Gitlab Runner Service Action" in a "GitHub Workflow" that is triggered by a "GitLab-CI job", and this way having temporary GitLab Runners hosted by GitHub.&lt;/p&gt;

&lt;p&gt;For more info about &lt;code&gt;GitHub workflow&lt;/code&gt;, check the official &lt;a href="https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For more info about &lt;code&gt;GitLab-CI&lt;/code&gt;, check the official &lt;a href="https://docs.gitlab.com/ee/ci/yaml/gitlab_ci_yaml.html" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Steps
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1
&lt;/h3&gt;

&lt;p&gt;Create a new GitHub repository with the following GitHub Workflow. File location: &lt;code&gt;.github/workflows/gitlab-runner.yaml&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;name: Gitlab Runner Service
on: [repository_dispatch]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - name: Maximize Build Space
        uses: easimon/maximize-build-space@master
        with:
          root-reserve-mb: 512
          swap-size-mb: 1024
          remove-dotnet: 'true'
          remove-android: 'true'
          remove-haskell: 'true'

      - name: Gitlab Runner
        uses: edersonbrilhante/gitlab-runner-action@main
        with:
          registration-token: "${{ github.event.client_payload.registration_token }}"
          docker-image: "docker:19.03.12"
          name: ${{ github.run_id }}
          tag-list: "crosscicd"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  What does this workflow do?
&lt;/h4&gt;

&lt;p&gt;This workflow will run just when the event repository_dispatch is triggered. The first step will be to increase the free space removing useless packages for our GitLab runner. And the second step will run the action that registers a new GitLab Runner with a tag &lt;code&gt;crosscicd&lt;/code&gt;, so start it and unregister it after a GitLab-CI job is completed with success or failure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2
&lt;/h3&gt;

&lt;p&gt;Create a new GitLab repository with the following GitLab-CI config. File location: &lt;code&gt;.gitlab-ci.yml&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;start-crosscicd:
  image: alpine
  before_script:
    - apk add --update curl &amp;amp;&amp;amp; rm -rf /var/cache/apk/*
  script: |
    curl -H "Authorization: token ${GITHUB_TOKEN}" \
    -H 'Accept: application/vnd.github.everest-preview+json' \
    "https://api.github.com/repos/${GITHUB_REPO}/dispatches" \
    -d '{"event_type": "gitlab_trigger_'${CI_PIPELINE_ID}'", "client_payload": {"registration_token": "'${GITLAB_REGISTRATION_TOKEN}'"}}'

github:
  image: docker:latest
  services:
    - name: docker:dind
      alias: thedockerhost
  variables:
    DOCKER_HOST: tcp://thedockerhost:2375/
    DOCKER_DRIVER: overlay2
    DOCKER_TLS_CERTDIR: ""
  script:
    - df -h
    - docker run --privileged ubuntu df -h
  tags:
    - crosscicd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  What does this gitlab-ci?
&lt;/h4&gt;

&lt;p&gt;The job &lt;strong&gt;&lt;em&gt;start-crosscicd&lt;/em&gt;&lt;/strong&gt; will trigger the GitHub workflow, creating the GitLab runner with the tag &lt;code&gt;crosscicd&lt;/code&gt;. And the job &lt;code&gt;GitHub&lt;/code&gt; will wait for a runner with a tag &lt;code&gt;crosscicd&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3
&lt;/h3&gt;

&lt;p&gt;Set the EnvVars in the new GitLab Repo&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;    GITHUB_REPO:&amp;lt;username&amp;gt;/&amp;lt;github-repo&amp;gt;
    GITHUB_TOKEN:&amp;lt;GitHub Access Token&amp;gt;
    GITLAB_REGISTRATION_TOKEN:&amp;lt;GitLab Registration Token&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  How to create a new GitHub Access Token:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Go to &lt;a href="https://github.com/settings/tokens/new" rel="noopener noreferrer"&gt;https://github.com/settings/tokens/new&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Mark the item &lt;code&gt;workflow&lt;/code&gt; and click in generate a token&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7t41gibobma1uyfa7tnz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7t41gibobma1uyfa7tnz.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  How to get Registration Token:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Go to &lt;code&gt;https://gitlab.com/&amp;lt;username&amp;gt;/&amp;lt;repo&amp;gt;/-/settings/ci_cd&lt;/code&gt; and click and expand &lt;code&gt;Runners&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Copy the Registration Token&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5m0s0al49w76jiqjyjsq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5m0s0al49w76jiqjyjsq.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Where to store the EnvVars?
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Go to  &lt;code&gt;https://gitlab.com/&amp;lt;username&amp;gt;/&amp;lt;repo&amp;gt;/-/settings/ci_cd&lt;/code&gt; and click and expand &lt;code&gt;Variables&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Click in Add Variable and save it for each EnvVar&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm8kfcu8v4p37m12z1e1w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm8kfcu8v4p37m12z1e1w.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmvncw3tiqqdle5lpk1yd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmvncw3tiqqdle5lpk1yd.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4
&lt;/h3&gt;

&lt;p&gt;Now your pipeline is ready to run the GitLab Runner in GitHub trigger by Gitlab-CI Job :)&lt;/p&gt;




&lt;h2&gt;
  
  
  Example
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Video Demo
&lt;/h3&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/jI5U1iMboOs"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;h3&gt;
  
  
  Screenshots
&lt;/h3&gt;

&lt;p&gt;&lt;/p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frs0l7kh49uiyb7z71w4p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frs0l7kh49uiyb7z71w4p.png" alt="Job start-crosscicd trigger Github Workflow"&gt;&lt;/a&gt;Job start-crosscicd trigger Github Workflow&lt;p&gt;&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd97dpb7iu5zabbrn1ab3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd97dpb7iu5zabbrn1ab3.png" alt="Workflow triggered by Gitlab-CI job"&gt;&lt;/a&gt;Workflow triggered by Gitlab-CI job&lt;p&gt;&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq2rzlaf1yxq3uzwyaa3f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq2rzlaf1yxq3uzwyaa3f.png" alt="There is 17GB free by default"&gt;&lt;/a&gt;There is 17GB free by default&lt;p&gt;&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fycjolujvyhkgsjutgwdd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fycjolujvyhkgsjutgwdd.png" alt="After Maximize we have 54GB free to use"&gt;&lt;/a&gt;After Maximize we have 54GB free to use&lt;p&gt;&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzzcfrksmy46gss18m4db.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzzcfrksmy46gss18m4db.png" alt="Register a Runner, Start it, and Unregister after the job in GitLab is completed"&gt;&lt;/a&gt;Register a Runner, Start it, and Unregister after the job in GitLab is completed&lt;p&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Code
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://github.com/edersonbrilhante/gitlab-runner-service-example" rel="noopener noreferrer"&gt;Repo with example code&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://github.com/edersonbrilhante/gitlab-runner-action" rel="noopener noreferrer"&gt;GitHub Action&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  That’s it!
&lt;/h2&gt;

&lt;p&gt;In case you have any questions, please leave a comment here or ping me on &lt;a href="https://www.linkedin.com/in/edersonbrilhante" rel="noopener noreferrer"&gt;🔗 LinkedIn&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>gitlabci</category>
      <category>cicd</category>
      <category>devops</category>
      <category>githubactions</category>
    </item>
    <item>
      <title>Vilicus — An overseer for security scanning of container images</title>
      <dc:creator>Ederson Brilhante</dc:creator>
      <pubDate>Wed, 31 Mar 2021 20:19:18 +0000</pubDate>
      <link>https://dev.to/edersonbrilhante/vilicus-a-overseer-for-security-scanning-of-container-images-eji</link>
      <guid>https://dev.to/edersonbrilhante/vilicus-a-overseer-for-security-scanning-of-container-images-eji</guid>
      <description>&lt;p&gt;Vilicus is an open-source tool that orchestrates security scans of container images(Docker/OCI) and centralizes all results into a database for further analysis and metrics.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why scan for vulnerabilities?
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;A recent &lt;a href="https://blog.prevasio.com/2020/12/operation-red-kangaroo-industrys-first.html" rel="noopener noreferrer"&gt;analysis&lt;/a&gt; of around 4 million Docker Hub images by cyber security firm Prevasio found that 51% of the images had exploitable vulnerabilities. A large number of these were cryptocurrency miners, both open and hidden, and 6,432 of the images had malware.&lt;br&gt;
&lt;a href="https://www.infoq.com/news/2020/12/dockerhub-image-vulnerabilities/" rel="noopener noreferrer"&gt;https://www.infoq.com/news/2020/12/dockerhub-image-vulnerabilities/&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;/p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F971e3mglk3o8dkls3h03.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F971e3mglk3o8dkls3h03.png" alt="Image from https://prevasio.com/static/web/viewer.html?file=/static/Red_Kangaroo.pdf"&gt;&lt;/a&gt;Image from &lt;a href="https://prevasio.com/static/web/viewer.html?file=/static/Red_Kangaroo.pdf" rel="noopener noreferrer"&gt;https://prevasio.com/static/web/viewer.html?file=/static/Red_Kangaroo.pdf&lt;/a&gt;&lt;p&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Docker image security scanning is a process for finding security vulnerabilities within your Docker image files.&lt;br&gt;
Typically, image scanning works by parsing through the packages or other dependencies that are defined in a container image file, then checking to see whether there are any known vulnerabilities in those packages or dependencies.&lt;br&gt;
&lt;a href="https://resources.whitesourcesoftware.com/blog-whitesource/docker-image-security-scanning" rel="noopener noreferrer"&gt;https://resources.whitesourcesoftware.com/blog-whitesource/docker-image-security-scanning&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  How does it work?
&lt;/h2&gt;

&lt;p&gt;There are many tools to scan container images for vulnerabilities such as &lt;a href="https://github.com/anchore/anchore-engine" rel="noopener noreferrer"&gt;Anchore&lt;/a&gt;, &lt;a href="https://github.com/quay/clair" rel="noopener noreferrer"&gt;Clair&lt;/a&gt;, and &lt;a href="https://github.com/aquasecurity/trivy" rel="noopener noreferrer"&gt;Trivy&lt;/a&gt;. But sometimes the results from the same image can be different. And this project comes to help the developers to improve the quality of their container images by finding vulnerabilities and thus addressing them with agnostic sight from vendors.&lt;/p&gt;

&lt;p&gt;Some articles comparing the scanning tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://boxboat.com/2020/04/24/image-scanning-tech-compared/" rel="noopener noreferrer"&gt;Open Source CVE Scanner Round-Up: Clair vs Anchore vs Trivy&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://opensource.com/article/18/8/tools-container-security" rel="noopener noreferrer"&gt;5 open source tools for container security&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.a10o.net/devsecops/docker-image-security-static-analysis-tool-comparison-anchore-engine-vs-clair-vs-trivy/" rel="noopener noreferrer"&gt;Docker Image Security: Static Analysis Tool Comparison — Anchore Engine vs Clair vs Trivy&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fobow7zp9z7oirw5r6l2n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fobow7zp9z7oirw5r6l2n.png"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Cached Database
&lt;/h2&gt;

&lt;p&gt;Vilicus updates daily the vendor databases with the latest changes in the vulns DBs.&lt;/p&gt;

&lt;p&gt;Using a strategy to storage the database data in layers of docker images, the whole platform is ready to use in minutes instead of hours. Starting the sync feed with vulns from scratch can take at least 6 hours.&lt;/p&gt;

&lt;p&gt;Check the strategy used in &lt;a href="https://github.com/edersonbrilhante/vilicus/blob/main/scripts/build-anchore-image.sh" rel="noopener noreferrer"&gt;Anchore&lt;/a&gt;, &lt;a href="https://github.com/edersonbrilhante/vilicus/blob/main/scripts/build-clair-image.sh" rel="noopener noreferrer"&gt;Clair&lt;/a&gt; and &lt;a href="https://github.com/edersonbrilhante/vilicus/blob/main/scripts/build-trivy-image.sh" rel="noopener noreferrer"&gt;Trivy&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Local Registry
&lt;/h2&gt;

&lt;p&gt;Vilicus provides a local registry, so you can build a local image and scanning it without pushing it to a remote repository.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docker build -t localhost:5000/local-image:my-tag .

curl -o docker-compose.yml https://raw.githubusercontent.com/edersonbrilhante/vilicus/main/deployments/docker-compose.yml

docker-compose up -d

IMAGE=localregistry.vilicus.svc:5000/local-image:my-tag

docker run -v ${PWD}/artifacts:/artifacts \
  --network container:vilicus \
  vilicus/vilicus:latest \
  sh -c "dockerize -wait http://vilicus:8080/healthz -wait-retry-interval 60s -timeout 2000s vilicus-client -p /opt/vilicus/configs/conf.yaml -i ${IMAGE}  -t /opt/vilicus/contrib/sarif.tpl -o /artifacts/results.sarif"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  GitHub Action
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;GitHub Actions makes it easy to automate all your software workflows, now with world-class CI/CD. Build, test, and deploy your code right from GitHub. Make code reviews, branch management, and issue triaging work the way you want.&lt;br&gt;
&lt;a href="https://github.com/features/actions" rel="noopener noreferrer"&gt;https://github.com/features/actions&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Vilicus provides a &lt;a href="https://github.com/marketplace/actions/vilicus-scan" rel="noopener noreferrer"&gt;GitHub action&lt;/a&gt; to help you scanning container images in your CI/CD.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Container scanning&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A scan can be done using a remote image and a local image. Using a remote repository such as docker.io the image will be &lt;strong&gt;&lt;em&gt;&lt;code&gt;docker.io/your-organization/image:tag&lt;/code&gt;&lt;/em&gt;&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  - name: Scan image
    uses: edersonbrilhante/vilicus-github-action@main
    with:
      image: "docker.io/myorganization/myimage:tag"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And to use a local image its need to tag as &lt;strong&gt;&lt;em&gt;&lt;code&gt;localhost:5000/image:tag&lt;/code&gt;&lt;/em&gt;&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  - name: Scan image
    uses: edersonbrilhante/vilicus-github-action@main
    with:
      image: "localhost:5000/myimage:tag"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Full example&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Complete example with steps for cleaning space, building local image, Vilicus scanning, and uploading results to GitHub Security&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;name: Container Image CI
on: [push]
jobs:
  build
    runs-on: ubuntu-latest
    steps:
      - name: Maximize Build Space
        uses: easimon/maximize-build-space@master
        with:
          root-reserve-mb: 512
          swap-size-mb: 1024
          remove-dotnet: 'true'
          remove-android: 'true'
          remove-haskell: 'true'
      - name: Checkout branch
        uses: actions/checkout@v2
      - name: Build the Container image
        run: docker build -t localhost:5000/local-image:${GITHUB_SHA} .
      - name: Vilicus Scan
        uses: edersonbrilhante/vilicus-github-action@main
        with:
          image: localhost:5000/local-image:${{ github.sha }}
      - name: Upload results to github security
        uses: github/codeql-action/upload-sarif@v1
        with:
          sarif_file: artifacts/results.sarif
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Results in GitHub Security
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/edersonbrilhante/vilicus-scan-examples" rel="noopener noreferrer"&gt;Check an example&lt;/a&gt; using Vilicus GitHub Action&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmtmicxlyd96ikk9a0xqu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmtmicxlyd96ikk9a0xqu.png" alt="Pipeline example"&gt;&lt;/a&gt;Pipeline example&lt;p&gt;&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzcq9ox1c1zifdusp9k6a.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzcq9ox1c1zifdusp9k6a.jpeg" alt="List with all vulns found"&gt;&lt;/a&gt;List with all vulns found&lt;p&gt;&lt;/p&gt;

&lt;p&gt;&lt;/p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgx8g581t9k2gqexyusms.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgx8g581t9k2gqexyusms.jpeg" alt="Vuln details"&gt;&lt;/a&gt;Vuln details&lt;p&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Source Code
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/edersonbrilhante/vilicus-github-action" rel="noopener noreferrer"&gt;VIlicus GitHub Action&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/edersonbrilhante/vilicus" rel="noopener noreferrer"&gt;Vilicus&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  That’s it!
&lt;/h2&gt;

&lt;p&gt;In case you have any questions, please leave a comment here or ping me on &lt;a href="https://www.linkedin.com/in/edersonbrilhante" rel="noopener noreferrer"&gt;🔗 LinkedIn&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>security</category>
      <category>docker</category>
      <category>devops</category>
      <category>tooling</category>
    </item>
  </channel>
</rss>
