I've been running Jenkins in one form or another for years now. Different companies, different sizes of teams, but somehow the same story keeps repeating itself, and at some point I just couldn't take it anymore. So I decided to write down what I went through, what I learned, and where this journey took me. This is Part 1 of what I'm calling My CI/CD Odyssey — a series where I want to share the ideas, the mistakes, and the things that actually worked.
Future chapters will go deeper into the painful stuff — building macOS workers without losing your mind, using spot instances as GitHub Actions runners to cut costs, and a few other rabbit holes I went into. But before we get there, let's start at the beginning, because the beginning is where most of the pain lives.
The "before" picture, and why it hurts
If you've worked with Jenkins for any reasonable amount of time, you probably know this scene: someone opens the Jenkins UI, clicks "New Item", picks a freestyle or pipeline job, fills in twenty-something fields, scrolls past a wall of plugin options, and clicks Save. Then a month later somebody has to figure out why a job behaves differently in dev than in prod, and the answer is "because Arthur clicked a different checkbox in February and nobody remembers".
That was basically my world for a long time. We had multi-tier environments — dev, stage, sometimes more — and on top of that, sometimes more than one Jenkins instance per tier. Each one was configured by hand. Plugins installed by hand. Pipelines copy-pasted from one Jenkins to another and edited by hand. Credentials added by hand. Workers attached by hand. Then one day you wake up and realize:
- Nobody remembers what plugins are installed where.
- The "stage" Jenkins doesn't match production anymore, and you only notice when a pipeline breaks in prod.
- A plugin update on Friday afternoon kills a build, and rolling it back means a human clicking buttons under stress.
- A new team member joins and you spend three days explaining tribal knowledge that should really live in a repo.
That last point is what really got me. Tribal knowledge is fine when there are two of you. It stops being fine very quickly.
The idea: treat Jenkins like any other piece of code
So I started doing some research, and the direction was pretty obvious in hindsight: if Jenkins is a piece of infrastructure, and we treat infrastructure as code everywhere else (Terraform for cloud, Helm for Kubernetes, Ansible for hosts), then Jenkins itself shouldn't be the special snowflake we manage by hand. The whole controller, all the jobs, all the credentials wiring, the workers — everything should come out of a git repo. End to end.
The goal I wrote down for myself was something like this:
I want a Jenkins instance where I can throw away the whole VM, the whole cluster, the whole config, run a pipeline, and ten minutes later have an identical Jenkins back. And I want
devto be code-to-code identical toprod, so when I test a plugin upgrade or a pipeline change in dev, I actually know it will behave the same in prod.
If you've ever burned yourself on a "but it worked in stage" deploy, you know exactly why that sentence matters.
The building blocks
Once I started designing this, the picture broke down into a few moving pieces. None of these are revolutionary on their own — what matters is how they fit together.
1. JCasC — Jenkins Configuration as Code
This is the foundation. JCasC is a Jenkins plugin that lets you define the entire controller config in YAML. System settings, security realm, authorization strategy, clouds, credentials wiring, tools, global libraries — all of it. The controller reads the YAML on boot and configures itself.
The moment I plugged JCasC in and could rebuild a controller from a YAML file, I knew I wasn't going back. No more "what's installed where". Whatever is in the YAML is the truth. If it's not in the YAML, it doesn't exist.
A minimal taste of what that looks like:
jenkins:
systemMessage: "Managed by JCasC — do not edit in the UI"
numExecutors: 0
mode: EXCLUSIVE
securityRealm:
github:
clientID: ${GITHUB_CLIENT_ID}
clientSecret: ${GITHUB_CLIENT_SECRET}
clouds:
- kubernetes:
name: "eks"
namespace: "jenkins"
jenkinsUrl: "http://jenkins.jenkins.svc.cluster.local:8080"
unclassified:
globalLibraries:
libraries:
- name: "ci-libs"
defaultVersion: "main"
retriever:
modernSCM:
scm:
git:
remote: "https://github.com/<org>/ci-libs.git"
Fifteen lines, and the whole controller knows who it is.
2. Job DSL — jobs from a git repo
JCasC handles the controller, but it doesn't really handle jobs. For that I leaned on the Job DSL plugin. Jobs are defined in Groovy files in a git repo, and a small "seeder" job in Jenkins polls the repo, picks up all the DSL files, and recreates jobs from them. If a job is removed from git, it disappears from Jenkins. If a parameter changes in git, it changes in Jenkins on the next seed run.
This means the Jenkins UI becomes basically read-only from a configuration point of view. Nobody edits a job in the UI anymore — if you do, the seeder will overwrite you on the next run. That's a feature, not a bug.
3. Helm + Kubernetes for the controller
I run the Jenkins controller in Kubernetes. Helm chart for the deploy, persistent volume for the home dir, a sidecar that injects JCasC config from a ConfigMap. Upgrading Jenkins is just bumping a chart version. Rolling back is rolling back a chart version. Plugin lists are values in a Helm values.yaml file, version-pinned, and reviewed in a pull request like any other change.
This is honestly the part that made plugin upgrades stop being scary. They go through a PR. They get tested in dev first. They get the same review as application code.
Side note: if you'd rather not deal with Helm at all, the community also maintains a Jenkins Kubernetes Operator that takes a CRD-first approach. I went with Helm for the simpler upgrade story, but the operator is a perfectly reasonable alternative if you're already heavy into the operator pattern.
4. Packer for worker images
The next big piece is the workers — the actual machines that run your builds. Here I went all-in on Packer. Every worker image is baked from a Packer template that lives in git: base OS, language runtimes, SDKs, build tools, everything pre-installed. The image gets a version. The version gets pinned in the worker config.
This was the moment that builds started to feel reproducible. Before Packer, every worker was a slightly different snowflake, hand-installed and slowly drifting. After Packer, every worker that boots from image v1.2.3 is byte-for-byte the same as every other worker booted from image v1.2.3. If a dependency upgrade breaks something, you know exactly which image introduced it, and you can pin back to the previous one in a one-line PR.
5. Ephemeral workers — born, used, destroyed
This is the part that connects everything, and honestly the part I'm proudest of. Workers in this setup are ephemeral. Not "long-lived agents we reboot once a week" — actually ephemeral. A pipeline asks Jenkins for a worker, dedicated job spins one up from a known Packer image, the worker runs the build, the worker dies. Always. Every build gets a virgin environment.
The "something" depends on the platform, but the pattern is identical across all of them:
- Linux builds — the Jenkins Kubernetes plugin schedules a pod in the EKS cluster from a container image we baked. Build finishes, pod is deleted. Lifecycle is seconds to minutes.
- AWS EC2 / Azure VMs (Linux and Windows) — Dedicated job run terraform to provision and de-provision instances from packer templates.
- macOS VMs — same idea, but the underlying virtualization is its own world. We spin up a fresh macOS VM from a Packer-baked image on each build (via Tart on Apple Silicon hosts, or vSphere for older fleets, or Orchard for pooled remote Macs), the build runs, and the VM is torn down at the end. macOS is messier and deserves its own post — that's Part 2 — but the contract is the same: born for one build, destroyed after.
The point is: every build starts from byte-identical state. Not "mostly the same". Not "the same modulo ~/.cache". Identical. If v1.2.3 of an image is what's running, then every build on that image starts from the exact same filesystem snapshot the Packer pipeline produced. There's no human in between leaving footprints.
That kills a whole category of bugs. No more "leftover state on the agent". No more "this worker has a weird ~/.cache somebody never cleaned up". No more "the disk filled up because of build artifacts from three weeks ago". No more "this only fails on Friday because the agent's been up since Monday and something is leaking". The worker simply doesn't live long enough to accumulate any of that.
It also makes "build is non-reproducible" investigations a lot shorter. If two builds against the same commit produce different artifacts, the cause is almost never the worker — because the worker is brand new in both cases. That narrows the search dramatically.
And it turns out to be a beautiful security property too: secrets that get pulled onto a worker disappear with it. There's no long-lived agent holding old tokens. If a credential leaks into a build environment, its blast radius is measured in minutes, not weeks.
6. Terraform / Terragrunt for everything else
All the things that aren't Jenkins itself — VPCs, IAM, secret stores, the EKS cluster, image galleries — live in Terraform, organized with Terragrunt so the same modules get reused across dev and prod with different inputs. Same code, different variables. That's how I get dev to be code-to-code identical to prod.
If you ever want to test how production will behave, just run the same Terraform with ENV=stage instead of ENV=prod. Same modules, same versions, just a different namespace. No surprises.
How it all clicks together
The flow ends up looking like this:
- Somebody opens a pull request — could be a new job, a plugin bump, a JCasC tweak, a new Packer image.
- CI runs validation: YAML lint, Groovy compile checks, Terraform plan, Packer build for changed images.
- PR gets reviewed and merged.
- On merge, GitHub Actions applies infra changes via Terraform, and the Jenkins seeder picks up new DSL files on its next poll.
- Next build that needs a worker pulls the new image. No human in the loop.
That's the loop. That's the whole point. The Jenkins UI becomes a window into what the repo says should be running, not the source of truth.
What this fixed for me
Here's what I noticed had actually changed:
- No more "works on stage, breaks on prod". Because the two are literally the same code with different inputs. If it works on stage, it works on prod, modulo data differences.
-
Plugin upgrades stopped being scary. They go through a PR. They get tried on
dev. They roll back withgit revert. - Onboarding got faster. New engineers read the repo. They don't have to be told secrets or shown a Jenkins UI tour.
- Disaster recovery got real. I can lose the controller VM, the EKS cluster, even the entire account, and as long as I have the repo I can rebuild.
- Audit trail came for free. Every change to any pipeline is a git commit, with an author, a timestamp, and a PR description. No more "who changed this and when".
What I'm still figuring out
I don't want to make this sound like a finished story, because it's not. A few things still keep me up at night:
macOS workers are their own special kind of hell. You can't just spin up a Mac VM in AWS the same way you spin up Linux. There's a whole ecosystem of hypervisors, licensing rules, and hardware constraints to deal with. This deserves its own post — and it's getting one. Part 2 will be all about macOS workers: Tart, virtualization on Apple Silicon, the trade-offs between self-hosted and cloud-mac providers, and how to make signing and notarization not feel like a horror movie.
GitHub actions Cost at scale. There is easy way to run spot instances as GitHub Actions runners to offload certain workloads cheaply, save money, and that's its own rabbit hole — different trade-offs, different failure modes, different cost curves. Part 3 will cover spot-based GitHub Actions runners end to end.
Closing thought
If there's one thing I'd say to anyone reading this who's still managing Jenkins by clicking buttons, it's this: you're not lazy for doing it, you're just paying the cost in places that don't show up on a dashboard. The cost shows up when someone leaves the team, when a plugin update breaks a build at 2am, when a customer-facing deploy fails because stage lied to you. Jenkins as a Code doesn't make those costs disappear, but it makes them visible and reviewable. And that, honestly, has been worth all the work.
Appendix — tools and plugins I leaned on
For anyone who wants to skip straight to the implementations, here's the short list of what's actually wired up in this setup:
Jenkins plugins
- Configuration as Code (JCasC) — the controller config in YAML.
- Job DSL — jobs defined in Groovy in a git repo.
- Kubernetes plugin — ephemeral pod agents in EKS.
- Pipeline: Shared Groovy Libraries — the global libraries that hold reusable pipeline code.
Deployment
- Jenkins official Helm chart — what I use to deploy the controller.
- Jenkins Kubernetes Operator — the CRD-based alternative, if you prefer operators over Helm.
Image building
- HashiCorp Packer — bakes all the worker images (Linux, Windows, macOS).
Infrastructure
- Terraform — everything outside Jenkins (VPCs, IAM, secrets, EKS, image galleries).
-
Terragrunt — keeps the same modules DRY across
dev/stage/prod. - Kubernetes / Amazon EKS — where the Jenkins controller lives.
- Helm — package manager for the Kubernetes side.
- GitHub Actions — applies Terraform on merge.
Coming up in later parts
- Tart — macOS VMs on Apple Silicon (Part 2).
- Orchard — Tart cluster orchestration for macOS fleets (Part 2).
This is Part 1 of My CI/CD Odyssey. If you want to be pinged when Part 2 drops, follow me here on dev.to. And if you're doing JaaC differently — I'd love to hear about it in the comments.



Top comments (1)
Very interesting article.