Khachatur Ashotyan

Posted on May 18 • Edited on May 23

Jenkins as a Code, or how I stopped clicking around in the UI

#jenkins #devops #cicd #gitops

I've been running Jenkins for years now. Different companies, different team sizes, but the same story keeps repeating, and at some point I couldn't take it anymore. So I decided to write some of it down. This is Part 1 of what I'm calling My CI/CD Odyssey - ideas I tried, things that blew up in my face, and stuff I still use today.

Later chapters get into the painful stuff: building macOS workers without losing your mind, spot instances as GitHub Actions runners to cut costs, plus a few other rabbit holes. First, the beginning. That's where most of the pain came from.

The "before" picture, and why it hurts

Anyone who's worked with Jenkins for a while knows this scene. Somebody opens the Jenkins UI, clicks "New Item", picks a freestyle or pipeline job, fills in twenty-something fields, scrolls past a wall of plugin options, hits Save. A month later somebody else has to figure out why a job behaves differently in stage than in prod, and the answer is "because Arthur clicked a different checkbox in February and nobody remembers".

That was my world for a long time. Multi-tier environments (stage, prod, sometimes more), and on top of that, sometimes more than one Jenkins instance per tier. Each one configured by hand: plugins installed manually, pipelines copy-pasted from one Jenkins to another and edited in place, credentials added by hand, workers attached one at a time. Then one day you wake up and realize:

Nobody remembers what plugins are installed where.
The "stage" Jenkins doesn't match prod anymore. You find out when a pipeline breaks in prod.
A Friday afternoon plugin update kills a build. Rolling it back is a human clicking buttons under stress.
A new team member joins, and you burn three days explaining tribal knowledge that should live in a repo.

That last one was what finally pushed me. Tribal knowledge is fine in a team of two sharing a desk, but past that it costs weeks of onboarding for every new hire.

The idea: treat Jenkins like any other piece of code

So I started reading. Jenkins is infrastructure. We already do infrastructure-as-code for everything else - Terraform for cloud, Helm for Kubernetes, Ansible for hosts - so why is Jenkins the one piece still managed by hand? Controller, jobs, credentials wiring, workers - pull it all out of a git repo.

What I wrote down for myself:

I want a Jenkins where I can throw away the VM, the cluster, the config, run a pipeline, and ten minutes later have the same Jenkins back. And stage should be code-to-code identical to prod, so when I test a plugin upgrade in stage I know how it'll behave in prod.

Anyone who's been burned by a "but it worked in stage" deploy knows why this matters.

The building blocks

When I started designing this, it broke into a handful of moving pieces. None of them are revolutionary on their own, but wiring them together is where the value lives.

1. JCasC - Jenkins Configuration as Code

This is the foundation. JCasC is a Jenkins plugin that defines the controller config in YAML - system settings, security realm, authorization strategy, clouds, credentials, tools, global libraries. The controller reads the YAML on boot and configures itself.

The first time I rebuilt a controller from a YAML file, I stopped clicking through the UI for good. The controller only knows about things in the YAML, so anything else might as well not exist.

Minimal example:

jenkins:
  systemMessage: "Managed by JCasC - do not edit in the UI"
  numExecutors: 0
  mode: EXCLUSIVE
  securityRealm:
    github:
      clientID: ${GITHUB_CLIENT_ID}
      clientSecret: ${GITHUB_CLIENT_SECRET}
  clouds:
    - kubernetes:
        name: "eks"
        namespace: "jenkins"
        jenkinsUrl: "http://jenkins.jenkins.svc.cluster.local:8080"
unclassified:
  globalLibraries:
    libraries:
      - name: "ci-libs"
        defaultVersion: "main"
        retriever:
          modernSCM:
            scm:
              git:
                remote: "https://github.com/<org>/ci-libs.git"

Fifteen lines of YAML, and that's most of the controller.

2. Job DSL - jobs from a git repo

JCasC handles the controller but not the jobs. For that I used the Job DSL plugin. Jobs live as Groovy files in a git repo, and a small "seeder" job in Jenkins polls the repo and rebuilds jobs from the DSL files on each run. Deleting a job from git removes it from Jenkins on the next poll; changing a parameter in git rolls forward the same way.

The Jenkins UI ends up effectively read-only from a configuration perspective. Anyone who tries to edit a job in the UI gets overwritten by the next seeder run, which is by design.

Look here for declarative API

3. Helm + Kubernetes for the controller

I run the Jenkins controller in Kubernetes. The deployment uses the official Helm chart, with a persistent volume for the home directory and a sidecar that injects JCasC config from a ConfigMap. Upgrading Jenkins is a chart version bump, rolling back is the same chart at the previous version. The plugin list sits in values.yaml, version-pinned and reviewed in a PR like any other code change.

This is when plugin upgrades stopped feeling like Friday-night events. Each upgrade goes through stage in a PR and gets the same review as application code.

Side note: if you don't want to deal with Helm, the community maintains a Jenkins Kubernetes Operator that's CRD-first. I went with Helm because the upgrade story is simpler, but the operator is fine if you're already deep in operators.

4. Packer for worker images

Then there's the workers, the machines that actually run builds. I went all-in on Packer. Every worker image is baked from a Packer template in git, with the base OS, language runtimes, SDKs, and build tools pre-installed. Each image has a version, and the worker config pins to a specific one.

Before Packer, every worker was a slightly different snowflake, hand-installed and slowly drifting. After Packer, every worker booted from v1.2.3 is byte-for-byte identical to every other one. When a dependency upgrade breaks something, you know which image introduced it, and pinning back to the previous version is a one-line PR.

5. Ephemeral workers - born, used, destroyed

The ephemeral worker piece is what ties everything together, and it's the part I'm proudest of. Workers in this setup are strictly ephemeral: a new worker per build, never a long-lived agent we reboot once a week. A pipeline asks Jenkins for a worker; a dedicated job spins one up from a known Packer image, the build runs on it, and the worker gets destroyed when the build finishes. Every build starts on a fresh machine.

The spin-up mechanism varies by platform:

Linux builds: the Jenkins Kubernetes plugin schedules a pod in EKS from a container image we baked. Build finishes, pod is deleted. Lifecycle is seconds to minutes.
AWS EC2 / Azure VMs (Linux and Windows): a dedicated job runs terraform to provision and de-provision instances from Packer templates.
macOS VMs: the same idea, but macOS virtualization is its own ecosystem. A fresh macOS VM gets booted from a Packer-baked image on each build (Tart on Apple Silicon hosts, vSphere for the older fleet, or Orchard for pooled remote Macs), the build runs, the VM is torn down. macOS deserves its own post (Part 2), but the lifecycle is the same: provisioned for one build, then torn down.

Every build starts from byte-identical state. Not "mostly the same", not "the same except for ~/.cache". If the image tag is v1.2.3, every build on it starts from the exact filesystem snapshot Packer produced. There's no operator history sitting on the disk.

That eliminates a whole class of bugs: leftover state on the agent, the weird ~/.cache nobody cleaned up, a disk full of artifacts from three weeks ago, the Friday-only flake from a leak that's been growing since Monday. None of it survives, because the worker doesn't live long enough to accumulate it.

It also makes "build is non-reproducible" investigations faster. If two builds against the same commit produce different artifacts, the cause is almost never the worker, since both ran on a fresh one.

Security gets simpler too. Secrets pulled onto a worker disappear with the worker, so no long-lived agent holds old tokens. If a credential ever leaks into a build environment, the worker is gone within minutes and the leak goes with it.

6. Terraform / Terragrunt for everything else

Everything that isn't Jenkins itself (VPCs, IAM, secret stores, the EKS cluster, image galleries) lives in Terraform, wrapped with Terragrunt so the same modules get reused across stage and prod with different inputs. That's why stage ends up code-to-code identical to prod: the same modules at the same versions, just with different variables.

To check how prod will behave under a change, you run the same Terraform with ENV=stage instead of ENV=prod.

How it all clicks together

The flow ends up looking like this:

Somebody opens a PR - new job, plugin bump, JCasC tweak, new Packer image, whatever.
CI validates: YAML lint, Groovy compile checks, terraform plan, Packer build for any changed images.
PR gets reviewed and merged.
On merge, GitHub Actions applies infra via Terraform. The Jenkins seeder picks up new DSL files on its next poll.
The next build that needs a worker pulls the new image.

The Jenkins UI becomes a view onto what the repo says should be running, while the repo itself holds the truth.

What this fixed for me

What changed:

We stopped seeing "works on stage, breaks on prod" bugs. Because stage runs the same code as prod with the same modules at the same versions, when it works in stage it works in prod (modulo data).
Plugin upgrades aren't Friday-night events anymore. A bad one gets reverted like any other change.
Onboarding got much faster. New engineers read the repo instead of getting a Jenkins UI tour and a Slack thread of secrets.
Disaster recovery actually works. If I lost the controller VM, the EKS cluster, or even the whole account, the repo alone is enough to rebuild it.
We get an audit trail without writing one. Every pipeline change is a git commit with an author, a timestamp, and a PR description.

What I'm still figuring out

This isn't a finished story. A few things still keep me up at night:

macOS workers are the hardest piece. AWS does offer Mac instances, but the 24-hour minimum allocation and bare-metal model make them nothing like spinning up a Linux VM, and the hypervisor, licensing, and hardware constraints push the whole macOS story onto its own track. Part 2 covers it: Tart, virtualization on Apple Silicon, the trade-offs between self-hosted and cloud-mac providers, and the signing and notarization pain.
GitHub Actions costs add up at scale. You can offload heavier workloads to spot-instance runners cheaply, though spot brings its own trade-offs. Part 3 walks through that.

Closing thought

If you're still managing Jenkins through the UI, it's rarely about laziness. The cost shows up in places that don't make it onto any dashboard: the engineer who leaves and takes the only working configuration in their head, the 2am plugin-upgrade breakage, the customer-facing deploy that fails because stage and prod had quietly drifted apart for six months. Jenkins as Code doesn't make those costs disappear, but it surfaces them as PRs I can see and review, which for me has been worth the work.

Appendix - tools and plugins I leaned on

For anyone who wants to skip straight to the implementations, here's what's wired up:

Jenkins plugins

Configuration as Code (JCasC): the controller config in YAML.
Job DSL: jobs defined in Groovy in a git repo.
Kubernetes plugin: ephemeral pod agents in EKS.
Pipeline: Shared Groovy Libraries: the global libraries that hold reusable pipeline code.

Deployment

Jenkins official Helm chart: what I use to deploy the controller.
Jenkins Kubernetes Operator: the CRD-based alternative, if you prefer operators over Helm.

Image building

HashiCorp Packer: bakes all the worker images (Linux, Windows, macOS).

Infrastructure

Terraform: everything outside Jenkins (VPCs, IAM, secrets, EKS, image galleries).
Terragrunt: keeps the same modules DRY across stage and prod.
Kubernetes / Amazon EKS: where the Jenkins controller lives.
Helm: package manager for the Kubernetes side.
GitHub Actions: applies Terraform on merge.

Coming up in later parts

Tart: macOS VMs on Apple Silicon (Part 2).
Orchard: Tart cluster orchestration for macOS fleets (Part 2).

This is Part 1 of My CI/CD Odyssey. Follow me here on dev.to if you want to be pinged when Part 2 drops. And if you're doing Jenkins as Code differently, I'd love to hear about it in the comments.

Top comments (1)

Anahit Grigorian • May 18

Very interesting article.