chengkai

Posted on Feb 23

How I Keep a Kubernetes CLI Lean: Vault + Jenkins + Istio, Loaded Only On Demand

#kubernetes #devops #bash #shellscript

Why This Exists — The Real Story

I work in a large enterprise IT department. Some downtime between projects gave me space to think about the problems I'd been quietly accumulating for years — the kind of problems that never make it onto the official backlog because they're too small to justify a ticket and too big to fix in a lunch break.

The Jenkins cluster needed an upgrade. It was 10,000 miles away from actually happening — the Kubernetes cluster was completely locked down, access restricted, every change requiring approvals through a process that had more steps than the actual work. One person could request it. A different team owned it. A third team controlled the firewall. Nobody could move fast enough to matter.

Manual TLS certificate updates were breaking things constantly. Every rotation was a ceremony — someone had to remember, someone had to do it, something always went wrong. Password expiry was the same story. Credentials rotated on a schedule nobody tracked, services broke, and the fix was always manual.

I looked at all of this and thought: I know how to automate this. Why am I waiting for permission?

So during some "quiet time" I started building. Not for work — for myself, on my own machine, to prove it was possible. The Kubernetes cluster was locked down, so I used k3d locally. If I couldn't fix the corporate environment, I could at least build the thing that should exist and understand it deeply enough to make the case for it later.

That became k3d-manager — a modular shell CLI that handles the three things I kept fighting with: Identity (LDAP/AD), Secrets (Vault), and Service Mesh (Istio) — all from a single command. With automated cert rotation. With automated password rotation. No manual steps. No ceremonies. No waiting.

No sprawling Helm umbrella charts. No operator frameworks to re-learn every six months. Just shell.

The "Cost of Stability" Problem

Picture this: you need to spin up a local Kubernetes cluster with Vault, Jenkins, LDAP, and Istio — all wired together — for a demo in two hours. Your approved toolchain involves three separate repos, a Helm umbrella chart nobody fully understands, and a Confluence page last updated in 2023.

That's the reality of working in a large enterprise IT department. It's a constant tug-of-war between speed and compliance. Every change needs an audit trail. Every tool needs a sign-off. And by the time something gets approved, the problem it was solving has already mutated into something else.

I got tired of waiting.

Visual Architecture Overview

Here's how the pieces fit together. It's more connected than it might look at first glance.

Worth noting: while k3d runs Kubernetes in Docker containers (great for local dev), k3d-manager also supports k3s directly — which means it works on bare metal servers, cloud instances, or any Linux VM. It's not just a laptop tool.

minikube was the obvious starting point when I was evaluating options. I didn't go that route — partly instinct, partly because a former coworker had mentioned k3s years earlier and it had stuck in the back of my mind. I hadn't paid much attention to it at the time. But when the scope expanded beyond pure local dev, k3s made more sense: it's production-grade, runs the same way everywhere, and the gap between your local environment and a real deployment is much smaller.

I developed the k3s support on an Ubuntu VM running in Parallels Desktop on my Mac. Not a dedicated server, not a cloud instance — a VM on the same machine I was writing code on. If it works there, it works anywhere k3s runs: a cheap VPS, an on-premise bare metal cluster, an edge device, or a cloud instance where you want k3s instead of a managed Kubernetes service. That covers a lot of ground — including startups that can't justify EKS costs yet and enterprises with compliance requirements that keep workloads off public cloud.

Why Shell? (And How to Do It Right)

I know — "shell scripts for enterprise Kubernetes" sounds like a joke. But hear me out.

The problem with most shell projects isn't the language. It's the lack of discipline. Scripts grow organically, functions get copy-pasted, and six months later nobody knows what anything does. I know because it started happening here — early on the main file was pulling in too many responsibilities, functions landing wherever they fit at the time. The refactoring came before it got unmanageable: system functions moved to system.sh, test logic to test.sh, everything organised under a proper lib/ directory. From that point I set a few rules.

Rule 1: Lazy-Load Everything

The main entry point — scripts/k3d-manager — is only 92 lines. It loads six core libraries at startup and nothing else. When you call a function that isn't built-in, it hands off to _try_load_plugin:

# scripts/k3d-manager
if [[ "$(type -t "$function_name")" == "function" ]]; then
    $function_name "$@"
else
    _try_load_plugin "$function_name" "$@"
fi

_try_load_plugin validates the function name (blocks private _-prefixed functions, rejects invalid names), then calls _load_plugin_function, which does the actual work:

# scripts/lib/system.sh
for plugin in "$PLUGINS_DIR"/*.sh; do
    if command grep -Eq \
      "^[[:space:]]*(function[[:space:]]+${func}[[:space:]]*\(\)|${func}[[:space:]]*\(\))[[:space:]]*\{" \
      "$plugin"; then
        source "$plugin"
        "$func" "$@"
        return $?
    fi
done

It greps plugin files for the function definition, sources only the matching file, and executes. The jenkins.sh plugin is 74KB. The vault.sh plugin is 60KB. None of that gets loaded until you actually call something from it. Startup stays near-instant regardless of how many plugins exist.

Rule 2: One Wrapper for All Commands

Every external command — kubectl, helm, apt-get, mkdir — goes through _run_command. It handles sudo detection, error reporting, and exit codes consistently:

# _run_command [--quiet] [--prefer-sudo|--require-sudo] [--probe '<subcmd>'] -- <prog> [args...]

# Try direct first, elevate if needed
_run_command --prefer-sudo -- mkdir -p "$dir"

# Require sudo or abort
_run_command --require-sudo -- mv "$tmpfile" /etc/rancher/k3s/k3s.yaml

# Try user first; if probe fails, fall back to sudo
_run_command --probe 'config current-context' -- kubectl get nodes

# Don't exit on failure — just return the code
_run_command --soft --quiet --prefer-sudo -- test -r "$kubeconfig_src"

When something fails, you get the exit code and the full command that blew up — no guessing. That alone has saved me hours of debugging.

Rule 3: Secrets Never Hit the Trace Log

One thing I'm especially happy with: the trace guard. If you run with set -x, bash logs every command — including the ones that contain passwords. So the dispatcher suppresses tracing for any invocation that carries a sensitive flag:

if [[ $- == *x* ]]; then
    set +x
    if _args_have_sensitive_flag "$@"; then
        secret_trace_guard=1
    else
        set -x
    fi
fi

Tracing gets re-enabled after the sensitive operation completes. Your debug logs stay useful without leaking credentials.

The Part Nobody Wants to Admit

Look at what those three rules actually are. The plugin loader is runtime polymorphism — call a function name, the dispatcher resolves it to the right implementation at runtime, sources only what's needed. The _run_command wrapper is a unified interface over heterogeneous external commands. The directory service abstraction — where the same deploy_ldap command works whether you're talking to OpenLDAP or Active Directory — is interface segregation.

Nobody calls it that, because it's shell. But the patterns are identical to what Java textbooks call object-oriented design.

OOP isn't a language feature. It's a design discipline. Java gives you class and interface keywords and produces unmaintainable spaghetti all the time. Shell gives you nothing — and produces a clean, modular, extensible system if you're disciplined enough. The language didn't make this maintainable. The choices did.

The "Secret-Free" Architecture

Secret management is where a lot of local dev setups quietly fail. Passwords end up hardcoded in ConfigMaps, .env files get committed, and rotation means hunting down every place a credential was copy-pasted.

I wired in HashiCorp Vault with the External Secrets Operator (ESO) to avoid all of that. The configuration lives in scripts/etc/ and is driven entirely by environment variables:

# scripts/etc/jenkins/vars.sh (excerpt)
export VAULT_PKI_ISSUE_SECRET="${VAULT_PKI_ISSUE_SECRET:-1}"
export VAULT_PKI_SECRET_NS="${VAULT_PKI_SECRET_NS:-istio-system}"
export VAULT_PKI_LEAF_HOST="${VAULT_PKI_LEAF_HOST:-jenkins.dev.local.me}"
export JENKINS_CERT_ROTATOR_SCHEDULE="${JENKINS_CERT_ROTATOR_SCHEDULE:-0 */12 * * *}"
export JENKINS_LDAP_VAULT_PATH="${JENKINS_LDAP_VAULT_PATH:-ldap/openldap-admin}"

What this gives you in practice:

No hardcoded passwords — Credentials never touch a ConfigMap. They're injected at runtime into a memory-backed volume via Vault Agent sidecars. When the pod dies, the secret dies with it.
Flexible Jenkins Auth — Jenkins can run against built-in auth, a local OpenLDAP instance (great for testing AD integration without touching production), or a real enterprise Active Directory. You just pick the mode that matches your environment. Switching is one flag: --enable-ldap, --enable-ad, or --enable-ad-prod.
Automated Rotation — A CronJob handles TLS certificate rotation on a 0 */12 * * * schedule, revokes the old leaf cert in Vault, and restarts affected pods. No manual steps, no stale certs sitting around.

The first time I rotated a credential without touching a single config file or redeploying Jenkins, it felt almost too easy.

Real Bugs, Real Fixes

This is the part nobody puts in blog posts. Here are three real incidents from my docs/issues/ folder.

The envsubst Trap

Templates are rendered with envsubst. The problem: envsubst doesn't understand bash default syntax like ${VAULT_PKI_PATH:-pki}. It just passes the literal string through into the YAML, which then breaks the deployment silently.

The fix was boring but necessary — explicitly export every variable with its default before calling envsubst:

# scripts/plugins/jenkins.sh
export VAULT_PKI_PATH="${VAULT_PKI_PATH:-pki}"
export VAULT_PKI_ROLE_TTL="${VAULT_PKI_ROLE_TTL:-}"
export VAULT_NAMESPACE="${VAULT_NAMESPACE:-}"
export VAULT_SKIP_VERIFY="${VAULT_SKIP_VERIFY:-}"
export JENKINS_CERT_ROTATOR_ALT_NAMES="${JENKINS_CERT_ROTATOR_ALT_NAMES:-}"

I caught it by comparing cert serials before and after a rotation. The serial wasn't changing, which meant the rotator was running but the YAML it generated was garbage.

The Pod Readiness Timeout

The original Jenkins readiness check had a 5-minute timeout. Plugin installation on a fresh pod takes longer than that. So deployments were failing — not because something was wrong, but because the wait loop gave up too early.

The fix: increase to 10 minutes, check for pod existence first, and show progress so you know it's not hanging:

Waiting for Jenkins controller pod to be created... (3s elapsed)
Waiting for Jenkins controller pod to be ready... (45s elapsed, status: true false false)
Waiting for Jenkins controller pod to be ready... (120s elapsed, status: true true false)
Waiting for Jenkins controller pod to be ready... (180s elapsed, status: true true true)
Jenkins controller pod is ready.

On timeout, it automatically runs kubectl get pod and kubectl describe pod for diagnostics. No more mystery failures.

The LDAP Password Drift

Password rotation that actually works — not a scheduled ceremony, not a manual handoff, just automated — was something I'd wanted to see done properly for years across multiple enterprise jobs. Every org I'd worked at had the same problem: passwords rotated on a schedule nobody tracked, services broke, and the fix was always someone manually hunting down where the credential was cached.

So I kept pushing: make it automated, make it reliable, no manual steps.

Pushing through that process with Claude is what exposed the failure modes I hadn't anticipated. First, time sync — if the clock drift between Vault and Active Directory exceeds a threshold, the rotation silently fails. Second, the ConfigMap drift: Jenkins reads the LDAP bind password from a ConfigMap rendered at deploy time. When Vault rotates the credential, the ConfigMap goes stale. Jenkins keeps using the old one until the pod is redeployed — which means your "automated rotation" is half-automated at best.

The root cause of the second issue runs deeper: envsubst substitutes ${LDAP_BIND_PASSWORD} at render time, and JCasC bakes the literal value into the config — there's no live lookup. The fix involves JCasC's Kubernetes secret provider syntax, which is finicky enough that it's still an open issue.

But here's what I take from it: I never would have found either of these failure modes by reading documentation. They only appeared because I kept pushing past the first working demo. The issue is documented and version-controlled — nobody has to rediscover it from scratch.

The Brittle Test Suite

At one point the project had 140 BATS tests. 67 of them were passing. That sounds like progress — until you look at what they were actually verifying.

Most of the tests mocked internal call sequences: deploy_jenkins calls _create_jenkins_vault_ldap_reader_role, which calls _wait_for_jenkins_ready, which makes exactly two kubectl calls. When any internal step changed — a function renamed, an extra call added during refactoring — the test broke. Not because the code was wrong. Because the test knew too much about how the implementation worked.

When I did a full test run on ldap-develop, 18 tests were failing. Not one of them represented an actual bug in production code. They'd drifted from the implementation and nobody had updated them — including me.

The harder realisation: the 67 passing tests weren't much better. None of them verified that Jenkins actually deployed. None confirmed that LDAP authentication worked end-to-end or that Vault issued a real certificate. They verified internal call sequences against mocks. That's not confidence — it's noise with a green checkmark.

The fix was blunt: delete 2,500+ lines of brittle mock-heavy tests, keep only the pure logic tests that don't require a live cluster, and replace integration coverage with E2E smoke tests against a real k3s instance. After cleanup: 84 tests, zero failures, all of them testing something observable.

The commit message was test: replace brittle bats suites with smoke entrypoint. The word "brittle" took a few weeks of watching tests fail for the wrong reasons to earn.

AI as my "Audit-as-Code" Partner

This is the part I'm most proud of, and also the part that gets the most puzzled looks when I describe it.

My audit trail doesn't live in Jira. It lives in my Git repo, right next to the code.

When I'm planning a change or debugging something tricky, I talk it through with an AI agent. Once we've settled on an approach, the AI generates a structured Markdown planning or issue log — the decision, the reasoning, the steps, and the outcome. That file gets committed alongside the code change.

The LDAP password drift issue above? Here's what that log actually looks like in docs/issues/:

## Issue: LDAP Password envsubst Issue
**Date:** 2025-11-21
**Status:** In Progress

### What happened
Jenkins ConfigMap hardcodes LDAP_BIND_PASSWORD at deployment time.
Password changes in Vault don't propagate to Jenkins config.

### Root cause
envsubst substitutes ${LDAP_BIND_PASSWORD} at render time.
JCasC then bakes the literal value into the config — no live lookup.

### Attempted fix
Changed template variable to escape envsubst:
  ${LDAP_BIND_PASSWORD} => $${jenkins-ldap-config:LDAP_BIND_PASSWORD}

### Status
Still failing. Needs investigation into JCasC Kubernetes secret
provider syntax. Next step: test with secretSource.KubernetesSecretSource.

Every decision has a paper trail. Every incident has a write-up. And because it's all in Git, it's searchable, diffable, and version-controlled — which ticks every enterprise audit checkbox I've ever run into, without the overhead of a ticketing system that nobody actually keeps up to date.

Honestly, it's changed how I think about documentation. Instead of writing docs after the fact, they just... happen as part of the process.

The Honest Accounting: What AI Assistance Actually Costs

I want to be straight about something, because I think the "I used AI to build this" framing can be misleading if you don't know what it actually means in practice.

GitHub Copilot was genuinely useful for autocomplete — filling in boilerplate, suggesting the next line when I knew what I wanted but didn't want to type it. That part worked the way the demos show.

Claude was different. When it was good, it was very good — helping me think through the architecture, writing out full functions, catching edge cases I would have missed. But Claude burns tokens fast. On the free tier, I'd hit my weekly quota after about 12 hours of serious work. Then I had to wait a week before I could continue.

That's the real reason this project took three months to complete — not the complexity of the code, but the rhythm of working in 12-hour weekly windows. Plan on Monday, build what I can, hit the limit, come back next week, reload context, continue.

It changed how I approach sessions. Every time I came back, I had to remind Claude of where we left off — what had been decided, what was still open, what the current state of the code was. That's what eventually pushed me toward keeping commit messages and in-repo docs precise enough to function as session handoff notes. The lazy-loading architecture, the docs/issues/ folder, the consistent naming conventions — all of it was shaped partly by the constraint that I'd be dropping and restarting context regularly.

You can see it in the commit history too — commits tagged (WIP) for features that weren't ready to call done: the LDAP password rotation CronJob, the Active Directory provider. Not buried in a branch or quietly abandoned, just marked honestly as incomplete and kept moving. That habit came from the same place: every session had to start from a known state, and "nearly done" is not a known state.

Looking back, the constraints of your tools shape the discipline you develop. Working in weekly 12-hour windows forced a kind of engineering hygiene I might not have bothered with otherwise. Every session had to start cleanly. Every decision had to be recorded somewhere findable.

If you're building something similar, budget for this. The token cost is real. The context-reloading overhead is real. The value is still there — I wouldn't have built this without it — but it's a different kind of "AI-assisted" than the marketing suggests.

Here's the part that surprised me: the tools themselves are accessible. GitHub Copilot, Claude Pro, ChatGPT Plus, GitHub Pro, and an A Cloud Guru (ACG) subscription for real AWS, Azure, and Google Cloud sandbox time — all in, you're looking at roughly $1,000–$1,200 a year. A laptop you already own. That's the full stack to do what I did here.

One note on ACG: some employers provide it as a benefit. I chose to pay for it myself. That's not martyrdom — it's a deliberate choice. Paying out of pocket meant no company requirements, no restrictions on what I could build or publish, no approval process for which cloud provider to test against. $1,000 a year is the price of not having to ask anyone for permission. Given that the whole reason this project exists is that I got tired of waiting for permission, that felt like the right call.

What's not in that bill is what you bring to it. A fresh graduate with the same subscriptions would struggle to reproduce this project — not because of the tools, but because they wouldn't know that password rotation is a real unsolved problem in most enterprises, or what a locked-down cluster environment actually feels like, or why manual certificate rotation always breaks something eventually. That domain knowledge, built up over years in real IT environments, is what the AI amplifies. The $1,000 a year removes the cost barrier on execution. It doesn't supply the judgment about what's worth executing.

Once I thought about it that way, it stopped being "look what AI can do" and became "look how far your existing expertise goes once the execution cost drops to nearly zero."

Final Thoughts

I'm not here to tell you to throw away Terraform or Helm — those tools earn their keep at scale. But there's a whole category of infrastructure work — local dev environments, integration rigs, bootstrapping pipelines — where a well-structured shell project is genuinely the right fit. It's fast, it has zero runtime dependencies, and any engineer on the team can read it without learning a new DSL or installing a plugin ecosystem.

One thing I'd say plainly to anyone thinking about doing this: AI made the coding fast. Finishing the product was still on me. The context reloading, the direction calls, the knowing-when-something-is-wrong-before-you've-wasted-10-hours-on-it — none of that is something an AI session can carry. The human overhead is real, and it doesn't show up in the demos.

The three things that made this work:

Discipline over cleverness — lazy loading, a single command wrapper, consistent naming conventions
Configuration drives behavior — same command, different results based on environment variables; no code changes needed to switch providers
Failures are first-class — issue logs committed alongside the code, not buried in a ticket queue

One thing worth stating directly: k3d-manager is designed to be a self-contained, all-in-one package. One command spins up a full local Kubernetes stack — Vault, Jenkins, LDAP/AD, Istio — all wired together, all configured, automated cert and password rotation included. No cloud account required. No 40-step Confluence page to follow. No shared cluster to fight over.

That fills a gap most teams don't realise they have. They've got either a sprawling Helm umbrella chart nobody fully understands, or a collection of disconnected scripts that only work on one person's machine. A disciplined, documented, self-contained package sits between those two extremes and covers real use cases:

Onboarding — new engineers have a working local environment in one command
Demo environments — spin up a credible full stack fast, tear it down when done
Training labs — teach Vault, Istio, Jenkins integration without needing a shared cluster
Bootstrap pipelines — repeatable, scriptable, no manual steps

If your enterprise demands traceability without the overhead, this setup might be worth stealing.

The repo is open at wilddog64/k3d-manager — take what's useful, leave what isn't, and let me know what you'd do differently.

Top comments (1)

Jason Lee • Mar 13

Brilliant idea. Excellent article.

The company I work for has a dedicated devops (ci/cd) team. They maintain a common used shell based library used on Concourse pipelines across teams and are being migrated to GitHub Actions workflows. They have never reached to this level of leveraging AI agents for automating the whole process yet.

Thanks for sharing this.