chengkai

Posted on Feb 27 • Edited on Mar 12

I Gave Gemini One Job: Prove It Actually Ran the Test

#gemini #kubernetes #devops #ai

This is a submission for the Built with Google Gemini: Writing Challenge

What I Built with Google Gemini

I've been building a tool called k3d-manager — a shell CLI that spins up a full local Kubernetes stack (Vault, Jenkins, LDAP, Istio) from a single command. It started as a macOS project, grew to support Linux too, and at some point became too complex to develop alone. Not because the code was hard. Because validating changes across two machines, multiple providers, and a pile of moving parts was eating all the time that should have gone into building.

So I set up a three-agent workflow on my M4 MacBook Air. Codex writes and modifies code. Gemini validates changes against the local k3d cluster running on the same machine. Claude audits the work, catches drift, and keeps the notes. A second Mac — an M2 MacBook Air — acts as the self-hosted GitHub Actions runner, only involved at the final CI stage when a PR is ready to merge.

Gemini's specific role: run the test suite against the local cluster and report back with real terminal output. Not a summary. Not a paraphrase. The actual output, so I can see what ran, where it ran, and whether it passed.

That's the job. Simple in theory. Harder to enforce than it sounds.

Demo

The full workflow — memory-bank pattern, agent protocols, .clinerules — lives in the repo:

github.com/wilddog64/k3d-manager (ldap-develop branch)

The agent instructions are in memory-bank/ and .clinerules. The test suite Gemini validates is in scripts/lib/test.sh. If you want to see what "proof of execution required" looks like as a written protocol, that's where to look.

Note: active development is on the ldap-develop branch — not yet merged to main.

What I Learned

Gemini lied to me. Not maliciously — it just did what language models do when instructions leave room for interpretation.

I asked Gemini to validate three test commands against the local cluster: test_vault, test_eso, test_istio. It came back with a tidy update. Tests passed. All green. Moving on.

Except there was no terminal output. No hostname. No timestamps. Just a confident summary.

I pushed back and asked for the actual output. By then I'd already lost confidence in the result. Had it run the tests at all? On the right machine? I genuinely couldn't tell.

That's not a Gemini-specific failure. It's a property of any agent that can write text: given an ambiguous instruction, it will produce a plausible response. "Validate these tests" is ambiguous. "Run these commands and paste the raw terminal output including the hostname" is not.

The fix was a protocol, not a prompt. I rewrote the instructions with one hard rule: every validation session requires literal terminal output including the hostname of the machine it ran on. No summaries. No paraphrasing. If any of that is missing, the validation doesn't count and the work gets done again.

Once that was in place, Gemini caught three real bugs that had slipped through code review:

test_eso was referencing a deprecated API version (v1beta1 instead of v1) — only surfaced on a real cluster
test_eso had a jsonpath expression wrapped in single quotes, which prevented shell variable expansion — silently querying the wrong key on every run
test_istio had a hardcoded namespace instead of using the parameterized variable — cleanup broke when the namespace changed

All three fixed. All three re-validated green on the local cluster. None of that happened because Gemini is clever. It happened because it ran real commands on real hardware and I required proof.

The broader lesson: In December 2024, the Financial Times reported that an Amazon internal AI coding agent reportedly decided to "delete and recreate" a customer-facing system — resulting in a 13-hour outage. Amazon attributed it to misconfigured access controls, not AI autonomy, and the full picture is still unclear. But regardless of where the fault actually sat, the pattern is a useful reference point: an automated system took a destructive action, validation wasn't enough to catch it, and the damage was done before anyone intervened. That pattern isn't unique to any one company — it's the thing everyone building with AI agents needs to think about. My incident cost me an afternoon. The stakes scale up fast.

The hostname requirement is now in .clinerules — the shared rules file all agents read at the start of every session. One failure turned into a permanent guardrail.

Google Gemini Feedback

What worked well: Gemini is genuinely good at the thing I needed it for — executing commands on real machines and reporting what actually happened. When the protocol is tight, the output is reliable. It navigates a Linux environment, reads logs, runs test scripts, and produces honest results. The key word is "when." The tool is only as reliable as the protocol around it.

Where it struggled: Loose instructions. If you say "check if this works," you'll get a confident answer that may or may not be grounded in anything real. This isn't a criticism unique to Gemini — it's how language models work. But it's worth saying plainly because most articles about AI agents skip this part. The agent will fill the gap between what you asked and what you meant with something plausible. Close the gap in the instructions, not after the fact.

What helped most: Structured context over long conversations. I keep a set of markdown files committed to git — a memory-bank — that each agent reads at the start of a session. Current state, open issues, what's been validated, what hasn't. Gemini picks that up quickly and stays on track across separate sessions. Without it, there's drift. With it, there's continuity. That pattern turned out to be more valuable than any single Gemini capability.

The honest summary: Gemini didn't change how I work. The failure did. I built a protocol that made Gemini useful, and now the protocol does most of the work. That's probably how it's supposed to go.

DEV Community

I Gave Gemini One Job: Prove It Actually Ran the Test

What I Built with Google Gemini

Demo

What I Learned

Google Gemini Feedback

Top comments (0)