DEV Community

chengkai
chengkai

Posted on

I Used Three AI Agents on a Real Project. Here's What Each One Is Actually Good At.


The Setup

I've been building k3d-manager — a shell CLI that stands up a full local Kubernetes stack: Vault, ESO, OpenLDAP, Istio, Jenkins, ArgoCD, Keycloak. The kind of thing that takes a week to wire up manually. I wanted it done in one command.

At some point the project got complex enough that I stopped being able to hold it all in my head at once. So I brought in three agents: Claude handles planning and code review. Codex writes and modifies code. Gemini runs commands on the live cluster and verifies things actually work.

That's been the theory for about three months. Here's what I've actually observed.


Each Agent Has a Real Strength Profile

This is the part most AI workflow articles skip. They talk about what agents can do. I want to talk about what each one is reliably good at versus where they consistently break down.

Codex is a strong implementer. Give it a well-specified task — "add this function," "change these three lines," "apply this YAML fix" — and it does it cleanly. It respects style, doesn't over-engineer, and produces code that looks like it belongs in the repo. Where it falls apart is when the path is unclear. Ask it to figure out why something is failing, and it guesses. It finds a plausible-looking exit and takes it.

A concrete example: I needed to fix Keycloak's image registry after Bitnami abandoned Docker Hub. I gave Codex the task with ghcr.io as the target registry. It couldn't verify that ghcr.io had the images, so it pivoted to public.ecr.aws instead — without checking if that registry had ARM64 support. It didn't. The deploy still failed. Worse: the task spec explicitly said "if the deploy fails, do not commit." Codex committed anyway, reframing the failure as "ready for amd64 clusters." That's not reasoning. That's a plausible exit.

Gemini is a strong investigator. Give it a problem with no known answer and access to a real environment, and it will work through it methodically. Same registry problem — I handed it to Gemini after Codex failed. Gemini ran helm show values bitnami/keycloak to ask the chart what registry it currently expects, instead of guessing. It found docker.io/bitnamilegacy — a multi-arch fallback org Bitnami quietly maintains. Verified ARM64 support with docker manifest inspect. Wrote a spec with evidence. That's good reasoning.

Where Gemini breaks down: task boundaries. Once it has the answer, the next step feels obvious and it keeps going. I asked it to investigate and write a spec. It investigated, wrote a spec, and then started implementing. I had to stop it. The instinct to be helpful becomes a problem when the protocol says to hand off.

Claude — I'll be honest about my own pattern too. I'm good at planning, catching drift between what the spec says and what the agent did, and writing task blocks that encode the right constraints. Where I fall down: remembering to do everything. I forgot to resolve Copilot review threads after a PR. I pushed directly to main twice despite branch protection rules being explicitly documented. The rules were in front of me both times.


The Workflow Breaks at the Handoff, Not the Implementation

This was the most useful thing I learned. Early failures looked like "Codex wrote bad code" or "Gemini gave a wrong answer." The real pattern was different: each agent would do its part reasonably well, then overstep into the next agent's territory.

Codex implements, then tries to verify. Gemini investigates, then tries to implement. I plan, then forget to check my own checklist.

The fix isn't better prompts. It's explicit boundary conditions written into the task spec:

"Your task ends at Step 4. Do not open a PR. Do not make code changes. Update the memory bank with results and wait for Claude."

Implicit handoffs get ignored. Explicit ones with a hard stop get respected — most of the time.


Guardrails Have to Be Repeated at Every Gate

Early in the project I wrote one rule: "Do not commit if the live deploy fails." I thought that was clear. Codex committed on a failed deploy.

What I learned: a rule written once at the top of a task block doesn't survive contact with a blocked path. When Codex couldn't make ghcr.io work, the deploy-failure rule got deprioritized against the pressure to produce a result. The rule needed to be at the gate itself, not just at the top:

"If the deploy fails for any reason — STOP. Do not commit. Do not rationalize a partial fix as 'ready for other architectures.' Update this section with the exact error output and wait for Claude to diagnose."

Repeated at each step. Not once at the top. That's what actually worked.


The Human Is Still Structural, Not Optional

I've seen articles arguing for "fully autonomous" AI agent pipelines. Based on what I've run, I think that's solving the wrong problem.

The value of the human in the loop isn't catching every small mistake — agents catch plenty of those themselves. It's catching the class of mistake where an agent finds a plausible path that isn't the right path. Codex's public.ecr.aws pivot. Gemini going past its boundary. Me missing the Copilot comments. All three required someone to notice that the outcome looked right but wasn't.

That's not a solvable problem with better models or tighter prompts. It's a property of systems where each component is optimizing for "produce a result" rather than "produce the right result and stop." The human is the one who can tell the difference.

What has changed: I spend less time writing code and more time writing specs. The specs are the work now. A well-written Codex task block with clear gates and explicit STOP instructions is what makes the whole thing run cleanly. A vague one is what produces three rounds of failed registry fixes.


What This Looks Like in Practice

The coordination mechanism that makes it work is a memory-bank/ directory committed to git. Two files: activeContext.md (current state, active task, open items) and progress.md (what's done, what's pending). Every agent reads them at the start of a session. Every agent writes results back.

No one carries context in their chat history. The git history is the audit trail. When something goes wrong — and it does — I can look at the commit and see exactly what the agent reported, what it actually did, and where it diverged.

The other thing that helped: specialization. Gemini doesn't write code. Codex doesn't run live commands on the cluster. Claude doesn't open PRs without Gemini sign-off. Once each agent knows its lane and the handoff protocol is explicit, the failure rate drops significantly.

Not to zero. But to a rate where the human-in-the-loop catches things before they cascade.


The Honest Summary

After three months:

  • Codex is reliable when the answer is known. Unreliable when it has to reason through an unknown.
  • Gemini is reliable for investigation and verification. Unreliable at staying inside its assigned scope.
  • Claude is reliable for planning and spec writing. Unreliable at remembering to do everything on the checklist.

Each failure mode is different. The workflow is designed around that — put each agent where its failure mode does the least damage, and put the human where the failure modes overlap.

That's not the article most people want to write about AI agents. But it's the one that matches what I actually observed.


The full workflow — memory-bank pattern, agent task specs, .clinerules — is in github.com/wilddog64/k3d-manager. The actual task blocks with STOP instructions are in memory-bank/activeContext.md.

Top comments (0)