chengkai

Posted on Feb 26

I Let Three AI Agents Build My Ansible Role. Here's What Actually Happened.

#ansible #devops #ai #windows

Three AI Agents, One Ansible Role, and a Branch That Had to Be Reset From Scratch

This started with a conversation with a coworker — but the foundation goes back nearly a decade.

In December 2016, working at an AWS shop running Linux, I built the original provision-tomcat Ansible role. I was learning Ansible at the time by converting existing Chef cookbooks — the company had a Chef Tomcat cookbook, and I translated it into Ansible. Along the way I also learned Vagrant and Test Kitchen to properly test and verify that the role actually worked — not just assume it did. That became 103 commits on Bitbucket over the following year. Then it sat there, largely untouched, while I moved on to other things.

That test discipline carried forward. When the Windows work started in 2026, the same approach applied — Vagrant, Test Kitchen, WinRM, a real Windows 11 VM. The tooling evolved but the principle didn't: you don't ship Ansible until you've watched it run against something real.

Fast forward to early 2026. A coworker mentioned that our team was looking to migrate from Puppet to Ansible, and brought up Tomcat on Windows as one of the hard cases. "This is what people will do," he said, describing the candidate upgrade pattern. Then: "Can you make this happen?"

The original role was Linux-only. Windows has its own quirks — symlinks work differently, service management is different, file paths are different. And the zero-downtime requirement was new — the 2016 version had no concept of a candidate upgrade workflow. You can't just stop the service, swap files, and hope.

So I turned to AI to close the gap.

That wasn't a naive decision. I'd been building k3d-manager in parallel — a modular shell CLI for local Kubernetes environments — using the same AI-assisted workflow. That project had already taught me what AI sessions actually do in practice: they degrade. Context drifts. Long sessions start hallucinating. The first time I lost an hour of decisions because a session ran too long, I started committing everything into a docs/ folder just to survive the next restart. By the time this project started, I wasn't figuring out how to work with AI. I already knew what didn't work.

What followed was not a clean "I described the problem and AI delivered the solution" story. It was Gemini writing the Windows-specific Ansible logic, Copilot reviewing PRs, Claude handling architecture decisions, a shared memory system keeping them from losing context between sessions, and GitHub Actions running integration tests on a real Windows 11 VM on my Mac. And one session where Gemini lost context entirely, generated 25+ commits of increasingly useless trial-and-error fixes, and pushed them all upstream before I caught it. The azure-dev branch had to be reset to main and rebuilt from scratch.

This is the honest story of both the thing that got built and the process that built it.

The Problem That Started It All

Upgrading Tomcat in a production enterprise environment sounds simple. Download the new version, stop the service, swap the files, restart. Done.

Except it's never that simple.

The old version might have local config changes nobody documented. The new version might silently fail to start due to a Java compatibility issue. And by the time you realise something went wrong, users are already hitting errors and your phone is ringing.

So let's start with what the role actually does — then we'll get to the agents.

The Symlink Strategy

The foundation of the whole approach is simple: never point your Windows service directly at a versioned directory.

D:/Tomcat/
├── apache-tomcat-9.0.113/    ← previous version
├── apache-tomcat-9.0.120/    ← new version
└── current -> apache-tomcat-9.0.120/  ← service always points here

The Windows service is installed once, pointing at D:/Tomcat/current/. When you upgrade, you:

Extract the new version into its own versioned directory
Flip the symlink to point at the new version
Restart the service

The service config never changes. Multiple versions coexist on disk. And if something goes wrong, rollback is just flipping the symlink back — no reinstall, no re-download.

The role also enforces a version retention policy (tomcat_keep_versions: 10 by default) to prevent disk exhaustion over time without losing the ability to roll back.

The Candidate Upgrade Workflow

The symlink strategy solves rollback. But it doesn't solve the harder problem: how do you know the new version actually works before you flip the symlink?

That's where the candidate workflow comes in.

When tomcat_candidate_enabled: true, the role runs a parallel Tomcat instance on an alternate port (9080 by default) before touching the production service:

Install candidate — Extract new Tomcat version, configure it on port 9080, install as a separate Tomcat9Candidate Windows service
Validate from guest — Run a local health check via win_uri on the candidate port
Validate from controller — Run a network-level check from the Ansible controller (more on this below)
Promote or abort — If both checks pass, flip the symlink and restart production. If either fails, clean up the candidate and leave production untouched.

There's also a manual gate: set tomcat_candidate_manual_control: true to leave the candidate running for ops team approval before promoting. Useful when you want a human to sign off before the cutover.

Visual Architecture Overview

![Provision Tomcat Architecture](

The Custom Lookup Plugins — The Real Innovation

Here's the part that took the most thought.

Ansible's win_uri module validates that Tomcat responds on the guest machine. But that doesn't tell you whether it's reachable from the network — which is what actually matters for users and load balancers.

So I wrote two custom Python lookup plugins that run from the Ansible controller:

controller_port.py — TCP port reachability check:

# lookup_plugins/controller_port.py
import socket

def run(self, terms, variables=None, **kwargs):
    host = kwargs.get('host')
    port = int(kwargs.get('port', 80))
    timeout = int(kwargs.get('timeout', 5))

    try:
        sock = socket.create_connection((host, port), timeout=timeout)
        sock.close()
        return [{"reachable": True}]
    except (socket.timeout, ConnectionRefusedError, OSError):
        return [{"reachable": False}]

controller_http.py — Full HTTP response check:

# lookup_plugins/controller_http.py
import urllib.request

def run(self, terms, variables=None, **kwargs):
    url = kwargs.get('url')
    timeout = int(kwargs.get('timeout', 10))

    try:
        response = urllib.request.urlopen(url, timeout=timeout)
        return [{"status": response.status, "ok": response.status in [200, 404]}]
    except Exception as e:
        return [{"status": None, "ok": False, "error": str(e)}]

And here's how they're used in the role with retry logic:

- name: Wait for candidate port from controller
  set_fact:
    tomcat_candidate_controller_port_check: >-
      {{ lookup('controller_port',
                host=tomcat_candidate_delegate_host,
                port=tomcat_candidate_delegate_port,
                timeout=tomcat_candidate_controller_delay) }}
  until: tomcat_candidate_controller_port_check.reachable | bool
  retries: "{{ tomcat_candidate_controller_attempts }}"
  delay: "{{ tomcat_candidate_controller_delay }}"

The guest says it's healthy. The controller confirms it's reachable from the network. Only then does production get touched.

Real Bugs, Real Fixes

This role went through 18 documented incidents. A few worth calling out:

The pywinrm / urllib3 Compatibility Break

urllib3 2.x quietly broke pywinrm. WinRM connections started failing with cryptic SSL errors, not a clear version conflict message. The fix was pinning urllib3<2 in the test dependencies — but finding it took longer than it should have.

Lesson: pin your transitive dependencies in Windows Ansible toolchains. They're less stable than the Linux equivalents.

The Vagrant Bundler Conflict

rbenv and Bundler were polluting each other's environment when running Kitchen tests. Vagrant would pick up the wrong Ruby environment and fail to load plugins. The fix was explicit environment isolation in the test runner scripts.

The Stale VirtualBox Disk Registrations

VirtualBox keeps a registry of disk images. When VMs are force-destroyed (after a failed test), the disk registration stays behind. Next run fails because the UUID is already registered. The bin/vbox-cleanup-disks helper script was born from this — it clears stale registrations before each test run.

Testing on Windows — Harder Than It Looks

Getting Test Kitchen working against Windows VMs is its own project.

The .kitchen.yml handles multiple platforms and drivers:

transport:
  name: winrm
  elevated: true
  ssl_peer_verification: false
  port: 5985
  winrm_transport: :plaintext
  connection_retries: 5
  connection_retry_sleep: 10

Key lessons:

Use port 55985 for Kitchen's WinRM forwarding (not the standard 5985) to avoid conflicts with other tools
elevated: true is required for most Windows service operations
AWS EC2 Windows instances need NTLM transport, not plaintext — the driver config is different from Vagrant

The Makefile has 60+ targets covering every platform/suite combination. make test-upgrade-win11 runs the full upgrade cycle on a Windows 11 VM. make test-idempotence-win11 verifies the role can run twice without making unintended changes.

GitHub Actions on a Mac — My Own CI Runner

One thing I'm particularly happy with: the CI pipeline runs on a self-hosted GitHub Actions runner on my Mac (ARM64). No cloud runner costs, no queue wait times, and integration tests spin up real Windows VMs via Vagrant locally.

The workflow has two jobs that run in sequence:

jobs:
  lint:
    runs-on: [self-hosted, macOS, ARM64]
    steps:
      - uses: ./.github/actions/setup
      - run: make check

  integration:
    needs: lint
    runs-on: [self-hosted, macOS, ARM64]
    steps:
      - uses: ./.github/actions/setup
      - run: |
          make test-win11
          make test-upgrade-win11

Lint first, integration only if lint passes. Simple but effective — no point spinning up a Windows VM if ansible-lint is going to fail.

The setup action does the heavy lifting: checks out four dependent roles (provision-java, windows-base, provision-windows-security, and the role itself symlinked into roles/), installs Python and Ruby dependencies, and pins Ruby to 3.3.x explicitly to avoid a dependency spiral that Ruby 4.0 introduced on M2 runners.

Smart path filtering means CI only triggers when actual code changes — not when docs or the memory-bank get updated:

on:
  push:
    paths:
      - '**.yml'
      - 'tasks/**'
      - 'defaults/**'
      - '!docs/**'
      - '!memory-bank/**'

That alone saves a lot of unnecessary test runs.

AWS and Azure — A Pragmatic Retreat

The workflow has branches for aws-dev and azure-dev, and the CI jobs for cloud integration tests are fully written. They're currently disabled:

- name: Azure Integration (Deferred)
  if: false # TODO: Re-enable when ACG TAP model stabilizes

- name: AWS Integration (Deferred)
  if: false # TODO: Re-enable when AWS sandbox is active

The honest reason: Microsoft deprecated Service Principals mid-project and forced a migration to Workload Identity Federation. What was working Azure authentication suddenly wasn't. Gemini spent considerable time trying to resolve the new auth model — Workload Identity Federation is more secure but significantly more complex to automate in a self-hosted runner context — and we hit a wall.

Rather than let a platform-level breaking change keep the entire pipeline red indefinitely, we made the pragmatic call: disable the Azure jobs, keep the Vagrant tests green, and revisit cloud integration once Workload Identity support stabilises in the toolchain.

It's a tradeoff, not a failure. The local Mac runner running full Kitchen tests against real Windows 11 VMs catches the same role-level bugs. The cloud tests add network topology and real VM provisioning coverage — valuable, but not worth blocking everything else for.

The workflow_dispatch trigger still lets you manually kick off tests against a specific environment when you need it — useful for on-demand cloud validation without waiting for a push to trigger the full pipeline.

One detail I'm glad I added: mandatory cleanup:

- name: Mandatory Cleanup
  if: always()
  run: make vagrant-destroy || true

Vagrant VMs get destroyed after every run, pass or fail. Without this, a failed test leaves a running Windows VM consuming memory and disk until someone manually cleans it up. The if: always() ensures cleanup runs even when earlier steps blow up.

How We Got Here — The Real Learning Curve

The multi-agent workflow didn't come from a blog post or a framework tutorial. Phases 1 through 3 were lived through on k3d-manager before this project started — which is why provision-tomcat could begin at Phase 4 instead of repeating the same mistakes from scratch.

Phase 1: Keep the session open

The first approach was simple — just keep the Claude session running as long as possible. It worked, until it didn't. The longer a session ran, the more context degraded. Claude would start hallucinating — confidently suggesting things that contradicted earlier decisions, losing track of what had already been done. At some point the session became more harmful than helpful.

Phase 2: Capture the session token

The workaround was capturing the session token before things went sideways and restarting with it to resume context. Fragile, manual, and only partially effective. The context would partially restore but something was always lost.

Phase 3: Build a docs/ folder

The real breakthrough was realising the session wasn't the right unit of persistence — the repository was. Everything important got written into a docs/ folder and committed alongside the code. Design decisions, incident reports, implementation plans. If Claude forgot, the docs didn't.

This was proto-memory-bank, discovered independently before knowing the term existed. That's where k3d-manager ended up. provision-tomcat started there.

Phase 4: Gemini introduces .clinerules and memory-bank

When Gemini entered the workflow via Cline, it introduced the formal memory-bank structure — a set of Markdown files with a defined protocol for what to read at session start and what to update after each task. It formalised exactly what the docs/ folder approach was already doing instinctively.

Phase 5: Agents don't always follow the protocol

Here's what the framework tutorials don't tell you: agents drift. Claude would sometimes skip updating the memory-bank entirely — even after three session restarts. The protocol existed. The agent just... didn't follow it.

The fix was manual: use Cline to scan docs/ and recent commits, then reconstruct and update the memory-bank from scratch. A human had to notice the drift and intervene.

What this means in practice:

The memory-bank isn't automatic. It's a discipline. The .clinerules file sets the expectation, the agents try to follow it, and a human stays in the loop to catch when they don't. It reduces the problem significantly — but it doesn't eliminate it.

Anyone implementing this workflow should expect to occasionally find a stale activeContext.md and spend 15 minutes manually bringing it back up to date. That's the real cost. It's worth it, but it's not zero.

A Multi-Agent AI Workflow — With Defined Roles

This is the part that surprises people most when I describe it.

This project isn't built by a single AI assistant. It runs a three-agent collaboration model, each with a distinct role:

Agent	Role
Gemini (via Cline)	Primary coding agent — implements tasks, writes Ansible logic, fixes bugs
GitHub Copilot	Code reviewer — reviews PRs, flags architectural drift, suggests improvements
Claude Code	Architecture and planning — design decisions, documentation, cross-cutting concerns

The roles weren't assigned upfront from a plan. They evolved from actual results.

Claude started as the primary coding agent. When it got stuck on a specific GitHub Actions issue longer than was acceptable, Gemini was brought in. Gemini fixed the problem quickly. That result — not a blog post, not a framework recommendation — is what earned Gemini the coding role. The workflow grew from there: if Gemini codes and Claude plans, what reviews the code? Copilot. And if Copilot reviews, should Gemini just accept every suggestion? No — so the "counter-argue" rule got written into .clinerules before it could become a rubber stamp.

That sequence matters. Most AI workflow tutorials assign roles before the project starts. This one assigned them after the project revealed what each agent was actually good at.

The .clinerules file defines how agents interact:

Since this project alternates between using Claude Code, Cline, and Gemini:
- Document Decisions: Record "Why" a decision was made in activeContext.md,
  not just "What" was done, to ensure the next agent can take over effectively.
- TODO Tracking: Clearly mark the pending task list in progress.md to prevent
  progress loss caused by token exhaustion or session resets.

Copilot gets its own protocol too:

When interacting with Copilot or other review agents, do not prioritize consensus.
Prioritize Architectural Integrity. If a reviewer suggests a change that conflicts
with systemPatterns.md, you MUST offer a counter-argument and explain the reasoning
based on our specific production constraints.

That last rule matters more than it looks. Without it, a coding agent will accept every review suggestion to move fast — even ones that contradict the established architecture. This rule forces a principled rebuttal instead.

In practice, the Copilot workflow looks like this: Gemini pushes a fix, I open a PR, tag @copilot in a comment, and Copilot opens a sub-PR with its suggested changes. On PR #2 alone, Copilot reviewed 23 out of 30 changed files and generated 15 comments — covering everything from WinRM timeout configurations to the memory-bank documentation structure. Each sub-PR gets its own CI run on the Mac runner before anything merges.

PR #2  ← Gemini's work (CI stabilization + Azure infrastructure)
  └── copilot/sub-pr-2   ← Copilot's suggested fixes
  └── copilot/sub-pr-4   ← Follow-up after @copilot mention
  └── copilot/sub-pr-5   ← Critical review of Makefile + ansible.cfg

The agent doesn't merge its own suggestions. I review the sub-PR, decide what to accept, and merge selectively. It's collaborative, not autonomous.

The Memory Bank — Shared Context Across Sessions

Every AI agent has the same problem: context doesn't survive session boundaries. When Gemini finishes a session and Claude picks up the next day, the new session starts cold.

The memory-bank/ directory solves this. It's a set of Markdown files that any agent reads before starting work and updates after completing a task:

memory-bank/
├── projectbrief.md      ← What this project is and why it exists
├── systemPatterns.md    ← Established architectural patterns (don't break these)
├── techContext.md       ← Tech stack, dependencies, constraints
├── activeContext.md     ← Current state: what's in progress and WHY decisions were made
├── progress.md          ← Pending TODOs — prevents handoff loss
└── task_state.md        ← Current task status

The activeContext.md is the most critical file. It captures not just what changed, but why — because the next agent needs the reasoning, not just the diff.

Here's a real excerpt from the azure-dev recovery session:

## The Strategy: Clean-Room Rebuild & Recovery (2026-02-17)
1. Branch Reset: azure-dev is now a clean branch from main.
   The unstable history is preserved in azure-dev-stale.
2. Ruby Stabilization: Pinning CI to Ruby 3.3.x to eliminate
   the Ruby 4.0 dependency spiral found on the self-hosted runner.
3. Vagrant-Only CI: Azure tests parked with if: false until
   the ACG TAP-shift model is stabilized locally.

## Operational Protocols (Anti-Regressive)
- Local-First Mandate: No "push-to-test" on GitHub.
  Every change must pass kitchen converge or make check locally first.
- Single-Commit Delivery: Technical fixes committed as atomic units
  once verified, keeping branch history clean and auditable.

This isn't just documentation — it's the operating context for the next agent. When a new session starts, reading activeContext.md takes 30 seconds and prevents 30 minutes of re-investigation.

The Hallucination Problem — What Nobody Tells You About Long-Running AI Sessions

Gemini works well in short, focused sessions. But in longer ones, context degrades. It starts losing track of what actually needs to be done and generates plausible-looking activity instead — commits that look meaningful but aren't, fixes that don't address the real problem, pushes that trigger CI for no good reason.

We learned this the hard way. The azure-dev branch accumulated 25+ commits from a single long session — 7 of them were trial-and-error WinRM fixes, each leaving residue, none properly reverted. The branch had to be reset to main and rebuilt from scratch.

The recovery sequence became its own protocol. When a collapse was detected — Gemini generating commits without real progress — the pattern was: stop the session, bring Claude in to audit what Gemini had done, update the memory-bank from scratch, then restart Gemini with clean context. It worked, partially. Gemini would regain coherence for a while, then drift again. Multiple restarts, same pattern. Eventually Gemini was asked directly: what happened and how do you fix it? The diagnosis was reasonable — test locally until it works, then push. The execution made a mess of the branch. Claude fixed it.

One finding came out of all this that isn't in any product documentation: bigger context windows don't solve agent hallucination. Gemini has one of the largest context windows available. It still lost coherence after long runs. The issue isn't how much memory the model has — it's that accumulated ambiguity compounds over a long session. More context just means more confused state fits in memory. The model doesn't get more coherent; it gets more confidently wrong. You learn this by running sessions long enough to watch it happen.

The eventual fix — "local-first, commit once, push verified" — emerged from the agent collaboration itself. Whether it was Claude's original diagnosis or Gemini rediscovering it after a session restart and reading the memory-bank, the important thing is that it got documented, committed to .clinerules, and became a hard rule for all subsequent sessions.

The fix isn't n8n or a smarter orchestration layer — that was actually Gemini's suggestion when asked how to solve its own hallucination problem, which is itself a symptom of the issue. The real guard rails are architectural:

Shorter sessions with hard stops
No direct push access to protected branches — everything goes through a PR
Mandatory memory-bank read at the start of every session
Human checkpoint after every 3 commits

Attribution gets blurry in a multi-agent workflow. You don't always know which agent originated an idea versus which one picked it up from context and surfaced it at the right moment. That ambiguity is part of the reality — and why the memory-bank exists. Good decisions need to be written down regardless of who made them.

GitHub Actions as the Continuous Loop

The CI pipeline completes the feedback loop. When Gemini pushes a fix, GitHub Actions automatically:

Runs ansible-lint and syntax checks (make check)
Spins up a Windows 11 VM on the self-hosted Mac runner
Runs the full install and upgrade test suites
Destroys the VM (pass or fail)
Reports results back to the PR

Gemini doesn't have to wait for a human to run tests. It pushes, CI validates, and the next agent (or the next Gemini session) picks up from a known state.

The progress.md tracks exactly where things stand across sessions:

### Phase 5: Verification & Cleanup
- [x] Run make check locally. (PASSED)
- [ ] Push single clean commit to azure-dev.
- [ ] Verify green CI on azure-dev.
- [ ] Delete stale branches.

No context lost. No re-investigation. The next agent just looks at the checkboxes and picks up where the last one stopped.

AI as my "Audit-as-Code" Partner

Every incident gets a committed Markdown file in docs/issues/. Not a Jira ticket, not a Slack thread that disappears — a structured document living in the repo, searchable alongside the code that caused the problem.

Here's a real example. During make test-upgrade-baseline-win11, two bugs surfaced at once:

## Issue: upgrade-baseline-win11 — 404 Download Failure and C:/D: Drive Mismatch
**Date:** 2026-02-17

### Error 1 — 404 Not Found on Tomcat Download
ERROR: downloading 'https://dlcdn.apache.org/tomcat/tomcat-9/v9.0.113/bin/...'
(404) Not Found.

### Root Cause
tests/playbook-upgrade.yml hardcodes tomcat_version: "9.0.113".
Apache's CDN only hosts the current patch release — 9.0.113 was
superseded by 9.0.115 and removed. defaults/main.yml was already
updated, but the playbook bypasses that default with its own set_fact.

### Error 2 — Wrong Install Drive (C: instead of D:)
playbook-upgrade.yml sets install_drive: "C:" at the vars level,
overriding the role default of "D:". The upgrade-baseline suite in
.kitchen.yml never passes install_drive: "D:" as an extra_var,
so the entire run uses C:.

### Required Fixes
- Fix 1: Update 9.0.113 -> 9.0.115 in playbook-upgrade.yml
- Fix 2: Change playbook default to D:, make drive intent explicit
  at the suite level in .kitchen.yml
- Prerequisite: Verify baseline box was built with D: before switching

Two bugs, one test run, full root cause documented. The next agent — Gemini, Claude, or a human — picks up the issue log, knows exactly what broke, why it broke, and what the preferred fix is. No archaeology required.

That's 18 incident documents so far. Each one is a bug that won't need to be rediscovered.

Final Thoughts

Windows + Ansible + Java middleware is a corner of the DevOps world that doesn't get written about much. Most content assumes Linux. Most roles assume you can just restart the service and users won't notice.

In enterprise environments, that's rarely true.

The candidate workflow, the custom lookup plugins, and the symlink strategy together give you a deployment process you can actually trust — one that validates from the network, not just from the guest, before promoting to production.

But what I'm most interested in is the broader pattern this project represents: a multi-agent AI workflow where Gemini codes, Copilot reviews, Claude plans, and GitHub Actions validates — all stitched together by a shared memory system and a self-hosted runner on a MacBook. No cloud compute bills. No manual test runs. No context lost between sessions.

It's not perfect — the Azure sandbox is still stabilising, there are open bugs in the upgrade baseline, and Ruby 4.0 nearly derailed the whole CI pipeline. But it's a real system, built incrementally, with every failure documented and every decision traceable.

That's what production-grade looks like in practice.

The repo is open at wilddog64/provision-tomcat — if you're managing Java middleware on Windows with Ansible, hopefully it saves you some of the pain I went through building it.

DEV Community