Yuji Suzuki

Posted on Feb 17

My AI Escaped Its Container and Did Everything — Except Review Its Own Code

#docker #devops #ai #security

Previously: The Complete Dev Cycle

In Part 4 of this series, my AI assistant achieved something remarkable. Running inside a secure Docker container, it could now execute the entire development cycle:

Code → Test → Build → Deploy → Commit

I called it the finale. The trilogy was complete. The AI could write code, run tests, build artifacts, deploy to containers, and commit changes — all while keeping secrets safely hidden.

I was wrong. Something was missing.

The Missing Piece

Look at that cycle again. Now think about how a real development team works.

Code → Test → Build → Deploy → Commit → PR → ...

Where's the review?

In any professional team, code doesn't just flow from writing to deployment. Someone reads it. Someone checks for bugs, security issues, architectural problems. Someone asks "did you consider this edge case?"

My AI could do everything — except check its own work.

The Official Plugin

Claude Code has an official /code-review plugin. When I discovered it, I was impressed by its design:

Parallel agents: Multiple AI agents analyze code simultaneously from different angles — bug scanning, CLAUDE.md compliance checking
Confidence scoring: Each finding gets a score, filtering out noise
Verification step: A separate agent re-checks findings to eliminate false positives

This is serious engineering. Not "ask AI to review code" but a structured, multi-stage pipeline designed to produce high-signal results.

I installed it immediately.

And it didn't work.

Why It Couldn't Reach

The official plugin is designed for a standard GitHub workflow. It expects:

gh CLI — to fetch PR details from GitHub
A GitHub PR — the review target is a pull request
A single repository — it operates within one project

My AI Sandbox environment has none of that:

No gh CLI (the container has no GitHub authentication)
No PR yet (I want review before pushing, not after)
Multiple independent repositories in one workspace (API, Web, iOS — each with their own Git history)

The plugin couldn't reach my code. Not because it was poorly designed — it's excellent at what it does. But it was built for a different moment in the development cycle: after you push. I needed something before.

Learning From the Design

I couldn't use the plugin directly, but I could learn from it.

The plugins documentation showed me that Claude Code's custom commands are just Markdown files — structured instructions that become slash commands. The official /code-review demonstrated what a well-designed review pipeline looks like: parallel analysis, scoring, verification.

So I did what my AI Sandbox was built for. I asked the AI:

Analyze the code-review plugin and create a custom command that works locally. Allow selecting which project to review. Confirm the target branch with the user. Run the same kind of review, but without GitHub access.

The AI read the official plugin, understood its structure, and produced a local version. No gh dependency. Multi-project support. Git and non-Git modes.

It worked.

From One to Nine

Once the local review command was running, the next thought was obvious.

If I can have a general code reviewer, why not a security reviewer? A performance reviewer? An architecture reviewer?

Each review type needs different expertise. A security review looks for injection vulnerabilities, authentication gaps, and data exposure. A performance review looks for N+1 queries, unnecessary allocations, and missing caching. A general review catches bugs and checks CLAUDE.md compliance.

One command became nine:

Command	Purpose
`ais-local-review`	General code review (bugs, CLAUDE.md)
`ais-local-security-review`	Security vulnerabilities
`ais-local-performance-review`	Performance bottlenecks
`ais-local-architecture-review`	Structural concerns
`ais-local-test-review`	Test quality assessment
`ais-local-doc-review`	Documentation accuracy
`ais-local-prompt-review`	AI prompt/command quality
`ais-refactor`	Concrete refactoring suggestions
`ais-test-gen`	Automated test generation

All nine share the same pipeline architecture inspired by the official plugin:

Parallel Analysis → Scoring → Verification → Report
(4-5 Sonnet agents)  (Haiku)   (Sonnet)

Each specialized command sends parallel agents with different review perspectives. A scoring agent evaluates confidence. A verification agent eliminates false positives. Only high-confidence, verified findings make it to the final report.

The Pipeline in Action

Here's what happens when you run /ais-local-review:

Step 1: Select a project and branch (or files, if no Git)

Step 2: Four Sonnet agents launch in parallel:

Agent #1: CLAUDE.md compliance — does the code follow project conventions?
Agent #2: Bug scan — obvious logic errors, edge cases
Agent #3: History analysis — are we reintroducing a previously fixed bug?
Agent #4: Comment check — does the code match its own documentation?

Step 3: A Haiku agent scores every finding (0-100)

Step 4: A Sonnet verification agent re-checks anything scoring 75+

Step 5: Only confirmed, high-confidence issues appear in the report

The result is a focused report. Not a wall of nitpicks — a short list of things that actually matter.

Two Reviews, Two Moments

Here's what's interesting: the official plugin and my local commands aren't competing. They serve different moments in the development cycle.

Code → Review → Test → Build → Deploy → Commit → PR → Review
         ↑                                                ↑
    ais-* commands                              Official /code-review
    Before you push                              After you push
    Quality gate                                 Team review
    Local, private                               GitHub, collaborative

The official /code-review is for when your code is ready for team eyes. It posts comments on PRs, suggests changes, integrates with GitHub's collaboration features.

My ais-* commands are for before that moment. While you're still developing. Before you've committed, sometimes before you've even finished writing tests. A private quality gate that catches issues early, when they're cheapest to fix.

The Completed Cycle

Remember the development cycle from Part 4?

Code → Test → Build → Deploy → Commit

Here's what it looks like now:

Code → Review → Test → Build → Deploy → Commit
         ↑
    The missing piece

The AI can write code, review its own work (from multiple perspectives), run tests, build, deploy, and commit. The quality gate that was missing is now in place.

What I Learned

This project started because the official plugin couldn't reach my code. But that limitation led somewhere unexpected.

The official plugin's design — parallel agents, confidence scoring, false positive elimination — was the blueprint. Open source at its best: you read how something works, understand the principles, and adapt them to your environment.

I didn't just get a code reviewer. I got nine specialized review tools, a refactoring assistant, and an automated test generator. All because the official plugin showed me what a well-designed review pipeline looks like, and my AI Sandbox gave me a place to build one that works locally.

The Series So Far

What started as "my AI can see my API keys" has become something larger:

Secrets: Hide sensitive files from AI using Docker volume mounts
Toolbox: AI discovers and uses tools autonomously via SandboxMCP
Host Access: AI breaks out of its container with controlled host OS access
Review (this article): AI reviews its own code, completing the dev cycle

The trilogy became a tetralogy. I'll stop promising it's complete.

The AI Sandbox with DockMCP is open source: GitHub repository

If you've built custom review commands for your AI workflow, I'd love to hear about it in the comments.

DEV Community