DEV Community

Cover image for Why I Make Claude and Gemini Argue: Building an Adversarial Agentic Workflow (Open-Source Skill)
Serhii Kravchenko
Serhii Kravchenko

Posted on

Why I Make Claude and Gemini Argue: Building an Adversarial Agentic Workflow (Open-Source Skill)

In traditional engineering, you'd never let a developer merge code without a peer review.

So why are we letting AI grade its own homework?

I've been building with Claude Code for 750+ sessions across multiple projects — a content pipeline, a marketing site, a design system, decision frameworks. Somewhere around session 200, I noticed a pattern: Claude is brilliant, but it has consistent blind spots. It favors certain architectures. It misses edge cases in its own prompts. It quietly accepts assumptions that a different perspective would challenge.

So I did something unconventional: I gave Claude a sparring partner.

I built an open-source skill called Brainstorm that runs a structured 3-round adversarial dialogue between Claude Code and Google's Gemini. Not a simple "ask two models the same question" approach — a real debate where each model challenges the other's reasoning, and they converge on a single actionable recommendation.

Here's the repo: Claude Starter Kit — the Brainstorm skill is included alongside memory, hooks, and three other Claude Code skills. MIT license, works out of the box.

Let me show you what happened when I put this to work.


The Problem: Single-Model Agentic Workflows Hit a Ceiling

If you use Claude Code daily, you know how productive it is. It reads your codebase, makes changes, runs tests, iterates. For straightforward tasks, it's incredible.

But for decisions — architecture choices, design system approaches, prompt engineering, evaluation criteria — a single model creates a feedback loop. Claude designs a solution, Claude evaluates it, Claude declares it good. There's no external challenge.

You might ask: why not just prompt Claude to red-team its own output? Or use a cheaper Claude model to draft and a smarter one to review? We tried both. The problem is fundamental — same provider, same training corpus, same architectural biases. Claude challenging Claude is like asking someone to proofread their own essay. They'll catch typos but miss the structural issues. A different model family with different training data and different instincts is what breaks the loop.

I hit this wall three separate times before I built a systematic fix:

Wall 1: Design system architecture. Claude recommended using Google's Stitch tool with post-processing to fix design token adherence. Sounded reasonable. Spent two days implementing it. Token adherence: 35%. The approach was fundamentally flawed, and Claude couldn't see it because it designed it.

Wall 2: Content pipeline prompts. Claude wrote evaluation prompts for our 7-stage content pipeline. The prompts looked great — well-structured, detailed, comprehensive. But when we actually measured output quality, the scores were mediocre. The prompts had loopholes that Claude couldn't identify in its own work.

Wall 3: Quality metrics. Claude designed metrics to evaluate content quality, then evaluated content using those same metrics. Circular validation. The scores looked good on paper but didn't reflect real quality improvements.

Every one of these failures had the same root cause: no adversarial pressure. The model was reviewing its own work with its own biases.


Enter the Brainstorm Skill: Claude vs Gemini in the Terminal

The Brainstorm skill runs a 3-round structured debate between Claude and Gemini. It's not random back-and-forth — each round has a specific purpose:

Round 1 — Diverge. Both models propose different approaches to the problem. Claude brings codebase context (it can read your files). Gemini brings a fresh perspective from a completely different model family with different training biases.

Round 2 — Deepen. Each model challenges the other's proposal. "What happens when input is empty?" "What about the edge case where the user has 12 languages?" "Your approach assumes X, but what if Y?" This is where the real value emerges — the challenges neither model would generate reviewing its own work.

Round 3 — Converge. After two rounds of productive conflict, the models synthesize a single recommendation with clear reasoning. You get one actionable path forward, not two competing opinions.

How Gemini gets context: Claude orchestrates the entire flow. It reads your local files, summarizes the relevant context, and passes it to Gemini via the Google GenAI API along with the debate prompt. Gemini never touches your filesystem directly — Claude acts as the bridge, deciding what context is relevant to share. This means your code stays local while Gemini gets exactly the context it needs to give meaningful critique.

The architecture is deliberate. Gemini uses a two-layer approach: Flash-Lite with Google Search grounding gathers real-world facts first (the "ground truth" phase), then Pro reasons on verified data. A mandatory fact-check phase at the end catches any claims that slipped through. A typical brainstorm takes about 40-60 seconds and costs roughly $0.02-0.05 in API calls — overkill for fixing a typo, but invaluable for architecture decisions.

# Install (one command)
git clone https://github.com/awrshift/claude-starter-kit.git my-project
cd my-project && claude

# Or add just the skill to an existing project
git clone https://github.com/awrshift/skill-brainstorm.git .claude/skills/brainstorm
Enter fullscreen mode Exit fullscreen mode

Then just say "brainstorm" in Claude Code, describe your problem, and watch the debate unfold.


Case Study 1: How an AI Code Review Loop Replaced Our Designer

This is the one that convinced me adversarial agentic workflows are the future of development.

We were building a marketing site called Avoid Content. The question: how should Claude generate UI components that precisely follow our design tokens (colors, typography, spacing)?

Claude's position (Round 1): Use Google Stitch to generate screens, then post-process the output to replace colors and fonts with our design tokens. Reasonable — Stitch is a powerful UI generation tool.

Gemini's challenge (Round 2): "Post-processing is fragile. What happens when Stitch generates a gradient that mixes two non-token colors? What about hover states? You'll spend more time fixing edge cases than you save." Gemini argued for generating code directly from design tokens — skip Stitch entirely, use a frontend-design approach where tokens are injected into the generation prompt.

The convergence (Round 3): Test both approaches, measure token adherence.

Results:

  • Stitch + post-processing: 35% token adherence (18 color references in benchmark components, only 6 matched)
  • Direct generation from tokens: 100% token adherence on benchmark components (18/18 exact hex matches — the strict token schema forced compliance)

That brainstorm literally replaced our entire design workflow. And it went further — we built a visual QA loop on top of it:

  1. Claude generates a component using design tokens
  2. Playwright takes a screenshot
  3. Gemini reviews the screenshot visually against a reference design
  4. Claude fixes issues Gemini identified
  5. Repeat (max 2 iterations)

Typography scores went from 5/10 to 8/10 in a single iteration. Spacing, visual hierarchy, overall polish — all improved measurably because a different model family was doing the AI code review.

We effectively replaced manual design reviews with an automated Claude + Gemini loop. Not "AI-assisted design" — AI-driven design with AI-driven quality assurance. The entire Avoid Content site was built this way: Claude prototyping, Gemini reviewing screenshots, iterating until both models agreed the output was solid.


Case Study 2: Gemini Gates in a 7-Stage Content Pipeline

Our content generation platform runs articles through seven stages: Strategy, Outline, Research, Generate, Verify, Optimize, Finalize. Each stage has specific quality metrics.

The breakthrough wasn't using Gemini to generate content. It was using Gemini as a gate — a checkpoint that must approve output before it moves to the next stage.

At the prompt design phase for Stage 4 (article generation), we ran the prompt through Gemini for stress-testing:

"Identify how an LLM could misinterpret this prompt. Find loopholes, missing constraints, ambiguous rules."

Gemini found three critical loopholes Claude missed:

  • The prompt said "avoid repetitive sentence starters" but didn't define what counts as repetitive (per-section vs. article-level)
  • Temperature 1.0 + negative instructions ("don't use X") triggered the pink elephant effect — the model used X more, not less
  • Word count targets lacked adaptive coefficients, causing +25% overshoot

After fixing these based on Gemini's critique, article quality jumped measurably. And here's the finding we've now confirmed three separate times:

Prompt quality > model quality. A basic prompt on Gemini Pro performed identically to Flash (50% quality score). The same models with a stress-tested prompt hit 75-80%. The bottleneck was never the model — it was the prompt.

This is the single most important lesson from running two AI models together. You don't need a more expensive model. You need a different model family to find the holes in your prompts.

The same pattern held for our AWRSHIFT decision framework — a structured tool for non-trivial choices (which architecture? build vs. buy?). Through brainstorm sessions, Gemini pushed back on Claude's overcomplicated 5-mode design and the system converged on a single adaptive flow. Simpler for users, more flexible for the system. That framework is also open-source: skill-awrshift.


A Mini Claude Code Tutorial: The Technical Setup

Getting this running takes about five minutes. Here's what you need:

Prerequisites:

Option 1: Full Starter Kit (recommended)

git clone https://github.com/awrshift/claude-starter-kit.git my-project
cd my-project
claude
Enter fullscreen mode Exit fullscreen mode

Claude runs the setup automatically — asks your name, project description, language preference, and configures everything. You get four Claude Code skills out of the box:

Skill What it does
Brainstorm 3-round Claude x Gemini adversarial dialogue
Gemini Quick second opinions, prompt stress-tests, visual reviews
AWRSHIFT Structured decision framework with Gemini gates
Skill Creator Build and test your own custom skills

Plus a persistent memory system, session hooks, multi-project journals, and experiments tracking.

Option 2: Just the Brainstorm skill

# Add to existing Claude Code project
git clone https://github.com/awrshift/skill-brainstorm.git .claude/skills/brainstorm

# Set up Gemini
echo "GOOGLE_API_KEY=your-key-here" >> .env
pip install google-genai
Enter fullscreen mode Exit fullscreen mode

Then in Claude Code, just say "brainstorm [your question]" and it works.

Option 3: Gemini skill only (for quick second opinions)

git clone https://github.com/awrshift/skill-gemini.git .claude/skills/gemini
Enter fullscreen mode Exit fullscreen mode

Use it for one-off checks: "ask Gemini if this architecture makes sense" or "get a second opinion on this prompt."


When to Use Brainstorm vs. Second Opinion

Not every decision needs a 3-round debate. Here's how we think about it after 750+ sessions:

Situation Tool Why
One clear path, need validation gemini second-opinion Quick (~5s), single-round, catches obvious issues
Multiple viable approaches brainstorm Full 3-round debate (~45s), converges on one answer
Prompt stress-testing gemini second-opinion Find loopholes before deploying prompts
Architecture decisions brainstorm Different model = different design instincts
Visual design review gemini --image Multimodal review of screenshots
Fact verification gemini second-opinion Cross-model validation of claims

The rule is simple: if there's one path and you need a sanity check, use Gemini directly. If there are multiple paths and you need to converge, use Brainstorm.


What We Learned: Five Rules for Multi-Model Agentic Workflows

After building production systems with Claude + Gemini, these patterns held up consistently:

1. Prompt quality always beats model upgrades.
We proved this three times across different domains. A well-crafted prompt on a cheaper model outperforms a lazy prompt on an expensive one. Use Gemini to stress-test your prompts before optimizing your model tier.

2. Gemini's critique is an input, never the decision.
44% of Gemini's insights were genuinely unique — things Claude would never catch on its own. But Gemini lacks your codebase context, your prior decisions, your constraints. Always evaluate its recommendations critically before acting.

3. Different model families catch different blind spots.
This is the whole thesis. Claude and Gemini have different training data, different architectures, different biases. When they disagree, that's where the most valuable insights hide. When they agree, you can be more confident the answer is solid.

4. Fact-check after every brainstorm.
We found that 2 out of 6 brainstorm decisions were invalidated when we checked the claims against live web data. The mandatory fact-check phase (built into Brainstorm v2.1) catches these before they become expensive mistakes.

5. The visual QA loop is underrated.
Most developers only use text-to-text generation. Adding Playwright screenshots + Gemini visual review creates an AI code review feedback loop that catches UI issues no text-based review ever could. Typography, spacing, color contrast — Gemini sees what Claude can only describe.


Exploring the Claude Code Skills Ecosystem

If you're new to Claude Code skills, they're essentially reusable capabilities you add to your agent. A skill is a folder with a SKILL.md file that tells Claude when and how to use it.

The ecosystem is growing fast — skills can be shared through the Claude Code plugins marketplace, and the Starter Kit includes a Skill Creator that lets you build your own skills and test them with an eval framework.

Some ideas for custom skills you could build:

  • A skill that runs your test suite and interprets failures
  • A skill that checks your PR against your team's code style guide
  • A skill that queries your production logs when debugging

Check the Claude skills on GitHub for more examples and inspiration.


Try It Yourself

Everything mentioned in this article is open-source and free:

The setup takes five minutes. The first brainstorm session will probably change how you think about AI pair programming.

Because the real power isn't in having a smarter AI. It's in having two AIs that think differently — and making them argue until the best answer wins.


Built by Serhii Kravchenko — based on 750+ sessions building AI content pipelines, multi-agent systems, and design automation with Claude Code.

We're working toward getting Brainstorm into the official Claude Code plugins directory. If this workflow saved you time, a star on the Starter Kit repo helps us get there faster.

Top comments (0)