Vyacheslav Mayorskiy

Posted on Feb 17 • Originally published at blog.ayechat.ai

Why I Use Different AI Models for Planning, Reviewing, and Coding

#ai #python #programming #productivity

I've been experimenting with something that feels slightly unhinged: using different AI models at different stages of building a feature.

Not because I'm indecisive. Because each model has a different superpower.

GPT-5.2 is great at structured documentation and architectural thinking. Claude Opus 4.6 is terrifyingly good at catching edge cases and writing precise code. So why would I force one
model to do everything when I could use them like specialized tools?

This is the story of building a tiny feature called printraw - and how a five-stage, multi-model workflow caught bugs that a single-model approach would have missed entirely.

The Feature: Stop Making Me Fight the Terminal

Here's the problem: Aye Chat renders AI responses in pretty Rich panels with Markdown formatting and box-drawing characters. Looks great. Feels polished.

But try to copy that text and paste it somewhere else.

You get a mess of line breaks, box characters, and formatting artifacts that make you want to throw your laptop into the sea.

The fix seemed simple: add a printraw command that reprints the last response as plain, copy-friendly text. No panels. No Rich formatting. Just raw text wrapped in delimiters you can
select and copy.

The feature itself? Trivial. The workflow I used to build it? That's what got interesting.

The Five-Stage Pipeline

Here's what I ended up doing:

Stage	Task	Model
1	Write the plan	GPT-5.2
2	Validate the plan	Claude Opus 4.6
3	Implement	Claude Opus 4.6
4	Write tests	GPT-5.2 + Claude Opus 4.6 (alternating)
5	Fix until green	GPT-5.2 + Claude Opus 4.6 (alternating)

This isn't "use one model for everything." It's a staged pipeline where model selection is intentional - like choosing a screwdriver vs a hammer based on what you're actually fastening.

Stage 1: Planning with GPT-5.2

I started by describing the UX problem to GPT-5.2 and asking for a complete implementation plan.

GPT-5.2 produced a thorough document covering:

Command syntax and output format
Where to capture the last response text
Two architecture options (store in REPL vs. store in presenter)
Which files to modify
Testing approach
Edge cases

Why GPT-5.2 for planning? It's genuinely good at organized technical writing. It thinks through tradeoffs without needing to see every line of code. The output was clean, structured,
and gave me something concrete to react to.

Stage 2: Validation with Claude Opus 4.6

Here's where it gets interesting.

I handed the plan to Claude 4.6 with a simple prompt: "Review and validate this plan. Let me know if you'd recommend any adjustments."

Claude came back with seven specific recommendations, prioritized by impact:

#	Recommendation	Priority
1	Add `raw` as a short alias	High - usability
2	Use plain `print()`, not Rich `console.print()`	High - correctness
3	Shorten delimiter lines	Low - taste
4	Clarify: summary-only output, not file changes	Medium
5	Treat whitespace-only summary as empty	Medium - edge case
6	Note that mid-stream `printraw` is N/A	Low - docs only
7	Add Rich-markup-leak test case	Medium - correctness

Recommendation #2 was the one that made me sit up.

The Rich markup leak problem: If you use Rich's console.print() to output "raw" text, and the AI's response happens to contain tokens like or, Rich interprets them as markup
instead of printing them literally. Your "raw" output comes out formatted. The whole point of the feature is defeated.

The fix - using Python's built-in print() - is trivial. But I would have missed it without a dedicated review pass.

Why Claude 4.6 for validation? It's like hiring a polite pedant to review your work. The structured table with priority ratings made it easy to cherry-pick which adjustments to accept.
I took recommendations #1 through #5 and skipped the documentation-only items.

Stage 3: Implementation with Claude Opus 4.6

With a validated plan in hand, I asked Claude to implement it.

The first implementation changed the return types of handle_with_command() and handle_blog_command() from Optional to Tuple[Optional, Optional] - threading the response text back
to the REPL.

I flagged this immediately: "Won't that introduce regressions and break existing functionality?"

Claude acknowledged the risk and proposed something cleaner: capture the text at the source of truth - inside print_assistant_response() itself, using a module-level variable.

This approach:

Required zero signature changes
Had zero regression risk
Was guaranteed to capture the correct text (whatever was actually printed)
Worked automatically for all code paths

Much better.

The Bug That Almost Shipped

Even after the refactor, the first test showed the command printing "No assistant response available yet" after a valid response.

Root cause: the initial code tried to capture text using getattr(llm_response, 'answer_summary', None), but the response object's attribute was actually .summary, not
.answer_summary.

The fix was exactly the module-level capture approach - store the text inside print_assistant_response() where the correct string is guaranteed to exist, regardless of what the response
object's attributes are named.

Final implementation touched four files:

presenter/repl_ui.py - Module variable + getter + capture logic
presenter/raw_output.py - New file: plain print() with delimiters
controller/command_handlers.py - New handle_printraw_command() handler
controller/repl.py - Added printraw and raw to built-in commands

Stages 4 & 5: The Adversarial Testing Loop

Here's where the multi-model approach got really interesting.

With implementation done, I didn't just ask one model to write tests and fix them. I ping-ponged between models: one writes, the other critiques and fixes, repeat.

The test coverage needed to include:

Normal output with delimiters
Rich markup leak prevention (the something case)
None input → warning message
Whitespace-only input → warning message
Empty string → warning message
The capture mechanism
The handler integration

Then came the iteration loop - but with a twist.

(ツ» model gpt-5.2
(ツ» write tests for the printraw feature

GPT writes tests.

(ツ» pytest tests/test_raw_output.py -v

Oh God. OH GOD. Red everywhere.

(ツ» model claude-opus-4.6
(ツ» fix the failing tests

Claude fixes things - and often rewrites chunks of GPT's approach entirely.

(ツ» pytest tests/test_raw_output.py -v

Still red, but fewer failures.

(ツ» model gpt-5.2
(ツ» these tests are still failing, fix them

GPT takes a different angle. Catches something Claude missed.

(ツ» pytest tests/test_raw_output.py -v

Green. Finally green.

Why alternate models? Because each model has different blind spots. GPT might write a test that's technically correct but uses mocking patterns Claude handles better. Claude might fix
the mock but miss an assertion edge case that GPT catches on the next pass.

It's adversarial collaboration. Each model is essentially reviewing the other's work, and bugs that survive one model's scrutiny often get caught by the other.

No context-switching. No copying error messages between terminals. Everything in one session - just swapping which brain is on the case.

Why This Workflow Works

Different models for different cognitive tasks

Planning is a different skill than code review which is a different skill than implementation. Using one model for everything is like using a hammer for screws - it technically works but
you're fighting the tool.

The staged approach catches errors early

The validation stage caught the Rich markup leak before any code was written. Without it, that bug would have surfaced (maybe) when users reported garbled output weeks later.

Regression risk is managed explicitly

By questioning the return-type changes, I avoided an entire class of integration issues. The "capture at the source of truth" pattern emerged from that pushback.

Alternating models surfaces hidden bugs

The ping-pong pattern during testing caught issues that a single model iterating with itself would have missed. Each model brings a different failure mode - and different solutions.

The conversation is the development environment

Every stage happened in the same Aye Chat session:

Model switching via model command
File generation via prompts
Test execution via pytest
Undo via restore when something went wrong
Diff inspection via diff to verify changes

No IDE. No separate terminal. No copy-pasting between tools.

The Takeaways

Plan first, validate second, implement third. Writing a plan document forces clarity before you touch code.
Switch models for validation. The model that wrote the plan won't catch its own blind spots. A fresh perspective - even from a different AI - brings a different analytical lens.
Capture at the source of truth. When multiple code paths need the same data, find the single point where it's guaranteed to be correct. Don't thread it through function signatures.
Question regression risk explicitly. When implementation requires changing existing contracts, ask: "Is there a way to do this without breaking things?" Usually there is.
Alternate models during test/fix loops. One model writes, the other critiques. Bugs that slip past one often get caught by the other. It's like having two reviewers who never get
tired.
Keep tests in the same session. Running pytest, reading failures, and fixing them without leaving the terminal keeps iteration tight and fast.

This whole feature - planned, validated, implemented, tested, debugged, and shipped - happened in a single Aye Chat session across a few hours.

Not because the feature was hard. Because the workflow made it frictionless.

About Aye Chat

Aye Chat is an open-source, AI-powered terminal workspace that brings AI directly into command-line workflows. Edit files, run commands, and chat with your codebase without leaving the
terminal - with an optimistic workflow backed by instant local snapshots.

Support Us

Star our GitHub repository - it helps new users discover Aye Chat.
Spread the word. Share Aye Chat with your team and friends who live in the terminal.

Top comments (4)

Ned C • Feb 20

@vmayorskiyac i hadnt thought about using one model to validate the others output as a deliberate step. i was more comparing them side by side after the fact. the feedback loop between models sounds like itd catch things a single-model review wouldnt, especially if they have different blind spots. ill check out that post.

Vyacheslav Mayorskiy • Feb 21

@nedcodes Just for the sake of visualization - here's a brief exchange like that with Opus generating implementation plan based on requirements and GPT 5.2 criticizing it.

Cheers!

Ned C • Feb 17

i gave the same refactor task to five different models a while back but just compared them head to head, which is a worse version of what you're doing here. the biggest surprise was how much unsolicited stuff each model added, one restructured the entire file, another added logging everywhere. using them at different pipeline stages makes more sense because you're actually playing to what each one is good at instead of asking all of them to do the same thing.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.