I've been experimenting with something that feels slightly unhinged: using different AI models at different stages of building a feature.
Not because I'm indecisive. Because each model has a different superpower.
GPT-5.2 is great at structured documentation and architectural thinking. Claude Opus 4.6 is terrifyingly good at catching edge cases and writing precise code. So why would I force one
model to do everything when I could use them like specialized tools?
This is the story of building a tiny feature called printraw - and how a five-stage, multi-model workflow caught bugs that a single-model approach would have missed entirely.
The Feature: Stop Making Me Fight the Terminal
Here's the problem: Aye Chat renders AI responses in pretty Rich panels with Markdown formatting and box-drawing characters. Looks great. Feels polished.
But try to copy that text and paste it somewhere else.
You get a mess of line breaks, box characters, and formatting artifacts that make you want to throw your laptop into the sea.
The fix seemed simple: add a printraw command that reprints the last response as plain, copy-friendly text. No panels. No Rich formatting. Just raw text wrapped in delimiters you can
select and copy.
The feature itself? Trivial. The workflow I used to build it? That's what got interesting.
The Five-Stage Pipeline
Here's what I ended up doing:
| Stage | Task | Model |
|---|---|---|
| 1 | Write the plan | GPT-5.2 |
| 2 | Validate the plan | Claude Opus 4.6 |
| 3 | Implement | Claude Opus 4.6 |
| 4 | Write tests | GPT-5.2 + Claude Opus 4.6 (alternating) |
| 5 | Fix until green | GPT-5.2 + Claude Opus 4.6 (alternating) |
This isn't "use one model for everything." It's a staged pipeline where model selection is intentional - like choosing a screwdriver vs a hammer based on what you're actually fastening.
Stage 1: Planning with GPT-5.2
I started by describing the UX problem to GPT-5.2 and asking for a complete implementation plan.
GPT-5.2 produced a thorough document covering:
- Command syntax and output format
- Where to capture the last response text
- Two architecture options (store in REPL vs. store in presenter)
- Which files to modify
- Testing approach
- Edge cases
Why GPT-5.2 for planning? It's genuinely good at organized technical writing. It thinks through tradeoffs without needing to see every line of code. The output was clean, structured,
and gave me something concrete to react to.
Stage 2: Validation with Claude Opus 4.6
Here's where it gets interesting.
I handed the plan to Claude 4.6 with a simple prompt: "Review and validate this plan. Let me know if you'd recommend any adjustments."
Claude came back with seven specific recommendations, prioritized by impact:
| # | Recommendation | Priority |
|---|---|---|
| 1 | Add raw as a short alias |
High - usability |
| 2 | Use plain print(), not Rich console.print()
|
High - correctness |
| 3 | Shorten delimiter lines | Low - taste |
| 4 | Clarify: summary-only output, not file changes | Medium |
| 5 | Treat whitespace-only summary as empty | Medium - edge case |
| 6 | Note that mid-stream printraw is N/A |
Low - docs only |
| 7 | Add Rich-markup-leak test case | Medium - correctness |
Recommendation #2 was the one that made me sit up.
The Rich markup leak problem: If you use Rich's console.print() to output "raw" text, and the AI's response happens to contain tokens like or, Rich interprets them as markup
instead of printing them literally. Your "raw" output comes out formatted. The whole point of the feature is defeated.
The fix - using Python's built-in print() - is trivial. But I would have missed it without a dedicated review pass.
Why Claude 4.6 for validation? It's like hiring a polite pedant to review your work. The structured table with priority ratings made it easy to cherry-pick which adjustments to accept.
I took recommendations #1 through #5 and skipped the documentation-only items.
Stage 3: Implementation with Claude Opus 4.6
With a validated plan in hand, I asked Claude to implement it.
The first implementation changed the return types of handle_with_command() and handle_blog_command() from Optional to Tuple[Optional, Optional] - threading the response text back
to the REPL.
I flagged this immediately: "Won't that introduce regressions and break existing functionality?"
Claude acknowledged the risk and proposed something cleaner: capture the text at the source of truth - inside print_assistant_response() itself, using a module-level variable.
This approach:
- Required zero signature changes
- Had zero regression risk
- Was guaranteed to capture the correct text (whatever was actually printed)
- Worked automatically for all code paths
Much better.
The Bug That Almost Shipped
Even after the refactor, the first test showed the command printing "No assistant response available yet" after a valid response.
Root cause: the initial code tried to capture text using getattr(llm_response, 'answer_summary', None), but the response object's attribute was actually .summary, not
.answer_summary.
The fix was exactly the module-level capture approach - store the text inside print_assistant_response() where the correct string is guaranteed to exist, regardless of what the response
object's attributes are named.
Final implementation touched four files:
-
presenter/repl_ui.py- Module variable + getter + capture logic -
presenter/raw_output.py- New file: plainprint()with delimiters -
controller/command_handlers.py- Newhandle_printraw_command()handler -
controller/repl.py- Addedprintrawandrawto built-in commands
Stages 4 & 5: The Adversarial Testing Loop
Here's where the multi-model approach got really interesting.
With implementation done, I didn't just ask one model to write tests and fix them. I ping-ponged between models: one writes, the other critiques and fixes, repeat.
The test coverage needed to include:
- Normal output with delimiters
- Rich markup leak prevention (the
somethingcase) -
Noneinput → warning message - Whitespace-only input → warning message
- Empty string → warning message
- The capture mechanism
- The handler integration
Then came the iteration loop - but with a twist.
(ツ» model gpt-5.2
(ツ» write tests for the printraw feature
GPT writes tests.
(ツ» pytest tests/test_raw_output.py -v
Oh God. OH GOD. Red everywhere.
(ツ» model claude-opus-4.6
(ツ» fix the failing tests
Claude fixes things - and often rewrites chunks of GPT's approach entirely.
(ツ» pytest tests/test_raw_output.py -v
Still red, but fewer failures.
(ツ» model gpt-5.2
(ツ» these tests are still failing, fix them
GPT takes a different angle. Catches something Claude missed.
(ツ» pytest tests/test_raw_output.py -v
Green. Finally green.
Why alternate models? Because each model has different blind spots. GPT might write a test that's technically correct but uses mocking patterns Claude handles better. Claude might fix
the mock but miss an assertion edge case that GPT catches on the next pass.
It's adversarial collaboration. Each model is essentially reviewing the other's work, and bugs that survive one model's scrutiny often get caught by the other.
No context-switching. No copying error messages between terminals. Everything in one session - just swapping which brain is on the case.
Why This Workflow Works
Different models for different cognitive tasks
Planning is a different skill than code review which is a different skill than implementation. Using one model for everything is like using a hammer for screws - it technically works but
you're fighting the tool.
The staged approach catches errors early
The validation stage caught the Rich markup leak before any code was written. Without it, that bug would have surfaced (maybe) when users reported garbled output weeks later.
Regression risk is managed explicitly
By questioning the return-type changes, I avoided an entire class of integration issues. The "capture at the source of truth" pattern emerged from that pushback.
Alternating models surfaces hidden bugs
The ping-pong pattern during testing caught issues that a single model iterating with itself would have missed. Each model brings a different failure mode - and different solutions.
The conversation is the development environment
Every stage happened in the same Aye Chat session:
- Model switching via
modelcommand - File generation via prompts
- Test execution via
pytest - Undo via
restorewhen something went wrong - Diff inspection via
diffto verify changes
No IDE. No separate terminal. No copy-pasting between tools.
The Takeaways
Plan first, validate second, implement third. Writing a plan document forces clarity before you touch code.
Switch models for validation. The model that wrote the plan won't catch its own blind spots. A fresh perspective - even from a different AI - brings a different analytical lens.
Capture at the source of truth. When multiple code paths need the same data, find the single point where it's guaranteed to be correct. Don't thread it through function signatures.
Question regression risk explicitly. When implementation requires changing existing contracts, ask: "Is there a way to do this without breaking things?" Usually there is.
Alternate models during test/fix loops. One model writes, the other critiques. Bugs that slip past one often get caught by the other. It's like having two reviewers who never get
tired.Keep tests in the same session. Running pytest, reading failures, and fixing them without leaving the terminal keeps iteration tight and fast.
This whole feature - planned, validated, implemented, tested, debugged, and shipped - happened in a single Aye Chat session across a few hours.
Not because the feature was hard. Because the workflow made it frictionless.
About Aye Chat
Aye Chat is an open-source, AI-powered terminal workspace that brings AI directly into command-line workflows. Edit files, run commands, and chat with your codebase without leaving the
terminal - with an optimistic workflow backed by instant local snapshots.
Support Us
- Star our GitHub repository - it helps new users discover Aye Chat.
- Spread the word. Share Aye Chat with your team and friends who live in the terminal.
Top comments (4)
@vmayorskiyac i hadnt thought about using one model to validate the others output as a deliberate step. i was more comparing them side by side after the fact. the feedback loop between models sounds like itd catch things a single-model review wouldnt, especially if they have different blind spots. ill check out that post.
@nedcodes Just for the sake of visualization - here's a brief exchange like that with Opus generating implementation plan based on requirements and GPT 5.2 criticizing it.
Cheers!
i gave the same refactor task to five different models a while back but just compared them head to head, which is a worse version of what you're doing here. the biggest surprise was how much unsolicited stuff each model added, one restructured the entire file, another added logging everywhere. using them at different pipeline stages makes more sense because you're actually playing to what each one is good at instead of asking all of them to do the same thing.
Some comments may only be visible to logged-in visitors. Sign in to view all comments.