TL;DR. I tried to drop the self-critique literature into my one-person stack and most of it did not fit. MetaCrit needs four agents. MAR needs a multi-persona debate. PR-CoT needs an external orchestrator. Reflexion needs a reward signal I do not have a budget for. Self-Reflection is the closest, but it is a two-step loop and does not include a stage that separates fake weaknesses from real ones. So I adapted the pattern down to what runs on a single 8GB GPU in a single agent session. Three stages. Negative-self → self-audit → mind-change. I'm calling it MINDCHANGE and shipping the spec as a seventh MD axis in the context-engineering kit. This post explains the adaptation, names the existing lines it borrows from, presents a 5-model experiment design (Claude Opus 4.7 + Gemma 4 31B + Gemini 3.5 Flash + DeepSeek V4 Pro + Qwen 3.6 Max preview (proxy for Qwen 3.7-Max, not yet on OpenRouter at publish time)), and proposes a direct orthogonal combination with thehwang's num_ctx harness.
Why the existing lines did not fit my stack
The self-critique literature is rich. Reading through it over the past two weeks I kept hitting the same wall. The papers assume infrastructure I do not have.
MetaCrit (arxiv 2507.15015) is a four-agent metacognitive framework grounded in the Nelson-Narens model. An object-level agent generates the initial response. A monitoring agent assesses validity. A control agent critiques logic. A meta-level synthesizer reconciles all three. Cleanly designed. Also four model calls per pass. On my routing tier that is 4x the cost of a single-shot. On a self-hosted 8GB GPU it is four times the wall time. For workloads I run hundreds of times a week through cron, the math kills it.
MAR (Multi-Agent Reflexion) (arxiv 2512.20845) replaces single-agent self-critique with structured debate among persona-based critics. The goal is to dodge self-bias by importing multiple external perspectives. Same scaling problem. Now you have a debate panel to maintain. And the personas need to be authored and tuned. For a solo builder maintaining 18 active projects, that maintenance cost is real.
MyGO PR-CoT (arxiv 2601.07780) is a poly-reflective chain-of-thought. The model self-evaluates across four pre-defined angles. Closer to single-agent but still needs an external orchestrator to enforce the four angles per pass. Doable. Still extra plumbing.
Reflect-Retry-Reward (arxiv 2505.24726) is reinforcement-learning based self-improvement. Requires a reward signal. I do not have a labeled reward dataset for the audits my cron pipeline runs. Cannot use it as-is.
PopuLoRA (Co-Evolving LLM Populations for Reasoning Self-Play) (HN announcement, 2026-05) is on the opposite axis: it evolves multiple LLM populations together through reasoning self-play. Strong line for population-level evolution. Orthogonal to MINDCHANGE. PopuLoRA improves the population over time. MINDCHANGE improves a single model's output within a single session through a personality sequence. They could compose in principle, though I have not tested it.
Self-Reflection is the most generic pattern. First answer → critique → refine. Closest to what a single-agent, single-session setup can support. But it is two stages. There is no stage that asks "is this critique even real or did the model just complain to look thorough?" That missing third stage is what causes self-reflection in practice to either bounce off real weaknesses (negative spiral) or rewrite a perfectly good answer into something worse (over-edit).
So I needed something that:
- Runs in a single model call sequence (single agent, single session, no orchestrator)
- Includes a stage that separates real weaknesses from fake ones (the missing third stage)
- Costs in the 2-4x range of a single-shot, not 4-8x
- Sits inside an MD file alongside the existing context-engineering kit, not in a framework
That is the adaptation work. The pattern I landed on is what I am calling MINDCHANGE.
The MINDCHANGE pattern
Three stages. Personality transitions inside one model session. The transitions are explicit in the prompt.
Stage 1. Negative-self
The model is told to look at its own previous output as if a stranger wrote it, then find weaknesses in four named categories.
You are now a *critical reviewer*. The output above is yours,
but treat it as if a stranger produced it. Find weaknesses
in these four categories:
(1) Factual accuracy: are quoted numbers, dates, sources correct?
(2) Logical consistency: are claim-evidence chains broken anywhere?
(3) Vague phrasing: any "well / appropriately / sufficiently"
predicates with no concrete definition?
(4) Missing counter-arguments: has the author preempted reasonable
objections, or skipped them?
Find a minimum of 2 and a maximum of 5 in each category.
If a category genuinely has none, say so explicitly.
Be sharp. No sycophancy.
Four design choices in this prompt that matter:
- "You are now" pins the personality inside the user prompt, not the system prompt. This keeps it portable across models that have weak system-prompt adherence (small open models often do).
- The four categories give the model a task scope. Without scope, "find weaknesses" returns either nothing or surface noise.
- The 2-minimum cuts the sycophancy escape. The 5-maximum cuts the negative spiral escape. Both bounds matter.
- The "if none, say so" line forces the model to commit to a position, not hedge with "could not find any."
Stage 2. Self-audit
The critique from stage 1 is handed back to the model. The model now switches personality from critical reviewer to self-auditor. For each critique item, the model assesses whether it is a real weakness (Yes / No / Unclear) and gives a one-line reason.
Critique list from Stage 1 received. Switch personality:
you are now a *self-auditor*, not a critic. For each item:
(a) Is this a real weakness an external reader would agree with?
Yes / No / Unclear.
(b) If Yes, one-line fix recommendation.
(c) If No or Unclear, one-line reason.
Then report what percentage of items were classified as real weaknesses
(example: 7 of 12 items were real). The classification criterion is
"would an external reader agree." That phrase exists to dodge self-bias.
This is the stage missing from generic Self-Reflection. The model is forced to grade its own critique, which means the over-eager critic from Stage 1 has to defend its claims to a different personality inside the same session. The three-way classification (Yes / No / Unclear) gives the model an honest escape if a critique was fake. The "external reader" framing is the explicit anti-self-bias prompt.
Stage 3. Mind-change
The real weaknesses from Stage 2 go to a third personality: the original author returning to the work. Only the weaknesses get fixed. Strong parts are preserved.
List of items classified as *real weaknesses* received. Switch
personality back to *original author*. Rewrite the original output:
(a) Apply fixes to all real-weakness items.
(b) Keep strong parts unchanged. No over-editing.
(c) Maintain original flow, tone, length.
Output the rewrite only. No fix-explanation commentary.
The third personality switch matters. By the time the model gets to Stage 3 it has been a critic, then an auditor. If the prompt does not return it to "author" mode, it tends to keep critiquing in the rewrite. Naming the personality is cheap and works.
The rewrite-only output (no fix-explanation) keeps the artifact clean. Downstream tooling parses the rewrite directly without needing to strip meta-commentary.
Comparison table
How MINDCHANGE differs from the five existing lines.
| Dimension | MetaCrit | MAR | PR-CoT | Reflect-Retry-Reward | Self-Reflection | MINDCHANGE |
|---|---|---|---|---|---|---|
| Agent count | 4 | Multi-persona | 1 + orchestrator | 1 + reward | 1 | 1 |
| Session boundary | Across agents | Across personas | Across passes | Across episodes | Within session | Within session |
| Stage count | 4 | N (debate length) | 4 | Continuous | 2 | 3 |
| Personality transitions | Implicit (different agents) | Explicit personas | None inside agent | None | None | Explicit, inside one agent |
| External reward needed | No | No | No | Yes | No | No |
| External orchestrator | Yes | Yes | Yes | Yes | No | No |
| Marginal cost | 4x | N x | 4x | Training pass | 2x | 2-4x |
| Fits in MD file | No | No | No | No | Partial | Yes (seventh axis) |
The honest framing: MINDCHANGE borrows the personality-transition idea from MAR, the staged-evaluation idea from MetaCrit, the same-session constraint from Self-Reflection, and the no-reward constraint from PR-CoT. None of it is novel as research. The adaptation is the contribution. It runs.
5-model experiment design and results
The MINDCHANGE pattern is testable. I ran the experiment over the past 24 hours, ahead of schedule.
Hypothesis (before the run). Adding the MINDCHANGE 3-stage prompt sequence to a single-pass model call improves output quality by a measurable lift across most model classes, at a cost penalty of 2-4x wall time and tokens. The lift will be larger for models with strong self-bias (small open models) than for models with weaker self-bias (frontier closed models).
Setup.
- Models (5): Claude Opus 4.7 (frontier closed, baseline) / Gemma 4 31B (open weights, mid-size) / Gemini 3.5 Flash (frontier closed, fast tier) / DeepSeek V4 Pro (open weights, frontier-competitive) / Qwen 3.6 Max preview (proxy for Qwen 3.7-Max, not yet on OpenRouter at publish time, HN 553 points, agent-focused)
- Conditions (2): MINDCHANGE on / off.
- Task fixture: Same 47-day Sniper trading bot log used in the cost-engineering and production-deployment posts. Audit task: surface 12 named structural issues. Gold-truth catch rate scored by substring pattern match against ground-truth list.
- Runs: 3 per cell, 30 total. Actual cost: $7.14 (over the $1-3 estimate; Qwen on-mode 3 runs failed at HTTP 403 "key limit exceeded" before completion).
- Metrics: catch rate (of 12) / wall time / token cost / negative spiral rate / real-weakness rate.
Measured results (catch rate, mean of 3 replicates).
| Model | off | on | lift | time ratio | cost ratio |
|---|---|---|---|---|---|
| Claude Opus 4.7 | 11.7 / 12 | 12.0 / 12 | +0.3 | 1.00 | 2.48 |
| DeepSeek V4 Pro | 7.7 / 12 | 7.0 / 12 | −0.7 | 3.23 | 3.92 |
| Gemini 3.5 Flash | 2.0 / 12 | 1.0 / 12 | −1.0 | 4.00 | 3.82 |
| Gemma 4 31B | 5.7 / 12 | 5.7 / 12 | +0.0 | 4.24 | 4.03 |
| Qwen 3.6 Max preview | 8.0 / 12 | (3 on-runs failed at API key limit) | n/a | n/a | n/a |
Negative spiral rate (on-mode runs where the rewrite scored worse than the original).
- Claude on: 0% (stable, 0/3)
- DeepSeek on: 33% (1/3)
- Gemini on: 33% (1/3)
- Gemma 4 on: 33% (1/3)
Real-weakness rate (Stage 2 self-audit Yes-rate, mean across all 4 models that completed): 76-77%, very consistent.
The hypothesis is wrong, in a specific way.
Claude Opus 4.7 showed the smallest lift, just inside the predicted band (+0.3, hypothesis said +0.5 to +1.5). Every other model went sideways or negative. DeepSeek and Gemini scored worse under MINDCHANGE than under the single-shot baseline. Gemma 4 31B was unchanged. The "stronger lift on small open models" prediction inverted.
Why I think the hypothesis broke:
Scoring ceiling on substring match. Claude was already at 11.7 / 12 baseline. There was almost no room to lift. The +0.3 measured is the model going from "missed one in some runs" to "caught everything in all 3 runs." It is a real signal but a tiny one.
Negative spiral on weaker models. When DeepSeek / Gemini / Gemma 4 went through Stage 1 (critical reviewer) and Stage 2 (self-auditor), they generated critiques. The 76% real-weakness rate means the model believed 3 of every 4 critiques were genuine. But the substring scorer cannot tell whether a fix introduced new framing that breaks the gold-truth pattern match. Three of every nine on-mode runs across non-Claude models scored lower after the rewrite. The model was being thorough; the scoring was punishing thoroughness.
Personality-switching cost on smaller models. The 3.23-4.24x wall-time ratio for non-Claude on-runs is mostly the four sequential model calls plus reasoning time on personalities. Smaller models spend more tokens on each personality switch ("you are now a critical reviewer...") and produce more disorganized output by the rewrite stage. The cost penalty hit harder than the hypothesis allowed.
Qwen ran out of room. The 3 on-mode runs for Qwen 3.6 Max preview failed at OpenRouter HTTP 403 "key limit exceeded" once cumulative spend crossed the $4 ceiling. So the most interesting unknown in the matrix is still unknown. Qwen off ran at 8.0 / 12, which is similar to DeepSeek baseline, but the on-mode test is gone for this wave.
What this means for MINDCHANGE as a seventh axis.
The pattern works on one model out of five tested, and the lift on that one model is +0.3 out of 12. Cost penalty is 2.5-4.2x. The 33% negative-spiral rate on the other four models means stacking MINDCHANGE blindly into a one-person pipeline would worsen output one time in three on non-Claude models.
This is a negative result. I am shipping it anyway, because the alternative is shipping a thesis I cannot defend, and the dev community I am writing into rewards honest negative results. The MINDCHANGE.md axis stays in the kit, but the README will be updated to flag it as model-specific (Claude-class only) and not a general lift.
The right next experiment is not a re-run of this one. The right next experiment is the 2x2 with thehwang's num_ctx harness on the same fixture, to see whether MINDCHANGE has any orthogonal lift when stacked with a different intervention. That experiment is described in the next section.
Orthogonal combination with thehwang's num_ctx harness
The previous post in this series documented thehwang's harness (Scripta) for measuring how num_ctx (Ollama context window parameter) shapes output quality. The cross-replication on RTX 4060 8GB confirmed his Mac 16GB findings, and one of our findings inverted depending on fixture shape.
The MINDCHANGE pattern lives on a different axis from num_ctx. The hypothesis worth testing in a follow-up:
-
num_ctxcontrols how much input the model sees per call - MINDCHANGE controls what personality sequence the model goes through across calls
These are orthogonal in the cleanest sense. They address different failure modes. num_ctx addresses "the model missed a structural issue because the input was silently truncated." MINDCHANGE addresses "the model saw the input but did not push back on its own output." Stacking both should produce additive lift, not redundant lift, since the gaps they close are non-overlapping.
A 2x2 matrix on the same task fixture would be the cleanest experiment:
num_ctx=2048 num_ctx=32768
MINDCHANGE off cell A cell B
MINDCHANGE on cell C cell D
Hypothesis: D > B > C > A, with the lift from B → D smaller than from A → C (because B already has the input-shape lift, so the personality-sequence lift adds less). The interesting unknown is whether the two lifts compose linearly or with diminishing returns.
That follow-up experiment is wave 3 of this series. Wave 2 is the 5-model MINDCHANGE matrix above. Wave 3 is the 2x2 combination with thehwang's harness. Both will publish as standalone posts.
Implementation note
MINDCHANGE ships as MINDCHANGE.md in the agent-starter-kit templates folder, alongside the existing six axes (CLAUDE.md, AGENTS.md, MEMORY.md, TESTING.md, GLOSSARY.md, ADR). MIT licensed.
The kit usage pattern is:
- Drop the six axes (or seven, with MINDCHANGE) into a project root
- The first six define content (project conventions, output schemas, memory, tests, vocabulary, decisions)
- MINDCHANGE defines sequence (how to walk a model through the content axes over a personality transition)
The seventh axis sits on top of the other six rather than alongside them. That layering matters for the comparison table above: MINDCHANGE is not a competing axis to MetaCrit or MAR, it is a composition layer.
What I am running next
- Wave 2 (target ~5-7 days): 5-model MINDCHANGE matrix, results post.
- Wave 3 (target ~14-21 days): 2x2 combination with thehwang's num_ctx harness on the same fixture, joint results post.
- Wave 4 (target ~30 days): MINDCHANGE adoption in the agent-starter-kit Kmong bundle for paying users + a Korean-language walkthrough for the claude-code-masterpack 5/28 release.
The kit and the axis are MIT. The cron pipeline that runs the experiments is the same one documented in the production-deployment post. The fixture is the same 47-day Sniper log used across the series.
If you test MINDCHANGE on your own workload, the comparison I would most like to see is the 2x2: kit-only context engineering on/off, crossed with MINDCHANGE on/off. Same task. Same model. Counter-experiments welcome.
Footer
This post follows the Gemma 4 Challenge production-deployment post which closed out the 5-piece challenge series. MINDCHANGE is the first axis of the next-stack series.
Related:
- MINDCHANGE.md axis spec (MIT, 9.5KB)
- agent-starter-kit (MIT) / Kmong bundle ₩39K
- thehwang's Scripta harness (MIT)
Jack. wildeconforce.com
Top comments (0)