DEV Community

Rishi Mohanty
Rishi Mohanty

Posted on • Originally published at Medium

The Human Habit That Made My AI 40% Smarter

You can make an o4-mini model about 40% smarter on your own laptop for under $20 — no fine-tuning, no extra compute, no specialized knowledge. The secret isn’t technical complexity, but a powerful human habit that you’ve been using every day, now applied systematically to AI.

However, before continuing, it’s important to ask: why does model quality even matter anymore? We can already equip a model with hundreds of tools, integrate it into complex workflows, and have it coordinate with 10 other models across dozens of steps. Doesn’t this already do the job pretty well?

Here’s the misconception: piling complexity on top of a mediocre model doesn’t inherently make the system better. A model can have an unlimited number of tools, but still have no clue how to actually accomplish the goal.

Yes, tools, workflows, and complexity make it easier for a model to perform, but they don’t fix the core issue. Intelligence does. It’s like giving a 12-year-old a recipe, ChatGPT, and industry-grade tools, then asking them to boil an egg. Now imagine replacing the kid with an AI, giving it tools and other models, and then asking it to make core infrastructure-related changes. Maybe it gets the job done, but does it do it properly? Efficiently? Reliably?

Intelligence is what determines whether a model quietly solves the task or burns hours of time and resources hallucinating or retrying the same failed solution. Whether it finishes in 10 steps or 30. Whether it pauses to ask for help or loops itself into failure. In an emerging society where AI is being tasked with running workflows, calling APIs, and managing real-world responsibilities, those small errors accumulate and become system-breaking.

This is the crux: model intelligence and quality are inseparable. High-quality models, capable of achieving assigned goals efficiently and reliably, remain core to effective AI systems — regardless of tools and complexity layered on.

This problem of model intelligence is well-known and documented, and sure, solutions exist. But most of them demand hours, piles of money, and significant resources. However, cheaper alternatives do exist. If you’re ready to dive into low-level modeling, parameter fine-tuning, and arcane math, go ahead. For the rest of us, there’s a simpler, more human approach that gets most of the benefits without all the overhead.

So here we are at the human habit. It’s a habit you’ve been doing your whole life without noticing. It’s so subconscious you do it automatically, even when you explicitly don’t want to. It’s precisely this habit that separates people who stumble through tasks from those who solve them efficiently, anticipate mistakes, and already have a plan C.

Think about the last time you wrote an email and reread it four times before hitting send. Or when you double-checked that calculation for the third time. Or even simpler, when you hit Ctrl+C eight times to make sure it worked. It’s that voice in your head that makes you pause before you act. That tiny, seemingly mundane pause where you notice what’s right, what’s wrong, and what can be done better, this is the secret ingredient humans use to constantly improve.

Now imagine if a model could do the same thing. Not in vague system prompts, or as a theory, but as a concrete step embedded in its response process. Every response, the model has its own little voice, giving it doubt or confidence, and it thinks through its responses multiple times before finally proceeding. Mistakes aren’t blindly repeated. Inefficiencies aren’t blindly amplified. Instead, it truly thinks through its actions before doing something.

It sounds so simple. So obvious. But this is where the magic happens. By applying our own natural, subconscious habit of self-reflection into a process for AI, we accelerate how intelligent our model is without piling on unnecessary complexity.

And yes, applied correctly, this habit didn’t just make o4-mini a bit better, it made it 40% smarter, faster, and more reliable.

Self-reflection and self-criticism already exist; however, how can we take them to the next level? Once again, let’s visit the human mind. Yes, humans critique themselves, but how effective is this critiquing if you are critiquing yourself? It’s hard to convince yourself to go outside on a rainy day for a run when you’re curled up drinking hot chocolate. Objectively, you should get up and exercise, but with bias and emotion-driven actions, that self-criticism isn’t effective. That’s why we decided to pivot and opt for multi-model critique along with objective rubric evaluations.

For our testing and cost limitations, we decided to go with o4-mini as the main model and gpt-5-mini as the critique engine. Here are their baseline performances:


Baseline resolution rates for the individual models on SWE-bench (bash-only) o4-mini resolves 45.00% of tasks, gpt-5-mini resolves 59.80%, and the theoretical upper bound is 100%.

Multi-model critique also already exists, but once again, this comes with its own problems. Solo critique just applies to the model at the end of each response, so the model doesn’t actually take action until it gets another prompt. Our fix for this was isolating the critique cycle per response. With every response, the model critiques itself multiple times until it finally returns results when it has achieved optimal scores.

However, this raises the question of what is optimal scoring. Is an optimal score 90/100 on a rubric? Well, what if the system is literally unable to get that 90? That’s why we decided to track improvement. By tracking improvement, the system only returns the response when the model has improved and critiqued to the best of its abilities. This also fixed another problem we had.

Each time the model critiqued itself, the gain was less and less. This is a common phenomenon for humans, too, known as the learning curve. Once again, tracking improvement saved us from the model driving itself into the ground.


Learning Curve Graph valamis.com

Now, in theory, if we had unlimited resources, our model would be able to achieve almost perfect results, right? Wrong again, however, this time our error had an uncanny similarity to humans. Think about a time when you were stuck in traffic, and the ETA jumped up by five or ten minutes. You’re sitting in the car, beginning to sweat, thinking to yourself: “Well, dammit, I’m going to be five minutes late, and that’s going to become 15 minutes. John’s already going to start presenting, Michael’s going to be mad at me, and I’m going to get fired. I hate traffic!” All from 10 minutes of delay. Such a small problem can cause people to overthink and start overcompensating, just leading to more problems.

Like humans, our critique system began obsessively overthinking. A problem as simple as a missing package caused the model to critique itself almost fifteen times before finally returning a response. Over and over again, over a course of several steps. Each returned solution tried to solve the given problem and more, leading to an avalanche of additional crashes.

This inspired our next addition, priority tuning. At the time, our critique model only considered the response acceptable if it returned production-level, iron-proofed, and absolutely perfect code. However, most of the time we don’t need this obsessive perfection; hence, we need to adjust the system’s priorities. This was a relatively easy fix by simply adding system instructions with more depth and specific goals for the model.

All of these different components are what came together to get us that 40% result.

Now, at first, all these problems were overwhelming. Hours of testing were wasted because the model was obsessive or began overengineering. Hours of whiteboarding, building new solutions, and going back to testing. In total, the model achieved a score of 63.16% in resolving issues on the SWE-bench bash-only (minimal agent on the SWE-bench Princeton Lite dataset). That’s a 40% increase versus o4-mini’s 45.00% resolved. Here’s a look at the baseline again:


Baseline resolution rates for the individual models on SWE-bench (bash-only) o4-mini resolves 45.00% of tasks, gpt-5-mini resolves 59.80%, and the theoretical upper bound is 100%.

Baseline resolution rates for the individual models on SWE-bench (bash-only)
o4-mini resolves 45.00% of tasks, gpt-5-mini resolves 59.80%, and the theoretical upper bound is 100%.
One of our hypotheses was that it would cap at our critique model’s score of 59.80% resolved (gpt-5-mini). However, our system even exceeded this. Here is a visualization of our improvement:


Performance increase of the enhanced o4-mini system on SWE-bench (bash-only) with task resolution from 45% to 63.16%, a relative improvement of ~40%

Performance increase of the enhanced o4-mini system on SWE-bench (bash-only) with task resolution from 45% to 63.16%, a relative improvement of ~40%
You can get the direct code we used to build this system from our repository tagged below:

github.com/Rishi138/multi-agent-critique/

After tuning prompts, thresholds, and other components of our system, our entire building process was $0.55. Actually testing the system on the benchmark cost $10.15. For our purposes, the most optimal cycles were about 2–3 critique calls before returning results; however, our code was optimized for SWEBench. Applicability in other domains may require your own tuning and testing; however, the cost should be well under $20.

Overall, the system’s relative improvement of 40% was exactly what we had hoped for. However, to have this quality-enhancing feature, we had some tradeoffs, including increased latency and cost per call. This illustrates to us that this architecture isn’t for all programs. If your priority is speed and minimal cost, then your program may do better without it.

In conclusion, by translating a simple human habit of self-reflection into a concrete AI process, we were able to make o4-mini ~40% smarter, faster, and more reliable. However, a larger idea to be learned from this is that bigger doesn’t mean better. This approach demonstrates that improving model intelligence doesn’t always require complex fine-tuning or massive resources; sometimes, the solution is closer to home.

By applying, refining, and integrating concepts, you can take your own AI-powered system to the next level.

For those interested in experimenting further, the implementation and code are available in our repository. This opens the door to more accessible, human-inspired AI improvements across a variety of tasks.

Follow for those interested in AI/ML who aren’t PhD-level mathematicians.

Top comments (0)