DEV Community

Cover image for Iterative review-fix loops remove LLM hallucinations, and there is a formula for it
Yannick Loth
Yannick Loth

Posted on

Iterative review-fix loops remove LLM hallucinations, and there is a formula for it

If you have used LLMs for anything that requires accuracy, you have probably noticed the pattern: the first output is a mix of genuine insight and confident fabrication. You read through it, spot the errors, fix them, and move on. Most people treat this manual review step as an unavoidable cost of working with these models.

But here is something worth pausing on: when you ask the same model to review its own output, it often finds the very errors it just made. It spots hallucinated facts, logical gaps, and inconsistencies that it introduced moments earlier. And if you ask it to fix those findings and then review again, it finds more. You can repeat this until a review round produces zero findings, and the result is remarkably clean.

This seems paradoxical at first. If the model was capable of recognizing these errors, why did it make them in the first place? The answer turns out to be that recognizing a mistake is fundamentally easier than not making one. Generation and verification are not the same cognitive task, and LLMs are measurably better at the second. This asymmetry is the key to everything that follows.

I have been using this review-fix loop for academic papers, code, and technical documentation, and it reliably converges within a few rounds. Recent research now provides a formal mathematical model explaining why it converges and when you should stop.

The pattern

The loop itself is straightforward:

output = generate(task)
changes = output
round = 0
while true:
    round += 1
    findings = review(changes)
    if findings == 0 and round >= 2:
        break
    changes = fix(changes, findings)
Enter fullscreen mode Exit fullscreen mode

You generate once, then enter a review-fix loop. Notice the round >= 2 condition: even if the first review finds nothing, you run at least two review passes. The reason is that verification itself is probabilistic. A single review pass gives the model one chance to catch each error, and given that CS (the probability of spotting a given error) is less than 1, some errors will slip through on any single pass. A second pass gives the model another independent chance at those, which is where most of the accuracy gain concentrates according to the convergence formula.

An important detail is that each review round focuses on what was changed in the previous round, not the entire output from scratch. The first round reviews the initial generation, but subsequent rounds only review the fixes that were just applied. This keeps each round focused on material that has not yet been reviewed, rather than re-scanning everything and risking the model second-guessing content it already validated.

The reason this is effective is that LLMs tend to be better at verifying content than generating it. They spot errors more reliably than they avoid making them in the first place.

This applies broadly. For documents, each round catches factual errors, inconsistencies, or missing context. For code, each round catches bugs, edge cases, or security issues. For technical writing, each round catches logical gaps and unsupported claims. The pattern does not require external feedback, fine-tuning, or multiple models, though using different prompts for generation and review does improve results.

Why it works: the solver-verifier gap

To understand why the same model that made the errors can then find them, it helps to think about what generation and verification actually require.

When generating, the model samples from a vast probability distribution. Many paths through that distribution lead to plausible-sounding but wrong outputs. The model has to construct the correct answer from scratch, navigating through all of those plausible-but-wrong alternatives. This is where hallucinations come from: the model lands on a confident-sounding path that happens to be wrong.

When reviewing, the task is different. The model does not have to construct anything. It just has to look at a specific claim and assess whether it is correct. This is a much more constrained problem, and the model's latent knowledge handles it more reliably. Think of it this way: it is hard to write a correct proof from scratch, but it is much easier to spot a logical gap in a proof that is already written.

Liu et al. (2024) provide an explanation for this in "On the Intrinsic Self-Correction Capability of LLMs": the instruction to review and fix errors shifts the model's internal state toward higher-certainty regions of its training distribution. Each review round activates latent knowledge that reduces uncertainty, progressively steering the output away from noise.

The important consequence is that the model is not learning anything new during refinement. It is accessing knowledge it already had but failed to retrieve during generation. The review prompt effectively resamples from a better part of the distribution, and each round extracts a bit more of that latent knowledge.

The convergence math

Yang et al. (2025) formalized this process in "A Probabilistic Inference Scaling Theory for LLM Self-Correction" at EMNLP 2025. They model it as a Markov chain with a recursive formula:

Acc_t = Acc_{t-1} · CL + (1 - Acc_{t-1}) · CS
Enter fullscreen mode Exit fullscreen mode

Here, Acc_t is the probability of the output being correct after round t, CL (Confidence Level) is the probability that the model keeps correct content correct, and CS (Critique Score) is the probability that the model successfully fixes an error.

Through mathematical induction, this resolves into a closed-form convergence equation:

Acc_t = Upp - α^t · (Upp - Acc_0)
Enter fullscreen mode Exit fullscreen mode

In this formula, Upp = CS / (1 - CL + CS) is the theoretical accuracy ceiling, α = CL - CS is the convergence rate (where a smaller α means faster convergence), and Acc_0 is the initial accuracy of the first draft.

This describes exponential decay toward the ceiling: each iteration removes a fixed fraction of remaining error. In practice, this translates to the following trajectory:

Round What typically happens
0 Initial generation, with hallucinations present
1 to 2 Major errors are removed, including factual and logical contradictions
3 to 5 Refinement phase, where nuances, edge cases, and subtle inconsistencies are resolved
6+ Diminishing returns, with a risk of the model "fixing" things that were not broken

For current models, convergence typically happens in 3 to 5 rounds.

How many review passes does each text need?

This is the question that the convergence formula actually answers, and the answer is not "one."

Verification is probabilistic. When the model reviews a piece of generated text, it has some probability CS of catching any given error. If CS is 0.4 (a reasonable estimate for current models on non-trivial content), then after a single review pass, each error has a 60% chance of surviving undetected. After two passes, that drops to 36%. After three, to about 22%.

You can compute this directly from the convergence equation. The accuracy gain from round t-1 to round t is:

ΔAcc_t = (1 - α) · α^(t-1) · (Upp - Acc_0)
Enter fullscreen mode Exit fullscreen mode

If you plug in typical values (say CL = 0.9, CS = 0.4, so α = 0.5 and Upp = 0.80), the relative share of total improvement per round looks like this:

Round Share of total improvement
1 50%
2 25%
3 12.5%
4 6.25%
5 3.125%

Round 1 catches half the errors. Round 2 catches half of what remains. Together, rounds 1 and 2 account for 75% of the total improvement the loop will ever achieve. This is why a single review pass is not enough: it leaves a quarter of the reachable improvement on the table.

The practical rule I follow is: every generated text gets reviewed at least twice. A single pass gives the model one chance to catch each error, and given the probabilistic nature of verification, that is not enough. Two passes give it a second independent chance, and that second chance is where a large share of the remaining errors get caught.

Beyond two passes, you get diminishing returns, but they are not zero. For content where quality is critical (proofs, security-sensitive code, text that will be published), I typically run 3 to 5 rounds. For less critical content (internal documentation, draft notes), two rounds is usually sufficient.

The formula also tells you something important: there is a maximum accuracy that the loop can reach, regardless of how many rounds you run. This ceiling, Upp, depends entirely on the model's ability to preserve correct content (CL) and fix errors (CS).

The ceiling and when to stop

If Upp is 0.80, the remaining 20% consists of errors that the model cannot recognize as errors, because they are blind spots shared by both its generator and its critic. This aligns with the finding from Huang et al. (2024) in "Large Language Models Cannot Self-Correct Reasoning Yet" at ICLR 2024: when the evaluator has the same blind spots as the generator, iteration rearranges errors without removing them.

There are a few practical rules for when to stop:

  1. Always run at least two review passes. As discussed above, the first two rounds capture 75% of the reachable improvement. Stopping after one pass leaves too much on the table.
  2. Track findings per round. If the count is decreasing, you are converging. If it plateaus or increases, you should stop.
  3. Set a hard cap at 5 or 6 rounds. Beyond this, the risk of introducing new errors tends to exceed the benefit of fixing remaining ones.
  4. Watch for stochastic drift. If a round introduces issues that previous rounds had already fixed, you have overshot, and you should use the previous round's output.

Making the loop more effective

A few things I have learned from running this pattern extensively:

Separate the reviewer from the generator. Using a different system prompt for review than for generation yields better results. The Self-Refine framework by Madaan et al. (NeurIPS 2023) showed that distinct generation, critique, and refinement prompts outperform monolithic approaches by roughly 20% on average across diverse tasks.

Be specific about what to review. A vague instruction like "review this for errors" produces weak findings, whereas "check for factual accuracy, logical consistency, missing edge cases, and unsupported claims" produces targeted and actionable findings. The more precise the review prompt, the higher the effective CS.

Quality-gate the initial generation. If the first draft is structurally broken or fundamentally confused about the topic, the review loop will not save it, because the critique signal gets drowned out by noise. Investing in better prompts, few-shot examples, or a more capable model for the initial generation will do more than adding extra review rounds on a poor draft.

Number your rounds. Tracking R1, R2, R3 with findings counts makes convergence visible and gives you a clear signal of when diminishing returns set in.

Recognize when to hand off. The ceiling tells you when human involvement is needed. If round 5 still shows findings but they are subtle judgment calls rather than clear errors, you have reached the model's intrinsic limit, and the remaining gap requires human expertise.

When it does not work

The loop fails in a few specific situations. If the task requires reasoning that the model cannot perform (for example, novel mathematical proofs or deeply specialized domain knowledge), then CS is close to zero and the ceiling is very low. If the initial quality is near zero, the model cannot bootstrap refinement from noise, and you need to fix the generation step first. And if the model is too aggressive (low CL), it frequently "fixes" correct content into incorrect content, so that each round both fixes and breaks things, and you oscillate instead of converging.

These are edge cases, however. For the broad range of practical tasks involving writing, coding, analysis, and documentation, the loop-until-convergence pattern works well.

In summary

If you are using LLMs for anything where quality matters, it is worth treating generation not as a one-shot process but as the first step in an iterative refinement loop. This is not a hack: it is a mathematically grounded strategy that takes advantage of the fact that these models can verify more reliably than they can generate.

The approach is to generate once, review and fix in a loop, and stop when findings reach zero or at around round 5. The convergence math from Yang et al. predicts this, and my daily experience with it confirms the prediction.


References

  1. Yang, Z., Zhang, Y., Wang, Y., Xu, Z., Lin, J., & Sui, Z. (2025). A Probabilistic Inference Scaling Theory for LLM Self-Correction. EMNLP 2025.

  2. Madaan, A., Tandon, N., Gupta, P., et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. NeurIPS 2023.

  3. Huang, J., Chen, X., Mishra, S., et al. (2024). Large Language Models Cannot Self-Correct Reasoning Yet. ICLR 2024.

  4. Liu, G., Mao, H., Cao, B., et al. (2024). On the Intrinsic Self-Correction Capability of LLMs: Uncertainty and Latent Concept. arXiv:2406.02378.

Top comments (0)