Long-horizon reasoning is where production LLM agents tend to quietly break. A model can produce a plausible-looking chain of thought, accept a wrong intermediate answer, and continue building on that error for every step that follows. By the time the final output appears, the damage is compounded and invisible. The paper behind ReFlect (arXiv:2605.05737, May 2026) quantifies exactly how bad this is: in controlled experiments, LLMs wrongly accept incorrect answers at least 76% of the time when using standard prompt-level self-critique — the "check your work" approach most developers reach for first.
ReFlect proposes a different model. Instead of asking the LLM to critique itself (which mostly produces formulaic acknowledgment templates rather than actual error signals), it inserts a deterministic harness between steps — an external wrapper that checks for numerical inconsistency, grounding failures, and logical contradictions. No fine-tuning. No model changes. Just inference-time scaffolding.
Why Prompt-Level Self-Critique Fails
The paper's evaluation of 100 audited reflection blocks found that 90 produced zero genuine issue flags — the model wrote "my reasoning appears correct" regardless of whether it was. This isn't a Claude or GPT-specific finding; it held across all six models tested. The core problem is that the same model that produced the error is being asked to detect it, using the same weights, at the same inference step. The model has no structural incentive to contradict its own prior output.
ReFlect moves the error detection logic outside the model entirely. The harness is deterministic: it checks whether numeric values in step N can be derived from steps 1 through N-1, whether stated conclusions match extracted premises, and whether the current step violates constraints established earlier in the chain.
The Harness Pattern
Here is a minimal Python 3.12 reproduction of the core error-detection logic. No API keys required — this models the harness structure from the paper's description:
from dataclasses import dataclass
from typing import Optional
import re
@dataclass
class ReasoningStep:
step_num: int
content: str
numeric_claims: list[float]
conclusion: Optional[str]
def extract_numeric_claims(text: str) -> list[float]:
return [float(m) for m in re.findall(r'\b\d+(?:\.\d+)?\b', text)]
def detect_inconsistency(
step: ReasoningStep,
prior_steps: list[ReasoningStep]
) -> str | None:
"""Deterministic cross-step consistency check — the core of ReFlect's harness."""
if not prior_steps:
return None
prior_nums = {n for s in prior_steps for n in s.numeric_claims}
for num in step.numeric_claims:
if num not in prior_nums and abs(num) > 1000:
return f"Step {step.step_num}: value {num} not grounded in prior steps"
return None
def reflect_harness(steps: list[str]) -> list[tuple[str, str | None]]:
parsed: list[ReasoningStep] = []
results = []
for i, content in enumerate(steps):
step = ReasoningStep(
step_num=i + 1,
content=content,
numeric_claims=extract_numeric_claims(content),
conclusion=(
content.split("therefore")[-1].strip()
if "therefore" in content.lower() else None
)
)
error = detect_inconsistency(step, parsed)
parsed.append(step)
results.append((content, error))
return results
The key insight is in detect_inconsistency: it checks whether a numeric value appearing in step N was established in the prior context. Values that appear without derivation — a telltale sign of hallucination or arithmetic drift — trigger a recovery prompt. The harness does not regenerate internally; it passes the error flag back to the caller, which decides whether to retry, branch, or escalate.
Running this against a four-step chain where step 4 introduces an unsupported value of 9500 (while prior context only establishes 25, 4, 100, and 90):
ReFlect Harness — Step Validation Demo
✓ OK — Step 1: The widget costs 25 dollars, we have 4 units.
✓ OK — Step 2: Total cost is therefore 100 dollars.
✓ OK — Step 3: After 10% discount, final price is 90 dollars.
⚠ ERROR: Step 4: value 9500.0 not grounded in prior steps
The harness correctly surfaces the injection. A full implementation would pass this signal back to the model with the error location and ask it to re-derive from step 3.
Benchmark Numbers
The paper's evaluation ran across six LLMs on six reasoning domains (arithmetic, symbolic, multi-hop QA, planning, and two others). Success rates against Direct Chain-of-Thought baselines:
| Model | Direct CoT | ReFlect | Gain |
|---|---|---|---|
| GPT-4o-mini | ~34% | ~41% | +7 pp |
| Qwen2.5-72B | ~47% | ~54% | +7 pp |
| Claude Sonnet 4.5 | ~27% | ~56% | +29 pp |
| Average (6 models) | ~35% | ~48% | +13 pp |
The Claude Sonnet 4.5 result is notable: the model's baseline Direct CoT performance is relatively low on these hard multi-step tasks, but the harness unlocks a 29 pp gain — suggesting ReFlect's external correction mechanism pairs especially well with models that tend toward overconfident reasoning chains.
Where This Fits in a Production Agent
The practical insertion point is between any two sequential LLM calls in an agentic loop. You run step N, extract numeric values and stated conclusions, check consistency with the prior context accumulator, and either proceed (clean) or re-prompt (flagged). The re-prompt can be as simple as:
Your prior step stated [CONCLUSION]. The harness detected that [VALUE] does not
follow from the established context. Please re-derive step [N] starting from [LAST_CLEAN_STEP].
No model switch, no retraining, no new weights. The harness is a wrapper you add to an existing inference loop.
Limitations
ReFlect's numeric-grounding check works well for quantitative reasoning but is harder to apply cleanly to purely qualitative multi-hop chains. The paper notes that the deterministic validator needs domain-specific rules for non-numeric tasks — "arbitrary string appeared in step 4" is harder to flag than "9500 appeared with no derivation."
Also: no GitHub code has been released as of this writing (2026-05-11). The above implementation is a conceptual reproduction based on the paper's description, not an official release. Watch arXiv:2605.05737 for updates.
Takeaway for Developers
ReFlect is a good pattern to steal even without the paper's exact implementation. The underlying idea — insert a deterministic validator between LLM steps rather than asking the model to self-critique — addresses a real production failure mode with zero training overhead. If you are building any agent that does more than 3-4 sequential reasoning steps, this approach is worth the wrapper complexity.
The full benchmark results and methodology are in arXiv:2605.05737.
Top comments (0)