DEV Community

Cover image for Letting LLMs Jump — and Then Verifying Ruthlessly
Shinsuke KAGAWA
Shinsuke KAGAWA

Posted on

Letting LLMs Jump — and Then Verifying Ruthlessly

The "First Plausible Answer" Problem

You've probably seen this: you ask an LLM to investigate a bug, and it latches onto the first plausible explanation. It confidently proposes a fix before thoroughly exploring alternatives. Sometimes it works. Often it doesn't—and you're left debugging the debugger.

I ran into this repeatedly in my personal projects. The LLM would find something that looked like the cause, stop investigating, and immediately suggest a solution. When the codebase was small, this worked fine. As it grew, I started getting fixes that didn't actually fix anything.

This is not for small scripts or simple bugs.

I only started needing this once my codebase grew large enough
that "just try a fix" stopped working.

The root issue? How I was defining the task's purpose.

Planning works well when the problem is understood.

But when the problem itself is unclear,
planning alone is not enough.

This article focuses on those cases.

The Factor That Made the Difference: Purpose

When delegating tasks to LLMs, two factors affect execution accuracy: Context (staying within ~70% of the context window) and Purpose (how you define the task's goal).

Context management matters, but this article focuses on the second factor—because that's where I was getting it wrong.

Where you set the task's goal matters more than you might think. The purpose you define determines the task granularity, and the right granularity depends on your codebase complexity.

A Real Example: Bug Investigation

The Old Approach

A single session handling "Investigation → Solution Proposal → Verification," followed by a separate review session.

What I Changed

My original goal was simple: "propose a solution" and "review it objectively."

Originally, I'd just have the LLM investigate, propose a fix, and implement it directly. But as the codebase grew, I started getting solutions that didn't actually work. So I added a review step—opening a fresh session to check the proposal with clean context.

This worked for about 60-70% of problems, but occasionally even this approach couldn't reach the root cause no matter how many iterations.

Here's what I changed:

  1. Problem Structuring: Structure my instructions upfront to make them easier for LLMs to parse in later steps
  2. Investigation: Conduct comprehensive investigation and report results
  3. Verification: If there's uncertainty in the report, perform additional verification
  4. Solution Derivation: Receive investigation and verification results, then derive solutions

By setting "investigation" as the purpose, the model stopped jumping to the first candidate and instead collected information from multiple angles.

Implementation Example

This setup is probably overkill for small scripts. I only started doing this after my codebase crossed a certain complexity threshold.

Here's how I structured the diagnosis workflow using Claude Code's slash commands and sub-agents. Full implementation is available at github.com/shinpr/claude-code-workflows.

Main Command (diagnose.md)

---
description: "Investigate problem, verify findings, and derive solutions"
---

**Command Context**: Diagnosis flow to identify root cause and present solutions

Target problem: $ARGUMENTS

## Step 0: Problem Structuring (Before investigator invocation)

### 0.1 Problem Type Determination

| Type | Criteria |
|------|----------|
| Change Failure | Indicates some change occurred before the problem appeared |
| New Discovery | No relation to changes is indicated |

### 0.2 Information Supplementation for Change Failures

If the following are unclear, **ask with AskUserQuestion** before proceeding:
- What was changed (cause change)
- What broke (affected area)
- Relationship between both (shared components, etc.)

## Diagnosis Flow Overview

The goal of investigation is not to propose solutions.
It is to eliminate wrong explanations.

**Context Separation**: Pass only structured JSON output to each step.
Each step starts fresh with the JSON data only.
Enter fullscreen mode Exit fullscreen mode

Sub-agent: Investigator

Think of the Investigator as a junior engineer whose only job is to gather facts—not to be clever. Its purpose is explicitly limited to evidence collection only—no solutions:

This is one concrete implementation.
The important part is the separation of purpose—not the specific tooling.

## Output Scope

This agent outputs **evidence matrix and factual observations only**.
Solution derivation is out of scope for this agent.

## Core Responsibilities

1. **Cross-check multiple sources** - Don't rely on a single source
2. **Search external info (WebSearch)** - Official docs, Stack Overflow, GitHub Issues
3. **List hypotheses and trace causes** - Multiple candidates, not just the first one
4. **Identify impact scope** - Where else might this pattern exist?
5. **Disclose blind spots** - Honestly report areas that could not be investigated
Enter fullscreen mode Exit fullscreen mode

Key output structure:

{
  "hypotheses": [
    {
      "id": "H1",
      "description": "Hypothesis description",
      "causeCategory": "typo|logic_error|missing_constraint|design_gap|external_factor",
      "causalChain": ["Phenomenon", "→ Direct cause", "→ Root cause"],
      "supportingEvidence": [...],
      "contradictingEvidence": [...],
      "unexploredAspects": ["Unverified aspects"]
    }
  ],
  "comparisonAnalysis": {
    "normalImplementation": "Path to working implementation",
    "failingImplementation": "Path to problematic implementation",
    "keyDifferences": ["Differences"]
  }
}
Enter fullscreen mode Exit fullscreen mode

Sub-agent: Verifier

The Verifier plays the annoying senior reviewer who assumes everything is wrong. It actively seeks refutation:

## Core Responsibilities

1. **Cross-check multiple sources** - Explore information sources not covered
2. **Generate alternative hypotheses** - What else could explain this?
3. **Play devil's advocate** - Assume "the investigation results are wrong"
4. **Pick the hypothesis with fewest holes** - Not "most evidence," but "least refuted"
Enter fullscreen mode Exit fullscreen mode

Sub-agent: Solver

The Solver is the engineer who actually has to ship something. Only after verification does it derive solutions:

## Output Scope

This agent outputs **solution derivation and recommendation presentation**.
Trust the given conclusion and proceed directly to solution derivation.

## Core Responsibilities

1. **Multiple solution generation** - At least 3 different approaches
2. **Tradeoff analysis** - Cost, risk, impact scope, maintainability
3. **Recommendation selection** - Optimal solution with selection rationale
4. **Implementation steps presentation** - Concrete, actionable steps
Enter fullscreen mode Exit fullscreen mode

Practical Guidelines

When designing LLM tasks, I now check two things:

  1. Purpose Clarity - "Don't create tasks with unclear purposes"
  2. Context Efficiency - Can it be completed in one session with sufficient information? (Ideally using 60-70% of working space)

I don't blindly split tasks into smaller pieces. Instead, I consider ROI and break down from larger tasks only when necessary.

By explicitly separating "investigation" from "solution," you prevent the model from rushing to conclusions before it has gathered sufficient evidence.

A Lesson I Learned the Hard Way

Early on, I made the Verifier run every single time. The problem? Even when the investigation was clearly off track, the Verifier would dutifully try to verify nonsense.

That's when I realized: you need a quality gate between steps, not just separation.

Now I have a checkpoint between Investigation and Verification. If the investigation output doesn't meet basic quality criteria (missing comparison analysis, shallow causal chains, etc.), it loops back instead of wasting cycles on verification.

I also added Step 0 (Problem Structuring) to help the LLM understand my intent better before diving in. These two changes—quality gates and upfront structuring—made the whole pipeline actually usable.

Top comments (0)