Aamer Mihaysi

Posted on Apr 4

The 70% to 94% Problem: Why Your AI Skills Are Probably Wrong

#ai #llm #agents #prompting

The 70% to 94% Problem: Why Your AI Skills Are Probably Wrong

You built a Claude Skill. It works... sometimes. Maybe 70% of the time. The other 30%? It drifts, hallucinates, or produces something vaguely correct but not what you needed.

You manually review and fix every output. That's not automation. That's expensive autocomplete.

Andrej Karpathy released something called autoresearch last month that solves this exact problem. The technique is deceptively simple, and it works on any prompt, skill, or instruction set where you can define what "good" looks like.

Here's the core insight:

The Method: What Actually Changes Your Hit Rate

Most people try to improve their prompts by adding more instructions. "Make sure to do X." "Don't forget Y." "Always format like Z."

This is the wrong approach.

The autoresearch method flips this: instead of adding rules, you add tests.

You write evaluation functions that check whether the output meets your standards. Then you run your prompt through multiple iterations, measuring which version performs best against your evals.

The optimization happens automatically. You're not guessing what to fix. The tests tell you.

The Three Components You Need

1. A clear definition of "good"

Not a description. An actual test.

❌ "The output should be well-formatted"

✅ "The output must have exactly 5 H2 headings, no paragraph longer than 3 lines, and every bullet point starts with an action verb"

2. Multiple versions of your prompt

You can't optimize what you can't compare. Write 3-5 variants of your skill or prompt, each with slightly different phrasing, structure, or emphasis.

3. A scoring mechanism

For each test input, run all prompt variants, score each output against your eval, and track which variant wins.

What This Looks Like in Practice

Let's say you have a skill that generates pitch deck outlines. Your current hit rate is 70%. Here's how you improve it:

Step 1: Define your evaluation

def eval_pitch_deck(output):
    score = 0
    # Must have these sections in order
    if has_section(output, "Problem"):
        score += 1
    if has_section(output, "Solution"):
        score += 1
    if has_section(output, "Market Size"):
        score += 1
    if has_section(output, "Business Model"):
        score += 1
    if has_section(output, "Ask"):
        score += 1

    # Ask should be slide 9-11
    if slide_position(output, "Ask") in range(9, 12):
        score += 1

    # First slide must state the problem
    if first_slide_is_problem(output):
        score += 1

    return score

Step 2: Generate prompt variants

Create 5 versions of your skill with different instruction phrasings:

Variant A: Detailed step-by-step
Variant B: Example-heavy with "do this, not that"
Variant C: Constraint-focused with explicit negatives
Variant D: Minimal instructions, maximum examples
Variant E: Structured template with fill-in-the-blank

Step 3: Run the comparison

Feed 50 sample requests through each variant. Score each output. Track the winner.

The Results People Are Seeing

When applied systematically, the improvement is dramatic:

70% → 94% for pitch deck structure skills
65% → 91% for MEDDIC qualification frameworks
60% → 87% for email tone matching

These aren't hypothetical numbers. They're from real skills optimized with this method.

Why This Works (The Theory)

LLMs are probabilistic. They don't "understand" your instructions the way a human would. They pattern-match against their training data and the context you provide.

When you write a skill, you're not programming. You're steering probability distributions.

The problem is that small changes in phrasing can create large changes in output. "Format nicely" means nothing to a model. "Use H2 headings for each section, bold the first sentence, keep paragraphs under 3 lines" means something specific.

The autoresearch method systematically explores the phrasing space and finds the high-performing regions.

The Cost of Optimization

Running 50 test inputs through 5 prompt variants costs maybe $2-5 in API calls. A few hours of mostly-unattended compute time.

Compare that to:

Manually reviewing 100 outputs
Rewriting your skill 10 times by hand
Never actually knowing if your skill improved

The ROI is absurd.

The Two Mistakes That Waste Everything

Mistake 1: Vague evaluations

If your eval says "check if output is good," you've built nothing. Every criterion must be objectively measurable.

Mistake 2: Testing on clean inputs only

Your skill works on clean, well-formed requests. But in production, you'll get edge cases, malformed inputs, and ambiguous requests. Test those too.

When to Stop Optimizing

The method has a built-in stopping point: diminishing returns.

If variant A scores 7.2 and variant B scores 7.3, the difference is noise. You're done. The optimization has found the plateau.

If variant A scores 6.0 and variant B scores 8.5, keep going. There's more to find.

What This Means for AI Workflows

We're entering an era where the skill isn't in writing the perfect prompt. The skill is in defining what you want and letting the system find it.

This is the difference between:

Hand-tuning a model
Writing an evaluation function

The first requires intuition and expertise. The second requires clarity and precision.

One scales. The other doesn't.

The Takeaway

If your skills, prompts, or instruction sets work "most of the time," that's not good enough. The gap between 70% and 94% is the gap between "I have to check everything" and "I can trust this."

Write tests. Run comparisons. Let the optimization find the phrasing that works.

Stop guessing. Start measuring.

DEV Community

The 70% to 94% Problem: Why Your AI Skills Are Probably Wrong

The 70% to 94% Problem: Why Your AI Skills Are Probably Wrong

The Method: What Actually Changes Your Hit Rate

The Three Components You Need

What This Looks Like in Practice

The Results People Are Seeing

Why This Works (The Theory)

The Cost of Optimization

The Two Mistakes That Waste Everything

When to Stop Optimizing

What This Means for AI Workflows

The Takeaway

Top comments (0)