The 70% to 94% Problem: Why Your AI Skills Are Probably Wrong
You built a Claude Skill. It works... sometimes. Maybe 70% of the time. The other 30%? It drifts, hallucinates, or produces something vaguely correct but not what you needed.
You manually review and fix every output. That's not automation. That's expensive autocomplete.
Andrej Karpathy released something called autoresearch last month that solves this exact problem. The technique is deceptively simple, and it works on any prompt, skill, or instruction set where you can define what "good" looks like.
Here's the core insight:
The Method: What Actually Changes Your Hit Rate
Most people try to improve their prompts by adding more instructions. "Make sure to do X." "Don't forget Y." "Always format like Z."
This is the wrong approach.
The autoresearch method flips this: instead of adding rules, you add tests.
You write evaluation functions that check whether the output meets your standards. Then you run your prompt through multiple iterations, measuring which version performs best against your evals.
The optimization happens automatically. You're not guessing what to fix. The tests tell you.
The Three Components You Need
1. A clear definition of "good"
Not a description. An actual test.
❌ "The output should be well-formatted"
✅ "The output must have exactly 5 H2 headings, no paragraph longer than 3 lines, and every bullet point starts with an action verb"
2. Multiple versions of your prompt
You can't optimize what you can't compare. Write 3-5 variants of your skill or prompt, each with slightly different phrasing, structure, or emphasis.
3. A scoring mechanism
For each test input, run all prompt variants, score each output against your eval, and track which variant wins.
What This Looks Like in Practice
Let's say you have a skill that generates pitch deck outlines. Your current hit rate is 70%. Here's how you improve it:
Step 1: Define your evaluation
def eval_pitch_deck(output):
score = 0
# Must have these sections in order
if has_section(output, "Problem"):
score += 1
if has_section(output, "Solution"):
score += 1
if has_section(output, "Market Size"):
score += 1
if has_section(output, "Business Model"):
score += 1
if has_section(output, "Ask"):
score += 1
# Ask should be slide 9-11
if slide_position(output, "Ask") in range(9, 12):
score += 1
# First slide must state the problem
if first_slide_is_problem(output):
score += 1
return score
Step 2: Generate prompt variants
Create 5 versions of your skill with different instruction phrasings:
- Variant A: Detailed step-by-step
- Variant B: Example-heavy with "do this, not that"
- Variant C: Constraint-focused with explicit negatives
- Variant D: Minimal instructions, maximum examples
- Variant E: Structured template with fill-in-the-blank
Step 3: Run the comparison
Feed 50 sample requests through each variant. Score each output. Track the winner.
The Results People Are Seeing
When applied systematically, the improvement is dramatic:
- 70% → 94% for pitch deck structure skills
- 65% → 91% for MEDDIC qualification frameworks
- 60% → 87% for email tone matching
These aren't hypothetical numbers. They're from real skills optimized with this method.
Why This Works (The Theory)
LLMs are probabilistic. They don't "understand" your instructions the way a human would. They pattern-match against their training data and the context you provide.
When you write a skill, you're not programming. You're steering probability distributions.
The problem is that small changes in phrasing can create large changes in output. "Format nicely" means nothing to a model. "Use H2 headings for each section, bold the first sentence, keep paragraphs under 3 lines" means something specific.
The autoresearch method systematically explores the phrasing space and finds the high-performing regions.
The Cost of Optimization
Running 50 test inputs through 5 prompt variants costs maybe $2-5 in API calls. A few hours of mostly-unattended compute time.
Compare that to:
- Manually reviewing 100 outputs
- Rewriting your skill 10 times by hand
- Never actually knowing if your skill improved
The ROI is absurd.
The Two Mistakes That Waste Everything
Mistake 1: Vague evaluations
If your eval says "check if output is good," you've built nothing. Every criterion must be objectively measurable.
Mistake 2: Testing on clean inputs only
Your skill works on clean, well-formed requests. But in production, you'll get edge cases, malformed inputs, and ambiguous requests. Test those too.
When to Stop Optimizing
The method has a built-in stopping point: diminishing returns.
If variant A scores 7.2 and variant B scores 7.3, the difference is noise. You're done. The optimization has found the plateau.
If variant A scores 6.0 and variant B scores 8.5, keep going. There's more to find.
What This Means for AI Workflows
We're entering an era where the skill isn't in writing the perfect prompt. The skill is in defining what you want and letting the system find it.
This is the difference between:
- Hand-tuning a model
- Writing an evaluation function
The first requires intuition and expertise. The second requires clarity and precision.
One scales. The other doesn't.
The Takeaway
If your skills, prompts, or instruction sets work "most of the time," that's not good enough. The gap between 70% and 94% is the gap between "I have to check everything" and "I can trust this."
Write tests. Run comparisons. Let the optimization find the phrasing that works.
Stop guessing. Start measuring.
Top comments (0)