Andrej Karpathy Just Taught Us How to Make Skills That Actually Work
42,000 GitHub stars in one week. That's how fast Andrej Karpathy's autoresearch method spread.
Not because it's a new model. Not because it's a clever prompt trick.
Because it solves the problem everyone building with AI has: how do you know your agent instructions actually work?
The method he built for optimizing ML training code turns out to be exactly what agent builders need. The pattern is universal: define what good looks like, measure against it, iterate until you get there.
Here's what this means for anyone building Skills, agents, or AI workflows.
The Problem Karpathy Solved
His fundraising Skill followed Sequoia/YC pitch deck format about 70% of the time.
The other 30%? Drift. Missing the traction slide. Burying the ask on page 9. Opening with market size instead of the problem.
His sales Skill nailed MEDDIC qualification two-thirds of the time and produced something vague and generic the rest.
His idea validation Skill sometimes ran the Mom Test framework perfectly and sometimes gave advice you'd get from asking ChatGPT "how to validate a startup idea."
He was manually reviewing and fixing every output.
That's not an AI operating system. That's fancy autocomplete with extra steps.
The Autoresearch Method Applied to Skills
Karpathy's insight: the same pattern that optimizes ML training runs can optimize agent instructions.
Instead of hoping your Skill works, you run it through controlled evaluations:
1. Define eval cases
Create test inputs with expected outputs. For a pitch deck Skill:
- Input: "Create a pitch deck for my B2B SaaS startup"
- Expected: Sequoia/YC hybrid format, 10-12 slides, specific sections in specific order
2. Run the Skill against evals
Fire each test case through the Skill. Capture every output.
3. Measure pass/fail rates
Not "does it look good?" but "does it match the expected output format?"
- Did the traction slide appear?
- Is the ask on slide 2?
- Is market size not the opener?
4. Iterate until clean
Fix the instructions. Rerun. Repeat until pass rate hits your target.
The fundraising Skill went from 70% to 94% accuracy. The sales Skill jumped from 65% to 91% MEDDIC compliance.
Each optimization run took hours, mostly unattended, and cost less than a coffee.
Why This Matters More Than You Think
Most agent development follows a demo cycle:
- Write instructions
- Test manually on 3-5 examples
- Tweak when it fails
- Ship when it "works"
This is like shipping code without tests. Except your code can change its behavior based on subtle context shifts you didn't anticipate.
The agent that followed instructions perfectly in test might interpret them differently when:
- The user asks a question slightly differently
- The context window fills with conversation history
- A new model version ships with different biases
- The task parameters shift
You won't know until you're debugging a production incident.
The A/B Test You Should Be Running
Here's the uncomfortable truth Karpathy discovered.
Your Skill might be making outputs worse.
He benchmarked his Skills against raw Claude. In some cases, raw Claude (no instructions) outperformed the carefully-crafted Skill.
Why? Instructions written for an older, less capable model now constrain the newer model unnecessarily. The Skill that helped Claude 3.5 is actively limiting Claude 4.
The solution is continuous A/B benchmarking:
- Run the same task set with your Skill loaded
- Run the same task set with raw model (no Skill)
- Compare outputs blind
- If raw wins, your Skill needs rewrites—or retirement
Most teams never do this. They assume their instructions help because they helped once.
The Description Problem
The most common failure: the Skill doesn't activate when it should.
You build a sophisticated agent. Test thoroughly. Ship. Users type requests that should trigger it.
Nothing happens.
The agent sits dormant because the description—what tells the system when to use this Skill—doesn't match how users actually ask.
You wrote "Email Drafter." Users type "write a message to my team."
You wrote "Data Cleaner." Users type "fix this spreadsheet."
The optimization loop:
- Generate test prompts that should trigger your Skill
- Generate test prompts that should NOT trigger your Skill
- Measure activation accuracy
- Rewrite descriptions to improve triggering
- Repeat until the right agent fires for the right requests
What Production-Grade Skill Development Looks Like
Teams winning with Skills follow a different playbook.
Before shipping:
- Define eval cases with expected outputs
- Run the Skill against the eval suite
- Measure pass rate, not "it looks good"
- A/B benchmark against raw model
- Optimize descriptions for triggering accuracy
After shipping:
- Log every Skill activation
- Sample outputs for manual review weekly
- Track drift metrics (output consistency over time)
- Re-run evals after model updates
- Retire instructions that stop helping
The Practical Setup
Want to try this? The pattern translates directly to any AI workflow.
Create an evals file:
{
"evals": [
{
"input": "Create a pitch deck for my fintech startup",
"expected_format": "sequoia_yc_hybrid",
"must_include": ["traction", "ask", "problem"],
"must_not_open_with": "market_size"
}
]
}
Run evaluations:
Point Claude at your evals file. Tell it to run each input through your Skill, check outputs against expectations, and report pass/fail rates.
Iterate systematically:
When the eval shows "traction slide missing 30% of the time," add explicit instruction: "The traction slide MUST appear between slides 6-8. Never skip it."
Rerun. Check if pass rate improved.
The Takeaway
Demos sell agents. Evaluations keep them working.
The gap between "I built a Skill" and "I built a Skill that works 99% of the time" is the gap between "it compiled" and "it passed all tests."
Karpathy's method is what separates people who dabble from people who run entire workflows on autopilot:
- Define what good looks like
- Measure against it systematically
- Iterate until you hit the target
- Re-test when models change
The autoresearch method took optimization from ML training to agent instructions. The pattern is the same. The leverage is different.
Your agent worked in the demo. Does it still work now?
The only way to know is to measure.
The teams that adopt these practices will ship agents that stay reliable. The ones that don't will debug production incidents forever. Karpathy showed us the method. Now we have to use it.
Top comments (0)