Aamer Mihaysi

Posted on Apr 5

Andrej Karpathy Just Taught Us How to Make Skills That Actually Work

#agents #ai #llm #productivity

Andrej Karpathy Just Taught Us How to Make Skills That Actually Work

42,000 GitHub stars in one week. That's how fast Andrej Karpathy's autoresearch method spread.

Not because it's a new model. Not because it's a clever prompt trick.

Because it solves the problem everyone building with AI has: how do you know your agent instructions actually work?

The method he built for optimizing ML training code turns out to be exactly what agent builders need. The pattern is universal: define what good looks like, measure against it, iterate until you get there.

Here's what this means for anyone building Skills, agents, or AI workflows.

The Problem Karpathy Solved

His fundraising Skill followed Sequoia/YC pitch deck format about 70% of the time.

The other 30%? Drift. Missing the traction slide. Burying the ask on page 9. Opening with market size instead of the problem.

His sales Skill nailed MEDDIC qualification two-thirds of the time and produced something vague and generic the rest.

His idea validation Skill sometimes ran the Mom Test framework perfectly and sometimes gave advice you'd get from asking ChatGPT "how to validate a startup idea."

He was manually reviewing and fixing every output.

That's not an AI operating system. That's fancy autocomplete with extra steps.

The Autoresearch Method Applied to Skills

Karpathy's insight: the same pattern that optimizes ML training runs can optimize agent instructions.

Instead of hoping your Skill works, you run it through controlled evaluations:

1. Define eval cases

Create test inputs with expected outputs. For a pitch deck Skill:

Input: "Create a pitch deck for my B2B SaaS startup"
Expected: Sequoia/YC hybrid format, 10-12 slides, specific sections in specific order

2. Run the Skill against evals

Fire each test case through the Skill. Capture every output.

3. Measure pass/fail rates

Not "does it look good?" but "does it match the expected output format?"

Did the traction slide appear?
Is the ask on slide 2?
Is market size not the opener?

4. Iterate until clean

Fix the instructions. Rerun. Repeat until pass rate hits your target.

The fundraising Skill went from 70% to 94% accuracy. The sales Skill jumped from 65% to 91% MEDDIC compliance.

Each optimization run took hours, mostly unattended, and cost less than a coffee.

Why This Matters More Than You Think

Most agent development follows a demo cycle:

Write instructions
Test manually on 3-5 examples
Tweak when it fails
Ship when it "works"

This is like shipping code without tests. Except your code can change its behavior based on subtle context shifts you didn't anticipate.

The agent that followed instructions perfectly in test might interpret them differently when:

The user asks a question slightly differently
The context window fills with conversation history
A new model version ships with different biases
The task parameters shift

You won't know until you're debugging a production incident.

The A/B Test You Should Be Running

Here's the uncomfortable truth Karpathy discovered.

Your Skill might be making outputs worse.

He benchmarked his Skills against raw Claude. In some cases, raw Claude (no instructions) outperformed the carefully-crafted Skill.

Why? Instructions written for an older, less capable model now constrain the newer model unnecessarily. The Skill that helped Claude 3.5 is actively limiting Claude 4.

The solution is continuous A/B benchmarking:

Run the same task set with your Skill loaded
Run the same task set with raw model (no Skill)
Compare outputs blind
If raw wins, your Skill needs rewrites—or retirement

Most teams never do this. They assume their instructions help because they helped once.

The Description Problem

The most common failure: the Skill doesn't activate when it should.

You build a sophisticated agent. Test thoroughly. Ship. Users type requests that should trigger it.

Nothing happens.

The agent sits dormant because the description—what tells the system when to use this Skill—doesn't match how users actually ask.

You wrote "Email Drafter." Users type "write a message to my team."

You wrote "Data Cleaner." Users type "fix this spreadsheet."

The optimization loop:

Generate test prompts that should trigger your Skill
Generate test prompts that should NOT trigger your Skill
Measure activation accuracy
Rewrite descriptions to improve triggering
Repeat until the right agent fires for the right requests

What Production-Grade Skill Development Looks Like

Teams winning with Skills follow a different playbook.

Before shipping:

Define eval cases with expected outputs
Run the Skill against the eval suite
Measure pass rate, not "it looks good"
A/B benchmark against raw model
Optimize descriptions for triggering accuracy

After shipping:

Log every Skill activation
Sample outputs for manual review weekly
Track drift metrics (output consistency over time)
Re-run evals after model updates
Retire instructions that stop helping

The Practical Setup

Want to try this? The pattern translates directly to any AI workflow.

Create an evals file:

{
  "evals": [
    {
      "input": "Create a pitch deck for my fintech startup",
      "expected_format": "sequoia_yc_hybrid",
      "must_include": ["traction", "ask", "problem"],
      "must_not_open_with": "market_size"
    }
  ]
}

Run evaluations:

Point Claude at your evals file. Tell it to run each input through your Skill, check outputs against expectations, and report pass/fail rates.

Iterate systematically:

When the eval shows "traction slide missing 30% of the time," add explicit instruction: "The traction slide MUST appear between slides 6-8. Never skip it."

Rerun. Check if pass rate improved.

The Takeaway

Demos sell agents. Evaluations keep them working.

The gap between "I built a Skill" and "I built a Skill that works 99% of the time" is the gap between "it compiled" and "it passed all tests."

Karpathy's method is what separates people who dabble from people who run entire workflows on autopilot:

Define what good looks like
Measure against it systematically
Iterate until you hit the target
Re-test when models change

The autoresearch method took optimization from ML training to agent instructions. The pattern is the same. The leverage is different.

Your agent worked in the demo. Does it still work now?

The only way to know is to measure.

The teams that adopt these practices will ship agents that stay reliable. The ones that don't will debug production incidents forever. Karpathy showed us the method. Now we have to use it.

DEV Community

Andrej Karpathy Just Taught Us How to Make Skills That Actually Work

Andrej Karpathy Just Taught Us How to Make Skills That Actually Work

The Problem Karpathy Solved

The Autoresearch Method Applied to Skills

Why This Matters More Than You Think

The A/B Test You Should Be Running

The Description Problem

What Production-Grade Skill Development Looks Like

The Practical Setup

The Takeaway

Top comments (0)