How to Run the Karpathy Loop: AI-Automated Benchmarking That Made Shopify's Template Engine 53% Faster

#ai #buildinpublic #productivity #webdev

The Karpathy loop is an AI-assisted optimization technique where an autonomous agent runs hundreds of edit-benchmark-discard cycles overnight, surfacing performance improvements that no single engineer would have the time to find manually — and it just made Shopify's Liquid template engine 53% faster with 61% fewer memory allocations.

On March 13, 2026, Tobias Lütke — Shopify's CEO, the person who originally built Liquid over 20 years ago — submitted GitHub PR #2056 against the Liquid codebase with those numbers attached. The method: approximately 120 automated experiment loops using a variant of Andrej Karpathy's autoresearch system. The results: 974 unit tests pass, zero regressions, and a performance improvement visible in production-scale benchmarks against real Shopify themes.

Here's how the technique works, how to run it yourself, and what it actually found inside Liquid's parser.

What the Karpathy Loop Actually Is

Andrej Karpathy open-sourced autoresearch on March 8, 2026 — a deliberately minimal Python tool (~630 lines) that turns performance experimentation into a fully autonomous overnight process.

The loop is conceptually simple:

read code + fitness metric
  → form hypothesis
  → edit code
  → run tests (correctness gate)
  → run benchmark (performance gate)
  → keep if metric improved, discard if not
  → log to JSONL
  → repeat

Each experiment runs inside a fixed time budget (roughly 5 minutes), which makes all runs directly comparable regardless of what changed. Single unambiguous fitness metric. Tests before benchmarks — correctness is non-negotiable, performance is the optimization target.

Karpathy's own results on ML training scripts: a 126-experiment overnight run dropped validation loss from 0.9979 to 0.9697. After approximately 700 autonomous changes over two days, the "time to GPT-2" training metric improved by 11%. On the night of March 8–9 alone, 35 autonomous agents on the Hyperspace network ran 333 unsupervised experiments.

The repo is at github.com/karpathy/autoresearch. It was originally designed for ML training scripts, but the core architecture is domain-agnostic.

How Tobi Adapted It for a Ruby Codebase

Lütke used Pi as his coding agent and adapted the autoresearch pattern through a plugin called pi-autoresearch (co-authored with David Cortés, available at github.com/davebcn87/pi-autoresearch).

The adaptation required two files:

autoresearch.md    — the agent's instruction file (what to optimize, what to measure)
autoresearch.sh    — the benchmark runner (runs tests + reports scores)
autoresearch.jsonl — state file (experiment history, what worked, what didn't)

The fitness metric was the ThemeRunner benchmark: a real Shopify theme executing against production-like template data, measured in milliseconds of combined parse+render time and tracked object allocations. The correctness gate was Liquid's existing 974-test suite — any experiment that breaks a test gets discarded immediately, before benchmarking.

Lütke's X post after the run: "I ran /autoresearch on the liquid codebase. 53% faster combined parse+render time, 61% fewer object allocations. This is probably somewhat overfit, but there are absolutely amazing ideas in this."

The "somewhat overfit" caveat matters. Automated loops can find improvements that look good on the benchmark fixture without generalizing to all production patterns. The next step — which Shopify's engineering team will handle — is verifying which changes survive broader production testing. That's expected from automated research, not a flaw in the technique.

What the Agent Actually Found: StringScanner → Byte-Level Scanning

The single biggest individual optimization surfaced by the loop: replacing the StringScanner-based tokenizer with String#byteindex for finding {% and {{ delimiters inside template strings.

String#byteindex searching for a fixed two-byte sequence is approximately 40% faster than StringScanner#skip_until with a regex pattern. On the ThemeRunner benchmark, the old approach was calling StringScanner#string= reset on every {% %} token — 878 times. The new byte-level cursor eliminates that reset overhead entirely, reducing parse time by roughly 12% on its own.

The PR went further. The automated loop also identified:

VariableParser refactor: the regex-based parser replaced with a manual byte scanner that extracts name expressions and filter chains without touching the Lexer or Parser at all
Variable#try_fast_parse path: 100% of variables in the benchmark (1,197 variables) now route through this byte-level fast path, bypassing the full parse stack
FullToken regex elimination: cursor-based scanning replaces the regex in the main tokenization path

Aaron Patterson's canonical 2023 post on fast tokenizers with StringScanner (tenderlovemaking.com) represented the previous state of the art in Ruby tokenizer performance. The Liquid PR effectively goes one level deeper — byte operations instead of scanner operations.

How to Run This on Your Own Codebase

You don't need to use Pi as your coding agent. The pattern is agent-agnostic. Here's what you actually need:

Step 1: Define your fitness metric precisely.
One number. Smaller is better (or larger is better — pick one). For Liquid it was benchmark milliseconds and allocation count. For an API endpoint it might be p95 response time. For a database query, it's execution time on a fixed dataset. Ambiguous metrics produce ambiguous results.

Step 2: Write your benchmark runner as a script.

#!/bin/bash
bundle exec ruby bench/themerunner.rb 2>&1 | tail -5

The agent needs to read a single number from stdout. Make the output format explicit in your instruction file.

Step 3: Set up your instruction file (autoresearch.md).
Tell the agent: what the codebase does, what the benchmark measures, what kinds of changes are in-scope (tokenizer, parser, memory allocation patterns), and what's out of scope (don't touch the public API surface, don't change error messages).

Step 4: Run with a time budget.
120 experiments at 5 minutes each = 10 hours. That's overnight. The correct use of this technique is to start it before you sleep and read the JSONL log in the morning.

Step 5: Review the winning experiments manually.
The agent produces hypotheses and results. You still own the decision about what ships. The JSONL log gives you a complete record of what was tried, what worked, and what the agent's reasoning was for each change.

Who Else Is Running This Pattern

Beyond Karpathy and Lütke, the technique is spreading across different ecosystems:

autoresearch-mlx (github.com/trevin-creator/autoresearch-mlx) — Apple Silicon port using MLX instead of PyTorch, runs natively on Mac without CUDA
autoresearch-win-rtx (github.com/jsegov/autoresearch-win-rtx) — Windows + RTX GPU adaptation
autoexp gist (gist.github.com/adhishthite) — generalized version for any quantifiable metric project, not just ML training

Simon Willison's coverage of the Liquid PR (simonwillison.net, March 13, 2026) frames this as one concrete instance of a broader class of agentic engineering patterns — AI-assisted iteration that produces better code through automated feedback loops rather than through AI writing code from scratch.

What This Means for Builders

The technique works on any codebase with a measurable benchmark. Ruby, Python, Rust, Go — the autoresearch pattern is language-agnostic. The requirement is a fitness metric you can print to stdout from a script.
Start with allocation profiling, not execution time. The Liquid PR's most productive discovery thread came from object allocation counts, not raw speed. Memory allocation patterns are often the root cause of performance issues, and they're cheaper to measure and easier for an agent to reason about than cache behavior or I/O.
The agent is a hypothesis generator, not an autonomous engineer. The loop produces candidates. The correctness gate (your test suite) is non-negotiable — run it before benchmarking every single experiment. Ship nothing without human review of the winning diffs.
120 experiments overnight is now a realistic option for individual contributors. Lütke submitted this PR as one person using a weekend afternoon to set up the loop. The productivity floor for a single engineer with the right tools has shifted significantly — not because AI writes better code than humans, but because it can test more hypotheses per hour than any human team can.

Built with IntelFlow — open-source AI intelligence engine. Set up your own daily briefing in 60 seconds.