I've been using ripgrep for years. It's the kind of tool that makes you feel smug about your workflow -- blazing fast, zero complaints. Then Cursor's engineering team published a blog post that essentially said: "Yeah, ripgrep is fast. But it's still scanning every file. What if we just... didn't?"
And honestly? They have a point.
The Problem Nobody Talks About
If you work on a normal-sized codebase -- say, 50k lines, maybe a few hundred files -- ripgrep is instantaneous. You type rg "functionName" and the answer is there before you lift your finger off the enter key.
But Cursor's enterprise customers aren't searching 50k-line repos. They're searching Chromium. They're searching monorepos with millions of files. And when an AI coding agent needs to grep the codebase dozens of times per task, a 15-second delay on each search means the agent is spending more time waiting than thinking.
That's the real insight here. It's not that ripgrep is slow -- it's that AI agents use grep constantly. What's imperceptible to a human becomes a bottleneck when your AI assistant is firing off parallel regex queries every few seconds.
The Core Idea: Don't Scan Everything
ripgrep's fundamental limitation is architectural: it examines every file's contents on every search. No matter how optimized the regex engine is, you're still doing a linear scan across the entire codebase.
Cursor's approach flips this. They build an inverted index of trigrams (3-character sequences) extracted from your source code. When you search for handleClick, the engine decomposes it into trigrams -- han, and, ndl, dle, leC, etc. -- looks up which files contain ALL of those trigrams, and only scans those candidate files.
It's the same principle behind full-text search engines like Elasticsearch. The clever part is adapting it to regex patterns and making it work locally on a developer's machine.
The Trigram Evolution
What I find most interesting about the post is how they walk through three generations of this approach, each solving problems the previous one created.
Generation 1: Basic Trigrams. Straightforward inverted index. Extract every 3-character sequence from every file, build posting lists. Works, but the posting lists get huge. Searching for for matches basically every file in any codebase.
Generation 2: Phrase-Aware Trigrams (GitHub's "Project Blackbird"). This is the clever one. They add two 8-bit bloom filters to each posting:
- A
nextMaskthat encodes which characters can follow the trigram -- essentially giving you quadgram-level precision with trigram-level storage - A
locMaskrecording positions modulo 8, so you can verify that two trigrams actually appear next to each other
Two extra bytes per posting, massive reduction in false positives. The tradeoff? Bloom filters saturate over time with updates, making them less useful for frequently-changing codebases.
Generation 3: Sparse N-grams. This is what Cursor actually ships. Instead of fixed 3-character chunks, they use variable-length n-grams where the boundaries are determined by character-pair frequency weights.
The insight: if a character pair is rare in real source code (computed from terabytes of open-source data), it gets a higher weight and becomes a natural n-gram boundary. This means the index automatically focuses on the most discriminating parts of your search pattern.
Query: "handleClick"
Basic trigrams: han, and, ndl, dle, leC, eCl, Cli, lic, ick (9 lookups)
Sparse n-grams: handleCl, Click (2 lookups, if 'eC' is a high-weight pair)
Fewer lookups, smaller candidate sets, faster results. And because the weight function is deterministic, the same n-grams get generated at index time and query time.
The Local-First Architecture
Here's the part that surprised me. The whole thing runs on your machine. No cloud roundtrip.
They store the index in two files:
- A postings file -- sequential byte stream of all posting lists
- A lookup table -- hash-offset pairs, memory-mapped for fast binary search
The git integration is particularly elegant. The index state is pinned to a commit hash. Any uncommitted changes go into an overlay layer, so incremental updates are cheap. You don't rebuild the entire index every time you save a file.
This is a direct consequence of how AI agents work. When your agent is firing off grep queries in parallel, every millisecond of network latency gets multiplied. Cursor explicitly calls this out: semantic search can tolerate a cloud roundtrip, but regex search can't.
What This Actually Means
Let's be real about what this is and isn't.
This is a meaningful engineering achievement. Taking a well-studied problem (inverted text indexing) and adapting it specifically for the regex-search-in-code use case, with smart tradeoffs around variable-length n-grams and local execution, is genuinely impressive work.
But it's also very specific to one use case: AI agents that need to grep large codebases interactively. If you're a human developer on a normal-sized project, ripgrep is still going to feel instant. You won't notice the difference.
Where this matters is the enterprise monorepo world -- Chromium-scale codebases where grep is genuinely the bottleneck in an agentic workflow. For those teams, dropping grep latency from 15 seconds to sub-second changes what an AI agent can realistically do in a session.
The Bigger Picture
The post traces a fascinating lineage: from 1993 inverted file research, through Russ Cox's 2012 trigram blog post, to Nelson Elhage's livegrep using suffix arrays, to GitHub's Project Blackbird, and finally to Cursor's sparse n-gram approach.
Each generation learned from the previous one's limitations. It's a great example of how progress in systems engineering often looks like iteration rather than revolution -- taking an existing idea and asking "what if we made this one specific tradeoff differently?"
And there's a meta-point here that I think is worth noting: AI coding agents are creating entirely new performance requirements for tools we thought were "solved." ripgrep solved regex search for humans. But agents search differently -- more frequently, more in parallel, more iteratively. The tools need to evolve to match.
I suspect we'll see more of this pattern. Tools that were fast enough for human interaction becoming bottlenecks in agentic workflows, and engineering teams having to rethink the fundamentals.
Alan West is a full-stack developer based in San Francisco who gets unreasonably excited about text search algorithms. Find him on GitHub or don't -- he'll be benchmarking something either way.
Top comments (0)