Tawan Shamsanor

Posted on May 18

How Semble Cuts AI Code Search Tokens by 98%

#ai #codesearch #devtools #llm

Grep wastes 98% of your AI's context window. Every time a coding agent like Claude Code, Cursor, or Codex searches a codebase, it fires off a grep command, finds the matching files, and dumps the entire contents into the LLM prompt. That brute-force approach works — but it's catastrophically expensive. A new open-source tool called Semble has flipped this equation on its head, returning the exact code snippets an agent needs while using roughly 2% of the tokens that grep-plus-read consumes.

Key Facts Most People Don't Know

A typical grep search on a 50,000 line codebase consumes 1.2 million tokens when fed to GPT-4, costing $14.40 per query at $0.012 per 1K tokens
Semble's abstract syntax tree parser strips comments, whitespace, and duplicate imports, reducing a 500KB Python file to just 23KB of semantic nodes
Traditional code search tools like ripgrep scan at 2.3GB/s but send 100% of matched file contents to LLMs, while Semble sends only 2% by extracting function signatures

Launched in May 2026 by MinishLab, Semble is a code search engine designed specifically for AI agents. It indexes any repository in under 250 milliseconds, answers queries in 1.5 milliseconds, and runs entirely on CPU — no GPU, no API keys, no external services. On Hacker News, it rocketed to over 300 upvotes in its first day. The reason is simple: Semble addresses the single biggest cost and performance bottleneck facing AI coding agents today.

The Token Economics of Code Search

To understand why Semble matters, you need to understand the hidden economics of how coding agents work. When you ask Claude Code "how does authentication work in this project?", the agent runs a grep-style search, reads the matching files, and feeds them into the LLM context window. A typical grep search on a 50,000 line codebase consumes 1.2 million tokens when fed to GPT-4, costing $14.40 per query at $0.012 per 1K tokens. Multiply that across dozens of queries per task, hundreds of tasks per day, and the token bill becomes the dominant cost of running AI coding assistants.

The problem isn't the search itself — ripgrep is blazing fast at 2.3GB/s. The problem is what happens after the search: traditional code search tools send 100% of matched file contents to LLMs, while Semble sends only 2% by extracting function signatures. That's not a marginal improvement. It's a 50x reduction in context that the LLM has to process, which means faster responses, lower costs, and less chance of the agent getting confused by irrelevant code.

"A typical grep search on a 50,000 line codebase consumes 1.2 million tokens when fed to GPT-4, costing $14.40 per query at $0.012 per 1K tokens"

In benchmark tests on the Linux kernel repository — 28 million lines of code — Semble reduced token usage from 8.4 million to 168,000 tokens per semantic search query. That's a 98% reduction on one of the largest codebases in existence.

How Semble Searches Code: The Internal Process

Semble doesn't try to out-search grep. It tries to out-think it. Here's how the system works internally.

Step 1: Parse Source Code into an Abstract Syntax Tree

Semble parses source code into an abstract syntax tree using language-specific parsers like tree-sitter, converting raw text into hierarchical nodes representing functions, classes, and imports. Instead of treating code as flat text — which is what grep does — Semble understands its structure. A 500KB Python file full of comments, docstrings, and whitespace gets stripped down to just 23KB of semantic nodes. This parsing happens at index time, which is why Semble can index an entire repository in under 250 milliseconds.

Step 2: Extract Semantic Chunks with Chonkie

The AST walker extracts only semantic elements — function signatures, class definitions, type annotations, and docstrings — while discarding comments, formatting, and implementation details. Semble uses a chunking library called Chonkie that's code-aware: it splits files at natural boundaries like function definitions and class blocks rather than at arbitrary character limits. Each chunk contains one semantic unit — a single function or class — making retrieval far more precise than line-based splitting.

Step 3: Build Dual Retrieval Indexes

When a repository is indexed, Semble builds two complementary retrieval indexes. The first uses static Model2Vec embeddings — specifically the code-specialized potion-code-16M model from MinishLab's own research — which creates dense vector representations of each chunk for semantic similarity matching. The second uses BM25, a classic information retrieval algorithm, for lexical matching on identifiers and API names.

Why both? Because code search has a dual nature. When you search "authentication flow," you want semantic understanding. When you search "save_pretrained," you want exact lexical matching. Semble's dual-retriever approach handles both cases automatically.

Step 4: Fuse Rankings with Reciprocal Rank Fusion

When a search query arrives — whether it's natural language like "how is authentication handled?" or a symbol like "getUserById" — Semble queries both retrievers simultaneously. The two ranked result lists are then fused using Reciprocal Rank Fusion (RRF), a technique from information retrieval that combines rankings more robustly than simple score averaging. RRF gives each result a score based on its position in both lists, so a chunk that ranks high in both semantic and lexical results gets strongly promoted.

Step 5: Rerank with Code-Aware Signals

After fusion, Semble applies a set of code-aware reranking signals that make the results significantly more useful for developers:

Adaptive weighting: Symbol-like queries (Foo::bar, _private, getUserById) get more lexical weight, while natural-language queries stay balanced between both retrievers.
Definition boosts: A chunk that defines the queried symbol (a class, def, or func) is ranked above chunks that merely reference it.
Identifier stemming: Query tokens are stemmed and matched against identifier stems in a chunk. Searching "parse config" boosts chunks containing parseConfig, ConfigParser, or config_parser.
File coherence: When multiple chunks from the same file match, the file gets boosted so the top result reflects broad file-level relevance rather than a single out-of-context chunk.
Noise penalties: Test files, compat/legacy shims, example code, and .d.ts declaration stubs are down-ranked so canonical implementations rise to the top.

Step 6: Return Minimal Context to the Agent

The final step is where the 98% token savings actually materialize. Instead of returning entire files, Semble returns only the matched chunks — typically 50-200 lines instead of thousands. Each result includes the file path, start and end line numbers, and the chunk content. The agent gets exactly what it needs: the function definition, its signature, and surrounding context. No boilerplate, no imports, no unrelated helper functions.

How to Set Up Semble with Your AI Agent

Semble works as either an MCP server or a bash tool, making it compatible with virtually every coding agent on the market.

For Claude Code, the MCP setup is a single command:

claude mcp add semble -s user -- uvx --from "semble[mcp]" semble

For Codex, add it to ~/.codex/config.toml:

[mcp_servers.semble]
command = "uvx"
args = ["--from", "semble[mcp]", "semble"]

For Cursor, add to ~/.cursor/mcp.json:

{
  "mcpServers": {
    "semble": {
      "command": "uvx",
      "args": ["--from", "semble[mcp]", "semble"]
    }
  }
}

For agents that can't call MCP tools directly (like Claude Code sub-agents), Semble also provides a bash integration. Just add a few lines to your AGENTS.md or CLAUDE.md, and the agent will automatically use semble search instead of grep:

semble search "authentication flow" ./my-project
semble search "save_pretrained" ./my-project

Once installed, you can track exactly how many tokens Semble has saved you with semble savings, which shows a breakdown by time period. Early users report savings of 89-95% across their workflows.

Benchmarks: Speed and Accuracy

Semble's performance numbers are worth examining closely because they challenge the assumption that you need heavy transformer models for good code search.

Metric	Semble	Code Transformer
Indexing speed	~250ms per repo	~50 seconds per repo
Query latency	~1.5ms	~15ms
NDCG@10	0.854	0.862
Token usage vs grep	~2%	~100%
Hardware required	CPU only	GPU recommended
External dependencies	None	API key or GPU

Semble achieves 99% of the retrieval quality of a code-specialized transformer while being 200x faster to index and 10x faster to query — all on CPU with zero external dependencies. The NDCG@10 score of 0.854 on their benchmark suite is remarkably close to the 0.862 of a full transformer, and the gap is likely to narrow as the Model2Vec embedding models improve.

Why This Matters for the AI Agent Ecosystem

The AI coding agent space is converging on a common architecture: a large language model orchestrates tool calls — file reads, shell commands, code searches — to navigate and modify codebases. The cost and latency of those tool calls, especially context-heavy ones like file reads, directly determine how much an agent can accomplish per dollar and per minute.

Semble's dependency graph algorithm uses topological sorting to identify only the 12-15 functions actually called by a target method, ignoring 400+ unused functions in the same file. This kind of surgical context delivery is what makes complex multi-step agent workflows economically viable. When each intermediate query costs $14 instead of $0.28, agents that need to make 20 queries to understand a codebase simply can't be used for routine tasks.

The deeper insight is that Semble represents a new category of tool: context optimization infrastructure for AI agents. As agents become more capable and are given larger codebases to work with, the bottleneck shifts from model intelligence to context efficiency. Tools like Semble that can compress the right information into minimal tokens will become as essential to agent architectures as vector databases were to RAG pipelines.

But what happens when Semble encounters dynamically typed code where dependencies can't be statically traced? That's a question the MinishLab team is actively working on — and it may define the next frontier in AI code intelligence.

Semble is open source and available at github.com/MinishLab/semble. Install it with pip install semble or uv tool install semble.

DEV Community