LEANN: The World's Most Lightweight Semantic Search Backend for RAG Everything 🎉

Yichuan Wang — Sun, 17 Aug 2025 00:47:15 +0000

Introducing our team's latest creation - a revolutionary approach to local RAG applications

TL;DR: We built LEANN, the world's most "lightweight" semantic search backend that achieves 97% storage savings compared to traditional solutions while maintaining high accuracy and performance. Perfect for privacy-focused RAG applications on your local machine.

🚀 Quick Start

Want to try it right now? Run this single command on your MacBook:

uv pip install leann

📚 Repository & Paper

GitHub: https://github.com/yichuan-w/LEANN ⭐ (Star us!)
Paper: Available on arXiv

What is RAG Everything?

RAG (Retrieval-Augmented Generation) has become the first true "killer application" of the LLM era. It seamlessly integrates private data that wasn't part of the training set into large model inference pipelines.

Privacy scenarios are absolutely the most important deployment direction - especially for your personal data and in highly sensitive domains like healthcare and finance.

RAG Everything starts from the most essential needs of personal laptops. We natively support a bunch of out-of-the-box scenarios (currently supporting macOS and Linux, Windows users need WSL):

🔍 Supported Applications

1. File System RAG

Replace Spotlight Search entirely. Spotlight not only consumes disk space but only does keyword matching. We transform it into a semantic search powerhouse.

2. Apple Mail RAG

Easily find answers to personal questions (like "How many courses should Berkeley EECS freshmen take in their first semester?").

3. Google Browser History RAG

Track down those vague search records you suddenly forgot - the ones you only have a fuzzy impression of.

4. WeChat Chat History RAG

This is what I use most! I've used LEANN to summarize conversations with friends and extract research ideas + slides. We implemented a small hack to bypass WeChat's encrypted database and extract chat records - don't worry, everything stays local with zero leakage.

5. Claude Code Semantic Search Enhancement 🔥

One of Claude Code's biggest pain points is that it's always grepping and finding nothing. LEANN is one of the first open-source projects to bring true semantic search to Claude Code through an MCP server - enabling it with just one line of code.

These are just the scenarios we think have the most "potential" - we'll continuously integrate more features based on user feedback until it becomes a personalized local Agent that remembers your LLM memory and masters all your private data.

Why LEANN? The Technical Deep Dive

The Problem with Current Vector Databases

Current mainstream vector databases excel in latency - most queries complete within 10ms-100ms even with millions of data points. In RAG's search + generation pipeline, search time is "far below" generation time, especially with reasoning models and long chain-of-thought processes.

Latency isn't the bottleneck in RAG - storage is.

The most important RAG deployment scenario is privacy, especially on personal computers where resources are naturally scarce. Consider this reality check:

For high recall in text RAG, you need fine chunk sizes → embedding storage becomes 3-10x the original text size → Real example: 70GB raw data → 220GB+ index storage

Our Solution: Trade Storage for Compute

LEANN makes a bold design choice: replace storage with recomputation.

Core Innovation

Key Observation: In graph-based indices, a query actually accesses very few nodes → Why store all embeddings?

Our pipeline:

Build a normal vector store
Delete all embeddings, keeping only the Proximity Graph to record relationships between data chunks
Convert memory loading to recomputation during inference
Leverage lightweight embedding models for efficient graph-based recomputation

Graph Structure Pruning

We observed significant visit skewness patterns in post-RNG graphs. Our strategy:

Keep high-degree nodes to ensure connectivity
Limit out-edges for low-degree nodes while allowing unlimited in-edges
Use heuristics to preserve only essential high-degree nodes

Results That Matter

✅ 97%+ reduction in index size

✅ <2 seconds retrieval time on 3090-level hardware

✅ 90%+ Top-3 recall on real RAG benchmarks

✅ Zero vector storage - all in 200GB+ embedding spaces

Note: Under this high compression rate, PQ, OPQ, and even state-of-the-art RaBitQ cannot guarantee high accuracy - proven in our paper.

Performance Optimizations

Adaptive pipeline combining coarse-grained and accurate search
Efficient GPU batching for better utilization
ZMQ communication using distances instead of embeddings
CPU/GPU overlapping
Selective caching of high-degree nodes

The Vision: RAG Everything

We're continuously maintaining this open-source project at Berkeley SkyLab with full-stack optimization across algorithms, applications, system design, vector databases, and kernel acceleration.

Our Goals

🎯 Seamlessly connect all your private data

🧠 Build long-term local AI memory and agents

💻 Zero cloud dependency, low-cost operation

Technical Details & Future Work

If you want to dive deeper into implementation details, check our arXiv paper and repository. I can write a follow-up post covering all implementation specifics if there's interest.

We hope LEANN inspires more vector search researchers to think about vector databases from a different angle, especially in popular RAG settings. We were fortunate to discuss our work at SIGMOD/ICML vector search workshops this year and received great recognition from the community.

Get Involved

⭐ Star our repository
🤝 Contribute to the project
🔗 Join our Berkeley SkyLab team

Ready to transform your local machine into a RAG powerhouse?

uv pip install leann

What private data would you want to RAG first? Drop a comment below! 👇

DEV Community: Yichuan Wang