LangChain has over 200,000 lines of code. I wanted to understand what RAG actually does — not what LangChain says it does.
So I built the whole pipeline from scratch in Rust. 300 lines. No magic.
GitHub: LakshmiSravyaVedantham/nano-rag
What RAG actually is (in one sentence)
You take a document, split it into chunks, turn each chunk into a vector, store those vectors, and when a user asks a question, you find the most similar chunks and feed them to an LLM as context.
That's it. That's all of RAG.
Here's the complete architecture — 4 files:
src/
├── chunk.rs # Text → overlapping chunks (~40 lines)
├── embed.rs # Chunks → embeddings + cosine sim (~55 lines)
├── store.rs # In-memory vector store + top-K (~50 lines)
└── main.rs # CLI: index + query commands (~80 lines)
Let me walk through each one.
Step 1: Chunking (chunk.rs)
Before you can embed anything, you need to split your document into pieces small enough to fit in an embedding model's context window.
pub fn split_chunks(text: &str, chunk_size: usize, overlap: usize) -> Vec<String> {
if text.is_empty() { return vec![]; }
let chars: Vec<char> = text.chars().collect();
let mut chunks = Vec::new();
let mut start = 0;
let step = chunk_size.saturating_sub(overlap).max(1);
while start < chars.len() {
let end = (start + chunk_size).min(chars.len());
chunks.push(chars[start..end].iter().collect());
if end == chars.len() { break; }
start += step;
}
chunks
}
Two parameters matter: chunk_size (how much text per chunk) and overlap (how much the chunks repeat at boundaries).
Why overlap? Because a sentence about a concept often spans chunk boundaries. Without overlap, you'd cut relevant context in half.
Step 2: Embeddings + Cosine Similarity (embed.rs)
An embedding is just a list of ~1536 floats that encodes the meaning of a piece of text. The OpenAI embedding API does the heavy lifting.
pub async fn embed(&self, texts: &[&str]) -> Result<Vec<Vec<f32>>> {
let body = json!({
"model": "text-embedding-3-small",
"input": texts
});
let resp: Value = self.client
.post("https://api.openai.com/v1/embeddings")
.bearer_auth(&self.api_key)
.json(&body)
.send().await?
.json().await?;
// Extract the float arrays from the response
Ok(resp["data"].as_array().unwrap()
.iter()
.map(|item| item["embedding"].as_array().unwrap()
.iter().map(|v| v.as_f64().unwrap() as f32).collect())
.collect())
}
To measure how similar two chunks are to a query, we use cosine similarity — the angle between two vectors, not their magnitude:
pub fn cosine_similarity(a: &[f32], b: &[f32]) -> f32 {
let dot: f32 = a.iter().zip(b).map(|(x, y)| x * y).sum();
let norm_a: f32 = a.iter().map(|x| x * x).sum::<f32>().sqrt();
let norm_b: f32 = b.iter().map(|x| x * x).sum::<f32>().sqrt();
if norm_a == 0.0 || norm_b == 0.0 { return 0.0; }
dot / (norm_a * norm_b)
}
Why cosine and not dot product? Because dot product favors longer chunks (more words = bigger magnitude). Cosine normalizes by length, so a short precise chunk can beat a long vague one.
Step 3: The Vector Store (store.rs)
This is where frameworks like Pinecone, Weaviate, and Qdrant live. For nano-rag, it's 50 lines:
pub struct VectorStore {
docs: Vec<(String, Vec<f32>)>,
}
impl VectorStore {
pub fn top_k(&self, query_embedding: &[f32], k: usize) -> Vec<ScoredDocument> {
let mut scored: Vec<ScoredDocument> = self.docs.iter()
.map(|(text, emb)| ScoredDocument {
text: text.clone(),
score: cosine_similarity(query_embedding, emb),
})
.collect();
scored.sort_by(|a, b| b.score.partial_cmp(&a.score).unwrap());
scored.truncate(k);
scored
}
}
For production you'd swap this Vec for Qdrant or pgvector. The interface stays identical — that's the point.
Step 4: Putting it together
# Install
cargo install nano-rag
# Index a document
export OPENAI_API_KEY=sk-...
nano-rag index --file my_doc.txt --output index.json
# Query it
nano-rag query --question "What does this say about X?" --index index.json
Output:
Query: "What does this say about X?"
[1] score=0.891
The document discusses X extensively in chapter 3...
[2] score=0.743
Related to X, the author also mentions...
What LangChain is hiding
After building this, I realized LangChain isn't hiding anything complex — it's hiding a lot of configuration. Document loaders, text splitters, vectorstore adapters, retriever chains, memory modules... all layers on top of the same 4 steps.
That's fine if you need the flexibility. But if you want to understand what's happening in your RAG pipeline, reading nano-rag's 300 lines is more valuable than reading any LangChain documentation.
The code is at github.com/LakshmiSravyaVedantham/nano-rag. Fork it, break it, make it yours.
Why Rust?
I originally started writing this in Python. It worked. But I wanted to understand the whole stack — including why Python LLM tools are slow to start up.
Rust forced me to think about ownership of the embedding data, the async runtime underneath reqwest, and the memory layout of the vector store. I learned more about RAG building this than I did using LangChain for months.
Next up: llm-bench — benchmarking OpenAI vs Claude vs Groq on your actual prompts, in milliseconds.
Top comments (0)