DEV Community

Cover image for Why I Built RAG From Scratch Before Using LangChain
Sumayea Rahman
Sumayea Rahman

Posted on • Originally published at github.com

Why I Built RAG From Scratch Before Using LangChain

Technical Note #01: Why I Built RAG From Scratch Before Using LangChain

Part of the Agentic Finance Beast Technical Notes series

Published: June 7, 2026
Reading Time: ~6 minutes


About This Note

This technical note documents my first implementation of a Retrieval-Augmented Generation (RAG) pipeline.

The goal was not to build a production-ready system.

The goal was to understand what actually happens between a user's question and an AI-generated answer before relying on frameworks such as LangChain.

Rather than starting with abstractions, I wanted to build the core pieces myself and learn where real-world AI systems succeed and fail.


TL;DR

I built a minimal RAG pipeline from scratch using Gemini Embeddings, cosine similarity search, and Mistral.

The biggest lesson wasn't prompt engineering.

It was discovering that retrieval quality often has a greater impact on answer quality than the language model itself.


The Question That Started It

Most RAG tutorials follow a similar pattern:

  1. Install LangChain.
  2. Connect a vector database.
  3. Load a document.
  4. Ask questions.

Within minutes, you have a working application.

That's impressive.

But it left me with a question:

If the system retrieves the wrong information, how would I debug it?

Frameworks make development faster, but they also hide implementation details.

Before using those abstractions, I wanted to understand the individual components behind Retrieval-Augmented Generation.

So I built a simple version myself.

No LangChain.

No vector database.

No orchestration framework.

Just Python, embeddings, similarity search, and an LLM.


What Is RAG?

Retrieval-Augmented Generation combines two systems:

  • A retrieval system that finds relevant information.
  • A generation system that uses that information to answer questions.

Instead of relying entirely on the language model's training data, relevant information is retrieved at runtime and injected into the prompt.

The simplified workflow looks like this:

Document
   ↓
Chunking
   ↓
Embeddings
   ↓
Similarity Search
   ↓
Context Retrieval
   ↓
LLM Response
Enter fullscreen mode Exit fullscreen mode

This architecture allows AI systems to answer questions using external knowledge without retraining the model.


The Architecture I Built

My implementation consisted of five core stages:

Document
   ↓
Sentence-Based Chunking
   ↓
Gemini Embeddings
   ↓
Cosine Similarity Search
   ↓
Mistral Answer Generation
Enter fullscreen mode Exit fullscreen mode

The knowledge base contained a short document about AI agents.

Users could ask questions, and the system would retrieve relevant information before generating a response.


Step 1: Chunking the Document

The first step was splitting the document into smaller pieces.

I used a simple sentence-based approach:

chunks = [
    s.strip() + "."
    for s in document.replace("\n", " ").split(". ")
    if s.strip()
]
Enter fullscreen mode Exit fullscreen mode

At first, this felt like a minor preprocessing step.

It wasn't.

I quickly realized that chunking affects retrieval quality directly.

Large chunks preserve context but often include irrelevant information.

Small chunks improve retrieval precision but can lose important context.

Even in this small project, chunking turned out to be a meaningful engineering decision.


Step 2: Generating Embeddings

After chunking the document, I generated embeddings using Gemini's embedding API.

Each chunk was converted into a high-dimensional vector representation.

embedding = data.get("embedding", {}).get("values", [])
Enter fullscreen mode Exit fullscreen mode

Before building this project, embeddings felt somewhat magical.

After seeing the actual vectors returned by the API, the concept became easier to understand.

Embeddings allow machines to compare meaning instead of matching exact words.

For example, a query about decision-making could retrieve information related to reasoning even if the exact keywords do not appear.

That capability is what makes semantic search possible.


Step 3: Implementing Similarity Search

Instead of using a vector database, I implemented cosine similarity manually.

def cosine_similarity(a, b):
    dot = sum(x * y for x, y in zip(a, b))
    norm_a = math.sqrt(sum(x * x for x in a))
    norm_b = math.sqrt(sum(x * x for x in b))
    return dot / (norm_a * norm_b)
Enter fullscreen mode Exit fullscreen mode

This was one of the most interesting parts of the project.

Before building it, vector search seemed complicated.

After implementing it myself, I realized the mathematics behind retrieval is relatively straightforward.

The challenge is not the formula.

The challenge is consistently retrieving the most useful context.


Step 4: Finding Relevant Information

When a user asks a question, the same embedding model converts the question into a vector.

The query embedding is then compared against every document embedding.

best_idx = similarities.index(max(similarities))
Enter fullscreen mode Exit fullscreen mode

The chunk with the highest similarity score becomes the retrieved context.

For example, questions such as:

  • What is an AI agent?
  • How do AI agents differ from traditional programs?
  • What can financial AI agents do?

successfully retrieved relevant information from the knowledge base.

For a minimal implementation, the results were surprisingly effective.


Step 5: Grounding the Response

Once relevant context is retrieved, it is passed to Mistral alongside the user's question.

Context: {context}

Question: {question}
Enter fullscreen mode Exit fullscreen mode

The model is instructed to answer using only the provided context.

This is where retrieval and generation come together.

Without retrieval, the model answers based on its training data.

With retrieval, the model answers using information supplied at runtime.

This simple shift dramatically improves factual grounding.


What Surprised Me Most

Before building this project, I assumed the language model would be the most important part of the system.

I was wrong.

Most answer quality issues were retrieval issues.

When retrieval returned weak context, answer quality suffered.

When retrieval returned relevant context, answer quality improved significantly.

This changed how I think about AI applications.

Prompt engineering matters.

Model selection matters.

But retrieval quality often determines whether an answer is useful in the first place.


Engineering Tradeoffs I Encountered

Even in a small project, several tradeoffs became visible.

Simplicity vs Context

Sentence-level chunking was easy to implement.

However, preserving context becomes more difficult as chunk sizes become smaller.

Precision vs Coverage

Retrieving a single best match is simple.

Retrieving multiple relevant chunks provides broader coverage but introduces additional complexity.

Learning vs Production

Building retrieval manually helped me understand the system.

In production environments, dedicated vector databases and retrieval frameworks become necessary.


What I Got Wrong

Before starting this project, I believed prompt engineering would have the greatest impact on answer quality.

The implementation showed otherwise.

Poor retrieval produced poor answers regardless of prompt quality.

Improving retrieval had a larger effect than rewriting prompts.

That was one of the most valuable lessons from the entire exercise.


Limitations of This Version

This implementation intentionally prioritizes learning over scalability.

Several important features are missing:

  • Top-K retrieval
  • Persistent vector storage
  • Metadata filtering
  • Conversation memory
  • Retrieval evaluation metrics
  • Hybrid search techniques

These limitations are acceptable because the objective was understanding the fundamentals rather than building a production-ready system.


What's Next

This implementation serves as the foundation for future work within Agentic Finance Beast.

The next improvements I plan to explore include:

  • Top-K retrieval instead of single-result retrieval
  • Better chunking strategies
  • Vector storage using pgvector
  • Financial document retrieval
  • Multi-step workflows with LangGraph
  • Agent memory and reasoning systems

Each improvement builds upon the concepts explored in this first implementation.


Final Thoughts

Building a RAG pipeline from scratch did not make me an expert in retrieval systems.

What it did provide was a practical understanding of how retrieval, embeddings, similarity search, and generation work together.

Frameworks such as LangChain are incredibly useful.

But understanding the fundamentals behind those abstractions provides a different kind of value.

When something breaks, I now have a mental model for where to investigate.

For me, that understanding made building from scratch worthwhile.


Repository

GitHub: https://github.com/Sumayea104/agentic-finance-beast

Top comments (0)