Jun Bae

Posted on Apr 25 • Edited on Apr 28

Introduction to RAG for LLMs: Sparse (Lexical) RAG and Dense RAG (Semantic Vector Search)

#ai #rag #python #machinelearning

Introduction

LLMs store information within their own parameters. By being trained on massive datasets, the models learn this data. But what if they are asked about the information they don't know? These queries will likely result in hallucinations or entirely wrong answers.

As we know, updating the models with current data is very difficult and resource-intensive. Therefore, most AI service providers do not update their models frequently. Instead, they usually leave the models as they are after release because retraining is highly inefficient. That's why all models have their knowledge cutoff dates.

How, then, can they answer questions about up-to-date information? For example, "who is the president of the U.S. right now" or "Tell me today's news regarding the U.S-Iran conflict." Without external tools, they simply can't.

Qwen3.5 that was released on March 2026 doesn't know the information of the last year.

Knowledge Integration Strategies

Fundamentally, LLMs hold intrinsic knowledge within their own parameters. Additionally, users can inject specific information through their prompts. Therefore, there are three main methods to provide models with the proper information: Fine-tuning, Prompt Engineering, and RAG.

Three Techniques to Optimize LLMs

Method	Required Resources (Cost)	Inference Time (Latency)	Training / Data Prep Time
Fine-Tuning	High	Low	High
Prompt Engineering	Very Low	Low to Medium	Zero
RAG	Medium	High	Low (Ingestion/Indexing)

Each of these three methods has its own pros and cons.

1) Fine-tuning

Fine-tuning adjusts the actual weights of the model by training it on a dataset.

Pros
- Deep Customization: Bakes your specific domain knowledge, stylistic tone, or formatting rules directly into the model's "brain."
- Shorter Prompts & Lower Latency: Because the knowledge is embedded in the weights, you no longer need to stuff massive context into your prompts. This drastically reduces the time-to-first-token and bypasses context window limits.
Cons
- High Upfront Cost: Requires GPUs to train.
- Data Hungry: You need a high-quality, perfectly curated dataset (often hundreds or thousands of examples).
- Knowledge Stagnation: The model's knowledge is frozen at the exact time of training. Retraining and deploying are very inconvenient and take a long time.

2) Prompt engineering

This is the simplest way to inject information. You're tweaking the input text to guide the model's output without changing its underlying neural network weights.

Pros
- Fastest Iteration: You can test, tweak, and deploy changes in seconds.
- Zero Training Cost: No heavy GPU compute is required.
- Highly Flexible: Switch tasks instantly just by altering the prompt instructions.
Cons
- Context Limits: You are strictly restricted by the model's maximum context window.
- Token Costs: Stuffing prompts with massive context gets expensive at scale.
- Inconsistent Reliability: Highly complex, multi-step instructions can confuse the model or trigger hallucinations.

3) RAG (Retrieval-Augmented Generation)

RAG connects your LLM to an external knowledge base (like a vector database). When a user asks a query, the model retrieves relevant data and feeds it to the LLM as context.

Pros
- Reduces Hallucinations: Answers are explicitly grounded in your specific, verifiable data.
- Up-to-Date Knowledge: You don't need to retrain the model when your data changes; simply update the vector database.
- Source Citations: You can trace exactly which document the model used to generate its response, adding trustworthiness.
Cons
- Higher Latency: Fetching embeddings, querying the database, and processing a bloated context window slows down inference time.
- Additional Infrastructure: Requires maintaining extra infrastructure (embedding models, vector DBs, retrieval pipelines).
- Garbage In, Garbage Out: If your retrieval step fails to find the right chunk, the LLM will fail to answer correctly regardless of its size.

Those methods are not inherently superior or inferior to one another. When you build an LLM system, you might use a combination of them.

For example, let's assume that you are building an AI-powered pet service that provides medical diagnoses for pets and locates nearby veterinarians. In this case, you need to provide the model with a veterinary knowledge base. Because this information doesn't change frequently, if you train the model on it once, you won't need to retrain it often. You could also input the information directly into the prompt, but that makes the prompt excessively long. Therefore, it is better to fine-tune the model on this knowledge.

Next, you have to write a basic instruction prompt outlining what services it should provide and what kind of persona or tone it should adopt.

Finally, how do we provide information about the vets' locations? If you simply train the model on this data, there is no guarantee it will retrieve the information accurately. Furthermore, you would have to retrain it whenever new clinics open or existing ones close across the state. This requires frequent updates, but it is also impossible to fit all the state's veterinary data into a single prompt. That is why you need to build a RAG system for this.

Now, let's dive deep into RAG.

RAG (Retrieval-Augmented Generation)

RAG, as the name implies, retrieves the data or information from the database. As mentioned above, the model's parametric memory of the model is static and obscures data provenance. And even if we train the model on our data, we can't be sure that it can reference it properly. Prompt engineering is relatively surefire, but more often than not, we can't input an entire dataset and change the prompt for every inference.

So, how exactly does RAG retrieve the data? There are several methods it uses.

Lexical(sparse) RAG

When people talk about RAG today, they usually mean Dense RAG—converting text into dense vector embeddings—but before that, there was a simpler way called Sparse(Lexical) RAG.

Lexical retrieval looks for exact keyword matches. It doesn't require high-dimensional embedding models. That's why it is called "sparse."

The undisputed king of traditional lexical retrieval is the Okapi BM25 algorithm.

Here is the master equation used to calculate the relevance score of a document $D$ given a user query $Q$ :

\text{score}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot \left(1 - b + b \cdot \frac{|D|}{\text{avgdl}}\right)}

Let's break down these variables:

$q_i$ : The $i$ -th keyword in the user's query.
$f(q_i, D)$ : The term frequency (how many times the keyword $q_i$ appears in document $D$ ).
$|D|$ : The total word count (length) of the document.
$\text{avgdl}$ : The average document length across your entire knowledge base.
$k_1$ and $b$ : Tunable constants. $k_1$ (usually between 1.2 and 2.0) controls how quickly the term frequency score saturates. $b$ (usually around 0.75) controls how much the document length penalizes the score, preventing massive, wordy documents from automatically dominating the top results.

The IDF (Inverse Document Frequency) portion ensures that rare words carry significantly more weight than common words like "the" or "animal." It is calculated as:

\text{IDF}(q_i) = \ln \left( \frac{N - n(q_i) + 0.5}{n(q_i) + 0.5} + 1 \right)

$N$ : The total number of documents in the database.
$n(q_i)$ : The number of documents that contain the keyword $q_i$ .

Essentially, BM25 says: "Reward documents where the query terms appear frequently, but only if those terms are rare across the whole database, and penalize documents that are just incredibly long."

Therefore, if you enter a specific query, it will calculate a score for each document and retrieve the top- $N$ documents. Let's try this with a real dataset and some code.

Sparse RAG Example

There are several frameworks that support BM25, such as langchain. In this example, I'm going to use the Qdrant/bm25 model from fastembed.

from fastembed import SparseTextEmbedding

sparse_model = SparseTextEmbedding(model_name="Qdrant/bm25")

documents = ["Hello. Who are you?", "Hello World, who the hell are you?"]
print(list(sparse_model.embed(documents)))

[
  SparseEmbedding(values=array([1.6877]), indices=array([613153351])), 
  SparseEmbedding(values=array([1.6786, 1.6786, 1.6786]), indices=array([613153351, 74040069, 1587029005]))
]

I just embedded two sentences with the BM25 model. The 'Indices' array is for distinguishing the words in the sentences. But you might notice that the first sentence has only one index. Actually, this model automatically filters out stop words (common, high-frequency words like "the", "is"). Then what are the values?

The values are the pre-calculated term weights: $\frac{TF \cdot (k_1 + 1)}{TF + k_1}$

But this is pretty weird. Why does it pre-calculate these value? For the entire equation of the score, you still need to calculate the $IDF$ and $\text{avgdl}$ . In fact, this is a trick that fastembed uses.

The secret lies in the name of the model: Qdrant/bm25. This is not a dynamic BM25 algorithm calculating stats from your specific pet clinic database. It is a pre-trained model. Researchers ran the BM25 algorithm over a massive, generic dataset (usually MS MARCO, a dataset of millions of Bing searches). From that massive dataset, they permanently froze two values:

The Global $IDF$ : A massive lookup table of how rare words are in the English language.
The Global $\text{avgdl}$ : A static constant representing the average document length in their training corpus.

Given that both variables are treated as fixed constants, the resulting values are technically the entire equation the moment you embed the document. This is the efficient trick that the framework leverages.

To test this, I downloaded the hotpot_qa dataset and embedded 1,000 rows of it using a BM25 scorer.

{'id': '5ac2a912554299218029dae8', 'question': 'Which band was founded first, Hole, the rock band that Courtney Love was a frontwoman of, or The Wolfhounds?', 'answer': 'The Wolfhounds', 'type': 'comparison', 'level': 'medium', 'supporting_facts': {'title': ['Courtney Love', 'Courtney Love', 'The Wolfhounds'], 'sent_id': [0, 2, 0]}, 'context': {'title': ["Nobody's Daughter", 'Courtney Love filmography', 'Patty Schemel', 'Beautiful Son', 'The Wolfhounds', 'Live Through This', 'Turpentine (song)', 'Miss World (song)', 'Softer, Softest', 'Courtney Love'], 'sentences': [["Nobody's Daughter is the fourth and final studio album by American alternative rock band Hole, released worldwide on April 27, 2010, through Mercury Records.", ' The album was originally conceived by Hole frontwoman Courtney Love as a solo project titled "How Dirty Girls Get Clean", following her poorly received solo debut "America's Sweetheart" (2004).', ' Much of the material featured on "Nobody's Daughter" originated from studio sessions for "How Dirty Girls Get Clean", which had been conceived in 2006 after a multitude of legal issues, drug addiction, and rehabilitation sentences had left Love "suicidal".', ' Love financed the making of the record herself, which cost nearly two million dollars.'], ['Courtney Love is an American ...

The dataset llooks something like this. Now, I'm going to try retrieving data using the sample question.

query = "Which band was founded first, Hole, the rock band that Courtney Love was a frontwoman of, or The Wolfhounds?"
retrieved_results = store.search(query, top_k=3)

for i, r in enumerate(retrieved_results):
    print(f"Retrieved #{i+1}: ", r, "\n")

output:

Retrieved #1:  {'score': 37.0975227355957, 'text': 'Miss World (song). "Miss World" is a song by American alternative rock band Hole, written by frontwoman Courtney Love and lead guitarist Eric Erlandson.  The song was released as the band\'s fifth single and the first from their second studio album, "Live Through This", in March 1994.', 'metadata': {'title': 'Miss World (song)', 'source': 'wikipedia_extract', 'original_doc_id': '5ac2a912554299218029dae8_Miss_World_(song)'}}

Retrieved #2:  {'score': 35.72223663330078, 'text': 'Courtney Love. Courtney Michelle Love (born Courtney Michelle Harrison; July 9, 1964) is an American singer, songwriter, actress, and visual artist.  Prolific in the punk and grunge scenes of the 1990s, Love has enjoyed a career that spans four decades.  She rose to prominence as the frontwoman of the alternative rock band Hole, which she formed in 1989.  Love has drawn public attention for her uninhibited live performances and confrontational lyrics, as well as her highly publicized personal life following her marriage to Kurt Cobain.', 'metadata': {'title': 'Courtney Love', 'source': 'wikipedia_extract', 'original_doc_id': '5ac2a912554299218029dae8_Courtney_Love'}}

Retrieved #3:  {'score': 34.49491882324219, 'text': 'Beautiful Son. "Beautiful Son" is a song by American alternative rock band Hole, written collectively by frontwoman Courtney Love, lead guitarist Eric Erlandson and drummer Patty Schemel.  The song was released as the band\'s fourth single in April 1993 on the European label City Slang.  To coincide with the song\'s lyrics, Love used a photograph of her husband, Kurt Cobain, at age 7 as the single\'s artwork.', 'metadata': {'title': 'Beautiful Son', 'source': 'wikipedia_extract', 'original_doc_id': '5ac2a912554299218029dae8_Beautiful_Son'}}

As you can see, the RAG system failed to retrieve the relevant data. Actually, this type of query is poorly suited for the sparse RAG. As mentioned above, sparse RAG is a method that finds documents containing the exact words from a query. So, if you mention the word "dog" in a query, it can't match it with "pet" or "puppy."

Therefore, it is better to use sparse RAG like this:

retrieved_results = store.search("Wolfhounds", top_k=3)
for i, r in enumerate(retrieved_results):
    print(f"Retrieved #{i+1}: ", r, "\n")

output:

Retrieved #1:  {'score': 11.775125503540039, 'text': 'The Wolfhounds. The Wolfhounds are an indie pop/noise pop band formed in Romford, UK in 1985 by Dave Callahan, Paul Clark, Andy Golding, Andy Bolton and Frank Stebbing, and originally active until 1990.  The band reformed in 2005 and continues to write, record and play live, releasing new albums in 2014 and 2016.', 'metadata': {'title': 'The Wolfhounds', 'source': 'wikipedia_extract', 'original_doc_id': '5ac2a912554299218029dae8_The_Wolfhounds'}}

If you want the correct answer to the question—"Which band was founded first, Hole, the rock band that Courtney Love was a frontwoman of, or The Wolfhounds?"—it is better to search for individual keywords: "Hole", "Courtney Love", "Wolfhounds" and combine them in a query.

Cons of Sparse RAG

Sparse RAG has some significant weaknesses.

First, it is strictly an exact-match algorithm. It can't correct minor typos or recognize conceptually relevant words like LLMs do.

retrieved_results = store.search("dog", top_k=1)
print(retrieved_results)

>>> [{'score': 10.09609603881836, 'text': 'Salty dog (cocktail). A salty dog is a cocktail of tequila, or gin, and grapefruit juice, served in a highball glass with a salted rim.  The salt is the only difference between a salty dog and a greyhound.  Vodka may be used as a substitute for tequila; nevertheless, it is historically a tequila drink.', 'metadata': {'title': 'Salty dog (cocktail)', 'source': 'wikipedia_extract', 'original_doc_id': '5ac3ad225542995ef918c1da_Salty_dog_(cocktail)'}}]

retrieved_results = store.search("puppy", top_k=1)
print(retrieved_results)

>>> []

Second, some languages like Korean or Japanese don't separate a word from its postposition. If this were applied to English, "A boy is" would look like "A boyis". Therefore, these languages need a specific process called morphological analysis or morpheme separation before searching with sparse RAG.

There are other algorithms such as BM42 or SPLADE to make up for sparse RAG's limitations. However, they are not widely used due to their complexity. BM25 still remains the industry standard, and if you need more precise and complex search tool, it is better to utilize the other RAG methods that I will explain below.

Dense RAG (Semantic Vector Search)

Due to the limitations of sparse RAG, dense RAG is the most widely used RAG method. When people refer to RAG, they usually mean dense RAG.

While sparse RAG is great for exact matches, Dense RAG is where the magic of "understanding" happens. By converting text into dense vectors (typically 768 or 1024 dimensions), we can find documents that are conceptually related, even if they share zero common words.

The method is simple: compute the distance between the query vector and document vectors, and retrieve the closest top- $N$ documents. The relevance score (distance score) is computed instantaneously, usually using the cosine similarity:

\text{sim}(q, d) = \frac{E_q \cdot E_d}{|E_q| |E_d|}

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from langchain_ollama import OllamaEmbeddings

embedding_model = OllamaEmbeddings(model="qwen3-embedding:4b")
vector_dog = np.array(embedding_model.embed_query("dog")).reshape(1, -1)
vector_puppy = np.array(embedding_model.embed_query("puppy")).reshape(1, -1)
vector_cat = np.array(embedding_model.embed_query("cat")).reshape(1, -1)
vector_missile = np.array(embedding_model.embed_query("patriot missile")).reshape(1, -1)

cosine_similarity(vector_dog, vector_puppy)
>>> Out[15]: array([[0.84683586]])

cosine_similarity(vector_dog, vector_cat)
>>> Out[16]: array([[0.79599878]])

cosine_similarity(vector_dog, vector_missile)
>>> Out[17]: array([[0.54258399]])

Should we then compute all of the relevance scores and retrieve the K-Nearest Neighbor (KNN)? This computational process would be exhaustive and brutal if the vector dimension is high. The computational complexity is $O(N \cdot d)$ where $N$ is the total number of documents and $d$ is the vector dimensionality.

So, there are some alternative methods to the full KNN. This approach is called ANN (Approximate Nearest Neighbor) and these two algorithms are the most representative ANN methods: HNSW and IVF-PQ.

Hierarchical Navigable Small World (HNSW)

HNSW is currently the "gold standard" for most vector databases. It is a graph-based algorithm that builds a multi-layered structure of vectors.

How it works

Think of it like a "skip list" for graphs. The top layers have fewer points and long-distance links (for fast traversal across the data "map"), while the bottom layers have all the points and short-distance links (for precise local searching). You start at the top, zoom in to the right neighborhood, and move down a layer to refine your search.

How it builds layers

The process of building and deploying nodes is a bit complicated. I will explain this step by step.

Step 1) In Layer 0 (the lowest layer) contains all inserted vectors. As you move to higher layers (Layer 1, Layer 2, etc.), the number of nodes decreases exponentially. How does it choose which nodes will remain? First, it rolls the dice for each node.

The maximum layer $l$ for a new node is determined by an exponentially decaying probability distribution:

l = \lfloor -\ln(u) \cdot m_L \rfloor

Where:

$u \sim U(0,1)$ is a uniformly distributed random number between 0 and 1.
$m_L$ is the Level Generation Multiplier.

The Role of $m_L$ and $M$
$m_L$ is mathematically tied to the hyperparameter $M$ (the maximum number of connections per node). The theoretical value for $m_L$ to minimize search complexity is:

m_L = \frac{1}{\ln(M)}

Therefore, this ensures that the number of nodes in each subsequent layer decreases by a factor of exactly $M$ .

Step 2) Let's assume our new vector $\mathbf{v}$ was assigned a maximum layer $l = 2$ . The graph currently has a maximum layer of $L = 4$ .

The algorithm starts at the top (Layer 4) at a predefined entry point node. It evaluates the distance between $\mathbf{v}$ and the entry point's neighbors. It greedily jumps to the neighbor closest to $\mathbf{v}$ repeating this process until it reaches Layer 2.

Step 3) Now that the routing has brought us spatially close to where $\mathbf{v}$ belongs in Layer 2, the actual graph building begins.

From Layer 2 to Layer 0, the algorithm performs a local search to find the nearest neighbors to connect $\mathbf{v}$ to.

The algorithm maintains a dynamic list of the closest nodes it has found so far, capped at the size of $efConstruction$ .
It continually explores the neighbors of the nodes in this queue. If it finds a closer node, it adds it to the queue.
Once the local area is fully explored, it selects up to $M$ nodes from the $efConstruction$ queue to create bidirectional edges with $\mathbf{v}$ .
It prunes the worst edge if adding $\mathbf{v}$ causes an existing node to exceed its maximum allowed connection (usually $M$ for upper layers, $2M$ for Layer 0). But it doesn't just prune the furthest edge; it drops nodes that are clustered together, ensuring connections are spread out in different directions to maintain the "small world" navigability.
After connecting in Layer 2, it goes down to Layer 1, uses the best nodes found in Layer 2 as the new entry points, fills a new $efConstruction$ queue, and connects up to $M$ nodes again. This process repeats until Layer 0 is fully connected.

Configuring HNSW (Hyperparameter tuning)

$M$ (Typical range: 16 to 64): Controls the number of bidirectional links per node and dictates the layer density.
- Trade-off: A higher $M$ yields better accuracy, but drastically increases RAM usage and insertion time.
$efConstruction$ (Typical range: 100 to 500): Controls the depth of the search during insertion.
- Trade-off: A higher $efConstruction$ builds a significantly higher-quality graph, but the penalty is a linear increase in index build time. It does not affect query latency.
$efSearch$
(Typical range: 50 to 200): The equivalent of $efConstruction$ , but used purely at query time.
- Trade-off: Controls the speed vs. recall trade-off for your users.

Note: When my company first introduced a RAG system, our Cloud Service Provider presented their tips and technical know-how of the HNSW. In that presentation, they recommended setting $M$ to 16 and both $efConstruction$ and $efSearch$ to 128. They told us that these values are the optimal balance considering all factors including memory usage, latency, and recall. According to our internal evaluations, that turned out to be true, but I don't think this is the absolute standard. Just consider this as a useful tip; you need to test it with your own data.

The query complexity of HNSW scales logarithmically, $O( \log n )$ , facilitating exceptionally fast retrieval. However, this demands a massive memory footprint, as the system must continuously store complex adjacency lists and bidirectional edge pointers in RAM.

Inverted File Index (IVF) & Product Quantization (PQ)

IVF is a clustering-based approach, often paired with Product Quantization (PQ) to save memory. It consumes less memory compared to HNSW, but requires a training process.

IVF (Inverted File Index) — The Macro Partition

IVF partitions the high-dimensional space into distinct regions (Voronoi cells) and only searches the regions closest to the query.

How it works

Training (K-Means): During index initialization, a clustering algorithm (typically K-means) is run across a representative sample of your dataset to find $C$ cluster's centers (centroids).
Assignment: Every vector in your database is assigned to its nearest centroid. The index builds an Inverted List—a dictionary mapping each vector ID to the its nearest centroid.
Querying: When a query vector $\mathbf{q}$ arrives, it calculates the distance from $\mathbf{q}$ to each of the $C$ centroids. Then, it extracts only the $nprobe$ number of the nearest centroids and computes distances to only the vectors residing in those specific cells.

PQ (Product Quantization) — The Micro Compression

While IVF reduces the number of vectors it searches, PQ reduces the size of the vectors themselves.

How it works

Splitting: A high-dimensional vector $\mathbf{x} \in \mathbb{R}^D$ is split into $m$ sized sub-vectors. For example, if $m=8$ and the vector dimension is 1,024, PQ splits the vectors into 8 sub-vectors of 128 dimensions each.
Sub-Clustering: For each of the $m$ sub-spaces, it runs K-Means to find sub-centroids. (Usually, the number of sub-centroids for each sub-space is set to 256, so that the centroid ID fits in a single 8-bit byte).
Encoding: Every sub-vector is replaced by the nearest centroid ID (0-255)
- Memory Saving: A 1024-dim float32 vector (4096 bytes) is compressed into $m$ bytes. If $m=8$ , it is 8 bytes, achieving a 512x compression ratio.
Querying:: When a query vector $\mathbf{q}$ arrives, it is split into $m$ sub-vectors. For each query sub-vector $\mathbf{q}_i$ , it calculates its distance to all 256 sub-centroids in the sub-space, and store them in a lookup table. Therefore, the distances will be calculated 256 * $m$ times for each query.
To calculate the distance between $\mathbf{q}_i$ and any compressed vector $\mathbf{x}$ , we simply sum the pre-computed distances from the lookup table using the stored centroid IDs:

d(\mathbf{q}, \mathbf{x}) \approx \sum_{i=1}^m d(\mathbf{q}_{i}, c_{i, j_{i}})

(Where

c_{i, j_{i}}

is the sub-centroid assigned to the

i

-th sub-vector of

\mathbf{x}

For example, let's assume $m=8$ , the vector dimensionality is 1024, and we have a dataset containing 100,000 vectors.

Without PQ (Flat Search): You must run 100,000 distance calculations between 1024-dimension vectors. That requires 102,400,000 multiplications.
With PQ: You would run 2,048 (256 * 8) distance computations to build the lookup table. Then, without any multiplications, you merely reference the lookup table 100,000 $\times$ 8 times. Not only are fewer computations required, but the file size of the vector DB is reduced significantly.

HNSW vs. IVF-PQ

Method	Inference Time (Latency)	RAM consumption	Index Build Time
HNSW	Very Low	High	High
IVF-PQ	Medium	Low	Medium (Requires Training)

Comparison image between HNSW and IVF-PQ from NVIDIA Techinical Blog

In most cases, people tend to prefer the HNSW. It is essentially plug-and-play; it doesn't require a training process. Also, its search time is faster even when using only a CPU. So, it appears superior to IVF in almost every aspect.

But if the database becomes too large (over 100M vectors), HNSW will eat up too much RAM. Therefore, to save on massive RAM costs, IVF-PQ could be a better choice in this case.

A Real Example of Using Dense RAG and Vector Search

I will demonstrate dense RAG and a vector search DB using HNSW. I'm going to input the entire transcript of the series "Demon Slayer" Seasons 1 through 4. (I think it would be boring to just use the typical data scattered across the internet!)

First, you need to build the client and DB. I will use the Qdrant framework. There are a bunch of RAG frameworks, and they all have their own pros and cons. So, you should find out which framework best fits for your project.

import docx
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

# Text splitter: It splits text into several chunks. 
# You can also set the overlap window to maintain context across chunks.text. 
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1500,
    chunk_overlap=150, # 10% overlap
    length_function=len,
)

embedding_dimension = 1024

for filepath in files:
    try:
        document = docx.Document(filepath)
        doc_content = []

        # Use user provided logic to extract tables
        for table in document.tables:
            for row in table.rows:
                for idx, cell in enumerate(row.cells):
                    if (idx == 0 or cell.text != row.cells[0].text) and cell.text:
                        doc_content.append(cell.text)

        content = "\n".join(doc_content).strip()

        if content:
            # Split content into smaller chunks with overlap
            chunks = text_splitter.split_text(content)
            for chunk_idx, chunk in enumerate(chunks):
                docs_texts.append(chunk)
                docs_metadata.append({
                    "source": filepath,
                    "filename": os.path.basename(filepath),
                    "chunk_index": chunk_idx,
                    "page_content": chunk # Store chunk text here
                })
    except Exception as e:
        print(f"Warning: Could not read file {filepath} - {e}")

embeddings = OllamaEmbeddings(model=model_name, dimensions=embedding_dimension) # I'm using the "qwen3-embedding:4b" model

vectors = embeddings.embed_documents(docs_texts)

Next, I build the Qdrant vector DB using HNSW. I set $m=16$ and $efConstruction$ to 128.

client = QdrantClient(path=output_dir)
collection_name = "semantic_rag_demon_slayer_collection"

client.create_collection(
    collection_name=collection_name,
    vectors_config=models.VectorParams(
        size=embedding_dimension,
        distance=models.Distance.COSINE,
        hnsw_config=models.HnswConfigDiff(
            m=16,
            ef_construct=128, 
            full_scan_threshold=0,  # Always use HNSW search, never fall back to brute force
        ),
    ),
    optimizers_config=models.OptimizersConfigDiff(
        indexing_threshold=100,  # Low threshold to force HNSW index build on small data
    ),
)

Next, I need to combine the vector points and metadata using PointStruct; and then insert them into the DB.

 points = []
for i, (vector, meta) in enumerate(zip(vectors, docs_metadata)):
    point_id = str(uuid.uuid4())
    points.append(
        models.PointStruct(
            id=point_id,
            vector=vector,
            payload=meta
        )
    )

client.upsert(
    collection_name=collection_name,
    points=points
)

uv run semantic_rag/embed_docs.py --input_dir database/demon_slayer --output_dir qdrant_db/vector_db/demon_slayer

---

Created new collection: semantic_rag_demon_slayer_collection
Processing document: database/demon_slayer\Demon Slayer _ S.1 E.01 (ENG sub).docx
Loaded 13 documents. Initializing the embedding model 'qwen3-embedding:4b'...
Generating embeddings... this might take a while.
Successfully saved the document: database/demon_slayer\Demon Slayer _ S.1 E.01 (ENG sub).docx
Processing document: database/demon_slayer\Demon Slayer _ S.1 E.02 (ENG sub).docx
Loaded 12 documents. Initializing the embedding model 'qwen3-embedding:4b'...
Generating embeddings... this might take a while.
Successfully saved the document: database/demon_slayer\Demon Slayer _ S.1 E.02 (ENG sub).docx

...

Successfully saved the document: database/demon_slayer\Demon Slayer _ S.4 E.07 (ENG sub).docx
Processing document: database/demon_slayer\Demon Slayer _ S.4 E.08 (ENG sub).docx
Loaded 13 documents. Initializing the embedding model 'qwen3-embedding:4b'...
Generating embeddings... this might take a while.
Successfully saved the document: database/demon_slayer\Demon Slayer _ S.4 E.08 (ENG sub).docx
Successfully vectorized and saved all documents locally.

I successfully saved all the documents in the vector DB.

Now, we need to test this with an LLM model to see if it retrieves the relevant information accurately.

You can retrieve the vectors in Qdrant this way:

embeddings = OllamaEmbeddings(model=model_name, dimensions=1024)

query_vector = embeddings.embed_query(query)

client = QdrantClient(path=db_dir)
collection_name = "semantic_rag_demon_slayer_collection"

results = client.query_points(
    collection_name=collection_name,
    query=query_vector,
    limit=top_k,
    search_params=models.SearchParams(hnsw_ef=128),
).points

retrieved_context = [res.payload.get("page_content", "") for res in results if res.payload]

Now, let's see whether it retrieves the Demon Slayer knowledge accurately.

top_k = 5 # retrieve 5 vectors
query = "Why did the Master ask Himejima to kill Muzan?"
query_vector = embeddings.embed_query(query)

results = client.query_points(
    collection_name=collection_name,
    query=query_vector,
    limit=top_k,
    search_params=models.SearchParams(hnsw_ef=128),
).points

for r in retrieved_context:
    print(r)

...
(Flashback) Use me as bait… and cut off… Muzan’s head.
Himejima
(Flashback) What makes you think that?
Kagaya
(Flashback) Fufu… Just my intuition. That’s all. No reason.
Himejima
(Thoughts) Along with his special voice, what he called “intuition” was prodigious among the Ubuyashiki clan.
(Thoughts) It’s also known as “foresight”. The power to see into the future. Using this, they built up their fortune and avoided crises many times over.
Kagaya
(Flashback) The other children… won’t agree… to using me as bait.
(Flashback) You’re the only one that I can ask… Gyoumei.
Himejima
(Flashback) Understood. If that is your wish, Master.
Kagaya
(Flashback) Thank you.
...

It successfully retrieved the relevant documents from the DB. Actually, this was the second most relevant chunk. The first one is somewhat relevant, but it doesn't include the right information to answer the question.

Next, before trying this with an LLM, I want to check if the GPT-5-nano model can answer questions about the series without any external information. Some models have the information about the series baked into their own weights.

I will ask these four questions:

"Why did Akaza want Kyojuro to become a demon?"
"Name all the Hashiras who entered the Infinity Castle"
"What advanced versions of Thunder Breathing the First Form Thunderclap and Flash can Zenitsu use?"
"What is the meaning of 'Musical Score' of Tengen?

from langchain_openai import ChatOpenAI
from dotenv import load_dotenv

load_dotenv()

llm = ChatOpenAI(model="gpt-5-nano", api_key=os.getenv("OPENAI_API_KEY"))

template = """
Answer the query about the Japanese anime series Demon Slayer.
If you cannot find the answer, say that you do not know. Do not hallucinate.
Query:
{question}
"""

prompt = PromptTemplate(input_variables=["question"], template=template)
chain = prompt | llm

query_list = [
    "Why did Akaza want Kyojuro to become a demon?",
    "Name all the Hashiras who entered the Infinity Castle",
    "What advanced versions of Thunder Breathing the First Form Thunderclap and Flash can Zenitsu use?",
    "What is the meaning of 'Musical Score' of Tengen?"
    ]

answer_list = []
for query in query_list:
    response = chain.invoke({
        "question": query
    })

    answer_list.append(response.content)

for idx, answer in enumerate(answer_list):
    print(f"Answer #{idx+1}: ", answer)

Answer #1:  Akaza wanted Kyojuro Rengoku to become a demon because he was looking for a powerful human to recruit into the Twelve Demon Moons. He admired Rengoku’s exceptional fighting ability and believed that, as a demon, Rengoku would be stronger and capable of fighting forever, giving Akaza the opportunity to face and test the strongest opponent.

Answer #2:  I don’t know.

Answer #3: 'I don’t know. If you’d like, I can look up or summarize what’s officially stated about Thunder Breathing forms for Zenitsu.'

Answer #4:  I don’t know the meaning of a term called “Musical Score” related to Tengen. In official Demon Slayer material, Tengen Uzui’s fighting style is “Sound Breathing” (音の呼吸), and there’s no canon term or technique known as “Musical Score.” If you saw that phrase somewhere, it’s likely a fan translation or a metaphor/metonym for his sound/music motif rather than an official term. If you can share the source, I can help interpret it.

All the answers are wrong. The first answer is somewhat close, but Akaza didn't try to recruit anyone to the Twelve Demon Moons.

But what if the model has access to the RAG DB? Let's try it with the vector search.

answer_list_with_rag = []
for query in query_list:
    query_vector = embeddings.embed_query(query)

    results = client.query_points(
        collection_name=collection_name,
        query=query_vector,
        limit=top_k,
        search_params=models.SearchParams(hnsw_ef=128),
    ).points
    retrieved_context = [res.payload.get("page_content", "") for res in results if res.payload]
    answer = llm_inference.generate_answer(query=query, retrieved_context=retrieved_context)

    answer_list_with_rag.append(answer)

Answer #1:  Akaza offered to make Kyojurou a demon so that he could keep training for 100–200 years and become much stronger. He pointed out that humans age and die, implying demonhood would grant him the time to grow far stronger.

Answer #2:  - Tokitou (Muichiro Tokito) – Mist Hashira
- Shinazugawa (Sanemi Shinazugawa) – Wind Hashira
- Iguro (Obanai Iguro) – Serpent Hashira
- Kanroji (Mitsuri Kanroji) – Love Hashira
- Shinobu (Shinobu Kocho) – Insect Hashira
- Tomioka (Giyu Tomioka) – Water Hashira
- Himejima (Gyomei Himejima) – Stone Hashira

Answer #3:  He can use Thunderclap and Flash: Godlike Speed, Sixfold, and Eightfold.

Answer #4:  It refers to the name of Tengen Uzui’s technique, a music-based fighting style. “Musical Score” (the Musical Score Technique) uses the idea of a musical score to guide and counter his moves—even turning the opponent’s Blood Demon Art into a “song” to read and deflect it.

Now, it can answer very accurately. This is the power of the RAG. Even if the backbone model doesn't have enough knowledge about a certain topic, the RAG system can efficiently inject the required knowledge efficiently.

Limitations of Vector Search RAG

However, it is not a silver bullet. Of course, it has limitations. When it comes to the basic RAG, it can only compare the semantic vectors; it can't understand the words on a deep learning level, and it is not good at multi-hop inference.

For example, consider this query: "What does the main character do in the second most populous state of the U.S.?"

When the model receives this query, it has to know who the main character is and which state is the second most populous. Then, it must combine this information and search through the database. But as we know, basic level RAG is not designed to perform this kind of multi-hop inference.

If I send the query "How did the Hashira who lost his brother to a demon manage to defeat one of the Twelve Demon Moons?" to the RAG, it shows me entirely irrelevant documents. Considering the fundamental algorithms of the vector search, this is natural. If we send the query directly to the semantic vector search client, it can't understand who the Hashira that lost his brother is, or how he defeated the demon. There is only one character who fits this condition: Muichiro Tokito. But the vector distance between "Muichiro Tokito" and "the Hashira who lost his brother to a demon" is not close. In fact, they might be far away from each other in the vector space.

query = "How did the Hashira who lost his brother to a demon manage to defeat one of the Twelve Demon Moons?"

"""
--- Retrieved Context Summary ---
[1] Source: database/demon_slayer\Demon Slayer _ S.4 E.02 (ENG sub).docx | Preview: I’m leaving now to be trained by the Wind Hashira. Shinobu I see. Kanao Can your training session wait until after the Stone Hashira’s? Shinobu I won’...
[2] Source: database/demon_slayer\Demon Slayer _ S.2 E.08 (ENG sub).docx | Preview: It’s just that… as he suffers from a skin disease, he can’t go outside during the day. Guest Oh dear, the poor thing. Father I was hoping that we coul...
[3] Source: database/demon_slayer\Demon Slayer _ S.1 E.24 (ENG sub).docx | Preview: And also, I’d like to entrust my dream to you. Tanjirou Dream? Shinobu Yes.  My dream that we can become friends with the demons. I’m quite sure you c...
[4] Source: database/demon_slayer\Demon Slayer _ S.1 E.23 (ENG).docx | Preview: Not to mention that the times have changed considerably in this era. Himejima Other than those who’ve had their loved ones brutally massacred and join...
[5] Source: database/demon_slayer\Demon Slayer _ S.1.5 – The Movie_ Mugen Train (ENG sub).docx | Preview: (Thoughts) Even though I’d taken 200 humans hostage, I still struggled! I was held at bay! Is this the power of a Hashira?  (Thoughts) And him… He was...

================ FINAL ANSWER ================

I do not know. The provided context mentions Shinobu Kocho losing her older sister Kanae to a demon and wanting to teach how to kill that demon, but it does not describe how she or any Hashira defeated a Twelve Demon Moon.

==============================================
"""

There are several ways to adapt RAG to overcome its limitations. A prime example is called GraphRAG, which utilizes Knowledge Graphs. Alternatively, you can combine several methods into a hybrid approach. I will cover some of these advanced RAG methods in the next post.

Conclusion

RAG is one of the most popular methods for retrieving information in LLM architectures. Sometimes you might need just simple sparse RAG, but other times you might need dense RAG or even more advanced RAG algorithms. You should consider your database, resources, backbone LLM, and use case to choose the method that best fits your project.

Top comments (0)

The discussion has been locked. New comments can't be added.

Introduction

Knowledge Integration Strategies

Three Techniques to Optimize LLMs

1) Fine-tuning

2) Prompt engineering

3) RAG (Retrieval-Augmented Generation)

RAG (Retrieval-Augmented Generation)

Lexical(sparse) RAG

Sparse RAG Example

Cons of Sparse RAG

Dense RAG (Semantic Vector Search)

Hierarchical Navigable Small World (HNSW)

How it works

How it builds layers

Inverted File Index (IVF) & Product Quantization (PQ)

IVF (Inverted File Index) — The Macro Partition

PQ (Product Quantization) — The Micro Compression

HNSW vs. IVF-PQ

A Real Example of Using Dense RAG and Vector Search

Limitations of Vector Search RAG

Conclusion