Ashok Nagaraj

Posted on Nov 7

Smart Chunking & Embeddings for RAG

#rag

From Raw Docs to Retrieval Gold: A Deep Dive into Chunking Strategies & Embedding Techniques (with Qdrant)

TL;DR: Great RAG systems don’t start at the vector DB—they start at chunking. This post walks through when and how to chunk, how to choose and generate embeddings, and how to index/search in Qdrant with dense, sparse, and hybrid retrieval. It includes runnable code, diagrams, and sample chunking illustrations you can paste directly into Markdown.

Why chunking matters
Chunking strategies
Embedding techniques
Qdrant as the vector DB
End-to-end example
Evaluation & tuning
Practical guidance
Further reading & references

Why chunking matters

Long contexts are seductive, but LLMs still show primacy/recency bias and degrade when key facts live in the middle of long inputs ("lost in the middle"). Thoughtful chunking with overlap keeps the right facts adjacent at retrieval-time and improves end-to-end accuracy and latency. Liu et al., 2024 .

Modern RAG stacks pair well-formed chunks with strong embeddings and a vector DB that supports dense + sparse retrieval and reranking. Qdrant provides production-grade vectors, filters, payloads, and hybrid retrieval in one place. See the Qdrant README and Payload docs. [Qdrant README] [Payload docs].

Visual overview

flowchart LR
  A[Raw Documents] --> B[Parse & Clean]
  B --> C[Chunking Strategy\n(Fixed / Recursive / Semantic / Hierarchical / Sentence-window)]
  C --> D[Embedder(s)\nBGE-M3 / Nomic / Voyage]
  D --> E[Qdrant Index\n(dense + sparse vectors,\npayload)]
  E --> F[Hybrid Retrieval\n(dense ⊕ sparse)]
  F --> G[Reranker (optional)\n(e.g., ColBERT)]
  G --> H[LLM + Prompt\n(answers, grounded)]

Qdrant supports dense vectors, sparse vectors, and hybrid workflows; reranking with late interaction (e.g., ColBERT) is a documented pattern. [Sparse vectors in Qdrant] [Hybrid tutorial].

Chunking strategies

Below are practical chunking strategies you can mix-and-match. Each section includes sample text, what the chunks look like, and code where useful.

1) Fixed-size window (tokens or characters) + overlap

When: fast baselines, logs, transcripts.
Why: predictable chunk lengths, simple to implement.
Risk: can split sentences mid-thought; consider overlap to preserve context.

Sample text (we’ll reuse this):

[Doc] The GPU kernel uses tiling to reduce global memory access. 
Block-level synchronization is required. See Algorithm 2 for warp-level primitives.

Fixed-size (≈ 80 chars) with 20-char overlap

Chunk 1:
"The GPU kernel uses tiling to reduce global memory access. Block-level"

Chunk 2:
"access. Block-level synchronization is required. See Algorithm 2 for"

Chunk 3:
"See Algorithm 2 for warp-level primitives."

Code (LangChain)

from langchain_text_splitters import CharacterTextSplitter

splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base", chunk_size=80, chunk_overlap=20
)
chunks = splitter.split_text("""The GPU kernel uses tiling to reduce global memory access. 
Block-level synchronization is required. See Algorithm 2 for warp-level primitives.""")
for i, c in enumerate(chunks, 1):
    print(f"Chunk {i}:\n{c}\n")

LangChain’s token/character splitters are standard, and RecursiveCharacterTextSplitter is the recommended default for general text. [LangChain splitters].

2) Recursive split (paragraph → sentence → word)

When: prose, docs, Markdown, HTML.
Why: preserves natural boundaries first; falls back only when needed.
How: tries ["\n\n", "\n", " ", ""] in order to keep larger units intact. [Recursive splitter]

Code (Recursive + Markdown-aware)

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=120, chunk_overlap=30, is_separator_regex=False
)
chunks = splitter.split_text(long_markdown_string)

3) Document-structure–aware split (Markdown/HTML/JSON)

When: knowledge bases, docs, webpages, specs.
Why: avoid chopping headings, lists, code blocks; align chunk meaning to structure.

LangChain provides Markdown/HTML splitters; LlamaIndex offers file-based node parsers (e.g., MarkdownNodeParser, HTMLNodeParser). [LangChain splitters] [LlamaIndex node parsers]

Code (LlamaIndex HTML)

from llama_index.core.node_parser import HTMLNodeParser
parser = HTMLNodeParser(tags=["h1","h2","p","li","code"])
nodes = parser.get_nodes_from_documents(html_docs)

4) Sentence-window retrieval

When: you want precise grounding while preserving local context.
Why: index at sentence granularity, but expand context during retrieval by adding a window of neighboring sentences.

Code (LlamaIndex SentenceWindow)

from llama_index.core.node_parser import SentenceWindowNodeParser

parser = SentenceWindowNodeParser(window_size=2)  # ±2 sentences of context
nodes = parser.get_nodes_from_documents(documents)

LlamaIndex provides SentenceWindowNodeParser specifically for this pattern. [LlamaIndex node parsers]

5) Semantic chunking (boundary by meaning, not characters)

When: long-form text with shifting topics (papers, handbooks).
Why: create chunk boundaries where semantic similarity between consecutive sentences drops.

A practical recipe: embed each sentence, compute cosine distance, start a new chunk when distance exceeds a threshold (e.g., 95th percentile). LlamaIndex provides a pack implementing this (“semantic chunking” popularized by Greg Kamradt). [LlamaIndex semantic chunking pack]

Illustration (semantic boundaries)

S1: Intro to tiling  ─┐
S2: Memory coalescing ─┤  (high similarity → same chunk)
S3: Warp shuffles     ─┘
S4: Runtime flags  ←  [semantic drop: new topic → new chunk]
S5: Env setup

6) Hierarchical chunking (multi-level nodes)

When: large manuals/books; need both overview and details.
Why: index multiple granularities (e.g., 2k/512/128 tokens) and let the retriever blend them.

Code (LlamaIndex Hierarchical)

from llama_index.core.node_parser.relational.hierarchical import HierarchicalNodeParser

parser = HierarchicalNodeParser.from_defaults(
    chunk_sizes=[2048, 512, 128],  # parent → child → grandchild
    chunk_overlap=40
)
nodes = parser.get_nodes_from_documents(documents)

Hierarchical node parsers create a flat list of nodes but preserve parent/child relationships—ideal for multi-granularity retrieval. [LlamaIndex hierarchical]

Choosing chunk sizes & overlaps

Start with 512–1,024 tokens and 10–20% overlap for prose; increase overlap for dense technical content and code.
Keep chunks semantically coherent (recursive/semantic methods) to mitigate “lost in the middle.” [Recursive splitter] [Lost in the Middle]

Embedding techniques

Picking an embedding model (2024–2025 snapshot)

BGE-M3 (open, multilingual, can output dense + sparse + multi-vector; long-text up to ~8k tokens). Strong hybrid story. [HF card] [arXiv]
Nomic Embed (v1.5 / v2 MoE) (open, long-context, Matryoshka—truncate dims without retraining; task-prefix prompts). [HF v1.5] [Tech report] [V2 MoE overview]
Voyage-3/3.5(-lite) (hosted API, strong multilingual retrieval; domain variants for code/finance/law). [Voyage docs]
Benchmark by task, not just overall averages; see MTEB (retrieval/STSb are most relevant for RAG). [MTEB leaderboard]

Tip: Retrieval usually uses cosine on L2-normalized vectors (most libraries handle this). Validate that your client and DB use the same similarity metric (e.g., COSINE in Qdrant). [Qdrant client]

Task-aware prompting for embeddings

Some models require instruction prefixes to ensure query/document embeddings live in compatible subspaces. For Nomic:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
doc_emb = model.encode(["search_document: GPU tiling improves mem access"]) 
qry_emb = model.encode(["search_query: What improves memory access on GPUs?"])

These search_document / search_query prefixes are part of the model spec. [HF v1.5]

Hybrid-ready models

BGE-M3 can produce both dense and sparse signals—useful if you plan to feed Qdrant’s hybrid flow (dense + sparse). [BGE-M3 docs]

Qdrant as the Vector DB

Why Qdrant?

Vectors + payloads (schemaless JSON metadata + filters). [Payload docs]
Sparse vectors and hybrid search patterns (dense ⊕ sparse; reranking with ColBERT). [Sparse vectors] [Hybrid tutorial]
Client ergonomics: Python client supports local :memory: mode, FastEmbed integration (client.add / client.query), async, and Cloud. [Qdrant Quickstart] [PyPI]

There’s also active work discussing hybrid algorithms (e.g., BM42) and how IDF/statistics interplay with sparsity. [BM42 news] [BM25/IDF discussion]

A. “Batteries-included” quickstart (FastEmbed via Qdrant client)

Install & run Qdrant

docker run -p 6333:6333 qdrant/qdrant:latest

Python client with auto-embedding and simplified APIs

# pip install "qdrant-client[fastembed]"
from qdrant_client import QdrantClient

client = QdrantClient(":memory:")  # or url="http://localhost:6333"

docs = [
  "Qdrant has LangChain integrations",
  "Qdrant also has LlamaIndex integrations"
]

# Simple add → auto-embeds via FastEmbed
ids = client.add(collection_name="demo_collection", documents=docs)

# Query by text directly (embeds the query under the hood)
result = client.query(
    collection_name="demo_collection",
    query_text="Which vector DB works with LangChain?",
    limit=2
)
print(result)

This “add / query” path is documented in the official Qdrant Python Client Quickstart. [Quickstart]

B. Manual control (your own embeddings + payload schema)

Create a collection with COSINE distance

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

client = QdrantClient(host="localhost", port=6333)

if not client.collection_exists("ml_notes"):
    client.create_collection(
        collection_name="ml_notes",
        vectors_config=VectorParams(size=1024, distance=Distance.COSINE),  # set to your model dim
    )

Upsert points with payloads

# assume `embeddings` is a list of 1024-d vectors
# and `texts` is the corresponding list of strings
points = [
    PointStruct(
        id=i,
        vector=embeddings[i],
        payload={"doc_id": "kernel_guide", "section": i, "text": texts[i], "tags": ["gpu", "tiling"]}
    )
    for i in range(len(embeddings))
]

client.upsert(collection_name="ml_notes", points=points)

Qdrant’s Python client covers collection creation, upserts, searches, and filtering; payloads are schemaless JSON, filterable by field. [Client docs] [Payload]

Search (vector) with metadata filter

from qdrant_client.models import Filter, FieldCondition, MatchValue

hits = client.search(
    collection_name="ml_notes",
    query_vector=query_vec,
    query_filter=Filter(
        must=[FieldCondition(key="tags", match=MatchValue(value="gpu"))]
    ),
    limit=5
)

C. Hybrid retrieval with Qdrant (dense + sparse)

Concept: Store both dense vectors and sparse vectors per point; retrieve with a hybrid pipeline (then optionally rerank). Qdrant documents sparse vectors and shows reranking patterns; LlamaIndex exposes a simple enable_hybrid=True switch powered by fastembed (e.g., Qdrant/bm25 or SPLADE). [Sparse vectors] [Hybrid tutorial] [LlamaIndex Qdrant hybrid]

Code (LlamaIndex + Qdrant hybrid)

# pip install -U llama-index llama-index-vector-stores-qdrant fastembed qdrant-client
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, StorageContext, Settings
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient, AsyncQdrantClient

docs = SimpleDirectoryReader("./data").load_data()
client = QdrantClient(host="localhost", port=6333)
aclient = AsyncQdrantClient(host="localhost", port=6333)

vector_store = QdrantVectorStore(
    "gpu_notes",
    client=client,
    aclient=aclient,
    enable_hybrid=True,                     # <-- dense + sparse
    fastembed_sparse_model="Qdrant/bm25",   # or "prithvida/Splade_PP_en_v1"
    batch_size=64,
)

storage_context = StorageContext.from_defaults(vector_store=vector_store)
Settings.chunk_size = 512

index = VectorStoreIndex.from_documents(docs, storage_context=storage_context)

# Query hybrid: specify final k and dense/sparse fused candidates under the hood
retriever = index.as_retriever(similarity_top_k=10)
nodes = retriever.retrieve("warp-level primitives vs block-level sync")
for n in nodes: 
    print(n.metadata.get("source"), n.score)

This setup and parameters are demonstrated in LlamaIndex’s Qdrant Hybrid example. [LlamaIndex hybrid]

Note: Qdrant docs also describe sparse vectors’ JSON shape and their role in hybrid pipelines; pairing dense semantics with sparse exact-term matching—then reranking—is a robust recipe. [Qdrant course excerpt] [Hybrid tutorial]

D. Reranking (optional but often impactful)

After retrieving top-N (e.g., N=50) via hybrid, rerank with ColBERT or a cross-encoder for final top-k. Qdrant’s advanced tutorial covers hybrid + reranking architecture and code paths. [Hybrid + Reranking]

Part IV — End-to-end example: Chunk → Embed → Qdrant → Hybrid Query

1) Chunk (Recursive) + overlap

from langchain_text_splitters import RecursiveCharacterTextSplitter

text = """
The GPU kernel uses tiling to reduce global memory access.
Block-level synchronization is required. See Algorithm 2 for warp-level primitives.
"""

splitter = RecursiveCharacterTextSplitter(chunk_size=120, chunk_overlap=30)
chunks = splitter.split_text(text)

2) Embed (choose one model)

Option A: BGE-M3 (dense)

from sentence_transformers import SentenceTransformer

bge = SentenceTransformer("BAAI/bge-m3")
vecs = bge.encode(chunks, normalize_embeddings=True)

Option B: Nomic Embed v1.5 (task prefixes + Matryoshka)

from sentence_transformers import SentenceTransformer

nomic = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
vecs = nomic.encode([f"search_document: {c}" for c in chunks])

3) Index in Qdrant (manual)

from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct

client = QdrantClient(url="http://localhost:6333")

if not client.collection_exists("chunks_demo"):
    client.create_collection(
        collection_name="chunks_demo",
        vectors_config=VectorParams(size=len(vecs[0]), distance=Distance.COSINE),
    )

points = [
    PointStruct(id=i, vector=vecs[i], payload={"text": chunks[i], "order": i})
    for i in range(len(chunks))
]
client.upsert(collection_name="chunks_demo", points=points)

4) Query (dense)

qry = nomic.encode(["search_query: How is global memory access reduced?"])[0]
hits = client.search(collection_name="chunks_demo", query_vector=qry, limit=3)
for h in hits:
    print(h.payload["text"], "→ score:", h.score)

5) Hybrid (dense ⊕ sparse) via LlamaIndex wrapper

For production hybrid, prefer the Qdrant + LlamaIndex path shown earlier—enables SPLADE/BM25 sparse vectors automatically and combines them with dense vectors before (optional) reranking. [LlamaIndex Qdrant hybrid]

Part V — Evaluation & tuning

A/B test chunk sizes/overlaps using retrieval metrics (Recall@k, MRR) and downstream QA accuracy.
Reference datasets from BEIR/MTEB for repeatable measurement; focus on retrieval and STS categories to reflect RAG performance. [MTEB]
Watch for middle-of-context degradation; shorter, semantically-tight chunks often help. [Lost in the Middle]

Part VII — Practical guidance (battle-tested)

Start simple: Recursive splitter, 512–1,024 tokens, 10–20% overlap; adjust per domain. [Recursive splitter]
Use hybrid retrieval for noisy queries/long-tail terms (dense semantics + sparse keywords). Qdrant + LlamaIndex makes this 1-line (enable_hybrid=True). [LlamaIndex hybrid]
Normalize embeddings and match distance functions (e.g., COSINE end-to-end). [Qdrant client]
Task-aware prompting (e.g., Nomic’s prefixes) to avoid query–doc space drift. [Nomic v1.5]
Payloads matter: store source, span offsets, titles, section IDs for robust filtering & citations. Qdrant payloads are schemaless and filterable. [Payload docs]
Rerank top candidates when quality trumps latency (ColBERT/cross-encoder). [Hybrid + Reranking]
Monitor updates to sparse/hybrid algorithms (e.g., BM42) as the ecosystem evolves. [BM42 news]

DEV Community