Understanding the Pain Points of Traditional Chunking
When I first built a basic RAG (Retrieval-Augmented Generation) pipeline using traditional chunking methods—like fixed-size segments or sliding windows—I noticed an immediate flaw: context fragmentation. My LLM would often return incoherent or incomplete answers when queried on long documents. This was especially evident when a user asked about features in "Milvus 2.4.13"—the model couldn't semantically link the header in one chunk with the feature list in another.
That’s when I encountered Late Chunking, which flips the process: instead of chunking first and embedding later, it embeds the entire document upfront and then slices it using token-based annotations. This context-first approach changed how I think about vector embedding pipelines.
Why Late Chunking Matters
In my tests, traditional chunking often led to missed associations. Embedding small segments in isolation meant that semantically important relationships were lost. For example:
- Chunk 1 had the string "Milvus 2.4.13"
- Chunk 2 listed new features
A query like "What are the new features in Milvus 2.4.13?" would yield poor matches because the embedding model treated each chunk independently.
Late Chunking solves this by leveraging long-context models (like jina-embeddings-v2-base-en
, which handles up to 8,192 tokens) to embed the full document first. Then it pools token vectors based on predefined spans, preserving global semantic awareness.
Implementation: From Theory to Code
Here's a simplified pipeline I used to implement Late Chunking.
1. Sentence Chunking with Span Annotation
def sentence_chunker(document, batch_size=10000):
nlp = spacy.blank("en")
nlp.add_pipe("sentencizer", config={"punct_chars": None})
doc = nlp(document)
docs = [nlp(document[i:i + batch_size]) for i in range(0, len(document), batch_size)]
doc = Doc.from_docs(docs)
chunks, span_annotations = [], []
for sent in doc.sents:
span_annotations.append((sent.start, sent.end))
chunks.append(sent.text)
return chunks, span_annotations
2. Token Embedding for Full Document
def document_to_token_embeddings(model, tokenizer, document, batch_size=4096):
tokenized = tokenizer(document, return_tensors="pt")
tokens = tokenized.tokens()
outputs = []
for i in range(0, len(tokens), batch_size):
batch_inputs = {k: v[:, i:i+batch_size] for k, v in tokenized.items()}
with torch.no_grad():
output = model(**batch_inputs).last_hidden_state
outputs.append(output)
return torch.cat(outputs, dim=1)
3. Embedding Pooling per Span
def late_chunking(token_embeddings, span_annotations):
pooled = []
for start, end in span_annotations:
if end - start >= 1:
pooled_vec = token_embeddings[start:end].sum(dim=0) / (end - start)
pooled.append(pooled_vec.detach().cpu().numpy())
return pooled
Full Flow
chunks, span_annotations = sentence_chunker(document)
token_embeddings = document_to_token_embeddings(model, tokenizer, document)
chunk_embeddings = late_chunking(token_embeddings[0], span_annotations)
Benchmarks: Accuracy Comparison with Traditional Chunking
To validate, I tested cosine similarity between the query "milvus 2.4.13" and each chunk. Late Chunking consistently delivered higher similarity scores:
Chunk | Late Chunking | Traditional Chunking |
---|---|---|
Dynamic replica load | 0.8785 | 0.8354 |
Bug fixes | 0.8483 | 0.7222 |
MMAP improvements | 0.8494 | 0.6907 |
Recommendation note | 0.8543 | 0.7186 |
These results empirically confirmed that Late Chunking captures global context more effectively than any local chunking trick I had previously used.
Vector Search Validation Using Milvus
After embedding, I stored the chunk vectors in Milvus. I compared its native ANN search against a brute-force cosine similarity scan. Both approaches returned identical top-3 matches for queries like "What are new features in milvus 2.4.13". This gave me high confidence in Milvus’s indexing fidelity.
res = client.insert(collection_name=collection, data=batch_data)
Manual Cosine Similarity Query
def cosine_sim_query(query):
query_vector = model(**tokenizer(query, return_tensors="pt")).last_hidden_state.mean(1).cpu().numpy().flatten()
sims = [cos_sim(query_vector, emb) for emb in chunk_embeddings]
return sorted(zip(chunks, sims), key=lambda x: x[1], reverse=True)[:3]
Reflections and What’s Next
Late Chunking changed how I think about retrieval granularity. Instead of forcing context into artificial boundaries, I now treat documents as cohesive semantic units until the final step. This works particularly well with long-context embedding models and scales easily with vector databases like Milvus.
Next, I plan to explore hybrid strategies—where semantic chunking (topic-wise grouping) is applied after Late Chunking to further improve query precision. I’m also interested in testing other chunking techniques on domain-specific corpora such as legal contracts or scientific papers.
Late Chunking isn’t just a tweak—it’s a rethinking of how context should be preserved in modern AI pipelines.
Top comments (0)