DEV Community

Elise Tanaka
Elise Tanaka

Posted on

Milvus and Late Chunking: What I Learned About Context-Aware Embedding in RAG

Understanding the Pain Points of Traditional Chunking

When I first built a basic RAG (Retrieval-Augmented Generation) pipeline using traditional chunking methods—like fixed-size segments or sliding windows—I noticed an immediate flaw: context fragmentation. My LLM would often return incoherent or incomplete answers when queried on long documents. This was especially evident when a user asked about features in "Milvus 2.4.13"—the model couldn't semantically link the header in one chunk with the feature list in another.

That’s when I encountered Late Chunking, which flips the process: instead of chunking first and embedding later, it embeds the entire document upfront and then slices it using token-based annotations. This context-first approach changed how I think about vector embedding pipelines.

Why Late Chunking Matters

In my tests, traditional chunking often led to missed associations. Embedding small segments in isolation meant that semantically important relationships were lost. For example:

  • Chunk 1 had the string "Milvus 2.4.13"
  • Chunk 2 listed new features

A query like "What are the new features in Milvus 2.4.13?" would yield poor matches because the embedding model treated each chunk independently.

Late Chunking solves this by leveraging long-context models (like jina-embeddings-v2-base-en, which handles up to 8,192 tokens) to embed the full document first. Then it pools token vectors based on predefined spans, preserving global semantic awareness.

Implementation: From Theory to Code

Here's a simplified pipeline I used to implement Late Chunking.

1. Sentence Chunking with Span Annotation

def sentence_chunker(document, batch_size=10000):
    nlp = spacy.blank("en")
    nlp.add_pipe("sentencizer", config={"punct_chars": None})
    doc = nlp(document)

    docs = [nlp(document[i:i + batch_size]) for i in range(0, len(document), batch_size)]
    doc = Doc.from_docs(docs)

    chunks, span_annotations = [], []
    for sent in doc.sents:
        span_annotations.append((sent.start, sent.end))
        chunks.append(sent.text)
    return chunks, span_annotations
Enter fullscreen mode Exit fullscreen mode

2. Token Embedding for Full Document

def document_to_token_embeddings(model, tokenizer, document, batch_size=4096):
    tokenized = tokenizer(document, return_tensors="pt")
    tokens = tokenized.tokens()
    outputs = []
    for i in range(0, len(tokens), batch_size):
        batch_inputs = {k: v[:, i:i+batch_size] for k, v in tokenized.items()}
        with torch.no_grad():
            output = model(**batch_inputs).last_hidden_state
        outputs.append(output)
    return torch.cat(outputs, dim=1)
Enter fullscreen mode Exit fullscreen mode

3. Embedding Pooling per Span

def late_chunking(token_embeddings, span_annotations):
    pooled = []
    for start, end in span_annotations:
        if end - start >= 1:
            pooled_vec = token_embeddings[start:end].sum(dim=0) / (end - start)
            pooled.append(pooled_vec.detach().cpu().numpy())
    return pooled
Enter fullscreen mode Exit fullscreen mode

Full Flow

chunks, span_annotations = sentence_chunker(document)
token_embeddings = document_to_token_embeddings(model, tokenizer, document)
chunk_embeddings = late_chunking(token_embeddings[0], span_annotations)
Enter fullscreen mode Exit fullscreen mode

Benchmarks: Accuracy Comparison with Traditional Chunking

To validate, I tested cosine similarity between the query "milvus 2.4.13" and each chunk. Late Chunking consistently delivered higher similarity scores:

Chunk Late Chunking Traditional Chunking
Dynamic replica load 0.8785 0.8354
Bug fixes 0.8483 0.7222
MMAP improvements 0.8494 0.6907
Recommendation note 0.8543 0.7186

These results empirically confirmed that Late Chunking captures global context more effectively than any local chunking trick I had previously used.

Vector Search Validation Using Milvus

After embedding, I stored the chunk vectors in Milvus. I compared its native ANN search against a brute-force cosine similarity scan. Both approaches returned identical top-3 matches for queries like "What are new features in milvus 2.4.13". This gave me high confidence in Milvus’s indexing fidelity.

res = client.insert(collection_name=collection, data=batch_data)
Enter fullscreen mode Exit fullscreen mode

Manual Cosine Similarity Query

def cosine_sim_query(query):
    query_vector = model(**tokenizer(query, return_tensors="pt")).last_hidden_state.mean(1).cpu().numpy().flatten()
    sims = [cos_sim(query_vector, emb) for emb in chunk_embeddings]
    return sorted(zip(chunks, sims), key=lambda x: x[1], reverse=True)[:3]
Enter fullscreen mode Exit fullscreen mode

Reflections and What’s Next

Late Chunking changed how I think about retrieval granularity. Instead of forcing context into artificial boundaries, I now treat documents as cohesive semantic units until the final step. This works particularly well with long-context embedding models and scales easily with vector databases like Milvus.

Next, I plan to explore hybrid strategies—where semantic chunking (topic-wise grouping) is applied after Late Chunking to further improve query precision. I’m also interested in testing other chunking techniques on domain-specific corpora such as legal contracts or scientific papers.

Late Chunking isn’t just a tweak—it’s a rethinking of how context should be preserved in modern AI pipelines.

Top comments (0)