A deep dive into semantic deduplication for LLM context windows
If you're building with RAG (Retrieval-Augmented Generation), you've probably noticed something frustrating: your LLM keeps getting the same information from different sources. The same answer appears in your documentation, your tool outputs, your memory system—just worded slightly differently.
This isn't a minor inefficiency. In production RAG systems, 30-40% of retrieved context is semantically redundant. That's wasted tokens, higher API costs, and confused model outputs.
I built GoVectorSync to fix this. Here's the technical deep-dive on the problem and solution.
The Problem: Semantic Redundancy in Multi-Source RAG
Modern AI agents pull context from multiple sources:
┌─────────────────────────────────────────────────────────────┐
│ User Query │
│ "How do I reset my password?" │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ Context Sources │
├─────────────────────────────────────────┤
│ 📄 RAG (Documentation) │
│ 🔧 MCP Tools (API responses) │
│ 🧠 Memory (Past conversations) │
│ ⚡ Skills (Procedural knowledge) │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Retrieved Chunks │
├─────────────────────────────────────────────────────────────┤
│ [RAG] "To reset your password, click 'Forgot Password' │
│ on the login page..." │
│ [RAG] "Password reset: Navigate to login, select │
│ 'Forgot Password'..." │
│ [MCP] "User password can be reset via the forgot │
│ password flow in the auth system" │
│ [Memory] "Last time you asked, I explained the password │
│ reset uses the forgot password link..." │
│ [RAG] "Account deletion is available in Settings..." │
│ [MCP] "Delete account: Settings > Account > Delete" │
│ [RAG] "Set up 2FA for extra security..." │
│ [Skills] "Contact support at support@example.com" │
└─────────────────────────────────────────────────────────────┘
Look at those first four results. They're all saying the same thing: use the forgot password flow. But because they come from different sources with different wording, naive top-k retrieval treats them as distinct.
The Math of Waste
If you retrieve 8 chunks and 5 are duplicates:
- 62% of your context window is wasted
- You're paying for tokens that add no information
- The LLM sees repetition, which can bias its response
- You're missing diverse information that could help
Why Cosine Similarity Isn't Enough
You might think: "Just dedupe by cosine similarity threshold."
The problem is choosing that threshold:
Similarity between chunks about password reset:
"To reset your password, click 'Forgot Password'..."
vs
"Password reset: Navigate to login, select 'Forgot'..."
Cosine Similarity: 0.82
Is 0.82 a duplicate? What about 0.75? 0.68?
A fixed threshold fails because:
- Domain variance: Technical docs cluster tighter than conversational text
- Length effects: Longer chunks have different similarity distributions
- Embedding model quirks: Different models have different similarity ranges
You need something smarter.
The Solution: Cluster → Select → Diversify
GoVectorSync uses a three-stage pipeline:
┌─────────────────────────────────────────────────────────────┐
│ GoVectorSync Pipeline │
└─────────────────────────────────────────────────────────────┘
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ STAGE 1 │ │ STAGE 2 │ │ STAGE 3 │
│ Over-fetch │────▶│ Cluster │────▶│ Select + │
│ (3-5x K) │ │ (Semantic) │ │ MMR │
└──────────────┘ └──────────────┘ └──────────────┘
│ │ │
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ 50 chunks │ │ 12 clusters │ │ 8 chunks │
│ from VectorDB│ │ by meaning │ │ diverse │
└──────────────┘ └──────────────┘ └──────────────┘
Stage 1: Over-fetch
Instead of retrieving exactly K chunks, retrieve 3-5x more:
// Request 50 chunks when you need 8
req.TopK = cfg.OverFetchK // 50
This gives us a pool to deduplicate from. The extra retrieval cost is negligible compared to the LLM inference cost you'll save.
Stage 2: Agglomerative Clustering
We group semantically similar chunks using hierarchical clustering:
┌─────────────────────────────────────────────────────────────┐
│ Agglomerative Clustering │
└─────────────────────────────────────────────────────────────┘
Initial state (each chunk is its own cluster):
[C1] [C2] [C3] [C4] [C5] [C6] [C7] [C8]
Step 1: Merge closest pair (C1, C2) - both about password reset
[C1,C2] [C3] [C4] [C5] [C6] [C7] [C8]
Step 2: Merge (C1,C2) with C3 - also password reset
[C1,C2,C3] [C4] [C5] [C6] [C7] [C8]
Step 3: Merge C4 into password cluster
[C1,C2,C3,C4] [C5] [C6] [C7] [C8]
Step 4: Merge C5,C6 - both about account deletion
[C1,C2,C3,C4] [C5,C6] [C7] [C8]
Stop when distance > threshold (0.15)
Final clusters:
┌─────────────────┐ ┌─────────────────┐ ┌────────┐ ┌────────┐
│ Password Reset │ │ Account Delete │ │ 2FA │ │Support │
│ C1, C2, C3, C4 │ │ C5, C6 │ │ C7 │ │ C8 │
└─────────────────┘ └─────────────────┘ └────────┘ └────────┘
The algorithm:
func (c *Clusterer) Cluster(chunks []types.Chunk) *types.ClusterResult {
// Initialize: each chunk is its own cluster
nodes := make([]*clusterNode, n)
for i := range chunks {
nodes[i] = &clusterNode{
members: []int{i},
centroid: chunks[i].Embedding,
active: true,
}
}
// Compute pairwise distance matrix
distMatrix := c.computeDistanceMatrix(chunks)
// Agglomerative merging
for activeCount > 1 {
// Find closest pair
minDist, minI, minJ := findClosestPair(nodes, distMatrix)
// Stop if distance exceeds threshold
if minDist > c.cfg.Threshold {
break
}
// Merge clusters
mergeClusters(nodes[minI], nodes[minJ], chunks)
nodes[minJ].active = false
}
return buildResult(nodes, chunks)
}
Key insight: We use average linkage by default, which computes cluster distance as the mean of all pairwise distances. This is more robust than single-linkage (chaining problem) or complete-linkage (too conservative).
func (c *Clusterer) clusterDistance(a, b *clusterNode, distMatrix [][]float64) float64 {
switch c.cfg.Linkage {
case "single":
// Min distance - can cause chaining
return minPairwiseDistance(a, b, distMatrix)
case "complete":
// Max distance - very conservative
return maxPairwiseDistance(a, b, distMatrix)
case "average":
// Mean distance - balanced
return avgPairwiseDistance(a, b, distMatrix)
}
}
Stage 3: Representative Selection
From each cluster, we pick one representative. Multiple strategies:
┌─────────────────────────────────────────────────────────────┐
│ Selection Strategies │
└─────────────────────────────────────────────────────────────┘
Strategy: "score" (default)
─────────────────────────────
Pick the chunk with highest retrieval score.
Best for: Preserving relevance ranking
Cluster: [C1: 0.92, C2: 0.89, C3: 0.85, C4: 0.78]
Selected: C1 (score 0.92)
Strategy: "centroid"
─────────────────────────────
Pick the chunk closest to cluster centroid.
Best for: Finding the most "typical" chunk
Cluster centroid: [0.12, -0.34, 0.56, ...]
C1 distance to centroid: 0.08
C2 distance to centroid: 0.12
C3 distance to centroid: 0.05 ← Selected
C4 distance to centroid: 0.15
Strategy: "hybrid"
─────────────────────────────
Weighted combination of score + centroid proximity.
Best for: Balancing relevance and typicality
hybrid_score = 0.7 * normalized_score + 0.3 * centroid_proximity
func (s *Selector) SelectFromCluster(cluster *types.Cluster) *types.Chunk {
switch s.cfg.Strategy {
case SelectByScore:
return s.selectByScore(cluster) // Highest retrieval score
case SelectByCentroid:
return s.selectByCentroid(cluster) // Closest to centroid
case SelectByHybrid:
return s.selectByHybrid(cluster) // Weighted combination
}
}
Stage 4: MMR Re-ranking (Optional)
After selecting representatives, we may still have more chunks than needed. MMR (Maximal Marginal Relevance) ensures the final set is diverse:
┌─────────────────────────────────────────────────────────────┐
│ Maximal Marginal Relevance (MMR) │
└─────────────────────────────────────────────────────────────┘
Formula:
MMR(chunk) = λ × relevance(chunk)
- (1-λ) × max_similarity(chunk, already_selected)
Where:
λ = 0.5 (balanced)
λ = 1.0 (pure relevance, no diversity)
λ = 0.0 (pure diversity, ignore relevance)
Example with λ = 0.5:
─────────────────────────────
Candidates: [A, B, C, D, E] (already selected: none)
Round 1:
MMR(A) = 0.5 × 0.95 - 0 = 0.475 ← Selected (highest relevance)
Round 2: (A already selected)
MMR(B) = 0.5 × 0.90 - 0.5 × sim(B,A)
= 0.45 - 0.5 × 0.85 = 0.025
MMR(C) = 0.5 × 0.85 - 0.5 × sim(C,A)
= 0.425 - 0.5 × 0.20 = 0.325 ← Selected (diverse!)
Round 3: (A, C already selected)
MMR(B) = 0.5 × 0.90 - 0.5 × max(sim(B,A), sim(B,C))
...
The key insight: MMR penalizes chunks that are similar to already-selected chunks, naturally promoting diversity.
func (m *MMR) computeMMRScore(candidateIdx int, selected []int,
scores []float64, simMatrix [][]float64) float64 {
relevance := scores[candidateIdx]
if len(selected) == 0 {
return m.cfg.Lambda * relevance
}
// Find max similarity to any selected chunk
maxSim := 0.0
for _, selIdx := range selected {
sim := simMatrix[candidateIdx][selIdx]
if sim > maxSim {
maxSim = sim
}
}
// MMR formula
return m.cfg.Lambda*relevance - (1-m.cfg.Lambda)*maxSim
}
The Full Pipeline
Here's how it all fits together:
┌─────────────────────────────────────────────────────────────┐
│ GoVectorSync Broker │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 1. EMBED QUERY │
│ "How do I reset my password?" → [0.12, -0.34, ...] │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 2. OVER-FETCH FROM VECTOR DB │
│ Query Pinecone/Qdrant for top 50 chunks │
│ Include embeddings for clustering │
│ │
│ Latency: ~15ms │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 3. AGGLOMERATIVE CLUSTERING │
│ 50 chunks → 15 clusters │
│ Threshold: 0.15 cosine distance │
│ │
│ Latency: ~8ms │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 4. REPRESENTATIVE SELECTION │
│ 15 clusters → 15 representatives │
│ Strategy: highest score per cluster │
│ │
│ Latency: ~1ms │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ 5. MMR RE-RANKING │
│ 15 representatives → 8 diverse chunks │
│ Lambda: 0.5 (balanced relevance/diversity) │
│ │
│ Latency: ~3ms │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ RESULT │
│ 8 chunks, each covering a distinct topic: │
│ • Password reset │
│ • Account deletion │
│ • 2FA setup │
│ • Support contact │
│ • Billing │
│ • API keys │
│ • Data export │
│ • Team management │
│ │
│ Total added latency: ~12ms │
└─────────────────────────────────────────────────────────────┘
Implementation Details
Distance Matrix Computation
For N chunks, we compute an N×N distance matrix. This is O(N²) but N is small (50-100 chunks):
func (c *Clusterer) computeDistanceMatrix(chunks []types.Chunk) [][]float64 {
n := len(chunks)
matrix := make([][]float64, n)
for i := 0; i < n; i++ {
matrix[i] = make([]float64, n)
for j := i + 1; j < n; j++ {
dist := math.CosineDistance(chunks[i].Embedding, chunks[j].Embedding)
matrix[i][j] = dist
matrix[j][i] = dist // Symmetric
}
}
return matrix
}
Cosine Distance
func CosineDistance(a, b []float32) float64 {
var dot, normA, normB float64
for i := range a {
dot += float64(a[i] * b[i])
normA += float64(a[i] * a[i])
normB += float64(b[i] * b[i])
}
if normA == 0 || normB == 0 {
return 1.0 // Max distance
}
similarity := dot / (math.Sqrt(normA) * math.Sqrt(normB))
return 1.0 - similarity // Convert to distance
}
Centroid Update on Merge
When merging clusters, we recompute the centroid as the mean of all member embeddings:
func (c *Clusterer) mergeClusters(a, b *clusterNode, chunks []types.Chunk) {
a.members = append(a.members, b.members...)
// Recompute centroid
dim := len(chunks[0].Embedding)
newCentroid := make([]float32, dim)
for _, idx := range a.members {
for d := 0; d < dim; d++ {
newCentroid[d] += chunks[idx].Embedding[d]
}
}
invN := float32(1.0 / float64(len(a.members)))
for d := 0; d < dim; d++ {
newCentroid[d] *= invN
}
a.centroid = newCentroid
}
Configuration Tuning
Cluster Threshold
The threshold controls how aggressively we merge:
| Threshold | Effect | Use Case |
|---|---|---|
| 0.10 | Conservative, more clusters | High-precision domains |
| 0.15 | Balanced (default) | General purpose |
| 0.20 | Aggressive, fewer clusters | Noisy/redundant data |
| 0.25+ | Very aggressive | Heavy deduplication |
MMR Lambda
| Lambda | Effect | Use Case |
|---|---|---|
| 0.3 | Diversity-focused | Exploratory queries |
| 0.5 | Balanced (default) | General purpose |
| 0.7 | Relevance-focused | Precise queries |
| 1.0 | Pure relevance | No diversity needed |
Over-fetch Ratio
| Ratio | Chunks Retrieved | Trade-off |
|---|---|---|
| 2x | 16 for K=8 | Minimal overhead, less dedup potential |
| 3x | 24 for K=8 | Good balance |
| 5x | 40 for K=8 | Maximum dedup, higher retrieval cost |
Performance
Benchmarks on a typical workload (50 chunks, 1536-dim embeddings):
| Stage | Latency |
|---|---|
| Distance matrix | 2ms |
| Clustering | 6ms |
| Selection | <1ms |
| MMR | 3ms |
| Total overhead | ~12ms |
This is negligible compared to:
- Vector DB query: 15-50ms
- LLM inference: 500-2000ms
But the savings are significant:
- 35% fewer tokens per query
- 2x more diverse context
- Better LLM answers (less confusion from repetition)
Usage
As a Proxy
GoVectorSync sits between your app and vector DB:
┌─────────┐ ┌──────────────┐ ┌────────────┐
│ Your │────▶│ GoVectorSync │────▶│ Pinecone │
│ App │◀────│ Proxy │◀────│ Qdrant │
└─────────┘ └──────────────┘ └────────────┘
Direct Integration
Request for access for Go SDK & demo
What's Next
I'm building GoVectorSync as an open-source tool for the RAG community. Current focus:
- More vector DB integrations (Weaviate, Milvus, Chroma)
- Ingestion-time dedup (deduplicate before storing)
- Adaptive thresholds (learn optimal settings per namespace)
- Streaming support (deduplicate as chunks arrive)
Try It
Join the waitlist for early access:
waitlist.siddhantkhare.com/?project=GoVectorSync
Found this useful? I write about AI infrastructure, security, and the engineering challenges of building production AI systems. Connect with me on LinkedIn or Twitter/X.
*Built by Siddhant Khare
Top comments (0)