Siddhant Khare

Posted on Dec 23

The Engineering guide to Context window efficiency

#programming #ai #rag #llm

A deep dive into semantic deduplication for LLM context windows

If you're building with RAG (Retrieval-Augmented Generation), you've probably noticed something frustrating: your LLM keeps getting the same information from different sources. The same answer appears in your documentation, your tool outputs, your memory system—just worded slightly differently.

This isn't a minor inefficiency. In production RAG systems, 30-40% of retrieved context is semantically redundant. That's wasted tokens, higher API costs, and confused model outputs.

I built GoVectorSync to fix this. Here's the technical deep-dive on the problem and solution.

The Problem: Semantic Redundancy in Multi-Source RAG

Modern AI agents pull context from multiple sources:

┌─────────────────────────────────────────────────────────────┐
│                        User Query                           │
│                "How do I reset my password?"                │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
        ┌─────────────────────────────────────────┐
        │           Context Sources               │
        ├─────────────────────────────────────────┤
        │  📄 RAG (Documentation)                 │
        │  🔧 MCP Tools (API responses)           │
        │  🧠 Memory (Past conversations)         │
        │  ⚡ Skills (Procedural knowledge)       │
        └─────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                   Retrieved Chunks                          │
├─────────────────────────────────────────────────────────────┤
│ [RAG]    "To reset your password, click 'Forgot Password'  │
│           on the login page..."                             │
│ [RAG]    "Password reset: Navigate to login, select        │
│           'Forgot Password'..."                             │
│ [MCP]    "User password can be reset via the forgot        │
│           password flow in the auth system"                 │
│ [Memory] "Last time you asked, I explained the password    │
│           reset uses the forgot password link..."           │
│ [RAG]    "Account deletion is available in Settings..."    │
│ [MCP]    "Delete account: Settings > Account > Delete"     │
│ [RAG]    "Set up 2FA for extra security..."                │
│ [Skills] "Contact support at support@example.com"          │
└─────────────────────────────────────────────────────────────┘

Look at those first four results. They're all saying the same thing: use the forgot password flow. But because they come from different sources with different wording, naive top-k retrieval treats them as distinct.

The Math of Waste

If you retrieve 8 chunks and 5 are duplicates:

62% of your context window is wasted
You're paying for tokens that add no information
The LLM sees repetition, which can bias its response
You're missing diverse information that could help

Why Cosine Similarity Isn't Enough

You might think: "Just dedupe by cosine similarity threshold."

The problem is choosing that threshold:

Similarity between chunks about password reset:

"To reset your password, click 'Forgot Password'..."
    vs
"Password reset: Navigate to login, select 'Forgot'..."

Cosine Similarity: 0.82

Is 0.82 a duplicate? What about 0.75? 0.68?

A fixed threshold fails because:

Domain variance: Technical docs cluster tighter than conversational text
Length effects: Longer chunks have different similarity distributions
Embedding model quirks: Different models have different similarity ranges

You need something smarter.

The Solution: Cluster → Select → Diversify

GoVectorSync uses a three-stage pipeline:

┌─────────────────────────────────────────────────────────────┐
│                    GoVectorSync Pipeline                    │
└─────────────────────────────────────────────────────────────┘

     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
     │   STAGE 1    │     │   STAGE 2    │     │   STAGE 3    │
     │  Over-fetch  │────▶│   Cluster    │────▶│   Select +   │
     │   (3-5x K)   │     │ (Semantic)   │     │     MMR      │
     └──────────────┘     └──────────────┘     └──────────────┘
           │                    │                    │
           ▼                    ▼                    ▼
     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
     │  50 chunks   │     │  12 clusters │     │   8 chunks   │
     │  from VectorDB│     │  by meaning  │     │   diverse    │
     └──────────────┘     └──────────────┘     └──────────────┘

Stage 1: Over-fetch

Instead of retrieving exactly K chunks, retrieve 3-5x more:

// Request 50 chunks when you need 8
req.TopK = cfg.OverFetchK  // 50

This gives us a pool to deduplicate from. The extra retrieval cost is negligible compared to the LLM inference cost you'll save.

Stage 2: Agglomerative Clustering

We group semantically similar chunks using hierarchical clustering:

┌─────────────────────────────────────────────────────────────┐
│              Agglomerative Clustering                       │
└─────────────────────────────────────────────────────────────┘

Initial state (each chunk is its own cluster):
[C1] [C2] [C3] [C4] [C5] [C6] [C7] [C8]

Step 1: Merge closest pair (C1, C2) - both about password reset
[C1,C2] [C3] [C4] [C5] [C6] [C7] [C8]

Step 2: Merge (C1,C2) with C3 - also password reset
[C1,C2,C3] [C4] [C5] [C6] [C7] [C8]

Step 3: Merge C4 into password cluster
[C1,C2,C3,C4] [C5] [C6] [C7] [C8]

Step 4: Merge C5,C6 - both about account deletion
[C1,C2,C3,C4] [C5,C6] [C7] [C8]

Stop when distance > threshold (0.15)

Final clusters:
┌─────────────────┐ ┌─────────────────┐ ┌────────┐ ┌────────┐
│ Password Reset  │ │ Account Delete  │ │  2FA   │ │Support │
│ C1, C2, C3, C4  │ │    C5, C6       │ │  C7    │ │  C8    │
└─────────────────┘ └─────────────────┘ └────────┘ └────────┘

The algorithm:

func (c *Clusterer) Cluster(chunks []types.Chunk) *types.ClusterResult {
    // Initialize: each chunk is its own cluster
    nodes := make([]*clusterNode, n)
    for i := range chunks {
        nodes[i] = &clusterNode{
            members:  []int{i},
            centroid: chunks[i].Embedding,
            active:   true,
        }
    }

    // Compute pairwise distance matrix
    distMatrix := c.computeDistanceMatrix(chunks)

    // Agglomerative merging
    for activeCount > 1 {
        // Find closest pair
        minDist, minI, minJ := findClosestPair(nodes, distMatrix)

        // Stop if distance exceeds threshold
        if minDist > c.cfg.Threshold {
            break
        }

        // Merge clusters
        mergeClusters(nodes[minI], nodes[minJ], chunks)
        nodes[minJ].active = false
    }

    return buildResult(nodes, chunks)
}

Key insight: We use average linkage by default, which computes cluster distance as the mean of all pairwise distances. This is more robust than single-linkage (chaining problem) or complete-linkage (too conservative).

func (c *Clusterer) clusterDistance(a, b *clusterNode, distMatrix [][]float64) float64 {
    switch c.cfg.Linkage {
    case "single":
        // Min distance - can cause chaining
        return minPairwiseDistance(a, b, distMatrix)
    case "complete":
        // Max distance - very conservative
        return maxPairwiseDistance(a, b, distMatrix)
    case "average":
        // Mean distance - balanced
        return avgPairwiseDistance(a, b, distMatrix)
    }
}

Stage 3: Representative Selection

From each cluster, we pick one representative. Multiple strategies:

┌─────────────────────────────────────────────────────────────┐
│                Selection Strategies                         │
└─────────────────────────────────────────────────────────────┘

Strategy: "score" (default)
─────────────────────────────
Pick the chunk with highest retrieval score.
Best for: Preserving relevance ranking

    Cluster: [C1: 0.92, C2: 0.89, C3: 0.85, C4: 0.78]
    Selected: C1 (score 0.92)


Strategy: "centroid"
─────────────────────────────
Pick the chunk closest to cluster centroid.
Best for: Finding the most "typical" chunk

    Cluster centroid: [0.12, -0.34, 0.56, ...]

    C1 distance to centroid: 0.08
    C2 distance to centroid: 0.12  
    C3 distance to centroid: 0.05  ← Selected
    C4 distance to centroid: 0.15


Strategy: "hybrid"
─────────────────────────────
Weighted combination of score + centroid proximity.
Best for: Balancing relevance and typicality

    hybrid_score = 0.7 * normalized_score + 0.3 * centroid_proximity

func (s *Selector) SelectFromCluster(cluster *types.Cluster) *types.Chunk {
    switch s.cfg.Strategy {
    case SelectByScore:
        return s.selectByScore(cluster)      // Highest retrieval score
    case SelectByCentroid:
        return s.selectByCentroid(cluster)   // Closest to centroid
    case SelectByHybrid:
        return s.selectByHybrid(cluster)     // Weighted combination
    }
}

Stage 4: MMR Re-ranking (Optional)

After selecting representatives, we may still have more chunks than needed. MMR (Maximal Marginal Relevance) ensures the final set is diverse:

┌─────────────────────────────────────────────────────────────┐
│           Maximal Marginal Relevance (MMR)                  │
└─────────────────────────────────────────────────────────────┘

Formula:
    MMR(chunk) = λ × relevance(chunk) 
               - (1-λ) × max_similarity(chunk, already_selected)

Where:
    λ = 0.5 (balanced)
    λ = 1.0 (pure relevance, no diversity)
    λ = 0.0 (pure diversity, ignore relevance)


Example with λ = 0.5:
─────────────────────────────

Candidates: [A, B, C, D, E]  (already selected: none)

Round 1:
    MMR(A) = 0.5 × 0.95 - 0 = 0.475  ← Selected (highest relevance)

Round 2: (A already selected)
    MMR(B) = 0.5 × 0.90 - 0.5 × sim(B,A)
           = 0.45 - 0.5 × 0.85 = 0.025
    MMR(C) = 0.5 × 0.85 - 0.5 × sim(C,A)
           = 0.425 - 0.5 × 0.20 = 0.325  ← Selected (diverse!)

Round 3: (A, C already selected)
    MMR(B) = 0.5 × 0.90 - 0.5 × max(sim(B,A), sim(B,C))
    ...

The key insight: MMR penalizes chunks that are similar to already-selected chunks, naturally promoting diversity.

func (m *MMR) computeMMRScore(candidateIdx int, selected []int, 
                               scores []float64, simMatrix [][]float64) float64 {
    relevance := scores[candidateIdx]

    if len(selected) == 0 {
        return m.cfg.Lambda * relevance
    }

    // Find max similarity to any selected chunk
    maxSim := 0.0
    for _, selIdx := range selected {
        sim := simMatrix[candidateIdx][selIdx]
        if sim > maxSim {
            maxSim = sim
        }
    }

    // MMR formula
    return m.cfg.Lambda*relevance - (1-m.cfg.Lambda)*maxSim
}

The Full Pipeline

Here's how it all fits together:

┌─────────────────────────────────────────────────────────────┐
│                  GoVectorSync Broker                        │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ 1. EMBED QUERY                                              │
│    "How do I reset my password?" → [0.12, -0.34, ...]      │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ 2. OVER-FETCH FROM VECTOR DB                                │
│    Query Pinecone/Qdrant for top 50 chunks                  │
│    Include embeddings for clustering                        │
│                                                             │
│    Latency: ~15ms                                           │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ 3. AGGLOMERATIVE CLUSTERING                                 │
│    50 chunks → 15 clusters                                  │
│    Threshold: 0.15 cosine distance                          │
│                                                             │
│    Latency: ~8ms                                            │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ 4. REPRESENTATIVE SELECTION                                 │
│    15 clusters → 15 representatives                         │
│    Strategy: highest score per cluster                      │
│                                                             │
│    Latency: ~1ms                                            │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ 5. MMR RE-RANKING                                           │
│    15 representatives → 8 diverse chunks                    │
│    Lambda: 0.5 (balanced relevance/diversity)               │
│                                                             │
│    Latency: ~3ms                                            │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ RESULT                                                      │
│ 8 chunks, each covering a distinct topic:                   │
│ • Password reset                                            │
│ • Account deletion                                          │
│ • 2FA setup                                                 │
│ • Support contact                                           │
│ • Billing                                                   │
│ • API keys                                                  │
│ • Data export                                               │
│ • Team management                                           │
│                                                             │
│ Total added latency: ~12ms                                  │
└─────────────────────────────────────────────────────────────┘

Implementation Details

Distance Matrix Computation

For N chunks, we compute an N×N distance matrix. This is O(N²) but N is small (50-100 chunks):

func (c *Clusterer) computeDistanceMatrix(chunks []types.Chunk) [][]float64 {
    n := len(chunks)
    matrix := make([][]float64, n)

    for i := 0; i < n; i++ {
        matrix[i] = make([]float64, n)
        for j := i + 1; j < n; j++ {
            dist := math.CosineDistance(chunks[i].Embedding, chunks[j].Embedding)
            matrix[i][j] = dist
            matrix[j][i] = dist  // Symmetric
        }
    }
    return matrix
}

Cosine Distance

func CosineDistance(a, b []float32) float64 {
    var dot, normA, normB float64
    for i := range a {
        dot += float64(a[i] * b[i])
        normA += float64(a[i] * a[i])
        normB += float64(b[i] * b[i])
    }

    if normA == 0 || normB == 0 {
        return 1.0  // Max distance
    }

    similarity := dot / (math.Sqrt(normA) * math.Sqrt(normB))
    return 1.0 - similarity  // Convert to distance
}

Centroid Update on Merge

When merging clusters, we recompute the centroid as the mean of all member embeddings:

func (c *Clusterer) mergeClusters(a, b *clusterNode, chunks []types.Chunk) {
    a.members = append(a.members, b.members...)

    // Recompute centroid
    dim := len(chunks[0].Embedding)
    newCentroid := make([]float32, dim)

    for _, idx := range a.members {
        for d := 0; d < dim; d++ {
            newCentroid[d] += chunks[idx].Embedding[d]
        }
    }

    invN := float32(1.0 / float64(len(a.members)))
    for d := 0; d < dim; d++ {
        newCentroid[d] *= invN
    }

    a.centroid = newCentroid
}

Configuration Tuning

Cluster Threshold

The threshold controls how aggressively we merge:

Threshold	Effect	Use Case
0.10	Conservative, more clusters	High-precision domains
0.15	Balanced (default)	General purpose
0.20	Aggressive, fewer clusters	Noisy/redundant data
0.25+	Very aggressive	Heavy deduplication

MMR Lambda

Lambda	Effect	Use Case
0.3	Diversity-focused	Exploratory queries
0.5	Balanced (default)	General purpose
0.7	Relevance-focused	Precise queries
1.0	Pure relevance	No diversity needed

Over-fetch Ratio

Ratio	Chunks Retrieved	Trade-off
2x	16 for K=8	Minimal overhead, less dedup potential
3x	24 for K=8	Good balance
5x	40 for K=8	Maximum dedup, higher retrieval cost

Performance

Benchmarks on a typical workload (50 chunks, 1536-dim embeddings):

Stage	Latency
Distance matrix	2ms
Clustering	6ms
Selection	<1ms
MMR	3ms
Total overhead	~12ms

This is negligible compared to:

Vector DB query: 15-50ms
LLM inference: 500-2000ms

But the savings are significant:

35% fewer tokens per query
2x more diverse context
Better LLM answers (less confusion from repetition)

Usage

As a Proxy

GoVectorSync sits between your app and vector DB:

┌─────────┐     ┌──────────────┐     ┌────────────┐
│  Your   │────▶│ GoVectorSync │────▶│  Pinecone  │
│   App   │◀────│    Proxy     │◀────│   Qdrant   │
└─────────┘     └──────────────┘     └────────────┘

Direct Integration

Request for access for Go SDK & demo

What's Next

I'm building GoVectorSync as an open-source tool for the RAG community. Current focus:

More vector DB integrations (Weaviate, Milvus, Chroma)
Ingestion-time dedup (deduplicate before storing)
Adaptive thresholds (learn optimal settings per namespace)
Streaming support (deduplicate as chunks arrive)

Try It

Join the waitlist for early access:

waitlist.siddhantkhare.com/?project=GoVectorSync

Found this useful? I write about AI infrastructure, security, and the engineering challenges of building production AI systems. Connect with me on LinkedIn or Twitter/X.

*Built by Siddhant Khare

DEV Community