DEV Community

Cover image for The Engineering guide to Context window efficiency
Siddhant Khare
Siddhant Khare

Posted on

The Engineering guide to Context window efficiency

A deep dive into semantic deduplication for LLM context windows


If you're building with RAG (Retrieval-Augmented Generation), you've probably noticed something frustrating: your LLM keeps getting the same information from different sources. The same answer appears in your documentation, your tool outputs, your memory system—just worded slightly differently.

This isn't a minor inefficiency. In production RAG systems, 30-40% of retrieved context is semantically redundant. That's wasted tokens, higher API costs, and confused model outputs.

I built GoVectorSync to fix this. Here's the technical deep-dive on the problem and solution.


The Problem: Semantic Redundancy in Multi-Source RAG

Modern AI agents pull context from multiple sources:

┌─────────────────────────────────────────────────────────────┐
│                        User Query                           │
│                "How do I reset my password?"                │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
        ┌─────────────────────────────────────────┐
        │           Context Sources               │
        ├─────────────────────────────────────────┤
        │  📄 RAG (Documentation)                 │
        │  🔧 MCP Tools (API responses)           │
        │  🧠 Memory (Past conversations)         │
        │  ⚡ Skills (Procedural knowledge)       │
        └─────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                   Retrieved Chunks                          │
├─────────────────────────────────────────────────────────────┤
│ [RAG]    "To reset your password, click 'Forgot Password'  │
│           on the login page..."                             │
│ [RAG]    "Password reset: Navigate to login, select        │
│           'Forgot Password'..."                             │
│ [MCP]    "User password can be reset via the forgot        │
│           password flow in the auth system"                 │
│ [Memory] "Last time you asked, I explained the password    │
│           reset uses the forgot password link..."           │
│ [RAG]    "Account deletion is available in Settings..."    │
│ [MCP]    "Delete account: Settings > Account > Delete"     │
│ [RAG]    "Set up 2FA for extra security..."                │
│ [Skills] "Contact support at support@example.com"          │
└─────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Look at those first four results. They're all saying the same thing: use the forgot password flow. But because they come from different sources with different wording, naive top-k retrieval treats them as distinct.

The Math of Waste

If you retrieve 8 chunks and 5 are duplicates:

  • 62% of your context window is wasted
  • You're paying for tokens that add no information
  • The LLM sees repetition, which can bias its response
  • You're missing diverse information that could help

Why Cosine Similarity Isn't Enough

You might think: "Just dedupe by cosine similarity threshold."

The problem is choosing that threshold:

Similarity between chunks about password reset:

"To reset your password, click 'Forgot Password'..."
    vs
"Password reset: Navigate to login, select 'Forgot'..."

Cosine Similarity: 0.82
Enter fullscreen mode Exit fullscreen mode

Is 0.82 a duplicate? What about 0.75? 0.68?

A fixed threshold fails because:

  1. Domain variance: Technical docs cluster tighter than conversational text
  2. Length effects: Longer chunks have different similarity distributions
  3. Embedding model quirks: Different models have different similarity ranges

You need something smarter.


The Solution: Cluster → Select → Diversify

GoVectorSync uses a three-stage pipeline:

┌─────────────────────────────────────────────────────────────┐
│                    GoVectorSync Pipeline                    │
└─────────────────────────────────────────────────────────────┘

     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
     │   STAGE 1    │     │   STAGE 2    │     │   STAGE 3    │
     │  Over-fetch  │────▶│   Cluster    │────▶│   Select +   │
     │   (3-5x K)   │     │ (Semantic)   │     │     MMR      │
     └──────────────┘     └──────────────┘     └──────────────┘
           │                    │                    │
           ▼                    ▼                    ▼
     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
     │  50 chunks   │     │  12 clusters │     │   8 chunks   │
     │  from VectorDB│     │  by meaning  │     │   diverse    │
     └──────────────┘     └──────────────┘     └──────────────┘
Enter fullscreen mode Exit fullscreen mode

Stage 1: Over-fetch

Instead of retrieving exactly K chunks, retrieve 3-5x more:

// Request 50 chunks when you need 8
req.TopK = cfg.OverFetchK  // 50
Enter fullscreen mode Exit fullscreen mode

This gives us a pool to deduplicate from. The extra retrieval cost is negligible compared to the LLM inference cost you'll save.

Stage 2: Agglomerative Clustering

We group semantically similar chunks using hierarchical clustering:

┌─────────────────────────────────────────────────────────────┐
│              Agglomerative Clustering                       │
└─────────────────────────────────────────────────────────────┘

Initial state (each chunk is its own cluster):
[C1] [C2] [C3] [C4] [C5] [C6] [C7] [C8]

Step 1: Merge closest pair (C1, C2) - both about password reset
[C1,C2] [C3] [C4] [C5] [C6] [C7] [C8]

Step 2: Merge (C1,C2) with C3 - also password reset
[C1,C2,C3] [C4] [C5] [C6] [C7] [C8]

Step 3: Merge C4 into password cluster
[C1,C2,C3,C4] [C5] [C6] [C7] [C8]

Step 4: Merge C5,C6 - both about account deletion
[C1,C2,C3,C4] [C5,C6] [C7] [C8]

Stop when distance > threshold (0.15)

Final clusters:
┌─────────────────┐ ┌─────────────────┐ ┌────────┐ ┌────────┐
│ Password Reset  │ │ Account Delete  │ │  2FA   │ │Support │
│ C1, C2, C3, C4  │ │    C5, C6       │ │  C7    │ │  C8    │
└─────────────────┘ └─────────────────┘ └────────┘ └────────┘
Enter fullscreen mode Exit fullscreen mode

The algorithm:

func (c *Clusterer) Cluster(chunks []types.Chunk) *types.ClusterResult {
    // Initialize: each chunk is its own cluster
    nodes := make([]*clusterNode, n)
    for i := range chunks {
        nodes[i] = &clusterNode{
            members:  []int{i},
            centroid: chunks[i].Embedding,
            active:   true,
        }
    }

    // Compute pairwise distance matrix
    distMatrix := c.computeDistanceMatrix(chunks)

    // Agglomerative merging
    for activeCount > 1 {
        // Find closest pair
        minDist, minI, minJ := findClosestPair(nodes, distMatrix)

        // Stop if distance exceeds threshold
        if minDist > c.cfg.Threshold {
            break
        }

        // Merge clusters
        mergeClusters(nodes[minI], nodes[minJ], chunks)
        nodes[minJ].active = false
    }

    return buildResult(nodes, chunks)
}
Enter fullscreen mode Exit fullscreen mode

Key insight: We use average linkage by default, which computes cluster distance as the mean of all pairwise distances. This is more robust than single-linkage (chaining problem) or complete-linkage (too conservative).

func (c *Clusterer) clusterDistance(a, b *clusterNode, distMatrix [][]float64) float64 {
    switch c.cfg.Linkage {
    case "single":
        // Min distance - can cause chaining
        return minPairwiseDistance(a, b, distMatrix)
    case "complete":
        // Max distance - very conservative
        return maxPairwiseDistance(a, b, distMatrix)
    case "average":
        // Mean distance - balanced
        return avgPairwiseDistance(a, b, distMatrix)
    }
}
Enter fullscreen mode Exit fullscreen mode

Stage 3: Representative Selection

From each cluster, we pick one representative. Multiple strategies:

┌─────────────────────────────────────────────────────────────┐
│                Selection Strategies                         │
└─────────────────────────────────────────────────────────────┘

Strategy: "score" (default)
─────────────────────────────
Pick the chunk with highest retrieval score.
Best for: Preserving relevance ranking

    Cluster: [C1: 0.92, C2: 0.89, C3: 0.85, C4: 0.78]
    Selected: C1 (score 0.92)


Strategy: "centroid"
─────────────────────────────
Pick the chunk closest to cluster centroid.
Best for: Finding the most "typical" chunk

    Cluster centroid: [0.12, -0.34, 0.56, ...]

    C1 distance to centroid: 0.08
    C2 distance to centroid: 0.12  
    C3 distance to centroid: 0.05  ← Selected
    C4 distance to centroid: 0.15


Strategy: "hybrid"
─────────────────────────────
Weighted combination of score + centroid proximity.
Best for: Balancing relevance and typicality

    hybrid_score = 0.7 * normalized_score + 0.3 * centroid_proximity
Enter fullscreen mode Exit fullscreen mode
func (s *Selector) SelectFromCluster(cluster *types.Cluster) *types.Chunk {
    switch s.cfg.Strategy {
    case SelectByScore:
        return s.selectByScore(cluster)      // Highest retrieval score
    case SelectByCentroid:
        return s.selectByCentroid(cluster)   // Closest to centroid
    case SelectByHybrid:
        return s.selectByHybrid(cluster)     // Weighted combination
    }
}
Enter fullscreen mode Exit fullscreen mode

Stage 4: MMR Re-ranking (Optional)

After selecting representatives, we may still have more chunks than needed. MMR (Maximal Marginal Relevance) ensures the final set is diverse:

┌─────────────────────────────────────────────────────────────┐
│           Maximal Marginal Relevance (MMR)                  │
└─────────────────────────────────────────────────────────────┘

Formula:
    MMR(chunk) = λ × relevance(chunk) 
               - (1-λ) × max_similarity(chunk, already_selected)

Where:
    λ = 0.5 (balanced)
    λ = 1.0 (pure relevance, no diversity)
    λ = 0.0 (pure diversity, ignore relevance)


Example with λ = 0.5:
─────────────────────────────

Candidates: [A, B, C, D, E]  (already selected: none)

Round 1:
    MMR(A) = 0.5 × 0.95 - 0 = 0.475  ← Selected (highest relevance)

Round 2: (A already selected)
    MMR(B) = 0.5 × 0.90 - 0.5 × sim(B,A)
           = 0.45 - 0.5 × 0.85 = 0.025
    MMR(C) = 0.5 × 0.85 - 0.5 × sim(C,A)
           = 0.425 - 0.5 × 0.20 = 0.325  ← Selected (diverse!)

Round 3: (A, C already selected)
    MMR(B) = 0.5 × 0.90 - 0.5 × max(sim(B,A), sim(B,C))
    ...
Enter fullscreen mode Exit fullscreen mode

The key insight: MMR penalizes chunks that are similar to already-selected chunks, naturally promoting diversity.

func (m *MMR) computeMMRScore(candidateIdx int, selected []int, 
                               scores []float64, simMatrix [][]float64) float64 {
    relevance := scores[candidateIdx]

    if len(selected) == 0 {
        return m.cfg.Lambda * relevance
    }

    // Find max similarity to any selected chunk
    maxSim := 0.0
    for _, selIdx := range selected {
        sim := simMatrix[candidateIdx][selIdx]
        if sim > maxSim {
            maxSim = sim
        }
    }

    // MMR formula
    return m.cfg.Lambda*relevance - (1-m.cfg.Lambda)*maxSim
}
Enter fullscreen mode Exit fullscreen mode

The Full Pipeline

Here's how it all fits together:

┌─────────────────────────────────────────────────────────────┐
│                  GoVectorSync Broker                        │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ 1. EMBED QUERY                                              │
│    "How do I reset my password?" → [0.12, -0.34, ...]      │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ 2. OVER-FETCH FROM VECTOR DB                                │
│    Query Pinecone/Qdrant for top 50 chunks                  │
│    Include embeddings for clustering                        │
│                                                             │
│    Latency: ~15ms                                           │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ 3. AGGLOMERATIVE CLUSTERING                                 │
│    50 chunks → 15 clusters                                  │
│    Threshold: 0.15 cosine distance                          │
│                                                             │
│    Latency: ~8ms                                            │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ 4. REPRESENTATIVE SELECTION                                 │
│    15 clusters → 15 representatives                         │
│    Strategy: highest score per cluster                      │
│                                                             │
│    Latency: ~1ms                                            │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ 5. MMR RE-RANKING                                           │
│    15 representatives → 8 diverse chunks                    │
│    Lambda: 0.5 (balanced relevance/diversity)               │
│                                                             │
│    Latency: ~3ms                                            │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│ RESULT                                                      │
│ 8 chunks, each covering a distinct topic:                   │
│ • Password reset                                            │
│ • Account deletion                                          │
│ • 2FA setup                                                 │
│ • Support contact                                           │
│ • Billing                                                   │
│ • API keys                                                  │
│ • Data export                                               │
│ • Team management                                           │
│                                                             │
│ Total added latency: ~12ms                                  │
└─────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Implementation Details

Distance Matrix Computation

For N chunks, we compute an N×N distance matrix. This is O(N²) but N is small (50-100 chunks):

func (c *Clusterer) computeDistanceMatrix(chunks []types.Chunk) [][]float64 {
    n := len(chunks)
    matrix := make([][]float64, n)

    for i := 0; i < n; i++ {
        matrix[i] = make([]float64, n)
        for j := i + 1; j < n; j++ {
            dist := math.CosineDistance(chunks[i].Embedding, chunks[j].Embedding)
            matrix[i][j] = dist
            matrix[j][i] = dist  // Symmetric
        }
    }
    return matrix
}
Enter fullscreen mode Exit fullscreen mode

Cosine Distance

func CosineDistance(a, b []float32) float64 {
    var dot, normA, normB float64
    for i := range a {
        dot += float64(a[i] * b[i])
        normA += float64(a[i] * a[i])
        normB += float64(b[i] * b[i])
    }

    if normA == 0 || normB == 0 {
        return 1.0  // Max distance
    }

    similarity := dot / (math.Sqrt(normA) * math.Sqrt(normB))
    return 1.0 - similarity  // Convert to distance
}
Enter fullscreen mode Exit fullscreen mode

Centroid Update on Merge

When merging clusters, we recompute the centroid as the mean of all member embeddings:

func (c *Clusterer) mergeClusters(a, b *clusterNode, chunks []types.Chunk) {
    a.members = append(a.members, b.members...)

    // Recompute centroid
    dim := len(chunks[0].Embedding)
    newCentroid := make([]float32, dim)

    for _, idx := range a.members {
        for d := 0; d < dim; d++ {
            newCentroid[d] += chunks[idx].Embedding[d]
        }
    }

    invN := float32(1.0 / float64(len(a.members)))
    for d := 0; d < dim; d++ {
        newCentroid[d] *= invN
    }

    a.centroid = newCentroid
}
Enter fullscreen mode Exit fullscreen mode

Configuration Tuning

Cluster Threshold

The threshold controls how aggressively we merge:

Threshold Effect Use Case
0.10 Conservative, more clusters High-precision domains
0.15 Balanced (default) General purpose
0.20 Aggressive, fewer clusters Noisy/redundant data
0.25+ Very aggressive Heavy deduplication

MMR Lambda

Lambda Effect Use Case
0.3 Diversity-focused Exploratory queries
0.5 Balanced (default) General purpose
0.7 Relevance-focused Precise queries
1.0 Pure relevance No diversity needed

Over-fetch Ratio

Ratio Chunks Retrieved Trade-off
2x 16 for K=8 Minimal overhead, less dedup potential
3x 24 for K=8 Good balance
5x 40 for K=8 Maximum dedup, higher retrieval cost

Performance

Benchmarks on a typical workload (50 chunks, 1536-dim embeddings):

Stage Latency
Distance matrix 2ms
Clustering 6ms
Selection <1ms
MMR 3ms
Total overhead ~12ms

This is negligible compared to:

  • Vector DB query: 15-50ms
  • LLM inference: 500-2000ms

But the savings are significant:

  • 35% fewer tokens per query
  • 2x more diverse context
  • Better LLM answers (less confusion from repetition)

Usage

As a Proxy

GoVectorSync sits between your app and vector DB:

┌─────────┐     ┌──────────────┐     ┌────────────┐
│  Your   │────▶│ GoVectorSync │────▶│  Pinecone  │
│   App   │◀────│    Proxy     │◀────│   Qdrant   │
└─────────┘     └──────────────┘     └────────────┘
Enter fullscreen mode Exit fullscreen mode

Direct Integration

Request for access for Go SDK & demo


What's Next

I'm building GoVectorSync as an open-source tool for the RAG community. Current focus:

  1. More vector DB integrations (Weaviate, Milvus, Chroma)
  2. Ingestion-time dedup (deduplicate before storing)
  3. Adaptive thresholds (learn optimal settings per namespace)
  4. Streaming support (deduplicate as chunks arrive)

Try It

Join the waitlist for early access:

waitlist.siddhantkhare.com/?project=GoVectorSync


Found this useful? I write about AI infrastructure, security, and the engineering challenges of building production AI systems. Connect with me on LinkedIn or Twitter/X.

*Built by Siddhant Khare

Top comments (0)