RoTSL

Posted on Mar 16 • Originally published at Medium on Mar 10

ContextFusion: The Context Brain Your LLM Apps Are Missing

#claude #llm #rags #contextengineering

A deep dive for users who want results and developers who want control

TL;DR (For the Impatient)

Normal users: Install context-portfolio-optimizer, run cpo compile ./your-docs --budget 4000, and stop overpaying for tokens.

Developers: Middleware pipeline that ingests heterogeneous sources → normalizes → precomputes → optimizes via multi-objective knapsack → compiles provider-specific payloads with delta fusion for agents.

Both groups get 60–99% token reduction with identical answer quality.

Part 1: For Normal Users — “Just Make My LLM Cheaper and Faster”

The Problem You Actually Face

You’re building with LLMs. Maybe it’s a chatbot over your company docs. Maybe it’s a coding assistant. Maybe it’s an agent that needs to remember context across 20 turns.

You keep hitting the same frustrations:

“Why is this API call so expensive?” — You’re sending 8,000 tokens when 800 would suffice
“Why does it take 10 seconds to respond?” — Latency scales with prompt size
“Why does my agent forget everything?” — You’re not managing context deltas across turns
“Why do I have to rewrite everything when I switch from GPT-4 to Claude?” — Hardcoded prompt formats

You’ve tried RAG. You’ve tried chunking. But you’re still blindly stuffing retrieved chunks into prompts without knowing which ones actually matter.

What ContextFusion Does (No Jargon)

Think of it like a smart travel packer for your LLM trips.

You have a weight limit (token budget). You have dozens of items (documents, code, images). Some items are essential. Some are nice-to-have. Some are duplicates. Some are risky (outdated, untrusted).

ContextFusion:

Unpacks everything — PDFs, Word docs, spreadsheets, images, code files
Weighs and labels each item — How useful? How risky? How heavy?
Packs the optimal suitcase — Maximum value within your weight limit
Formats it for your destination — OpenAI’s preferred style, Anthropic’s format, or local Ollama
And for return trips (agent conversations), it remembers what you already packed and only adds what’s new.

Real Results

Benchmarks run with Claude Sonnet 4.6 on production-like workloads. Full methodology at github.com/rotsl/context-fusion/benchmarks

Getting Started (Three Options)

Option A: NPM Wrapper (Easiest — No Python Required)

# One-time setup
npm install -g @rotsl/contextfusion
npx @rotsl/contextfusion setup

# Create API keys file
npx @rotsl/contextfusion env
# Edit .env with your OPENAI_API_KEY or ANTHROPIC_API_KEY

# Run optimization
npx @rotsl/contextfusion run ./my-documents \
  --query "Summarize key findings" \
  --provider anthropic \
  --model claude-sonnet-4-6 \
  --budget 4000

# Launch Web UI
npx @rotsl/contextfusion ui --port 8080

Option B: Python Package (More Control)

pip install context-portfolio-optimizer

# Set up environment
cat > .env << 'EOF'
ANTHROPIC_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
EOF

# Run CLI
cpo run ./my-documents --budget 4000 --query "What are the main points?"

# Or compile for specific task type
cpo compile ./my-codebase \
  --task "Explain this function" \
  --provider openai \
  --model gpt-5-mini \
  --mode code \
  --budget 3000

Option C: Docker (Isolated, Reproducible)

docker build -t context-fusion:latest .
docker run --rm -it -v "$(pwd)":/app context-fusion:latest run ./data --budget 3000

The Web UI: See What Your LLM Actually Receives

Run cpo ui --port 8080 and open your browser. You'll see:

Run stats: Files ingested, blocks selected, total tokens
Representation usage: Which compact variants were chosen
Selected blocks: Source, representation type, utility score, token estimate
Context preview: Exactly what gets sent to the LLM
Model answer: Optional direct comparison

This transparency is rare. Most RAG tools are black boxes. ContextFusion shows its work.

Common Use Cases

When ContextFusion Helps Most

✅ Multi-provider setups — Same pipeline, different output formats

✅ Cost-sensitive production — 60–99% token reduction

✅ Agent conversations — Delta fusion prevents token churn

✅ Complex ingestion — PDFs, images, code, spreadsheets unified

✅ Latency requirements — Precomputation + caching

When You Might Not Need It

❌ Simple single-turn Q&A with tiny documents

❌ You’re already heavily invested in a specific RAG framework and happy with costs

❌ You need real-time streaming with sub-100ms latency (ContextFusion adds 50–200ms optimization overhead)

Part 2: For Developers — “How This Actually Works”

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│ INGESTION LAYER │
│ PDF │ DOCX │ CSV │ JSON │ Images (OCR) │ Code │ Markdown │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ NORMALIZATION LAYER │
│ Convert all sources to uniform ContextBlock objects │
│ - source_type, content_hash, created_at, metadata │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ REPRESENTATION LAYER │
│ Precompute compact variants per block: │
│ - universal_summary (general purpose) │
│ - qa_extractive (question-answering focused) │
│ - code_signature (functions, classes, dependencies) │
│ - agent_condensed (working memory format) │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ PRECOMPUTE PIPELINE │
│ Store: fingerprints, summaries, token stats, │
│ retrieval features, compact variants in .cpo_cache/ │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ RETRIEVAL LAYER │
│ Query classification → Lexical retrieval (top-100) │
│ → Fast rerank (top-20/25) → Candidate set │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ MULTI-OBJECTIVE PLANNER (Core) │
│ │
│ maximize Σ( w_u·utility - w_r·risk - w_t·token_cost │
│ - w_l·latency + w_c·cacheability + w_d·diversity ) │
│ │
│ subject to: Σ(token_i) ≤ budget │
│ │
│ Selects optimal representation variant per block │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ COMPRESSION LAYER │
│ - JSON minification │
│ - Citation compaction (Source URI → [id]) │
│ - Schema field pruning │
│ Levels: none │ light │ medium │ aggressive │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ DELTA FUSION (Agent Mode) │
│ Compute ContextDelta: │
│ - added_blocks: new since last turn │
│ - updated_blocks: changed content │
│ - removed_blocks: no longer relevant │
│ - unchanged_block_ids: reuse from cache │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ PROVIDER ADAPTER LAYER │
│ Compile provider-specific payloads: │
│ - openai: chat.completions format │
│ - anthropic: messages with XML citations │
│ - ollama: local API structure │
│ - openai_compatible: generic wrapper │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ CACHE-AWARE ASSEMBLY │
│ Segment into: │
│ - stable: system instructions, citation maps, cacheable blocks │
│ - dynamic: volatile content, real-time data │
└─────────────────────────────────────────────────────────────────┘

The Knapsack Formulation: Why This Isn’t Just “Smart Chunking”

Most RAG tools use semantic similarity: embed query, embed chunks, return top-k. This fails when:

Your budget is 4,000 tokens and you have 50 relevant chunks of 500 tokens each
Some chunks are high-utility but high-risk (outdated documentation)
Some chunks are cacheable, others must be fresh
You need diversity (don’t send 5 versions of the same information)

ContextFusion’s planner treats this as a constrained optimization problem:

# Pseudocode of the core algorithm
def select_context_blocks(candidates, budget, weights):
    """
    candidates: List[ContextBlock with multiple representation variants]
    budget: int (token limit)
    weights: dict[str, float] (utility, risk, latency, cacheability, diversity)
    """

    # Generate all (block, variant) pairs with scores
    items = []
    for block in candidates:
        for variant in block.representations:
            score = (
                weights['utility'] * variant.utility_score
                - weights['risk'] * block.risk_score
                - weights['token_cost'] * variant.token_count
                - weights['latency'] * variant.latency_estimate
                + weights['cacheability'] * block.cache_score
                + weights['diversity'] * diversity_bonus(block, selected)
            )
            items.append((block.id, variant, score, variant.token_count))

    # Solve 0/1 knapsack for maximum score within budget
    selected = knapsack_01(items, budget)
    return selected

This is NP-hard, but with proper indexing and heuristics, it runs in <100ms for typical workloads.

Code Example: Pipeline Integration

from context_portfolio_optimizer import PipelineRunner, Config
from context_portfolio_optimizer.providers import AnthropicAdapter

# Custom configuration
config = Config.from_yaml("""
budget:
  instructions: 1000
  retrieval: 3000
  memory: 2000
  examples: 1500
  tool_trace: 1000
  output_reserve: 1000

scoring:
  utility_weights:
    retrieval: 0.25
    trust: 0.20
    freshness: 0.15
    structure: 0.15
    diversity: 0.15
    token_cost: -0.10

provider:
  name: anthropic
  model: claude-sonnet-4-6
""")

# Initialize pipeline
runner = PipelineRunner(config=config)

# Run full pipeline
result = runner.run(
    sources=["./docs/architecture.pdf", "./src/api.py", "./data/metrics.csv"],
    query="How does the authentication flow work?",
    task_mode="qa", # chat | qa | code | agent
    budget=4000,
    use_precomputed=True,
    compute_delta=False # Set True for agent loops
)

# Inspect results
print(f"Selected {result['stats']['blocks_selected']} blocks")
print(f"Total tokens: {result['stats']['total_tokens']}")
print(f"Context preview:\n{result['context'][:500]}...")

# Direct provider compilation
adapter = AnthropicAdapter(config.provider)
payload = adapter.compile_packet(
    context_blocks=result['selected_blocks'],
    task="Answer with citations",
    model="claude-sonnet-4-6"
)
# payload is ready for anthropic.messages.create(**payload)

Delta Fusion: The Secret to Efficient Agents

Standard agent implementations re-send the entire conversation history + retrieved context on every turn. With 10 turns × 4,000 tokens = 40,000 tokens wasted.

ContextFusion’s delta tracking:

# Turn 1: Full context
turn1_result = runner.run(sources, query="Step 1...", task_mode="agent")
turn1_packet = turn1_result['context_packet']

# Turn 2: Only send what changed
turn2_result = runner.run(
    sources, 
    query="Step 2...",
    task_mode="agent",
    previous_packet=turn1_packet, # Enable delta computation
    compute_delta=True
)

# turn2_result['context_delta'] contains:
# {
# 'added_blocks': [new_retrieved_content],
# 'updated_blocks': [changed_blocks],
# 'removed_blocks': [no_longer_relevant],
# 'unchanged_block_ids': [ids_to_reuse_from_cache],
# 'full_context_hash': 'abc123...' # For cache validation
# }

The provider adapter assembles:

System instructions (stable, cached)
Citation map (stable, cached)
New/updated blocks (dynamic, sent)
Unchanged block references (cached, not sent)

Precompute Pipeline: Latency Optimization

For production workloads, precompute expensive operations:

# One-time setup (can run offline, on CI, or scheduled)
cpo precompute ./corpus \
  --store-dir .cpo_cache/precompute \
  --semantic-dedup \
  --generate-all-representations

# Runtime query uses precomputed artifacts
cpo compile ./corpus \
  --precomputed-only \
  --query "Quick question" \
  --budget 2000

Precomputed artifacts:

fingerprints.jsonl: Content hashes for deduplication
representations/: All compact variants per block
token_stats.json: Pre-counted tokens per variant
retrieval_index.faiss: FAISS index for fast similarity search
features.jsonl: Utility/risk/cacheability scores

MCP Server Integration

Expose ContextFusion as an MCP (Model Context Protocol) server:

cpo serve-mcp --host localhost --port 8765

MCP clients can now call:

tools/ingest: Add documents to context
tools/compile: Optimize and compile context
resources/context/{session_id}: Retrieve compiled packets
tools/delta: Compute context deltas

Framework Integrations

LangChain:

from context_portfolio_optimizer.integrations import ContextFusionRetriever

retriever = ContextFusionRetriever(
    sources=["./docs"],
    budget=3000,
    task_mode="qa"
)

# Use in any LangChain chain
from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(
    llm=chat_model,
    chain_type="stuff",
    retriever=retriever
)

LlamaIndex:

from context_portfolio_optimizer.integrations import ContextFusionNodeParser

parser = ContextFusionNodeParser(
    budget_per_query=4000,
    precompute_dir=".cpo_cache"
)

# Use with LlamaIndex index construction
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(
    documents,
    node_parser=parser
)

Development Setup

git clone https://github.com/rotsl/context-fusion.git
cd context-fusion
make bootstrap # Install dev dependencies

# Development workflow
make test # Run test suite (49 tests)
make lint # Ruff + mypy
make type-check # Strict type checking
make format # Auto-format code

# Local servers
make ui # Web UI on :8080
make serve-mcp # MCP server on :8765

# Benchmarking
make benchmark # Run full benchmark suite

Project Structure

Performance Characteristics

Extending ContextFusion

Custom representation:

from context_portfolio_optimizer.representations import Representation, register_representation

@register_representation("my_custom")
class MyCustomRepresentation(Representation):
    def generate(self, block: ContextBlock) -> str:
        # Your custom summarization logic
        return custom_summarize(block.content)

    def estimate_tokens(self, text: str) -> int:
        return len(text.split()) * 1.3 # Rough heuristic

Custom provider adapter:

from context_portfolio_optimizer.providers import BaseProviderAdapter, register_adapter

@register_adapter("my_provider")
class MyProviderAdapter(BaseProviderAdapter):
    def compile_packet(self, context_blocks, task, model, **kwargs):
        # Format for your custom LLM API
        return {
            "model": model,
            "messages": [
                {"role": "system", "content": self.format_system()},
                {"role": "user", "content": self.format_context(context_blocks, task)}
            ]
        }

Part 3: Common Questions

Q: How is this different from LangChain’s ContextualCompressionRetriever?

LangChain’s version compresses after retrieval using an LLM call. ContextFusion optimizes which content to retrieve and which representation to use, without requiring an LLM for compression. It’s also provider-agnostic and handles delta fusion for agents.

Q: Does this replace my vector database?

No. ContextFusion sits after retrieval. Use Pinecone, Weaviate, pgvector, or FAISS for initial retrieval — then pass candidates through ContextFusion for optimization.

Q: What about streaming responses?

ContextFusion optimizes the input context. Streaming the LLM’s output is unaffected. The optimization adds 50–200ms overhead, which is usually offset by reduced LLM latency from shorter prompts.

Q: Can I use this with local models?

Yes. The Ollama adapter works with any OpenAI-compatible local server. Budget planning and compression are even more valuable with slower local hardware.

Q: How do I debug suboptimal context selection?

Run cpo ui and inspect the "Selected Blocks" panel. Each block shows its utility score, risk score, token count, and why it was included/excluded. Run cpo ablate ./data to see which blocks contribute most to answer quality.

GitHub - rotsl/context-fusion: ContextFusion is the context brain for LLM apps - compress, rank, and route the right evidence to chat + agent models across OpenAI, Claude, Ollama, and MCP

NPM Package

Final Thoughts

ContextFusion isn’t just another RAG tool. It’s a bet that context optimization — treating token budgets as scarce resources to be allocated intelligently — will become as essential as retrieval itself.

For normal users: Install it, run it, pay less.

For developers: Extend it, integrate it, build smarter systems.

Fuse less context. Keep more signal. Ship faster answers.

⭐️ Star the repo, ⚠️file issues, ㊣ submit PRs. ContextFusion is Apache-2.0 and built for production.

DEV Community