A deep dive for users who want results and developers who want control
TL;DR (For the Impatient)
Normal users: Install context-portfolio-optimizer, run cpo compile ./your-docs --budget 4000, and stop overpaying for tokens.
Developers: Middleware pipeline that ingests heterogeneous sources → normalizes → precomputes → optimizes via multi-objective knapsack → compiles provider-specific payloads with delta fusion for agents.
Both groups get 60–99% token reduction with identical answer quality.
Part 1: For Normal Users — “Just Make My LLM Cheaper and Faster”
The Problem You Actually Face
You’re building with LLMs. Maybe it’s a chatbot over your company docs. Maybe it’s a coding assistant. Maybe it’s an agent that needs to remember context across 20 turns.
You keep hitting the same frustrations:
- “Why is this API call so expensive?” — You’re sending 8,000 tokens when 800 would suffice
- “Why does it take 10 seconds to respond?” — Latency scales with prompt size
- “Why does my agent forget everything?” — You’re not managing context deltas across turns
- “Why do I have to rewrite everything when I switch from GPT-4 to Claude?” — Hardcoded prompt formats
You’ve tried RAG. You’ve tried chunking. But you’re still blindly stuffing retrieved chunks into prompts without knowing which ones actually matter.
What ContextFusion Does (No Jargon)
Think of it like a smart travel packer for your LLM trips.
You have a weight limit (token budget). You have dozens of items (documents, code, images). Some items are essential. Some are nice-to-have. Some are duplicates. Some are risky (outdated, untrusted).
ContextFusion:
- Unpacks everything — PDFs, Word docs, spreadsheets, images, code files
- Weighs and labels each item — How useful? How risky? How heavy?
- Packs the optimal suitcase — Maximum value within your weight limit
- Formats it for your destination — OpenAI’s preferred style, Anthropic’s format, or local Ollama
- And for return trips (agent conversations), it remembers what you already packed and only adds what’s new.
Real Results
Benchmarks run with Claude Sonnet 4.6 on production-like workloads. Full methodology at github.com/rotsl/context-fusion/benchmarks
Getting Started (Three Options)
Option A: NPM Wrapper (Easiest — No Python Required)
# One-time setup
npm install -g @rotsl/contextfusion
npx @rotsl/contextfusion setup
# Create API keys file
npx @rotsl/contextfusion env
# Edit .env with your OPENAI_API_KEY or ANTHROPIC_API_KEY
# Run optimization
npx @rotsl/contextfusion run ./my-documents \
--query "Summarize key findings" \
--provider anthropic \
--model claude-sonnet-4-6 \
--budget 4000
# Launch Web UI
npx @rotsl/contextfusion ui --port 8080
Option B: Python Package (More Control)
pip install context-portfolio-optimizer
# Set up environment
cat > .env << 'EOF'
ANTHROPIC_API_KEY=your_key_here
OPENAI_API_KEY=your_key_here
EOF
# Run CLI
cpo run ./my-documents --budget 4000 --query "What are the main points?"
# Or compile for specific task type
cpo compile ./my-codebase \
--task "Explain this function" \
--provider openai \
--model gpt-5-mini \
--mode code \
--budget 3000
Option C: Docker (Isolated, Reproducible)
docker build -t context-fusion:latest .
docker run --rm -it -v "$(pwd)":/app context-fusion:latest run ./data --budget 3000
The Web UI: See What Your LLM Actually Receives
Run cpo ui --port 8080 and open your browser. You'll see:
- Run stats: Files ingested, blocks selected, total tokens
- Representation usage: Which compact variants were chosen
- Selected blocks: Source, representation type, utility score, token estimate
- Context preview: Exactly what gets sent to the LLM
- Model answer: Optional direct comparison
This transparency is rare. Most RAG tools are black boxes. ContextFusion shows its work.
Common Use Cases
When ContextFusion Helps Most
✅ Multi-provider setups — Same pipeline, different output formats
✅ Cost-sensitive production — 60–99% token reduction
✅ Agent conversations — Delta fusion prevents token churn
✅ Complex ingestion — PDFs, images, code, spreadsheets unified
✅ Latency requirements — Precomputation + caching
When You Might Not Need It
❌ Simple single-turn Q&A with tiny documents
❌ You’re already heavily invested in a specific RAG framework and happy with costs
❌ You need real-time streaming with sub-100ms latency (ContextFusion adds 50–200ms optimization overhead)
Part 2: For Developers — “How This Actually Works”
Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ INGESTION LAYER │
│ PDF │ DOCX │ CSV │ JSON │ Images (OCR) │ Code │ Markdown │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ NORMALIZATION LAYER │
│ Convert all sources to uniform ContextBlock objects │
│ - source_type, content_hash, created_at, metadata │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ REPRESENTATION LAYER │
│ Precompute compact variants per block: │
│ - universal_summary (general purpose) │
│ - qa_extractive (question-answering focused) │
│ - code_signature (functions, classes, dependencies) │
│ - agent_condensed (working memory format) │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ PRECOMPUTE PIPELINE │
│ Store: fingerprints, summaries, token stats, │
│ retrieval features, compact variants in .cpo_cache/ │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ RETRIEVAL LAYER │
│ Query classification → Lexical retrieval (top-100) │
│ → Fast rerank (top-20/25) → Candidate set │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ MULTI-OBJECTIVE PLANNER (Core) │
│ │
│ maximize Σ( w_u·utility - w_r·risk - w_t·token_cost │
│ - w_l·latency + w_c·cacheability + w_d·diversity ) │
│ │
│ subject to: Σ(token_i) ≤ budget │
│ │
│ Selects optimal representation variant per block │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ COMPRESSION LAYER │
│ - JSON minification │
│ - Citation compaction (Source URI → [id]) │
│ - Schema field pruning │
│ Levels: none │ light │ medium │ aggressive │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ DELTA FUSION (Agent Mode) │
│ Compute ContextDelta: │
│ - added_blocks: new since last turn │
│ - updated_blocks: changed content │
│ - removed_blocks: no longer relevant │
│ - unchanged_block_ids: reuse from cache │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ PROVIDER ADAPTER LAYER │
│ Compile provider-specific payloads: │
│ - openai: chat.completions format │
│ - anthropic: messages with XML citations │
│ - ollama: local API structure │
│ - openai_compatible: generic wrapper │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ CACHE-AWARE ASSEMBLY │
│ Segment into: │
│ - stable: system instructions, citation maps, cacheable blocks │
│ - dynamic: volatile content, real-time data │
└─────────────────────────────────────────────────────────────────┘
The Knapsack Formulation: Why This Isn’t Just “Smart Chunking”
Most RAG tools use semantic similarity: embed query, embed chunks, return top-k. This fails when:
- Your budget is 4,000 tokens and you have 50 relevant chunks of 500 tokens each
- Some chunks are high-utility but high-risk (outdated documentation)
- Some chunks are cacheable, others must be fresh
- You need diversity (don’t send 5 versions of the same information)
ContextFusion’s planner treats this as a constrained optimization problem:
# Pseudocode of the core algorithm
def select_context_blocks(candidates, budget, weights):
"""
candidates: List[ContextBlock with multiple representation variants]
budget: int (token limit)
weights: dict[str, float] (utility, risk, latency, cacheability, diversity)
"""
# Generate all (block, variant) pairs with scores
items = []
for block in candidates:
for variant in block.representations:
score = (
weights['utility'] * variant.utility_score
- weights['risk'] * block.risk_score
- weights['token_cost'] * variant.token_count
- weights['latency'] * variant.latency_estimate
+ weights['cacheability'] * block.cache_score
+ weights['diversity'] * diversity_bonus(block, selected)
)
items.append((block.id, variant, score, variant.token_count))
# Solve 0/1 knapsack for maximum score within budget
selected = knapsack_01(items, budget)
return selected
This is NP-hard, but with proper indexing and heuristics, it runs in <100ms for typical workloads.
Code Example: Pipeline Integration
from context_portfolio_optimizer import PipelineRunner, Config
from context_portfolio_optimizer.providers import AnthropicAdapter
# Custom configuration
config = Config.from_yaml("""
budget:
instructions: 1000
retrieval: 3000
memory: 2000
examples: 1500
tool_trace: 1000
output_reserve: 1000
scoring:
utility_weights:
retrieval: 0.25
trust: 0.20
freshness: 0.15
structure: 0.15
diversity: 0.15
token_cost: -0.10
provider:
name: anthropic
model: claude-sonnet-4-6
""")
# Initialize pipeline
runner = PipelineRunner(config=config)
# Run full pipeline
result = runner.run(
sources=["./docs/architecture.pdf", "./src/api.py", "./data/metrics.csv"],
query="How does the authentication flow work?",
task_mode="qa", # chat | qa | code | agent
budget=4000,
use_precomputed=True,
compute_delta=False # Set True for agent loops
)
# Inspect results
print(f"Selected {result['stats']['blocks_selected']} blocks")
print(f"Total tokens: {result['stats']['total_tokens']}")
print(f"Context preview:\n{result['context'][:500]}...")
# Direct provider compilation
adapter = AnthropicAdapter(config.provider)
payload = adapter.compile_packet(
context_blocks=result['selected_blocks'],
task="Answer with citations",
model="claude-sonnet-4-6"
)
# payload is ready for anthropic.messages.create(**payload)
Delta Fusion: The Secret to Efficient Agents
Standard agent implementations re-send the entire conversation history + retrieved context on every turn. With 10 turns × 4,000 tokens = 40,000 tokens wasted.
ContextFusion’s delta tracking:
# Turn 1: Full context
turn1_result = runner.run(sources, query="Step 1...", task_mode="agent")
turn1_packet = turn1_result['context_packet']
# Turn 2: Only send what changed
turn2_result = runner.run(
sources,
query="Step 2...",
task_mode="agent",
previous_packet=turn1_packet, # Enable delta computation
compute_delta=True
)
# turn2_result['context_delta'] contains:
# {
# 'added_blocks': [new_retrieved_content],
# 'updated_blocks': [changed_blocks],
# 'removed_blocks': [no_longer_relevant],
# 'unchanged_block_ids': [ids_to_reuse_from_cache],
# 'full_context_hash': 'abc123...' # For cache validation
# }
The provider adapter assembles:
- System instructions (stable, cached)
- Citation map (stable, cached)
- New/updated blocks (dynamic, sent)
- Unchanged block references (cached, not sent)
Precompute Pipeline: Latency Optimization
For production workloads, precompute expensive operations:
# One-time setup (can run offline, on CI, or scheduled)
cpo precompute ./corpus \
--store-dir .cpo_cache/precompute \
--semantic-dedup \
--generate-all-representations
# Runtime query uses precomputed artifacts
cpo compile ./corpus \
--precomputed-only \
--query "Quick question" \
--budget 2000
Precomputed artifacts:
- fingerprints.jsonl: Content hashes for deduplication
- representations/: All compact variants per block
- token_stats.json: Pre-counted tokens per variant
- retrieval_index.faiss: FAISS index for fast similarity search
- features.jsonl: Utility/risk/cacheability scores
MCP Server Integration
Expose ContextFusion as an MCP (Model Context Protocol) server:
cpo serve-mcp --host localhost --port 8765
MCP clients can now call:
- tools/ingest: Add documents to context
- tools/compile: Optimize and compile context
- resources/context/{session_id}: Retrieve compiled packets
- tools/delta: Compute context deltas
Framework Integrations
LangChain:
from context_portfolio_optimizer.integrations import ContextFusionRetriever
retriever = ContextFusionRetriever(
sources=["./docs"],
budget=3000,
task_mode="qa"
)
# Use in any LangChain chain
from langchain.chains import RetrievalQA
qa = RetrievalQA.from_chain_type(
llm=chat_model,
chain_type="stuff",
retriever=retriever
)
LlamaIndex:
from context_portfolio_optimizer.integrations import ContextFusionNodeParser
parser = ContextFusionNodeParser(
budget_per_query=4000,
precompute_dir=".cpo_cache"
)
# Use with LlamaIndex index construction
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex.from_documents(
documents,
node_parser=parser
)
Development Setup
git clone https://github.com/rotsl/context-fusion.git
cd context-fusion
make bootstrap # Install dev dependencies
# Development workflow
make test # Run test suite (49 tests)
make lint # Ruff + mypy
make type-check # Strict type checking
make format # Auto-format code
# Local servers
make ui # Web UI on :8080
make serve-mcp # MCP server on :8765
# Benchmarking
make benchmark # Run full benchmark suite
Project Structure
Performance Characteristics
Extending ContextFusion
Custom representation:
from context_portfolio_optimizer.representations import Representation, register_representation
@register_representation("my_custom")
class MyCustomRepresentation(Representation):
def generate(self, block: ContextBlock) -> str:
# Your custom summarization logic
return custom_summarize(block.content)
def estimate_tokens(self, text: str) -> int:
return len(text.split()) * 1.3 # Rough heuristic
Custom provider adapter:
from context_portfolio_optimizer.providers import BaseProviderAdapter, register_adapter
@register_adapter("my_provider")
class MyProviderAdapter(BaseProviderAdapter):
def compile_packet(self, context_blocks, task, model, **kwargs):
# Format for your custom LLM API
return {
"model": model,
"messages": [
{"role": "system", "content": self.format_system()},
{"role": "user", "content": self.format_context(context_blocks, task)}
]
}
Part 3: Common Questions
Q: How is this different from LangChain’s ContextualCompressionRetriever?
LangChain’s version compresses after retrieval using an LLM call. ContextFusion optimizes which content to retrieve and which representation to use, without requiring an LLM for compression. It’s also provider-agnostic and handles delta fusion for agents.
Q: Does this replace my vector database?
No. ContextFusion sits after retrieval. Use Pinecone, Weaviate, pgvector, or FAISS for initial retrieval — then pass candidates through ContextFusion for optimization.
Q: What about streaming responses?
ContextFusion optimizes the input context. Streaming the LLM’s output is unaffected. The optimization adds 50–200ms overhead, which is usually offset by reduced LLM latency from shorter prompts.
Q: Can I use this with local models?
Yes. The Ollama adapter works with any OpenAI-compatible local server. Budget planning and compression are even more valuable with slower local hardware.
Q: How do I debug suboptimal context selection?
Run cpo ui and inspect the "Selected Blocks" panel. Each block shows its utility score, risk score, token count, and why it was included/excluded. Run cpo ablate ./data to see which blocks contribute most to answer quality.
Final Thoughts
ContextFusion isn’t just another RAG tool. It’s a bet that context optimization — treating token budgets as scarce resources to be allocated intelligently — will become as essential as retrieval itself.
For normal users: Install it, run it, pay less.
For developers: Extend it, integrate it, build smarter systems.
Fuse less context. Keep more signal. Ship faster answers.
⭐️ Star the repo, ⚠️file issues, ㊣ submit PRs. ContextFusion is Apache-2.0 and built for production.






Top comments (0)