
I keep running into the same problem with LLM apps.
This work is based on my previous article on dev.to https://dev.to/rotsl/contextfusion-the-context-brain-your-llm-apps-are-missing-2gkm
You build a retrieval pipeline, hook it up to an API, and then quietly ship prompts that are full of stuff the model doesn’t need. Extra chunks. Duplicates. Half-relevant context that just bloats everything.
And you pay for all of it.
CFAdv is basically an attempt to stop doing that.
It builds on context-fusion, but adds something that turns out to matter more than I expected: even if you pick the right context, you can still mess it up by putting it in the wrong place.
Most pipelines are still doing this
Let’s be honest about the default pattern:
chunks = retriever.top_k(query, k=5)
prompt = "\n\n".join(chunks)
response = llm(prompt)
That’s it.
No budget. No filtering beyond retrieval. No thought about ordering.
More context is assumed to be better. It often isn’t.
CFAdv splits the problem in two
Instead of one “context step”, it does two separate things:
1. Decide what gets in
2. Decide where it goes
That separation is the whole point.
Step 1: selecting context under a budget
Instead of top-k, CFAdv treats selection like an optimization problem.
Each chunk gets a score based on things like:
• relevance
• trust
• freshness
• diversity
• token cost
Then it tries to pick the best combination under a fixed token budget.
At a high level:
def value(chunk):
utility = (
0.25 * chunk.relevance +
0.20 * chunk.trust +
0.15 * chunk.freshness +
0.15 * chunk.structure +
0.15 * chunk.diversity
)
risk = (
0.40 * chunk.hallucination +
0.35 * chunk.staleness +
0.25 * chunk.privacy
)
return utility - risk
Then rank by value density:
density = value(chunk) / max(chunk.tokens, 1)
And greedily pack until you hit the budget.
The small trick that makes a big difference
There’s a simple filter before any of that:
floor = max_score * 0.15
selected = [c for c in candidates if c.score >= floor]
Anything below 15% of the best chunk just gets dropped.
That sounds minor, but it changes behavior a lot.
• If your data is clean, everything stays
• If it’s noisy, most of it disappears
So you don’t fill your prompt with mediocre content just because you have space.
Step 2: ordering for attention
This is the part I underestimated.
Even if you pick the right chunks, models don’t treat all positions equally. Stuff at the start tends to get more attention than stuff buried in the middle.
So CFAdv reorders the selected chunks based on similarity to the query.
Basic version:
def cosine(a, b):
return (a @ b) / (norm(a) * norm(b))
scores = [cosine(embed(query), embed(chunk)) for chunk in chunks]
weights = softmax(scores)
ordered = [chunk for _, chunk in sorted(
zip(weights, chunks),
reverse=True
)]
Higher weight goes earlier in the prompt.
No embeddings API required
Instead of calling an external model, it uses a simple hashed bag-of-words vector.
import hashlib
import numpy as np
import re
def embed(text, dim=64):
vec = np.zeros(dim)
tokens = re.findall(r"\b\w+\b", text.lower())
for t in tokens:
h = int(hashlib.sha256(t.encode()).hexdigest(), 16)
vec[h % dim] += 1.0
vec[(h >> 16) % dim] += 0.5
return vec / (np.linalg.norm(vec) + 1e-8)
It’s not fancy. No positional info, no learned weights. But for short chunks it works surprisingly well.
Two levels of ordering
There’s also a second layer.
Instead of treating everything as one list, CFAdv groups context into blocks:
• system
• history
• retrieval
• tools
Then it does:
1. sort chunks inside each block
2. sort the blocks themselves
Sketch:
# intra-block
for block in blocks:
block.chunks.sort(key=lambda c: similarity(query, c), reverse=True)
# cross-block
block_scores = {
block: similarity(query, mean_embed(block.chunks))
for block in blocks
}
ordered_blocks = sorted(blocks, key=lambda b: block_scores[b], reverse=True)
So you end up shaping the whole prompt, not just shuffling pieces.
The full pipeline
CFAdv is an 8-stage pipeline, but it’s easier to think of it like this:
docs = ingest(files)
blocks = normalize(docs)
variants = represent(blocks)
candidates = retrieve(query, variants)
selected = plan(candidates, budget=120)
ordered = attention_fuse(query, selected)
packet = assemble(ordered)
prompt = compile(packet, mode="qa")
Each step is stateless. That makes it easier to test and reason about.
What happens in practice
You can cut most of the prompt without losing the answer, as long as:
• retrieval pulls in some noise
• there is redundancy
• the query only needs a subset of the data
If everything is relevant, the system mostly leaves it alone.
If only one chunk survives selection, ordering doesn’t matter.
Where this actually helps
This kind of pipeline shines when:
• your retrieval step is messy
• you’re concatenating multiple documents
• prompts are long enough for attention effects to matter
If you already have clean, minimal context, you won’t see much change.
The part that stuck with me
This isn’t really about attention or embeddings.
It’s about treating prompt assembly as something worth optimizing.
Right now most systems act like prompts are just containers. You throw things in and hope the model figures it out.
CFAdv flips that.
It asks a simple question: what is the smallest amount of context that still works?
Then it enforces it.
And once you start thinking that way, it’s hard to go back to dumping chunks into a string and calling it a day.
Try it yourself
If you want to see how this works in practice or plug it into your own workflow:
GitHub repo
Contains the full Python library, CLI, benchmarks, and tests. You can run it locally, inspect the pipeline stages, or integrate it into your own RAG setup.Live demo
Lets you compare raw prompts vs CFAdv-compiled prompts side by side. Useful for quickly seeing how much context gets removed and how ordering changes.
If you’re already using retrieval + concatenation, the repo is the easiest place to start. Swap your prompt assembly step with CFAdv’s planner + fusion stages and see what drops out.
Top comments (0)