DEV Community

RoTSL
RoTSL

Posted on

Your LLM prompts are probably wasting 90% of tokens. Here’s how I fixed mine.

Tokens in LLM
I keep running into the same problem with LLM apps.

This work is based on my previous article on dev.to https://dev.to/rotsl/contextfusion-the-context-brain-your-llm-apps-are-missing-2gkm

You build a retrieval pipeline, hook it up to an API, and then quietly ship prompts that are full of stuff the model doesn’t need. Extra chunks. Duplicates. Half-relevant context that just bloats everything.

And you pay for all of it.

CFAdv is basically an attempt to stop doing that.

It builds on context-fusion, but adds something that turns out to matter more than I expected: even if you pick the right context, you can still mess it up by putting it in the wrong place.


Most pipelines are still doing this

Let’s be honest about the default pattern:

chunks = retriever.top_k(query, k=5)
prompt = "\n\n".join(chunks)
response = llm(prompt)

Enter fullscreen mode Exit fullscreen mode

That’s it.

No budget. No filtering beyond retrieval. No thought about ordering.

More context is assumed to be better. It often isn’t.


CFAdv splits the problem in two

Instead of one “context step”, it does two separate things:
1. Decide what gets in
2. Decide where it goes

That separation is the whole point.


Step 1: selecting context under a budget

Instead of top-k, CFAdv treats selection like an optimization problem.

Each chunk gets a score based on things like:
• relevance
• trust
• freshness
• diversity
• token cost

Then it tries to pick the best combination under a fixed token budget.

At a high level:

def value(chunk):
    utility = (
        0.25 * chunk.relevance +
        0.20 * chunk.trust +
        0.15 * chunk.freshness +
        0.15 * chunk.structure +
        0.15 * chunk.diversity
    )
    risk = (
        0.40 * chunk.hallucination +
        0.35 * chunk.staleness +
        0.25 * chunk.privacy
    )
    return utility - risk

Then rank by value density:

density = value(chunk) / max(chunk.tokens, 1)

Enter fullscreen mode Exit fullscreen mode

And greedily pack until you hit the budget.


The small trick that makes a big difference

There’s a simple filter before any of that:

floor = max_score * 0.15
selected = [c for c in candidates if c.score >= floor]

Enter fullscreen mode Exit fullscreen mode

Anything below 15% of the best chunk just gets dropped.

That sounds minor, but it changes behavior a lot.
• If your data is clean, everything stays
• If it’s noisy, most of it disappears

So you don’t fill your prompt with mediocre content just because you have space.


Step 2: ordering for attention

This is the part I underestimated.

Even if you pick the right chunks, models don’t treat all positions equally. Stuff at the start tends to get more attention than stuff buried in the middle.

So CFAdv reorders the selected chunks based on similarity to the query.

Basic version:

def cosine(a, b):
    return (a @ b) / (norm(a) * norm(b))

scores = [cosine(embed(query), embed(chunk)) for chunk in chunks]
weights = softmax(scores)

ordered = [chunk for _, chunk in sorted(
    zip(weights, chunks),
    reverse=True
)]

Enter fullscreen mode Exit fullscreen mode

Higher weight goes earlier in the prompt.


No embeddings API required

Instead of calling an external model, it uses a simple hashed bag-of-words vector.

import hashlib
import numpy as np
import re

def embed(text, dim=64):
    vec = np.zeros(dim)
    tokens = re.findall(r"\b\w+\b", text.lower())

    for t in tokens:
        h = int(hashlib.sha256(t.encode()).hexdigest(), 16)
        vec[h % dim] += 1.0
        vec[(h >> 16) % dim] += 0.5

    return vec / (np.linalg.norm(vec) + 1e-8)

Enter fullscreen mode Exit fullscreen mode

It’s not fancy. No positional info, no learned weights. But for short chunks it works surprisingly well.


Two levels of ordering

There’s also a second layer.

Instead of treating everything as one list, CFAdv groups context into blocks:
• system
• history
• retrieval
• tools

Then it does:
1. sort chunks inside each block
2. sort the blocks themselves

Sketch:

# intra-block
for block in blocks:
    block.chunks.sort(key=lambda c: similarity(query, c), reverse=True)

# cross-block
block_scores = {
    block: similarity(query, mean_embed(block.chunks))
    for block in blocks
}

ordered_blocks = sorted(blocks, key=lambda b: block_scores[b], reverse=True)

Enter fullscreen mode Exit fullscreen mode

So you end up shaping the whole prompt, not just shuffling pieces.


The full pipeline

CFAdv is an 8-stage pipeline, but it’s easier to think of it like this:

docs = ingest(files)
blocks = normalize(docs)
variants = represent(blocks)

candidates = retrieve(query, variants)
selected = plan(candidates, budget=120)

ordered = attention_fuse(query, selected)
packet = assemble(ordered)

prompt = compile(packet, mode="qa")

Enter fullscreen mode Exit fullscreen mode

Each step is stateless. That makes it easier to test and reason about.


What happens in practice

You can cut most of the prompt without losing the answer, as long as:
• retrieval pulls in some noise
• there is redundancy
• the query only needs a subset of the data

If everything is relevant, the system mostly leaves it alone.

If only one chunk survives selection, ordering doesn’t matter.


Where this actually helps

This kind of pipeline shines when:
• your retrieval step is messy
• you’re concatenating multiple documents
• prompts are long enough for attention effects to matter

If you already have clean, minimal context, you won’t see much change.


The part that stuck with me

This isn’t really about attention or embeddings.

It’s about treating prompt assembly as something worth optimizing.

Right now most systems act like prompts are just containers. You throw things in and hope the model figures it out.

CFAdv flips that.

It asks a simple question: what is the smallest amount of context that still works?

Then it enforces it.

And once you start thinking that way, it’s hard to go back to dumping chunks into a string and calling it a day.


Try it yourself

If you want to see how this works in practice or plug it into your own workflow:

  • GitHub repo
    Contains the full Python library, CLI, benchmarks, and tests. You can run it locally, inspect the pipeline stages, or integrate it into your own RAG setup.

  • Live demo
    Lets you compare raw prompts vs CFAdv-compiled prompts side by side. Useful for quickly seeing how much context gets removed and how ordering changes.

If you’re already using retrieval + concatenation, the repo is the easiest place to start. Swap your prompt assembly step with CFAdv’s planner + fusion stages and see what drops out.

Top comments (0)