KVQuant / BitForge: same model, smarter context, better answer

#ai #benchmarking #python #opensource

Most AI workflow posts are just a screenshot of a chat box and a hopeful caption.

This one is different: I ran the same local model twice on the same question, once with a raw prompt and once with a memory + retrieval stack around it.

What changed

Before:

raw prompt
no compression
no semantic retrieval
more clutter in context

After:

compressed working context
semantic retrieval from memory notes
fewer prompt tokens
same model, same task, less nonsense

The measured result

From the proof pack:

Before latency: 28,590.3 ms
After latency: 25,008.9 ms
Before accuracy: 0.500
After accuracy: 1.000
Before prompt tokens: 87
After prompt tokens: 108
Memory saved: -24.1%

That last line is the fun one: the “after” run used more prompt tokens here, because I tuned it to answer the question better. Token count is a tool, not a religion.

Why this matters

The model did not become magical. The workflow got smarter.

That is the whole game with KV cache compression and prompt shaping work: make the task clearer, measure the result, and keep the same model honest across versions.

DEV Community

KVQuant / BitForge: same model, smarter context, better answer

What changed

The measured result

Why this matters

Proof pack

Links

Top comments (0)