DEV Community

Aman Sachan
Aman Sachan

Posted on

KVQuant / BitForge: same model, smarter context, better answer

Most AI workflow posts are just a screenshot of a chat box and a hopeful caption.

This one is different: I ran the same local model twice on the same question, once with a raw prompt and once with a memory + retrieval stack around it.

What changed

Before:

  • raw prompt
  • no compression
  • no semantic retrieval
  • more clutter in context

After:

  • compressed working context
  • semantic retrieval from memory notes
  • fewer prompt tokens
  • same model, same task, less nonsense

The measured result

From the proof pack:

  • Before latency: 28,590.3 ms
  • After latency: 25,008.9 ms
  • Before accuracy: 0.500
  • After accuracy: 1.000
  • Before prompt tokens: 87
  • After prompt tokens: 108
  • Memory saved: -24.1%

That last line is the fun one: the β€œafter” run used more prompt tokens here, because I tuned it to answer the question better. Token count is a tool, not a religion.

Why this matters

The model did not become magical. The workflow got smarter.

That is the whole game with KV cache compression and prompt shaping work: make the task clearer, measure the result, and keep the same model honest across versions.

Proof pack

Before/after view

Scores panel

Terminal transcript

Links

Top comments (0)