Most AI workflow posts are just a screenshot of a chat box and a hopeful caption.
This one is different: I ran the same local model twice on the same question, once with a raw prompt and once with a memory + retrieval stack around it.
What changed
Before:
- raw prompt
- no compression
- no semantic retrieval
- more clutter in context
After:
- compressed working context
- semantic retrieval from memory notes
- fewer prompt tokens
- same model, same task, less nonsense
The measured result
From the proof pack:
- Before latency: 28,590.3 ms
- After latency: 25,008.9 ms
- Before accuracy: 0.500
- After accuracy: 1.000
- Before prompt tokens: 87
- After prompt tokens: 108
- Memory saved: -24.1%
That last line is the fun one: the βafterβ run used more prompt tokens here, because I tuned it to answer the question better. Token count is a tool, not a religion.
Why this matters
The model did not become magical. The workflow got smarter.
That is the whole game with KV cache compression and prompt shaping work: make the task clearer, measure the result, and keep the same model honest across versions.
Proof pack
Links
- GitHub: https://github.com/AmSach/llm-foundry
- Proof pack: https://zo.pub/man42/kvquant-bitforge-real-prompt-proof
- GitHub profile: https://github.com/AmSach
- Instagram: https://www.instagram.com/i.amsach
- LinkedIn: https://www.linkedin.com/in/theamansachan



Top comments (0)