Qwen sky proof: compressed memory made a tiny model behave better — with the receipts

#ai #llm #python #benchmarking

This was a tiny-model before/after run with a very ordinary goal: keep the answer useful when the wording changes.

The setup used Qwen2.5-0.5B-Instruct with a memory layer around it.

The measured result

From the proof pack:

Before latency: 10,061.7 ms
After latency: 4,652.6 ms
Before tokens: 35
After tokens: 97
Token saved: -177.1%
Latency delta: -5,409.1 ms
Peak RSS: 1,794 MB

That is a nice reminder that “smaller prompt” is not always the same thing as “better answer”. Sometimes the smarter move is to give the model the right memory, even if it costs a few more tokens.

What the demo showed

The before run was raw. The after run used a compressed memory summary that kept the useful facts and dropped the filler.

That is the point of this kind of system: stay useful when the wording changes.

DEV Community

Qwen sky proof: compressed memory made a tiny model behave better — with the receipts

The measured result

What the demo showed

Proof pack

Links

Top comments (0)