DEV Community

Aman Sachan
Aman Sachan

Posted on

Qwen sky proof: compressed memory made a tiny model behave better — with the receipts

This was a tiny-model before/after run with a very ordinary goal: keep the answer useful when the wording changes.

The setup used Qwen2.5-0.5B-Instruct with a memory layer around it.

The measured result

From the proof pack:

  • Before latency: 10,061.7 ms
  • After latency: 4,652.6 ms
  • Before tokens: 35
  • After tokens: 97
  • Token saved: -177.1%
  • Latency delta: -5,409.1 ms
  • Peak RSS: 1,794 MB

That is a nice reminder that “smaller prompt” is not always the same thing as “better answer”. Sometimes the smarter move is to give the model the right memory, even if it costs a few more tokens.

What the demo showed

The before run was raw. The after run used a compressed memory summary that kept the useful facts and dropped the filler.

That is the point of this kind of system: stay useful when the wording changes.

Proof pack

Side-by-side proof

Terminal capture

Links and artefacts

Links

Top comments (0)