Samsung's Trick Makes a Tiny 4B Agent Nearly Match a Model 18 Times Bigger

#samsung #ondevice #distillation #agents

Samsung R&D UK and Queen Mary University of London have published a method, called DuoMem, that lets a small 4-billion-parameter AI agent nearly match the task-completion ability of a model 18 times its size. The paper, posted to arXiv on June 27, 2026, reports that the small model's success rate on a standard household-task benchmark jumped from 4.3 percent to 77.9 percent after applying the technique, closing most of the distance to the 87.1 percent scored by a 72-billion-parameter teacher model.

Key facts

Task success on ALFWorld rose from 4.3% to 77.9% for a 4B model, versus 87.1% for the 72B teacher (arXiv 2606.29961)
Published June 27, 2026, by Samsung R&D UK and Queen Mary University of London
The 4B model finishes tasks more than 3x faster in wall-clock time than the 72B teacher
Method combines two channels at once: injected "procedural memory" plus lightweight LoRA fine-tuning (under 10 million extra trainable values)

The problem DuoMem tackles is a familiar one in distillation: big AI models are good at multi-step tasks like "find the mug, then put it in the microwave," but they're too slow and too heavy to run on a phone or a home device. Small models are fast enough to live on that hardware, but historically they've been bad at exactly this kind of sequential, plan-then-act reasoning. Distillation - having a large "teacher" model pass its skill down to a small "student" - has been the standard fix, but most existing approaches only pull on one lever at a time: either they hand the student some hints before it acts, or they retrain its weights on the teacher's behavior, rarely both together.

DuoMem's contribution is doing both simultaneously, and the authors report that combining them beats either approach used alone. The first channel works before the small model even starts acting: the system prepends the large teacher's procedural memories - essentially a written playbook of how similar tasks were solved successfully - directly into the small model's input. The second channel changes the small model itself: it gets fine-tuned with LoRA adapters, a lightweight technique (covered in our fine-tuning and LoRA guide) that adjusts a model's behavior using well under 10 million extra trainable values, a tiny fraction of the model's full parameter count, trained specifically on the teacher's successful task runs.

A useful analogy is training a new employee. You could hand them a laminated cheat sheet of "how we do this job here" to read before every shift - that's the procedural-memory channel. Or you could have them shadow an expert for a few weeks and internalize the habits - that's the fine-tuning channel. Most training programs pick one. DuoMem does both: the cheat sheet for immediate guidance, plus enough practice that some of it becomes instinct. The paper's results suggest that pairing the two gets further than doubling down on either alone.

The result matters because it attacks one of the biggest practical bottlenecks in shipping AI agents: running a capable, multi-step agent on-device, rather than routing every request to a giant model in the cloud. A 4B model is small enough to plausibly run locally on a phone or a laptop, while a 72B model generally is not without serious hardware or a cloud connection. If a 4B model can close most of the performance gap to a model 18 times its size, that changes the calculus for who can afford to build responsive, private, always-available AI agents - not just companies that can afford massive inference bills. The wall-clock speedup, over three times faster than the teacher, is the other half of that story: it's not just cheaper, it's faster to actually use.

The one honest caveat is that ALFWorld is a simulated benchmark - a text-and-simulation environment for household chores, not a messy real kitchen or a real customer-support ticket queue. The 4B model, even after DuoMem, still trails the 72B teacher by close to ten percentage points, and no one has yet shown these gains holding up in an actual shipped product running on real hardware in the real world. Whether DuoMem's gains survive contact with genuinely open-ended, real-world tasks - where the "successful task runs" used for training won't perfectly anticipate what comes next - is still an open question the paper doesn't answer.

For more on how AI agents are built and what "agent memory" means in practice, see our explainer on AI agents and our coverage of agent memory.

Originally published on Ground Truth, where every claim is checked against the primary source.