How to Stop an SLM from Overthinking

#ai #slm #qwen #testing

When running a 4B parameter scale model locally, you expect a trade-off: it’s fast, but it has the memory of a goldfish. In my testing of Qwen3-4B, I kept hitting a wall. I’d ask the model a fairly complex Linux diagnostic question:

I’m working on a Linux server and during peak hours, the system becomes sluggish even though free -h shows plenty of free RAM and top doesn’t reveal any single process hogging CPU or memory. What is the most likely cause of this slowdown?

From my experience testing the model, I felt this was a question that wasn't beyond Qwen's capability. But the model would think and think and then "stall." I have the context size set at 2048, a reasonable value. Yet it was thinking so hard that it filled its own context window with internal deliberation before it could even start the answer.

I researched the issue and found the answer - a "soft switch" or control token that can be appended to the end of a prompt:

/no_think

It turns out Qwen3-4B isn't just one model; it’s a hybrid. It has a high-gear "Reasoning Mode" and a low-gear "Direct Mode." If you add the control token /no_think to your prompt, you aren't just giving it a suggestion not to think, you are triggering a hard-coded neural bypass.

This is a game changer for SLMs. By bypassing the "chain of thought" mode, the model won't waste hundreds of tokens talking to itself in the background. This leaves the entire 2048 (or max of 8192 for this model) token window for the solution.

It's sort of like asking the model for a 'gut reaction' as an answer - the same response you might want from an experienced tech. Without the control token - the /no_think trigger, the model treats every prompt like a PhD thesis. With it, you get a very quick answer, and in this case it provided a reasonable cause for the Linux issue.

On my machine (and HP with i7 processor using AVX-512 and running in CPU-only mode, the difference in performance was like night and day. Its thinking mode was a slow crawl that eventually crashed the model. Adding /no_think at the end of the prompt delivered an instantaneous, high-precision strike.

Don't let the "small" in SLM fool you. These models are built with "hidden" controls that allow them to punch way above their weight class if you know how to toggle their internal logic and use precise, careful prompting.

If you're using Qwen3, try adding /no_think at the very end of your prompt. (the slash is important in order to prevent the model from confusing no_think as part of a sentence in the prompt)

This control token is a specific sequence created by the Qwen team that the model was fine-tuned to recognize. It effectively tells the attention mechanism to skip the "reasoning experts" in its training data and go straight to the "knowledge experts."

Ben Santora - January 2026

DEV Community

How to Stop an SLM from Overthinking

Top comments (0)