Discussion on: Should you use Gemma 4 for your Development? A Multiversal Analysis to Determine if Gemma 4 is Right for You!

View post

Klaudia Grzondziel The DEVengers • May 22

Ahahah, I had the same issues running Gemma locally with Ollama – my computer slowly turned into a snail, everything felt super slow, and I had to close almost every app 😅 In the end, it completely froze anyway!

Good job with your multiversal analysis! That's a top example of collaboration!👏🏻

FrancisTRᴅᴇᴠ (っ◔◡◔)っ The DEVengers • May 22

Hey Klaudia! Yea it is always a common issue when running any local model in particular. You just have to hope that it will be relatively fast as if you are using the Cloud Version (which I had to use for this case). I am surprised @codingwithjiro (Elmar) got his to run on a laptop. If I were to run on my laptop, it would be cooked.

Appreciate the comment! Glad you liked it :D

Elmar Chavez The DEVengers • May 22

Glad it's not just me @klaudiagrz. Thanks for reading!

UnitBuilds • May 26

Have you tried the E4B and E2B models, they're quite fast and easy to run. I used them for my agentic browser swarm using a custom MCP (albeit it dropped token drain by 80%, so extremely lightweight), to run concurrent instances. I got to 4 concurrent E2B's on a 8gb gpu running at 100+ TPS each using an RX 9060 XT and LM Studio using Vulkan (trying to get lllama.cpp rocm working)

FrancisTRᴅᴇᴠ (っ◔◡◔)っ The DEVengers • May 26

Even with lower models, it tends to be the same result (for me at least and I am using a Desktop). Maybe if you get lucky? Not sure if she tried it yet, but would assume she had?

UnitBuilds • May 26

Well that means probably running it wrong? Try LM Studio, then make sure you set the pipeline to use your native accelerator (Cuda/RoCm) if not supported, run Vulkan, turn on KV Cache quantization to Q8 and give that a try, if still not, turn on shared KV Cache, just be sure to scale your KV Cache accordingly for all your parallel runners)

FrancisTRᴅᴇᴠ (っ◔◡◔)っ The DEVengers • May 26

I guess to be fair, the main start for people getting into local LLM is running it either in Ollama or in a plain terminal. Either way, it is running slow for both cases.

Never tried LM Studio surprisingly. Will give it a shot sometime in the future. Thanks for the suggestions!

UnitBuilds • May 26

It works quite for me, else you can try an Ollama docker container, just make sure that you test the WSL2 ability if you're on Nvidia.

UnitBuilds • May 26

Do you see high gpu usage? If so, hows your vram looking? If it overflows to 'shared memory' (RAM), or mmap (storage), every single token is dragged down by the need to swap the memory. Which instantly tanks performance. Most things operate under the hood with llama.cpp, make sure you're running the right one.

FrancisTRᴅᴇᴠ (っ◔◡◔)っ The DEVengers • May 26

I would have to check, but I would assume so. Good to know!

UnitBuilds • May 26

Whenever I see a LLM that I know 'should' fit my vram chug, that's usually the culprit, either wrong underlying architecture active, or overflow. You can also check whether you have thinking on, I know with qwen models, it looks like it's slow, but actually it's because thinking goes into a separate channel, is its not really logged accurately.

Some comments have been hidden by the post's author - find out more