DEV Community

Discussion on: Should you use Gemma 4 for your Development? A Multiversal Analysis to Determine if Gemma 4 is Right for You!

Collapse
 
klaudiagrz profile image
Klaudia Grzondziel The DEVengers

Ahahah, I had the same issues running Gemma locally with Ollama – my computer slowly turned into a snail, everything felt super slow, and I had to close almost every app 😅 In the end, it completely froze anyway!

Good job with your multiversal analysis! That's a top example of collaboration!👏🏻

Collapse
 
francistrdev profile image
FrancisTRᴅᴇᴠ (っ◔◡◔)っ The DEVengers

Hey Klaudia! Yea it is always a common issue when running any local model in particular. You just have to hope that it will be relatively fast as if you are using the Cloud Version (which I had to use for this case). I am surprised @codingwithjiro (Elmar) got his to run on a laptop. If I were to run on my laptop, it would be cooked.

Appreciate the comment! Glad you liked it :D

Collapse
 
codingwithjiro profile image
Elmar Chavez The DEVengers

Glad it's not just me @klaudiagrz. Thanks for reading!

Collapse
 
unitbuilds profile image
UnitBuilds

Have you tried the E4B and E2B models, they're quite fast and easy to run. I used them for my agentic browser swarm using a custom MCP (albeit it dropped token drain by 80%, so extremely lightweight), to run concurrent instances. I got to 4 concurrent E2B's on a 8gb gpu running at 100+ TPS each using an RX 9060 XT and LM Studio using Vulkan (trying to get lllama.cpp rocm working)

Collapse
 
francistrdev profile image
FrancisTRᴅᴇᴠ (っ◔◡◔)っ The DEVengers

Even with lower models, it tends to be the same result (for me at least and I am using a Desktop). Maybe if you get lucky? Not sure if she tried it yet, but would assume she had?

Thread Thread
 
unitbuilds profile image
UnitBuilds

Well that means probably running it wrong? Try LM Studio, then make sure you set the pipeline to use your native accelerator (Cuda/RoCm) if not supported, run Vulkan, turn on KV Cache quantization to Q8 and give that a try, if still not, turn on shared KV Cache, just be sure to scale your KV Cache accordingly for all your parallel runners)

Thread Thread
 
francistrdev profile image
FrancisTRᴅᴇᴠ (っ◔◡◔)っ The DEVengers

I guess to be fair, the main start for people getting into local LLM is running it either in Ollama or in a plain terminal. Either way, it is running slow for both cases.

Never tried LM Studio surprisingly. Will give it a shot sometime in the future. Thanks for the suggestions!

Thread Thread
 
unitbuilds profile image
UnitBuilds

It works quite for me, else you can try an Ollama docker container, just make sure that you test the WSL2 ability if you're on Nvidia.

Thread Thread
 
unitbuilds profile image
UnitBuilds

Do you see high gpu usage? If so, hows your vram looking? If it overflows to 'shared memory' (RAM), or mmap (storage), every single token is dragged down by the need to swap the memory. Which instantly tanks performance. Most things operate under the hood with llama.cpp, make sure you're running the right one.

Thread Thread
 
francistrdev profile image
FrancisTRᴅᴇᴠ (っ◔◡◔)っ The DEVengers

I would have to check, but I would assume so. Good to know!

Thread Thread
 
unitbuilds profile image
UnitBuilds

Whenever I see a LLM that I know 'should' fit my vram chug, that's usually the culprit, either wrong underlying architecture active, or overflow. You can also check whether you have thinking on, I know with qwen models, it looks like it's slow, but actually it's because thinking goes into a separate channel, is its not really logged accurately.

Thread Thread

Some comments have been hidden by the post's author - find out more