Hi! I’m Klaudia — a Technical Writer who enjoys turning complex ideas into clear and readable documentation. 7+ years of experience across open-source and commercial projects.
Ahahah, I had the same issues running Gemma locally with Ollama – my computer slowly turned into a snail, everything felt super slow, and I had to close almost every app 😅 In the end, it completely froze anyway!
Good job with your multiversal analysis! That's a top example of collaboration!👏🏻
Hey Klaudia! Yea it is always a common issue when running any local model in particular. You just have to hope that it will be relatively fast as if you are using the Cloud Version (which I had to use for this case). I am surprised @codingwithjiro (Elmar) got his to run on a laptop. If I were to run on my laptop, it would be cooked.
Licensed civil engineer turned full stack developer building accessible, responsive web applications. I also review code in Frontend Mentor and participate in collaborative projects.
Have you tried the E4B and E2B models, they're quite fast and easy to run. I used them for my agentic browser swarm using a custom MCP (albeit it dropped token drain by 80%, so extremely lightweight), to run concurrent instances. I got to 4 concurrent E2B's on a 8gb gpu running at 100+ TPS each using an RX 9060 XT and LM Studio using Vulkan (trying to get lllama.cpp rocm working)
Even with lower models, it tends to be the same result (for me at least and I am using a Desktop). Maybe if you get lucky? Not sure if she tried it yet, but would assume she had?
Well that means probably running it wrong? Try LM Studio, then make sure you set the pipeline to use your native accelerator (Cuda/RoCm) if not supported, run Vulkan, turn on KV Cache quantization to Q8 and give that a try, if still not, turn on shared KV Cache, just be sure to scale your KV Cache accordingly for all your parallel runners)
I guess to be fair, the main start for people getting into local LLM is running it either in Ollama or in a plain terminal. Either way, it is running slow for both cases.
Never tried LM Studio surprisingly. Will give it a shot sometime in the future. Thanks for the suggestions!
Do you see high gpu usage? If so, hows your vram looking? If it overflows to 'shared memory' (RAM), or mmap (storage), every single token is dragged down by the need to swap the memory. Which instantly tanks performance. Most things operate under the hood with llama.cpp, make sure you're running the right one.
Whenever I see a LLM that I know 'should' fit my vram chug, that's usually the culprit, either wrong underlying architecture active, or overflow. You can also check whether you have thinking on, I know with qwen models, it looks like it's slow, but actually it's because thinking goes into a separate channel, is its not really logged accurately.
Some comments have been hidden by the post's author - find out more
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
Ahahah, I had the same issues running Gemma locally with Ollama – my computer slowly turned into a snail, everything felt super slow, and I had to close almost every app 😅 In the end, it completely froze anyway!
Good job with your multiversal analysis! That's a top example of collaboration!👏🏻
Hey Klaudia! Yea it is always a common issue when running any local model in particular. You just have to hope that it will be relatively fast as if you are using the Cloud Version (which I had to use for this case). I am surprised @codingwithjiro (Elmar) got his to run on a laptop. If I were to run on my laptop, it would be cooked.
Appreciate the comment! Glad you liked it :D
Glad it's not just me @klaudiagrz. Thanks for reading!
Have you tried the E4B and E2B models, they're quite fast and easy to run. I used them for my agentic browser swarm using a custom MCP (albeit it dropped token drain by 80%, so extremely lightweight), to run concurrent instances. I got to 4 concurrent E2B's on a 8gb gpu running at 100+ TPS each using an RX 9060 XT and LM Studio using Vulkan (trying to get lllama.cpp rocm working)
Even with lower models, it tends to be the same result (for me at least and I am using a Desktop). Maybe if you get lucky? Not sure if she tried it yet, but would assume she had?
Well that means probably running it wrong? Try LM Studio, then make sure you set the pipeline to use your native accelerator (Cuda/RoCm) if not supported, run Vulkan, turn on KV Cache quantization to Q8 and give that a try, if still not, turn on shared KV Cache, just be sure to scale your KV Cache accordingly for all your parallel runners)
I guess to be fair, the main start for people getting into local LLM is running it either in Ollama or in a plain terminal. Either way, it is running slow for both cases.
Never tried LM Studio surprisingly. Will give it a shot sometime in the future. Thanks for the suggestions!
It works quite for me, else you can try an Ollama docker container, just make sure that you test the WSL2 ability if you're on Nvidia.
Do you see high gpu usage? If so, hows your vram looking? If it overflows to 'shared memory' (RAM), or mmap (storage), every single token is dragged down by the need to swap the memory. Which instantly tanks performance. Most things operate under the hood with llama.cpp, make sure you're running the right one.
I would have to check, but I would assume so. Good to know!
Whenever I see a LLM that I know 'should' fit my vram chug, that's usually the culprit, either wrong underlying architecture active, or overflow. You can also check whether you have thinking on, I know with qwen models, it looks like it's slow, but actually it's because thinking goes into a separate channel, is its not really logged accurately.