This is a submission for the Gemma 4 Challenge: Write About Gemma 4
Running modern AI models locally on older hardware sounds almost impossible. B...
For further actions, you may consider blocking this person and/or reporting abuse
The 2015-desktop angle is the more interesting half of the local-AI story right now. The "1x H100" crowd gets all the airtime, but the actual unlock for hobbyist devs is that a CPU-only or modest-iGPU machine can now run a model that's genuinely useful for code-completion or summarization workloads.
Two things I'd be curious to see in a follow-up: tokens/sec under sustained load rather than first-token (thermal throttling on old desktops is brutal once you get past the first minute), and whether you saw a meaningful quality difference between 2B and 4B on tasks that matter to you, not just benchmark scores. In our testing the 2B-vs-4B gap is small on classification and pretty large on anything requiring two-step reasoning, but it's very task-dependent.
Did you try llamafile or just stick with one runtime? llamafile's been surprising on old AVX2-only CPUs.
Thanks! I completely agree. The fact that older hardware can now run actually useful local models is probably the most exciting part for hobby developers right now.
And jumping to your last question, I stuck with Ollama mainly because it’s much more approachable for tech people in general, not just developers. But I’ll definitely try llamafile, especially once I start integrating models into the app I’m planning to build.
These are actually really good insights, and I’d like to focus more on them in follow-up testing. Right now I’m planning to evolve the measurements in two directions:
If E2B or E4B prove capable enough, I’d also like to experiment with MCP, RAG, and similar integrations to see how far they can be pushed.
Good test of local LLMs. 😀 I wanted to use AI for free, so I tried local LLMs last year, but they were quite slow and low quality. My CPU and memory usage hit 100%, so I gave up. But they might be better now.
Definitely give it another try. I’m pretty sure you have better hardware than my archaic PC, so the E4B model should run fine for you (maybe even 26B 😀).
The reasoning quality also depends on your expectations. E4B and E2B are still relatively small models, so they won’t compete with models like Anthropic Claude Sonnet 4.6 or Google Gemini 3.1 for programming tasks, but they’re definitely usable.
Yes, I might try simple tasks with local LLMs and avoid comparing them with Claude Code. There should be a good way to make the most of local LLMs. 🤔
This is such a thorough breakdown — the CPU becoming the bottleneck instead of RAM was genuinely surprising to me. I've been working with Gemma 4-27B via API for my space app and the instruction-following precision you mentioned is exactly what made it work for persona-switching (NASA commander → planetarium narrator in the same app). Would love to see how the E4B handles creative + factual tasks together in your trip planner MVP!
Thank you for reading!
Your space app sounds really interesting. I’ll definitely check it out.
Glad you want to see more 😄 For my trip planner, I actually selected E2B because lower precision on factual tasks will not hurt that much there. I also want to show how smaller models can be used smartly, so even a less capable model can still be an important part of an app.
So stay tuned 😄
Wow, this is such an amazing breakdown 😄 I also wanted to participate in this contest, but now I’m honestly a bit embarrassed after reading this 😀
Local LLMs have always tempted me too. I experimented a bit with browser-based ones, but on a real computer you can definitely feel the difference in quality/performance.
Also, I find it fascinating that it struggles so much with Czech 😄 Such a beautiful language! 😄
Thank you! You should definitely go for it. These challenges are a great way to push our knowledge further.
That actually sounds like a challenge article idea now: “Gemma E2B in the browser?” 😄
And to be honest, I’m not surprised it struggles with Czech. It’s my native language and even I struggle with it sometimes 😀
I love "szukajmy szczotek" in Czech, which are totally neutral words in Polish 🤣
Yep, generally “szukaj” just sounds funny to Czech speakers 😄
Fascinating
Glad you found it fascinating!
Hopefully the fascinating part is the article itself, not the fact that I’m still developing side projects on a machine from 2015 😄
It’s rare to see someone testing the 2015 hardware vs. modern LLM threshold so thoroughly.
2015 hardware might be a bit too old, but I believe a lot of people are still on older machines (around 2020 or earlier), so this kind of testing can be quite valuable for them.
I’m curious how far current small models can realistically go before hardware becomes the real bottleneck.
Running benchmarks on actual old hardware instead of speculating is the right approach. The real usability threshold for local models is around 30 tok/s — below that it becomes "submit and wait" rather than interactive, and that's where model size selection matters more than raw benchmark scores. For a 2015 i7, Gemma 4 2B with Q4 quantization is probably the sweet spot between quality and speed. The practical question is always: fast enough to stay in flow, or slow enough to break concentration?
For daily engineering work or general usage, you’re probably right about the ~30 tok/s threshold for a good UX.
But I see these smaller models more as tools inside developer apps, for example intent extraction, classification, or handling simple prompts, where token speed is not necessarily the bottleneck.
That said, I still want to try them in VS Code, mostly out of curiosity rather than for real daily usage.
Nice approach on the agent orchestration approach! Be curious how it handled production traffic vs benchmarks though — What was the biggest unexpected challenge you hit along the way?
Followed! Looking forward to more content like this.
Thanks, glad you liked it!
I’ve actually already implemented an MVP using Gemma 4 E2B and wrote an article about it: How to Use Gemma 4 E2B the Smart Way: Family Trip Advisor There’s also a small benchmark section with real MVP token usage if you’re interested (still running on the old machine 😄 )
Love benchmarks like this because they puncture the "you need an H100 to do anything" myth. A 2015 desktop running Gemma 4 2B/4B usefully is exactly the point: for a huge class of tasks (classification, extraction, light reasoning, the mechanical bulk), small local models on modest hardware are already good enough - and good enough + free + private beats frontier-but-metered for that work.
The 2B-vs-4B tradeoff is the interesting practical knob: 2B for the high-volume trivial stuff, 4B when you need a bit more reasoning, escalate to an API only for the genuinely hard. That tiering is the whole game on constrained hardware. Great hands-on data - what token/sec did you land on, and was 4B usable for real coding tasks or more for Q&A?
Thanks for the comment!
That's exactly what motivated me to run the benchmark. There is a lot of discussion around high-end AI hardware, but for many practical tasks, older hardware is still surprisingly capable.
I haven't tried them for coding tasks yet, that's still on my TODO list.
I also used the 2B model in a real MVP project: How to Use Gemma 4 E2B the Smart Way: Family Trip Advisor
For smaller, focused prompts it was definitely usable. With a good architecture, both the 2B and 4B models can deliver surprisingly good results in real applications.
love the benchmarking of these 2!
Thanks, glad you found the benchmark useful!
Damn. A 2015 laptop with 24 GB RAM???????????????
I was here expecting a 4 GB RAM kinda stuff.
It’s actually a PC, not a laptop 🙂 Originally it had 8 GB (2x4 GB), but over the years I added another 16 GB as software, especially browsers and IDEs, started needing more and more RAM.
Good insights but in 2015 i was in college
This is the kind of benchmarking I love — real hardware, not cloud instances. The 4B model on a 2015 desktop is impressive for inference but I wonder about the token generation speed for interactive use. 2B would probably be the sweet spot for daily driver.
I run Gemma models locally on my laptop too. What quantization level did you use? Q4_K_M seems to be the best balance for consumer hardware in my experience.
Followed you for more local LLM content! 👀
Wao amazing breakdown 😄
Thanks, glad you liked it! 😄
Awesome, this is a real turning point.
Yes, it’s exciting that we can finally run useful models locally. I’m curious how far we can push these edge models.
I think it's of no use to be like that.