FrancisTRᴅᴇᴠ (っ◔◡◔)っ

for The DEVengers

Posted on May 22

Should you use Gemma 4 for your Development? A Multiversal Analysis to Determine if Gemma 4 is Right for You!

#devchallenge #gemmachallenge #gemma #discuss

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Disclaimer: This is an individual submission for Francis Tran (@francistrdev). Everyone involved has their own accounts of using Gemma 4 and has been paraphrased in this post. See below for the list of people involved.

Author: Francis Tran (@francistrdev)
Featuring: Elmar Chavez (@codingwithjiro), Konark Sharma (@konark_13), and Julien Avezou (@javz)

The cover image uses characters from various Anime chosen by the people who are featured in this post, put together in using Pixlr, and using AI to enhance the quality of the image. Some contents started off using AI and has been paraphrased by the author's own words.

Introduction
Setting up Gemma 4 using Ollama
Using Gemma 4 as a Confused Developer
Building a Prototype using Gemma 4
- Results of Gemma 4 Vs. GPT 5.4
  - Technical Depth
  - Architecture & System Thinking
- Julien's Verdict on Gemma 4
Using Gemma 4 as an AI-Agent
Summary

Thank you for Reading!

Introduction

In the age of AI, many developers are using AI in their workflows. This can range from AI assistants cleaning up a large code-base to developers vibe coding a prototype to showcase to others.

Regardless of your use of AI in your development, it is important to choose the model you want to work with. There are a lot of choices to choose from such as ChatGPT, Gemini, Copilot, etc. The one we are talking about today is Google's latest model: Gemma 4!

The goal of this article is to share our experiences using Gemma 4 in 4 different perspectives of Development starting from setting up Gemma 4 using Ollama, testing the hallucination of Gemma 4, how Gemma 4 outputs, and using Gemma 4 as an AI-Agent. Additionally, I will share our discoveries and whether we recommend using Gemma 4 based on our experiences! We hope that our experience of using Gemma 4 will help you to decide if Gemma 4 is right for you in any skill level of your developer career!

With that said, we will now talk about the setup process. In any development, we all experience setting up our environment whether it is OpenClaw or a simple NPM package! Having a local LLM is one of them. Here is Elmar's experience of setting up Gemma 4 using Ollama!

Setting up Gemma 4 using Ollama

Elmar Chavez

Licensed civil engineer turned full stack developer building accessible, responsive web applications. I also review code in Frontend Mentor and participate in collaborative projects.

Elmar has been using ChatGPT for the past year where he mainly uses it as a mentor when doing projects. He was interested having a LLM locally. He believes that having an LLM locally is convenient for his case, due to power outages he has to faced in the Philippines. Having Gemma 4 locally will allow him to be productive when coding without the need to worry about connecting to the internet. With that said, he started to research about setting up Gemma 4 using Ollama.

Installing Ollama and Gemma 4

Elmar started to search on the internet, looking for tutorials on getting started on Gemma 4.

One of the ways to get started is installing Ollama, which Elmar thought was a weird name until he realized the logo was an actual llama.

When downloading Ollama, the main issue is the storage, which was nearing 100% capacity and Ollama installer defaults to the C:\ drive.

After searching more about installing Ollama without installing on the C:\ drive, which led to this command he had to run on the terminal:

C:\Users\ELMAR>cd Downloads
C:\Users\ELMAR\Downloads>OllamaSetup.exe /DIR="D:\Ollama"

Ultimately, it worked well. He had done other configurations such as redirecting the AI model installation in the same drive under D:\OllamaModels. He then downloaded the gemma4:e4b.

Trying out Gemma 4

The first thing Elmar tried was "hi gemma4". Although the expected outcome is Gemma 4 saying hello back, it gave him this error:

Error
500 Internal Server Error: model requires more system memory (9.8 GiB) than is available (4.9 GiB)

For context, this is his device specifications:

His laptop is great for programming and other developments. The issue is that his laptop is only suitable for "lightweight models" when it comes to downloading LLMs locally.

Elmar discovered that local AI depends heavily on your own hardware, unlike browser-based AI tools like ChatGPT.

One of the things Elmar did was to run a smaller version of Gemma 4:

ollama run gemma4:e2b

It gave him a similar error as before:

Error
500 Internal Server Error: model requires more system memory (7.2 GiB) than is available (5.7 GiB)

After investigating a bit more, he closed other applications that were running on his computer and taking up space and it outputted!

The big thing about running it for the first time is that it took longer than he expected, but he mentioned that he had VSCode, Notepad, and 4 tabs currently opened in his Chrome Browser.

Running Gemma 4 Offline

Elmar was curious about the result of using Gemma 4 without connecting to the internet.

It ran slower than usual, but it is still under a minute.

Elmar then tried to ask questions to Gemma 4 that he would normally ask to ChatGPT as part of his Development workflow.

In this case, he is currently building his website portfolio. Elmar asked a question along with attaching a screenshot to Gemma 4 and here is what it output:

Surprisingly, it took less time considering the fact that he inserted an image for this use case.

The wait time for the LLM to respond, which is around a minute, is reasonable for Elmar since he has a decent laptop. It is quite useful especially for countries that commonly have power outages or that have unreliable internet.

Elmar's Verdict on Gemma 4

This is Elmar's Verdict of Gemma 4 in terms of setting up via Ollama and using it for the first time:

Web-based LLMs are great for convenience and device flexibility. While local LLMs are great for offline reliability and privacy, I'd say the experience of experimenting with Ollama + Gemma 4 only made my perspective with AI clearer.

It is not magic, it's just data combined with probability and rendering models that require large amounts of RAM for compute power.

This begs the question, if I have a higher end device, would I invest fully into local LLM's?

I would not sacrifice my own hardware for running AI computations just to get responses faster.

There are data centers that have the hardware especially built for AI. I would rather give the job to them.

Overall, ChatGPT is still far more practical for my current workflow so I will be sticking to it as my default. However, I would definitely use Ollama + Gemma 4 as a reliable alternative.

That is Elmar's Verdict on Gemma 4! In the early stage, it is fast to setup, but difficult to use when it comes to response time and the hardware needed, especially considering most users do not have a powerful laptop.

Now you know about the setup with Gemma 4! Starting off with Gemma 4 and seeing its outputs is a good sign that everything works! However, although Gemma 4 is a small and powerful model, how good is it at not being confused? We now transition to Konark where he tested Gemma 4 to the limit!

Using Gemma 4 as a Confused Developer

Konark Sharma

I am software engineer who finds fun and creativity in Frontend. I would love to be a part of a team to help, develop and learn from new people and add my knowledge to the project.

Konark has been using Gemma 4 for some time and the one thing he wanted to know is to see if he can abuse Gemma 4 to its confusion.

We all have that one moment where we randomly test an AI just to see “How smart are you actually?" with questions like “How many r’s are there in strawberry?”. Konark did three tests such as the Strawberry test, the Tic Tac Toe Rotation test, and the Broken Cup Test.

The Strawberry Test

Konark asked Gemma 4: “How many r’s are there in strawberry?” (The answer would be three). Initially, this type of question was commonly used to confuse many AI models.

Gemma 4 thought for a while and gave Konark three different interpretations regarding the number of "r’s" in strawberry.

Konark realized that sometimes AI over complicates simple questions and he would rather prefer simple answers. For example, if there are 10 apples in a basket and you asked Gemma 4: “How many apples are there?”.

Would you prefer this answer:

“This is my first time counting. Do you mean apples inside the basket or around the basket?”

Or would you simply perfer:

“There are 10 apples.”

Konark would prefer the second answer.

With these results, he was curious about how Gemini would answer to the strawberry test. As such, he asked Gemini and it simply replied:

3 r’s.

It is straightforward and it is always good to have a simple, non-complex answer.

The Tic Tac Toe Rotation Test

Another fun question Konark asked Gemma 4 was: “How would rotating a tic tac toe board by 90 degrees change the rules of the game and its strategy?”

The whole goal is to see if Gemma 4 thinks it's still the same, expect the board is rotated.

However, Gemma 4 took quite some time to conclude this. It explained things properly and even walked through a rotated example.

Although it is a helpful response, it would be nice to get a more straightforward answer similarly to the Strawberry Test.

The Broken Cup Test

This is Konark's favorite test, in which he asked Gemma 4: “I have a metal cup with the bottom missing and the top sealed. How can I use this cup?”

The simple answer would be "Rotate the cup. And suddenly, it becomes usable again.".

However, Gemma 4 did not really understand the trick. Instead, it started suggesting alternative ways to use the object.

Although it is creative, it is not entirely correct.

On the other hand, Gemini understood the trick faster and then even expanded on practical uses.

Overall, Gemma 4 still struggles with certain trick questions and edge-case reasoning.

Konark's Verdict on Gemma 4

This is Konark's verdict of using Gemma 4.

After playing around with both models, I reached a simple conclusion.

Gemma 4 feels good for:

learning

content generation

straightforward explanations

slower, detailed responses

Gemini feels better for:

trick questions

quick reasoning

direct answers

faster conclusions

And honestly, both have their place.

Sometimes I want a detailed explanation.

Sometimes I just want: “Bro, just give me the answer.”

And that matters.

Because developers don’t always want essays. Sometimes we just want the bug fixed.

This is important to consider when you want to use Gemma 4. Sometimes, we just want a simple answer to the issue we are trying to solve. Of course, prompting varies and this is Konark's experience using Gemma 4 so it can be different from others.

Speaking of prompting, AI is used to prototype projects in development! We now transition to Julien's experience on using Gemma 4 when it comes to prototyping a project and how it compares to what he usually uses, which is Codex!

Building a Prototype using Gemma 4

Julien Avezou

Builder | Software Engineer | Author of The Thinking Engineer Toolkit

Julien planned on using AI to create a fast prototype.

He was curious about how Gemma 4 performs compared to his usual setup, Codex (GPT 5.4).

For context, the project is a Chrome extension to help develop better use of LLMs. He will share more on his blog page in the future!

His goal is to compare Gemma 4 Vs. GPT 5.4 when it comes to prototyping. For this comparison, he will run tests across the following 2 categories using the same prompt as input and comparing the quality of the outputs:

Technical Depth: The factual accuracy, understanding of deep system mechanics and ability to compare non-trivial trade-offs.
Architecture & System Thinking: The ability to think at the macro level, identifying bottlenecks, scaling issues, and making justifiable "when/why" decisions.

Results of Gemma 4 Vs. GPT 5.4

Here are the results for the cases of Technical Depth and Architecture & System Thinking!

Technical Depth

This is the prompt given to both LLMs:

I am building a Chrome extension which detects when a developer is using ChatGPT, Claude, or Gemini, shows a floating button, opens a side panel, lets the user describe their task, recommends one of five AI thinking modes, generates a better prompt, and lets the user copy it manually.

Modes:
1. Explore — when I don’t fully understand the problem yet
2. Challenge — when I have a plan, but it might be wrong
3. Decide — when I need to choose between options
4. Audit — when I need to verify quality or correctness
5. Reflect — when I want to actually learn from what I did

MVP constraints:

Chrome extension only

no login

no dashboard

no automatic prompt insertion

no usage tracking yet

Task:
Give me a technically detailed implementation plan for the MVP.

Here is the results:

Model	Pros	Cons
Gemma 4	Clear high-level structure. Correctly identifies manifest, background script, content script, overlay UI, and template-based prompt generation. Easy to read. Good for a conceptual first pass.	Too generic. Does not mention Chrome Side Panel API. Blurs overlay and side panel. Drifts from your exact modes. Weak on privacy/security. Less actionable for immediate implementation.
GPT 5.4	Much more implementation-ready. Strong file structure. Useful code snippets. Correct Chrome APIs. Better MVP scoping. Better maintenance thinking. Exact mode alignment. Clear recommendation to avoid LLM/API in v1.	Still light on deeper security analysis. The manifest/build setup would need adjustment because TypeScript files are not directly used by Chrome without bundling. Could have compared local/API/hybrid tradeoffs more explicitly.

Overall, GPT 5.4 gives the stronger Technical Depth output since it is more accurate, more build-ready, and more specific.

Architecture & System Thinking

This is the prompt given to both LLMs:

I am building a Chrome extension that helps developers use AI more intentionally.

It sits on top of ChatGPT, Claude, and Gemini. The user describes their task, then it recommends one of five thinking modes and generates a better prompt to copy.

Modes:
1. Explore — when I don’t fully understand the problem yet
2. Challenge — when I have a plan, but it might be wrong
3. Decide — when I need to choose between options
4. Audit — when I need to verify quality or correctness
5. Reflect — when I want to actually learn from what I did

MVP:

detect supported AI chat pages

show floating ModeCheck button

open side panel

recommend mode

generate better prompt

copy manually

soft CTA to the Thinking Engineer Toolkit

Future:
A SaaS dashboard may later track AI usage modes, cognitive cost, dependency patterns, and optional hard blocks.

Task:
Think at the system architecture level.

Cover:

best one-week MVP architecture

what to build now vs postpone

key system boundaries

privacy/trust risks

maintenance risks for a solo builder

how to preserve future optionality for analytics/dashboard

strongest argument for and against the browser extension approach

recommended roadmap: week 1, month 1, month 3

Be concise but thoughtful. Focus on trade-offs and “when/why” decisions.

Here is the results:

Model	Pros	Cons
Gemma 4	Good strategic restraint. Correctly recommends client-side MVP, hardcoded templates, no backend, and no dashboard at first. Good emphasis on fast time-to-value. Useful suggestion to model prompts as structured objects/schema for future flexibility.	Less complete system thinking. Weak privacy/trust analysis. Weak argument for/against extension approach. Maintenance advice is shallow. Roadmap becomes generic around logging, API integration, billing. Does not fully address future cognitive-cost/hard-block architecture.
GPT 5.4	Stronger system boundaries. Better local-first architecture. Better privacy/trust framing. Better solo-builder maintenance analysis. Stronger roadmap. Better future optionality through event/data modeling without collecting data yet. Better distinction between host page, extension, and future backend.	Does not go very deep on future scaling bottlenecks or hard-block mechanics. Could have included Gemma’s useful prompt-schema idea. Slightly less focused on concrete “engine” abstraction for prompt assembly.

Overall, GPT 5.4 wins on Architecture & System Thinking. Gemma 4 provides a solid, restrained MVP strategy, but GPT 5.4 gives a more complete architectural view: system boundaries, trust, local-first design, future analytics optionality, solo-builder maintenance risk, and roadmap discipline.

Julien's Verdict on Gemma 4

This is Julien's verdict on the use of Gemma 4:

GPT 5.4 seems to be the strongest at high-level reasoning and synthesis as you would expect from a recent frontier model.

However a good use case for Gemma 4 in terms of prototyping work would be for codebase analysis, running experiments through constraints and for privacy-sensitive workflows.

Being an open model which can be run locally has great specific and complementary use cases when prototyping developer tools.

I will consider this going forward.

Even though GPT is preferred when it comes to high-level reasoning and synthesis, there are other cases where Gemma 4 is useful for prototyping.

Speaking of prototyping, now we transition over to using Gemma 4 as an AI-Agent!

Using Gemma 4 as an AI-Agent

FrancisTRᴅᴇᴠ (っ◔◡◔)っ

📚 𝗙𝘂𝗹𝗹-𝗦𝘁𝗮𝗰𝗸 𝗗𝗲𝘃𝗲𝗹𝗼𝗽𝗲𝗿 📚 🚀 𝗗𝗘𝗩 𝗖𝗼𝗺𝗺𝘂𝗻𝗶𝘁𝘆 𝗠𝗼𝗱𝗲𝗿𝗮𝘁𝗼𝗿 🚀 ༼ つ ◕_◕ ༽つ🍰🍔🍕 ₍₍⚞(˶>ᗜ<˶)⚟⁾⁾ "ᴀ ꜱᴍᴏᴏᴛʜ ꜱᴇᴀ ɴᴇᴠᴇʀ ᴍᴀᴅᴇ ᴀ ꜱᴋɪʟʟᴇᴅ ꜱᴀɪʟᴏʀ" - ꜰʀᴀɴᴋʟɪɴ ᴅ. ʀᴏᴏꜱᴇᴠᴇʟᴛ

Whenever I am contributing to Open Source, it is quite useful to have an AI-Agent to work with you. My preference is to use an agent living in an application like Visual Studio Code.

I have been using GitHub Copilot when it comes to contributing to Forem and it has been useful when it comes to navigating code bases, problem solving issues, etc.

When I heard about Gemma 4, I thought it would be a good idea to try out when using Gemma 4 as an AI-Agent. I specifically used the cloud version using Ollama since the hardware I have cannot handle LLM locally, even though I tried the smaller version of the model.

Setting up wasn't too difficult on my part. Since I tried Ollama to begin with before the release of Gemma 4, it was as easy as pulling the model using this command:

ollama pull gemma4:31b-cloud

After installing, I selected the model and was ready to go!

When using the model, I noticed there wasn't much of a big change compared to my initial workflow of using Copilot as my agent.

Both models asked me for permission when running commands, able to problem solve the issue I requested, etc. To be fair, I wasn't using "Continue" Visual Studio code extension since it is quite popular when using local LLMs. Also, Visual Studio code is flexible when it comes to adding your own model in addition to using Copilot, which I wasn't surprised that it behaves similarly when using Copilot.

However, there are a couple of things I noticed when using Gemma 4 that I believe are important to address based on my experience using Gemma 4.

Gemma 4 is Strict

The main difference between Gemma 4 and Copilot was that it is quite firm on the tasks it is given to them. I notice that whenever I request a plan on tackling an issue, it provides a To-do list of the steps Gemma 4 is planning on doing to fix the issue.

For example, this is the prompt I gave to Gemma 4:

I have this issue. How do I approach this?

https://github.com/forem/forem/issues/23277

It will output their findings like any other LLM. However, it would not act right away and instead, listed out the next steps:

In cases like these, I just follow up the Agent "Go for it" or "Proceed". Sometimes it does everything in the to-do list in one go and in other cases it goes one by one where I have to tell the Agent to "Do all the tasks". Do note that I did not change any settings as I was setting up Gemma 4 since it was just pulling the model and selecting it in Visual Studio Code.

Compare that to Copilot, where regardless of the prompt, it would take action right away. There were cases where I forgot to change the mode from "Agent" to "Chat" and I was specifically asking the AI about the plan to tackle the issue. Instead of listing out the steps, it did it right away which I did not want it to do.

In all fairness, Gemma 4 and Copilot are from different companies (Google and Microsoft) and since Copilot is built into Visual Studio Code, it is fair to assume that its actions strictly follows based on the settings you have in Visual Studio Code for Copilot.

However, it is always nice to see Gemma 4 listing out the To-do list and asking the user to proceed. I believe it is good practice by default if you are a type of user that wants to get an AI's advice on what to do and doing it yourself from there.

Please Google I need this. Gemma 4 is kind of Looping Itself

It is exactly what you are thinking. This is a common occurrence where Gemma 4 will read a file from top to bottom and reread it again. Here is an example that I encountered when using Gemma 4:

I am not sure what is causing this issue since this happens randomly.

The solution is stopping the Agent and sending the same prompt. This fixes the issue really well, though this bug remains a mystery to why it occurs in the first place.

Francis' Verdict on Gemma 4

This is my verdict on the use of Gemma 4 as an AI-Agent:

I believe Gemma 4 works really well when being an open source developer. It can navigate large code-bases and being able to solve complex problems well for a model that is quite small.

Although there isn't a big difference between Gemma 4 and other models I have used, it is good to consider token usage. I would recommend Ollama when using Gemma 4. It is free to use the cloud version and the tokens reset weekly whereas Copilot resets monthly. If you are heavily relying on AI-Agents, I would recommend the Ollama solution since it is more flexible. Even if you reach the daily limit of tokens, it resets every few hours.

I would highly recommend Gemma 4 as a great starting point when it comes to using it as an Agent when you want to contribute to open source. Just remember to monitor the response time since there is a very good chance where it loops itself in which it would burn tokens in the process.

Summary

Gemma 4 has been used in some aspects when it comes to setting up Ollama to use Gemma 4, testing out the response of the model, prototyping using Gemma 4, and the role as an Agent. Here are the main things to consider based on what I have shared above.

1. Convenience

Like any other local models, you have to consider not only if you are conformable downloading a model locally, but ensuring you have enough compute power to run an LLM. Although there are options to use LLM on the cloud (for Ollama at least), having a local LLM is very convenient for developers who want to run the model locally and not worry about:

Use of Tokens limits and payed services.
The environment when it comes to data centers in the age of AI.

2. Performance

Speaking of compute power, having a local LLM, even with smaller models of Gemma 4, still needs a lot of power. We saw that in Elmar's experience, it took a while for Gemma 4 to produce an output comparable to using any cloud model. If you are comfortable with the wait time of the model, then it shouldn't be a problem!

3. Context Output

Every model behaves differently whether cloud or local models. This is no different to Gemma 4. If you are using Gemma 4, you have to make sure the prompt is specific to your needs. This is best practice in prompts in general when it comes to requesting your needs and being as specific as possible.

4. Role as an AI-Agent

It is common for developers nowadays to use a local model to run as an Agent. Using Gemma 4 via Ollama has been a great experience for me. Do note that you may encounter issues when using Gemma 4. I would recommend monitoring Gemma 4's token usage in the event of Gemma 4 looping itself and burning unnecessary tokens.

Verdict: Overall, if you have enough power to run a local LLM and would like to have a model that is smaller but more powerful than other Flagship models to use in your development knowing the limitations of local LLMs in general, then I believe Gemma 4 is right for you!

I hope that this Multiversal Analysis has helped you to determine if Gemma 4 is right for you! If you like to learn more about Gemma 4 in detail, you can read here: https://deepmind.google/models/gemma/gemma-4/

Note: Gemma 4 has different model sizes and the conclusion based on the verdicts may not be accurate to your own personal experience. However, I believe these experiences capture the feeling on what to expect when using Gemma 4 and any local AI model in general.

Visit the Google Documentation to learn more: https://ai.google.dev/gemma/docs/core

Thank you for Reading!

If you have made it this far, thank you for taking the time to read this article! I hope you have learned something based on our experiences using Gemma 4 in different areas of development.

I would like to give credit to the Virtual Coffee Group where I used their "#co-working" room to meet with Elmar Chavez (@codingwithjiro), Konark Sharma (@konark_13), and Julien Avezou (@javz) for the first time outside of DEV and would love to work with them again in the future!

Check out Virtual Coffee below to see what they do!

Virtual Coffee

Virtual Coffee is a laid-back conversation with developers twice a week. It's the conversation that keeps going in slack. It's the online events that support developers at all stages of the journey.

Feel free to send us love by following us individually and our DEVenger org!

The DEVengers

This is an organization where we assemble the greatest minds the community has ever known! The question isn't "How fast can I code?". The question is "What will I learn from the farthest below?".

FrancisTRᴅᴇᴠ (っ◔◡◔)っ

Julien Avezou

Builder | Software Engineer | Author of The Thinking Engineer Toolkit

Konark Sharma

I am software engineer who finds fun and creativity in Frontend. I would love to be a part of a team to help, develop and learn from new people and add my knowledge to the project.

Elmar Chavez

Licensed civil engineer turned full stack developer building accessible, responsive web applications. I also review code in Frontend Mentor and participate in collaborative projects.

Challenge #1: Did you find 4 hidden random videos in this article?

Challenge #2: How many "r's" are there in this post in total?

Challenge #3: Who are the anime characters in this article's cover image?

Challenge #4: Based on Challenge 3, who decide to choose those anime characters in this article respectfully?

Any questions/comments? Love to hear your thoughts on Gemma 4! What's your experience on using Gemma 4 and do you recommend it?

Top comments (58)

Klaudia Grzondziel The DEVengers • May 22

Ahahah, I had the same issues running Gemma locally with Ollama – my computer slowly turned into a snail, everything felt super slow, and I had to close almost every app 😅 In the end, it completely froze anyway!

Good job with your multiversal analysis! That's a top example of collaboration!👏🏻

FrancisTRᴅᴇᴠ (っ◔◡◔)っ The DEVengers • May 22

Hey Klaudia! Yea it is always a common issue when running any local model in particular. You just have to hope that it will be relatively fast as if you are using the Cloud Version (which I had to use for this case). I am surprised @codingwithjiro (Elmar) got his to run on a laptop. If I were to run on my laptop, it would be cooked.

Appreciate the comment! Glad you liked it :D

Elmar Chavez The DEVengers • May 22

Glad it's not just me @klaudiagrz. Thanks for reading!

UnitBuilds • May 26

Have you tried the E4B and E2B models, they're quite fast and easy to run. I used them for my agentic browser swarm using a custom MCP (albeit it dropped token drain by 80%, so extremely lightweight), to run concurrent instances. I got to 4 concurrent E2B's on a 8gb gpu running at 100+ TPS each using an RX 9060 XT and LM Studio using Vulkan (trying to get lllama.cpp rocm working)

FrancisTRᴅᴇᴠ (っ◔◡◔)っ The DEVengers • May 26

Even with lower models, it tends to be the same result (for me at least and I am using a Desktop). Maybe if you get lucky? Not sure if she tried it yet, but would assume she had?

UnitBuilds • May 26

Well that means probably running it wrong? Try LM Studio, then make sure you set the pipeline to use your native accelerator (Cuda/RoCm) if not supported, run Vulkan, turn on KV Cache quantization to Q8 and give that a try, if still not, turn on shared KV Cache, just be sure to scale your KV Cache accordingly for all your parallel runners)

FrancisTRᴅᴇᴠ (っ◔◡◔)っ The DEVengers • May 26

I guess to be fair, the main start for people getting into local LLM is running it either in Ollama or in a plain terminal. Either way, it is running slow for both cases.

Never tried LM Studio surprisingly. Will give it a shot sometime in the future. Thanks for the suggestions!

UnitBuilds • May 26

It works quite for me, else you can try an Ollama docker container, just make sure that you test the WSL2 ability if you're on Nvidia.

UnitBuilds • May 26

Do you see high gpu usage? If so, hows your vram looking? If it overflows to 'shared memory' (RAM), or mmap (storage), every single token is dragged down by the need to swap the memory. Which instantly tanks performance. Most things operate under the hood with llama.cpp, make sure you're running the right one.

FrancisTRᴅᴇᴠ (っ◔◡◔)っ The DEVengers • May 26

I would have to check, but I would assume so. Good to know!

UnitBuilds • May 26

Whenever I see a LLM that I know 'should' fit my vram chug, that's usually the culprit, either wrong underlying architecture active, or overflow. You can also check whether you have thinking on, I know with qwen models, it looks like it's slow, but actually it's because thinking goes into a separate channel, is its not really logged accurately.

Julien Avezou The DEVengers • May 22

The Strawberry test How many r’s are there in strawberry?” (There are three) is interesting. Why need 3 interpretations for that? That seems unnecessary.

FrancisTRᴅᴇᴠ (っ◔◡◔)っ The DEVengers • May 22

I find that to be interesting as well since I ran it in Ollama with the latest Gemma 4 model and it gave me this:

@konark_13 How did you ran Gemma 4?

Konark Sharma The DEVengers • May 23

I ran Gemma4 on the terminal and played and tested it. Am I using temu version of Gemma4? I think I need to check my model and then try it. haha

Julien Avezou The DEVengers • May 23

Haha yeah it would be interesting to observe if you get a similar output again

FrancisTRᴅᴇᴠ (っ◔◡◔)っ The DEVengers • May 26

Yea ran the same thing on terminal and same outcome. How did you install Gemma 4?

Allan Kipruto • May 23

Really interesting breakdown — especially the way you frame Gemma 4 in terms of “when to use vs when not to use” across different application scales.

One thing I’ve noticed building with Gemma 4 (specifically e4b-it) is that the real advantage isn’t just capability, but deployability in constrained environments.

I’ve been working on an offline-first education system where Gemma 4 runs locally in classrooms (no cloud dependency). In that context, the “small but efficient model” argument becomes more important than raw benchmark performance.

For example, latency + affordability + offline inference matter more than peak reasoning ability when you’re trying to support real students in low-connectivity regions.

Curious if you think the tradeoff between “model power vs local deployability” will become a bigger deciding factor than benchmarks in the next wave of LLM adoption?

Elmar Chavez The DEVengers • May 23

I agree, a local, small, and efficient AI like Gemma 4 is good for areas with low-connectivity. Personally, the first pro that comes in mind is its local capabilities not its model power. What's important is that I can use AI while offline and that is already a great feature in itself.

What's interesting would be the future local AI models that use less compute power. Imagine an efficient and reliable AI in a low-end device powered locally. This is perfect since not all people need big data centers from the cloud for everyday AI use.

FrancisTRᴅᴇᴠ (っ◔◡◔)っ The DEVengers • May 26

Thanks Allan! Sorry for the late response!

I believe the "model power vs local deployability" is important and I believe it is already used as a deciding factor, which will receive more attention. Benchmarks isn't a good way to measure AI capabilities since there are cases where data are artificially modified to reach those requirements instead of relying on if the AI is accessible and powerful enough for others to use. Hope that makes sense! Thanks again Allan :D

Konark Sharma The DEVengers • May 23

What a wonderful article. The collaboration and teamup was awesome. Learned a lot while discussing ideas and distribution of ideas. We let loose on Gemma4 and tried everyway possible to check it's capabilities. If we missed any, next time we will bring something even better.

Awesome collaborating with you all. Thanks for the time and lessons @francistrdev, @codingwithjiro and @javz

FrancisTRᴅᴇᴠ (っ◔◡◔)っ The DEVengers • May 26

Yea thanks for sharing with your experience on Gemma 4. It was fun to meet you guys on call (especially not knowing who is AI lol).

Thắng Thắng • May 26

Tôi cũng là 1 dev nhỏ lập trình viên nhỏ và muốn được tham gia chung nghiên cứu 1 số dự án 🥰

FrancisTRᴅᴇᴠ (っ◔◡◔)っ The DEVengers • May 26 • Edited

Nghe tuyệt đấy, Thắng! Cụ thể là những chủ đề nào trong Nghiên cứu?

S M Tahosin • May 24

This is a great, well-rounded breakdown. The open-weight space is moving so fast that it's hard to know which model fits the local-dev workflow best. Highlighting Gemma 4's specific strengths—especially its coding and multimodal capabilities—against the hardware requirements makes the decision-making process much clearer.

Elmar Chavez The DEVengers • May 24

Glad it helped one of your decisions Tahosin!

FrancisTRᴅᴇᴠ (っ◔◡◔)っ The DEVengers • May 26

Indeed! It is always important to factor not only the hardware to run local LLMs, but to also determine which model suits your needs! Thanks Tahosin :D

Suny Choudhary • May 27

This is a fun framing, but the practical question is exactly right.

Gemma 4 might be good for development, but I would not judge it only by benchmark numbers. For real dev work, the test is much messier: can it understand an existing codebase, follow project conventions, avoid over-editing, explain tradeoffs, and recover when the first attempt fails?

A model can look strong in isolated coding tasks and still struggle with repo-level context, dependency issues, tests, edge cases, and debugging across multiple files.

For me, the best use case for models like this is not “replace the developer.” It is fast scaffolding, code explanation, refactoring help, test generation, and catching obvious mistakes.

The real value depends on whether it reduces thinking friction without adding cleanup debt.

Elmar Chavez The DEVengers • May 27

@sunnysingh1997 that's a really mature and practical take. Value is only ever served when it helps a developer's thinking and goals for a project. Because in reality, the bottleneck still lies on the developer's decisions for the project. I'd say, as long as the model keeps the developer sane and productive without mental overload, that model is valuable enough.

FrancisTRᴅᴇᴠ (っ◔◡◔)っ The DEVengers • May 27

Hey Suny! The big thing is never judge an AI by the benchmark because there is a history of data being skewed to get the requirement.

The real value depends on whether it reduces thinking friction without adding cleanup debt.

I agree! This is quite common for those using AI in general and I think it's a good idea to determine if an AI can do that. Thanks Suny for sharing :D

Andy Stewart • May 27

This multi-perspective review is remarkably grounded. On-device LLMs are never built in a vacuum; they are strictly bound by hardware constraints. Balancing edge-cloud boundaries, managing token loops, and handling contextual freezing on standard hardware with limited RAM requires a deterministic architectural mindset. Navigating these constraints is the exact engineering literacy every developer needs in the era of local AI.

Elmar Chavez The DEVengers • May 27

@lcmd007 I just really hope that local AIs will take way less compute power than what we currently have. That would be a complete game-changer for sure.

FrancisTRᴅᴇᴠ (っ◔◡◔)っ The DEVengers • May 27

Thanks Andy! Adding to @codingwithjiro point, it would be neat to have less compute power. I believe Google is currently researching on how to maximize its potential without the need to build more data centers, which I think is why Gemma4 exist, though I might be wrong. Thanks again Andy!

Jasmine Park • May 23

SRE lens worth adding: model comparison without a golden eval suite and a drift monitor is theatre. We swapped Llama-3.3 for Gemma-3 on a classification surface and the win on benchmark turned into a 12% regression in production, because the training distribution differed from real traffic. Now we run a paired-comparison test: same 500 inputs on both models, scored against a human-labeled gold set, with a McNemar test on the disagreement vector. Plus an OTel recording rule that alerts on any model-swap-day classification distribution divergence. Without that, the benchmark numbers are just press releases.

FrancisTRᴅᴇᴠ (っ◔◡◔)っ The DEVengers • May 26

Interesting share Jasmine!

Kirill • May 26

Interesting observation about Gemma being more "careful" as an agent.

I noticed something similar while integrating multiple LLMs into an audio-first product. Once the summaries became "good enough", the main differences stopped being raw intelligence and became things like tone, density, pacing and reliability under load.

That was a weird moment because it made the model feel more like one component inside a media pipeline rather than "the product".

FrancisTRᴅᴇᴠ (っ◔◡◔)っ The DEVengers • May 26

Hey Kirill! Hope you are well. I am curious on why you decide to use multiple LLMs into your audio product? Probably I am not understanding what your product does specifically with audio? Otherwise, thanks for sharing!! :D

Kirill • May 26

The audio part is actually the core idea 🙂

I built a small audio-first system where you can dump long reads into a Telegram bot and get back short spoken summaries for passive listening while walking, commuting, cooking, etc. So I ended up testing multiple LLMs not because I wanted "the smartest model", but because different models create noticeably different listening experiences once converted to speech.

Some feel more like concise radio hosts
Some feel more chaotic
Some ramble
Some compress information better

At some point the model itself stopped feeling like the product and started feeling more like casting different voices into the same media pipeline.

FrancisTRᴅᴇᴠ (っ◔◡◔)っ The DEVengers • May 26

Oh that makes sense! Have you figured out the solution or are you still trying to? I am curious to see if there is a way to do this without an LLM summarizing it (I know it's possible, but can't pinpoint it).

Kirill • May 26

I suspect the funny part is that once summaries become "good enough", users stop caring how the summary was technically produced. At that point they care more about:
Does playback feel seamless?
Can I consume this while doing something else?
Does the pacing feel natural?
Does the voice become mentally tiring after 20 minutes?
Does the app interrupt my flow?

That was the weird realization for me - the product gradually became less about "AI summarization" and more about minimizing cognitive friction around information consumption

FrancisTRᴅᴇᴠ (っ◔◡◔)っ The DEVengers • May 26

Yea fair enough. It's all about how the users are using the product and ensuring you have cases meet. Obviously, you can't pleased everyone but it's the reality of it.

View full discussion (58 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments. Some comments have been hidden by the post's author - find out more

Table of Contents

Introduction

Setting up Gemma 4 using Ollama

Elmar ChavezFollow

Installing Ollama and Gemma 4

Trying out Gemma 4

Running Gemma 4 Offline

Elmar's Verdict on Gemma 4

Using Gemma 4 as a Confused Developer

Konark SharmaFollow

The Strawberry Test

The Tic Tac Toe Rotation Test

The Broken Cup Test

Konark's Verdict on Gemma 4

Gemma 4 feels good for:

Gemini feels better for:

Building a Prototype using Gemma 4

Julien AvezouFollow

Results of Gemma 4 Vs. GPT 5.4

Technical Depth

Architecture & System Thinking

Julien's Verdict on Gemma 4

Using Gemma 4 as an AI-Agent

FrancisTRᴅᴇᴠ (っ◔◡◔)っFollow

Gemma 4 is Strict

Please Google I need this. Gemma 4 is kind of Looping Itself

Francis' Verdict on Gemma 4

Summary

1. Convenience

2. Performance

3. Context Output

4. Role as an AI-Agent

Thank you for Reading!

Virtual Coffee Follow

The DEVengers Follow

FrancisTRᴅᴇᴠ (っ◔◡◔)っFollow

Julien AvezouFollow

Konark SharmaFollow

Elmar ChavezFollow

Elmar Chavez

Konark Sharma

Julien Avezou

FrancisTRᴅᴇᴠ (っ◔◡◔)っ

Virtual Coffee

The DEVengers

FrancisTRᴅᴇᴠ (っ◔◡◔)っ

Julien Avezou

Konark Sharma

Elmar Chavez