GroverTek

Posted on Apr 6

Running Gemma 4 Locally with Ollama and OpenCode

#ai #ollama #opencode #gemma4

First steps:

The usual first step with getting Gemma 4 running on Ollama is to pull the model:

ollama pull gemma4:e4b

See the available models and select the correct version for your system.

The e4b variant is a good starting point if your hardware can support it.

Use the ollama list command to ensure your version is now available to Ollama.

Testing

Now, run the model to ensure it works as expected:

ollama run gemma4:e4b

Ask a simple question or just say "Hello", then use /bye to exit.

Immediately run ollama ps. You should see something like this:

NAME          ID              SIZE     PROCESSOR    CONTEXT    UNTIL              
gemma4:e4b    c6eb396dbd59    10 GB    100% GPU     4096       4 minutes from now

Pay close attention to that CONTEXT value. If you see 4096 like this, then Ollama is using the default 4K context window. This will bite you when you try to work with the model in OpenCode. Symptoms of the small context window might be the model constantly stating "Just let me know what you want to do", or similar. The cause is the system prompts eat up the bulk of that available context space and your prompt gets truncated, skipped. To fix this we need to use a larger context window.

OpenCode has support for specifying a context window size in its configuration, but I have not seen that work with Ollama based models. Instead, we need to set a different context window size within Ollama. We create a new version of the model with the desired context size, then use that model in OpenCode. The easiest way to do this is by using Ollama itself.

ollama run gemma4:e4b

# then within the Ollama model prompt run these commands:
/set parameter num_ctx 32768
/save gemma4:e4b-32k
/bye

# then confirm the new model is in place
ollama list

A note about the num_ctx value: It should be divisible by 2, or more to the point it should be a power of 2. In this case a 32K context window is '32768' bytes. You can experiment for your use case/needs to see if a smaller 16K window would be sufficient, or if you need a larger 64K or 128K window. Keep in mind that bigger values means more VRAM / memory usage - adjust for your hardware as needed.

The /save model name is arbitrary - give it a unique name that works for you. I normally just add on "-32k" or whatever size context I've adjusted for.

Using the model in OpenCode

I assume you are familiar with opencode, and will skip most of the tutorial level items.

We need to tell OpenCode the new model is an available option. We do that by adding an entry into the opencode.json file - either globally (mine is at ~/.config/opencode/opencode.json) or in your specific project folder.

{
  "$schema": "https://opencode.ai/config.json",
  "default_agent": "plan",
  "compaction": {
    "auto": true,
    "prune": true,
    "reserved": 8192
  },
  "provider": {
    "ollama": {
      "name": "Ollama",
      "npm": "@ai-sdk/openai-compatible",
      "options": {
        "baseURL": "http://localhost:11434/v1"
      },
      "models": {
        "gemma4:e4b-32k": {
          "name": "gemma4:e4b-32k",
          "_launch": true,
          "id": "gemma4:e4b-32k",
          "tool_call": true,
          "options": {
            "temperature": 0.1
          },
          "maxTokens": 16384
        },
        "ministral-3:8b-OC32k": {
          "_launch": true,
          "id": "ministral-3:8b-OC32k",
          "name": "Ministral 3 8B (32k)",
          "options": {
            "temperature": 0.1
          },
          "maxTokens": 16384
        }        
      }
    },
  }
}

This example may not match your specific setup. The important part is where the gemma4:e4b-32k model is defined. You can copy/paste that block and then just modify it to fit your system if needed.

"name" can be made more friendly if you'd like - i.e. "Gemma 4 (32k)". This is what you will see when selecting a model in opencode TUI.
"id" is what you will use to reference the particular model - for example if you were to call opencode run PROMPT --model gemma4:e4b-32k from the command line.
"_launch" - I don't know if this is really needed, but I have had better luck if it is set to "true". When you call ollama launch opencode this is the value that is set.
"tool_call" - if this is not defined and true, then Gemma will run but will just stop when it tries to use the "task" tool call.
"options" - set as you need for your use case. There is a lot of documentation on options available.
"maxTokens" - set this to a reasonable size that will cover the size of the output you expect from a prompt.

Using OpenCode

Once you have the opencode.json file configured you can launch opencode:

cd project/folder
opencode

This launches the opencode TUI. Once in there use the model selection tool - either via ctrl-p or ctrl-x m. I use the "ollama" text to filter to my ollama based models. You should now see the gemma4:e4b-32k model listed. Select it.

Now you are using the local Gemma model. Interact with it as needed, ask questions or build your empire. The performance will only be as good as your hardware, and you will experience some issues you wouldn't if you were using a cloud instance of the model. But for most use cases this should be up to the task. The performance is reasonable on a 16GB VRAM based system and only sees a second or two delay for a response once the model has been loaded.

Observations

I've only just started trying to use Gemma4 for real world tasks. It is capable, but I can see I need to guide it a little more than other models - like the default Zen Big Pickle model. But that is not an apples to apples comparison either. Still so far the model is capable. The real task will be when I try to use it with my market analysis project.

Conclusion

I hope this has been helpful. Let me know if my information is incorrect or can be improved.

What's your experience with running LLMs locally? Let me know in the comments.

Top comments (5)

Justin Howard-Stanley • Apr 9

I so far like it better than Big Pickle. I've only used the cloud version but its already solved two problems big pickle and sonnet couldn't(i'm limited to free LLMS, and have a low end HP laptop, so locals out of the question)

GroverTek • Apr 10

I'm curious which provider you are using to access Gemma 4 for free in the cloud. My own searching hasn't yet found a cloud source.

My experience with Ollama is a little spotty for more complex tasks. Gemma, Qwen 3.5, etc. have a tendency to work great for a bit and then they just stop mid task. Something about the tool hand off situation. The fix is to refine the prompt to get more standard responses, but that is not always feasible when you are in the middle of a loop over thousands of records and it breaks in the middle of the night while unattended. This means I can't trust the Ollama approach for long running tasks, at least not yet. I'm assuming this is mostly a skills or configuration issue on my end though. I've found though with Big Pickle and other models I can run long tasks, as long as I'm not hammering the cloud server too hard and get rate limited. If I can use Gemma 4 in the cloud for free, that may change my outlook. (I don't like paying multiple subscriptions for LLMs, and it adds up too quickly for me)

Justin Howard-Stanley • Apr 10

ollama.com/library/gemma4:31b-cloud

GroverTek • Apr 10

Thanks. I understood Ollama's cloud system was a subscription, but never really looked into too deeply. I'll check that out in a bit. In the mean time I did find a free option on OpenRouter: openrouter.ai/google/gemma-4-31b-i...

Some comments may only be visible to logged-in visitors. Sign in to view all comments.