Enny Rodríguez

Posted on May 8

My Local Copilot: Gemma 4 + Open WebUI + OpenHands for Coding Without Leaving My Machine

#devchallenge #gemmachallenge #gemma #ai

Gemma 4 Challenge: Write about Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Note: this post describes a real local architecture I use for development. Exact model names in Ollama, Hugging Face or Kaggle may vary depending on the runtime you use. The important part is not memorizing one command, but understanding how to separate chat, reasoning, multimodal context, code execution and repositories on your own machine.

My Local Copilot: Gemma 4 + Open WebUI + OpenHands for Coding Without Leaving My Machine

For a long time, I used local models as if they were just another chat window.

I pasted an error, copied the answer, went back to my editor, ran tests, copied the next error, and repeated the loop.

That works, but it leaves a lot on the table.

What makes Gemma 4 interesting to me is not only that it is an open model family with multimodal capabilities and variants that can target different hardware profiles. What makes it interesting is that it lets me think about a different kind of setup: an environment where the model is not isolated in a tab, but connected to a local development workflow.

My goal for this experiment was simple:

I want a development copilot that runs locally, can reason with me, can understand visual and textual context, can read project files, and, when it makes sense, can act on a repository without sending my entire codebase to an external service.

To do that, I built a stack with a few pieces that complement each other:

Gemma 4 as the open model family for reasoning, explanation and assistance.
Ollama running natively on macOS to use the local hardware efficiently.
Open WebUI as the general interface for chat, model comparison, multimodal input and image generation.
OpenHands as the development agent that can read files, use a terminal and work on repositories.
GitHub and GitLab as the real source of issues, pull requests, merge requests and product context.

The main idea is this:

Gemma 4 becomes much more useful when it stops being "a model in a box" and becomes part of a local development architecture.

The Architecture

The stack separates responsibilities. Ollama runs on the host because, on macOS Apple Silicon, that is the practical path for taking advantage of the local runtime. The interfaces run in Docker. Open WebUI is where I think, compare, inspect visual context and generate supporting images. OpenHands is where I move from conversation to action.

That separation changes the experience:

If I need to think, summarize, compare approaches or work with images, I use Open WebUI.
If I need the model to read files, propose changes and run commands, I use OpenHands.
If the task comes from real work, I start from GitHub or GitLab and bring the context into my local workspace.
If I want to change the model, I do it through Ollama without redesigning the rest of the stack.

Why Gemma 4 Fits This Workflow

Google introduced Gemma 4 as an open model family with variants for different hardware and use cases. That matters for local development because not every task needs the same model.

For my workflow, four capabilities are especially relevant.

First, model size becomes a routing decision. Sometimes I want a quick answer about a function. Sometimes I want a deeper review of a multi-module change. Those are not the same task.

Second, longer context changes how a model can work with code. A useful coding assistant needs to understand conventions, nearby files, previous decisions and test structure.

Third, agents need more than good text generation. A coding agent has to hold instructions, use tools, read results and correct itself. The model matters, but the surrounding architecture matters too.

Fourth, multimodality changes how software tasks are described. Sometimes the context is not in a .py or .ts file. It is a broken UI screenshot, a diagram, a wireframe, a generated asset, a chart or a screenshot of an error. Open WebUI gives me a natural entry point for that material before turning it into a development task.

My Local Setup

My setup uses Docker Compose for the interfaces and keeps Ollama running directly on the host.

The key detail is that OpenHands talks to Ollama through the OpenAI-compatible endpoint:

[llm]
model = "openai/gemma4:e4b"
base_url = "http://host.docker.internal:11434/v1"
ollama_base_url = "http://host.docker.internal:11434"
api_key = "local-llm"

In my configuration repo, this same pattern already works with other local models. For Gemma 4, the conceptual change is to replace the model with the variant I want to test: a smaller one for latency, a stronger one for planning, or a larger one for architectural review.

I also keep multiple models available in OpenHands. I do not use one model for everything. I can start with a fast variant for inspection, move to a stronger variant for implementation planning, and reserve a larger variant for decisions where the cost of being wrong is higher.

Open WebUI as the Multimodal Lane

I do not use Open WebUI only as a nicer chat UI. In my workflow it has three roles:

Technical chat: discuss a bug, explain a module, compare implementation approaches.
Multimodal input: upload screenshots, diagrams, error captures, UI images or visual material that helps describe a task.
Image generation: create quick assets, documentation visuals, cover images or architecture illustrations.

This is useful because many real tasks start visually:

"This component looks broken."
"This onboarding flow is confusing."
"This chart does not explain the data."
"This error appears on screen after checkout."

Instead of manually translating all of that into text, I use Open WebUI to turn visual material into actionable context.

For images, my stack can use Ollama's OpenAI-compatible API from Open WebUI. I also keep a separate ComfyUI lane for more controlled image workflows. I do not mix that with OpenHands: multimodal reasoning and image generation live in Open WebUI; code editing lives in OpenHands.

The Workflow

The pattern that works best for me is not asking the agent to do everything at once. I use an explicit workflow, and it often starts in GitHub or GitLab.

Open WebUI and OpenHands do not play the same role.

Open WebUI is my reasoning and multimodal context table. OpenHands is my workbench. GitHub and GitLab are the real task queue.

GitHub and GitLab as Workflow Inputs

There is a big difference between "trying a model" and "working with a copilot." The difference is where tasks come from.

In my case, many tasks already exist as:

GitHub issues;
GitLab issues;
pull requests with pending review comments;
merge requests with feedback;
bugs reported with screenshots;
technical discussions that need to become code changes.

The flow looks like this:

This helps me avoid vague prompts. Instead of telling the agent "improve this project," I start from a concrete task that already has social and product context: who asked for it, why it matters, what was discussed, which files it may touch and how it will be reviewed.

Example: From Bug Report to Local Patch

Suppose I have this bug:

The search endpoint returns duplicate results when the user sends the same filter with different casing.

In Open WebUI, I start broadly:

I am working on a backend with search endpoints.
There is a bug: if the user sends repeated filters with different casing,
the endpoint returns duplicate results.

Before touching code, give me an investigation plan:
- which files would you look for
- which tests would you expect to find
- which edge cases should be covered

Gemma 4 does not need to touch the repository yet. I only want help thinking.

Then I move to OpenHands with a more concrete task:

Work in /workspace/my-repo.

Goal:
Fix the bug where repeated filters with different casing generate duplicate results.

Constraints:
- Do not change the public API.
- Keep the existing project style.
- Add or adjust focused tests.
- Run the relevant suite before finishing.

Deliverable:
- Summary of changed files.
- Short explanation of the fix.
- Commands executed and their result.

That prompt change is intentional. I do not say "fix it" in a generic way. I give context, boundaries and a verifiable deliverable.

If the bug comes from GitHub or GitLab, I add one more layer:

Remote context:
- Issue: https://github.com/org/repo/issues/123
- Base branch: main
- Suggested work branch: fix/search-filter-deduplication

Read the issue as the functional specification.
If there is ambiguity between the issue and the current code,
prioritize existing behavior and call out the question in the final summary.

When the issue includes screenshots, I inspect them first in Open WebUI with Gemma 4. That lets me turn visual evidence into acceptance criteria before asking OpenHands to edit files.

How I Choose a Gemma 4 Variant

I do not think about models as a ladder where "bigger always wins." I think in lanes.

Task type	Gemma 4 variant I would try first	Why
Quick chat, classification, short summaries	E2B	Low latency and a good fit for simple tasks
Screenshots, diagrams, UI explanation, task drafting	E4B	Good balance for multimodal reasoning and general assistance
Explaining code, reviewing functions, drafting tests	E4B / 26B A4B	Depends on the size of the change and the context
Medium refactors, multi-file debugging	26B A4B	More capacity without always jumping to the heaviest model
Architecture review, long context, complex decisions	31B	When quality matters more than latency

This table is not a universal truth. It is a practical starting point. Local hardware, quantization, runtime and configured context size can change the experience a lot.

In OpenHands, I like having more than one option configured because the agent's behavior changes with the model. A smaller variant may be enough for short inspection tasks. For multi-module planning, I prefer a stronger one. For architectural review, I accept more latency if the answer is more careful.

My Prompt Template for Local Agents

This is the structure I use most often with OpenHands:

Context:
I am in an existing repository. Read before editing.
The task comes from [GitHub/GitLab issue or PR/MR].

Goal:
[describe the expected result in one sentence]

Constraints:
- Keep existing patterns.
- Do not do unrelated refactors.
- Do not change global configuration unless required.
- If there is ambiguity, explain the decision.

Verification:
- Run the related tests.
- If something cannot be run, explain why.

Deliverable:
- Changed files.
- Summary of the change.
- Commands executed.
- Link or reference to the remote task.
- Risks or follow-ups.

With local models, this structure helps a lot. It reduces ambiguity and pushes the agent to behave like a software collaborator instead of a text generator.

The Real Cycle I Use

stateDiagram-v2
    [*] --> Think
    Think: Open WebUI\nunderstand problem\ntext + images
    Think --> Scope
    Scope: small task\nissue/PR/MR + constraints
    Scope --> Act
    Act: OpenHands\nselected Gemma model\nread edit run
    Act --> Review
    Review: inspect diff\nvalidate tests
    Review --> Commit: if good
    Review --> Scope: if context is missing
    Commit --> [*]

The key is keeping tasks small. A local agent can be very useful, but it is still probabilistic software. My rule is simple: if I could not review the diff in a few minutes, the task is too large.

What Worked Well

The best part of the setup is the feeling of control.

I can start the local stack, switch models, test prompts, share only the folders I want and shut everything down when I am done. For private projects, prototypes and learning, that reduced friction matters.

I also like having separate modes:

Multimodal conversation mode: I think with Gemma 4 in Open WebUI using text, images, screenshots and diagrams.
Visual generation mode: I create images or supporting assets from Open WebUI when a post, documentation page or product task needs them.
Action mode: I delegate a concrete task to OpenHands and choose the Gemma model that best fits.
Repository mode: I bring context from GitHub or GitLab and turn it into a local branch with a reviewable diff.

That boundary prevents every conversation from becoming an execution. Not every prompt deserves filesystem access.

What Still Requires Care

Not everything is automatic.

Local agents are sensitive to:

prompt quality;
configured context size;
quantization choices;
hardware latency;
runtime stability;
the model's ability to follow tool instructions.

I also learned that it is useful to keep fallback models. In my stack, I keep coding-specialized models next to the general model. That lets me compare answers or switch lanes if a specific task gets stuck.

Another lesson: connected repositories speed things up, but they also require discipline. A GitHub or GitLab issue can carry a lot of context, but not all of that context is specification. Sometimes it includes opinions, old assumptions or contradictory comments. That is why I like passing through Open WebUI first to synthesize acceptance criteria before opening the OpenHands lane.

Local Security: Not Magic, But Better Boundaries

Running locally does not automatically mean "secure." It means I have more control over where the code lives and which processes can read it.

My basic rules are:

expose Open WebUI and OpenHands only on 127.0.0.1;
mount a scoped working directory, not the whole disk;
review diffs before committing;
do not give real secrets to the agent;
use GitHub/GitLab tokens with minimum required permissions when needed;
avoid mounting global credentials into the sandbox;
use disposable repositories for aggressive experiments;
keep logs and configuration outside the application repository.

Privacy does not come from one tool. It comes from designing the workflow with clear limits.

Why This Matters

The discussion around open models often stays at the benchmark level. Benchmarks matter, but as a developer I care about a more practical question:

What can I do today, on my own machine, with enough quality and control to actually change my development workflow?

Gemma 4 points directly at that question. Not because it automatically replaces every closed model, but because it makes a category of local setups more viable: assistants that can reason over text and images, generate supporting material, work with repositories and integrate with open tools.

For me, the near future is not one giant cloud copilot. It is a combination of:

open models;
local runtimes;
hackable interfaces;
multimodal inputs;
agents with limited permissions;
repositories connected to real tasks;
developers who understand their own architecture.

Gemma 4 fits that direction well.

Base Commands for the Stack

My local flow starts with Ollama on the host:

OLLAMA_CONTEXT_LENGTH=32768 \
OLLAMA_KEEP_ALIVE=30m \
OLLAMA_HOST=0.0.0.0:11434 \
ollama serve

Then I pull the model I want to test:

ollama pull gemma4:e4b

I can also keep multiple variants available and choose by task:

ollama pull gemma4:e2b
ollama pull gemma4:e4b
ollama pull gemma4:26b-a4b
ollama pull gemma4:31b

If your runtime publishes the variants under different names, replace those identifiers with the correct names for Ollama, Hugging Face or Kaggle.

For image generation from Open WebUI, my stack uses a local OpenAI-compatible endpoint. For example:

ollama pull x/flux2-klein:4b
ollama pull x/z-image-turbo

Then I start the interfaces:

docker compose up -d open-webui openhands comfyui

Local URLs:

Open WebUI: http://localhost:3000
OpenHands: http://localhost:3001
ComfyUI: http://localhost:8188
Ollama API: http://localhost:11434

To bring in tasks and branches from remote repositories:

git clone git@github.com:org/repo.git
git clone git@gitlab.com:org/repo.git

You can also use gh or glab to fetch issues, check out PRs/MRs or inspect review comments from the terminal.

Minimal OpenHands Configuration

[core]

[llm]
model = "openai/gemma4:e4b"
base_url = "http://host.docker.internal:11434/v1"
ollama_base_url = "http://host.docker.internal:11434"
api_key = "local-llm"

To switch models, I keep explicit model values for the task:

# Fast inspection
model = "openai/gemma4:e2b"

# General balance
model = "openai/gemma4:e4b"

# More complex changes
model = "openai/gemma4:26b-a4b"

# Deeper review
model = "openai/gemma4:31b"

In Docker Compose, the important part is mounting the workspace and pointing OpenHands to the local endpoint:

openhands:
  image: docker.openhands.dev/openhands/openhands:1.6
  ports:
    - "127.0.0.1:3001:3000"
  environment:
    RUNTIME: "docker"
    LLM_MODEL: "openai/gemma4:e4b"
    LLM_BASE_URL: "http://host.docker.internal:11434/v1"
    LLM_OLLAMA_BASE_URL: "http://host.docker.internal:11434"
    LLM_API_KEY: "local-llm"
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
    - ./workspace:/workspace:rw
    - /Users/me/projects:/workspace/host-projects:rw

Image Generation in Open WebUI

In Open WebUI, I enable image generation against my local endpoint:

ENABLE_IMAGE_GENERATION=true
IMAGE_GENERATION_ENGINE=openai
IMAGES_OPENAI_API_BASE_URL=http://host.docker.internal:11434/v1
IMAGES_OPENAI_API_KEY=ollama
IMAGE_GENERATION_MODEL=x/flux2-klein:4b

Final Mental Model

The most important part of the diagram is the last one: developer judgment.

The model accelerates. The agent executes. But the engineering judgment is still mine.

Closing

Gemma 4 is exciting because it lowers the barrier for building more useful local assistants. Not just chatbots. Not just demos. Real workflows where an open model can help understand text and images, generate supporting assets, modify code and validate software inside a machine I control.

My conclusion after building this setup is simple: the leap is not only in the model. It is in connecting the model to a well-designed workflow.

Gemma 4 + Open WebUI + OpenHands + GitHub/GitLab is one concrete way to do that.

Top comments (3)

Hollow House Institute • May 9

What stands out to me is that local workflows also change the governance environment itself.

A lot of local AI discussion focuses on tooling, orchestration, and developer control. But once systems start operating outside centralized infrastructure, visibility and enforcement continuity start degrading too.

The harder question becomes:

how do telemetry, Decision Boundaries, and Stop Authority keep persisting once execution becomes decentralized and partially offline?

Enny Rodríguez • May 9

I completely agree. Local-first AI should not become governance-free AI.

That’s the real governance challenge.

In a decentralized/offline setup, telemetry, decision boundaries, and stop authority have to move into the local runtime.

Telemetry: local append-only logs, signed receipts, command history, file access records, model usage, diffs, and approval events, synced when online.

Decision Boundaries: policy-as-code enforced locally — allowed commands, allowed folders, repo scopes, network limits, secret protection, and human-in-the-loop thresholds.

Stop Authority: online kill switch when connected; offline bounded leases when disconnected. If the policy lease expires, or the action is high-risk, the agent must stop or require reconnection.

So local-first AI does not remove governance. It forces governance to become runtime-native.

For me, the next evolution of this setup is execution-time telemetry: tracking what files were accessed, what commands were proposed or executed, what model was used, and what code changes were produced.

Running locally solves privacy and control, but accountability still needs to be designed explicitly.

Hollow House Institute • May 9

That’s the shift I keep coming back to too.

A lot of existing governance still assumes the infrastructure itself remains centralized and observable.

But once execution becomes local, partially offline, and user-controlled, governance can no longer depend on external visibility alone.

At that point governance has to persist as operational infrastructure inside the runtime itself:
telemetry continuity, enforceable Decision Boundaries, replayable evidence, escalation logic, and Stop Authority conditions that survive disconnection and migration.

Otherwise systems may remain technically functional while accountability continuity quietly degrades.

I think execution-time telemetry becomes critical here because behavior during runtime is ultimately the thing that matters most.