DEV Community: Enny Rodríguez

My Local Copilot: Gemma 4 + Open WebUI + OpenHands for Coding Without Leaving My Machine

Enny Rodríguez — Fri, 08 May 2026 22:42:23 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Note: this post describes a real local architecture I use for development. Exact model names in Ollama, Hugging Face or Kaggle may vary depending on the runtime you use. The important part is not memorizing one command, but understanding how to separate chat, reasoning, multimodal context, code execution and repositories on your own machine.

My Local Copilot: Gemma 4 + Open WebUI + OpenHands for Coding Without Leaving My Machine

For a long time, I used local models as if they were just another chat window.

I pasted an error, copied the answer, went back to my editor, ran tests, copied the next error, and repeated the loop.

That works, but it leaves a lot on the table.

What makes Gemma 4 interesting to me is not only that it is an open model family with multimodal capabilities and variants that can target different hardware profiles. What makes it interesting is that it lets me think about a different kind of setup: an environment where the model is not isolated in a tab, but connected to a local development workflow.

My goal for this experiment was simple:

I want a development copilot that runs locally, can reason with me, can understand visual and textual context, can read project files, and, when it makes sense, can act on a repository without sending my entire codebase to an external service.

To do that, I built a stack with a few pieces that complement each other:

Gemma 4 as the open model family for reasoning, explanation and assistance.
Ollama running natively on macOS to use the local hardware efficiently.
Open WebUI as the general interface for chat, model comparison, multimodal input and image generation.
OpenHands as the development agent that can read files, use a terminal and work on repositories.
GitHub and GitLab as the real source of issues, pull requests, merge requests and product context.

The main idea is this:

Gemma 4 becomes much more useful when it stops being "a model in a box" and becomes part of a local development architecture.

The Architecture

The stack separates responsibilities. Ollama runs on the host because, on macOS Apple Silicon, that is the practical path for taking advantage of the local runtime. The interfaces run in Docker. Open WebUI is where I think, compare, inspect visual context and generate supporting images. OpenHands is where I move from conversation to action.

That separation changes the experience:

If I need to think, summarize, compare approaches or work with images, I use Open WebUI.
If I need the model to read files, propose changes and run commands, I use OpenHands.
If the task comes from real work, I start from GitHub or GitLab and bring the context into my local workspace.
If I want to change the model, I do it through Ollama without redesigning the rest of the stack.

Why Gemma 4 Fits This Workflow

Google introduced Gemma 4 as an open model family with variants for different hardware and use cases. That matters for local development because not every task needs the same model.

For my workflow, four capabilities are especially relevant.

First, model size becomes a routing decision. Sometimes I want a quick answer about a function. Sometimes I want a deeper review of a multi-module change. Those are not the same task.

Second, longer context changes how a model can work with code. A useful coding assistant needs to understand conventions, nearby files, previous decisions and test structure.

Third, agents need more than good text generation. A coding agent has to hold instructions, use tools, read results and correct itself. The model matters, but the surrounding architecture matters too.

Fourth, multimodality changes how software tasks are described. Sometimes the context is not in a .py or .ts file. It is a broken UI screenshot, a diagram, a wireframe, a generated asset, a chart or a screenshot of an error. Open WebUI gives me a natural entry point for that material before turning it into a development task.

My Local Setup

My setup uses Docker Compose for the interfaces and keeps Ollama running directly on the host.

The key detail is that OpenHands talks to Ollama through the OpenAI-compatible endpoint:

[llm]
model = "openai/gemma4:e4b"
base_url = "http://host.docker.internal:11434/v1"
ollama_base_url = "http://host.docker.internal:11434"
api_key = "local-llm"

In my configuration repo, this same pattern already works with other local models. For Gemma 4, the conceptual change is to replace the model with the variant I want to test: a smaller one for latency, a stronger one for planning, or a larger one for architectural review.

I also keep multiple models available in OpenHands. I do not use one model for everything. I can start with a fast variant for inspection, move to a stronger variant for implementation planning, and reserve a larger variant for decisions where the cost of being wrong is higher.

Open WebUI as the Multimodal Lane

I do not use Open WebUI only as a nicer chat UI. In my workflow it has three roles:

Technical chat: discuss a bug, explain a module, compare implementation approaches.
Multimodal input: upload screenshots, diagrams, error captures, UI images or visual material that helps describe a task.
Image generation: create quick assets, documentation visuals, cover images or architecture illustrations.

This is useful because many real tasks start visually:

"This component looks broken."
"This onboarding flow is confusing."
"This chart does not explain the data."
"This error appears on screen after checkout."

Instead of manually translating all of that into text, I use Open WebUI to turn visual material into actionable context.

For images, my stack can use Ollama's OpenAI-compatible API from Open WebUI. I also keep a separate ComfyUI lane for more controlled image workflows. I do not mix that with OpenHands: multimodal reasoning and image generation live in Open WebUI; code editing lives in OpenHands.

The Workflow

The pattern that works best for me is not asking the agent to do everything at once. I use an explicit workflow, and it often starts in GitHub or GitLab.

Open WebUI and OpenHands do not play the same role.

Open WebUI is my reasoning and multimodal context table. OpenHands is my workbench. GitHub and GitLab are the real task queue.

GitHub and GitLab as Workflow Inputs

There is a big difference between "trying a model" and "working with a copilot." The difference is where tasks come from.

In my case, many tasks already exist as:

GitHub issues;
GitLab issues;
pull requests with pending review comments;
merge requests with feedback;
bugs reported with screenshots;
technical discussions that need to become code changes.

The flow looks like this:

This helps me avoid vague prompts. Instead of telling the agent "improve this project," I start from a concrete task that already has social and product context: who asked for it, why it matters, what was discussed, which files it may touch and how it will be reviewed.

Example: From Bug Report to Local Patch

Suppose I have this bug:

The search endpoint returns duplicate results when the user sends the same filter with different casing.

In Open WebUI, I start broadly:

I am working on a backend with search endpoints.
There is a bug: if the user sends repeated filters with different casing,
the endpoint returns duplicate results.

Before touching code, give me an investigation plan:
- which files would you look for
- which tests would you expect to find
- which edge cases should be covered

Gemma 4 does not need to touch the repository yet. I only want help thinking.

Then I move to OpenHands with a more concrete task:

Work in /workspace/my-repo.

Goal:
Fix the bug where repeated filters with different casing generate duplicate results.

Constraints:
- Do not change the public API.
- Keep the existing project style.
- Add or adjust focused tests.
- Run the relevant suite before finishing.

Deliverable:
- Summary of changed files.
- Short explanation of the fix.
- Commands executed and their result.

That prompt change is intentional. I do not say "fix it" in a generic way. I give context, boundaries and a verifiable deliverable.

If the bug comes from GitHub or GitLab, I add one more layer:

Remote context:
- Issue: https://github.com/org/repo/issues/123
- Base branch: main
- Suggested work branch: fix/search-filter-deduplication

Read the issue as the functional specification.
If there is ambiguity between the issue and the current code,
prioritize existing behavior and call out the question in the final summary.

When the issue includes screenshots, I inspect them first in Open WebUI with Gemma 4. That lets me turn visual evidence into acceptance criteria before asking OpenHands to edit files.

How I Choose a Gemma 4 Variant

I do not think about models as a ladder where "bigger always wins." I think in lanes.

Task type	Gemma 4 variant I would try first	Why
Quick chat, classification, short summaries	E2B	Low latency and a good fit for simple tasks
Screenshots, diagrams, UI explanation, task drafting	E4B	Good balance for multimodal reasoning and general assistance
Explaining code, reviewing functions, drafting tests	E4B / 26B A4B	Depends on the size of the change and the context
Medium refactors, multi-file debugging	26B A4B	More capacity without always jumping to the heaviest model
Architecture review, long context, complex decisions	31B	When quality matters more than latency

This table is not a universal truth. It is a practical starting point. Local hardware, quantization, runtime and configured context size can change the experience a lot.

In OpenHands, I like having more than one option configured because the agent's behavior changes with the model. A smaller variant may be enough for short inspection tasks. For multi-module planning, I prefer a stronger one. For architectural review, I accept more latency if the answer is more careful.

My Prompt Template for Local Agents

This is the structure I use most often with OpenHands:

Context:
I am in an existing repository. Read before editing.
The task comes from [GitHub/GitLab issue or PR/MR].

Goal:
[describe the expected result in one sentence]

Constraints:
- Keep existing patterns.
- Do not do unrelated refactors.
- Do not change global configuration unless required.
- If there is ambiguity, explain the decision.

Verification:
- Run the related tests.
- If something cannot be run, explain why.

Deliverable:
- Changed files.
- Summary of the change.
- Commands executed.
- Link or reference to the remote task.
- Risks or follow-ups.

With local models, this structure helps a lot. It reduces ambiguity and pushes the agent to behave like a software collaborator instead of a text generator.

The Real Cycle I Use

stateDiagram-v2
    [*] --> Think
    Think: Open WebUI\nunderstand problem\ntext + images
    Think --> Scope
    Scope: small task\nissue/PR/MR + constraints
    Scope --> Act
    Act: OpenHands\nselected Gemma model\nread edit run
    Act --> Review
    Review: inspect diff\nvalidate tests
    Review --> Commit: if good
    Review --> Scope: if context is missing
    Commit --> [*]

The key is keeping tasks small. A local agent can be very useful, but it is still probabilistic software. My rule is simple: if I could not review the diff in a few minutes, the task is too large.

What Worked Well

The best part of the setup is the feeling of control.

I can start the local stack, switch models, test prompts, share only the folders I want and shut everything down when I am done. For private projects, prototypes and learning, that reduced friction matters.

I also like having separate modes:

Multimodal conversation mode: I think with Gemma 4 in Open WebUI using text, images, screenshots and diagrams.
Visual generation mode: I create images or supporting assets from Open WebUI when a post, documentation page or product task needs them.
Action mode: I delegate a concrete task to OpenHands and choose the Gemma model that best fits.
Repository mode: I bring context from GitHub or GitLab and turn it into a local branch with a reviewable diff.

That boundary prevents every conversation from becoming an execution. Not every prompt deserves filesystem access.

What Still Requires Care

Not everything is automatic.

Local agents are sensitive to:

prompt quality;
configured context size;
quantization choices;
hardware latency;
runtime stability;
the model's ability to follow tool instructions.

I also learned that it is useful to keep fallback models. In my stack, I keep coding-specialized models next to the general model. That lets me compare answers or switch lanes if a specific task gets stuck.

Another lesson: connected repositories speed things up, but they also require discipline. A GitHub or GitLab issue can carry a lot of context, but not all of that context is specification. Sometimes it includes opinions, old assumptions or contradictory comments. That is why I like passing through Open WebUI first to synthesize acceptance criteria before opening the OpenHands lane.

Local Security: Not Magic, But Better Boundaries

Running locally does not automatically mean "secure." It means I have more control over where the code lives and which processes can read it.

My basic rules are:

expose Open WebUI and OpenHands only on 127.0.0.1;
mount a scoped working directory, not the whole disk;
review diffs before committing;
do not give real secrets to the agent;
use GitHub/GitLab tokens with minimum required permissions when needed;
avoid mounting global credentials into the sandbox;
use disposable repositories for aggressive experiments;
keep logs and configuration outside the application repository.

Privacy does not come from one tool. It comes from designing the workflow with clear limits.

Why This Matters

The discussion around open models often stays at the benchmark level. Benchmarks matter, but as a developer I care about a more practical question:

What can I do today, on my own machine, with enough quality and control to actually change my development workflow?

Gemma 4 points directly at that question. Not because it automatically replaces every closed model, but because it makes a category of local setups more viable: assistants that can reason over text and images, generate supporting material, work with repositories and integrate with open tools.

For me, the near future is not one giant cloud copilot. It is a combination of:

open models;
local runtimes;
hackable interfaces;
multimodal inputs;
agents with limited permissions;
repositories connected to real tasks;
developers who understand their own architecture.

Gemma 4 fits that direction well.

Base Commands for the Stack

My local flow starts with Ollama on the host:

OLLAMA_CONTEXT_LENGTH=32768 \
OLLAMA_KEEP_ALIVE=30m \
OLLAMA_HOST=0.0.0.0:11434 \
ollama serve

Then I pull the model I want to test:

ollama pull gemma4:e4b

I can also keep multiple variants available and choose by task:

ollama pull gemma4:e2b
ollama pull gemma4:e4b
ollama pull gemma4:26b-a4b
ollama pull gemma4:31b

If your runtime publishes the variants under different names, replace those identifiers with the correct names for Ollama, Hugging Face or Kaggle.

For image generation from Open WebUI, my stack uses a local OpenAI-compatible endpoint. For example:

ollama pull x/flux2-klein:4b
ollama pull x/z-image-turbo

Then I start the interfaces:

docker compose up -d open-webui openhands comfyui

Local URLs:

Open WebUI: http://localhost:3000
OpenHands: http://localhost:3001
ComfyUI: http://localhost:8188
Ollama API: http://localhost:11434

To bring in tasks and branches from remote repositories:

git clone git@github.com:org/repo.git
git clone git@gitlab.com:org/repo.git

You can also use gh or glab to fetch issues, check out PRs/MRs or inspect review comments from the terminal.

Minimal OpenHands Configuration

[core]

[llm]
model = "openai/gemma4:e4b"
base_url = "http://host.docker.internal:11434/v1"
ollama_base_url = "http://host.docker.internal:11434"
api_key = "local-llm"

To switch models, I keep explicit model values for the task:

# Fast inspection
model = "openai/gemma4:e2b"

# General balance
model = "openai/gemma4:e4b"

# More complex changes
model = "openai/gemma4:26b-a4b"

# Deeper review
model = "openai/gemma4:31b"

In Docker Compose, the important part is mounting the workspace and pointing OpenHands to the local endpoint:

openhands:
  image: docker.openhands.dev/openhands/openhands:1.6
  ports:
    - "127.0.0.1:3001:3000"
  environment:
    RUNTIME: "docker"
    LLM_MODEL: "openai/gemma4:e4b"
    LLM_BASE_URL: "http://host.docker.internal:11434/v1"
    LLM_OLLAMA_BASE_URL: "http://host.docker.internal:11434"
    LLM_API_KEY: "local-llm"
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
    - ./workspace:/workspace:rw
    - /Users/me/projects:/workspace/host-projects:rw

Image Generation in Open WebUI

In Open WebUI, I enable image generation against my local endpoint:

ENABLE_IMAGE_GENERATION=true
IMAGE_GENERATION_ENGINE=openai
IMAGES_OPENAI_API_BASE_URL=http://host.docker.internal:11434/v1
IMAGES_OPENAI_API_KEY=ollama
IMAGE_GENERATION_MODEL=x/flux2-klein:4b

Final Mental Model

The most important part of the diagram is the last one: developer judgment.

The model accelerates. The agent executes. But the engineering judgment is still mine.

Closing

Gemma 4 is exciting because it lowers the barrier for building more useful local assistants. Not just chatbots. Not just demos. Real workflows where an open model can help understand text and images, generate supporting assets, modify code and validate software inside a machine I control.

My conclusion after building this setup is simple: the leap is not only in the model. It is in connecting the model to a well-designed workflow.

Gemma 4 + Open WebUI + OpenHands + GitHub/GitLab is one concrete way to do that.

Gemma Local Code Mentor: A Local-First VS Code AI Assistant Powered by Gemma 4

Enny Rodríguez — Fri, 08 May 2026 21:12:08 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

Most AI coding tools ask for the same tradeoff:

"Give me your code, and I'll give you help."

I wanted to try the opposite.

What if a coding mentor lived inside VS Code, understood your repository, helped with real developer tasks, and kept your code on your own machine by default?

So I built Gemma Local Code Mentor.

TL;DR

Gemma Local Code Mentor is a local-first VS Code extension powered by Gemma 4.

It can:

Explain selected code
Suggest refactors
Generate tests
Summarize files
Summarize repository architecture
Answer questions about the repo
Run through a local FastAPI backend
Use Ollama as the default local model runtime
Keep Local Only Mode enabled by default

No telemetry.
No cloud fallback.
No external API calls while Local Only Mode is on.
Your code stays where it belongs: on your machine.

What I Built

I built a VS Code extension plus a Dockerized FastAPI backend for developers who want AI help without sending private code to a remote API.

The workflow is simple:

Select code in VS Code.
Run a Gemma: command.
The extension sends context to 127.0.0.1:8765.
The backend builds a task-specific prompt.
Gemma 4 responds through a local provider.
The result appears in a VS Code side panel.

The extension currently includes these commands:

Gemma: Explain Selection
Gemma: Refactor Selection
Gemma: Generate Tests
Gemma: Summarize File
Gemma: Summarize Architecture
Gemma: Ask Repository
Gemma: Toggle Local Only Mode
Gemma: Open Panel

This is not just a chat box glued into an editor. The backend has structured prompt builders, response parsing, provider routing, tests, repository context handling, and privacy checks.

Why I Built It

There are many AI coding assistants now, but the privacy model often feels backwards.

For open source code, cloud tools are usually fine.

For client code, internal company projects, security-sensitive prototypes, or early startup ideas, uploading code somewhere else can be a blocker.

I wanted a coding assistant with different defaults:

Feature	Typical Cloud Assistant	Gemma Local Code Mentor
Runs in VS Code	Yes	Yes
Explains code	Yes	Yes
Generates tests	Yes	Yes
Refactors code	Yes	Yes
Sends code to cloud	Often	No by default
Works with local models	Usually no	Yes
Has a local-only switch	Rare	Yes
Can be hacked by contributors	Limited	Fully open source

The goal is not to beat every commercial coding assistant.

The goal is to prove that a useful AI coding mentor can be local-first from day one.

Demo

Suggested demo flow:

Open a real code file in VS Code.
Select a function.
Run Gemma: Explain Selection.
Run Gemma: Generate Tests.
Ask a repository-level question.
Show the side panel with Local Only Mode: ON.
Show the backend running locally.

Code

Repository:

ennydev-2026 / GemmaLocalCodeMentor

GLCM - Gemma Local Code Mentor

Gemma Local Code Mentor

Gemma Local Code Mentor is a local-first VSCode extension and Dockerized FastAPI backend for explaining, refactoring, testing, and summarizing code with local Gemma models.

What It Does

The project runs on the developer's machine:

VSCode extension in TypeScript.
Local FastAPI backend on 127.0.0.1:8765.
Ollama as the default local model runtime.
Local sample provider for development and tests without installed models.
Double-model routing
- Fast model for short explanations and lightweight chat.
- Deep model for refactors, tests, architecture, and larger context.
Local Only Mode enabled by default.

Architecture

flowchart LR
    A["VSCode Extension"] --> B["FastAPI Backend :8765"]
    B --> C["Prompt Orchestrator"]
    B --> D["Repo Context Builder"]
    B --> E["Local Index Store"]
    B --> F["Model Router"]
    F --> G["Fast Gemma Model"]
    F --> H["Deep Gemma Model"]
    G --> I["Ollama"]
    H --> I
    B --> J["Response Parser"]
    J --> A

Commands

gemma.explainSelection
gemma.refactorSelection
gemma.generateTests
gemma.summarizeFile
gemma.summarizeArchitecture
gemma.askRepo
gemma.togglePrivacyMode
gemma.openPanel

…

View on GitHub

Direct link:

https://github.com/ennydev-2026/GemmaLocalCodeMentor

How I Used Gemma 4

I used Gemma 4 as the reasoning layer behind the local code mentor.

The project is designed around two model roles:

Gemma 4 E4B for fast tasks like short explanations and lightweight chat
Gemma 4 31B Dense for deeper tasks like refactoring, test generation, architecture summaries, and larger context

That choice was intentional.

A code mentor should not use the largest model for every single request. If I ask what a small function does, I want a fast answer. If I ask for tests, architecture, or a refactor, I want deeper reasoning.

So the backend includes a model router:

fast mode uses the fast model
deep mode uses the deep model
auto mode chooses based on task type and context size

This makes Gemma 4 feel more like a practical local development tool instead of a single hardcoded model call.

Why Gemma 4 specifically

Task	Model	Why
Explain selection	Gemma 4 E4B	Fast local response
Generate tests	Gemma 4 31B Dense	More reasoning depth
Architecture summary	Gemma 4 31B Dense	Larger context and better synthesis
Ask repo	Auto router	Chooses by task/context size

Architecture

The stack:

VS Code extension in TypeScript
FastAPI backend in Python
Ollama as the default local runtime
Docker support
Mock provider for development and tests
.gemmaignore support
Local URL safety checks
Backend test coverage with pytest

Local-First Is a Product Decision

The privacy layer is not just a README promise.

The repo includes:

Local Only Mode enabled by default
Backend URL validation
No telemetry
No cloud fallback
No external API calls while local-only is enabled
.gemmaignore for excluding sensitive files
Mock mode so contributors can work without installing a model first

That matters because local AI changes who can safely use these tools.

A freelancer can use it on client code.
A company can test AI workflows without sending source code away.
A student can learn from a mentor without paying API costs.
An open-source maintainer can customize the whole stack.

Run It Locally

Backend:

cd backend
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
uvicorn app.main:app --host 127.0.0.1 --port 8765 --reload

Mock mode, no model required:

cd backend
GEMMA_PROVIDER=mock uvicorn app.main:app --host 127.0.0.1 --port 8765 --reload

Extension:

cd extension
npm install
npm run compile

Then open the project in VS Code, press F5, and run any Gemma: command.

What I Want Help With

This is where I want the community involved.

I would love contributors for:

Better repository indexing
Smarter prompt templates
More language-aware code analysis
Inline code actions
Diff previews before applying refactors
Local embeddings for repo search
Better test framework detection
llama.cpp provider support
MLX provider support
A polished marketplace-ready VSIX
UI improvements for the side panel

If you care about local AI, open models, privacy-respecting devtools, or VS Code extensions, jump in.

Fork it.
Open an issue.
Try another Gemma 4 model.
Add a provider.
Improve the prompts.
Make the UX better.

Install the VS Code Extension

You can install the extension directly in VS Code using this identifier:

ennydev-2026.gemma-local-code-mentor

Final Thought

AI coding tools are becoming part of the daily developer workflow.

That means defaults matter.

The default should not always be:

"Upload your code first."

Sometimes the best place for your code is exactly where it already is:

on your machine.

What would you add to a local-first VS Code code mentor?