I Built an AI Agent Orchestrator Where Gemma 4 Only Knows What You Teach It

#devchallenge #gemmachallenge #gemma #springboot

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

GemmaOrch is a skill-based AI agent orchestrator: you define what an agent knows by dropping Markdown files into a folder, assign those skills to a named agent, and chat with it. The agent powered by Gemma 4 will only answer within the boundaries of those files — it refuses anything outside scope with a precise phrase, never hallucinates expertise it wasn't given.

The core idea: agent behavior lives in .md files, not in code. No prompts hardcoded in the application. No domain logic baked into the service layer. The skill files are
the agent.

What it solves: building specialized AI assistants usually means either fine-tuning a model (expensive, slow to iterate) or writing complex prompt engineering into your codebase (brittle, hard to maintain). GemmaOrch separates the two concerns — the orchestration logic stays in Java, the expertise lives in plain Markdown that anyone can read and edit.

Key features:

Skill-driven agents — each agent's system prompt is built entirely from its assigned skill files at runtime.
GitHub skill importer — paste a public GitHub folder URL and GemmaOrch fetches every .md file recursively, creating the skill locally.
Streaming chat — token-by-token streaming via Spring WebFlux, rendered as Markdown client-side.
MCP server — every agent is automatically exposed as a JSON-RPC 2.0 tool on POST /mcp, callable from Claude Code, Cursor, or any MCP-compatible IDE.
REST API — POST /api/chat/{agentId} for integrating agents into external services, with a one-click "Copy curl" button in the UI.
Zero infrastructure — H2 file-based database, no external services required beyond the AI Studio API key.

Built with: Java 25 · Spring Boot 3.5 · Spring AI 1.1.5 · Thymeleaf · HTMX 2.0

Demo

The app runs locally — see the Quick Start in the repo or the Docker section to spin it up in two commands.

Dashboard — 3-panel layout (skills · main · agents):

Creating and importing skills:

Chatting with a skill-scoped agent:

API Access panel — copy the curl command directly from the agent detail view:

Agents as MCP tools in the IDE:

Code

Repository: Bzaid94/gemmorch-agents

How I Used Gemma 4

I used the gemma-4-31b-it model — the 31B dense instruction-tuned variant — via Google AI Studio through Spring AI's spring-ai-starter-model-google-genai.

Why the 31B dense, specifically:

The project enforces a hard constraint: agents must refuse anything outside their assigned skills and must do so with an exact phrase. This is a correctness requirement, not a quality preference — if the constraint breaks, the product doesn't work.

I tested smaller variants first. The 4B model followed the constraint most of the time, but would occasionally drift: offering "related" information outside its skills, or partially revealing the system prompt when directly asked. With the 31B dense, these failures essentially disappeared. The constraint held reliably across multi-turn conversations and adversarial inputs.

Two specific things the 31B unlocked that smaller models couldn't deliver consistently:

Long-context constraint adherence. A single agent's system prompt can carry 10,000+ tokens of skill content (multiple skill files, each with reference documents). The 31B model kept the opening STRICT CONSTRAINTS block in effect even with extensive context following it — smaller models would silently "forget" early instructions as context
grew.
Role disambiguation. Many skill files written for Claude Code or agentic CLI tools contain dispatch instructions like "invoke subagent X" or "request tool Y." Injected directly into a system prompt, smaller models would sometimes output those templates literally. The 31B correctly understood the meta-instruction — "you are the agent being invoked, not the orchestrator invoking agents" — and applied the skill knowledge directly instead of outputting workflow templates.

Why not the 26B MoE? The MoE variant optimizes for throughput across concurrent requests. GemmaOrch is a single-tenant orchestrator where precision per response matters more
than requests-per-second. The dense model's full parameter activation per token is worth the inference cost for this use case.

Why not the 4B? For a general assistant or creative tool, the 4B is genuinely capable and would be my first choice to keep costs and latency low. But when "breaking the constraint" is a correctness failure — not just a quality degradation — the extra capacity of the 31B is justified.

The open-weights advantage: Gemma 4 is open. The application is architected so the model is an environment variable — swap AI Studio for a local Ollama instance and nothing else changes. For users with sensitive skill content (internal knowledge bases, proprietary processes), self-hosting is a real deployment path, not a future promise.