This is a submission for the Gemma 4 Challenge: Write About Gemma 4
The architecture review was going fine. The customer was a European bank, the gateway was Azure APIM, the integration was a week from sign-off. Then someone joined the call from their AI governance team. She asked one question.
"Where does the inference happen?"
I said Frankfurt. She thanked me. The deal stalled for nine months.
I spent 6 months as a solutions engineer at an API observability company. Sat in rooms like that across European banks, Dutch insurers, Indian fintechs, German utilities. The customers wanted AI on their API traffic. Anomaly detection. Schema inference. Half a dozen things the marketing deck promised.
They couldn't have any of it, because the only models good enough to actually do the work lived in someone else's data centre, and "someone else's data centre" was the answer that ended every conversation.
Gemma 4 is the first open model I've used that holds up when you hand it a real OpenAPI spec and ask it to reason about consumer policies. That's the threshold. Everything below it was a demo.
The objection was never one objection
- At UBS it was Swiss data residency law.
- At an NN Group subsidiary it was an internal policy nobody could find the author of.
- At a Polish bank the infosec lead didn't trust any vendor that hadn't been in the building for five years.
- At a healthcare customer it was a clause buried in a downstream contract with a hospital network that nobody on our side had read until the security review.
Regulatory, contractual, political, sometimes all three on the same call. The common thread was that "we'll send your traffic to OpenAI" ended the conversation, and the workarounds available were bad.
Azure OpenAI with managed identity worked for the customers who could swallow Microsoft. The customers who couldn't got Llama 2 or Mistral 7B, which couldn't reason over a real OpenAPI spec, hallucinated method signatures when you fed them traces, and broke the moment you asked them to consider rate limit policies and consumer tiers in the same prompt.
The gap between models that demo well and models that survive a real enterprise API workload was wider than vendors wanted to admit.
What actually shifted on April 2
Google released Gemma 4 on April 2, 2026. Four sizes, multimodal, native function calling, 256K context on the larger variants. The 31B dense variant sits at #3 on the Arena AI open leaderboard.
That's been covered.
The thing nobody's writing about: Apache 2.0.
Earlier Gemma releases shipped under Google's own licence. Procurement teams at large companies had to read it, debate it, sometimes escalate it. Apache 2.0 they've read a thousand times.
The legal review goes from a quarter to an afternoon. For a writer that sounds boring. For a deal it's the difference between closing in Q2 and closing in Q4.
The other two pieces matter, but matter less. Function calling is native, trained with dedicated tokens, not bolted on through prompt engineering which means tool use is reliable enough to put behind a real workflow instead of a demo. And the size distribution gives you actual deployment options.
The 26B MoE activates roughly 3.8B parameters per token, so it runs at small-model speed with medium-model reasoning, and fits on a single H100 with breathing room.
The 31B dense is for higher-margin workloads where quality justifies the GPU. E2B and E4B run on a laptop, which matters because that's where the integration work happens before anyone provisions infrastructure.
Apache 2.0 is the procurement story. The rest is the engineering story. They both have to be true for this to work, but only one of them was the blocker.
What this means for an API platform
The standard enterprise stack is a gateway with identity in front and observability behind it, plus a developer portal and a governance layer somewhere.
You know the shape.
The places AI wants to live in it are obvious: reasoning over traces to spot anomalies, suggesting policy fixes when a consumer trips a limit, generating documentation from observed traffic, triaging error patterns, answering "why did this call fail" from a request payload plus the spec.
What changes with Gemma 4 is that the workload sits on a single GPU node inside the customer's VPC.
The model takes traffic data, OpenAPI specs, and policy definitions as context, calls back into the gateway's admin API through function calling, and produces structured output the rest of the system can act on.
Nothing leaves the customer's network.
The procurement story most engineers underestimate: you ship a Helm chart, a model weights file, and a licence. The customer deploys it in their cluster. There's nothing to negotiate because nothing crosses the boundary.
The demo
Gemma 4 acting as a reasoning layer over a mock API gateway. Three tools: list consumers on an API, get a consumer's rate limit, propose a new limit and an agent loop that picks which to call.
Pull the model:
ollama pull gemma4:e4b
E4B runs on a laptop and function calling works the same as on the larger variants. For production traffic volume you'd move to 26B MoE or 31B dense, but for the demo this is the right pick.
The mock gateway.
In a real system this is your WSO2 admin API, your Kong admin API, whatever you're running.
For the demo it's a dictionary.
import json
import requests
OLLAMA_URL = "http://localhost:11434/api/chat"
MODEL = "gemma4:e4b"
GATEWAY_STATE = {
"payments-api": {
"consumers": [
{"id": "acme-corp", "tier": "gold", "limit_per_minute": 1000},
{"id": "beta-startup", "tier": "free", "limit_per_minute": 60},
]
},
"users-api": {
"consumers": [
{"id": "internal-dashboard", "tier": "internal", "limit_per_minute": 5000},
]
},
}
def list_consumers(api_name: str) -> dict:
api = GATEWAY_STATE.get(api_name)
if not api:
return {"error": f"API {api_name} not found"}
return {"api": api_name, "consumers": api["consumers"]}
def get_rate_limit(api_name: str, consumer_id: str) -> dict:
api = GATEWAY_STATE.get(api_name)
if not api:
return {"error": f"API {api_name} not found"}
for c in api["consumers"]:
if c["id"] == consumer_id:
return {"consumer": consumer_id, "limit_per_minute": c["limit_per_minute"], "tier": c["tier"]}
return {"error": f"Consumer {consumer_id} not found on {api_name}"}
def propose_rate_limit(api_name: str, consumer_id: str, new_limit_per_minute: int, justification: str) -> dict:
return {
"proposal": {
"api": api_name,
"consumer": consumer_id,
"new_limit_per_minute": new_limit_per_minute,
"justification": justification,
"status": "pending_approval",
}
}
TOOL_MAP = {
"list_consumers": list_consumers,
"get_rate_limit": get_rate_limit,
"propose_rate_limit": propose_rate_limit,
}
Tool schemas.
The descriptions are what the model reads to decide what to call, and vague descriptions produce vague selection. Worth more time than people give them.
TOOLS = [
{
"type": "function",
"function": {
"name": "list_consumers",
"description": "List all consumers registered on a specific API. Use when the user asks who is calling an API, or before reasoning about a specific consumer.",
"parameters": {
"type": "object",
"properties": {
"api_name": {"type": "string", "description": "The API identifier, e.g. 'payments-api'"},
},
"required": ["api_name"],
},
},
},
{
"type": "function",
"function": {
"name": "get_rate_limit",
"description": "Get the current rate limit and tier for a specific consumer on a specific API.",
"parameters": {
"type": "object",
"properties": {
"api_name": {"type": "string"},
"consumer_id": {"type": "string"},
},
"required": ["api_name", "consumer_id"],
},
},
},
{
"type": "function",
"function": {
"name": "propose_rate_limit",
"description": "Propose a new rate limit for a consumer. The proposal is queued for human approval, never applied directly.",
"parameters": {
"type": "object",
"properties": {
"api_name": {"type": "string"},
"consumer_id": {"type": "string"},
"new_limit_per_minute": {"type": "integer"},
"justification": {"type": "string", "description": "Why this limit is being proposed, in one sentence."},
},
"required": ["api_name", "consumer_id", "new_limit_per_minute", "justification"],
},
},
},
]
The agent loop. Send the message, check for tool calls, execute them, feed the results back.
def run_agent(user_message: str, max_steps: int = 5) -> str:
messages = [
{
"role": "system",
"content": (
"You are an API platform assistant. You have access to tools that read "
"and propose changes against the API gateway. Never invent consumer IDs "
"or limits. If you need data, call a tool. Propose changes only when "
"the user asks for a change."
),
},
{"role": "user", "content": user_message},
]
for _ in range(max_steps):
response = requests.post(
OLLAMA_URL,
json={"model": MODEL, "messages": messages, "tools": TOOLS, "stream": False},
).json()
msg = response["message"]
messages.append(msg)
tool_calls = msg.get("tool_calls")
if not tool_calls:
return msg.get("content", "")
for call in tool_calls:
fn_name = call["function"]["name"]
args = call["function"]["arguments"]
if isinstance(args, str):
args = json.loads(args)
result = TOOL_MAP[fn_name](**args)
messages.append({"role": "tool", "content": json.dumps(result)})
return "Step limit reached"
Run it:
print(run_agent(
"acme-corp is a paying customer on payments-api and they say they're hitting limits. "
"Check their current limit and propose doubling it. Use one-sentence justification."
))
Running this on an M5 MacBook with E4B, the model picked get_rate_limit first to read the actual tier and limit (1000/min, gold), then called propose_rate_limit with the doubled value and a one-sentence justification. The proposal came back queued for approval.
Three model calls total, around 15 seconds end-to-end after the model warmed up.
No hallucinated current limit, no guess at the tier. The tools were the only path to data the model could trust.
That's the pattern. The model reasons. The tools fetch and propose. The output is structured. No part of this needs a cloud endpoint.
What this doesn't solve
Pretending Gemma 4 fixes everything is the kind of vendor pitch I sat through for three years, so a few caveats from running this on real hardware.
Latency.
The 26B MoE on a single H100 returns a tool call quickly enough for back-office workflows, but agent loops with multiple tool calls stack up.
On my M5 MacBook running E4B locally, the cold-start first call took 15 seconds while the model warmed up, then subsequent calls settled at 3-7 seconds each. A two-tool-call agent loop completed end-to-end in about 15 seconds wall time after warm-up.
Fine for ops tooling. Bad for anything customer-facing. Cache tool definitions, batch where the workload allows, and route to 31B dense only when the smaller variant fails a validation check.
Context window.
The 256K number sounds enormous until you try to fit a real enterprise OpenAPI spec, a week of structured traces, and a tool registry into one prompt.
A heavily documented spec with hundreds of endpoints will eat through the budget before you've added the trace data. Quality also degrades well before you hit the technical ceiling. R
AG is still the right pattern for observability workloads.
Function calling reliability on the smaller variants is good, not perfect. When I ran this exact demo on E4B, the model picked the right tools in the right order on the first try, with correct arguments.
But it's a 4B effective-parameter model, push it into a wider tool registry or more ambiguous tool descriptions and the failure modes start showing up. Validate every tool call against your schema and refuse to execute if it doesn't match.
The proposal pattern in the demo, where every change is queued for human approval rather than applied, is not optional in production.
And the model is still a model.
It will be confidently wrong about something at the worst possible moment. Build the system so the model is a useful component and the gateway is the authority.
Not the other way around.
Top comments (0)