DEV Community

Cover image for Adding a Free Overflow Model to Your MCP Server: Gemma via the Gemini API
Joe Provence
Joe Provence

Posted on • Originally published at sltcreative.com

Adding a Free Overflow Model to Your MCP Server: Gemma via the Gemini API

Most agentic workflows have a single failure mode nobody plans for: the primary LLM hits its rate limit mid-session and everything stops. You can't log a result. You can't draft the next section. The workflow is blocked until the window resets. After hitting this enough times, I started treating it as an architecture problem rather than a billing problem.

The fix turned out to be simpler than I expected.


The Insight Hidden in the Gemini Docs

While auditing our Google AI Studio integration, I noticed that Gemma — Google's open-weight model family — is served through the exact same API endpoint as Gemini. Same Python SDK, same API key, different model string. And Gemma 3 27B costs $0 per million tokens on the free tier. If you already have a Gemini API key, you already have free access to a capable open-weight model. No new credentials, no additional SDK, no separate account.

That's the whole unlock.


Registering the Tool in FastMCP

Adding query_gemma to a FastMCP server is a thin wrapper — roughly fifteen lines:

import google.generativeai as genai
from fastmcp import FastMCP

mcp = FastMCP("my-server")

@mcp.tool()
def query_gemma(prompt: str, model: str = "gemma-3-27b-it") -> str:
    """Send a prompt to Gemma. Use for generation tasks to reduce primary LLM token usage."""
    client = genai.GenerativeModel(model)
    response = client.generate_content(prompt)
    return response.text
Enter fullscreen mode Exit fullscreen mode

The model parameter defaults to gemma-3-27b-it but accepts the full family:

Model Best for
gemma-3-1b-it Minimal tasks, fastest
gemma-3-4b-it Classification, simple formatting
gemma-3-12b-it General use
gemma-3-27b-it Default — best Gemma 3 quality
gemma-4-26b-a4b-it Gemma 4, efficient
gemma-4-31b-it Gemma 4, highest quality

After adding the tool, reconnect your MCP connector to reload the manifest. That's the entire deployment.


The Workflow Split That Makes This Useful

The important constraint: query_gemma is text in, text out. Gemma has no access to your tool registry. It can't call other MCP tools, query your data layer, or read session state. It only knows what you explicitly pass in the prompt.

This forces a clean separation that turns out to be the right design anyway. The primary LLM handles tool calls, data retrieval, QA, and logging. Gemma handles generation-heavy tasks — drafting, summarizing, classifying, formatting. The primary LLM does less of the expensive token work. When it hits rate limits, Gemma absorbs the generation queue while the primary LLM recovers.

The split also makes each model's role legible. If something fails, you know immediately which layer to look at.


The Gap That Remains

The free tier rate limits are real. Gemma 3 models allow 5–15 requests per minute depending on model size. For interactive workflows, that's usually fine. For anything resembling batch processing, you'll hit the ceiling fast and need retry logic.

The deeper limitation is context. Gemma doesn't know what your other tools returned unless you tell it. Every query_gemma call needs to be self-contained — task description, relevant data, output format, all passed explicitly. That's more prompt engineering overhead than calling a context-aware primary LLM, and it matters for complex tasks.


What This Is and Isn't

This isn't a replacement for your primary LLM. For tasks requiring tool calls, structured reasoning over live data, or anything where the model needs to know what happened earlier in the session — you still need the primary stack.

For pure generation tasks, it works well and it's free. The practical framing: treat it as a relief valve on your token budget, not a second brain.


Build your overflow capacity the same way you build your primary stack — thin interfaces, clear contracts, explicit failure modes. A model you can swap in when the primary one is saturated is worth more than a more powerful model you can't afford to run continuously.


Top comments (0)