How to Unlock Local Inference in the Google Gemini SDK (Without Forking)

#ai #autonomy #productivity #localfirst

There is a growing demand in the google/gemini-cli issues for local model support. The reality? The functionality is already there.

The @google/gemini-cli-core SDK was architected as a modular orchestrator, not just a cloud wrapper. At Tars, we’ve tapped into the SDK’s native ContentGenerator interface and OverrideStrategy to run 100% local agentic loops without forking the core.

1. The Strategy: Bypassing the Cloud Router

The Gemini SDK uses a ClassifierStrategy by default to ping Google’s flash-lite for prompt routing. This is what causes "API Key Missing" errors when trying to run locally.

We bypass this natively by exploiting the SDK's internal routing priority:

FallbackStrategy
OverrideStrategy (Triggered when a concrete model is provided)
ClassifierStrategy (The default cloud ping)

By simply passing a specific model name (e.g., qwen-3b) instead of auto during initialization, we trip the OverrideStrategy. This "amputates" the cloud router, forcing the SDK to talk directly to our local bridge with 0ms latency and zero cloud pings.

2. The Implementation: `LlamaCppGenerator`

Tars implements the SDK's ContentGenerator interface. This allows us to intercept the SDK’s generateContent and streamGenerateContent calls. We then:

Map Gemini Parts to OpenAI: Translate the SDK’s complex multi-part messages (text + function calls) into flat OpenAI-compatible JSON.
Native Tool-Calling Bridge: To make the SDK recognize local tool calls, we manually map them to the response.functionCalls prototype getter. This allows local models (like Qwen 3.5) to participate in the exact same multi-turn tool-loops as Gemini 1.5 Pro.

3. Future-Proofing: Upgrading Core without Breaking

Because Tars uses the standard ContentGenerator interface, we can upgrade @google/gemini-cli-core to the latest version (e.g., for new Gemini 2.0 features) without breaking our local inference logic. We aren't hacking the SDK; we are using it exactly as it was designed to be extended.

The Verdict

The Gemini CLI doesn't need a "Local Mode" feature request—it needs an implementation that respects its modular architecture. Tars is that implementation.

Key Benefits:

100% Privacy: No telemetry or classifier pings to Google.
Agentic Power: Full MCP extension support (Gmail, Drive, Shell) on local hardware.
Telemetry: Captures local usageMetadata (tokens) for real-time dashboard tracking.

Recommended Model: Qwen 3.5 (35B or 80B) for the most reliable tool-calling and JSON output.

[!TIP]
Get Started: You can test this today by running tars setup and selecting the Llama.cpp backend.
Repository: github.com/agustinsacco/tars

DEV Community

How to Unlock Local Inference in the Google Gemini SDK (Without Forking)

1. The Strategy: Bypassing the Cloud Router

2. The Implementation: `LlamaCppGenerator`

3. Future-Proofing: Upgrading Core without Breaking

The Verdict

Top comments (0)

1. The Strategy: Bypassing the Cloud Router

2. The Implementation: LlamaCppGenerator

3. Future-Proofing: Upgrading Core without Breaking

The Verdict

2. The Implementation: `LlamaCppGenerator`