polar3130

Posted on Feb 27

Using Gemini CLI with a Local LLM

#cli #gemini #llm #tutorial

Gemini CLI, an open-source AI agent published by Google, lets you interact with Gemini models from your terminal. It normally connects to Google's API endpoint, but by redirecting the API destination, you can also use a locally running LLM as its backend.

In this post, I'll walk through how to combine LiteLLM Proxy and Ollama to swap Gemini CLI's backend to a local LLM, along with a few gotchas I encountered during setup.

I've also covered using LiteLLM Proxy for centralized LLM API management in a previous post, if you're interested.

Architecture Overview

Here is the overall architecture:

By setting the GOOGLE_GEMINI_BASE_URL environment variable from the @google/genai SDK, you can redirect all of Gemini CLI's API requests to an arbitrary endpoint. This variable doesn't appear to be documented in the Gemini CLI docs, but it is supported on the SDK side (reference PR).

LiteLLM Proxy exposes Gemini API-compatible endpoints (/v1beta/models/{model}:streamGenerateContent, etc.) and relays incoming requests to a local model running on Ollama. LiteLLM Proxy has a feature called model_group_alias that routes a requested model name to a different model, which allows you to map model names sent by Gemini CLI (such as gemini-3-flash-preview) to a local model.

Test Environment

macOS (Apple Silicon, Tahoe 26)
Gemini CLI v0.30.0
LiteLLM v1.81.16
Ollama v0.17.0
Python 3.14.0
Node.js v22.17.0

Setup

Installing Ollama and Pulling a Model

Install via Homebrew and start it as a service.

brew install ollama
brew services start ollama

Pull a model. I initially planned to use gemma3, but as described later, gemma3 doesn't support tool calling in the Ollama template format, so I went with the lightweight qwen2.5:3b (~1.9 GB) for this proof of concept.

ollama pull qwen2.5:3b

Installing LiteLLM

Create a Python virtual environment and install LiteLLM.

python3 -m venv .venv
source .venv/bin/activate
pip install 'litellm[proxy]'

Configuring LiteLLM Proxy

Create a litellm_config.yaml file.

model_list:
  - model_name: local-model
    litellm_params:
      model: ollama_chat/qwen2.5:3b
      api_base: "http://localhost:11434"

router_settings:
  model_group_alias:
    "gemini-3.1-pro-preview": "local-model"
    "gemini-3.1-pro-preview-customtools": "local-model"
    "gemini-3-flash-preview": "local-model"
    "gemini-3-flash-preview-customtools": "local-model"
    "gemini-2.5-pro": "local-model"
    "gemini-2.5-flash": "local-model"
    "gemini-2.5-flash-lite": "local-model"

The key here is the model_group_alias configuration. Gemini CLI uses multiple models internally — a main generation model (gemini-3-flash-preview, etc.) as well as a lighter model for input classification (gemini-2.5-flash-lite). Aliases for all of these model names need to be defined. It would be nice if wildcards were supported, but for now, each model name requires its own alias.

Start the proxy with the config file.

litellm --config litellm_config.yaml --port 4000

Starting Gemini CLI

Set the environment variables and start Gemini CLI.

export GOOGLE_GEMINI_BASE_URL="http://localhost:4000"
export GEMINI_API_KEY="sk-dummy-key"
gemini --sandbox=false

The API key isn't actually used, so any dummy value will do.

You should now be getting responses from the local LLM.

Note that --sandbox=false is specified because, in sandbox mode, GOOGLE_GEMINI_BASE_URL is not passed into the sandbox container — a known issue (Issue #2168).

Gotchas During Setup

Missing model_group_alias Entries Cause 500 Errors

Gemini CLI uses different models for different purposes depending on the version. With v0.30.0, which I used, models such as gemini-3-flash-preview and gemini-2.5-flash-lite were observed in requests.

If the corresponding alias is not defined in LiteLLM Proxy, you'll get a BadRequestError: There are no healthy deployments for this model error. Since the models in use may change with Gemini CLI upgrades or new Gemini model releases, you'll likely need to monitor the proxy logs for requested model names and add any missing entries to model_group_alias as they appear.

The Model Must Support Tool Calling in Its Ollama Template

I initially used Google's gemma3:4b, but it failed with a does not support tools error.

Gemini CLI sends tool definitions (for file operations, command execution, etc.) as part of its requests. For Ollama to handle the tools parameter, the model's chat template needs to support tool calling.

An important nuance here is that a model's function calling capability and Ollama template support are separate concerns.

gemma3 is capable of prompt-based function calling at the model level (reference), but its Ollama template does not support it (ollama/ollama#9941).

Qwen 2.5, on the other hand, supports tool calling in its official Ollama template. The qwen2.5:3b model I used is only about 1.9 GB at 3B parameters, making it a convenient choice for a proof of concept.

Wrap-Up

I've shown how to swap Gemini CLI's backend to a local LLM by combining LiteLLM Proxy and Ollama.

The setup itself is relatively straightforward, but there were a few things that were hard to notice without actually running it — such as model name changes across Gemini CLI versions and the tool calling support status of Ollama models.

That said, a 3B-parameter model doesn't have the capacity to reliably handle Gemini CLI's AI agent features like file operations and code generation. For serious use as a coding assistant, you'll likely want to consider larger models.

Top comments (3)

JC • Mar 21

in the current situation pip install uvloop==0.22.1, because Python 3.14.2 is incompatible with the older uvloop.

polar3130 • Mar 23

Thanks for the tip! I didn't run into this issue in my setup, but that's helpful for anyone who does.

Sid • Apr 14

It's important to note if you are logged into gemini using your google account, this does not work - it only works when Gemini CLI is being used with an API Key