polar3130

Posted on Nov 25

Using Gemini CLI Through LiteLLM Proxy

#gemini #cli #llm #tutorial

Organizations adopting LLMs at scale often struggle with fragmented API usage, inconsistent authentication methods, and lack of visibility across teams. Tools like Gemini CLI make local development easier, but they also introduce governance challenges—especially when authentication silently bypasses centralized gateways.

In this article, I walk through how to route Gemini CLI traffic through LiteLLM Proxy, explain why this configuration matters for enterprise environments, and highlight key operational considerations learned from hands-on testing.

Why Use a Proxy for Gemini CLI?

Before diving into configuration, it’s worth clarifying why an LLM gateway is needed in the first place.

Problems with direct Gemini CLI usage

If developers run Gemini CLI with default settings:

Authentication may fall back to Google Account login → usage disappears from organizational audits
API traffic may hit multiple GCP projects/regions → inconsistent cost attribution
Personal API keys or user identities may be used → security and compliance risks
Team-wide visibility into token usage becomes impossible → cost governance cannot scale

LiteLLM Proxy as a solution

LiteLLM Proxy provides:

A unified OpenAI-compatible API endpoint
Virtual API keys with per-user / per-project scoping
Rate, budget, and quota enforcement
Centralized monitoring & analytics
Governance applied regardless of client tool (CLI, IDE, scripts)

This makes it suitable for organizations where 50–300+ developers may use Gemini, GPT, Claude, or Llama models across multiple teams.

Architecture Overview

For this walkthrough, I deployed LiteLLM Proxy onto Cloud Run, using Cloud SQL for metadata storage.

Why this design?

Cloud Run scales automatically and supports secure invocations.
Cloud SQL stores key usage, analytics, and configuration.
Vertex AI IAM is handled via the LiteLLM Proxy’s service account.
API visibility is centralized and independent of client behavior.

Caveats

Cloud SQL connection limits must be considered when scaling Cloud Run.
Cold starts may slightly increase latency for short-lived CLI invocations.
Multi-region routing is out of scope but may be required for HA.

Configuration: LiteLLM Proxy

Below is a minimal configuration enabling Gemini models via Vertex AI:

model_list:
  - model_name: gemini-2.5-pro
    litellm_params:
      model: vertex_ai/gemini-2.5-pro
      vertex_project: os.environ/GOOGLE_CLOUD_PROJECT
      vertex_location: us-central1

  - model_name: gemini-2.5-flash
    litellm_params:
      model: vertex_ai/gemini-2.5-flash
      vertex_project: os.environ/GOOGLE_CLOUD_PROJECT
      vertex_location: us-central1

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  ui_username: admin
  ui_password: os.environ/LITELLM_UI_PASSWORD

Operational notes & recommendations

Region selection: Vertex AI availability varies by location; us-central1 is generally safest for new Gemini releases.
Key management: Store LITELLM_MASTER_KEY and UI credentials in Secret Manager, not environment variables.
Production settings to consider: num_retries, timeout, async_calls, request logging policies.
Access control: Use Cloud Run’s invoker IAM or an API Gateway layer for stronger borders.

Virtual key issuance

curl -X POST https://<proxy>/key/generate \
  -H "Authorization: Bearer <master key>" \
  -d '{"models": ["gemini-2.5-pro","gemini-2.5-flash"], "duration":"30d"}'

This key will later be used by the Gemini CLI.

Configuration: Gemini CLI

Point the CLI to LiteLLM Proxy:

export GOOGLE_GEMINI_BASE_URL="https://<LiteLLM Proxy URL>"
export GEMINI_API_KEY="<virtual key>"

Important

GEMINI_API_KEY must be a LiteLLM virtual key, not a Google Cloud API key.

Gemini CLI now behaves as if it were talking to Vertex AI, but traffic flows through LiteLLM.

Testing the End-to-End Path

Once configured, run a simple test through Gemini CLI:

$ gemini hello
Loaded cached credentials.
Hello! I'm ready for your first command.

On the LiteLLM dashboard, you should see request logs, latency, and token usage.

Important Note: Authentication Bypass in Gemini CLI

During testing, I observed situations where:

Gemini CLI worked normally
but LiteLLM Proxy showed zero usage

Why it happens

Gemini CLI supports three authentication methods:

Login with Google
Use Gemini API Key
Vertex AI

When a user logs in with Google Login:

The CLI uses Google OAuth credentials
These credentials automatically route traffic directly to Vertex AI
GOOGLE_GEMINI_BASE_URL is ignored
LiteLLM Proxy is completely bypassed

If OAuth login is left enabled:

Teams lose visibility of CLI usage
Costs appear under personal or unintended projects
Security review cannot track data flowing to Vertex AI
API limits and budgets set on LiteLLM do not apply

This is the number one issue organizations should be aware of.

Summary

In this article, we walked through how to route Gemini CLI traffic through LiteLLM Proxy and highlighted key lessons from testing.

Benefits

Unifies API governance across CLI, IDE, and backend services
Enables per-user quotas, budgets, and access scopes
Provides analytics across all models and providers
Gives SRE/PFE teams full visibility into LLM usage patterns

Limitations / Things to Consider

Gemini CLI’s Google-auth login bypasses proxies unless explicitly disabled
Cloud Run + Cloud SQL requires connection pooling considerations
Model list updates must be maintained when Vertex releases new versions
LiteLLM Enterprise features (SSO, RBAC, audit logging) may be necessary for large orgs

DEV Community