DEV Community

polar3130
polar3130

Posted on

Using Gemini CLI Through LiteLLM Proxy

Organizations adopting LLMs at scale often struggle with fragmented API usage, inconsistent authentication methods, and lack of visibility across teams. Tools like Gemini CLI make local development easier, but they also introduce governance challenges—especially when authentication silently bypasses centralized gateways.

In this article, I walk through how to route Gemini CLI traffic through LiteLLM Proxy, explain why this configuration matters for enterprise environments, and highlight key operational considerations learned from hands-on testing.


Why Use a Proxy for Gemini CLI?

Before diving into configuration, it’s worth clarifying why an LLM gateway is needed in the first place.

Problems with direct Gemini CLI usage

If developers run Gemini CLI with default settings:

  • Authentication may fall back to Google Account login → usage disappears from organizational audits
  • API traffic may hit multiple GCP projects/regions → inconsistent cost attribution
  • Personal API keys or user identities may be used → security and compliance risks
  • Team-wide visibility into token usage becomes impossible → cost governance cannot scale

LiteLLM Proxy as a solution

LiteLLM Proxy provides:

  • A unified OpenAI-compatible API endpoint
  • Virtual API keys with per-user / per-project scoping
  • Rate, budget, and quota enforcement
  • Centralized monitoring & analytics
  • Governance applied regardless of client tool (CLI, IDE, scripts)

This makes it suitable for organizations where 50–300+ developers may use Gemini, GPT, Claude, or Llama models across multiple teams.


Architecture Overview

For this walkthrough, I deployed LiteLLM Proxy onto Cloud Run, using Cloud SQL for metadata storage.

Why this design?

  • Cloud Run scales automatically and supports secure invocations.
  • Cloud SQL stores key usage, analytics, and configuration.
  • Vertex AI IAM is handled via the LiteLLM Proxy’s service account.
  • API visibility is centralized and independent of client behavior.

Caveats

  • Cloud SQL connection limits must be considered when scaling Cloud Run.
  • Cold starts may slightly increase latency for short-lived CLI invocations.
  • Multi-region routing is out of scope but may be required for HA.

Configuration: LiteLLM Proxy

Below is a minimal configuration enabling Gemini models via Vertex AI:

model_list:
  - model_name: gemini-2.5-pro
    litellm_params:
      model: vertex_ai/gemini-2.5-pro
      vertex_project: os.environ/GOOGLE_CLOUD_PROJECT
      vertex_location: us-central1

  - model_name: gemini-2.5-flash
    litellm_params:
      model: vertex_ai/gemini-2.5-flash
      vertex_project: os.environ/GOOGLE_CLOUD_PROJECT
      vertex_location: us-central1

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  ui_username: admin
  ui_password: os.environ/LITELLM_UI_PASSWORD
Enter fullscreen mode Exit fullscreen mode

Operational notes & recommendations

  • Region selection: Vertex AI availability varies by location; us-central1 is generally safest for new Gemini releases.
  • Key management: Store LITELLM_MASTER_KEY and UI credentials in Secret Manager, not environment variables.
  • Production settings to consider: num_retries, timeout, async_calls, request logging policies.
  • Access control: Use Cloud Run’s invoker IAM or an API Gateway layer for stronger borders.

Virtual key issuance

curl -X POST https://<proxy>/key/generate \
  -H "Authorization: Bearer <master key>" \
  -d '{"models": ["gemini-2.5-pro","gemini-2.5-flash"], "duration":"30d"}'
Enter fullscreen mode Exit fullscreen mode

This key will later be used by the Gemini CLI.


Configuration: Gemini CLI

Point the CLI to LiteLLM Proxy:

export GOOGLE_GEMINI_BASE_URL="https://<LiteLLM Proxy URL>"
export GEMINI_API_KEY="<virtual key>"
Enter fullscreen mode Exit fullscreen mode

Important

GEMINI_API_KEY must be a LiteLLM virtual key, not a Google Cloud API key.

Gemini CLI now behaves as if it were talking to Vertex AI, but traffic flows through LiteLLM.


Testing the End-to-End Path

Once configured, run a simple test through Gemini CLI:

$ gemini hello
Loaded cached credentials.
Hello! I'm ready for your first command.
Enter fullscreen mode Exit fullscreen mode

On the LiteLLM dashboard, you should see request logs, latency, and token usage.


Important Note: Authentication Bypass in Gemini CLI

During testing, I observed situations where:

  • Gemini CLI worked normally
  • but LiteLLM Proxy showed zero usage

Why it happens

Gemini CLI supports three authentication methods:

  1. Login with Google
  2. Use Gemini API Key
  3. Vertex AI

When a user logs in with Google Login:

  • The CLI uses Google OAuth credentials
  • These credentials automatically route traffic directly to Vertex AI
  • GOOGLE_GEMINI_BASE_URL is ignored
  • LiteLLM Proxy is completely bypassed

If OAuth login is left enabled:

  • Teams lose visibility of CLI usage
  • Costs appear under personal or unintended projects
  • Security review cannot track data flowing to Vertex AI
  • API limits and budgets set on LiteLLM do not apply

This is the number one issue organizations should be aware of.


Summary

In this article, we walked through how to route Gemini CLI traffic through LiteLLM Proxy and highlighted key lessons from testing.

Benefits

  • Unifies API governance across CLI, IDE, and backend services
  • Enables per-user quotas, budgets, and access scopes
  • Provides analytics across all models and providers
  • Gives SRE/PFE teams full visibility into LLM usage patterns

Limitations / Things to Consider

  • Gemini CLI’s Google-auth login bypasses proxies unless explicitly disabled
  • Cloud Run + Cloud SQL requires connection pooling considerations
  • Model list updates must be maintained when Vertex releases new versions
  • LiteLLM Enterprise features (SSO, RBAC, audit logging) may be necessary for large orgs

Top comments (0)