<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: polar3130</title>
    <description>The latest articles on DEV Community by polar3130 (@polar3130).</description>
    <link>https://dev.to/polar3130</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1034515%2Fdf1609c4-233e-482b-81ff-1d1a5e49a948.jpg</url>
      <title>DEV Community: polar3130</title>
      <link>https://dev.to/polar3130</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/polar3130"/>
    <language>en</language>
    <item>
      <title>Using Gemini CLI with a Local LLM</title>
      <dc:creator>polar3130</dc:creator>
      <pubDate>Fri, 27 Feb 2026 09:39:44 +0000</pubDate>
      <link>https://dev.to/polar3130/using-gemini-cli-with-a-local-llm-5f5l</link>
      <guid>https://dev.to/polar3130/using-gemini-cli-with-a-local-llm-5f5l</guid>
      <description>&lt;p&gt;&lt;a href="https://github.com/google-gemini/gemini-cli" rel="noopener noreferrer"&gt;Gemini CLI&lt;/a&gt;, an open-source AI agent published by Google, lets you interact with Gemini models from your terminal. It normally connects to Google's API endpoint, but by redirecting the API destination, you can also use a locally running LLM as its backend.&lt;/p&gt;

&lt;p&gt;In this post, I'll walk through how to combine &lt;a href="https://docs.litellm.ai/" rel="noopener noreferrer"&gt;LiteLLM Proxy&lt;/a&gt; and &lt;a href="https://ollama.com/" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt; to swap Gemini CLI's backend to a local LLM, along with a few gotchas I encountered during setup.&lt;/p&gt;

&lt;p&gt;I've also covered using LiteLLM Proxy for centralized LLM API management in a &lt;a href="https://dev.to/polar3130/using-gemini-cli-through-litellm-proxy-1627"&gt;previous post&lt;/a&gt;, if you're interested.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;Here is the overall architecture:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpnmr9791ehlwcodfs3yw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpnmr9791ehlwcodfs3yw.png" alt=" " width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By setting the &lt;code&gt;GOOGLE_GEMINI_BASE_URL&lt;/code&gt; environment variable from the &lt;code&gt;@google/genai&lt;/code&gt; SDK, you can redirect all of Gemini CLI's API requests to an arbitrary endpoint. This variable doesn't appear to be documented in the Gemini CLI docs, but it is supported on the SDK side (&lt;a href="https://github.com/google-gemini/gemini-cli/pull/6380" rel="noopener noreferrer"&gt;reference PR&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;LiteLLM Proxy exposes Gemini API-compatible endpoints (&lt;code&gt;/v1beta/models/{model}:streamGenerateContent&lt;/code&gt;, etc.) and relays incoming requests to a local model running on Ollama. LiteLLM Proxy has a feature called &lt;code&gt;model_group_alias&lt;/code&gt; that routes a requested model name to a different model, which allows you to map model names sent by Gemini CLI (such as &lt;code&gt;gemini-3-flash-preview&lt;/code&gt;) to a local model.&lt;/p&gt;

&lt;h2&gt;
  
  
  Test Environment
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;macOS (Apple Silicon, Tahoe 26)&lt;/li&gt;
&lt;li&gt;Gemini CLI v0.30.0&lt;/li&gt;
&lt;li&gt;LiteLLM v1.81.16&lt;/li&gt;
&lt;li&gt;Ollama v0.17.0&lt;/li&gt;
&lt;li&gt;Python 3.14.0&lt;/li&gt;
&lt;li&gt;Node.js v22.17.0&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Installing Ollama and Pulling a Model
&lt;/h3&gt;

&lt;p&gt;Install via Homebrew and start it as a service.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;ollama
brew services start ollama
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pull a model. I initially planned to use gemma3, but as described later, gemma3 doesn't support tool calling in the Ollama template format, so I went with the lightweight &lt;code&gt;qwen2.5:3b&lt;/code&gt; (~1.9 GB) for this proof of concept.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama pull qwen2.5:3b
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Installing LiteLLM
&lt;/h3&gt;

&lt;p&gt;Create a Python virtual environment and install LiteLLM.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv .venv
&lt;span class="nb"&gt;source&lt;/span&gt; .venv/bin/activate
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s1"&gt;'litellm[proxy]'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Configuring LiteLLM Proxy
&lt;/h3&gt;

&lt;p&gt;Create a &lt;code&gt;litellm_config.yaml&lt;/code&gt; file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;model_list&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;local-model&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ollama_chat/qwen2.5:3b&lt;/span&gt;
      &lt;span class="na"&gt;api_base&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:11434"&lt;/span&gt;

&lt;span class="na"&gt;router_settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;model_group_alias&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-3.1-pro-preview"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;local-model"&lt;/span&gt;
    &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-3.1-pro-preview-customtools"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;local-model"&lt;/span&gt;
    &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-3-flash-preview"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;local-model"&lt;/span&gt;
    &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-3-flash-preview-customtools"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;local-model"&lt;/span&gt;
    &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-pro"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;local-model"&lt;/span&gt;
    &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;local-model"&lt;/span&gt;
    &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.5-flash-lite"&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;local-model"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key here is the &lt;code&gt;model_group_alias&lt;/code&gt; configuration. Gemini CLI uses multiple models internally — a main generation model (&lt;code&gt;gemini-3-flash-preview&lt;/code&gt;, etc.) as well as a lighter model for input classification (&lt;code&gt;gemini-2.5-flash-lite&lt;/code&gt;). Aliases for all of these model names need to be defined. It would be nice if wildcards were supported, but for now, each model name requires its own alias.&lt;/p&gt;

&lt;p&gt;Start the proxy with the config file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;litellm &lt;span class="nt"&gt;--config&lt;/span&gt; litellm_config.yaml &lt;span class="nt"&gt;--port&lt;/span&gt; 4000
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Starting Gemini CLI
&lt;/h3&gt;

&lt;p&gt;Set the environment variables and start Gemini CLI.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_GEMINI_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"http://localhost:4000"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GEMINI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"sk-dummy-key"&lt;/span&gt;
gemini &lt;span class="nt"&gt;--sandbox&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The API key isn't actually used, so any dummy value will do.&lt;/p&gt;

&lt;p&gt;You should now be getting responses from the local LLM.&lt;/p&gt;

&lt;p&gt;Note that &lt;code&gt;--sandbox=false&lt;/code&gt; is specified because, in sandbox mode, &lt;code&gt;GOOGLE_GEMINI_BASE_URL&lt;/code&gt; is not passed into the sandbox container — a known issue (&lt;a href="https://github.com/google-gemini/gemini-cli/issues/2168" rel="noopener noreferrer"&gt;Issue #2168&lt;/a&gt;).&lt;/p&gt;

&lt;h2&gt;
  
  
  Gotchas During Setup
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Missing model_group_alias Entries Cause 500 Errors
&lt;/h3&gt;

&lt;p&gt;Gemini CLI uses different models for different purposes depending on the version. With v0.30.0, which I used, models such as &lt;code&gt;gemini-3-flash-preview&lt;/code&gt; and &lt;code&gt;gemini-2.5-flash-lite&lt;/code&gt; were observed in requests.&lt;/p&gt;

&lt;p&gt;If the corresponding alias is not defined in LiteLLM Proxy, you'll get a &lt;code&gt;BadRequestError: There are no healthy deployments for this model&lt;/code&gt; error. Since the models in use may change with Gemini CLI upgrades or new Gemini model releases, you'll likely need to monitor the proxy logs for requested model names and add any missing entries to &lt;code&gt;model_group_alias&lt;/code&gt; as they appear.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Model Must Support Tool Calling in Its Ollama Template
&lt;/h3&gt;

&lt;p&gt;I initially used Google's &lt;code&gt;gemma3:4b&lt;/code&gt;, but it failed with a &lt;code&gt;does not support tools&lt;/code&gt; error.&lt;/p&gt;

&lt;p&gt;Gemini CLI sends tool definitions (for file operations, command execution, etc.) as part of its requests. For Ollama to handle the &lt;code&gt;tools&lt;/code&gt; parameter, the model's chat template needs to support tool calling.&lt;/p&gt;

&lt;p&gt;An important nuance here is that a model's function calling capability and Ollama template support are separate concerns.&lt;/p&gt;

&lt;p&gt;gemma3 is capable of prompt-based function calling at the model level (&lt;a href="https://ai.google.dev/gemma/docs/capabilities/function-calling" rel="noopener noreferrer"&gt;reference&lt;/a&gt;), but its Ollama template does not support it (&lt;a href="https://github.com/ollama/ollama/issues/9941" rel="noopener noreferrer"&gt;ollama/ollama#9941&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Qwen 2.5, on the other hand, supports tool calling in its official Ollama template. The &lt;code&gt;qwen2.5:3b&lt;/code&gt; model I used is only about 1.9 GB at 3B parameters, making it a convenient choice for a proof of concept.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrap-Up
&lt;/h2&gt;

&lt;p&gt;I've shown how to swap Gemini CLI's backend to a local LLM by combining LiteLLM Proxy and Ollama.&lt;/p&gt;

&lt;p&gt;The setup itself is relatively straightforward, but there were a few things that were hard to notice without actually running it — such as model name changes across Gemini CLI versions and the tool calling support status of Ollama models.&lt;/p&gt;

&lt;p&gt;That said, a 3B-parameter model doesn't have the capacity to reliably handle Gemini CLI's AI agent features like file operations and code generation. For serious use as a coding assistant, you'll likely want to consider larger models.&lt;/p&gt;

</description>
      <category>cli</category>
      <category>gemini</category>
      <category>llm</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Using Gemini CLI Through LiteLLM Proxy</title>
      <dc:creator>polar3130</dc:creator>
      <pubDate>Tue, 25 Nov 2025 03:31:00 +0000</pubDate>
      <link>https://dev.to/polar3130/using-gemini-cli-through-litellm-proxy-1627</link>
      <guid>https://dev.to/polar3130/using-gemini-cli-through-litellm-proxy-1627</guid>
      <description>&lt;p&gt;Organizations adopting LLMs at scale often struggle with fragmented API usage, inconsistent authentication methods, and lack of visibility across teams. Tools like Gemini CLI make local development easier, but they also introduce governance challenges—especially when authentication silently bypasses centralized gateways.&lt;/p&gt;

&lt;p&gt;In this article, I walk through how to route Gemini CLI traffic through LiteLLM Proxy, explain why this configuration matters for enterprise environments, and highlight key operational considerations learned from hands-on testing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Use a Proxy for Gemini CLI?
&lt;/h2&gt;

&lt;p&gt;Before diving into configuration, it’s worth clarifying why an LLM gateway is needed in the first place.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problems with direct Gemini CLI usage
&lt;/h3&gt;

&lt;p&gt;If developers run Gemini CLI with default settings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Authentication may fall back to Google Account login
→ usage disappears from organizational audits&lt;/li&gt;
&lt;li&gt;API traffic may hit multiple GCP projects/regions
→ inconsistent cost attribution&lt;/li&gt;
&lt;li&gt;Personal API keys or user identities may be used
→ security and compliance risks&lt;/li&gt;
&lt;li&gt;Team-wide visibility into token usage becomes impossible
→ cost governance cannot scale&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  LiteLLM Proxy as a solution
&lt;/h3&gt;

&lt;p&gt;LiteLLM Proxy provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A unified OpenAI-compatible API endpoint&lt;/li&gt;
&lt;li&gt;Virtual API keys with per-user / per-project scoping&lt;/li&gt;
&lt;li&gt;Rate, budget, and quota enforcement&lt;/li&gt;
&lt;li&gt;Centralized monitoring &amp;amp; analytics&lt;/li&gt;
&lt;li&gt;Governance applied regardless of client tool (CLI, IDE, scripts)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This makes it suitable for organizations where 50–300+ developers may use Gemini, GPT, Claude, or Llama models across multiple teams.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;For this walkthrough, I deployed LiteLLM Proxy onto Cloud Run, using Cloud SQL for metadata storage.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnexvfqb0ew76jfmix6x7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnexvfqb0ew76jfmix6x7.png" alt=" " width="800" height="239"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why this design?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Cloud Run scales automatically and supports secure invocations.&lt;/li&gt;
&lt;li&gt;Cloud SQL stores key usage, analytics, and configuration.&lt;/li&gt;
&lt;li&gt;Vertex AI IAM is handled via the LiteLLM Proxy’s service account.&lt;/li&gt;
&lt;li&gt;API visibility is centralized and independent of client behavior.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Caveats
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Cloud SQL connection limits must be considered when scaling Cloud Run.&lt;/li&gt;
&lt;li&gt;Cold starts may slightly increase latency for short-lived CLI invocations.&lt;/li&gt;
&lt;li&gt;Multi-region routing is out of scope but may be required for HA.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Configuration: LiteLLM Proxy
&lt;/h2&gt;

&lt;p&gt;Below is a minimal configuration enabling Gemini models via Vertex AI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;model_list&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemini-2.5-pro&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vertex_ai/gemini-2.5-pro&lt;/span&gt;
      &lt;span class="na"&gt;vertex_project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;os.environ/GOOGLE_CLOUD_PROJECT&lt;/span&gt;
      &lt;span class="na"&gt;vertex_location&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-central1&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;model_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gemini-2.5-flash&lt;/span&gt;
    &lt;span class="na"&gt;litellm_params&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vertex_ai/gemini-2.5-flash&lt;/span&gt;
      &lt;span class="na"&gt;vertex_project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;os.environ/GOOGLE_CLOUD_PROJECT&lt;/span&gt;
      &lt;span class="na"&gt;vertex_location&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-central1&lt;/span&gt;

&lt;span class="na"&gt;general_settings&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;master_key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;os.environ/LITELLM_MASTER_KEY&lt;/span&gt;
  &lt;span class="na"&gt;ui_username&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;admin&lt;/span&gt;
  &lt;span class="na"&gt;ui_password&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;os.environ/LITELLM_UI_PASSWORD&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Operational notes &amp;amp; recommendations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Region selection: Vertex AI availability varies by location; &lt;code&gt;us-central1&lt;/code&gt; is generally safest for new Gemini releases.&lt;/li&gt;
&lt;li&gt;Key management:
Store &lt;code&gt;LITELLM_MASTER_KEY&lt;/code&gt; and UI credentials in Secret Manager, not environment variables.&lt;/li&gt;
&lt;li&gt;Production settings to consider:
&lt;code&gt;num_retries&lt;/code&gt;, &lt;code&gt;timeout&lt;/code&gt;, &lt;code&gt;async_calls&lt;/code&gt;, request logging policies.&lt;/li&gt;
&lt;li&gt;Access control:
Use Cloud Run’s invoker IAM or an API Gateway layer for stronger borders.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Virtual key issuance
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST https://&amp;lt;proxy&amp;gt;/key/generate &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Authorization: Bearer &amp;lt;master key&amp;gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"models": ["gemini-2.5-pro","gemini-2.5-flash"], "duration":"30d"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This key will later be used by the Gemini CLI.&lt;/p&gt;




&lt;h2&gt;
  
  
  Configuration: Gemini CLI
&lt;/h2&gt;

&lt;p&gt;Point the CLI to LiteLLM Proxy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GOOGLE_GEMINI_BASE_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://&amp;lt;LiteLLM Proxy URL&amp;gt;"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GEMINI_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;virtual key&amp;gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Important
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;GEMINI_API_KEY&lt;/code&gt; must be a LiteLLM virtual key, not a Google Cloud API key.&lt;/p&gt;

&lt;p&gt;Gemini CLI now behaves as if it were talking to Vertex AI, but traffic flows through LiteLLM.&lt;/p&gt;




&lt;h2&gt;
  
  
  Testing the End-to-End Path
&lt;/h2&gt;

&lt;p&gt;Once configured, run a simple test through Gemini CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ gemini hello
Loaded cached credentials.
Hello! I'm ready for your first command.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On the LiteLLM dashboard, you should see request logs, latency, and token usage.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ply4igltd01md2yprhh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ply4igltd01md2yprhh.png" alt=" " width="800" height="551"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Important Note: Authentication Bypass in Gemini CLI
&lt;/h2&gt;

&lt;p&gt;During testing, I observed situations where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gemini CLI worked normally&lt;/li&gt;
&lt;li&gt;but LiteLLM Proxy showed zero usage&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why it happens
&lt;/h3&gt;

&lt;p&gt;Gemini CLI supports three authentication methods:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Login with Google&lt;/li&gt;
&lt;li&gt;Use Gemini API Key&lt;/li&gt;
&lt;li&gt;Vertex AI&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk2dwzw3hajd5gspl2sng.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk2dwzw3hajd5gspl2sng.png" alt=" " width="800" height="548"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When a user logs in with Google Login:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The CLI uses Google OAuth credentials&lt;/li&gt;
&lt;li&gt;These credentials automatically route traffic directly to Vertex AI&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GOOGLE_GEMINI_BASE_URL&lt;/code&gt; is ignored&lt;/li&gt;
&lt;li&gt;LiteLLM Proxy is completely bypassed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If OAuth login is left enabled:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Teams lose visibility of CLI usage&lt;/li&gt;
&lt;li&gt;Costs appear under personal or unintended projects&lt;/li&gt;
&lt;li&gt;Security review cannot track data flowing to Vertex AI&lt;/li&gt;
&lt;li&gt;API limits and budgets set on LiteLLM do not apply&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the number one issue organizations should be aware of.&lt;/p&gt;




&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;In this article, we walked through how to route Gemini CLI traffic through LiteLLM Proxy and highlighted key lessons from testing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Benefits
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Unifies API governance across CLI, IDE, and backend services&lt;/li&gt;
&lt;li&gt;Enables per-user quotas, budgets, and access scopes&lt;/li&gt;
&lt;li&gt;Provides analytics across all models and providers&lt;/li&gt;
&lt;li&gt;Gives SRE/PFE teams full visibility into LLM usage patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Limitations / Things to Consider
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Gemini CLI’s Google-auth login bypasses proxies unless explicitly disabled&lt;/li&gt;
&lt;li&gt;Cloud Run + Cloud SQL requires connection pooling considerations&lt;/li&gt;
&lt;li&gt;Model list updates must be maintained when Vertex releases new versions&lt;/li&gt;
&lt;li&gt;LiteLLM Enterprise features (SSO, RBAC, audit logging) may be necessary for large orgs&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>gemini</category>
      <category>cli</category>
      <category>llm</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Differences in Response Models between the Vertex AI SDK and the Gen AI SDK</title>
      <dc:creator>polar3130</dc:creator>
      <pubDate>Thu, 31 Jul 2025 13:15:56 +0000</pubDate>
      <link>https://dev.to/polar3130/differences-in-response-models-between-the-vertex-ai-sdk-and-the-gen-ai-sdk-4m49</link>
      <guid>https://dev.to/polar3130/differences-in-response-models-between-the-vertex-ai-sdk-and-the-gen-ai-sdk-4m49</guid>
      <description>&lt;p&gt;When migrating a Python-based AI application from the Vertex AI SDK to the Gen AI SDK, I made an interesting discovery: the Gen AI SDK uses a Pydantic-based response model (&lt;code&gt;GenerateContentResponse&lt;/code&gt;), which means you can serialize it with &lt;code&gt;model_dump()&lt;/code&gt; or &lt;code&gt;model_dump_json()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For anyone unfamiliar with the current landscape, it can be confusing that Google offers &lt;em&gt;multiple&lt;/em&gt; official SDKs for working with the Gemini API. Below is some background before we dive in. At the moment, Gemini exposes &lt;strong&gt;two main APIs&lt;/strong&gt; and &lt;strong&gt;three Python SDKs&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Gemini APIs
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Gemini API in Vertex AI&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Access Gemini models via Google Cloud’s Vertex AI&lt;/li&gt;
&lt;li&gt;Requires a Google Cloud project&lt;/li&gt;
&lt;li&gt;IAM-based authentication and access control&lt;/li&gt;
&lt;li&gt;Per-project quotas that throttle usage as needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;See: &lt;strong&gt;“Migrate from the Gemini Developer API to the Vertex AI Gemini API”&lt;/strong&gt; (Google Cloud Docs)&lt;/p&gt;

&lt;h3&gt;
  
  
  2. &lt;strong&gt;Gemini Developer API&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Access Gemini through Google AI Studio&lt;/li&gt;
&lt;li&gt;Works even without a Google Cloud project&lt;/li&gt;
&lt;li&gt;Generous free tier—ideal for learning and prototyping&lt;/li&gt;
&lt;li&gt;Enterprise-grade features and advanced settings are limited&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;See: &lt;strong&gt;“Get a Gemini API key | Google AI for Developers”&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Official Python SDKs
&lt;/h2&gt;

&lt;p&gt;Google currently maintains three Python SDKs:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Google Gen AI SDK (&lt;code&gt;google-genai&lt;/code&gt;)&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Supports &lt;em&gt;both&lt;/em&gt; the Gemini Developer API and the Vertex AI Gemini API&lt;/li&gt;
&lt;li&gt;A single code base that handles API-key auth (AI Studio) &lt;em&gt;and&lt;/em&gt; IAM auth (Vertex AI)&lt;/li&gt;
&lt;li&gt;Newer than the others and updated most frequently&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Docs: &lt;a href="https://googleapis.github.io/python-genai/" rel="noopener noreferrer"&gt;https://googleapis.github.io/python-genai/&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. &lt;strong&gt;Vertex AI SDK (&lt;code&gt;google-cloud-aiplatform&lt;/code&gt;)&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Dedicated to the Gemini API in Vertex AI&lt;/li&gt;
&lt;li&gt;Lets you use Gemini models through Vertex AI&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Repo: &lt;a href="https://github.com/googleapis/python-aiplatform" rel="noopener noreferrer"&gt;https://github.com/googleapis/python-aiplatform&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3. &lt;strong&gt;Google AI Python SDK (&lt;code&gt;google-generativeai&lt;/code&gt;)&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Targets the Gemini Developer API only&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Does not&lt;/em&gt; work with Vertex AI&lt;/li&gt;
&lt;li&gt;Now deprecated; scheduled for EOL at the end of August 2025&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Repo: &lt;a href="https://github.com/google-gemini/deprecated-generative-ai-python" rel="noopener noreferrer"&gt;https://github.com/google-gemini/deprecated-generative-ai-python&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  How the Response Models Differ
&lt;/h2&gt;

&lt;p&gt;Both the Vertex AI SDK and the Gen AI SDK can call the Gemini API in Vertex AI, but their usage patterns differ slightly. Below are minimal examples.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example with the Vertex AI SDK
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;vertexai.generative_models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GenerativeModel&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;vertexai&lt;/span&gt;

&lt;span class="n"&gt;PROJECT_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*****************&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;REGION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-central1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;vertexai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;REGION&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GenerativeModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.0-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the capital of Japan?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;           &lt;span class="c1"&gt;# -&amp;gt; Tokyo is the capital of Japan.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Example with the Gen AI SDK
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.genai.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HttpOptions&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;vertexai&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*****************&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-central1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;http_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;HttpOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.0-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the capital of Japan?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;           &lt;span class="c1"&gt;# -&amp;gt; Tokyo is the capital of Japan.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Although you still retrieve the generated text via &lt;code&gt;response.text&lt;/code&gt;, the underlying &lt;em&gt;response object&lt;/em&gt; differs:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;SDK&lt;/th&gt;
&lt;th&gt;Response class&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Vertex AI SDK&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GenerationResponse&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gen AI SDK&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GenerateContentResponse&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each response bundles the generated content plus rich metadata. If you want to dump &lt;em&gt;everything&lt;/em&gt; to JSON—for example, to inspect intermediate artifacts—here’s how you do it with each SDK.&lt;/p&gt;

&lt;h3&gt;
  
  
  Serializing with the Vertex AI SDK
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;vertexai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;vertexai.generative_models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GenerativeModel&lt;/span&gt;

&lt;span class="n"&gt;vertexai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*****************&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-central1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GenerativeModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.0-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the capital of Japan?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_dict&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;GenerationResponse&lt;/code&gt; exposes a handy &lt;code&gt;to_dict()&lt;/code&gt; method.&lt;/p&gt;

&lt;h3&gt;
  
  
  Serializing with the Gen AI SDK
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;GenerateContentResponse&lt;/code&gt; is built on &lt;strong&gt;Pydantic’s &lt;code&gt;BaseModel&lt;/code&gt;&lt;/strong&gt;, so you use &lt;code&gt;model_dump()&lt;/code&gt; (or &lt;code&gt;model_dump_json()&lt;/code&gt;) instead:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.genai.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HttpOptions&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;genai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;vertexai&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;*****************&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-central1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;http_options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;HttpOptions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.0-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the capital of Japan?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ensure_ascii&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Tip&lt;/strong&gt;&lt;br&gt;
If you call &lt;code&gt;to_dict()&lt;/code&gt; on a &lt;code&gt;GenerateContentResponse&lt;/code&gt;, you’ll get an error—one of those “gotchas” to watch for when migrating.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Google’s migration guide and GitHub issues both mention this explicitly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Migration guide: &lt;em&gt;Generate-content section&lt;/em&gt; (ai.google.dev)&lt;/li&gt;
&lt;li&gt;Issue tracker: &lt;code&gt;googleapis/python-genai#709&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Interestingly, the Gen AI SDK surfaces &lt;em&gt;more&lt;/em&gt; generation-time metadata than the Vertex AI SDK (though that’s unrelated to the serialization method itself).&lt;/p&gt;




&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;When moving from the Vertex AI SDK to the Gen AI SDK, remember that &lt;strong&gt;&lt;code&gt;GenerateContentResponse&lt;/code&gt; is Pydantic-based&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Serialize with &lt;code&gt;model_dump()&lt;/code&gt; or &lt;code&gt;model_dump_json()&lt;/code&gt;, &lt;em&gt;not&lt;/em&gt; &lt;code&gt;to_dict()&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;Some features are still exclusive to the Vertex AI SDK, so the Gen AI SDK isn’t yet a drop-in replacement for every use case.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;That said, Google’s docs now recommend the Gen AI SDK, and it’s seeing the most active development—so consolidation onto this SDK feels inevitable.&lt;/p&gt;&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;Looking forward to future updates!&lt;/p&gt;

</description>
      <category>gemini</category>
      <category>python</category>
      <category>vertexai</category>
    </item>
    <item>
      <title>Understanding Quota Project Warnings When Using Google Cloud ADC</title>
      <dc:creator>polar3130</dc:creator>
      <pubDate>Mon, 30 Jun 2025 01:48:56 +0000</pubDate>
      <link>https://dev.to/polar3130/understanding-quota-project-warnings-when-using-google-cloud-adc-4bpd</link>
      <guid>https://dev.to/polar3130/understanding-quota-project-warnings-when-using-google-cloud-adc-4bpd</guid>
      <description>&lt;p&gt;This article covers authentication credentials when developing applications on Google Cloud.&lt;/p&gt;

&lt;p&gt;When you run the &lt;code&gt;gcloud auth application-default login&lt;/code&gt; command locally, you may see a warning like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WARNING: 
Cannot add the project "dazzling-pillar-4369" to ADC as the quota project because the account in ADC does not have the "serviceusage.services.use" permission on this project. You might receive a "quota_exceeded" or "API not enabled" error. Run $ gcloud auth application-default set-quota-project to add a quota project.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This article explains the meaning, cause, and recommended practices related to this warning.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Are Application Default Credentials (ADC)?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://cloud.google.com/docs/authentication/application-default-credentials" rel="noopener noreferrer"&gt;Application Default Credentials (ADC)&lt;/a&gt; are a mechanism for obtaining default credentials for applications to access Google Cloud services and APIs.&lt;br&gt;
With ADC, applications can automatically discover the appropriate credentials depending on their environment:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;If the &lt;code&gt;GOOGLE_APPLICATION_CREDENTIALS&lt;/code&gt; environment variable is set, the service account key or config file at that path is used.&lt;/li&gt;
&lt;li&gt;Otherwise, credentials saved via the &lt;code&gt;gcloud auth application-default login&lt;/code&gt; command (user credentials saved locally in the ADC file) are used.&lt;/li&gt;
&lt;li&gt;If neither is present, the default service account credentials for the environment (e.g., GCE or GKE) are retrieved from the metadata server.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;ADC allows developers to access Google Cloud resources without hardcoding credentials.&lt;br&gt;
When running &lt;code&gt;gcloud auth application-default login&lt;/code&gt; locally, a credentials file is created (usually at &lt;code&gt;~/.config/gcloud/application_default_credentials.json&lt;/code&gt;), which is then used by client libraries.&lt;/p&gt;

&lt;p&gt;Note: As mentioned in point 2 above, the &lt;code&gt;gcloud auth application-default login&lt;/code&gt; command uses the permissions of your user account.&lt;br&gt;
If you want your application to impersonate a specific service account, use &lt;code&gt;gcloud auth application-default login --impersonate-service-account&lt;/code&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  Understanding the Quota Project Warning
&lt;/h2&gt;

&lt;p&gt;The warning shown at the start of this article relates to the &lt;strong&gt;Quota Project&lt;/strong&gt; used with ADC.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The project ID &lt;code&gt;"dazzling-pillar-4369"&lt;/code&gt; could not be set as the Quota Project because the account used by ADC lacks the &lt;code&gt;serviceusage.services.use&lt;/code&gt; permission on that project.&lt;/li&gt;
&lt;li&gt;As a result, errors like &lt;code&gt;quota_exceeded&lt;/code&gt; or &lt;code&gt;API not enabled&lt;/code&gt; may occur.&lt;/li&gt;
&lt;li&gt;The recommended fix is to run &lt;code&gt;gcloud auth application-default set-quota-project&lt;/code&gt; to explicitly set the Quota Project.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s clarify what a “Quota Project” is.&lt;br&gt;
There are two types of Google Cloud APIs: &lt;strong&gt;resource-based&lt;/strong&gt; and &lt;strong&gt;client-based&lt;/strong&gt; (&lt;a href="https://cloud.google.com/docs/authentication/troubleshoot-adc" rel="noopener noreferrer"&gt;Troubleshoot your ADC setup&lt;/a&gt;).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Resource-based APIs&lt;/strong&gt; use the settings and quotas of the project containing the resource.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Client-based APIs&lt;/strong&gt;, since they aren’t tied to a specific project, require explicit specification of the Quota Project (also referred to as the billing project).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When using user credentials (rather than a service account), client libraries require you to specify which project should be used for quota and billing purposes.&lt;/p&gt;

&lt;p&gt;When &lt;code&gt;gcloud auth application-default login&lt;/code&gt; is executed, the Cloud SDK attempts to associate the current project as the Quota Project for ADC.&lt;br&gt;
However, if the user credentials (e.g., a Viewer role) do not include &lt;code&gt;serviceusage.services.use&lt;/code&gt; permission for that project, this association fails, resulting in the warning message.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://cloud.google.com/docs/authentication/troubleshoot-adc?hl=ja#:~:text=%E3%83%97%E3%83%AD%E3%82%B8%E3%82%A7%E3%82%AF%E3%83%88%E3%82%92%E8%AB%8B%E6%B1%82%E5%85%88%E3%83%97%E3%83%AD%E3%82%B8%E3%82%A7%E3%82%AF%E3%83%88%E3%81%A8%E3%81%97%E3%81%A6%E6%8C%87%E5%AE%9A" rel="noopener noreferrer"&gt;Reference - ADC Troubleshooting Docs&lt;/a&gt;&lt;/p&gt;


&lt;h3&gt;
  
  
  What Is &lt;code&gt;serviceusage.services.use&lt;/code&gt;?
&lt;/h3&gt;

&lt;p&gt;This is a permission required to use (and be billed for) services within a specific project.&lt;br&gt;
It is included in the &lt;strong&gt;Service Usage Consumer&lt;/strong&gt; role (&lt;code&gt;roles/serviceusage.serviceUsageConsumer&lt;/code&gt;).&lt;br&gt;
Viewer roles do not include this permission. So if your credentials only allow viewing a project, you cannot set it as a Quota Project.&lt;/p&gt;



&lt;p&gt;As a side note, the warning mentions the following two possible errors if a Quota Project is not set:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;quota_exceeded&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;API not enabled&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You might wonder why &lt;code&gt;quota_exceeded&lt;/code&gt; would occur if no project is associated.&lt;br&gt;
It turns out that in the absence of a properly set Quota Project, some client-based API calls fall back to a &lt;strong&gt;shared Google-owned quota project&lt;/strong&gt; (&lt;a href="https://stackoverflow.com/questions/72745805/warnings-because-of-user-credentials-without-quota-project" rel="noopener noreferrer"&gt;StackOverflow&lt;/a&gt;, &lt;a href="https://medium.com/google-cloud/google-oauth-credential-going-deeper-the-hard-way-f403cf3edf9d" rel="noopener noreferrer"&gt;Google Cloud Medium article&lt;/a&gt;).&lt;br&gt;
If this fallback project doesn’t have the necessary APIs enabled, you may also see &lt;code&gt;API not enabled&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Official documentation also mentions a mysterious fallback project ID (&lt;a href="https://cloud.google.com/docs/authentication/troubleshoot-adc#unknown_project_764086051850_used_for_request" rel="noopener noreferrer"&gt;link&lt;/a&gt;), which is likely the shared quota project.&lt;/p&gt;


&lt;h2&gt;
  
  
  How to Set the Quota Project for ADC
&lt;/h2&gt;

&lt;p&gt;First, ensure the account used by ADC has the &lt;code&gt;serviceusage.services.use&lt;/code&gt; permission on the project you want to use.&lt;br&gt;
Grant the &lt;strong&gt;Service Usage Consumer&lt;/strong&gt; role:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud projects add-iam-policy-binding &amp;lt;PROJECT_ID&amp;gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--member&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"user:&amp;lt;your-email@example.com&amp;gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"roles/serviceusage.serviceUsageConsumer"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives the necessary permission to use the project’s services, allowing it to be set as the Quota Project.&lt;/p&gt;

&lt;p&gt;Then, set the Quota Project in ADC:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud auth application-default set-quota-project &amp;lt;PROJECT_ID&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This updates your local ADC file with a &lt;code&gt;quota_project_id&lt;/code&gt; field.&lt;br&gt;
From then on, any API calls made using ADC will use this project for billing and quota tracking.&lt;/p&gt;


&lt;h2&gt;
  
  
  Best Practices for Project Configuration
&lt;/h2&gt;

&lt;p&gt;To prevent these types of errors in the future, here are some best practices:&lt;/p&gt;
&lt;h3&gt;
  
  
  Set a Default Quota Project in gcloud Configuration
&lt;/h3&gt;

&lt;p&gt;While running &lt;code&gt;gcloud auth application-default set-quota-project&lt;/code&gt; manually each time works, it can be tedious.&lt;br&gt;
Instead, you can set the quota project in your gcloud config:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud config &lt;span class="nb"&gt;set &lt;/span&gt;billing/quota_project &amp;lt;PROJECT_ID&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures any future ADC credentials will automatically use this project.&lt;/p&gt;

&lt;h3&gt;
  
  
  Consider Using Service Accounts
&lt;/h3&gt;

&lt;p&gt;For long-lived automated processes like CI/CD pipelines, using service accounts is better than user-based ADC.&lt;br&gt;
With service accounts, you can assign only the necessary IAM roles and explicitly define the quota project.&lt;/p&gt;

&lt;h3&gt;
  
  
  Don’t Ignore Warnings or Errors
&lt;/h3&gt;

&lt;p&gt;This last point is more of a mindset: don’t ignore warnings like these.&lt;br&gt;
If left unresolved, your application may rely on fallback shared projects and break unexpectedly when quotas are hit or APIs are disabled.&lt;br&gt;
Understanding and fixing such issues helps you prevent incidents, deepen your knowledge, and improve team technical strength.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Starting with a common ADC warning message, we explored what a Quota Project is, how to configure it, and some best practices for stable development.&lt;br&gt;
Failing to set a proper Quota Project can result in unexpected errors and usage being billed to Google’s fallback project.&lt;br&gt;
Setting a correct quota project ensures your API calls are tracked and billed properly under your own project.&lt;/p&gt;

</description>
      <category>googlecloud</category>
      <category>gemini</category>
      <category>vertexai</category>
    </item>
    <item>
      <title>Optimizing Image Management Efficiency Using AWS ECR Pull-Through Cache</title>
      <dc:creator>polar3130</dc:creator>
      <pubDate>Tue, 25 Mar 2025 02:43:10 +0000</pubDate>
      <link>https://dev.to/polar3130/optimizing-image-management-efficiency-using-aws-ecr-pull-through-cache-4846</link>
      <guid>https://dev.to/polar3130/optimizing-image-management-efficiency-using-aws-ecr-pull-through-cache-4846</guid>
      <description>&lt;p&gt;This time, we will take a look at AWS Elastic Container Registry (ECR), an essential service when building container execution environments on AWS. We will introduce the pull-through cache feature of ECR by covering its recent updates—such as support for private ECR repositories—enterprise use cases, and insights gained during our testing process.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is ECR's Pull-Through Cache Feature?
&lt;/h2&gt;

&lt;p&gt;ECR’s pull-through cache is a feature that &lt;strong&gt;dynamically retrieves external container images on-demand and caches them within your private ECR&lt;/strong&gt;. By creating a “pull-through cache rule” in your account’s private ECR and associating an upstream registry with a namespace (a prefix applied to the repository name in your ECR), developers and deployment environments can &lt;strong&gt;pull external images via the ECR repository URI&lt;/strong&gt;, while ECR automatically fetches the image from the upstream registry in the background.&lt;/p&gt;

&lt;p&gt;At the time of the first pull, a repository is automatically created in your private ECR using the specified prefix combined with the original image name, and the image layers are cached. During this initial retrieval, AWS uses an IP address from its managed infrastructure to fetch the image via the pull-through cache feature. For subsequent pulls, the image is served directly from the cached copy in ECR, so there is no further access to the external registry. In addition, ECR checks at least once every 24 hours whether the image on the upstream side with the same tag has been updated, and if an update is detected, the cached copy in your private ECR is refreshed. This means that even for tags that are frequently updated, such as &lt;code&gt;latest&lt;/code&gt;, the ECR cache will follow the latest image at least once every 24 hours (as of the time of writing, the image update interval cannot be modified).&lt;/p&gt;

&lt;p&gt;Through this mechanism, you can use ECR like a proxy cache to access external container images via your internal repository.&lt;/p&gt;

&lt;h2&gt;
  
  
  Main Benefits of the Pull-Through Cache
&lt;/h2&gt;

&lt;p&gt;The most obvious benefit of caching is improved performance. Since the required container images are cached in ECR, retrieving the same external image for the second time and beyond is significantly faster, reducing your application’s startup time. Of course, this is not the only advantage; ECR’s pull-through cache offers several additional benefits in terms of security and other aspects:&lt;/p&gt;

&lt;h3&gt;
  
  
  Reduced External Dependency and Improved Availability
&lt;/h3&gt;

&lt;p&gt;Even if an external container registry is down during deployment, having a cached copy in ECR can help avoid or mitigate service impact. Moreover, since the image is obtained from within ECR, you can also expect lower network latency and faster performance due to intra-region replication.&lt;/p&gt;

&lt;h3&gt;
  
  
  Avoidance of External Registry Rate Limits
&lt;/h3&gt;

&lt;p&gt;For example, Docker Hub imposes rate limits on the number of pulls an anonymous user can perform within a given time period. However, if you retrieve images from Docker Hub via ECR’s pull-through cache, you can use them without worrying about these limits. As explained earlier, since ECR fetches images from Docker Hub on your behalf using AWS infrastructure, your environment does not need to directly access Docker Hub.&lt;/p&gt;

&lt;h3&gt;
  
  
  Centralized Security Controls
&lt;/h3&gt;

&lt;p&gt;By consolidating the origin of your container images to ECR, you can centrally manage access controls and vulnerability scans. ECR integrates with IAM for authentication and offers security features such as KMS encryption at rest and image scanning integrated with Amazon Inspector. This means that even images imported from external sources can benefit from consistent security measures. For instance, images retrieved through the pull-through cache can be automatically scanned for vulnerabilities, and lifecycle policies can be used to periodically remove older tags. Additionally, you can enforce policies that only grant production clusters access to ECR while blocking direct access to external registries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost Optimization
&lt;/h3&gt;

&lt;p&gt;The pull-through cache only caches images that are actually pulled, so there is no need to pre-replicate all potentially used images, which helps reduce storage costs. Moreover, using cached images within the same region minimizes cross-region data transfers, lowering the cost associated with inter-region image transfers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Supported Upstream Sources
&lt;/h2&gt;

&lt;p&gt;Currently, ECR’s pull-through cache supports specifying several major public registries as well as some private registries as upstream sources.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/AmazonECR/latest/userguide/pull-through-cache-creating-rule.html" rel="noopener noreferrer"&gt;https://docs.aws.amazon.com/AmazonECR/latest/userguide/pull-through-cache-creating-rule.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At the initial release in 2021, only Quay.io and Amazon ECR Public were supported, but support has since expanded to include registries that require authentication, such as Docker Hub and GitHub Container Registry. Additionally, with the update released this month (March 2025), even private Amazon ECR repositories can now act as upstream sources—regardless of cross-region or cross-account boundaries.&lt;/p&gt;

&lt;p&gt;You can find the release notes here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://aws.amazon.com/jp/about-aws/whats-new/2025/03/amazon-ecr-pull-through-cache/" rel="noopener noreferrer"&gt;https://aws.amazon.com/jp/about-aws/whats-new/2025/03/amazon-ecr-pull-through-cache/&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;p&gt;You can create pull-through cache rules using the AWS CLI with the &lt;code&gt;aws ecr create-pull-through-cache-rule&lt;/code&gt; command. The main options include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;--ecr-repository-prefix&lt;/code&gt;: The prefix for the repository to be created in your private ECR (e.g., &lt;code&gt;docker-hub&lt;/code&gt; or &lt;code&gt;quay&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--upstream-registry-url&lt;/code&gt;: The endpoint URL of the external upstream registry (e.g., for Docker Hub, &lt;code&gt;registry-1.docker.io&lt;/code&gt;; for ECR Public, &lt;code&gt;public.ecr.aws&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--credential-arn&lt;/code&gt;: (For upstream sources that require authentication) The ARN of the secret stored in Secrets Manager containing your credentials.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--registry-id&lt;/code&gt;: (Optional) The AWS account ID of the private registry in which to create the rule (if not specified, the default registry is used).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--custom-role-arn&lt;/code&gt;: (For the case where the upstream is another AWS account’s ECR) The ARN of the IAM role used for cross-account access.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Below is an example of setting Docker Hub as the upstream source. Since authentication is required for retrieving images from Docker Hub, a Secrets Manager secret ARN is specified. The secret name in Secrets Manager must begin with the prefix &lt;code&gt;ecr-pullthroughcache/&lt;/code&gt; (&lt;a href="https://docs.aws.amazon.com/AmazonECR/latest/userguide/pull-through-cache-creating-rule.html" rel="noopener noreferrer"&gt;Creating a pull through cache rule in Amazon ECR - Amazon ECR&lt;/a&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Set Docker Hub as the upstream (authentication required)&lt;/span&gt;
aws ecr create-pull-through-cache-rule &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--ecr-repository-prefix&lt;/span&gt; &lt;span class="s2"&gt;"docker-hub"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--upstream-registry-url&lt;/span&gt; &lt;span class="s2"&gt;"registry-1.docker.io"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--credential-arn&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:secretsmanager:&amp;lt;region&amp;gt;:&amp;lt;accountID&amp;gt;:secret:ecr-pullthroughcache/&amp;lt;DockerHub-secret-name&amp;gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For using ECR Public as the source (which does not require authentication), the command would look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Set ECR Public as the upstream (no credentials required)&lt;/span&gt;
aws ecr create-pull-through-cache-rule &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--ecr-repository-prefix&lt;/span&gt; &lt;span class="s2"&gt;"ecr-public"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--upstream-registry-url&lt;/span&gt; &lt;span class="s2"&gt;"public.ecr.aws"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Considerations When Using the Pull-Through Cache
&lt;/h2&gt;

&lt;p&gt;Beyond general considerations such as supported regions and quotas, here are some points to note:&lt;/p&gt;

&lt;h3&gt;
  
  
  Using Secrets Manager
&lt;/h3&gt;

&lt;p&gt;For registries like Docker Hub and GitHub that require authentication, the secret name in Secrets Manager must begin with the prefix &lt;code&gt;ecr-pullthroughcache/&lt;/code&gt;.&lt;br&gt;&lt;br&gt;
For example, in the case of Docker Hub, including the keys &lt;code&gt;username&lt;/code&gt; and &lt;code&gt;accessToken&lt;/code&gt; in the secret allows ECR to perform upstream authentication using those credentials.&lt;br&gt;&lt;br&gt;
When setting up via the console, only secrets with this prefix will be displayed as options, so if your secret is not shown, verify that the prefix is correctly applied.&lt;br&gt;&lt;br&gt;
This prefix must also be used when provisioning secrets through CloudFormation or Terraform.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tag Updates and Immutability
&lt;/h3&gt;

&lt;p&gt;As mentioned earlier, pull-through cache checks for updates on the upstream side on a tag-by-tag basis every 24 hours. If you have enabled immutability for image tags in your ECR repository, which prevents overwriting images with the same tag, there is a concern that the cache might not update. However, there have been issues reported (see &lt;a href="https://github.com/aws/containers-roadmap/issues/2275" rel="noopener noreferrer"&gt;pull through cache rule related cache repository's immutability not ...&lt;/a&gt;) where image replacement still occurs even when immutability is enabled. This is something to be aware of when using the feature.&lt;/p&gt;

&lt;h2&gt;
  
  
  Applying the Concept to Enterprise Environment Segregation
&lt;/h2&gt;

&lt;p&gt;Let’s explore some potential use cases that might arise in an enterprise setting where separation between development and production environments is a key requirement. Often, due to the security and reliability demands of a production environment, a policy may be adopted where “external images can be used flexibly in the development environment, while only trusted internal images are allowed in production.” The pull-through cache feature can be very useful in implementing such segregation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Controlling the Image Flow in Development Environments
&lt;/h3&gt;

&lt;p&gt;When developers test new middleware or OSS images, they typically pull various images from Docker Hub or GitHub Container Registry. By setting up pull-through cache rules in the ECR of the development AWS account and retrieving external images via ECR, you can centralize the image flow and maintain visibility into which images have been used. For example, since images cached in the development ECR can be scanned for vulnerabilities using ECR’s scanning feature, potential vulnerabilities in images being used during development can be identified early. In organizations with a dedicated platform or infrastructure operations team, this centralized caching can also aid in monitoring the usage of external images by application teams.&lt;/p&gt;

&lt;h3&gt;
  
  
  Commonizing Sources Across Multiple Environments
&lt;/h3&gt;

&lt;p&gt;If you are managing multiple AWS accounts or regions for development, staging, and regression testing, you can create a central repository and use pull-through cache rules to ensure that all environments retrieve images from a common source. For instance, only images that pass tests in the development environment could be exported to the central repository, and each environment could then create its own pull-through cache rule pointing to this common upstream repository. While push-based image distribution (e.g., using replication) is also an option, pull-based caching via the pull-through cache can simplify management as each environment’s ECR automatically caches from the central repository without requiring additional distribution mechanisms.&lt;/p&gt;

&lt;h3&gt;
  
  
  Moving Images from a Development Repository to a Production Repository
&lt;/h3&gt;

&lt;p&gt;In enterprise environments, it is common to have separate repositories for development and production to meet different isolation requirements. In such cases, when exporting images from a central development repository to a production repository, a design that uses replication or individual image export/import might be more secure than using on-demand retrieval via pull-through cache. This is because on-demand retrieval from the development repository could inadvertently cache images in the production repository that are not authorized for production deployment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this post, we have introduced the pull-through cache feature of ECR along with its latest updates and potential enterprise use cases. Although caching is often associated with improved performance, as highlighted here, the feature also offers significant benefits in terms of availability and security. As discussed, pull-through cache can play a key role in managing image distribution channels and vulnerability scanning scopes in production environments.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>ecr</category>
      <category>container</category>
    </item>
    <item>
      <title>Implementing a Fallback Strategy for Experimental Vertex AI Models</title>
      <dc:creator>polar3130</dc:creator>
      <pubDate>Fri, 28 Feb 2025 14:16:08 +0000</pubDate>
      <link>https://dev.to/polar3130/implementing-a-fallback-strategy-for-experimental-vertex-ai-models-28lj</link>
      <guid>https://dev.to/polar3130/implementing-a-fallback-strategy-for-experimental-vertex-ai-models-28lj</guid>
      <description>&lt;p&gt;When integrating &lt;strong&gt;experimental AI models&lt;/strong&gt; into your application, there's always a risk that they may become unavailable due to frequent updates, deprecations, or API changes. To mitigate this risk and enhance the resilience and operational stability of your application, having a well-planned &lt;strong&gt;fallback mechanism&lt;/strong&gt; using a Generally Available (GA) model can be highly effective.&lt;/p&gt;

&lt;p&gt;This blog post explores the advantages of maintaining a fallback model strategy in Vertex AI and provides an implementation guide using Python.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why a Fallback Model is Essential
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Ensuring Service Continuity&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Experimental models can sometimes be temporarily or permanently deprecated. Having a GA model as a backup allows your application to continue running without interruptions.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. &lt;strong&gt;Handling API Changes &amp;amp; Compatibility Issues&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Experimental models undergo frequent API updates that may introduce breaking changes. GA models, on the other hand, offer a more stable and backward-compatible alternative.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. &lt;strong&gt;Maintaining Output Quality &amp;amp; Stability&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Experimental models may produce unpredictable or inconsistent outputs. A GA model ensures a baseline of output quality when the experimental model fails.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. &lt;strong&gt;Managing Costs Effectively&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;GA models are often more cost-effective. You may choose to use the experimental model only for specific high-value use cases while keeping the GA model as the default option.&lt;/p&gt;




&lt;h2&gt;
  
  
  Considerations When Implementing a Fallback Strategy
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Automatic Failover Handling&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Your application should detect API failures such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;404 Not Found&lt;/strong&gt; (Model deprecated or removed)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;500 Internal Server Error&lt;/strong&gt; (Service outage)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rate-limiting issues (429 Too Many Requests)&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When such failures occur, your system should &lt;strong&gt;automatically switch&lt;/strong&gt; to a GA model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note on Rate Limits and Fallback Strategy for Error 429&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When applying a fallback strategy for handling error 429 (Too Many Requests), be aware that it may not always be effective if both the experimental and GA models share the same base model. For example, &lt;code&gt;gemini-2.0-flash-thinking-exp-01-21&lt;/code&gt; and &lt;code&gt;gemini-2.0-flash&lt;/code&gt; are both based on &lt;code&gt;gemini-2.0-flash&lt;/code&gt;. In Gemini models, rate limits are not only applied to individual models but also to the underlying base model.&lt;/p&gt;

&lt;p&gt;This means that if you attempt to switch to another model that shares the same base model, you might still be subject to the same rate limit, rendering the fallback ineffective. &lt;/p&gt;

&lt;p&gt;For more details, refer to the official documentation: &lt;a href="https://cloud.google.com/vertex-ai/generative-ai/docs/quotas" rel="noopener noreferrer"&gt;Vertex AI Quotas&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Handling Model Output Differences&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Experimental and GA models may generate different responses. Implementing &lt;strong&gt;pre-processing and post-processing&lt;/strong&gt; logic can help normalize outputs.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Parallel Testing Before Deployment&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To prevent unexpected issues in production, &lt;strong&gt;test both models in parallel&lt;/strong&gt; and evaluate their responses to ensure the fallback model meets your requirements.&lt;/p&gt;




&lt;h2&gt;
  
  
  Python Implementation: Fallback from Experimental to GA Model
&lt;/h2&gt;

&lt;p&gt;Here's how you can implement a &lt;strong&gt;fallback strategy&lt;/strong&gt; using Vertex AI's generative models in Python:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;vertexai.generative_models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GenerativeModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;GenerationConfig&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;predict_with_fallback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;models&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.0-flash-thinking-exp-01-21&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemini-2.0-flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Experimental first, then GA model
&lt;/span&gt;    &lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GenerationConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_output_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Trying model: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;GenerativeModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;generation_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Success with model:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Model &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; failed: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;  &lt;span class="c1"&gt;# Fall back to the next model
&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;All models failed.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;span class="c1"&gt;# Example usage
&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain the significance of Kubernetes in modern cloud computing.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;predict_with_fallback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generated text:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Failed to generate text with all models.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;How This Works&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Prioritizes the experimental model (&lt;code&gt;gemini-2.0-flash-thinking-exp-01-21&lt;/code&gt;)&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;If it fails, falls back to the GA model (&lt;code&gt;gemini-2.0-flash&lt;/code&gt;)&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Handles API errors and exceptions&lt;/strong&gt; to ensure continuous operation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prints logs&lt;/strong&gt; to track which model is being used.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Using an &lt;strong&gt;experimental AI model without a fallback mechanism&lt;/strong&gt; is risky, as these models frequently change or become unavailable. By implementing a &lt;strong&gt;fallback strategy with a stable GA model&lt;/strong&gt;, you ensure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Seamless service continuity&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistent API compatibility&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality assurance in generated outputs&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost-effective AI usage&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When designing AI-driven applications, always plan for model unavailability scenarios. A structured fallback mechanism allows your system to &lt;strong&gt;adapt dynamically&lt;/strong&gt; while maintaining a &lt;strong&gt;high-quality user experience&lt;/strong&gt;.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Handling Error Code 429 in Vertex AI: Implementing Retries with Python</title>
      <dc:creator>polar3130</dc:creator>
      <pubDate>Sun, 26 Jan 2025 14:24:24 +0000</pubDate>
      <link>https://dev.to/polar3130/handling-error-code-429-in-vertex-ai-implementing-retries-with-python-1iag</link>
      <guid>https://dev.to/polar3130/handling-error-code-429-in-vertex-ai-implementing-retries-with-python-1iag</guid>
      <description>&lt;p&gt;When developing applications using Vertex AI, encountering error code &lt;strong&gt;429 (Too Many Requests)&lt;/strong&gt; is a common scenario. As described in the &lt;a href="https://cloud.google.com/vertex-ai/generative-ai/docs/error-code-429" rel="noopener noreferrer"&gt;official documentation&lt;/a&gt;, this error occurs when the number of requests exceeds the allocated capacity for processing. To handle such situations effectively, implementing retries with exponential backoff can be crucial.&lt;/p&gt;

&lt;p&gt;This blog post introduces a simple and effective way to implement retries using the &lt;code&gt;google.api_core.retry&lt;/code&gt; package in Python. By the end of this post, you'll have a clear understanding of how to handle error 429 gracefully and ensure your application remains robust and responsive.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;What is Error Code 429?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Error 429 signifies that the service has hit its processing capacity limits for your requests. To mitigate this, Google Cloud provides two primary approaches:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Use truncated exponential backoff for retries&lt;/strong&gt; (recommended for pay-as-you-go users).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Subscribe to Provisioned Throughput&lt;/strong&gt;, a monthly subscription service to reserve throughput capacity.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This post focuses on the first approach, showcasing how to implement retries programmatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Using Exponential Backoff with Python&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;google.api_core.retry&lt;/code&gt; package offers a convenient way to handle retries. Here’s the Python implementation for retrying API requests with Vertex AI's generative model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.api_core&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;exceptions&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.api_core.retry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Retry&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;vertexai.generative_models&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;GenerativeModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;GenerationConfig&lt;/span&gt;

&lt;span class="n"&gt;RETRIABLE_TYPES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;exceptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TooManyRequests&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# 429
&lt;/span&gt;    &lt;span class="n"&gt;exceptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;InternalServerError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# 500
&lt;/span&gt;    &lt;span class="n"&gt;exceptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;BadGateway&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# 502
&lt;/span&gt;    &lt;span class="n"&gt;exceptions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ServiceUnavailable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# 503
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="nd"&gt;@Retry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;predicate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RETRIABLE_TYPES&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;initial&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;10.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Initial wait time in seconds
&lt;/span&gt;        &lt;span class="n"&gt;maximum&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;60.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Maximum wait time in seconds
&lt;/span&gt;        &lt;span class="n"&gt;multipier&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;2.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Exponential backoff multiplier
&lt;/span&gt;        &lt;span class="n"&gt;deadline&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;300.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Total retry period in seconds
&lt;/span&gt;        &lt;span class="n"&gt;on_error&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;retry_state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Retrying due to: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;retry_state&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;last_attempt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exception&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_content_with_retry&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;GenerativeModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;generate_content&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;contents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;generation_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;generate_content_with_retry&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Key Features of the Code&lt;/strong&gt;
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Retryable Exceptions&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;RETRIABLE_TYPES&lt;/code&gt; set defines the exceptions that should trigger a retry. These include common server errors such as:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;429 Too Many Requests&lt;/strong&gt;: Indicates rate limits have been exceeded.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;500 Internal Server Error&lt;/strong&gt;: A general server-side error.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;502 Bad Gateway&lt;/strong&gt;: Received an invalid response from an upstream server.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;503 Service Unavailable&lt;/strong&gt;: Temporary service unavailability.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Retry Parameters&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;initial&lt;/code&gt;&lt;/strong&gt;: The initial wait time before retrying (e.g., 10 seconds).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;maximum&lt;/code&gt;&lt;/strong&gt;: The upper limit for the wait time (e.g., 60 seconds).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;multipier&lt;/code&gt;&lt;/strong&gt;: Controls the exponential growth of wait times (e.g., doubling).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;deadline&lt;/code&gt;&lt;/strong&gt;: Total retry period, ensuring retries don't exceed a fixed duration (e.g., 5 minutes).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Logging on Retry&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;on_error&lt;/code&gt; callback logs the error that triggered the retry, providing visibility into the retry process.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Reusable Logic&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;code&gt;generate_content_with_retry&lt;/code&gt; function encapsulates the retry logic, making it reusable for multiple API calls.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Understanding the &lt;code&gt;@Retry&lt;/code&gt; Decorator&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;Retry&lt;/code&gt; decorator simplifies the process of implementing retries by abstracting common retry mechanisms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Exponential Backoff&lt;/strong&gt;: Increases the wait time between retries exponentially, preventing overwhelming the service.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexible Conditions&lt;/strong&gt;: Retries can be configured to handle specific exception types.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Customizable Parameters&lt;/strong&gt;: Allows fine-tuning of wait times, retry limits, and error handling.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By leveraging this decorator, developers can build robust applications that gracefully handle transient errors and ensure high availability.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Dynamic Shared Quota and Error 429&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What is Dynamic Shared Quota?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://cloud.google.com/vertex-ai/generative-ai/docs/dsq" rel="noopener noreferrer"&gt;Dynamic Shared Quota (DSQ)&lt;/a&gt; is a resource management framework introduced by Google Cloud for Vertex AI Generative AI services. It dynamically allocates the available compute capacity among ongoing requests, making it an essential feature for applications requiring flexible and efficient use of resources.&lt;/p&gt;

&lt;p&gt;DSQ eliminates the need for fixed quotas and adapts in real time to the demands of your application. This system is particularly beneficial in environments where workload demands fluctuate significantly, as it ensures optimal resource utilization without manual intervention.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Key Benefits of DSQ for Users:&lt;/strong&gt;
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Flexible Resource Utilization:&lt;/strong&gt;&lt;br&gt;
With DSQ, you don’t need to pre-allocate fixed quotas. Resources are dynamically adjusted based on the real-time demand of your application. This approach reduces resource wastage and ensures that you have sufficient capacity when you need it most.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Improved Scalability:&lt;/strong&gt;&lt;br&gt;
During traffic surges, DSQ dynamically reallocates resources to maintain application performance. This capability ensures high availability and responsiveness, even under peak loads, enabling seamless user experiences.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Simplified Quota Management:&lt;/strong&gt;&lt;br&gt;
Traditional quota systems require you to monitor resource usage and request increases as needed. DSQ streamlines this process, automatically managing resource allocation and saving you time and effort.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Understanding Error 429: "Too Many Requests"&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;While DSQ provides significant advantages, it operates within the constraints of the available shared capacity. When your application sends requests that exceed the currently available capacity, you might encounter an HTTP 429 error, indicating &lt;strong&gt;“Too Many Requests.”&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This error doesn’t mean that the service is unavailable—it’s a signal that your requests are temporarily exceeding the dynamic quota. The official &lt;a href="https://cloud.google.com/vertex-ai/generative-ai/docs/error-code-429" rel="noopener noreferrer"&gt;documentation on Error 429&lt;/a&gt; provides guidance on handling this situation.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Best Practices for Handling Error 429:&lt;/strong&gt;
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Exponential Backoff with Retry:&lt;/strong&gt;&lt;br&gt;
Implementing an exponential backoff strategy is the recommended approach to handle 429 errors. By introducing progressively longer delays between retries, you allow the system time to recover and allocate additional capacity for your requests.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Consider Provisioned Throughput:&lt;/strong&gt;&lt;br&gt;
If your application consistently requires higher throughput, you might benefit from &lt;a href="https://cloud.google.com/vertex-ai/pricing#generative_ai" rel="noopener noreferrer"&gt;Provisioned Throughput&lt;/a&gt;. This subscription-based service reserves capacity for your usage, reducing the likelihood of encountering 429 errors during high-demand periods.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Monitor and Optimize Requests:&lt;/strong&gt;&lt;br&gt;
Analyze your application’s request patterns to identify opportunities for optimization. Consolidating redundant requests or adjusting the frequency of non-essential calls can help you stay within DSQ’s dynamically allocated capacity.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;The Intersection of DSQ and Error 429&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Dynamic Shared Quota and error code 429 are inherently linked. DSQ’s ability to dynamically adjust resource allocation helps you avoid unnecessary over-provisioning, but it also requires careful handling of temporary resource constraints. Understanding and leveraging DSQ allows you to design robust applications that gracefully adapt to fluctuating resource availability.&lt;/p&gt;

&lt;p&gt;By implementing exponential backoff and optimizing your resource usage, you can maximize the benefits of DSQ while minimizing disruptions caused by error 429. Whether you’re building lightweight prototypes or deploying enterprise-grade solutions, DSQ offers a powerful, scalable foundation for your Vertex AI applications.&lt;/p&gt;

&lt;p&gt;For more details, refer to the &lt;a href="https://cloud.google.com/vertex-ai/generative-ai/docs/dsq" rel="noopener noreferrer"&gt;DSQ documentation&lt;/a&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Handling error 429 in Vertex AI can be straightforward with the right tools and strategies. By using the &lt;code&gt;google.api_core.retry&lt;/code&gt; package, you can implement exponential backoff with minimal effort, ensuring your application remains robust even during transient capacity issues.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Building a Kubernetes Client for Google Kubernetes Engine (GKE) in Python</title>
      <dc:creator>polar3130</dc:creator>
      <pubDate>Mon, 25 Nov 2024 14:16:00 +0000</pubDate>
      <link>https://dev.to/polar3130/building-a-kubernetes-client-for-google-kubernetes-engine-gke-in-python-16mg</link>
      <guid>https://dev.to/polar3130/building-a-kubernetes-client-for-google-kubernetes-engine-gke-in-python-16mg</guid>
      <description>&lt;p&gt;This blog post introduces an effective method for creating a Kubernetes client for GKE in Python. By leveraging the &lt;code&gt;google-cloud-container&lt;/code&gt;, &lt;code&gt;google-auth&lt;/code&gt;, and &lt;code&gt;kubernetes&lt;/code&gt; libraries, you can use the same code to interact with the Kubernetes API regardless of whether your application is running locally or on Google Cloud. This flexibility comes from using &lt;strong&gt;Application Default Credentials (ADC)&lt;/strong&gt; to authenticate and dynamically construct the requests needed for Kubernetes API interactions, eliminating the need for additional tools or configuration files like &lt;code&gt;kubeconfig&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;When running locally, a common approach is to use the &lt;code&gt;gcloud container clusters get-credentials&lt;/code&gt; command to generate a &lt;code&gt;kubeconfig&lt;/code&gt; file and interact with the Kubernetes API using &lt;code&gt;kubectl&lt;/code&gt;. While this workflow is natural and effective for local setups, it becomes less practical in environments like Cloud Run or other Google Cloud services.&lt;/p&gt;

&lt;p&gt;With ADC, you can streamline access to the Kubernetes API for GKE clusters by dynamically configuring the Kubernetes client. This approach ensures a consistent, efficient way to connect to your cluster without the overhead of managing external configuration files or installing extra tools.&lt;/p&gt;




&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Authentication with Google Cloud
&lt;/h3&gt;

&lt;p&gt;If you're running the code locally, simply authenticate using the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;gcloud auth application-default login
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will use your user account credentials as the Application Default Credentials (ADC). &lt;/p&gt;

&lt;p&gt;If you're running the code on Google Cloud services like Cloud Run, you don’t need to handle authentication manually. Just ensure that the service has a properly configured service account attached with the necessary permissions to access the GKE cluster.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Gather Your Cluster Details
&lt;/h3&gt;

&lt;p&gt;Before running the script, make sure you have the following details:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Google Cloud Project ID&lt;/strong&gt;: The ID of the project where your GKE cluster is hosted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cluster Location&lt;/strong&gt;: The region or zone where your cluster is located (e.g., &lt;code&gt;us-central1-a&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cluster Name&lt;/strong&gt;: The name of the Kubernetes cluster you want to connect to.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Script
&lt;/h2&gt;

&lt;p&gt;Below is the Python function that sets up a Kubernetes client for a GKE cluster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.cloud&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;container_v1&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;google.auth&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;google.auth.transport.requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kubernetes&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;kubernetes_client&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tempfile&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;NamedTemporaryFile&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;yaml&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_k8s_client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cluster_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;kubernetes_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;CoreV1Api&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Fetches a Kubernetes client for the specified GCP project, location, and cluster ID.

    Args:
        project_id (str): Google Cloud Project ID
        location (str): Location of the cluster (e.g., &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-central1-a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;)
        cluster_id (str): Name of the Kubernetes cluster

    Returns:
        kubernetes_client.CoreV1Api: Kubernetes CoreV1 API client
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="c1"&gt;# Retrieve cluster information
&lt;/span&gt;    &lt;span class="n"&gt;gke_cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;container_v1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ClusterManagerClient&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;get_cluster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;projects/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/locations/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/clusters/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cluster_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="c1"&gt;# Obtain Google authentication credentials
&lt;/span&gt;    &lt;span class="n"&gt;creds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;auth&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;auth_req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;auth&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# Refresh the token
&lt;/span&gt;    &lt;span class="n"&gt;creds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;refresh&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;auth_req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Initialize the Kubernetes client configuration object
&lt;/span&gt;    &lt;span class="n"&gt;configuration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kubernetes_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Configuration&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c1"&gt;# Set the cluster endpoint
&lt;/span&gt;    &lt;span class="n"&gt;configuration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;gke_cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;

    &lt;span class="c1"&gt;# Write the cluster CA certificate to a temporary file
&lt;/span&gt;    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nc"&gt;NamedTemporaryFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;delete&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;ca_cert&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;ca_cert&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gke_cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;master_auth&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cluster_ca_certificate&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;configuration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ssl_ca_cert&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ca_cert&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;

    &lt;span class="c1"&gt;# Set the authentication token
&lt;/span&gt;    &lt;span class="n"&gt;configuration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key_prefix&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;authorization&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Bearer&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
    &lt;span class="n"&gt;configuration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;authorization&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;creds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;

    &lt;span class="c1"&gt;# Create and return the Kubernetes CoreV1 API client
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;kubernetes_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;CoreV1Api&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kubernetes_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ApiClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;configuration&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;project_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-project-id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# Google Cloud Project ID
&lt;/span&gt;    &lt;span class="n"&gt;location&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-cluster-location&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# Cluster region (e.g., "us-central1-a")
&lt;/span&gt;    &lt;span class="n"&gt;cluster_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-cluster-id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# Cluster name
&lt;/span&gt;
    &lt;span class="c1"&gt;# Retrieve the Kubernetes client
&lt;/span&gt;    &lt;span class="n"&gt;core_v1_api&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_k8s_client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cluster_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Fetch the kube-system Namespace
&lt;/span&gt;    &lt;span class="n"&gt;namespace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;core_v1_api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_namespace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kube-system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Output the Namespace resource in YAML format
&lt;/span&gt;    &lt;span class="n"&gt;yaml_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;yaml&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dump&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_dict&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;default_flow_style&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;yaml_output&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Connecting to the GKE Cluster
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;get_k8s_client&lt;/code&gt; function begins by fetching cluster details from GKE using the &lt;code&gt;google-cloud-container&lt;/code&gt; library. This library interacts with the GKE service, allowing you to retrieve information such as the cluster's API endpoint and certificate authority (CA). These details are essential for configuring the Kubernetes client.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;gke_cluster&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;container_v1&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ClusterManagerClient&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;get_cluster&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;projects/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/locations/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;location&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/clusters/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cluster_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It’s important to note that the &lt;code&gt;google-cloud-container&lt;/code&gt; library is designed for interacting with GKE as a service, not directly with Kubernetes APIs. For example, while you can use this library to retrieve cluster information, upgrade clusters, or configure maintenance policies—similar to what you can do with the &lt;code&gt;gcloud container clusters&lt;/code&gt; command—you cannot use it to directly obtain a Kubernetes API client. This distinction is why the function constructs a Kubernetes client separately after fetching the necessary cluster details from GKE.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Authenticating with Google Cloud
&lt;/h3&gt;

&lt;p&gt;To interact with GKE and Kubernetes APIs, the function uses Google Cloud’s Application Default Credentials (ADC) to authenticate. Here's how each step of the authentication process works:&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;&lt;code&gt;google.auth.default()&lt;/code&gt;&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;This function retrieves the ADC for the environment in which the code is running. Depending on the context, it may return:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;User account credentials&lt;/strong&gt; (e.g., from &lt;code&gt;gcloud auth application-default login&lt;/code&gt; in a local development setup).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service account credentials&lt;/strong&gt; (e.g., when running in a Google Cloud environment like Cloud Run).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It also returns the associated project ID if available, although in this case, only the credentials are used.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;&lt;code&gt;google.auth.transport.requests.Request()&lt;/code&gt;&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;This creates an HTTP request object for handling authentication-related network requests. It uses Python's &lt;code&gt;requests&lt;/code&gt; library internally and provides a standardized way to refresh credentials or request access tokens.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;&lt;code&gt;creds.refresh(auth_req)&lt;/code&gt;&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;When ADC is retrieved using &lt;code&gt;google.auth.default()&lt;/code&gt;, the credentials object does not initially include an access token (at least in a local environment). The &lt;code&gt;refresh()&lt;/code&gt; method explicitly obtains an access token and attaches it to the credentials object, enabling it to authenticate API requests.&lt;/p&gt;

&lt;p&gt;The following code demonstrates how you can verify this behavior:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Obtain Google authentication credentials
&lt;/span&gt;&lt;span class="n"&gt;creds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;auth&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;auth_req&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;google&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;auth&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transport&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Inspect credentials before refreshing
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Access Token (before refresh()): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;creds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Token Expiry (before refresh()): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;creds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expiry&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Refresh the token
&lt;/span&gt;&lt;span class="n"&gt;creds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;refresh&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;auth_req&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Inspect credentials after refreshing
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Access Token (after): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;creds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Token Expiry (after): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;creds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;expiry&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;example output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Access Token (before refresh()): None
Token Expiry (before refresh()): 2024-11-24 06:11:19.640651

Access Token (after): **********
Token Expiry (after): 2024-11-24 07:16:06.866467
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before calling &lt;code&gt;refresh()&lt;/code&gt;, the &lt;code&gt;token&lt;/code&gt; attribute is &lt;code&gt;None&lt;/code&gt;. After &lt;code&gt;refresh()&lt;/code&gt; is invoked, the credentials are populated with a valid access token and its expiry time.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Configuring the Kubernetes Client
&lt;/h3&gt;

&lt;p&gt;The Kubernetes client is configured using the cluster’s API endpoint, a temporary file for the CA certificate, and the refreshed Bearer token. This ensures that the client can securely authenticate and communicate with the cluster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;configuration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;https://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;gke_cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="n"&gt;configuration&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;authorization&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;creds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CA certificate is stored temporarily and referenced by the client for secure SSL communication. With these settings, the Kubernetes client is fully configured and ready to interact with the cluster.&lt;/p&gt;




&lt;h2&gt;
  
  
  Example Output
&lt;/h2&gt;

&lt;p&gt;Here’s an example of the YAML output for the &lt;code&gt;kube-system&lt;/code&gt; Namespace:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;api_version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Namespace&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
  &lt;span class="na"&gt;creation_timestamp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2024-11-24 04:49:48+00:00&lt;/span&gt;
  &lt;span class="na"&gt;deletion_grace_period_seconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
  &lt;span class="na"&gt;deletion_timestamp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
  &lt;span class="na"&gt;finalizers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
  &lt;span class="na"&gt;generate_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
  &lt;span class="na"&gt;generation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;kubernetes.io/metadata.name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-system&lt;/span&gt;
  &lt;span class="na"&gt;managed_fields&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;api_version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
    &lt;span class="na"&gt;fields_type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;FieldsV1&lt;/span&gt;
    &lt;span class="na"&gt;fields_v1&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;f:metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;f:labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;.&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;
          &lt;span class="na"&gt;f:kubernetes.io/metadata.name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;
    &lt;span class="na"&gt;manager&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-apiserver&lt;/span&gt;
    &lt;span class="na"&gt;operation&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Update&lt;/span&gt;
    &lt;span class="na"&gt;subresource&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
    &lt;span class="na"&gt;time&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2024-11-24 04:49:48+00:00&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kube-system&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
  &lt;span class="na"&gt;owner_references&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
  &lt;span class="na"&gt;resource_version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;15'&lt;/span&gt;
  &lt;span class="na"&gt;self_link&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
  &lt;span class="na"&gt;uid&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;01132228-7e86-4b74-8b78-8ceaa8df9913&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;finalizers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;kubernetes&lt;/span&gt;
&lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;conditions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;
  &lt;span class="na"&gt;phase&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Active&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This approach highlights the portability of using the same code to interact with the Kubernetes API, whether running locally or on a Google Cloud service like Cloud Run. By leveraging Application Default Credentials (ADC), we’ve demonstrated a flexible method to dynamically generate a Kubernetes API client without relying on pre-generated configuration files or external tools. This makes it easy to build applications that can seamlessly adapt to different environments, simplifying both development and deployment workflows.&lt;/p&gt;

</description>
      <category>googlecloud</category>
      <category>python</category>
      <category>googlekubernetesengine</category>
      <category>gke</category>
    </item>
  </channel>
</rss>
