DEV Community: polar3130

Using Gemini CLI with a Local LLM

polar3130 — Fri, 27 Feb 2026 09:39:44 +0000

Gemini CLI, an open-source AI agent published by Google, lets you interact with Gemini models from your terminal. It normally connects to Google's API endpoint, but by redirecting the API destination, you can also use a locally running LLM as its backend.

In this post, I'll walk through how to combine LiteLLM Proxy and Ollama to swap Gemini CLI's backend to a local LLM, along with a few gotchas I encountered during setup.

I've also covered using LiteLLM Proxy for centralized LLM API management in a previous post, if you're interested.

Architecture Overview

Here is the overall architecture:

By setting the GOOGLE_GEMINI_BASE_URL environment variable from the @google/genai SDK, you can redirect all of Gemini CLI's API requests to an arbitrary endpoint. This variable doesn't appear to be documented in the Gemini CLI docs, but it is supported on the SDK side (reference PR).

LiteLLM Proxy exposes Gemini API-compatible endpoints (/v1beta/models/{model}:streamGenerateContent, etc.) and relays incoming requests to a local model running on Ollama. LiteLLM Proxy has a feature called model_group_alias that routes a requested model name to a different model, which allows you to map model names sent by Gemini CLI (such as gemini-3-flash-preview) to a local model.

Test Environment

macOS (Apple Silicon, Tahoe 26)
Gemini CLI v0.30.0
LiteLLM v1.81.16
Ollama v0.17.0
Python 3.14.0
Node.js v22.17.0

Setup

Installing Ollama and Pulling a Model

Install via Homebrew and start it as a service.

brew install ollama
brew services start ollama

Pull a model. I initially planned to use gemma3, but as described later, gemma3 doesn't support tool calling in the Ollama template format, so I went with the lightweight qwen2.5:3b (~1.9 GB) for this proof of concept.

ollama pull qwen2.5:3b

Installing LiteLLM

Create a Python virtual environment and install LiteLLM.

python3 -m venv .venv
source .venv/bin/activate
pip install 'litellm[proxy]'

Configuring LiteLLM Proxy

Create a litellm_config.yaml file.

model_list:
  - model_name: local-model
    litellm_params:
      model: ollama_chat/qwen2.5:3b
      api_base: "http://localhost:11434"

router_settings:
  model_group_alias:
    "gemini-3.1-pro-preview": "local-model"
    "gemini-3.1-pro-preview-customtools": "local-model"
    "gemini-3-flash-preview": "local-model"
    "gemini-3-flash-preview-customtools": "local-model"
    "gemini-2.5-pro": "local-model"
    "gemini-2.5-flash": "local-model"
    "gemini-2.5-flash-lite": "local-model"

The key here is the model_group_alias configuration. Gemini CLI uses multiple models internally — a main generation model (gemini-3-flash-preview, etc.) as well as a lighter model for input classification (gemini-2.5-flash-lite). Aliases for all of these model names need to be defined. It would be nice if wildcards were supported, but for now, each model name requires its own alias.

Start the proxy with the config file.

litellm --config litellm_config.yaml --port 4000

Starting Gemini CLI

Set the environment variables and start Gemini CLI.

export GOOGLE_GEMINI_BASE_URL="http://localhost:4000"
export GEMINI_API_KEY="sk-dummy-key"
gemini --sandbox=false

The API key isn't actually used, so any dummy value will do.

You should now be getting responses from the local LLM.

Note that --sandbox=false is specified because, in sandbox mode, GOOGLE_GEMINI_BASE_URL is not passed into the sandbox container — a known issue (Issue #2168).

Gotchas During Setup

Missing model_group_alias Entries Cause 500 Errors

Gemini CLI uses different models for different purposes depending on the version. With v0.30.0, which I used, models such as gemini-3-flash-preview and gemini-2.5-flash-lite were observed in requests.

If the corresponding alias is not defined in LiteLLM Proxy, you'll get a BadRequestError: There are no healthy deployments for this model error. Since the models in use may change with Gemini CLI upgrades or new Gemini model releases, you'll likely need to monitor the proxy logs for requested model names and add any missing entries to model_group_alias as they appear.

The Model Must Support Tool Calling in Its Ollama Template

I initially used Google's gemma3:4b, but it failed with a does not support tools error.

Gemini CLI sends tool definitions (for file operations, command execution, etc.) as part of its requests. For Ollama to handle the tools parameter, the model's chat template needs to support tool calling.

An important nuance here is that a model's function calling capability and Ollama template support are separate concerns.

gemma3 is capable of prompt-based function calling at the model level (reference), but its Ollama template does not support it (ollama/ollama#9941).

Qwen 2.5, on the other hand, supports tool calling in its official Ollama template. The qwen2.5:3b model I used is only about 1.9 GB at 3B parameters, making it a convenient choice for a proof of concept.

Wrap-Up

I've shown how to swap Gemini CLI's backend to a local LLM by combining LiteLLM Proxy and Ollama.

The setup itself is relatively straightforward, but there were a few things that were hard to notice without actually running it — such as model name changes across Gemini CLI versions and the tool calling support status of Ollama models.

That said, a 3B-parameter model doesn't have the capacity to reliably handle Gemini CLI's AI agent features like file operations and code generation. For serious use as a coding assistant, you'll likely want to consider larger models.

Using Gemini CLI Through LiteLLM Proxy

polar3130 — Tue, 25 Nov 2025 03:31:00 +0000

Organizations adopting LLMs at scale often struggle with fragmented API usage, inconsistent authentication methods, and lack of visibility across teams. Tools like Gemini CLI make local development easier, but they also introduce governance challenges—especially when authentication silently bypasses centralized gateways.

In this article, I walk through how to route Gemini CLI traffic through LiteLLM Proxy, explain why this configuration matters for enterprise environments, and highlight key operational considerations learned from hands-on testing.

Why Use a Proxy for Gemini CLI?

Before diving into configuration, it’s worth clarifying why an LLM gateway is needed in the first place.

Problems with direct Gemini CLI usage

If developers run Gemini CLI with default settings:

Authentication may fall back to Google Account login → usage disappears from organizational audits
API traffic may hit multiple GCP projects/regions → inconsistent cost attribution
Personal API keys or user identities may be used → security and compliance risks
Team-wide visibility into token usage becomes impossible → cost governance cannot scale

LiteLLM Proxy as a solution

LiteLLM Proxy provides:

A unified OpenAI-compatible API endpoint
Virtual API keys with per-user / per-project scoping
Rate, budget, and quota enforcement
Centralized monitoring & analytics
Governance applied regardless of client tool (CLI, IDE, scripts)

This makes it suitable for organizations where 50–300+ developers may use Gemini, GPT, Claude, or Llama models across multiple teams.

Architecture Overview

For this walkthrough, I deployed LiteLLM Proxy onto Cloud Run, using Cloud SQL for metadata storage.

Why this design?

Cloud Run scales automatically and supports secure invocations.
Cloud SQL stores key usage, analytics, and configuration.
Vertex AI IAM is handled via the LiteLLM Proxy’s service account.
API visibility is centralized and independent of client behavior.

Caveats

Cloud SQL connection limits must be considered when scaling Cloud Run.
Cold starts may slightly increase latency for short-lived CLI invocations.
Multi-region routing is out of scope but may be required for HA.

Configuration: LiteLLM Proxy

Below is a minimal configuration enabling Gemini models via Vertex AI:

model_list:
  - model_name: gemini-2.5-pro
    litellm_params:
      model: vertex_ai/gemini-2.5-pro
      vertex_project: os.environ/GOOGLE_CLOUD_PROJECT
      vertex_location: us-central1

  - model_name: gemini-2.5-flash
    litellm_params:
      model: vertex_ai/gemini-2.5-flash
      vertex_project: os.environ/GOOGLE_CLOUD_PROJECT
      vertex_location: us-central1

general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY
  ui_username: admin
  ui_password: os.environ/LITELLM_UI_PASSWORD

Operational notes & recommendations

Region selection: Vertex AI availability varies by location; us-central1 is generally safest for new Gemini releases.
Key management: Store LITELLM_MASTER_KEY and UI credentials in Secret Manager, not environment variables.
Production settings to consider: num_retries, timeout, async_calls, request logging policies.
Access control: Use Cloud Run’s invoker IAM or an API Gateway layer for stronger borders.

Virtual key issuance

curl -X POST https://<proxy>/key/generate \
  -H "Authorization: Bearer <master key>" \
  -d '{"models": ["gemini-2.5-pro","gemini-2.5-flash"], "duration":"30d"}'

This key will later be used by the Gemini CLI.

Configuration: Gemini CLI

Point the CLI to LiteLLM Proxy:

export GOOGLE_GEMINI_BASE_URL="https://<LiteLLM Proxy URL>"
export GEMINI_API_KEY="<virtual key>"

Important

GEMINI_API_KEY must be a LiteLLM virtual key, not a Google Cloud API key.

Gemini CLI now behaves as if it were talking to Vertex AI, but traffic flows through LiteLLM.

Testing the End-to-End Path

Once configured, run a simple test through Gemini CLI:

$ gemini hello
Loaded cached credentials.
Hello! I'm ready for your first command.

On the LiteLLM dashboard, you should see request logs, latency, and token usage.

Important Note: Authentication Bypass in Gemini CLI

During testing, I observed situations where:

Gemini CLI worked normally
but LiteLLM Proxy showed zero usage

Why it happens

Gemini CLI supports three authentication methods:

Login with Google
Use Gemini API Key
Vertex AI

When a user logs in with Google Login:

The CLI uses Google OAuth credentials
These credentials automatically route traffic directly to Vertex AI
GOOGLE_GEMINI_BASE_URL is ignored
LiteLLM Proxy is completely bypassed

If OAuth login is left enabled:

Teams lose visibility of CLI usage
Costs appear under personal or unintended projects
Security review cannot track data flowing to Vertex AI
API limits and budgets set on LiteLLM do not apply

This is the number one issue organizations should be aware of.

Summary

In this article, we walked through how to route Gemini CLI traffic through LiteLLM Proxy and highlighted key lessons from testing.

Benefits

Unifies API governance across CLI, IDE, and backend services
Enables per-user quotas, budgets, and access scopes
Provides analytics across all models and providers
Gives SRE/PFE teams full visibility into LLM usage patterns

Limitations / Things to Consider

Gemini CLI’s Google-auth login bypasses proxies unless explicitly disabled
Cloud Run + Cloud SQL requires connection pooling considerations
Model list updates must be maintained when Vertex releases new versions
LiteLLM Enterprise features (SSO, RBAC, audit logging) may be necessary for large orgs

Differences in Response Models between the Vertex AI SDK and the Gen AI SDK

polar3130 — Thu, 31 Jul 2025 13:15:56 +0000

When migrating a Python-based AI application from the Vertex AI SDK to the Gen AI SDK, I made an interesting discovery: the Gen AI SDK uses a Pydantic-based response model (GenerateContentResponse), which means you can serialize it with model_dump() or model_dump_json().

For anyone unfamiliar with the current landscape, it can be confusing that Google offers multiple official SDKs for working with the Gemini API. Below is some background before we dive in. At the moment, Gemini exposes two main APIs and three Python SDKs.

Gemini APIs

1. Gemini API in Vertex AI

Access Gemini models via Google Cloud’s Vertex AI
Requires a Google Cloud project
IAM-based authentication and access control
Per-project quotas that throttle usage as needed

See: “Migrate from the Gemini Developer API to the Vertex AI Gemini API” (Google Cloud Docs)

2. Gemini Developer API

Access Gemini through Google AI Studio
Works even without a Google Cloud project
Generous free tier—ideal for learning and prototyping
Enterprise-grade features and advanced settings are limited

See: “Get a Gemini API key | Google AI for Developers”

Official Python SDKs

Google currently maintains three Python SDKs:

1. Google Gen AI SDK (`google-genai`)

Supports both the Gemini Developer API and the Vertex AI Gemini API
A single code base that handles API-key auth (AI Studio) and IAM auth (Vertex AI)
Newer than the others and updated most frequently

Docs: https://googleapis.github.io/python-genai/

2. Vertex AI SDK (`google-cloud-aiplatform`)

Dedicated to the Gemini API in Vertex AI
Lets you use Gemini models through Vertex AI

Repo: https://github.com/googleapis/python-aiplatform

3. Google AI Python SDK (`google-generativeai`)

Targets the Gemini Developer API only
Does not work with Vertex AI
Now deprecated; scheduled for EOL at the end of August 2025

Repo: https://github.com/google-gemini/deprecated-generative-ai-python

How the Response Models Differ

Both the Vertex AI SDK and the Gen AI SDK can call the Gemini API in Vertex AI, but their usage patterns differ slightly. Below are minimal examples.

Example with the Vertex AI SDK

from vertexai.generative_models import GenerativeModel
import vertexai

PROJECT_ID = "*****************"
REGION = "us-central1"

vertexai.init(project=PROJECT_ID, location=REGION)

model = GenerativeModel("gemini-2.0-flash")

prompt = "What is the capital of Japan?"
response = model.generate_content(prompt)

print(response.text)           # -> Tokyo is the capital of Japan.

Example with the Gen AI SDK

from google import genai
from google.genai.types import HttpOptions

client = genai.Client(
    vertexai=True,
    project="*****************",
    location="us-central1",
    http_options=HttpOptions(api_version="v1"),
)

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="What is the capital of Japan?",
)

print(response.text)           # -> Tokyo is the capital of Japan.

Although you still retrieve the generated text via response.text, the underlying response object differs:

SDK	Response class
Vertex AI SDK	`GenerationResponse`
Gen AI SDK	`GenerateContentResponse`

Each response bundles the generated content plus rich metadata. If you want to dump everything to JSON—for example, to inspect intermediate artifacts—here’s how you do it with each SDK.

Serializing with the Vertex AI SDK

import json

import vertexai
from vertexai.generative_models import GenerativeModel

vertexai.init(project="*****************", location="us-central1")

model = GenerativeModel("gemini-2.0-flash")
response = model.generate_content("What is the capital of Japan?")

print(json.dumps(response.to_dict(), indent=2, ensure_ascii=False))

GenerationResponse exposes a handy to_dict() method.

Serializing with the Gen AI SDK

GenerateContentResponse is built on Pydantic’s BaseModel, so you use model_dump() (or model_dump_json()) instead:

from google import genai
from google.genai.types import HttpOptions
import json

client = genai.Client(
    vertexai=True,
    project="*****************",
    location="us-central1",
    http_options=HttpOptions(api_version="v1"),
)

response = client.models.generate_content(
    model="gemini-2.0-flash",
    contents="What is the capital of Japan?",
)

print(json.dumps(response.model_dump(mode='json'), indent=2, ensure_ascii=False))

Tip
If you call to_dict() on a GenerateContentResponse, you’ll get an error—one of those “gotchas” to watch for when migrating.

Google’s migration guide and GitHub issues both mention this explicitly:

Migration guide: Generate-content section (ai.google.dev)
Issue tracker: googleapis/python-genai#709

Interestingly, the Gen AI SDK surfaces more generation-time metadata than the Vertex AI SDK (though that’s unrelated to the serialization method itself).

Takeaways

When moving from the Vertex AI SDK to the Gen AI SDK, remember that GenerateContentResponse is Pydantic-based.
- Serialize with model_dump() or model_dump_json(), not to_dict().
Some features are still exclusive to the Vertex AI SDK, so the Gen AI SDK isn’t yet a drop-in replacement for every use case.
That said, Google’s docs now recommend the Gen AI SDK, and it’s seeing the most active development—so consolidation onto this SDK feels inevitable.

Looking forward to future updates!

Understanding Quota Project Warnings When Using Google Cloud ADC

polar3130 — Mon, 30 Jun 2025 01:48:56 +0000

This article covers authentication credentials when developing applications on Google Cloud.

When you run the gcloud auth application-default login command locally, you may see a warning like this:

WARNING: 
Cannot add the project "dazzling-pillar-4369" to ADC as the quota project because the account in ADC does not have the "serviceusage.services.use" permission on this project. You might receive a "quota_exceeded" or "API not enabled" error. Run $ gcloud auth application-default set-quota-project to add a quota project.

This article explains the meaning, cause, and recommended practices related to this warning.

What Are Application Default Credentials (ADC)?

Application Default Credentials (ADC) are a mechanism for obtaining default credentials for applications to access Google Cloud services and APIs.
With ADC, applications can automatically discover the appropriate credentials depending on their environment:

If the GOOGLE_APPLICATION_CREDENTIALS environment variable is set, the service account key or config file at that path is used.
Otherwise, credentials saved via the gcloud auth application-default login command (user credentials saved locally in the ADC file) are used.
If neither is present, the default service account credentials for the environment (e.g., GCE or GKE) are retrieved from the metadata server.

ADC allows developers to access Google Cloud resources without hardcoding credentials.
When running gcloud auth application-default login locally, a credentials file is created (usually at ~/.config/gcloud/application_default_credentials.json), which is then used by client libraries.

Note: As mentioned in point 2 above, the gcloud auth application-default login command uses the permissions of your user account.
If you want your application to impersonate a specific service account, use gcloud auth application-default login --impersonate-service-account.

Understanding the Quota Project Warning

The warning shown at the start of this article relates to the Quota Project used with ADC.

The project ID "dazzling-pillar-4369" could not be set as the Quota Project because the account used by ADC lacks the serviceusage.services.use permission on that project.
As a result, errors like quota_exceeded or API not enabled may occur.
The recommended fix is to run gcloud auth application-default set-quota-project to explicitly set the Quota Project.

Let’s clarify what a “Quota Project” is.
There are two types of Google Cloud APIs: resource-based and client-based (Troubleshoot your ADC setup).

Resource-based APIs use the settings and quotas of the project containing the resource.
Client-based APIs, since they aren’t tied to a specific project, require explicit specification of the Quota Project (also referred to as the billing project).

When using user credentials (rather than a service account), client libraries require you to specify which project should be used for quota and billing purposes.

When gcloud auth application-default login is executed, the Cloud SDK attempts to associate the current project as the Quota Project for ADC.
However, if the user credentials (e.g., a Viewer role) do not include serviceusage.services.use permission for that project, this association fails, resulting in the warning message.

Reference - ADC Troubleshooting Docs

What Is `serviceusage.services.use`?

This is a permission required to use (and be billed for) services within a specific project.
It is included in the Service Usage Consumer role (roles/serviceusage.serviceUsageConsumer).
Viewer roles do not include this permission. So if your credentials only allow viewing a project, you cannot set it as a Quota Project.

As a side note, the warning mentions the following two possible errors if a Quota Project is not set:

quota_exceeded
API not enabled

You might wonder why quota_exceeded would occur if no project is associated.
It turns out that in the absence of a properly set Quota Project, some client-based API calls fall back to a shared Google-owned quota project (StackOverflow, Google Cloud Medium article).
If this fallback project doesn’t have the necessary APIs enabled, you may also see API not enabled.

Official documentation also mentions a mysterious fallback project ID (link), which is likely the shared quota project.

How to Set the Quota Project for ADC

First, ensure the account used by ADC has the serviceusage.services.use permission on the project you want to use.
Grant the Service Usage Consumer role:

gcloud projects add-iam-policy-binding <PROJECT_ID> \
    --member="user:<your-email@example.com>" \
    --role="roles/serviceusage.serviceUsageConsumer"

This gives the necessary permission to use the project’s services, allowing it to be set as the Quota Project.

Then, set the Quota Project in ADC:

gcloud auth application-default set-quota-project <PROJECT_ID>

This updates your local ADC file with a quota_project_id field.
From then on, any API calls made using ADC will use this project for billing and quota tracking.

Best Practices for Project Configuration

To prevent these types of errors in the future, here are some best practices:

Set a Default Quota Project in gcloud Configuration

While running gcloud auth application-default set-quota-project manually each time works, it can be tedious.
Instead, you can set the quota project in your gcloud config:

gcloud config set billing/quota_project <PROJECT_ID>

This ensures any future ADC credentials will automatically use this project.

Consider Using Service Accounts

For long-lived automated processes like CI/CD pipelines, using service accounts is better than user-based ADC.
With service accounts, you can assign only the necessary IAM roles and explicitly define the quota project.

Don’t Ignore Warnings or Errors

This last point is more of a mindset: don’t ignore warnings like these.
If left unresolved, your application may rely on fallback shared projects and break unexpectedly when quotas are hit or APIs are disabled.
Understanding and fixing such issues helps you prevent incidents, deepen your knowledge, and improve team technical strength.

Conclusion

Starting with a common ADC warning message, we explored what a Quota Project is, how to configure it, and some best practices for stable development.
Failing to set a proper Quota Project can result in unexpected errors and usage being billed to Google’s fallback project.
Setting a correct quota project ensures your API calls are tracked and billed properly under your own project.

Optimizing Image Management Efficiency Using AWS ECR Pull-Through Cache

polar3130 — Tue, 25 Mar 2025 02:43:10 +0000

This time, we will take a look at AWS Elastic Container Registry (ECR), an essential service when building container execution environments on AWS. We will introduce the pull-through cache feature of ECR by covering its recent updates—such as support for private ECR repositories—enterprise use cases, and insights gained during our testing process.

What is ECR's Pull-Through Cache Feature?

ECR’s pull-through cache is a feature that dynamically retrieves external container images on-demand and caches them within your private ECR. By creating a “pull-through cache rule” in your account’s private ECR and associating an upstream registry with a namespace (a prefix applied to the repository name in your ECR), developers and deployment environments can pull external images via the ECR repository URI, while ECR automatically fetches the image from the upstream registry in the background.

At the time of the first pull, a repository is automatically created in your private ECR using the specified prefix combined with the original image name, and the image layers are cached. During this initial retrieval, AWS uses an IP address from its managed infrastructure to fetch the image via the pull-through cache feature. For subsequent pulls, the image is served directly from the cached copy in ECR, so there is no further access to the external registry. In addition, ECR checks at least once every 24 hours whether the image on the upstream side with the same tag has been updated, and if an update is detected, the cached copy in your private ECR is refreshed. This means that even for tags that are frequently updated, such as latest, the ECR cache will follow the latest image at least once every 24 hours (as of the time of writing, the image update interval cannot be modified).

Through this mechanism, you can use ECR like a proxy cache to access external container images via your internal repository.

Main Benefits of the Pull-Through Cache

The most obvious benefit of caching is improved performance. Since the required container images are cached in ECR, retrieving the same external image for the second time and beyond is significantly faster, reducing your application’s startup time. Of course, this is not the only advantage; ECR’s pull-through cache offers several additional benefits in terms of security and other aspects:

Reduced External Dependency and Improved Availability

Even if an external container registry is down during deployment, having a cached copy in ECR can help avoid or mitigate service impact. Moreover, since the image is obtained from within ECR, you can also expect lower network latency and faster performance due to intra-region replication.

Avoidance of External Registry Rate Limits

For example, Docker Hub imposes rate limits on the number of pulls an anonymous user can perform within a given time period. However, if you retrieve images from Docker Hub via ECR’s pull-through cache, you can use them without worrying about these limits. As explained earlier, since ECR fetches images from Docker Hub on your behalf using AWS infrastructure, your environment does not need to directly access Docker Hub.

Centralized Security Controls

By consolidating the origin of your container images to ECR, you can centrally manage access controls and vulnerability scans. ECR integrates with IAM for authentication and offers security features such as KMS encryption at rest and image scanning integrated with Amazon Inspector. This means that even images imported from external sources can benefit from consistent security measures. For instance, images retrieved through the pull-through cache can be automatically scanned for vulnerabilities, and lifecycle policies can be used to periodically remove older tags. Additionally, you can enforce policies that only grant production clusters access to ECR while blocking direct access to external registries.

Cost Optimization

The pull-through cache only caches images that are actually pulled, so there is no need to pre-replicate all potentially used images, which helps reduce storage costs. Moreover, using cached images within the same region minimizes cross-region data transfers, lowering the cost associated with inter-region image transfers.

Supported Upstream Sources

Currently, ECR’s pull-through cache supports specifying several major public registries as well as some private registries as upstream sources.

https://docs.aws.amazon.com/AmazonECR/latest/userguide/pull-through-cache-creating-rule.html

At the initial release in 2021, only Quay.io and Amazon ECR Public were supported, but support has since expanded to include registries that require authentication, such as Docker Hub and GitHub Container Registry. Additionally, with the update released this month (March 2025), even private Amazon ECR repositories can now act as upstream sources—regardless of cross-region or cross-account boundaries.

You can find the release notes here:

https://aws.amazon.com/jp/about-aws/whats-new/2025/03/amazon-ecr-pull-through-cache/

Setup

You can create pull-through cache rules using the AWS CLI with the aws ecr create-pull-through-cache-rule command. The main options include:

--ecr-repository-prefix: The prefix for the repository to be created in your private ECR (e.g., docker-hub or quay).
--upstream-registry-url: The endpoint URL of the external upstream registry (e.g., for Docker Hub, registry-1.docker.io; for ECR Public, public.ecr.aws).
--credential-arn: (For upstream sources that require authentication) The ARN of the secret stored in Secrets Manager containing your credentials.
--registry-id: (Optional) The AWS account ID of the private registry in which to create the rule (if not specified, the default registry is used).
--custom-role-arn: (For the case where the upstream is another AWS account’s ECR) The ARN of the IAM role used for cross-account access.

Below is an example of setting Docker Hub as the upstream source. Since authentication is required for retrieving images from Docker Hub, a Secrets Manager secret ARN is specified. The secret name in Secrets Manager must begin with the prefix ecr-pullthroughcache/ (Creating a pull through cache rule in Amazon ECR - Amazon ECR).

# Set Docker Hub as the upstream (authentication required)
aws ecr create-pull-through-cache-rule \
    --ecr-repository-prefix "docker-hub" \
    --upstream-registry-url "registry-1.docker.io" \
    --credential-arn "arn:aws:secretsmanager:<region>:<accountID>:secret:ecr-pullthroughcache/<DockerHub-secret-name>"

For using ECR Public as the source (which does not require authentication), the command would look like:

# Set ECR Public as the upstream (no credentials required)
aws ecr create-pull-through-cache-rule \
    --ecr-repository-prefix "ecr-public" \
    --upstream-registry-url "public.ecr.aws"

Considerations When Using the Pull-Through Cache

Beyond general considerations such as supported regions and quotas, here are some points to note:

Using Secrets Manager

For registries like Docker Hub and GitHub that require authentication, the secret name in Secrets Manager must begin with the prefix ecr-pullthroughcache/.

For example, in the case of Docker Hub, including the keys username and accessToken in the secret allows ECR to perform upstream authentication using those credentials.

When setting up via the console, only secrets with this prefix will be displayed as options, so if your secret is not shown, verify that the prefix is correctly applied.

This prefix must also be used when provisioning secrets through CloudFormation or Terraform.

Tag Updates and Immutability

As mentioned earlier, pull-through cache checks for updates on the upstream side on a tag-by-tag basis every 24 hours. If you have enabled immutability for image tags in your ECR repository, which prevents overwriting images with the same tag, there is a concern that the cache might not update. However, there have been issues reported (see pull through cache rule related cache repository's immutability not ...) where image replacement still occurs even when immutability is enabled. This is something to be aware of when using the feature.

Applying the Concept to Enterprise Environment Segregation

Let’s explore some potential use cases that might arise in an enterprise setting where separation between development and production environments is a key requirement. Often, due to the security and reliability demands of a production environment, a policy may be adopted where “external images can be used flexibly in the development environment, while only trusted internal images are allowed in production.” The pull-through cache feature can be very useful in implementing such segregation.

Controlling the Image Flow in Development Environments

When developers test new middleware or OSS images, they typically pull various images from Docker Hub or GitHub Container Registry. By setting up pull-through cache rules in the ECR of the development AWS account and retrieving external images via ECR, you can centralize the image flow and maintain visibility into which images have been used. For example, since images cached in the development ECR can be scanned for vulnerabilities using ECR’s scanning feature, potential vulnerabilities in images being used during development can be identified early. In organizations with a dedicated platform or infrastructure operations team, this centralized caching can also aid in monitoring the usage of external images by application teams.

Commonizing Sources Across Multiple Environments

If you are managing multiple AWS accounts or regions for development, staging, and regression testing, you can create a central repository and use pull-through cache rules to ensure that all environments retrieve images from a common source. For instance, only images that pass tests in the development environment could be exported to the central repository, and each environment could then create its own pull-through cache rule pointing to this common upstream repository. While push-based image distribution (e.g., using replication) is also an option, pull-based caching via the pull-through cache can simplify management as each environment’s ECR automatically caches from the central repository without requiring additional distribution mechanisms.

Moving Images from a Development Repository to a Production Repository

In enterprise environments, it is common to have separate repositories for development and production to meet different isolation requirements. In such cases, when exporting images from a central development repository to a production repository, a design that uses replication or individual image export/import might be more secure than using on-demand retrieval via pull-through cache. This is because on-demand retrieval from the development repository could inadvertently cache images in the production repository that are not authorized for production deployment.

Conclusion

In this post, we have introduced the pull-through cache feature of ECR along with its latest updates and potential enterprise use cases. Although caching is often associated with improved performance, as highlighted here, the feature also offers significant benefits in terms of availability and security. As discussed, pull-through cache can play a key role in managing image distribution channels and vulnerability scanning scopes in production environments.

Implementing a Fallback Strategy for Experimental Vertex AI Models

polar3130 — Fri, 28 Feb 2025 14:16:08 +0000

When integrating experimental AI models into your application, there's always a risk that they may become unavailable due to frequent updates, deprecations, or API changes. To mitigate this risk and enhance the resilience and operational stability of your application, having a well-planned fallback mechanism using a Generally Available (GA) model can be highly effective.

This blog post explores the advantages of maintaining a fallback model strategy in Vertex AI and provides an implementation guide using Python.

Why a Fallback Model is Essential

1. Ensuring Service Continuity

Experimental models can sometimes be temporarily or permanently deprecated. Having a GA model as a backup allows your application to continue running without interruptions.

2. Handling API Changes & Compatibility Issues

Experimental models undergo frequent API updates that may introduce breaking changes. GA models, on the other hand, offer a more stable and backward-compatible alternative.

3. Maintaining Output Quality & Stability

Experimental models may produce unpredictable or inconsistent outputs. A GA model ensures a baseline of output quality when the experimental model fails.

4. Managing Costs Effectively

GA models are often more cost-effective. You may choose to use the experimental model only for specific high-value use cases while keeping the GA model as the default option.

Considerations When Implementing a Fallback Strategy

Automatic Failover Handling

Your application should detect API failures such as:

404 Not Found (Model deprecated or removed)
500 Internal Server Error (Service outage)
Rate-limiting issues (429 Too Many Requests)

When such failures occur, your system should automatically switch to a GA model.

Note on Rate Limits and Fallback Strategy for Error 429

When applying a fallback strategy for handling error 429 (Too Many Requests), be aware that it may not always be effective if both the experimental and GA models share the same base model. For example, gemini-2.0-flash-thinking-exp-01-21 and gemini-2.0-flash are both based on gemini-2.0-flash. In Gemini models, rate limits are not only applied to individual models but also to the underlying base model.

This means that if you attempt to switch to another model that shares the same base model, you might still be subject to the same rate limit, rendering the fallback ineffective.

For more details, refer to the official documentation: Vertex AI Quotas.

Handling Model Output Differences

Experimental and GA models may generate different responses. Implementing pre-processing and post-processing logic can help normalize outputs.

Parallel Testing Before Deployment

To prevent unexpected issues in production, test both models in parallel and evaluate their responses to ensure the fallback model meets your requirements.

Python Implementation: Fallback from Experimental to GA Model

Here's how you can implement a fallback strategy using Vertex AI's generative models in Python:

from vertexai.generative_models import GenerativeModel, GenerationConfig

def predict_with_fallback(prompt: str):
    models = ["gemini-2.0-flash-thinking-exp-01-21", "gemini-2.0-flash"]  # Experimental first, then GA model
    config = GenerationConfig(
        temperature=1.0,
        max_output_tokens=1024
    )

    for model in models:
        try:
            print(f"Trying model: {model}")
            response = GenerativeModel(model).generate_content(
                contents=prompt, generation_config=config
            )
            print("Success with model:", model)
            return response.text
        except Exception as e:
            print(f"Model {model} failed: {e}")
            continue  # Fall back to the next model

    print("All models failed.")
    return None

# Example usage
prompt = "Explain the significance of Kubernetes in modern cloud computing."

result = predict_with_fallback(prompt)
if result:
    print("Generated text:", result)
else:
    print("Failed to generate text with all models.")

How This Works

Prioritizes the experimental model (gemini-2.0-flash-thinking-exp-01-21).
If it fails, falls back to the GA model (gemini-2.0-flash).
Handles API errors and exceptions to ensure continuous operation.
Prints logs to track which model is being used.

Conclusion

Using an experimental AI model without a fallback mechanism is risky, as these models frequently change or become unavailable. By implementing a fallback strategy with a stable GA model, you ensure:

Seamless service continuity
Consistent API compatibility
Quality assurance in generated outputs
Cost-effective AI usage

When designing AI-driven applications, always plan for model unavailability scenarios. A structured fallback mechanism allows your system to adapt dynamically while maintaining a high-quality user experience.

Handling Error Code 429 in Vertex AI: Implementing Retries with Python

polar3130 — Sun, 26 Jan 2025 14:24:24 +0000

When developing applications using Vertex AI, encountering error code 429 (Too Many Requests) is a common scenario. As described in the official documentation, this error occurs when the number of requests exceeds the allocated capacity for processing. To handle such situations effectively, implementing retries with exponential backoff can be crucial.

This blog post introduces a simple and effective way to implement retries using the google.api_core.retry package in Python. By the end of this post, you'll have a clear understanding of how to handle error 429 gracefully and ensure your application remains robust and responsive.

What is Error Code 429?

Error 429 signifies that the service has hit its processing capacity limits for your requests. To mitigate this, Google Cloud provides two primary approaches:

Use truncated exponential backoff for retries (recommended for pay-as-you-go users).
Subscribe to Provisioned Throughput, a monthly subscription service to reserve throughput capacity.

This post focuses on the first approach, showcasing how to implement retries programmatically.

Using Exponential Backoff with Python

The google.api_core.retry package offers a convenient way to handle retries. Here’s the Python implementation for retrying API requests with Vertex AI's generative model:

from google.api_core import exceptions
from google.api_core.retry import Retry
from vertexai.generative_models import GenerativeModel, GenerationConfig

RETRIABLE_TYPES = {
    exceptions.TooManyRequests,  # 429
    exceptions.InternalServerError,  # 500
    exceptions.BadGateway,  # 502
    exceptions.ServiceUnavailable,  # 503
}

def generate(model, prompt, config):
    @Retry(
        predicate=lambda exc: isinstance(exc, RETRIABLE_TYPES),
        initial=10.0,  # Initial wait time in seconds
        maximum=60.0,  # Maximum wait time in seconds
        multipier=2.0,  # Exponential backoff multiplier
        deadline=300.0,  # Total retry period in seconds
        on_error=lambda retry_state: logger.warning(
            f"Retrying due to: {retry_state.last_attempt.exception}"
        ),
    )
    def generate_content_with_retry():
        return GenerativeModel(model).generate_content(
            contents=prompt, generation_config=config
        )

    return generate_content_with_retry()

Key Features of the Code

Retryable Exceptions:
- The RETRIABLE_TYPES set defines the exceptions that should trigger a retry. These include common server errors such as:
  - 429 Too Many Requests: Indicates rate limits have been exceeded.
  - 500 Internal Server Error: A general server-side error.
  - 502 Bad Gateway: Received an invalid response from an upstream server.
  - 503 Service Unavailable: Temporary service unavailability.
Retry Parameters:
- initial: The initial wait time before retrying (e.g., 10 seconds).
- maximum: The upper limit for the wait time (e.g., 60 seconds).
- multipier: Controls the exponential growth of wait times (e.g., doubling).
- deadline: Total retry period, ensuring retries don't exceed a fixed duration (e.g., 5 minutes).
Logging on Retry:
- The on_error callback logs the error that triggered the retry, providing visibility into the retry process.
Reusable Logic:
- The generate_content_with_retry function encapsulates the retry logic, making it reusable for multiple API calls.

Understanding the `@Retry` Decorator

The Retry decorator simplifies the process of implementing retries by abstracting common retry mechanisms:

Exponential Backoff: Increases the wait time between retries exponentially, preventing overwhelming the service.
Flexible Conditions: Retries can be configured to handle specific exception types.
Customizable Parameters: Allows fine-tuning of wait times, retry limits, and error handling.

By leveraging this decorator, developers can build robust applications that gracefully handle transient errors and ensure high availability.

Dynamic Shared Quota and Error 429

What is Dynamic Shared Quota?

Dynamic Shared Quota (DSQ) is a resource management framework introduced by Google Cloud for Vertex AI Generative AI services. It dynamically allocates the available compute capacity among ongoing requests, making it an essential feature for applications requiring flexible and efficient use of resources.

DSQ eliminates the need for fixed quotas and adapts in real time to the demands of your application. This system is particularly beneficial in environments where workload demands fluctuate significantly, as it ensures optimal resource utilization without manual intervention.

Key Benefits of DSQ for Users:

Flexible Resource Utilization:
With DSQ, you don’t need to pre-allocate fixed quotas. Resources are dynamically adjusted based on the real-time demand of your application. This approach reduces resource wastage and ensures that you have sufficient capacity when you need it most.
Improved Scalability:
During traffic surges, DSQ dynamically reallocates resources to maintain application performance. This capability ensures high availability and responsiveness, even under peak loads, enabling seamless user experiences.
Simplified Quota Management:
Traditional quota systems require you to monitor resource usage and request increases as needed. DSQ streamlines this process, automatically managing resource allocation and saving you time and effort.

Understanding Error 429: "Too Many Requests"

While DSQ provides significant advantages, it operates within the constraints of the available shared capacity. When your application sends requests that exceed the currently available capacity, you might encounter an HTTP 429 error, indicating “Too Many Requests.”

This error doesn’t mean that the service is unavailable—it’s a signal that your requests are temporarily exceeding the dynamic quota. The official documentation on Error 429 provides guidance on handling this situation.

Best Practices for Handling Error 429:

Exponential Backoff with Retry:
Implementing an exponential backoff strategy is the recommended approach to handle 429 errors. By introducing progressively longer delays between retries, you allow the system time to recover and allocate additional capacity for your requests.
Consider Provisioned Throughput:
If your application consistently requires higher throughput, you might benefit from Provisioned Throughput. This subscription-based service reserves capacity for your usage, reducing the likelihood of encountering 429 errors during high-demand periods.
Monitor and Optimize Requests:
Analyze your application’s request patterns to identify opportunities for optimization. Consolidating redundant requests or adjusting the frequency of non-essential calls can help you stay within DSQ’s dynamically allocated capacity.

The Intersection of DSQ and Error 429

Dynamic Shared Quota and error code 429 are inherently linked. DSQ’s ability to dynamically adjust resource allocation helps you avoid unnecessary over-provisioning, but it also requires careful handling of temporary resource constraints. Understanding and leveraging DSQ allows you to design robust applications that gracefully adapt to fluctuating resource availability.

By implementing exponential backoff and optimizing your resource usage, you can maximize the benefits of DSQ while minimizing disruptions caused by error 429. Whether you’re building lightweight prototypes or deploying enterprise-grade solutions, DSQ offers a powerful, scalable foundation for your Vertex AI applications.

For more details, refer to the DSQ documentation.

Conclusion

Handling error 429 in Vertex AI can be straightforward with the right tools and strategies. By using the google.api_core.retry package, you can implement exponential backoff with minimal effort, ensuring your application remains robust even during transient capacity issues.

Building a Kubernetes Client for Google Kubernetes Engine (GKE) in Python

polar3130 — Mon, 25 Nov 2024 14:16:00 +0000

This blog post introduces an effective method for creating a Kubernetes client for GKE in Python. By leveraging the google-cloud-container, google-auth, and kubernetes libraries, you can use the same code to interact with the Kubernetes API regardless of whether your application is running locally or on Google Cloud. This flexibility comes from using Application Default Credentials (ADC) to authenticate and dynamically construct the requests needed for Kubernetes API interactions, eliminating the need for additional tools or configuration files like kubeconfig.

When running locally, a common approach is to use the gcloud container clusters get-credentials command to generate a kubeconfig file and interact with the Kubernetes API using kubectl. While this workflow is natural and effective for local setups, it becomes less practical in environments like Cloud Run or other Google Cloud services.

With ADC, you can streamline access to the Kubernetes API for GKE clusters by dynamically configuring the Kubernetes client. This approach ensures a consistent, efficient way to connect to your cluster without the overhead of managing external configuration files or installing extra tools.

Prerequisites

1. Authentication with Google Cloud

If you're running the code locally, simply authenticate using the following command:

gcloud auth application-default login

This will use your user account credentials as the Application Default Credentials (ADC).

If you're running the code on Google Cloud services like Cloud Run, you don’t need to handle authentication manually. Just ensure that the service has a properly configured service account attached with the necessary permissions to access the GKE cluster.

2. Gather Your Cluster Details

Before running the script, make sure you have the following details:

Google Cloud Project ID: The ID of the project where your GKE cluster is hosted.
Cluster Location: The region or zone where your cluster is located (e.g., us-central1-a).
Cluster Name: The name of the Kubernetes cluster you want to connect to.

The Script

Below is the Python function that sets up a Kubernetes client for a GKE cluster.

from google.cloud import container_v1
import google.auth
import google.auth.transport.requests
from kubernetes import client as kubernetes_client
from tempfile import NamedTemporaryFile
import base64
import yaml

def get_k8s_client(project_id: str, location: str, cluster_id: str) -> kubernetes_client.CoreV1Api:
    """
    Fetches a Kubernetes client for the specified GCP project, location, and cluster ID.

    Args:
        project_id (str): Google Cloud Project ID
        location (str): Location of the cluster (e.g., "us-central1-a")
        cluster_id (str): Name of the Kubernetes cluster

    Returns:
        kubernetes_client.CoreV1Api: Kubernetes CoreV1 API client
    """

    # Retrieve cluster information
    gke_cluster = container_v1.ClusterManagerClient().get_cluster(request={
        "name": f"projects/{project_id}/locations/{location}/clusters/{cluster_id}"
    })

    # Obtain Google authentication credentials
    creds, _ = google.auth.default()
    auth_req = google.auth.transport.requests.Request()
    # Refresh the token
    creds.refresh(auth_req)

    # Initialize the Kubernetes client configuration object
    configuration = kubernetes_client.Configuration()
    # Set the cluster endpoint
    configuration.host = f'https://{gke_cluster.endpoint}'

    # Write the cluster CA certificate to a temporary file
    with NamedTemporaryFile(delete=False) as ca_cert:
        ca_cert.write(base64.b64decode(gke_cluster.master_auth.cluster_ca_certificate))
        configuration.ssl_ca_cert = ca_cert.name

    # Set the authentication token
    configuration.api_key_prefix['authorization'] = 'Bearer'
    configuration.api_key['authorization'] = creds.token

    # Create and return the Kubernetes CoreV1 API client
    return kubernetes_client.CoreV1Api(kubernetes_client.ApiClient(configuration))


def main():
    project_id = "your-project-id"  # Google Cloud Project ID
    location = "your-cluster-location"  # Cluster region (e.g., "us-central1-a")
    cluster_id = "your-cluster-id"  # Cluster name

    # Retrieve the Kubernetes client
    core_v1_api = get_k8s_client(project_id, location, cluster_id)

    # Fetch the kube-system Namespace
    namespace = core_v1_api.read_namespace(name="kube-system")

    # Output the Namespace resource in YAML format
    yaml_output = yaml.dump(namespace.to_dict(), default_flow_style=False)
    print(yaml_output)

if __name__ == "__main__":
    main()

How It Works

1. Connecting to the GKE Cluster

The get_k8s_client function begins by fetching cluster details from GKE using the google-cloud-container library. This library interacts with the GKE service, allowing you to retrieve information such as the cluster's API endpoint and certificate authority (CA). These details are essential for configuring the Kubernetes client.

gke_cluster = container_v1.ClusterManagerClient().get_cluster(request={
    "name": f"projects/{project_id}/locations/{location}/clusters/{cluster_id}"
})

It’s important to note that the google-cloud-container library is designed for interacting with GKE as a service, not directly with Kubernetes APIs. For example, while you can use this library to retrieve cluster information, upgrade clusters, or configure maintenance policies—similar to what you can do with the gcloud container clusters command—you cannot use it to directly obtain a Kubernetes API client. This distinction is why the function constructs a Kubernetes client separately after fetching the necessary cluster details from GKE.

2. Authenticating with Google Cloud

To interact with GKE and Kubernetes APIs, the function uses Google Cloud’s Application Default Credentials (ADC) to authenticate. Here's how each step of the authentication process works:

`google.auth.default()`

This function retrieves the ADC for the environment in which the code is running. Depending on the context, it may return:

User account credentials (e.g., from gcloud auth application-default login in a local development setup).
Service account credentials (e.g., when running in a Google Cloud environment like Cloud Run).

It also returns the associated project ID if available, although in this case, only the credentials are used.

`google.auth.transport.requests.Request()`

This creates an HTTP request object for handling authentication-related network requests. It uses Python's requests library internally and provides a standardized way to refresh credentials or request access tokens.

`creds.refresh(auth_req)`

When ADC is retrieved using google.auth.default(), the credentials object does not initially include an access token (at least in a local environment). The refresh() method explicitly obtains an access token and attaches it to the credentials object, enabling it to authenticate API requests.

The following code demonstrates how you can verify this behavior:

# Obtain Google authentication credentials
creds, _ = google.auth.default()
auth_req = google.auth.transport.requests.Request()

# Inspect credentials before refreshing
print(f"Access Token (before refresh()): {creds.token}")
print(f"Token Expiry (before refresh()): {creds.expiry}")

# Refresh the token
creds.refresh(auth_req)

# Inspect credentials after refreshing
print(f"Access Token (after): {creds.token}")
print(f"Token Expiry (after): {creds.expiry}")

example output:

Access Token (before refresh()): None
Token Expiry (before refresh()): 2024-11-24 06:11:19.640651

Access Token (after): **********
Token Expiry (after): 2024-11-24 07:16:06.866467

Before calling refresh(), the token attribute is None. After refresh() is invoked, the credentials are populated with a valid access token and its expiry time.

3. Configuring the Kubernetes Client

The Kubernetes client is configured using the cluster’s API endpoint, a temporary file for the CA certificate, and the refreshed Bearer token. This ensures that the client can securely authenticate and communicate with the cluster.

configuration.host = f'https://{gke_cluster.endpoint}'
configuration.api_key['authorization'] = creds.token

The CA certificate is stored temporarily and referenced by the client for secure SSL communication. With these settings, the Kubernetes client is fully configured and ready to interact with the cluster.

Example Output

Here’s an example of the YAML output for the kube-system Namespace:

api_version: v1
kind: Namespace
metadata:
  annotations: null
  creation_timestamp: 2024-11-24 04:49:48+00:00
  deletion_grace_period_seconds: null
  deletion_timestamp: null
  finalizers: null
  generate_name: null
  generation: null
  labels:
    kubernetes.io/metadata.name: kube-system
  managed_fields:
  - api_version: v1
    fields_type: FieldsV1
    fields_v1:
      f:metadata:
        f:labels:
          .: {}
          f:kubernetes.io/metadata.name: {}
    manager: kube-apiserver
    operation: Update
    subresource: null
    time: 2024-11-24 04:49:48+00:00
  name: kube-system
  namespace: null
  owner_references: null
  resource_version: '15'
  self_link: null
  uid: 01132228-7e86-4b74-8b78-8ceaa8df9913
spec:
  finalizers:
  - kubernetes
status:
  conditions: null
  phase: Active

Conclusion

This approach highlights the portability of using the same code to interact with the Kubernetes API, whether running locally or on a Google Cloud service like Cloud Run. By leveraging Application Default Credentials (ADC), we’ve demonstrated a flexible method to dynamically generate a Kubernetes API client without relying on pre-generated configuration files or external tools. This makes it easy to build applications that can seamlessly adapt to different environments, simplifying both development and deployment workflows.

DEV Community: polar3130

Using Gemini CLI with a Local LLM

Architecture Overview

Test Environment

Setup

Installing Ollama and Pulling a Model

Installing LiteLLM

Configuring LiteLLM Proxy

Starting Gemini CLI

Gotchas During Setup

Missing model_group_alias Entries Cause 500 Errors

The Model Must Support Tool Calling in Its Ollama Template

Wrap-Up

Using Gemini CLI Through LiteLLM Proxy

Why Use a Proxy for Gemini CLI?

Problems with direct Gemini CLI usage

LiteLLM Proxy as a solution

Architecture Overview

Why this design?

Caveats

Configuration: LiteLLM Proxy

Operational notes & recommendations

Virtual key issuance

Configuration: Gemini CLI

Important

Testing the End-to-End Path

Important Note: Authentication Bypass in Gemini CLI

Why it happens

Summary

Benefits

Limitations / Things to Consider

Differences in Response Models between the Vertex AI SDK and the Gen AI SDK

Gemini APIs

1. Gemini API in Vertex AI

2. Gemini Developer API

Official Python SDKs

1. Google Gen AI SDK (google-genai)

2. Vertex AI SDK (google-cloud-aiplatform)

3. Google AI Python SDK (google-generativeai)

How the Response Models Differ

Example with the Vertex AI SDK

Example with the Gen AI SDK

Serializing with the Vertex AI SDK

Serializing with the Gen AI SDK

Takeaways

Understanding Quota Project Warnings When Using Google Cloud ADC

What Are Application Default Credentials (ADC)?

Understanding the Quota Project Warning

What Is serviceusage.services.use?

How to Set the Quota Project for ADC

Best Practices for Project Configuration

Set a Default Quota Project in gcloud Configuration

Consider Using Service Accounts

Don’t Ignore Warnings or Errors

Conclusion

Optimizing Image Management Efficiency Using AWS ECR Pull-Through Cache

What is ECR's Pull-Through Cache Feature?

Main Benefits of the Pull-Through Cache

Reduced External Dependency and Improved Availability

Avoidance of External Registry Rate Limits

Centralized Security Controls

Cost Optimization

Supported Upstream Sources

Setup

Considerations When Using the Pull-Through Cache

Using Secrets Manager

Tag Updates and Immutability

Applying the Concept to Enterprise Environment Segregation

Controlling the Image Flow in Development Environments

Commonizing Sources Across Multiple Environments

Moving Images from a Development Repository to a Production Repository

Conclusion

Implementing a Fallback Strategy for Experimental Vertex AI Models

Why a Fallback Model is Essential

1. Ensuring Service Continuity

2. Handling API Changes & Compatibility Issues

3. Maintaining Output Quality & Stability

4. Managing Costs Effectively

Considerations When Implementing a Fallback Strategy

Automatic Failover Handling

1. Google Gen AI SDK (`google-genai`)

2. Vertex AI SDK (`google-cloud-aiplatform`)

3. Google AI Python SDK (`google-generativeai`)

What Is `serviceusage.services.use`?

Understanding the `@Retry` Decorator

`google.auth.default()`

`google.auth.transport.requests.Request()`

`creds.refresh(auth_req)`