DEV Community

Pratham Chauhan
Pratham Chauhan

Posted on

Setting Up Gemini on Vertex AI for Production: A No-Nonsense Walkthrough

If you want to call Gemini through Google Cloud's Vertex AI from a real production environment, not just a local script, there's a specific order of operations that saves you from a lot of pain later. This is that walkthrough: project setup, a properly scoped service account, local testing, and finally getting it running securely on an EC2 instance, including the keyless Workload Identity Federation (WIF) path.

You might be wondering why bother with all this instead of just grabbing an API key from AI Studio and dropping it into an environment variable. Honestly, for a weekend project, do that. But a raw API key is a single static string that grants full access to whatever it's scoped to, forever, until you remember to revoke it. If it ends up in a public repo, a client side bundle, or a log file, anyone holding it can run up your bill or worse. The setup in this post trades a bit of upfront complexity for something much safer: identity based access that can be scoped down to exactly one permission, rotated without touching your app code, and in the WIF case, never even exists as a file you could accidentally leak. It's the difference between handing someone a house key and giving them a temporary badge that only opens one door and expires on its own.

Before diving in, a couple of terms that will come up a lot. An IAM role is just a labeled bundle of permissions, like a job title that comes with a fixed list of things you're allowed to do. A service account is not a human user. Think of it as a robot identity that your app logs in as, instead of a person typing a password. We'll use both throughout.

Part 2 covers everything that goes wrong if you skip a step or fat finger a detail here, and trust me, there's a lot that can go wrong. But first, let's do it right.

Step 1: Pick your project and enable the right API

Open the Google Cloud Console and confirm you're in the correct project (not just "a" project, since billing and permissions are scoped per project, and it's easy to set things up in the wrong one if you have several).

PROJECT_ID="your-gcp-project-id"
LOCATION="us-central1"
gcloud config set project "$PROJECT_ID"
Enter fullscreen mode Exit fullscreen mode

The API you want is aiplatform.googleapis.com, labeled "Vertex AI API" in the console. An API here just means a specific Google Cloud service you have to switch on before you can use it. There's a deceptively similar sounding one called Vertex AI Search for commerce (retail.googleapis.com) that has nothing to do with calling Gemini models. Enable the right one:

gcloud services enable aiplatform.googleapis.com
gcloud services list --enabled --filter="aiplatform.googleapis.com"
Enter fullscreen mode Exit fullscreen mode

Vertex AI locations aren't inherited from your project. You choose a region explicitly, meaning the physical data center area where your requests get processed. us-central1 is a safe, well supported default. Whatever you pick, use the same value everywhere: local testing, EC2 env vars, your app code. Mismatched regions are a surprisingly common source of "it works locally but not in prod."

Step 2: Set a budget before you test anything

Before your first API call, set up a budget alert. Go to Billing, then Budgets & alerts, then Create budget. Scope it to this specific project, and set a monthly amount you're comfortable with. A hundred dollars is a reasonable starting point. Add alert thresholds at every 10% so you get early warning, not just a surprise at the end of the month.

One important caveat: budgets are alerts, not hard stops. Google Cloud won't automatically cut you off at your limit. If you need an actual ceiling, build it into your own backend. Check estimated monthly spend before each Gemini call and reject the request if you're over:

if (monthlySpendUsd >= 100) {
  throw new Error("Monthly Gemini budget reached");
}
Enter fullscreen mode Exit fullscreen mode

This is more reliable than a billing disable webhook, because it stops the request, not the project, which matters if this project hosts anything else.

Step 3: Create a least privilege service account

"Least privilege" just means giving something only the exact permissions it needs to do its job, nothing extra "just in case." Resist the urge to reuse an existing service account or grant Editor or Owner, which are broad, all access roles. Create a fresh service account dedicated to this one purpose:

IAM & Admin → Service Accounts → Create service account
Name: gemini-prod-runner
Enter fullscreen mode Exit fullscreen mode

Then grant it exactly one role, roles/aiplatform.user (shown as "Vertex AI User" in the console). This single role covers everything needed to call Gemini models. It doesn't need Owner, Editor, or any billing or admin permissions, so even if this identity were somehow compromised, the blast radius is small: someone could call Gemini on your dime, but they couldn't touch your other cloud resources or billing settings.

gcloud projects add-iam-policy-binding "$PROJECT_ID" \
  --member="serviceAccount:gemini-prod-runner@${PROJECT_ID}.iam.gserviceaccount.com" \
  --role="roles/aiplatform.user"
Enter fullscreen mode Exit fullscreen mode

Step 4: Test locally before touching any servers

For local development, Application Default Credentials, usually shortened to ADC, are the path of least resistance. ADC is just Google's term for "let the command line tool log you in once, then every script on this machine can quietly reuse that login" instead of you managing key files by hand.

gcloud auth application-default login
gcloud auth application-default set-quota-project "$PROJECT_ID"
Enter fullscreen mode Exit fullscreen mode

That second command matters more than it looks. It tells Google which project to bill your test requests against. Skip it and you'll get vague "quota exceeded" or "API not enabled" errors that have nothing to do with quotas or the API.

Set your environment:

export GOOGLE_CLOUD_PROJECT="$PROJECT_ID"
export GOOGLE_CLOUD_LOCATION="us-central1"
export GOOGLE_GENAI_USE_VERTEXAI="true"
Enter fullscreen mode Exit fullscreen mode

And a minimal test script:

import { GoogleGenAI } from "@google/genai";

const ai = new GoogleGenAI({
  vertexai: true,
  project: process.env.GOOGLE_CLOUD_PROJECT,
  location: process.env.GOOGLE_CLOUD_LOCATION || "us-central1",
});

const response = await ai.models.generateContent({
  model: "gemini-2.5-flash",
  contents: "Say hello in one sentence.",
});

console.log(response.text);
Enter fullscreen mode Exit fullscreen mode

If you'd rather test with the exact kind of credentials you'll use in production, generate a service account key, which is just a downloadable JSON file containing a long lived password for that robot identity. This is fine for local testing only (more on why not to ship this to prod below):

gcloud iam service-accounts keys create ~/secrets/gemini-prod-runner.json \
  --iam-account=gemini-prod-runner@${PROJECT_ID}.iam.gserviceaccount.com
chmod 600 ~/secrets/gemini-prod-runner.json
Enter fullscreen mode Exit fullscreen mode
export GOOGLE_APPLICATION_CREDENTIALS="$HOME/secrets/gemini-prod-runner.json"
Enter fullscreen mode Exit fullscreen mode

Add secrets/, .env, and anything matching *credentials* or *service-account* to .gitignore immediately. Don't wait until after the first commit.

Step 5: Deploying to EC2, two paths

You have two reasonable options here, and which one you pick is a real tradeoff, not just a "more secure equals always better" decision.

Option A is putting that service account JSON key file on the server's disk. Copy the key file to the instance, point GOOGLE_APPLICATION_CREDENTIALS at it, and you're done. Fast to set up, easy to debug, and perfectly fine for getting a first deployment working. The tradeoff is that it's a long lived static secret sitting on a real machine. If it leaks, it's valid until you manually revoke it.

scp ~/secrets/gemini-prod-runner.json ubuntu@<EC2_IP>:/home/ubuntu/gemini-prod-runner.json
ssh ubuntu@<EC2_IP> chmod 600 /home/ubuntu/gemini-prod-runner.json
Enter fullscreen mode Exit fullscreen mode

Option B is Workload Identity Federation, or WIF. In plain terms, WIF lets two clouds vouch for each other without ever sharing a password. AWS already knows, with certainty, which EC2 instance is making a request. WIF lets Google trust that AWS issued vouching instead of asking for a Google specific secret. No Google key ever touches the instance. Instead, AWS proves the instance's identity, and Google exchanges that proof for a short lived token that expires on its own, on demand, every time. More setup work, but nothing sitting on disk to rotate or leak.

If this is your first time deploying this app, I'd genuinely recommend starting with Option A, confirming everything else works end to end, and then swapping to WIF as an isolated second step. Debugging two unfamiliar systems at once, your app's Vertex AI integration and a federated trust chain between two clouds, is much harder than debugging them one at a time.

Step 6: Setting up WIF properly

This is the part with the most moving pieces, so go slowly, and here are the building blocks in plain language before the steps.

An IAM role on the AWS side (yes, AWS has its own separate concept also called a role, easy to confuse with Google's IAM role) is what gives your EC2 instance a verifiable identity card. It's not about granting AWS permissions here, it's purely "this machine is who it says it is."

A Workload Identity Pool on the Google side is basically a waiting room where Google agrees to listen to identities from somewhere outside Google, like AWS. A provider inside that pool is the specific configuration that says "and here's exactly how to verify an AWS identity, and here's which AWS account I trust." A principal is Google's general word for "the identity asking for access," whether that's a person, a service account, or in this case, an AWS role being recognized through the pool.

On the AWS side, create that IAM role for EC2. It doesn't need any special AWS permissions, since it exists purely to give the instance a verifiable identity, not to grant AWS side access to anything:

IAM → Roles → Create role → Trusted entity: AWS service → Use case: EC2
Name: ec2-gemini-runner
Enter fullscreen mode Exit fullscreen mode

Attach it to your running instance:

EC2 → Instances → select instance → Actions → Security → Modify IAM role
Enter fullscreen mode Exit fullscreen mode

On the Google side, create the Workload Identity Pool and an AWS provider inside it:

IAM & Admin → Workload Identity Federation → Create pool
Pool ID: aws-ec2-pool
Provider type: AWS
Provider ID: aws-ec2-provider
AWS account ID: <your AWS account ID>
Enter fullscreen mode Exit fullscreen mode

Then grant that pool's identity permission to act as your service account, or grant the role directly to the AWS identity as its own principal. Google supports both patterns, and which one your setup uses depends on how you configure the binding (a binding is just the rule that says "this identity gets this permission"). The console's "Connected service accounts" flow under your pool will show you which path you're on. Either way, the end result needs to be that this specific AWS role can obtain a Google access token scoped to call Vertex AI.

Download the generated credential configuration file and place it on the instance:

export GOOGLE_APPLICATION_CREDENTIALS="/home/ubuntu/gcp-wif-credential.json"
Enter fullscreen mode Exit fullscreen mode

This file is not a private key. It's a small JSON document of instructions telling Google's auth library how to go ask AWS who it's talking to. Worth opening it once and reading it, since knowing its shape will save you real time if something goes wrong later (Part 2 has a lot more on this).

Step 7: Wire it into your process manager

Whatever you use to run the app in production, PM2, systemd, Docker, make sure these environment variables actually reach the running process, not just your shell:

GOOGLE_CLOUD_PROJECT=your-project-id
GOOGLE_CLOUD_LOCATION=us-central1
GOOGLE_GENAI_USE_VERTEXAI=true
GOOGLE_APPLICATION_CREDENTIALS=/home/ubuntu/gcp-wif-credential.json
Enter fullscreen mode Exit fullscreen mode

A systemd unit, for reference:

[Unit]
Description=My Gemini Vertex AI App
After=network.target

[Service]
Type=simple
User=ubuntu
WorkingDirectory=/home/ubuntu/my-app
Environment="GOOGLE_CLOUD_PROJECT=your-project-id"
Environment="GOOGLE_CLOUD_LOCATION=us-central1"
Environment="GOOGLE_GENAI_USE_VERTEXAI=true"
Environment="GOOGLE_APPLICATION_CREDENTIALS=/home/ubuntu/gcp-wif-credential.json"
ExecStart=/usr/bin/npm start
Restart=always

[Install]
WantedBy=multi-user.target
Enter fullscreen mode Exit fullscreen mode

Pre-production checklist

A quick list worth running through before calling this done. The correct API, aiplatform.googleapis.com and not retail.googleapis.com, is enabled. A budget with staged alerts exists. The service account has roles/aiplatform.user only, nothing broader. Any key files are outside the repo, chmod 600, and .gitignore'd. On EC2, you've picked one auth path deliberately rather than half configuring both. And your application has its own spend guard rather than relying solely on Google's billing alerts.

That's the setup. If everything above goes exactly to plan, you're done. If it doesn't, and there are a surprising number of ways it doesn't, Part 2 walks through every failure mode I personally hit, in the order I hit them, and how to actually diagnose each one instead of guessing.

Top comments (0)