Vertex AI RAG Engine handles chunking, embedding, and retrieval with a managed vector database. Terraform provisions the infrastructure - GCS bucket, service account, engine config - while the Python SDK manages corpus and file operations.
You have a Vertex AI endpoint answering general questions. Ask it about your company's internal docs and it hallucinates with confidence. RAG fixes this by grounding model responses in your actual data.
Vertex AI RAG Engine is GCP's fully managed RAG service. You create a corpus (the index), import files from Cloud Storage or Google Drive, and the engine handles chunking, embedding with a model like text-embedding-005, and vector storage in a managed Spanner-based database. At query time, it retrieves relevant chunks and feeds them as context to Gemini.
ποΈ Architecture Overview
ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββ
β GCS Bucket βββββ>β RAG Engine βββββ>β RagManagedDb β
β (Documents) β β (Corpus) β β (Managed Spanner β
β β β β β Vector Store) β
ββββββββββββββββ ββββββββ¬ββββββββ ββββββββββββββββββββ
β
βββββββββ΄βββββββββββ
β retrieveContexts β
β + generateContentβ
ββββββββββββββββββββ
Data flow: Documents in GCS are imported into a corpus. RAG Engine chunks them, generates embeddings via the configured model, and stores vectors in RagManagedDb. Queries use retrieveContexts for retrieval or generateContent with a RAG tool for end-to-end generation.
π¦ Step 1: GCS Bucket for Documents
This bucket holds your source documents - PDFs, plain text, HTML, and Markdown:
# rag/gcs.tf
resource "google_storage_bucket" "rag_documents" {
name = "${var.environment}-${var.corpus_name}-rag-docs-${var.project_id}"
location = var.region
project = var.project_id
force_destroy = var.environment != "prod"
uniform_bucket_level_access = true
versioning {
enabled = true
}
labels = {
environment = var.environment
purpose = "rag-document-source"
}
}
π Step 2: Service Account and IAM
RAG Engine uses a Google-managed service agent (service-PROJECT_NUMBER@gcp-sa-vertex-rag.iam.gserviceaccount.com), but your application code needs a service account with the right permissions to create corpora and import files:
# rag/iam.tf
data "google_project" "current" {
project_id = var.project_id
}
resource "google_service_account" "rag_app" {
account_id = "${var.environment}-rag-app"
display_name = "RAG Application Service Account"
project = var.project_id
}
# Permission to use Vertex AI RAG Engine APIs
resource "google_project_iam_member" "rag_user" {
project = var.project_id
role = "roles/aiplatform.user"
member = "serviceAccount:${google_service_account.rag_app.email}"
}
# Permission to read documents from GCS
resource "google_storage_bucket_iam_member" "rag_reader" {
bucket = google_storage_bucket.rag_documents.name
role = "roles/storage.objectViewer"
member = "serviceAccount:${google_service_account.rag_app.email}"
}
# Grant RAG service agent access to the GCS bucket
resource "google_storage_bucket_iam_member" "rag_agent_reader" {
bucket = google_storage_bucket.rag_documents.name
role = "roles/storage.objectViewer"
member = "serviceAccount:service-${data.google_project.current.number}@gcp-sa-vertex-rag.iam.gserviceaccount.com"
}
Critical: The RAG service agent (gcp-sa-vertex-rag) needs storage.objectViewer on your bucket. Without this, file imports silently fail.
βοΈ Step 3: RAG Engine Configuration
The google_vertex_ai_rag_engine_config resource controls the managed database tier. This is a project-level setting that affects all corpora using RagManagedDb:
# rag/engine_config.tf
resource "google_vertex_ai_rag_engine_config" "this" {
region = var.region
project = var.project_id
rag_managed_db_config {
dynamic "basic" {
for_each = var.environment == "dev" ? [1] : []
content {}
}
dynamic "scaled" {
for_each = var.environment == "prod" ? [1] : []
content {}
}
}
}
Tier options:
| Tier | Use Case | Billing |
|---|---|---|
| Basic | Development, small corpora | Lower cost, sufficient for testing |
| Scaled | Production workloads | Higher throughput, better latency |
| Unprovisioned | Cleanup only | Deletes RAG Engine data |
π§ Step 4: Variables
# rag/variables.tf
variable "project_id" { type = string }
variable "environment" { type = string }
variable "region" { type = string }
variable "corpus_name" { type = string }
variable "embedding_model" {
type = string
default = "text-embedding-005"
}
variable "chunk_size" {
type = number
default = 512
}
variable "chunk_overlap" {
type = number
default = 100
}
Per-environment configs:
# environments/dev.tfvars
corpus_name = "company-docs"
embedding_model = "text-embedding-005"
chunk_size = 300
chunk_overlap = 50
# environments/prod.tfvars
corpus_name = "company-docs"
embedding_model = "text-embedding-005"
chunk_size = 512
chunk_overlap = 100
π§ Step 5: Create Corpus and Import Files (Python SDK)
Unlike AWS where the knowledge base and data source are Terraform resources, RAG Engine corpus management is done through the Python SDK or REST API. This is the operational layer that sits on top of the Terraform infrastructure:
from vertexai import rag
from vertexai.generative_models import GenerativeModel, Tool
import vertexai
vertexai.init(project="YOUR_PROJECT_ID", location="us-east4")
# Configure embedding model
embedding_config = rag.RagEmbeddingModelConfig(
vertex_prediction_endpoint=rag.VertexPredictionEndpoint(
publisher_model="publishers/google/models/text-embedding-005"
)
)
# Create corpus
rag_corpus = rag.create_corpus(
display_name="company-docs",
description="Internal company documentation",
embedding_model_config=embedding_config,
)
print(f"Corpus created: {rag_corpus.name}")
# Import files from GCS
rag.import_files(
corpus_name=rag_corpus.name,
paths=["gs://dev-company-docs-rag-docs-your-project/"],
chunk_size=512,
chunk_overlap=100,
)
Why not Terraform? Corpus and file operations are mutable, long-running, and frequently updated (new documents added regularly). They fit the operational workflow better than the infrastructure layer. Terraform manages the things that change rarely (buckets, IAM, engine config). The SDK manages the things that change often (corpus content).
π Step 6: Query Your Knowledge Base
Two approaches for retrieval:
# Approach 1: Retrieve contexts only
response = rag.retrieval_query(
rag_resources=[rag.RagResource(rag_corpus=rag_corpus.name)],
text="What is our refund policy?",
rag_retrieval_config=rag.RagRetrievalConfig(
top_k=5,
filter=rag.RagRetrievalConfig.Filter(
vector_distance_threshold=0.5
)
),
)
for context in response.contexts.contexts:
print(f"Source: {context.source_uri}")
print(f"Text: {context.text[:200]}")
# Approach 2: End-to-end RAG with Gemini
rag_tool = Tool.from_retrieval(
retrieval=rag.Retrieval(
source=rag.VertexRagStore(
rag_resources=[rag.RagResource(rag_corpus=rag_corpus.name)],
rag_retrieval_config=rag.RagRetrievalConfig(top_k=3)
)
)
)
model = GenerativeModel(
model_name="gemini-2.0-flash-001",
tools=[rag_tool]
)
response = model.generate_content("What is our refund policy?")
print(response.text)
Approach 2 is the most common pattern - Gemini automatically retrieves relevant chunks and generates a grounded response in a single call.
π Step 7: Enable APIs
Don't forget the API enablement in Terraform:
# rag/apis.tf
resource "google_project_service" "vertex_ai" {
project = var.project_id
service = "aiplatform.googleapis.com"
}
resource "google_project_service" "storage" {
project = var.project_id
service = "storage.googleapis.com"
}
Your first GCP RAG pipeline is deployed. Documents in GCS, vectors in RagManagedDb, retrieval via Gemini - infrastructure in Terraform, operations in Python, all repeatable across environments. π
Found this helpful? Follow for the full RAG Pipeline with Terraform series! π¬
Top comments (2)
Thanks for making this approachable with Terraform! One thing I'd add for production use: the infrastructure is only half the story. The other half is the quality of your document ingestion pipeline. We spent a lot of time getting the GCP infrastructure right, then discovered that our retrieval quality was mediocre because we hadn't thought carefully about preprocessing β removing boilerplate, normalizing formatting, handling code blocks differently from prose. I'd recommend setting up a small evaluation harness early β a handful of question-answer pairs from your actual documents β so you can measure whether infrastructure changes and ingestion tweaks are actually improving things.
Thanks for going through the post. Youβre right in this I have only explained the infrastructure part. Will consider your inputs for future posts on checking and improving the quality of the RAG pipeline that we have built.