DEV Community

Cover image for Vertex AI RAG Engine Advanced RAG with Terraform: Chunking, Hybrid Search, and Reranking 🧠
Suhas Mallesh
Suhas Mallesh

Posted on

Vertex AI RAG Engine Advanced RAG with Terraform: Chunking, Hybrid Search, and Reranking 🧠

Basic chunking gets you a demo. Hybrid search, reranking with the Vertex AI Ranking API, metadata filtering, and tuned retrieval configs turn a RAG Engine corpus into a production system. All wired through Terraform and the Python SDK.

In RAG Post 1, we deployed a Vertex AI RAG Engine corpus with basic fixed-size chunking. It works, but retrieval quality is mediocre. Your users ask nuanced questions and get incomplete or irrelevant answers back.

The fix isn't a better generation model. It's better retrieval. RAG Engine supports chunking tuning, hybrid search with configurable alpha weighting, reranking via the Vertex AI Ranking API, metadata filtering, and vector distance thresholds. The infrastructure layer (Terraform) and the operational layer (Python SDK) each handle different parts. This post covers the production patterns that make the difference. 🎯

🧱 Chunking: The Biggest Lever You Control

RAG Engine uses fixed-size token chunking configured at file import time. Unlike AWS Bedrock (which offers semantic and hierarchical strategies as native options), GCP keeps chunking straightforward but gives you fine-grained control over size and overlap.

The key insight: chunking configuration is set per import operation, not per corpus. You can re-import the same files with different chunking to test what works best.

from vertexai import rag

# Production chunking config
rag.import_files(
    corpus_name=corpus.name,
    paths=["gs://company-docs-prod/policies/"],
    transformation_config=rag.TransformationConfig(
        chunking_config=rag.ChunkingConfig(
            chunk_size=512,
            chunk_overlap=100
        )
    ),
    max_embedding_requests_per_min=900
)
Enter fullscreen mode Exit fullscreen mode

Chunking Size Guide

Document Type Chunk Size Overlap Why
Short FAQs, Q&A pairs 256 30 Small chunks = precise matching
General docs, guides 512 100 Balanced precision and context
Long legal/technical docs 1024 200 Preserves cross-reference context
Pre-processed content Use as-is 0 Already split at natural boundaries

Tuning approach: Start with 512/100. If answers lack context, increase to 1024/200. If retrieval returns irrelevant chunks, decrease to 256/50. Re-import and compare - the corpus supports multiple imports with different configs against the same files.

Rate Limiting Embeddings

The max_embedding_requests_per_min parameter is critical in production. Without it, large imports can exhaust your embedding model quota and fail partway through. Set it below your project's QPM limit for the embedding model:

# Terraform outputs feed into SDK config
# environments/prod.tfvars sets the quota boundary
embedding_qpm_rate = 900  # Leave headroom below 1000 QPM limit
Enter fullscreen mode Exit fullscreen mode

πŸ” Hybrid Search

By default, RAG Engine uses pure vector (dense) search. Hybrid search combines vector similarity with keyword (sparse/token-based) matching using Reciprocal Rank Fusion (RRF). The alpha parameter controls the balance:

Alpha Value Behavior
1.0 Pure vector/semantic search
0.5 Equal weight (default)
0.0 Pure keyword search
from vertexai import rag
from vertexai.generative_models import GenerativeModel, Tool

rag_retrieval_config = rag.RagRetrievalConfig(
    top_k=10,
    filter=rag.Filter(
        vector_distance_threshold=0.3
    ),
    hybrid_search=rag.HybridSearch(
        alpha=0.6  # Slightly favor semantic, but include keyword matching
    )
)

# Retrieve-only (no generation)
response = rag.retrieval_query(
    rag_resources=[rag.RagResource(rag_corpus=corpus.name)],
    text="What is policy ABC-123 regarding overtime?",
    rag_retrieval_config=rag_retrieval_config,
)
Enter fullscreen mode Exit fullscreen mode

When to adjust alpha: If users search for specific codes, IDs, or exact terminology (policy numbers, product SKUs, error codes), lower alpha toward 0.3-0.4 to boost keyword matching. For natural language questions about concepts, keep alpha at 0.6-0.8.

πŸ“Š Reranking with the Vertex AI Ranking API

Retrieval returns the top-K chunks by similarity. But similarity isn't the same as relevance. Reranking re-scores those chunks using a deeper query-document understanding.

RAG Engine integrates with two reranking approaches:

Rank Service (Recommended for Production)

Uses Google's pre-trained ranking models via the Discovery Engine API. Requires enabling discoveryengine.googleapis.com:

# rag/apis.tf

resource "google_project_service" "discovery_engine" {
  project = var.project_id
  service = "discoveryengine.googleapis.com"

  disable_dependent_services = false
  disable_on_destroy         = false
}
Enter fullscreen mode Exit fullscreen mode

Then configure at retrieval time:

rag_retrieval_config = rag.RagRetrievalConfig(
    top_k=15,  # Retrieve wide
    ranking=rag.Ranking(
        rank_service=rag.RankService(
            model_name="semantic-ranker-default@latest"
        )
    ),
    hybrid_search=rag.HybridSearch(alpha=0.6)
)

response = rag.retrieval_query(
    rag_resources=[rag.RagResource(rag_corpus=corpus.name)],
    text="What are the penalties for late contract delivery?",
    rag_retrieval_config=rag_retrieval_config,
)
Enter fullscreen mode Exit fullscreen mode

Pattern: Retrieve 15 chunks with hybrid search, let the rank service re-score and return the most relevant. This "retrieve wide, rerank narrow" approach consistently outperforms retrieving 5 directly.

LLM Ranker (Alternative)

Uses an LLM to re-rank results. Higher latency but can handle nuanced relevance judgments:

rag_retrieval_config = rag.RagRetrievalConfig(
    top_k=10,
    ranking=rag.Ranking(
        llm_ranker=rag.LlmRanker(
            model_name="gemini-2.0-flash"
        )
    )
)
Enter fullscreen mode Exit fullscreen mode

Trade-off: Rank Service is faster and cheaper. LLM Ranker handles complex, ambiguous queries better. Start with Rank Service and switch to LLM Ranker only for specific query patterns where relevance is poor.

🏷️ Metadata Filtering

Scope retrieval to specific document categories using metadata filters. Metadata is applied at query time as a filter string:

rag_retrieval_config = rag.RagRetrievalConfig(
    top_k=10,
    filter=rag.Filter(
        vector_distance_threshold=0.3,
        metadata_filter="department = 'legal' AND year >= 2024"
    ),
    hybrid_search=rag.HybridSearch(alpha=0.6)
)

response = rag.retrieval_query(
    rag_resources=[rag.RagResource(rag_corpus=corpus.name)],
    text="What changed in the refund policy?",
    rag_retrieval_config=rag_retrieval_config,
)
Enter fullscreen mode Exit fullscreen mode

Metadata is attached during file import. For GCS-sourced files, metadata comes from the file's properties or can be set programmatically during import operations.

πŸ“ Vector Distance Threshold

The vector_distance_threshold parameter filters out low-relevance chunks before they reach the model. Only chunks with a vector distance below the threshold are returned:

# Strict filtering - only highly relevant chunks
filter=rag.Filter(vector_distance_threshold=0.3)

# Relaxed filtering - cast a wider net
filter=rag.Filter(vector_distance_threshold=0.7)
Enter fullscreen mode Exit fullscreen mode

Tuning guide: Start with 0.5. If you get irrelevant chunks, tighten to 0.3. If too few results come back, relax to 0.7. This is especially important when using reranking - set a relaxed threshold to let more candidates through, then let the reranker sort by relevance.

πŸ“ Production Terraform + SDK Config

The infrastructure layer (Terraform) provisions APIs, GCS, IAM, and engine config. The operational layer (Python SDK) handles corpus creation, import, and retrieval tuning:

# rag/main.tf

resource "google_project_service" "required_apis" {
  for_each = toset([
    "aiplatform.googleapis.com",
    "discoveryengine.googleapis.com",
    "storage.googleapis.com",
  ])

  project = var.project_id
  service = each.value

  disable_dependent_services = false
  disable_on_destroy         = false
}

resource "google_vertex_ai_rag_engine_config" "this" {
  region = var.region

  rag_managed_db {
    type = var.rag_db_tier
  }

  depends_on = [google_project_service.required_apis]
}

resource "google_storage_bucket" "rag_docs" {
  name     = "${var.project_id}-${var.environment}-rag-docs"
  location = var.region

  uniform_bucket_level_access = true

  lifecycle_rule {
    condition { age = var.doc_retention_days }
    action    { type = "Delete" }
  }
}
Enter fullscreen mode Exit fullscreen mode

With environment-specific variables:

# environments/dev.tfvars
rag_db_tier         = "BASIC"
doc_retention_days  = 90
embedding_qpm_rate  = 500

# Retrieval config (passed to SDK)
chunk_size          = 300
chunk_overlap       = 50
retrieval_top_k     = 5
alpha               = 0.5
distance_threshold  = 0.5
reranker            = "none"

# environments/prod.tfvars
rag_db_tier         = "SCALED"
doc_retention_days  = 2555
embedding_qpm_rate  = 900

# Retrieval config (passed to SDK)
chunk_size          = 512
chunk_overlap       = 100
retrieval_top_k     = 15
alpha               = 0.6
distance_threshold  = 0.3
reranker            = "semantic-ranker-default@latest"
Enter fullscreen mode Exit fullscreen mode

πŸ”„ Azure vs AWS vs GCP: Advanced RAG Comparison

Feature Azure AI Search AWS Bedrock KB GCP RAG Engine
Chunking Fixed-size + Document Layout skill Fixed, hierarchical, semantic, Lambda Fixed-size only
Hybrid search BM25 + vector via RRF (built-in) Supported on OpenSearch Alpha-weighted dense/sparse
Semantic reranking Built-in transformer ranker (L2) Cohere Rerank Rank Service + LLM Ranker
Query decomposition Agentic retrieval (native) Native API parameter Not built-in
Metadata filtering Filterable index fields + OData JSON metadata files in S3 Filter string at query time
Strictness control 1-5 scale on data source Not built-in Vector distance threshold
Reranker score range 0-4 (calibrated, cross-query consistent) Model-dependent Model-dependent

GCP's advantage is operational simplicity - the managed vector DB and per-import chunking make experimentation faster. AWS offers more built-in chunking strategies and native query decomposition.

πŸ’‘ Decision Framework

Your Situation Chunking Hybrid Alpha Reranking Threshold
Getting started, mixed docs 512 / 100 0.5 None 0.5
Users search by codes/IDs 256 / 50 0.3 Rank Service 0.5
Long technical documents 1024 / 200 0.7 Rank Service 0.3
High-precision production 512 / 100 0.6 Rank Service 0.3
Complex, ambiguous queries 512 / 100 0.6 LLM Ranker 0.5

Start with the "high-precision production" row as your default. Enable the Discovery Engine API, use Rank Service reranking, and tune from there.

⏭️ What's Next

This is Post 2 of the GCP RAG Pipeline with Terraform series.


Your RAG pipeline just leveled up. Hybrid search for precision, Rank Service reranking for relevance, metadata filtering for scope, and vector distance thresholds for noise control - all driven by Terraform variables per environment. 🧠

Found this helpful? Follow for the full RAG Pipeline with Terraform series! πŸ’¬

Top comments (0)