Suhas Mallesh

Posted on Feb 28

Azure AI Search Advanced RAG with Terraform: Hybrid Search, Semantic Ranking, and Agentic Retrieval 🧠

#azure #terraform #devops #ai

Vector search alone leaves relevance on the table. Hybrid search with semantic ranking, chunking strategies, metadata filtering, strictness tuning, and the new agentic retrieval pipeline turn Azure AI Search into a production RAG system. All wired through Terraform.

In RAG Post 1, we deployed Azure AI Search with a basic index and connected it to Azure OpenAI. It works, but retrieval quality is mediocre. Your users ask nuanced questions and get incomplete or irrelevant answers.

The fix isn't a better generation model. It's better retrieval. Azure AI Search has the most sophisticated built-in retrieval pipeline of the three major clouds: hybrid search combining BM25 keyword matching with vector similarity via Reciprocal Rank Fusion (RRF), a transformer-based semantic ranker for deep re-scoring, metadata filtering, strictness controls, and a new agentic retrieval mode that decomposes complex queries automatically. This post covers the production patterns. 🎯

Important note: Azure OpenAI "On Your Data" is deprecated and approaching retirement. Microsoft recommends migrating to Foundry Agent Service with Foundry IQ. The patterns in this post use direct Azure AI Search integration, which works with both the current and future architecture.

🧱 Chunking: Getting the Foundation Right

Azure AI Search supports two chunking approaches through its indexer pipeline:

Fixed-size chunking via the Text Split skill - splits by token or character count with configurable overlap. Simple, predictable, cost-effective.

Structure-aware chunking via the Document Layout skill - uses Azure Document Intelligence to recognize headers, sections, and layout elements. Preserves document structure but adds per-page processing cost.

Chunking Size Guidance

Microsoft's own benchmarking on real customer datasets provides clear guidance:

Chunk Size	Overlap	Best For	Trade-off
256 tokens	25%	Short FAQs, Q&A pairs	High precision, less context
512 tokens	25%	General documents (recommended default)	Best balance of precision and context
1024 tokens	10-15%	Long technical/legal documents	More context, risk of noise

Their data shows that 512-token chunks with 25% overlap using sentence boundary preservation consistently outperforms other sizes across query types. Larger chunks actually reduce retrieval performance because embeddings must compress more semantic content into the same number of dimensions.

Key insight: Always preserve sentence boundaries when chunking. Splitting mid-sentence degrades both embedding quality and retrieval accuracy.

🔍 The Three-Layer Retrieval Pipeline

Azure AI Search's real advantage is its layered retrieval architecture. Each layer improves result quality:

Layer 1: Hybrid Search (L1 Recall)

Hybrid search runs keyword (BM25) and vector queries in parallel, then merges results using RRF. This catches both semantic matches and exact terminology:

from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizableTextQuery

search_client = SearchClient(
    endpoint=search_endpoint,
    index_name="company-docs",
    credential=credential
)

vector_query = VectorizableTextQuery(
    text="What are the penalties for late delivery?",
    k_nearest_neighbors=50,
    fields="contentVector"
)

results = search_client.search(
    search_text="penalties late delivery",  # BM25 keyword query
    vector_queries=[vector_query],           # Vector query
    select=["title", "content", "source"],
    top=50
)

Why both? Vector search handles synonyms and paraphrasing. Keyword search catches product codes, policy numbers, and exact terminology that embeddings miss. RRF fusion gives you the best of both.

Layer 2: Semantic Ranking (L2 Reranking)

The semantic ranker is a transformer model that re-scores the top 50 results from L1 using cross-attention between query and document text. It produces a calibrated score from 0 (irrelevant) to 4 (excellent):

results = search_client.search(
    search_text="penalties late delivery",
    vector_queries=[vector_query],
    query_type="semantic",
    semantic_configuration_name="default",
    top=50
)

for result in results:
    print(f"Score: {result['@search.reranker_score']:.2f} - {result['title']}")

Terraform setup - semantic ranking requires Standard tier and explicit configuration:

resource "azurerm_search_service" "this" {
  name                = "${var.environment}-${var.project}-search"
  resource_group_name = azurerm_resource_group.this.name
  location            = var.location
  sku                 = var.search_sku

  semantic_search_sku = var.semantic_search_sku

  identity {
    type = "SystemAssigned"
  }
}

# environments/dev.tfvars
search_sku          = "basic"
semantic_search_sku = "free"    # Limited queries/month

# environments/prod.tfvars
search_sku          = "standard"
semantic_search_sku = "standard" # Unlimited semantic queries

Microsoft's benchmark results: Hybrid + Semantic ranking finds the best content at every result set size. Pure vector search alone misses relevant results that hybrid catches, and without semantic ranking, the best results often sit at position 7 or 8 instead of position 1.

Layer 3: Agentic Retrieval (Query Decomposition)

The newest addition to Azure AI Search. Agentic retrieval automatically decomposes complex queries into sub-queries, runs them in parallel, semantically reranks each result set, and merges into a unified response:

User: "Compare our 2024 and 2025 refund policies and highlight what changed"

Agentic retrieval decomposes into:
  → Sub-query 1: "2024 refund policy terms and conditions"
  → Sub-query 2: "2025 refund policy terms and conditions"
  → Sub-query 3: "changes updates refund policy"

Each sub-query: hybrid search → semantic rerank top 50 → merge

This is available through the Knowledge Base object in the 2025-11-01-preview API. The infrastructure (search service, index, semantic config) is the same - agentic retrieval adds an orchestration layer on top.

When to use: Complex questions with multiple intents, comparative queries, or queries that span multiple document categories. Adds latency but improves answer quality by up to 40% on complex queries according to Microsoft's benchmarks.

🏷️ Metadata Filtering

Scope retrieval to specific document categories using filterable fields in your index. Filters execute before vector search, narrowing the candidate set:

results = search_client.search(
    search_text="refund policy changes",
    vector_queries=[vector_query],
    query_type="semantic",
    semantic_configuration_name="default",
    filter="department eq 'legal' and year ge 2024",
    top=20
)

Terraform consideration: Filter fields must be defined as filterable: true in your index schema. Index schema is typically managed through the Portal, SDK, or REST API (not Terraform), but the search service and its SKU/capabilities are Terraform-managed.

📏 Strictness and top_n_documents

When using Azure OpenAI's data source integration, two parameters control retrieval quality:

strictness (1-5): Controls how aggressively irrelevant chunks are filtered out. Higher values filter more aggressively:

Strictness	Behavior	Use Case
1-2	Lenient - includes borderline results	Exploratory questions, broad topics
3	Balanced (default)	General purpose
4-5	Strict - only highly relevant results	Precise factual lookups, compliance

top_n_documents (1-20): How many chunks to include in the LLM prompt after filtering and reranking. More documents = more context but higher token cost:

completion = client.chat.completions.create(
    model=deployment,
    messages=[{"role": "user", "content": "What changed in the refund policy?"}],
    extra_body={
        "data_sources": [{
            "type": "azure_search",
            "parameters": {
                "endpoint": search_endpoint,
                "index_name": "company-docs",
                "query_type": "vector_semantic_hybrid",
                "semantic_configuration": "default",
                "strictness": 4,
                "top_n_documents": 5,
                "authentication": {
                    "type": "system_assigned_managed_identity"
                }
            }
        }]
    }
)

Tuning guide: If the model says "I don't have enough information," reduce strictness or increase top_n_documents. If answers include irrelevant context, increase strictness or decrease top_n_documents.

📐 Production Terraform Configuration

# rag/main.tf

resource "azurerm_search_service" "this" {
  name                = "${var.environment}-${var.project}-search"
  resource_group_name = azurerm_resource_group.this.name
  location            = var.location
  sku                 = var.search_sku

  semantic_search_sku = var.semantic_search_sku
  replica_count       = var.search_replicas
  partition_count     = var.search_partitions

  identity {
    type = "SystemAssigned"
  }

  tags = var.tags
}

# Embedding model deployment
resource "azurerm_cognitive_deployment" "embedding" {
  name                 = "text-embedding-3-small"
  cognitive_account_id = azurerm_cognitive_account.openai.id

  model {
    format  = "OpenAI"
    name    = "text-embedding-3-small"
    version = "1"
  }

  sku {
    name     = "Standard"
    capacity = var.embedding_capacity
  }
}

# environments/dev.tfvars
search_sku          = "basic"
semantic_search_sku = "free"
search_replicas     = 1
search_partitions   = 1
embedding_capacity  = 30
strictness          = 3
top_n_documents     = 5

# environments/prod.tfvars
search_sku          = "standard"
semantic_search_sku = "standard"
search_replicas     = 2
search_partitions   = 1
embedding_capacity  = 120
strictness          = 4
top_n_documents     = 5

🔄 Azure vs AWS vs GCP: Advanced RAG Comparison

Feature	Azure AI Search	AWS Bedrock KB	GCP RAG Engine
Chunking	Fixed-size + Document Layout skill	Fixed, hierarchical, semantic, Lambda	Fixed-size only
Hybrid search	BM25 + vector via RRF (built-in)	Supported on OpenSearch	Alpha-weighted dense/sparse
Semantic reranking	Built-in transformer ranker (L2)	Cohere Rerank	Rank Service + LLM Ranker
Query decomposition	Agentic retrieval (native)	Native API parameter	Not built-in
Metadata filtering	Filterable index fields + OData	JSON metadata files in S3	Filter string at query time
Strictness control	1-5 scale on data source	Not built-in	Vector distance threshold
Reranker score range	0-4 (calibrated, cross-query consistent)	Model-dependent	Model-dependent

Azure's advantage is the most mature retrieval pipeline - three layers (hybrid, semantic ranking, agentic) that compose together. The semantic ranker's calibrated scoring also enables consistent quality thresholds across different indexes and query patterns.

💡 Decision Framework

Your Situation	Query Type	Semantic Ranker	Strictness	top_n
Getting started	`vector_simple_hybrid`	Free tier	3	5
Production general	`vector_semantic_hybrid`	Standard	3	5
Precise factual lookup	`vector_semantic_hybrid`	Standard	4-5	3
Broad research queries	`vector_semantic_hybrid`	Standard	2	10
Complex multi-part questions	Agentic retrieval	Standard	3	5

Start with vector_semantic_hybrid on Standard tier. It's the recommended default from Microsoft's own benchmarking. Add agentic retrieval for complex query patterns.

⏭️ What's Next

This is Post 2 of the Azure RAG Pipeline with Terraform series.

Post 1: Azure AI Search RAG - Basic Setup 🔍
Post 2: Advanced RAG - Hybrid Search, Semantic Ranking, Agentic Retrieval (you are here) 🧠

Your RAG pipeline now has the full Azure AI Search arsenal. Hybrid search for recall, semantic ranking for precision, agentic retrieval for complex queries, metadata filtering for scope, and strictness tuning for noise control - all driven by Terraform variables per environment. 🧠

Found this helpful? Follow for the full RAG Pipeline with Terraform series! 💬

DEV Community