DEV Community

Cover image for Azure AI Search Advanced RAG with Terraform: Hybrid Search, Semantic Ranking, and Agentic Retrieval 🧠
Suhas Mallesh
Suhas Mallesh

Posted on

Azure AI Search Advanced RAG with Terraform: Hybrid Search, Semantic Ranking, and Agentic Retrieval 🧠

Vector search alone leaves relevance on the table. Hybrid search with semantic ranking, chunking strategies, metadata filtering, strictness tuning, and the new agentic retrieval pipeline turn Azure AI Search into a production RAG system. All wired through Terraform.

In RAG Post 1, we deployed Azure AI Search with a basic index and connected it to Azure OpenAI. It works, but retrieval quality is mediocre. Your users ask nuanced questions and get incomplete or irrelevant answers.

The fix isn't a better generation model. It's better retrieval. Azure AI Search has the most sophisticated built-in retrieval pipeline of the three major clouds: hybrid search combining BM25 keyword matching with vector similarity via Reciprocal Rank Fusion (RRF), a transformer-based semantic ranker for deep re-scoring, metadata filtering, strictness controls, and a new agentic retrieval mode that decomposes complex queries automatically. This post covers the production patterns. 🎯

Important note: Azure OpenAI "On Your Data" is deprecated and approaching retirement. Microsoft recommends migrating to Foundry Agent Service with Foundry IQ. The patterns in this post use direct Azure AI Search integration, which works with both the current and future architecture.

🧱 Chunking: Getting the Foundation Right

Azure AI Search supports two chunking approaches through its indexer pipeline:

Fixed-size chunking via the Text Split skill - splits by token or character count with configurable overlap. Simple, predictable, cost-effective.

Structure-aware chunking via the Document Layout skill - uses Azure Document Intelligence to recognize headers, sections, and layout elements. Preserves document structure but adds per-page processing cost.

Chunking Size Guidance

Microsoft's own benchmarking on real customer datasets provides clear guidance:

Chunk Size Overlap Best For Trade-off
256 tokens 25% Short FAQs, Q&A pairs High precision, less context
512 tokens 25% General documents (recommended default) Best balance of precision and context
1024 tokens 10-15% Long technical/legal documents More context, risk of noise

Their data shows that 512-token chunks with 25% overlap using sentence boundary preservation consistently outperforms other sizes across query types. Larger chunks actually reduce retrieval performance because embeddings must compress more semantic content into the same number of dimensions.

Key insight: Always preserve sentence boundaries when chunking. Splitting mid-sentence degrades both embedding quality and retrieval accuracy.

πŸ” The Three-Layer Retrieval Pipeline

Azure AI Search's real advantage is its layered retrieval architecture. Each layer improves result quality:

Layer 1: Hybrid Search (L1 Recall)

Hybrid search runs keyword (BM25) and vector queries in parallel, then merges results using RRF. This catches both semantic matches and exact terminology:

from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizableTextQuery

search_client = SearchClient(
    endpoint=search_endpoint,
    index_name="company-docs",
    credential=credential
)

vector_query = VectorizableTextQuery(
    text="What are the penalties for late delivery?",
    k_nearest_neighbors=50,
    fields="contentVector"
)

results = search_client.search(
    search_text="penalties late delivery",  # BM25 keyword query
    vector_queries=[vector_query],           # Vector query
    select=["title", "content", "source"],
    top=50
)
Enter fullscreen mode Exit fullscreen mode

Why both? Vector search handles synonyms and paraphrasing. Keyword search catches product codes, policy numbers, and exact terminology that embeddings miss. RRF fusion gives you the best of both.

Layer 2: Semantic Ranking (L2 Reranking)

The semantic ranker is a transformer model that re-scores the top 50 results from L1 using cross-attention between query and document text. It produces a calibrated score from 0 (irrelevant) to 4 (excellent):

results = search_client.search(
    search_text="penalties late delivery",
    vector_queries=[vector_query],
    query_type="semantic",
    semantic_configuration_name="default",
    top=50
)

for result in results:
    print(f"Score: {result['@search.reranker_score']:.2f} - {result['title']}")
Enter fullscreen mode Exit fullscreen mode

Terraform setup - semantic ranking requires Standard tier and explicit configuration:

resource "azurerm_search_service" "this" {
  name                = "${var.environment}-${var.project}-search"
  resource_group_name = azurerm_resource_group.this.name
  location            = var.location
  sku                 = var.search_sku

  semantic_search_sku = var.semantic_search_sku

  identity {
    type = "SystemAssigned"
  }
}
Enter fullscreen mode Exit fullscreen mode
# environments/dev.tfvars
search_sku          = "basic"
semantic_search_sku = "free"    # Limited queries/month

# environments/prod.tfvars
search_sku          = "standard"
semantic_search_sku = "standard" # Unlimited semantic queries
Enter fullscreen mode Exit fullscreen mode

Microsoft's benchmark results: Hybrid + Semantic ranking finds the best content at every result set size. Pure vector search alone misses relevant results that hybrid catches, and without semantic ranking, the best results often sit at position 7 or 8 instead of position 1.

Layer 3: Agentic Retrieval (Query Decomposition)

The newest addition to Azure AI Search. Agentic retrieval automatically decomposes complex queries into sub-queries, runs them in parallel, semantically reranks each result set, and merges into a unified response:

User: "Compare our 2024 and 2025 refund policies and highlight what changed"

Agentic retrieval decomposes into:
  β†’ Sub-query 1: "2024 refund policy terms and conditions"
  β†’ Sub-query 2: "2025 refund policy terms and conditions"
  β†’ Sub-query 3: "changes updates refund policy"

Each sub-query: hybrid search β†’ semantic rerank top 50 β†’ merge
Enter fullscreen mode Exit fullscreen mode

This is available through the Knowledge Base object in the 2025-11-01-preview API. The infrastructure (search service, index, semantic config) is the same - agentic retrieval adds an orchestration layer on top.

When to use: Complex questions with multiple intents, comparative queries, or queries that span multiple document categories. Adds latency but improves answer quality by up to 40% on complex queries according to Microsoft's benchmarks.

🏷️ Metadata Filtering

Scope retrieval to specific document categories using filterable fields in your index. Filters execute before vector search, narrowing the candidate set:

results = search_client.search(
    search_text="refund policy changes",
    vector_queries=[vector_query],
    query_type="semantic",
    semantic_configuration_name="default",
    filter="department eq 'legal' and year ge 2024",
    top=20
)
Enter fullscreen mode Exit fullscreen mode

Terraform consideration: Filter fields must be defined as filterable: true in your index schema. Index schema is typically managed through the Portal, SDK, or REST API (not Terraform), but the search service and its SKU/capabilities are Terraform-managed.

πŸ“ Strictness and top_n_documents

When using Azure OpenAI's data source integration, two parameters control retrieval quality:

strictness (1-5): Controls how aggressively irrelevant chunks are filtered out. Higher values filter more aggressively:

Strictness Behavior Use Case
1-2 Lenient - includes borderline results Exploratory questions, broad topics
3 Balanced (default) General purpose
4-5 Strict - only highly relevant results Precise factual lookups, compliance

top_n_documents (1-20): How many chunks to include in the LLM prompt after filtering and reranking. More documents = more context but higher token cost:

completion = client.chat.completions.create(
    model=deployment,
    messages=[{"role": "user", "content": "What changed in the refund policy?"}],
    extra_body={
        "data_sources": [{
            "type": "azure_search",
            "parameters": {
                "endpoint": search_endpoint,
                "index_name": "company-docs",
                "query_type": "vector_semantic_hybrid",
                "semantic_configuration": "default",
                "strictness": 4,
                "top_n_documents": 5,
                "authentication": {
                    "type": "system_assigned_managed_identity"
                }
            }
        }]
    }
)
Enter fullscreen mode Exit fullscreen mode

Tuning guide: If the model says "I don't have enough information," reduce strictness or increase top_n_documents. If answers include irrelevant context, increase strictness or decrease top_n_documents.

πŸ“ Production Terraform Configuration

# rag/main.tf

resource "azurerm_search_service" "this" {
  name                = "${var.environment}-${var.project}-search"
  resource_group_name = azurerm_resource_group.this.name
  location            = var.location
  sku                 = var.search_sku

  semantic_search_sku = var.semantic_search_sku
  replica_count       = var.search_replicas
  partition_count     = var.search_partitions

  identity {
    type = "SystemAssigned"
  }

  tags = var.tags
}

# Embedding model deployment
resource "azurerm_cognitive_deployment" "embedding" {
  name                 = "text-embedding-3-small"
  cognitive_account_id = azurerm_cognitive_account.openai.id

  model {
    format  = "OpenAI"
    name    = "text-embedding-3-small"
    version = "1"
  }

  sku {
    name     = "Standard"
    capacity = var.embedding_capacity
  }
}
Enter fullscreen mode Exit fullscreen mode
# environments/dev.tfvars
search_sku          = "basic"
semantic_search_sku = "free"
search_replicas     = 1
search_partitions   = 1
embedding_capacity  = 30
strictness          = 3
top_n_documents     = 5

# environments/prod.tfvars
search_sku          = "standard"
semantic_search_sku = "standard"
search_replicas     = 2
search_partitions   = 1
embedding_capacity  = 120
strictness          = 4
top_n_documents     = 5
Enter fullscreen mode Exit fullscreen mode

πŸ”„ Azure vs AWS vs GCP: Advanced RAG Comparison

Feature Azure AI Search AWS Bedrock KB GCP RAG Engine
Chunking Fixed-size + Document Layout skill Fixed, hierarchical, semantic, Lambda Fixed-size only
Hybrid search BM25 + vector via RRF (built-in) Supported on OpenSearch Alpha-weighted dense/sparse
Semantic reranking Built-in transformer ranker (L2) Cohere Rerank Rank Service + LLM Ranker
Query decomposition Agentic retrieval (native) Native API parameter Not built-in
Metadata filtering Filterable index fields + OData JSON metadata files in S3 Filter string at query time
Strictness control 1-5 scale on data source Not built-in Vector distance threshold
Reranker score range 0-4 (calibrated, cross-query consistent) Model-dependent Model-dependent

Azure's advantage is the most mature retrieval pipeline - three layers (hybrid, semantic ranking, agentic) that compose together. The semantic ranker's calibrated scoring also enables consistent quality thresholds across different indexes and query patterns.

πŸ’‘ Decision Framework

Your Situation Query Type Semantic Ranker Strictness top_n
Getting started vector_simple_hybrid Free tier 3 5
Production general vector_semantic_hybrid Standard 3 5
Precise factual lookup vector_semantic_hybrid Standard 4-5 3
Broad research queries vector_semantic_hybrid Standard 2 10
Complex multi-part questions Agentic retrieval Standard 3 5

Start with vector_semantic_hybrid on Standard tier. It's the recommended default from Microsoft's own benchmarking. Add agentic retrieval for complex query patterns.

⏭️ What's Next

This is Post 2 of the Azure RAG Pipeline with Terraform series.


Your RAG pipeline now has the full Azure AI Search arsenal. Hybrid search for recall, semantic ranking for precision, agentic retrieval for complex queries, metadata filtering for scope, and strictness tuning for noise control - all driven by Terraform variables per environment. 🧠

Found this helpful? Follow for the full RAG Pipeline with Terraform series! πŸ’¬

Top comments (0)