Suhas Mallesh

Posted on Mar 5

Cosmos DB Vector Search for RAG: NoSQL-Native DiskANN on Azure with Terraform 🔎

#azure #terraform #devops #ai

Azure AI Search is powerful but adds a separate service to manage. Cosmos DB for NoSQL has built-in vector search powered by Microsoft Research's DiskANN - store vectors and data together with sub-20ms latency. Here's the Terraform setup for RAG workloads.

In RAG Post 1, we deployed an Azure AI Search index as the vector store for our RAG pipeline. AI Search gives you a full retrieval stack with semantic ranking, hybrid search, and query decomposition. But it's a separate service with its own pricing tier, and for many RAG workloads, that's more infrastructure than you need.

Azure Cosmos DB for NoSQL now includes built-in vector search powered by DiskANN, a high-performance vector indexing algorithm developed by Microsoft Research. You store vectors alongside your documents in the same database, query them with the familiar NoSQL SQL syntax, and get sub-20ms latency at scale. If you already have data in Cosmos DB or want a single database for both operational and vector workloads, this eliminates an entire service from your architecture. 🎯

🔍 Why Cosmos DB for RAG?

AI Search is the right choice when you need the full three-layer retrieval pipeline (hybrid search, semantic ranking, agentic retrieval). But many RAG use cases don't need all of that:

Capability	Azure AI Search	Cosmos DB for NoSQL
Vector search	✅ Separate index	✅ Same container as data
Hybrid search	✅ BM25 + vector via RRF	✅ FullTextScore + VectorDistance via RRF
Semantic ranking	✅ Cross-encoder reranker	❌ (preview: semantic reranking)
Query decomposition	✅ Agentic retrieval	❌ Application-level
Index type	HNSW	DiskANN, quantizedFlat, flat
Latency	50-200ms	< 20ms (DiskANN at 10M vectors)
Pricing model	Per-tier (Basic/Standard)	Pay-per-RU (serverless or provisioned)
Min cost (dev)	~$75/month (Basic)	~$0 (serverless, pay per request)

The trade-off: Cosmos DB gives you lower latency and true pay-per-use pricing, but you lose AI Search's built-in semantic ranker and agentic retrieval. For straightforward RAG against your own documents, Cosmos DB's native vector search is often enough.

🔧 Terraform Setup

Cosmos DB Account with Vector Search

# cosmosdb/main.tf

resource "azurerm_cosmosdb_account" "this" {
  name                = "${var.environment}-${var.project}-cosmos"
  location            = azurerm_resource_group.this.location
  resource_group_name = azurerm_resource_group.this.name
  offer_type          = "Standard"
  kind                = "GlobalDocumentDB"

  capabilities {
    name = "EnableNoSQLVectorSearch"
  }

  capabilities {
    name = "EnableServerless"
  }

  consistency_policy {
    consistency_level = "Session"
  }

  geo_location {
    location          = var.location
    failover_priority = 0
  }
}

The EnableNoSQLVectorSearch capability activates DiskANN-based vector indexing. The EnableServerless capability sets pay-per-request pricing with no minimum cost - ideal for dev/test RAG workloads.

Database and Container

# cosmosdb/database.tf

resource "azurerm_cosmosdb_sql_database" "rag" {
  name                = "rag-database"
  resource_group_name = azurerm_resource_group.this.name
  account_name        = azurerm_cosmosdb_account.this.name
}

resource "azurerm_cosmosdb_sql_container" "chunks" {
  name                = "rag-chunks"
  resource_group_name = azurerm_resource_group.this.name
  account_name        = azurerm_cosmosdb_account.this.name
  database_name       = azurerm_cosmosdb_sql_database.rag.name
  partition_key_paths = ["/department"]

  indexing_policy {
    indexing_mode = "consistent"

    included_path {
      path = "/*"
    }

    excluded_path {
      path = "/contentVector/*"
    }
  }
}

Important: Exclude vector paths from the general index (/contentVector/*). Vectors get their own specialized DiskANN index - including them in the general index wastes RUs and storage.

Vector and Full-Text Policies via AzAPI

Terraform's azurerm provider doesn't yet support vector embedding policies or vector indexes directly on the container resource. Use the azapi provider to set these policies:

# cosmosdb/vector_policy.tf

resource "azapi_update_resource" "vector_policy" {
  type      = "Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers@2024-05-15"
  parent_id = azurerm_cosmosdb_sql_container.chunks.id

  body = jsonencode({
    properties = {
      resource = {
        vectorEmbeddingPolicy = {
          vectorEmbeddings = [{
            path             = "/contentVector"
            dataType         = "float32"
            distanceFunction = "cosine"
            dimensions       = var.embedding_dimensions
          }]
        }
        indexingPolicy = {
          indexingMode = "consistent"
          automatic    = true
          includedPaths = [{ path = "/*" }]
          excludedPaths = [
            { path = "/contentVector/*" },
            { path = "/_etag/?" }
          ]
          vectorIndexes = [{
            path = "/contentVector"
            type = var.vector_index_type
          }]
          fullTextIndexes = [{
            path = "/content"
          }]
        }
        fullTextPolicy = {
          defaultLanguage = "en-US"
          fullTextPaths   = [{
            path     = "/content"
            language = "en-US"
          }]
        }
      }
    }
  })
}

This configures three things at once: the vector embedding policy (dimensions, distance function, data type), the DiskANN vector index, and the full-text index for hybrid search.

📄 Document Structure

Each document in your container stores the chunk content, metadata, and its vector embedding together:

{
  "id": "chunk-001",
  "sourceUri": "policies/returns-2024.pdf",
  "title": "Return Policy - Section 3",
  "content": "Customers may return items within 30 days...",
  "department": "legal",
  "year": 2024,
  "contentVector": [0.023, -0.041, 0.089, ...]
}

No separate vector store, no sync pipeline, no metadata mapping. The vector lives alongside the data it represents.

🔍 Vector Search: Three Patterns

Pattern 1: Pure Vector Search

SELECT TOP 10 c.title, c.content, c.sourceUri,
    VectorDistance(c.contentVector, @queryVector) AS score
FROM c
ORDER BY VectorDistance(c.contentVector, @queryVector)

VectorDistance computes cosine similarity using the DiskANN index. Always use TOP N to avoid scanning the entire index.

Pattern 2: Filtered Vector Search

Combine standard WHERE clauses with vector search:

SELECT TOP 10 c.title, c.content, c.sourceUri,
    VectorDistance(c.contentVector, @queryVector) AS score
FROM c
WHERE c.department = "legal" AND c.year >= 2024
ORDER BY VectorDistance(c.contentVector, @queryVector)

Cosmos DB's Sharded DiskANN optimizes filtered vector search by partitioning indexes per shard key. If you frequently filter by department, use it as the partition key for best performance.

Pattern 3: Hybrid Search (Vector + Full-Text)

Combine vector similarity with BM25 keyword scoring using Reciprocal Rank Fusion:

SELECT TOP 10 c.title, c.content, c.sourceUri
FROM c
ORDER BY RANK RRF(
    VectorDistance(c.contentVector, @queryVector),
    FullTextScore(c.content, ["return", "policy", "30 days"])
)

RRF merges the rankings from both search methods, catching results that are semantically similar and/or contain exact keywords. This is the same fusion approach AI Search uses internally.

⚡ DiskANN Index Types

Index Type	Accuracy	Latency	RU Cost	Best For
flat	100% (exact)	Higher	Higher	< 1,000 vectors, testing
quantizedFlat	~95%	Medium	Medium	1K-100K vectors
diskANN	~95%	Lowest	Lowest	100K+ vectors, production

Note: Both quantizedFlat and diskANN require at least 1,000 vectors for accurate quantization. Below that threshold, Cosmos DB falls back to a full scan with higher RU charges.

DiskANN tuning parameters:

quantizationByteSize (1-512): Higher = better accuracy, higher RU cost
indexingSearchListSize (10-500): Higher = better accuracy, slower index builds

For most RAG workloads, the defaults work well. Only tune if you measure recall issues.

🔗 Integrating with Azure OpenAI for RAG

from azure.cosmos import CosmosClient
from openai import AzureOpenAI

# Generate query embedding
openai_client = AzureOpenAI(...)
query_embedding = openai_client.embeddings.create(
    model="text-embedding-3-small",
    input=user_query
).data[0].embedding

# Vector search in Cosmos DB
cosmos_client = CosmosClient(endpoint, credential)
container = cosmos_client.get_database_client("rag-database") \
    .get_container_client("rag-chunks")

results = container.query_items(
    query="""
        SELECT TOP 5 c.title, c.content, c.sourceUri
        FROM c
        ORDER BY VectorDistance(c.contentVector, @vector)
    """,
    parameters=[{"name": "@vector", "value": query_embedding}],
    enable_cross_partition_query=True
)

chunks = [item["content"] for item in results]

# Generate response with Azure OpenAI
response = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": f"Answer based on context:\n{chr(10).join(chunks)}"},
        {"role": "user", "content": user_query}
    ]
)

📐 Environment-Specific Configuration

# environments/dev.tfvars
cosmos_serverless      = true    # Pay per request, no minimum
vector_index_type      = "quantizedFlat"  # Good for < 100K vectors
embedding_dimensions   = 1536

# environments/prod.tfvars
cosmos_serverless      = false   # Provisioned throughput
vector_index_type      = "diskANN"        # Production scale
embedding_dimensions   = 1536
provisioned_throughput = 4000    # RU/s, autoscale to 40,000

Cost comparison: Serverless Cosmos DB charges per RU consumed with no minimum. A typical RAG query costs 10-50 RUs depending on vector dimensions and result count. At 10,000 queries/month, that's roughly $0.50-$2.50 total - compared to ~$75/month for AI Search Basic tier.

💡 When to Choose Cosmos DB Vector Search

Pick Cosmos DB when you need:

Zero minimum cost for dev/test with serverless mode
Sub-20ms vector queries at production scale
Data and vectors together in the same document
Hybrid search with vector + full-text + filters in one query
Global distribution with multi-region replication

Stick with AI Search when you need the semantic ranker, agentic retrieval, or the full three-layer pipeline from our Advanced RAG post.

⏭️ What's Next

This is Post 3 of the Azure RAG Pipeline with Terraform series.

Post 1: Azure AI Search RAG - Basic Setup 🔍
Post 2: Advanced RAG - Three-Layer Retrieval 🧠
Post 3: Cosmos DB Vector Search - NoSQL-Native RAG (you are here) 💰
Post 4: Auto-Sync Pipelines (next series)

Your RAG pipeline just went serverless. Cosmos DB's built-in DiskANN vector search gives you sub-20ms queries, hybrid search with RRF, and true pay-per-use pricing - all without a separate vector store. One database, one bill, one less service to manage. 💰

Found this helpful? Follow for the full RAG Pipeline with Terraform series! 💬

DEV Community