DEV Community

Cover image for Cosmos DB Vector Search for RAG: NoSQL-Native DiskANN on Azure with Terraform πŸ”Ž
Suhas Mallesh
Suhas Mallesh

Posted on

Cosmos DB Vector Search for RAG: NoSQL-Native DiskANN on Azure with Terraform πŸ”Ž

Azure AI Search is powerful but adds a separate service to manage. Cosmos DB for NoSQL has built-in vector search powered by Microsoft Research's DiskANN - store vectors and data together with sub-20ms latency. Here's the Terraform setup for RAG workloads.

In RAG Post 1, we deployed an Azure AI Search index as the vector store for our RAG pipeline. AI Search gives you a full retrieval stack with semantic ranking, hybrid search, and query decomposition. But it's a separate service with its own pricing tier, and for many RAG workloads, that's more infrastructure than you need.

Azure Cosmos DB for NoSQL now includes built-in vector search powered by DiskANN, a high-performance vector indexing algorithm developed by Microsoft Research. You store vectors alongside your documents in the same database, query them with the familiar NoSQL SQL syntax, and get sub-20ms latency at scale. If you already have data in Cosmos DB or want a single database for both operational and vector workloads, this eliminates an entire service from your architecture. 🎯

πŸ” Why Cosmos DB for RAG?

AI Search is the right choice when you need the full three-layer retrieval pipeline (hybrid search, semantic ranking, agentic retrieval). But many RAG use cases don't need all of that:

Capability Azure AI Search Cosmos DB for NoSQL
Vector search βœ… Separate index βœ… Same container as data
Hybrid search βœ… BM25 + vector via RRF βœ… FullTextScore + VectorDistance via RRF
Semantic ranking βœ… Cross-encoder reranker ❌ (preview: semantic reranking)
Query decomposition βœ… Agentic retrieval ❌ Application-level
Index type HNSW DiskANN, quantizedFlat, flat
Latency 50-200ms < 20ms (DiskANN at 10M vectors)
Pricing model Per-tier (Basic/Standard) Pay-per-RU (serverless or provisioned)
Min cost (dev) ~$75/month (Basic) ~$0 (serverless, pay per request)

The trade-off: Cosmos DB gives you lower latency and true pay-per-use pricing, but you lose AI Search's built-in semantic ranker and agentic retrieval. For straightforward RAG against your own documents, Cosmos DB's native vector search is often enough.

πŸ”§ Terraform Setup

Cosmos DB Account with Vector Search

# cosmosdb/main.tf

resource "azurerm_cosmosdb_account" "this" {
  name                = "${var.environment}-${var.project}-cosmos"
  location            = azurerm_resource_group.this.location
  resource_group_name = azurerm_resource_group.this.name
  offer_type          = "Standard"
  kind                = "GlobalDocumentDB"

  capabilities {
    name = "EnableNoSQLVectorSearch"
  }

  capabilities {
    name = "EnableServerless"
  }

  consistency_policy {
    consistency_level = "Session"
  }

  geo_location {
    location          = var.location
    failover_priority = 0
  }
}
Enter fullscreen mode Exit fullscreen mode

The EnableNoSQLVectorSearch capability activates DiskANN-based vector indexing. The EnableServerless capability sets pay-per-request pricing with no minimum cost - ideal for dev/test RAG workloads.

Database and Container

# cosmosdb/database.tf

resource "azurerm_cosmosdb_sql_database" "rag" {
  name                = "rag-database"
  resource_group_name = azurerm_resource_group.this.name
  account_name        = azurerm_cosmosdb_account.this.name
}

resource "azurerm_cosmosdb_sql_container" "chunks" {
  name                = "rag-chunks"
  resource_group_name = azurerm_resource_group.this.name
  account_name        = azurerm_cosmosdb_account.this.name
  database_name       = azurerm_cosmosdb_sql_database.rag.name
  partition_key_paths = ["/department"]

  indexing_policy {
    indexing_mode = "consistent"

    included_path {
      path = "/*"
    }

    excluded_path {
      path = "/contentVector/*"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Important: Exclude vector paths from the general index (/contentVector/*). Vectors get their own specialized DiskANN index - including them in the general index wastes RUs and storage.

Vector and Full-Text Policies via AzAPI

Terraform's azurerm provider doesn't yet support vector embedding policies or vector indexes directly on the container resource. Use the azapi provider to set these policies:

# cosmosdb/vector_policy.tf

resource "azapi_update_resource" "vector_policy" {
  type      = "Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers@2024-05-15"
  parent_id = azurerm_cosmosdb_sql_container.chunks.id

  body = jsonencode({
    properties = {
      resource = {
        vectorEmbeddingPolicy = {
          vectorEmbeddings = [{
            path             = "/contentVector"
            dataType         = "float32"
            distanceFunction = "cosine"
            dimensions       = var.embedding_dimensions
          }]
        }
        indexingPolicy = {
          indexingMode = "consistent"
          automatic    = true
          includedPaths = [{ path = "/*" }]
          excludedPaths = [
            { path = "/contentVector/*" },
            { path = "/_etag/?" }
          ]
          vectorIndexes = [{
            path = "/contentVector"
            type = var.vector_index_type
          }]
          fullTextIndexes = [{
            path = "/content"
          }]
        }
        fullTextPolicy = {
          defaultLanguage = "en-US"
          fullTextPaths   = [{
            path     = "/content"
            language = "en-US"
          }]
        }
      }
    }
  })
}
Enter fullscreen mode Exit fullscreen mode

This configures three things at once: the vector embedding policy (dimensions, distance function, data type), the DiskANN vector index, and the full-text index for hybrid search.

πŸ“„ Document Structure

Each document in your container stores the chunk content, metadata, and its vector embedding together:

{
  "id": "chunk-001",
  "sourceUri": "policies/returns-2024.pdf",
  "title": "Return Policy - Section 3",
  "content": "Customers may return items within 30 days...",
  "department": "legal",
  "year": 2024,
  "contentVector": [0.023, -0.041, 0.089, ...]
}
Enter fullscreen mode Exit fullscreen mode

No separate vector store, no sync pipeline, no metadata mapping. The vector lives alongside the data it represents.

πŸ” Vector Search: Three Patterns

Pattern 1: Pure Vector Search

SELECT TOP 10 c.title, c.content, c.sourceUri,
    VectorDistance(c.contentVector, @queryVector) AS score
FROM c
ORDER BY VectorDistance(c.contentVector, @queryVector)
Enter fullscreen mode Exit fullscreen mode

VectorDistance computes cosine similarity using the DiskANN index. Always use TOP N to avoid scanning the entire index.

Pattern 2: Filtered Vector Search

Combine standard WHERE clauses with vector search:

SELECT TOP 10 c.title, c.content, c.sourceUri,
    VectorDistance(c.contentVector, @queryVector) AS score
FROM c
WHERE c.department = "legal" AND c.year >= 2024
ORDER BY VectorDistance(c.contentVector, @queryVector)
Enter fullscreen mode Exit fullscreen mode

Cosmos DB's Sharded DiskANN optimizes filtered vector search by partitioning indexes per shard key. If you frequently filter by department, use it as the partition key for best performance.

Pattern 3: Hybrid Search (Vector + Full-Text)

Combine vector similarity with BM25 keyword scoring using Reciprocal Rank Fusion:

SELECT TOP 10 c.title, c.content, c.sourceUri
FROM c
ORDER BY RANK RRF(
    VectorDistance(c.contentVector, @queryVector),
    FullTextScore(c.content, ["return", "policy", "30 days"])
)
Enter fullscreen mode Exit fullscreen mode

RRF merges the rankings from both search methods, catching results that are semantically similar and/or contain exact keywords. This is the same fusion approach AI Search uses internally.

⚑ DiskANN Index Types

Index Type Accuracy Latency RU Cost Best For
flat 100% (exact) Higher Higher < 1,000 vectors, testing
quantizedFlat ~95% Medium Medium 1K-100K vectors
diskANN ~95% Lowest Lowest 100K+ vectors, production

Note: Both quantizedFlat and diskANN require at least 1,000 vectors for accurate quantization. Below that threshold, Cosmos DB falls back to a full scan with higher RU charges.

DiskANN tuning parameters:

  • quantizationByteSize (1-512): Higher = better accuracy, higher RU cost
  • indexingSearchListSize (10-500): Higher = better accuracy, slower index builds

For most RAG workloads, the defaults work well. Only tune if you measure recall issues.

πŸ”— Integrating with Azure OpenAI for RAG

from azure.cosmos import CosmosClient
from openai import AzureOpenAI

# Generate query embedding
openai_client = AzureOpenAI(...)
query_embedding = openai_client.embeddings.create(
    model="text-embedding-3-small",
    input=user_query
).data[0].embedding

# Vector search in Cosmos DB
cosmos_client = CosmosClient(endpoint, credential)
container = cosmos_client.get_database_client("rag-database") \
    .get_container_client("rag-chunks")

results = container.query_items(
    query="""
        SELECT TOP 5 c.title, c.content, c.sourceUri
        FROM c
        ORDER BY VectorDistance(c.contentVector, @vector)
    """,
    parameters=[{"name": "@vector", "value": query_embedding}],
    enable_cross_partition_query=True
)

chunks = [item["content"] for item in results]

# Generate response with Azure OpenAI
response = openai_client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": f"Answer based on context:\n{chr(10).join(chunks)}"},
        {"role": "user", "content": user_query}
    ]
)
Enter fullscreen mode Exit fullscreen mode

πŸ“ Environment-Specific Configuration

# environments/dev.tfvars
cosmos_serverless      = true    # Pay per request, no minimum
vector_index_type      = "quantizedFlat"  # Good for < 100K vectors
embedding_dimensions   = 1536

# environments/prod.tfvars
cosmos_serverless      = false   # Provisioned throughput
vector_index_type      = "diskANN"        # Production scale
embedding_dimensions   = 1536
provisioned_throughput = 4000    # RU/s, autoscale to 40,000
Enter fullscreen mode Exit fullscreen mode

Cost comparison: Serverless Cosmos DB charges per RU consumed with no minimum. A typical RAG query costs 10-50 RUs depending on vector dimensions and result count. At 10,000 queries/month, that's roughly $0.50-$2.50 total - compared to ~$75/month for AI Search Basic tier.

πŸ’‘ When to Choose Cosmos DB Vector Search

Pick Cosmos DB when you need:

  • Zero minimum cost for dev/test with serverless mode
  • Sub-20ms vector queries at production scale
  • Data and vectors together in the same document
  • Hybrid search with vector + full-text + filters in one query
  • Global distribution with multi-region replication

Stick with AI Search when you need the semantic ranker, agentic retrieval, or the full three-layer pipeline from our Advanced RAG post.

⏭️ What's Next

This is Post 3 of the Azure RAG Pipeline with Terraform series.


Your RAG pipeline just went serverless. Cosmos DB's built-in DiskANN vector search gives you sub-20ms queries, hybrid search with RRF, and true pay-per-use pricing - all without a separate vector store. One database, one bill, one less service to manage. πŸ’°

Found this helpful? Follow for the full RAG Pipeline with Terraform series! πŸ’¬

Top comments (0)