Building a RAG System with Azure OpenAI and Cognitive Search: Complete Guide

#azureopenai #cognitivesearch #rag #ai

Building a RAG System with Azure OpenAI and Cognitive Search: Complete Guide

Introduction

Retrieval-Augmented Generation (RAG) is transforming how we build AI applications. Instead of relying solely on what the model knows, RAG allows us to augment responses with your own data - documents, databases, or any structured information.

In this guide, I'll walk you through building a production-ready RAG system using Azure OpenAI and Azure Cognitive Search. By the end, you'll have a system that can answer questions about your own documents with citations.

Why RAG Matters

Traditional LLM limitations:

Knowledge cutoff dates
Hallucinations on specific domains
No access to private data

RAG solves these by:

Grounding responses in your data
Providing source citations
Keeping data in your control

Architecture Overview

┌─────────────┐     ┌──────────────────┐     ┌─────────────┐
│  Documents  │────>│ Azure Cognitive │────>│   Azure    │
│  (PDF, etc) │     │     Search       │     │  OpenAI    │
└─────────────┘     └──────────────────┘     └─────────────┘
                           │                        │
                           v                        v
                    ┌─────────────┐         ┌─────────────┐
                    │  Embedding  │         │   GPT-4     │
                    │   Model     │         │   Model     │
                    └─────────────┘         └─────────────┘

Prerequisites

Azure subscription
Azure OpenAI resource with GPT-4 deployment
Azure Cognitive Search resource
Azure AI services (for embeddings)
Node.js 18+ or Python 3.9+

Step 1: Setting Up Azure Resources

Create Azure OpenAI Resource

# Create OpenAI resource
az cognitiveservices account create \
  --name openai-rag-demo \
  --resource-group rg-rag-demo \
  --kind OpenAI \
  --sku s0 \
  --location eastus

# Deploy GPT-4
az cognitiveservices account deployment create \
  --name openai-rag-demo \
  --resource-group rg-rag-demo \
  --deployment-name gpt-4 \
  --model-format OpenAI \
  --model gpt-4 \
  --version "0613" \
  --sku-capacity 1 \
  --sku-name "Standard"

# Deploy text-embedding-ada-002
az cognitiveservices account deployment create \
  --name openai-rag-demo \
  --resource-group rg-rag-demo \
  --deployment-name text-embedding-ada-002 \
  --model-format OpenAI \
  --model text-embedding-ada-002 \
  --version "2" \
  --sku-capacity 1 \
  --sku-name "Standard"

Create Cognitive Search

# Create search service
az search service create \
  --name search-rag-demo \
  --resource-group rg-rag-demo \
  --sku free \
  --location eastus

Step 2: Indexing Documents

Here's a complete Python script to index your documents:

import os
import json
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import *
from openai import AzureOpenAI
from pypdf import PdfReader
import tiktoken

# Configuration
AZURE_SEARCH_ENDPOINT = os.environ["AZURE_SEARCH_ENDPOINT"]
AZURE_SEARCH_KEY = os.environ["AZURE_SEARCH_KEY"]
AZURE_OPENAI_ENDPOINT = os.environ["AZURE_OPENAI_ENDPOINT"]
AZURE_OPENAI_KEY = os.environ["AZURE_OPENAI_KEY"]
INDEX_NAME = "rag-index"

# Initialize clients
search_client = SearchClient(
    endpoint=AZURE_SEARCH_ENDPOINT,
    index_name=INDEX_NAME,
    credential=AzureKeyCredential(AZURE_SEARCH_KEY)
)

openai_client = AzureOpenAI(
    api_key=AZURE_OPENAI_KEY,
    api_version="2024-02-01",
    azure_endpoint=AZURE_OPENAI_ENDPOINT
)

def extract_text_from_pdf(pdf_path):
    """Extract text from PDF document"""
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text

def chunk_text(text, chunk_size=1000, overlap=100):
    """Split text into overlapping chunks"""
    tokenizer = tiktoken.get_encoding("cl100k_base")
    tokens = tokenizer.encode(text)

    chunks = []
    for i in range(0, len(tokens), chunk_size - overlap):
        chunk_tokens = tokens[i:i + chunk_size]
        chunk_text = tokenizer.decode(chunk_tokens)
        chunks.append(chunk_text)

    return chunks

def get_embedding(text):
    """Get embedding for text using Azure OpenAI"""
    response = openai_client.embeddings.create(
        input=text,
        model="text-embedding-ada-002"
    )
    return response.data[0].embedding

def index_documents(folder_path):
    """Index all documents from a folder"""
    documents = []

    for filename in os.listdir(folder_path):
        if filename.endswith('.pdf'):
            filepath = os.path.join(folder_path, filename)
            text = extract_text_from_pdf(filepath)
            chunks = chunk_text(text)

            for i, chunk in enumerate(chunks):
                doc = {
                    "id": f"{filename}-{i}",
                    "content": chunk,
                    "source": filename,
                    "chunk_id": i
                }
                doc["embedding"] = get_embedding(chunk)
                documents.append(doc)

    # Upload to search index
    search_client.upload_documents(documents)
    print(f"Indexed {len(documents)} document chunks")

if __name__ == "__main__":
    index_documents("./documents")

Step 3: Querying the RAG System

def query_rag_system(query, top_k=5):
    """Query the RAG system and get augmented response"""

    # Get query embedding
 get_embedding(query)

    # Search    query_embedding = for relevant documents
    search_results = search_client.search(
        search_text=query,
        vector_queries=[{
            "kind": "vector",
            "field": "embedding",
            "vector": query_embedding,
            "k": top_k
        }],
        select=["content", "source", "chunk_id"],
        top=top_k
    )

    # Build context from results
    context = "\n\n".join([
        f"[Source: {result['source']}]\n{result['content']}"
        for result in search_results
    ])

    # Generate response with context
    system_prompt = f"""You are a helpful assistant that answers questions 
    based on the provided context. Always cite your sources.

    Context:
    {context}
    """

    response = openai_client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": query}
        ],
        temperature=0.3
    )

    return {
        "answer": response.choices[0].message.content,
        "sources": [
            {"source": r["source"], "chunk": r["chunk_id"]}
            for r in search_results
        ]
    }

# Example usage
result = query_rag_system("What are the key security considerations?")
print(result["answer"])
print("\nSources:")
for source in result["sources"]:
    print(f"  - {source['source']} (chunk {source['chunk']})")

Step 4: Semantic Search Configuration

For better results, configure semantic search in Cognitive Search:

{
  "semanticConfiguration": {
    "name": "semantic-config",
    "prioritizedFields": {
      "titleField": {
        "fieldName": "source"
      },
      "prioritizedContentFields": [
        {
          "fieldName": "content"
        }
      ]
    }
  }
}

Enable semantic search on your index:

from azure.search.documents.indexes.models import (
    SemanticConfiguration,
    SemanticField,
    SemanticSettings
)

semantic_config = SemanticConfiguration(
    name="default",
    prioritized_fields=SemanticPrioritizedFields(
        title_field=SemanticField(field_name="source"),
        prioritized_content_fields=[
            SemanticField(field_name="content")
        ]
    )
)

# Apply to index
index.semantic_settings = SemanticSettings(
    configurations=[semantic_config]
)

Cost Optimization Tips

Use the free tier for Cognitive Search during development
Implement caching for repeated queries
Batch embeddings - process multiple documents together
Monitor usage via Azure Cost Management

# Simple caching implementation
from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_query(query):
    return query_rag_system(query)

Testing Your RAG System

# Test cases
test_queries = [
    "What is the main topic of the document?",
    "Summarize the key findings",
    "What are the recommendations?"
]

for query in test_queries:
    print(f"\nQuery: {query}")
    result = query_rag_system(query)
    print(f"Answer: {result['answer'][:200]}...")

Production Considerations

Security
- Use managed identities
- Implement role-based access
- Encrypt data at rest
Monitoring
- Log all queries and responses
- Track token usage
- Set up alerts for errors
Scalability
- Use Azure AD auth
- Implement rate limiting
- Consider vector database alternatives