Building a RAG System with Azure OpenAI and Cognitive Search: Complete Guide
Introduction
Retrieval-Augmented Generation (RAG) is transforming how we build AI applications. Instead of relying solely on what the model knows, RAG allows us to augment responses with your own data - documents, databases, or any structured information.
In this guide, I'll walk you through building a production-ready RAG system using Azure OpenAI and Azure Cognitive Search. By the end, you'll have a system that can answer questions about your own documents with citations.
Why RAG Matters
Traditional LLM limitations:
- Knowledge cutoff dates
- Hallucinations on specific domains
- No access to private data
RAG solves these by:
- Grounding responses in your data
- Providing source citations
- Keeping data in your control
Architecture Overview
┌─────────────┐ ┌──────────────────┐ ┌─────────────┐
│ Documents │────>│ Azure Cognitive │────>│ Azure │
│ (PDF, etc) │ │ Search │ │ OpenAI │
└─────────────┘ └──────────────────┘ └─────────────┘
│ │
v v
┌─────────────┐ ┌─────────────┐
│ Embedding │ │ GPT-4 │
│ Model │ │ Model │
└─────────────┘ └─────────────┘
Prerequisites
- Azure subscription
- Azure OpenAI resource with GPT-4 deployment
- Azure Cognitive Search resource
- Azure AI services (for embeddings)
- Node.js 18+ or Python 3.9+
Step 1: Setting Up Azure Resources
Create Azure OpenAI Resource
# Create OpenAI resource
az cognitiveservices account create \
--name openai-rag-demo \
--resource-group rg-rag-demo \
--kind OpenAI \
--sku s0 \
--location eastus
# Deploy GPT-4
az cognitiveservices account deployment create \
--name openai-rag-demo \
--resource-group rg-rag-demo \
--deployment-name gpt-4 \
--model-format OpenAI \
--model gpt-4 \
--version "0613" \
--sku-capacity 1 \
--sku-name "Standard"
# Deploy text-embedding-ada-002
az cognitiveservices account deployment create \
--name openai-rag-demo \
--resource-group rg-rag-demo \
--deployment-name text-embedding-ada-002 \
--model-format OpenAI \
--model text-embedding-ada-002 \
--version "2" \
--sku-capacity 1 \
--sku-name "Standard"
Create Cognitive Search
# Create search service
az search service create \
--name search-rag-demo \
--resource-group rg-rag-demo \
--sku free \
--location eastus
Step 2: Indexing Documents
Here's a complete Python script to index your documents:
import os
import json
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import *
from openai import AzureOpenAI
from pypdf import PdfReader
import tiktoken
# Configuration
AZURE_SEARCH_ENDPOINT = os.environ["AZURE_SEARCH_ENDPOINT"]
AZURE_SEARCH_KEY = os.environ["AZURE_SEARCH_KEY"]
AZURE_OPENAI_ENDPOINT = os.environ["AZURE_OPENAI_ENDPOINT"]
AZURE_OPENAI_KEY = os.environ["AZURE_OPENAI_KEY"]
INDEX_NAME = "rag-index"
# Initialize clients
search_client = SearchClient(
endpoint=AZURE_SEARCH_ENDPOINT,
index_name=INDEX_NAME,
credential=AzureKeyCredential(AZURE_SEARCH_KEY)
)
openai_client = AzureOpenAI(
api_key=AZURE_OPENAI_KEY,
api_version="2024-02-01",
azure_endpoint=AZURE_OPENAI_ENDPOINT
)
def extract_text_from_pdf(pdf_path):
"""Extract text from PDF document"""
reader = PdfReader(pdf_path)
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
return text
def chunk_text(text, chunk_size=1000, overlap=100):
"""Split text into overlapping chunks"""
tokenizer = tiktoken.get_encoding("cl100k_base")
tokens = tokenizer.encode(text)
chunks = []
for i in range(0, len(tokens), chunk_size - overlap):
chunk_tokens = tokens[i:i + chunk_size]
chunk_text = tokenizer.decode(chunk_tokens)
chunks.append(chunk_text)
return chunks
def get_embedding(text):
"""Get embedding for text using Azure OpenAI"""
response = openai_client.embeddings.create(
input=text,
model="text-embedding-ada-002"
)
return response.data[0].embedding
def index_documents(folder_path):
"""Index all documents from a folder"""
documents = []
for filename in os.listdir(folder_path):
if filename.endswith('.pdf'):
filepath = os.path.join(folder_path, filename)
text = extract_text_from_pdf(filepath)
chunks = chunk_text(text)
for i, chunk in enumerate(chunks):
doc = {
"id": f"{filename}-{i}",
"content": chunk,
"source": filename,
"chunk_id": i
}
doc["embedding"] = get_embedding(chunk)
documents.append(doc)
# Upload to search index
search_client.upload_documents(documents)
print(f"Indexed {len(documents)} document chunks")
if __name__ == "__main__":
index_documents("./documents")
Step 3: Querying the RAG System
def query_rag_system(query, top_k=5):
"""Query the RAG system and get augmented response"""
# Get query embedding
get_embedding(query)
# Search query_embedding = for relevant documents
search_results = search_client.search(
search_text=query,
vector_queries=[{
"kind": "vector",
"field": "embedding",
"vector": query_embedding,
"k": top_k
}],
select=["content", "source", "chunk_id"],
top=top_k
)
# Build context from results
context = "\n\n".join([
f"[Source: {result['source']}]\n{result['content']}"
for result in search_results
])
# Generate response with context
system_prompt = f"""You are a helpful assistant that answers questions
based on the provided context. Always cite your sources.
Context:
{context}
"""
response = openai_client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": query}
],
temperature=0.3
)
return {
"answer": response.choices[0].message.content,
"sources": [
{"source": r["source"], "chunk": r["chunk_id"]}
for r in search_results
]
}
# Example usage
result = query_rag_system("What are the key security considerations?")
print(result["answer"])
print("\nSources:")
for source in result["sources"]:
print(f" - {source['source']} (chunk {source['chunk']})")
Step 4: Semantic Search Configuration
For better results, configure semantic search in Cognitive Search:
{
"semanticConfiguration": {
"name": "semantic-config",
"prioritizedFields": {
"titleField": {
"fieldName": "source"
},
"prioritizedContentFields": [
{
"fieldName": "content"
}
]
}
}
}
Enable semantic search on your index:
from azure.search.documents.indexes.models import (
SemanticConfiguration,
SemanticField,
SemanticSettings
)
semantic_config = SemanticConfiguration(
name="default",
prioritized_fields=SemanticPrioritizedFields(
title_field=SemanticField(field_name="source"),
prioritized_content_fields=[
SemanticField(field_name="content")
]
)
)
# Apply to index
index.semantic_settings = SemanticSettings(
configurations=[semantic_config]
)
Cost Optimization Tips
- Use the free tier for Cognitive Search during development
- Implement caching for repeated queries
- Batch embeddings - process multiple documents together
- Monitor usage via Azure Cost Management
# Simple caching implementation
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_query(query):
return query_rag_system(query)
Testing Your RAG System
# Test cases
test_queries = [
"What is the main topic of the document?",
"Summarize the key findings",
"What are the recommendations?"
]
for query in test_queries:
print(f"\nQuery: {query}")
result = query_rag_system(query)
print(f"Answer: {result['answer'][:200]}...")
Production Considerations
-
Security
- Use managed identities
- Implement role-based access
- Encrypt data at rest
-
Monitoring
- Log all queries and responses
- Track token usage
- Set up alerts for errors
-
Scalability
- Use Azure AD auth
- Implement rate limiting
- Consider vector database alternatives
Key Takeaways
- RAG combines retrieval with generation for accurate, grounded responses
- Azure Cognitive Search provides excellent vector and semantic search
- Proper chunking and embedding are critical for quality results
- Always cite sources in production systems
Next Steps
- Add support for more document formats (DOCX, PPTX)
- Implement hybrid search (keyword + vector)
- Add user authentication and authorization
- Build a web UI with streaming responses
GitHub Repository: [Link to be created - azure-openai-rag-starter]
Tags: #azureopenai #cognitive-search #rag #ai #tutorial #azure
Have questions or want to see more detailed implementation? Let me know in the comments!
Top comments (1)
interesting stuff