Retrieval-Augmented Generation (RAG) has become the default approach for providing context and memory to AI applications. Understanding the RAG pipeline—from document ingestion to final response generation—is essential for building high-performance AI systems. This comprehensive guide explores each stage of the RAG workflow, with special focus on Oracle Database 23ai's native vector capabilities.
The RAG Pipeline: An Overview
A complete RAG pipeline consists of two main phases:
Ingestion Phase (Offline):
Documents → Chunking → Embedding → Index/Store in Vector Database
Retrieval Phase (Runtime):
User Query → Embed Query → Similarity Search → Top-K Results → LLM Generation → Response
The RAG pipeline connects the model to the information it needs at query time. When a user asks a question, the pipeline retrieves the most relevant documents, prepares that text as context, and includes it in the prompt so the model can generate an answer grounded in retrieved material rather than training data alone.
Phase 1: Document Ingestion
Ingestion is the process of loading documents from multiple sources and in multiple formats, transforming them into a structured form suitable for embedding and retrieval.
Document Loading
Raw documents—whether PDFs, Word documents, web pages, or database records—must be transformed into manageable chunks suitable for retrieval.
Common Sources:
- Local files (PDF, DOCX, TXT, CSV)
- Web pages and APIs
- Databases and data warehouses
- Cloud storage (S3, GCS, Azure Blob)
- Enterprise systems (SharePoint, Confluence, Notion)
Loading Libraries:
- PyPDF2: For PDF documents
- python-docx: For Word documents
- LangChain Document Loaders: Unified interface for multiple formats
- LlamaIndex Data Connectors: Specialized loaders for various sources
- Oracle OracleDocLoader: Native loader for Oracle Database 23ai
Text Extraction and Cleaning
A clean corpus improves both retrieval accuracy and embedding quality. Use parsing libraries to extract only the primary content from web pages or PDFs, and remove boilerplate, navigation links, and disclaimers.
Cleaning Steps:
- Remove Noise: Headers, footers, navigation, ads
- Normalize Structure: Convert to consistent format (JSONL, Parquet)
- Preserve Metadata: Source, timestamp, author, document type
- Handle Special Characters: Normalize encoding (UTF-8)
- Maintain Logical Structure: Keep headings, lists, tables
Metadata Enrichment
Each chunk should carry metadata such as topic, date, and source, so that retrieval can filter and rank results more effectively.
Important Metadata Fields:
- source_id: Unique document identifier
- title: Document title
- created_date: Document creation timestamp
- author: Document creator
- document_type: PDF, email, webpage
- topic/category: For domain filtering
- page_number: For PDF sources
- chunk_id: Unique identifier for the chunk
Phase 2: Chunking - The Critical Step
Chunking is simply the act of splitting larger documents into smaller units ("chunks"). Each chunk can be individually indexed, embedded, and retrieved.
Why Chunking Matters
The chunking process directly feeds into the embedding step, where each chunk is converted into vector representations that capture semantic meaning.
A common source of inaccurate or unstable answers lies in a structural conflict within the traditional "chunk-embed-retrieve" pipeline: using a single-granularity, fixed-size text chunk to perform two inherently conflicting tasks:
- Semantic matching (recall): Smaller chunks (100-256 tokens) needed for precise similarity search
- Context understanding (utilization): Larger chunks (1024+ tokens) needed for coherent LLM generation
This creates a difficult trade-off between "precise but fragmented" and "complete but vague."
Pre-Chunking vs Post-Chunking
Pre-chunking is the most common method. It processes documents asynchronously by breaking them into smaller pieces before embedding and storing them in the vector database. This enables fast retrieval at query time since all chunks are pre-computed.
Post-chunking takes a different approach by embedding entire documents first, then performing chunking at query time only on the documents that are actually retrieved. The chunked results can be cached, so the system becomes faster over time.
Chunking Strategies
Fixed-Length Chunking: This is the simplest approach, segmenting text into equally sized pieces using character, token, or word counts.
Common Chunk Sizes:
- 300-500 tokens: Typical chunks for most use cases, depending on the model and content type
- 100-256 tokens: For high-precision recall in semantic search
- 1024+ tokens: For providing complete context to LLMs
Adopt recursive/semantic chunkers with 10-20% overlap to maintain context across boundaries.
Advanced Chunking Approaches:
Recursive Chunking: This method is a bit smarter. It splits text hierarchically based on cues like paragraphs, line breaks, or sentences. If a section is too long, it recursively divides it further.
Semantic Chunking: Instead of letting token limits decide where ideas get chopped, it uses embeddings to detect where the meaning actually shifts—grouping semantically related sentences together.
Document-Aware Chunking: Preserves document structure (sections, paragraphs, tables) while chunking.
Chunk Overlap
Chunk overlap is the number of overlapping characters (or tokens) between adjacent chunks. This ensures that context isn't lost when information spans chunk boundaries.
Recommended Overlap:
- 10-20% overlap is standard practice
- For 500-token chunks, use 50-100 token overlap
- Maintains continuity without excessive redundancy
Example:
Chunk 1: [Tokens 0-500]
Chunk 2: [Tokens 450-950] ← 50 token overlap
Chunk 3: [Tokens 900-1400] ← 50 token overlap
Splitting Process
Documents are split recursively until the chunks are small enough, using text splitters that respect:
- Sentence boundaries (don't break mid-sentence)
- Paragraph boundaries (maintain logical grouping)
- Section boundaries (preserve document structure)
Popular Chunking Libraries:
The LLM community often turns to two powerful open-source libraries: LangChain and LlamaIndex:
- LangChain: Broad framework with flexible chunking strategies
- LlamaIndex: Designed specifically for RAG pipelines with sophisticated NodeParsers
- Chonkie: Lightweight, dedicated chunking library focusing solely on text splitting
Phase 3: Embedding and Storage
Chunking makes information retrievable; embeddings make it searchable by meaning rather than keywords.
What Are Embeddings?
Embeddings are numerical representations that capture semantic relationships between text. A piece of text—whether a word, phrase, sentence, or paragraph—is converted to a vector of numbers (typically 384-1536 dimensions).
Key Properties:
- Semantic similarity: Embeddings that are numerically similar are also semantically similar
- Dense representation: High-dimensional vectors capture nuanced meaning
- Model-specific: Different embedding models produce different vector spaces
Embedding Models
Popular Options:
- OpenAI text-embedding-3-large: 3072 dimensions, high quality
- Cohere embed-v4.0: 1024 dimensions, multimodal (text + images)
- Sentence-BERT (SBERT): 768 dimensions, open-source
- all-MiniLM-L6-v2: 384 dimensions, lightweight and fast
Choose MiniLM/SBERT/Instruct/E5 for text, CLIP for images. Normalize embeddings for consistent similarity calculations.
Embedding Generation: Outside vs Inside the Database
Option 1: Generate Embeddings Outside Oracle Database 23ai
Vector embeddings can be generated outside Oracle Database 23ai using third-party embedding models, then imported into the database for storage and search.
Workflow:
from sentence_transformers import SentenceTransformer
# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Generate embeddings
chunks = ["First chunk text", "Second chunk text"]
embeddings = model.encode(chunks)
# Store in Oracle Database 23ai
# INSERT INTO documents (text, embedding) VALUES (?, ?)
Advantages:
- Flexibility in choosing any embedding model
- Can use cloud-based embedding services
- Easy integration with existing ML pipelines
Option 2: Generate Embeddings Inside Oracle Database 23ai
Embeddings can also be generated inside Oracle Database 23ai by downloading pretrained embedding models, converting them to ONNX format, and importing the ONNX format models into Database 23ai.
ONNX (Open Neural Network Exchange) is an open-source format designed to represent deep learning models. It aims to provide interoperability between different deep learning frameworks.
Process:
- Download Pretrained Model from Hugging Face or similar source
- Convert to ONNX Format using Oracle OML4Py utility
- Import into Database 23ai using DBMS_VECTOR package
The command can be run to import the ONNX file into the database:
EXEC dbms_vector.load_onnx_model(
'DM_DUMP',
'all_MiniLM_L12_v2.onnx',
'minilm_l12_v2',
JSON('{"function": "embedding", "embeddingOutput": "embedding", "input": {"input": ["DATA"]}}')
);
- Generate Embeddings in SQL:
SELECT VECTOR_EMBEDDING(minilm_l12_v2 USING 'Your text here' AS data) AS embedding
FROM DUAL;
Advantages:
- Oracle Database implements an ONNX runtime directly within the database. This allows you to generate vector embeddings directly within Oracle Database using SQL
- Eliminates data movement between systems
- Simplified architecture and security
- Native integration with Oracle's vector capabilities
Phase 4: Vector Indexes for Fast Search
Once embeddings are generated and stored, vector indexes are specialized data structures designed for similarity searches in high-dimensional vector spaces. They help accelerate vector similarity searches by reducing the number of distance calculations needed.
Why Vector Indexes?
Exact search to find the closest matches for a given vector are accurate but can be slow, since costly vector distance computations are needed for all vectors in a column.
Vector indexes use techniques like clustering, partitioning, and neighbor graphs to group similar vectors together, reducing the search space dramatically.
Oracle AI Vector Search: Supported Index Types
Oracle AI Vector Search supports Hierarchical Navigable Small World (HNSW) indexes and Inverted File (IVF) indexes.
1. HNSW: In-Memory Neighbor Graph
Hierarchical Navigable Small World (HNSW) creates a navigable graph structure to locate similar vectors with exceptional speed and accuracy.
Characteristics:
HNSW is an In-Memory Neighbor Graph Vector Index fully built in-memory. You need to setup a new memory pool with VECTOR_MEMORY_SIZE in the SGA to accommodate it.
Since the HNSW index consumes a lot of memory, Oracle Database 23ai makes this index available by allocating a memory area called Vector Pool.
Memory Calculation:
To roughly determine the memory size needed to store an HNSW index, use the following formula: 1.3 × number of vectors × number of dimensions × size of your vector dimension type (for example, a FLOAT32 is 4 bytes).
Creating HNSW Index:
-- Configure vector pool
ALTER SYSTEM SET vector_memory_size=800M SCOPE=SPFILE;
-- Restart required
-- Create HNSW index
CREATE VECTOR INDEX documents_hnsw_idx
ON documents (embedding)
ORGANIZATION NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95;
Performance:
HNSW enables sub-100ms retrieval at 95%+ recall—extremely fast for real-time applications.
Limitations:
DML operations (insert, update, delete) are not allowed after creating an HNSW index—designed for read-heavy workloads.
HNSW indexes are not available in RAC environments.
2. IVF: Neighbor Partition Index
Inverted File Flat (IVF) indexes partition data points into clusters, enabling efficient searching within relevant clusters.
Characteristics:
IVF Flat index is a Neighbor Partition Vector index built on disk, and its blocks are cached in the regular buffer cache.
A significant advantage of an IVF vector index is that it is not constrained by the amount of memory available in the vector pool like an HNSW vector index is.
How IVF Works:
It is designed to enhance search efficiency by narrowing the search area through the use of neighbor partitions or clusters. The center of each partition, or the centroid, represents the average vector for each partition. By default, the number of partitions and their corresponding centroids is set to the square root of the dataset's size.
Creating IVF Index:
CREATE VECTOR INDEX documents_ivf_idx
ON documents (embedding)
ORGANIZATION NEIGHBOR PARTITIONS
DISTANCE COSINE
WITH TARGET ACCURACY 95;
Advantages:
- Works with massive datasets
- Supports global and local indexes on partitioned tables
- DML operations can degrade accuracy over time. Periodically rebuilding the index may be necessary
- More memory-efficient than HNSW
Performance:
Although an IVF index won't be as fast as an equivalent HNSW index, it can be used for very large data sets and still provide excellent performance when compared to an exhaustive similarity search.
3. Hybrid Vector Indexes
Hybrid vector index combines full text and semantic search in one index, allowing users to easily index and query their documents using a combination of full text search and semantic vector search.
Use Case:
When you need to match both semantic meaning AND specific keywords/terms in a single query.
GPU Acceleration
At Oracle CloudWorld 2024, Oracle demonstrated GPU-accelerated creation of vector embeddings and HNSW index creation using NVIDIA GPUs.
When a user sends a request to create an HNSW vector index to Oracle Database, the database intelligently compiles and redirects the task to a Compute Offload Server process that leverages GPUs and the CAGRA algorithm from NVIDIA cuVS library to rapidly generate a graph index.
Phase 5: Retrieval - Finding Relevant Context
At runtime, when a user submits a query, the retrieval phase locates the most relevant chunks to provide as context to the LLM.
Query Embedding
The system encodes the user's query with the same model used to embed documents during ingestion.
-- Generate query embedding
SELECT VECTOR_EMBEDDING(minilm_l12_v2 USING 'What is RAG?' AS data) AS query_vector
FROM DUAL;
Similarity Search
The query vector is compared against all document vectors in the database using a distance metric.
Distance Metrics:
Common options include EUCLIDEAN, L2_SQUARED, COSINE, DOT, MANHATTAN, and HAMMING. You can specify the metric during index creation or within the search query itself.
- Cosine Similarity: Most common for text (measures angle between vectors)
- Euclidean Distance: Straight-line distance in vector space
- Dot Product: Efficient for normalized vectors
Performing Similarity Search:
SELECT
text,
VECTOR_DISTANCE(embedding, :query_vector, COSINE) AS distance
FROM documents
ORDER BY distance
FETCH APPROX FIRST 5 ROWS ONLY;
Top-K Results
The system searches the index for the nearest vectors, retrieves the top-k chunks, and includes those chunks with the query in the prompt.
Choosing K:
- K=3-5: For focused, specific queries
- K=10-20: For broader research queries
- K=50+: For comprehensive document analysis
Challenges:
Top-k returns near duplicates or shallow snippets, so the prompt lacks diversity. Retrieval misses proper nouns, IDs, or acronyms because they occur rarely.
Solutions:
- Reranking: Rerank with cross-encoders (Cohere ReRank, ColBERT) when precision matters
- Hybrid Search: Combine BM25 (keyword) with dense vectors to boost tail recall
- Metadata Filtering: Filter by date, source, document type before retrieval
Monitoring Index Accuracy
A valuable feature is Oracle's provision of functions to report index accuracy based on query vectors used in approximate searches. This allows you to continuously monitor and optimize the accuracy of your vector search implementation.
Phase 6: Generation - Creating the Final Response
Once relevant chunks are retrieved, they're combined with the user's query and sent to an LLM for response generation.
Prompt Assembly
Retrieved chunks are formatted into a coherent context block:
Context:
[Chunk 1 text]
[Chunk 2 text]
[Chunk 3 text]
Question: [User query]
Answer based on the provided context:
LLM Generation
The complete prompt (context + query) is sent to the LLM, which generates a response grounded in the retrieved information rather than relying solely on parametric knowledge.
Benefits:
- Reduces hallucinations
- Provides traceable sources
- Enables up-to-date information
- Maintains factual accuracy
Best Practices for Production RAG Pipelines
1. Optimize Chunking Strategy
Align chunk size with context windows, use task-specific sentence transformers, and default to HNSW + metadata filtering for sub-100ms retrieval at 95%+ recall.
2. Choose Appropriate Embedding Models
- Match embedding dimensions to your dataset size
- Consider domain-specific models for specialized content
- Test multiple models and measure retrieval quality
3. Implement Proper Indexing
- Small-Medium datasets (< 1M vectors): HNSW for maximum speed
- Large datasets (1M+ vectors): IVF for scalability
- Keyword + Semantic: Hybrid vector indexes
4. Monitor and Iterate
Monitor recall@k, latency, and index memory; re-embed on model upgrades.
Track metrics:
- Retrieval accuracy (precision@k, recall@k)
- Query latency (p50, p95, p99)
- Index build time
- Storage requirements
5. Handle Edge Cases
- Query reformulation for better retrieval
- Fallback strategies when retrieval fails
- Confidence thresholds for generated responses
- Citation and source tracking
Advanced RAG Techniques
1. Self-Correction (CRAG)
When strong retrieval pipelines return incomplete or irrelevant context, CRAG adds a lightweight feedback loop: before generating, the system checks whether the retrieved set is good enough. If it looks weak, the system triggers another retrieval pass.
2. Agentic RAG
Agentic RAG systems plan retrieval strategies and iterate based on partial results—dynamically deciding what to retrieve next.
3. GraphRAG
Build an advanced RAG system by first structuring data into a knowledge graph, then layering graph-aware retrieval to capture entity relationships.
4. Multimodal RAG
Combining text, images, audio, and video retrieval for richer context.
A robust RAG pipeline requires careful attention to every stage:
Ingestion: Clean, structured data with rich metadata
Chunking: Balanced chunk sizes with appropriate overlap (10-20%)
Embedding: Semantic representations using appropriate models
Indexing: Fast similarity search with HNSW or IVF indexes
Retrieval: Accurate top-K results with metadata filtering
Generation: Grounded responses with proper citations
Oracle Database 23ai provides a complete, integrated platform for RAG with native vector support, ONNX model runtime, advanced indexing (HNSW, IVF, Hybrid), and SQL-native operations—eliminating the need for separate vector databases.
By mastering these components and following best practices, you can build production-grade RAG systems that deliver accurate, relevant, and trustworthy AI-powered experiences.
Top comments (0)