Ryan Giggs

Posted on Jan 21

RAG Pipeline Deep Dive: Ingestion, Chunking, Embedding, and Vector Search

#rag #vectorsearch #vectordatabase #embeddingmodels

Retrieval-Augmented Generation (RAG) has become the default approach for providing context and memory to AI applications. Understanding the RAG pipeline—from document ingestion to final response generation—is essential for building high-performance AI systems. This comprehensive guide explores each stage of the RAG workflow, with special focus on Oracle Database 23ai's native vector capabilities.

The RAG Pipeline: An Overview

A complete RAG pipeline consists of two main phases:

Ingestion Phase (Offline):
Documents → Chunking → Embedding → Index/Store in Vector Database

Retrieval Phase (Runtime):
User Query → Embed Query → Similarity Search → Top-K Results → LLM Generation → Response

The RAG pipeline connects the model to the information it needs at query time. When a user asks a question, the pipeline retrieves the most relevant documents, prepares that text as context, and includes it in the prompt so the model can generate an answer grounded in retrieved material rather than training data alone.

Phase 1: Document Ingestion

Ingestion is the process of loading documents from multiple sources and in multiple formats, transforming them into a structured form suitable for embedding and retrieval.

Document Loading

Raw documents—whether PDFs, Word documents, web pages, or database records—must be transformed into manageable chunks suitable for retrieval.

Common Sources:

Local files (PDF, DOCX, TXT, CSV)
Web pages and APIs
Databases and data warehouses
Cloud storage (S3, GCS, Azure Blob)
Enterprise systems (SharePoint, Confluence, Notion)

Loading Libraries:

PyPDF2: For PDF documents
python-docx: For Word documents
LangChain Document Loaders: Unified interface for multiple formats
LlamaIndex Data Connectors: Specialized loaders for various sources
Oracle OracleDocLoader: Native loader for Oracle Database 23ai

Text Extraction and Cleaning

A clean corpus improves both retrieval accuracy and embedding quality. Use parsing libraries to extract only the primary content from web pages or PDFs, and remove boilerplate, navigation links, and disclaimers.

Cleaning Steps:

Remove Noise: Headers, footers, navigation, ads
Normalize Structure: Convert to consistent format (JSONL, Parquet)
Preserve Metadata: Source, timestamp, author, document type
Handle Special Characters: Normalize encoding (UTF-8)
Maintain Logical Structure: Keep headings, lists, tables

Metadata Enrichment

Each chunk should carry metadata such as topic, date, and source, so that retrieval can filter and rank results more effectively.

Important Metadata Fields:

source_id: Unique document identifier
title: Document title
created_date: Document creation timestamp
author: Document creator
document_type: PDF, email, webpage
topic/category: For domain filtering
page_number: For PDF sources
chunk_id: Unique identifier for the chunk

Phase 2: Chunking - The Critical Step

Chunking is simply the act of splitting larger documents into smaller units ("chunks"). Each chunk can be individually indexed, embedded, and retrieved.

Why Chunking Matters

The chunking process directly feeds into the embedding step, where each chunk is converted into vector representations that capture semantic meaning.

A common source of inaccurate or unstable answers lies in a structural conflict within the traditional "chunk-embed-retrieve" pipeline: using a single-granularity, fixed-size text chunk to perform two inherently conflicting tasks:

Semantic matching (recall): Smaller chunks (100-256 tokens) needed for precise similarity search
Context understanding (utilization): Larger chunks (1024+ tokens) needed for coherent LLM generation

This creates a difficult trade-off between "precise but fragmented" and "complete but vague."

Pre-Chunking vs Post-Chunking

Pre-chunking is the most common method. It processes documents asynchronously by breaking them into smaller pieces before embedding and storing them in the vector database. This enables fast retrieval at query time since all chunks are pre-computed.

Post-chunking takes a different approach by embedding entire documents first, then performing chunking at query time only on the documents that are actually retrieved. The chunked results can be cached, so the system becomes faster over time.

Chunking Strategies

Fixed-Length Chunking: This is the simplest approach, segmenting text into equally sized pieces using character, token, or word counts.

Common Chunk Sizes:

300-500 tokens: Typical chunks for most use cases, depending on the model and content type
100-256 tokens: For high-precision recall in semantic search
1024+ tokens: For providing complete context to LLMs

Adopt recursive/semantic chunkers with 10-20% overlap to maintain context across boundaries.

Advanced Chunking Approaches:

Recursive Chunking: This method is a bit smarter. It splits text hierarchically based on cues like paragraphs, line breaks, or sentences. If a section is too long, it recursively divides it further.

Semantic Chunking: Instead of letting token limits decide where ideas get chopped, it uses embeddings to detect where the meaning actually shifts—grouping semantically related sentences together.

Document-Aware Chunking: Preserves document structure (sections, paragraphs, tables) while chunking.

Chunk Overlap

Chunk overlap is the number of overlapping characters (or tokens) between adjacent chunks. This ensures that context isn't lost when information spans chunk boundaries.

Recommended Overlap:

10-20% overlap is standard practice
For 500-token chunks, use 50-100 token overlap
Maintains continuity without excessive redundancy

Example:

Chunk 1: [Tokens 0-500]
Chunk 2: [Tokens 450-950]  ← 50 token overlap
Chunk 3: [Tokens 900-1400] ← 50 token overlap

Splitting Process

Documents are split recursively until the chunks are small enough, using text splitters that respect:

Sentence boundaries (don't break mid-sentence)
Paragraph boundaries (maintain logical grouping)
Section boundaries (preserve document structure)

Popular Chunking Libraries:

The LLM community often turns to two powerful open-source libraries: LangChain and LlamaIndex:

LangChain: Broad framework with flexible chunking strategies
LlamaIndex: Designed specifically for RAG pipelines with sophisticated NodeParsers
Chonkie: Lightweight, dedicated chunking library focusing solely on text splitting

Phase 3: Embedding and Storage

Chunking makes information retrievable; embeddings make it searchable by meaning rather than keywords.

What Are Embeddings?

Embeddings are numerical representations that capture semantic relationships between text. A piece of text—whether a word, phrase, sentence, or paragraph—is converted to a vector of numbers (typically 384-1536 dimensions).

Key Properties:

Semantic similarity: Embeddings that are numerically similar are also semantically similar
Dense representation: High-dimensional vectors capture nuanced meaning
Model-specific: Different embedding models produce different vector spaces

Embedding Models

Popular Options:

OpenAI text-embedding-3-large: 3072 dimensions, high quality
Cohere embed-v4.0: 1024 dimensions, multimodal (text + images)
Sentence-BERT (SBERT): 768 dimensions, open-source
all-MiniLM-L6-v2: 384 dimensions, lightweight and fast

Choose MiniLM/SBERT/Instruct/E5 for text, CLIP for images. Normalize embeddings for consistent similarity calculations.

Embedding Generation: Outside vs Inside the Database

Option 1: Generate Embeddings Outside Oracle Database 23ai

Vector embeddings can be generated outside Oracle Database 23ai using third-party embedding models, then imported into the database for storage and search.

Workflow:

from sentence_transformers import SentenceTransformer

# Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Generate embeddings
chunks = ["First chunk text", "Second chunk text"]
embeddings = model.encode(chunks)

# Store in Oracle Database 23ai
# INSERT INTO documents (text, embedding) VALUES (?, ?)

Advantages:

Flexibility in choosing any embedding model
Can use cloud-based embedding services
Easy integration with existing ML pipelines

Option 2: Generate Embeddings Inside Oracle Database 23ai

Embeddings can also be generated inside Oracle Database 23ai by downloading pretrained embedding models, converting them to ONNX format, and importing the ONNX format models into Database 23ai.

ONNX (Open Neural Network Exchange) is an open-source format designed to represent deep learning models. It aims to provide interoperability between different deep learning frameworks.

Process:

Download Pretrained Model from Hugging Face or similar source
Convert to ONNX Format using Oracle OML4Py utility
Import into Database 23ai using DBMS_VECTOR package

The command can be run to import the ONNX file into the database:

EXEC dbms_vector.load_onnx_model(
  'DM_DUMP',
  'all_MiniLM_L12_v2.onnx',
  'minilm_l12_v2',
  JSON('{"function": "embedding", "embeddingOutput": "embedding", "input": {"input": ["DATA"]}}')
);

Generate Embeddings in SQL:

SELECT VECTOR_EMBEDDING(minilm_l12_v2 USING 'Your text here' AS data) AS embedding
FROM DUAL;

Advantages:

Oracle Database implements an ONNX runtime directly within the database. This allows you to generate vector embeddings directly within Oracle Database using SQL
Eliminates data movement between systems
Simplified architecture and security
Native integration with Oracle's vector capabilities

Phase 4: Vector Indexes for Fast Search

Once embeddings are generated and stored, vector indexes are specialized data structures designed for similarity searches in high-dimensional vector spaces. They help accelerate vector similarity searches by reducing the number of distance calculations needed.

Why Vector Indexes?

Exact search to find the closest matches for a given vector are accurate but can be slow, since costly vector distance computations are needed for all vectors in a column.

Vector indexes use techniques like clustering, partitioning, and neighbor graphs to group similar vectors together, reducing the search space dramatically.

Oracle AI Vector Search: Supported Index Types

Oracle AI Vector Search supports Hierarchical Navigable Small World (HNSW) indexes and Inverted File (IVF) indexes.

1. HNSW: In-Memory Neighbor Graph

Hierarchical Navigable Small World (HNSW) creates a navigable graph structure to locate similar vectors with exceptional speed and accuracy.

Characteristics:

HNSW is an In-Memory Neighbor Graph Vector Index fully built in-memory. You need to setup a new memory pool with VECTOR_MEMORY_SIZE in the SGA to accommodate it.

Since the HNSW index consumes a lot of memory, Oracle Database 23ai makes this index available by allocating a memory area called Vector Pool.

Memory Calculation:

To roughly determine the memory size needed to store an HNSW index, use the following formula: 1.3 × number of vectors × number of dimensions × size of your vector dimension type (for example, a FLOAT32 is 4 bytes).

Creating HNSW Index:

-- Configure vector pool
ALTER SYSTEM SET vector_memory_size=800M SCOPE=SPFILE;
-- Restart required

-- Create HNSW index
CREATE VECTOR INDEX documents_hnsw_idx 
ON documents (embedding)
ORGANIZATION NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95;

Performance:

HNSW enables sub-100ms retrieval at 95%+ recall—extremely fast for real-time applications.

Limitations:

DML operations (insert, update, delete) are not allowed after creating an HNSW index—designed for read-heavy workloads.

HNSW indexes are not available in RAC environments.

2. IVF: Neighbor Partition Index

Inverted File Flat (IVF) indexes partition data points into clusters, enabling efficient searching within relevant clusters.

Characteristics:

IVF Flat index is a Neighbor Partition Vector index built on disk, and its blocks are cached in the regular buffer cache.

A significant advantage of an IVF vector index is that it is not constrained by the amount of memory available in the vector pool like an HNSW vector index is.

How IVF Works:

It is designed to enhance search efficiency by narrowing the search area through the use of neighbor partitions or clusters. The center of each partition, or the centroid, represents the average vector for each partition. By default, the number of partitions and their corresponding centroids is set to the square root of the dataset's size.

Creating IVF Index:

CREATE VECTOR INDEX documents_ivf_idx 
ON documents (embedding)
ORGANIZATION NEIGHBOR PARTITIONS
DISTANCE COSINE
WITH TARGET ACCURACY 95;

Advantages:

Works with massive datasets
Supports global and local indexes on partitioned tables
DML operations can degrade accuracy over time. Periodically rebuilding the index may be necessary
More memory-efficient than HNSW

Performance:

Although an IVF index won't be as fast as an equivalent HNSW index, it can be used for very large data sets and still provide excellent performance when compared to an exhaustive similarity search.

3. Hybrid Vector Indexes

Hybrid vector index combines full text and semantic search in one index, allowing users to easily index and query their documents using a combination of full text search and semantic vector search.

Use Case:
When you need to match both semantic meaning AND specific keywords/terms in a single query.

GPU Acceleration

At Oracle CloudWorld 2024, Oracle demonstrated GPU-accelerated creation of vector embeddings and HNSW index creation using NVIDIA GPUs.

When a user sends a request to create an HNSW vector index to Oracle Database, the database intelligently compiles and redirects the task to a Compute Offload Server process that leverages GPUs and the CAGRA algorithm from NVIDIA cuVS library to rapidly generate a graph index.

Phase 5: Retrieval - Finding Relevant Context

At runtime, when a user submits a query, the retrieval phase locates the most relevant chunks to provide as context to the LLM.

Query Embedding

The system encodes the user's query with the same model used to embed documents during ingestion.

-- Generate query embedding
SELECT VECTOR_EMBEDDING(minilm_l12_v2 USING 'What is RAG?' AS data) AS query_vector
FROM DUAL;

Similarity Search

The query vector is compared against all document vectors in the database using a distance metric.

Distance Metrics:

Common options include EUCLIDEAN, L2_SQUARED, COSINE, DOT, MANHATTAN, and HAMMING. You can specify the metric during index creation or within the search query itself.

Cosine Similarity: Most common for text (measures angle between vectors)
Euclidean Distance: Straight-line distance in vector space
Dot Product: Efficient for normalized vectors

Performing Similarity Search:

SELECT 
  text,
  VECTOR_DISTANCE(embedding, :query_vector, COSINE) AS distance
FROM documents
ORDER BY distance
FETCH APPROX FIRST 5 ROWS ONLY;

Top-K Results

The system searches the index for the nearest vectors, retrieves the top-k chunks, and includes those chunks with the query in the prompt.

Choosing K:

K=3-5: For focused, specific queries
K=10-20: For broader research queries
K=50+: For comprehensive document analysis

Challenges:

Top-k returns near duplicates or shallow snippets, so the prompt lacks diversity. Retrieval misses proper nouns, IDs, or acronyms because they occur rarely.

Solutions:

Reranking: Rerank with cross-encoders (Cohere ReRank, ColBERT) when precision matters
Hybrid Search: Combine BM25 (keyword) with dense vectors to boost tail recall
Metadata Filtering: Filter by date, source, document type before retrieval

Monitoring Index Accuracy

A valuable feature is Oracle's provision of functions to report index accuracy based on query vectors used in approximate searches. This allows you to continuously monitor and optimize the accuracy of your vector search implementation.

Phase 6: Generation - Creating the Final Response

Once relevant chunks are retrieved, they're combined with the user's query and sent to an LLM for response generation.

Prompt Assembly

Retrieved chunks are formatted into a coherent context block:

Context:
[Chunk 1 text]
[Chunk 2 text]
[Chunk 3 text]

Question: [User query]

Answer based on the provided context:

LLM Generation

The complete prompt (context + query) is sent to the LLM, which generates a response grounded in the retrieved information rather than relying solely on parametric knowledge.

Benefits:

Reduces hallucinations
Provides traceable sources
Enables up-to-date information
Maintains factual accuracy

Best Practices for Production RAG Pipelines

1. Optimize Chunking Strategy

Align chunk size with context windows, use task-specific sentence transformers, and default to HNSW + metadata filtering for sub-100ms retrieval at 95%+ recall.

2. Choose Appropriate Embedding Models

Match embedding dimensions to your dataset size
Consider domain-specific models for specialized content
Test multiple models and measure retrieval quality

3. Implement Proper Indexing

Small-Medium datasets (< 1M vectors): HNSW for maximum speed
Large datasets (1M+ vectors): IVF for scalability
Keyword + Semantic: Hybrid vector indexes

4. Monitor and Iterate

Monitor recall@k, latency, and index memory; re-embed on model upgrades.

Track metrics:

Retrieval accuracy (precision@k, recall@k)
Query latency (p50, p95, p99)
Index build time
Storage requirements

5. Handle Edge Cases

Query reformulation for better retrieval
Fallback strategies when retrieval fails
Confidence thresholds for generated responses
Citation and source tracking

Advanced RAG Techniques

1. Self-Correction (CRAG)

When strong retrieval pipelines return incomplete or irrelevant context, CRAG adds a lightweight feedback loop: before generating, the system checks whether the retrieved set is good enough. If it looks weak, the system triggers another retrieval pass.

2. Agentic RAG

Agentic RAG systems plan retrieval strategies and iterate based on partial results—dynamically deciding what to retrieve next.

3. GraphRAG

Build an advanced RAG system by first structuring data into a knowledge graph, then layering graph-aware retrieval to capture entity relationships.

4. Multimodal RAG

Combining text, images, audio, and video retrieval for richer context.

A robust RAG pipeline requires careful attention to every stage:

Ingestion: Clean, structured data with rich metadata
Chunking: Balanced chunk sizes with appropriate overlap (10-20%)
Embedding: Semantic representations using appropriate models
Indexing: Fast similarity search with HNSW or IVF indexes
Retrieval: Accurate top-K results with metadata filtering
Generation: Grounded responses with proper citations

Oracle Database 23ai provides a complete, integrated platform for RAG with native vector support, ONNX model runtime, advanced indexing (HNSW, IVF, Hybrid), and SQL-native operations—eliminating the need for separate vector databases.

By mastering these components and following best practices, you can build production-grade RAG systems that deliver accurate, relevant, and trustworthy AI-powered experiences.