Oracle Database 23ai introduces native vector capabilities that enable semantic search directly within SQL. Understanding how to create vectors, calculate distances, and choose appropriate metrics is fundamental to building effective AI-powered applications. This comprehensive guide explores vector operations in Oracle 23ai with practical examples and best practices.
The VECTOR Data Type
Oracle 23ai introduces a native VECTOR data type designed specifically to store and manage vector embeddings efficiently within the database.
Declaration Syntax:
-- Flexible: any dimensions and format
embedding VECTOR
-- Specific dimensions, flexible format
embedding VECTOR(512)
-- Fully specified: dimensions and format
embedding VECTOR(512, FLOAT32)
Format Options:
- INT8: 8-bit integers
- FLOAT32: 32-bit floating-point (IEEE standard, most common)
- FLOAT64: 64-bit floating-point (IEEE standard, higher precision)
- BINARY: Binary vectors for specialized use cases
Oracle Database automatically casts values if needed between formats.
Creating Vectors with the VECTOR Constructor
The VECTOR constructor is a function that allows us to create vectors without storing them. It's particularly useful for learning purposes, testing, and ad-hoc queries.
Basic Syntax
VECTOR(array_literal [, number_of_dimensions] [, format])
Parameters:
- array_literal: String representation of vector values
- number_of_dimensions: Optional, inferred from array if not specified
- format: Optional (INT8, FLOAT32, FLOAT64, BINARY)
Examples
Simple 2D Vector:
SELECT VECTOR('[0, 0]');
Output: [0, 0]
Vector with Scientific Notation:
SELECT VECTOR('[10, 0]');
Output: [1.0E+001, 0] (equivalent to [10, 0])
Specifying Dimensions:
SELECT VECTOR('[1, 2, 3]', 3, FLOAT32);
Creating Vectors from Variables:
DECLARE
my_vector VECTOR;
BEGIN
my_vector := VECTOR('[0.5, 0.8, 0.2]');
DBMS_OUTPUT.PUT_LINE(my_vector);
END;
/
Best Practices for Learning
Use small number of dimensions (2-4) when learning vector concepts, as they're easier to visualize and understand. For production, match the dimensions to your embedding model (typically 384-1536).
The VECTOR_DISTANCE Function
VECTOR_DISTANCE is the main function for calculating distance between two vectors. It allows you to calculate distance between two parameters and therefore takes 2 vectors as params.
Syntax
VECTOR_DISTANCE(expr1, expr2 [, distance_metric])
Parameters:
- expr1, expr2: Two vector expressions to compare
- distance_metric: Optional; defaults to COSINE (or HAMMING for BINARY vectors)
Returns: BINARY_DOUBLE representing the distance
Default Behavior
If you do not specify a distance metric, then the default distance metric is cosine. If the input vectors are BINARY vectors, the default metric is hamming.
Basic Example
SELECT
VECTOR_DISTANCE(
VECTOR('[1, 0]'),
VECTOR('[0, 1]'),
COSINE
) AS distance;
Vector Distance Metrics: Choosing the Right One
Oracle AI Vector Search supports multiple distance metrics, each suited for different use cases and data characteristics.
1. Euclidean Distance (L2)
Euclidean distance gives us the straight-line distance between two vectors. It uses the Pythagorean theorem and is sensitive to both vector size and direction.
Formula:
d = √[(x₂-x₁)² + (y₂-y₁)²]
SQL Example:
SELECT TO_NUMBER(
VECTOR_DISTANCE(
VECTOR('[3, 0]'),
VECTOR('[0, 4]'),
EUCLIDEAN
)
) AS distance;
Result: 5 (forming a 3-4-5 right triangle)
Use Cases:
- Spatial data and geographical coordinates
- Physical measurements
- Image similarity (pixel-level comparisons)
- When absolute distance matters
2. Euclidean Squared Distance (L2_SQUARED)
The Euclidean distance without taking the square root. When ordering is more important than the actual distance values, squared Euclidean distance is very useful as it's faster to calculate, avoiding the square root computation.
Advantage: Faster computation for ranking/sorting
SQL Example:
SELECT TO_NUMBER(
VECTOR_DISTANCE(
VECTOR('[3, 0]'),
VECTOR('[0, 4]'),
EUCLIDEAN_SQUARED
)
) AS distance_squared;
Result: 25 (5²)
3. Cosine Similarity / Cosine Distance
Cosine similarity is the most widely used similarity metric, especially in natural language processing (NLP). The smaller the angle, the more similar the vectors are.
Cosine measures the angle between two vectors rather than their magnitude. It's ideal for text embeddings where direction matters more than magnitude.
Formula:
cosine_similarity = (A · B) / (||A|| × ||B||)
cosine_distance = 1 - cosine_similarity
SQL Example:
SELECT
VECTOR_DISTANCE(
VECTOR('[1, 0]'),
VECTOR('[1, 1]'),
COSINE
) AS cosine_dist;
Key Characteristics:
- Range: 0 (identical) to 2 (opposite)
- Normalized: Insensitive to vector magnitude
- Best for: Text embeddings, document similarity, semantic search
Why It's Popular:
Cosine is one of the most useful metrics since it measures the angle between two vectors instead of the difference in size or position. This makes it perfect for comparing text embeddings where the semantic meaning is encoded in direction, not magnitude.
4. Dot Product Similarity
Dot product allows us to multiply the size of each vector by the cosine of their angle. It is equivalent to the sum of the vectors' coordinates. Larger values mean more similar; smaller values mean less similar.
Formula:
dot_product = Σ(aᵢ × bᵢ)
Note: Oracle's DOT metric calculates the negated dot product, so more negative values indicate greater similarity.
SQL Example:
SELECT
VECTOR_DISTANCE(
VECTOR('[2, 3]'),
VECTOR('[4, 5]'),
DOT
) AS dot_distance;
Use Cases:
- Recommendation systems
- Similarity ranking with normalized vectors
- Fast approximate nearest neighbor search
Important: For meaningful results, vectors should be normalized to unit length.
5. Manhattan Distance (L1)
Manhattan distance is useful for describing uniform grids. It's useful for city blocks, power grids, chessboards, and is faster than Euclidean metrics.
Also known as taxicab distance or L1 distance, it calculates the sum of absolute differences between vector components.
Formula:
manhattan = Σ|aᵢ - bᵢ|
SQL Example:
SELECT
VECTOR_DISTANCE(
VECTOR('[1, 2]'),
VECTOR('[4, 6]'),
MANHATTAN
) AS manhattan_dist;
Result: |4-1| + |6-2| = 7
Use Cases:
- Grid-based problems
- Route planning on city streets
- Feature selection in machine learning
- When diagonal movement isn't allowed
6. Hamming Distance
Hamming distance describes where vector dimensions differ. They are binary vectors and tell us the number of bits that require change to match.
Hamming distance computes the position of each bit in the sequence and is used for network error detection and correction.
SQL Example:
SELECT
VECTOR_DISTANCE(
VECTOR('[1, 0, 1, 1, 0]', 5, BINARY),
VECTOR('[1, 1, 1, 0, 0]', 5, BINARY),
HAMMING
) AS hamming_dist;
Result: 2 (positions 2 and 4 differ)
Use Cases:
- Error detection and correction
- Genetic sequence comparison
- Digital communication
- Binary classification problems
7. Jaccard Distance
Jaccard distance measures dissimilarity between binary vectors based on the ratio of intersection to union.
Requirements: Both vectors must be BINARY
Formula:
jaccard = 1 - (|A ∩ B| / |A ∪ B|)
SQL Example:
SELECT
VECTOR_DISTANCE(
VECTOR('[1, 1, 0, 1]', 4, BINARY),
VECTOR('[1, 0, 1, 1]', 4, BINARY),
JACCARD
) AS jaccard_dist;
Use Cases:
- Set similarity comparison
- Document deduplication
- Recommendation systems
- Clustering binary data
Shorthand Distance Operators
Oracle provides convenient shorthand operators for common distance calculations, making SQL queries more concise and readable.
Available Operators
<-> Euclidean Distance Operator
SELECT '[1, 2]' <-> '[0, 1]' AS euclidean_dist;
Equivalent to:
SELECT L2_DISTANCE(VECTOR('[1, 2]'), VECTOR('[0, 1]')) AS euclidean_dist;
or
SELECT VECTOR_DISTANCE(VECTOR('[1, 2]'), VECTOR('[0, 1]'), EUCLIDEAN) AS euclidean_dist;
<=> Cosine Distance Operator
SELECT '[1, 0]' <=> '[0, 1]' AS cosine_dist;
Equivalent to:
SELECT COSINE_DISTANCE(VECTOR('[1, 0]'), VECTOR('[0, 1]')) AS cosine_dist;
<#> Negative Dot Product Operator
SELECT '[2, 3]' <#> '[4, 5]' AS neg_dot_product;
Equivalent to:
SELECT -1 * INNER_PRODUCT(VECTOR('[2, 3]'), VECTOR('[4, 5]')) AS neg_dot_product;
Practical Example
-- Compare products by embedding similarity
SELECT
p1.product_name,
p2.product_name,
p1.embedding <=> p2.embedding AS similarity_score
FROM products p1, products p2
WHERE p1.product_id = 100
AND p2.product_id != 100
ORDER BY similarity_score
FETCH FIRST 5 ROWS ONLY;
Shorthand Distance Functions
In addition to operators, Oracle provides shorthand functions for cleaner code:
L1_DISTANCE - Manhattan distance
L2_DISTANCE - Euclidean distance
COSINE_DISTANCE - Cosine distance
INNER_PRODUCT - Dot product (not negated)
HAMMING_DISTANCE - Hamming distance
JACCARD_DISTANCE - Jaccard distance
Example:
SELECT
L2_DISTANCE(v1.embedding, v2.embedding) AS euclidean,
COSINE_DISTANCE(v1.embedding, v2.embedding) AS cosine,
INNER_PRODUCT(v1.embedding, v2.embedding) AS dot_prod
FROM vectors v1, vectors v2
WHERE v1.id = 1 AND v2.id = 2;
Performing Similarity Search
The VECTOR_DISTANCE function can be used to perform similarity search by ordering results based on vector proximity.
Exact Similarity Search
SELECT
id,
description,
VECTOR_DISTANCE(embedding, :query_vector, COSINE) AS distance
FROM documents
ORDER BY distance
FETCH FIRST 10 ROWS ONLY;
Approximate Similarity Search (with Index)
-- Using FETCH APPROXIMATE with vector index
SELECT
id,
description
FROM documents
ORDER BY embedding <=> :query_vector
FETCH APPROXIMATE FIRST 10 ROWS ONLY;
Key Difference:
- EXACT: Compares query vector with every vector (slower, 100% accurate)
- APPROXIMATE: Uses vector indexes (HNSW/IVF) for fast search with ~95% accuracy
Choosing the Right Distance Metric
Decision Matrix
| Use Case | Recommended Metric | Why |
|---|---|---|
| Text embeddings | COSINE | Captures semantic similarity, magnitude-invariant |
| Image similarity | EUCLIDEAN | Pixel-level comparisons benefit from absolute distance |
| Recommendation systems | DOT (normalized vectors) | Fast computation, works well with normalized data |
| Grid/route problems | MANHATTAN | Natural fit for grid-based navigation |
| Binary classification | HAMMING | Direct bit difference counting |
| Error detection | HAMMING | Counts differing positions |
| Set similarity | JACCARD | Measures intersection/union ratio |
Performance Considerations
Fastest to Slowest:
- DOT (simple multiplication and sum)
- EUCLIDEAN_SQUARED (avoids square root)
- MANHATTAN (absolute values and sum)
- EUCLIDEAN (includes square root)
- COSINE (normalization overhead)
Matching Index and Query Metrics
If a similarity search query specifies a distance metric that conflicts with the metric in a vector index, the vector index is not used and an exact search is performed instead.
Best Practice: Ensure your query metric matches your index metric for optimal performance.
-- Index created with COSINE
CREATE VECTOR INDEX docs_idx ON documents(embedding)
ORGANIZATION NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95;
-- Query should also use COSINE for index usage
SELECT id FROM documents
ORDER BY embedding <=> :query_vector -- Uses COSINE operator
FETCH APPROXIMATE FIRST 10 ROWS ONLY;
Complete Working Example
Here's a comprehensive example demonstrating vector creation, storage, and similarity search:
-- Create table with vector column
CREATE TABLE product_embeddings (
product_id NUMBER PRIMARY KEY,
product_name VARCHAR2(100),
description CLOB,
embedding VECTOR(384, FLOAT32)
);
-- Insert sample data
INSERT INTO product_embeddings VALUES (
1,
'Laptop Computer',
'High-performance laptop for developers',
VECTOR('[0.2, 0.8, 0.5, ...]', 384, FLOAT32)
);
INSERT INTO product_embeddings VALUES (
2,
'Wireless Mouse',
'Ergonomic wireless mouse',
VECTOR('[0.1, 0.3, 0.7, ...]', 384, FLOAT32)
);
COMMIT;
-- Create vector index for fast similarity search
CREATE VECTOR INDEX product_emb_idx
ON product_embeddings(embedding)
ORGANIZATION NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95;
-- Perform similarity search
DECLARE
query_vec VECTOR;
BEGIN
-- Create query vector (in practice, from embedding model)
query_vec := VECTOR('[0.15, 0.75, 0.6, ...]', 384, FLOAT32);
-- Find similar products
FOR rec IN (
SELECT
product_name,
description,
COSINE_DISTANCE(embedding, query_vec) AS similarity
FROM product_embeddings
ORDER BY embedding <=> query_vec
FETCH APPROXIMATE FIRST 5 ROWS ONLY
) LOOP
DBMS_OUTPUT.PUT_LINE(
'Product: ' || rec.product_name ||
' | Similarity: ' || rec.similarity
);
END LOOP;
END;
/
Best Practices
1. Choose Appropriate Dimensions
Match embedding dimensions to your model:
- 384: MiniLM, lightweight models
- 768: BERT, sentence transformers
- 1024: Cohere embedding models
- 1536: OpenAI ada-002
2. Normalize Vectors When Using DOT
-- Normalize vector to unit length
SELECT
VECTOR_NORM(embedding) AS original_magnitude,
embedding / VECTOR_NORM(embedding) AS normalized_vector
FROM documents
WHERE id = 1;
3. Use Appropriate Formats
- FLOAT32: Default, balances precision and performance
- FLOAT64: When high precision is critical
- INT8: For quantized models, saves storage
4. Monitor Index Accuracy
-- Check if index is being used
EXPLAIN PLAN FOR
SELECT id FROM documents
ORDER BY embedding <=> :query_vector
FETCH APPROXIMATE FIRST 10 ROWS ONLY;
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
5. Benchmark Different Metrics
Test multiple metrics on your data to find the best performer:
-- Compare metrics
SELECT
'COSINE' AS metric,
AVG(COSINE_DISTANCE(e1.embedding, e2.embedding)) AS avg_dist
FROM embeddings e1, embeddings e2
WHERE e1.id != e2.id
UNION ALL
SELECT
'EUCLIDEAN',
AVG(L2_DISTANCE(e1.embedding, e2.embedding))
FROM embeddings e1, embeddings e2
WHERE e1.id != e2.id;
Common Pitfalls and Solutions
Pitfall 1: Dimension Mismatch
Problem: Comparing vectors with different dimensions
-- This will error
SELECT VECTOR('[1, 2]') <=> VECTOR('[1, 2, 3]');
Solution: Ensure all vectors have the same dimensions
Pitfall 2: Wrong Metric for Data Type
Problem: Using JACCARD on non-binary vectors
Solution: Use JACCARD only with BINARY vectors
Pitfall 3: Not Using Indexes
Problem: Slow similarity searches on large datasets
Solution: Create appropriate vector indexes (HNSW for speed, IVF for scale)
Pitfall 4: Metric Mismatch with Index
Problem: Query metric conflicts with index metric, causing full table scan
Solution: Match query metric to index metric
Conclusion
Oracle Database 23ai's native vector capabilities provide a powerful, integrated platform for semantic search and AI-powered applications. Key takeaways:
Vector Creation:
- VECTOR constructor for creating vectors directly in SQL
- Support for multiple formats (INT8, FLOAT32, FLOAT64, BINARY)
- Flexible dimension specification
Distance Metrics:
- COSINE: Default, best for text embeddings and semantic similarity
- EUCLIDEAN: Straight-line distance, good for spatial data
- DOT: Fast for normalized vectors, recommendation systems
- MANHATTAN: Grid-based problems, faster than Euclidean
- HAMMING: Binary vectors, error detection
- JACCARD: Set similarity with binary vectors
Shorthand Operators:
-
<->for Euclidean distance -
<=>for Cosine distance -
<#>for negative dot product
Best Practices:
- Match metric to your use case and embedding model
- Create appropriate vector indexes
- Ensure metric consistency between index and queries
- Use approximate search for large datasets
- Benchmark different metrics on your data
By understanding these vector operations and distance metrics, you can build efficient, accurate similarity search applications entirely within Oracle Database 23ai.
Top comments (0)