Ryan Giggs

Posted on Mar 9

Oracle Database 23ai: Creating Vectors and Understanding Distance Metrics for Similarity Search

#oracledatabase23ai #vectordatabase #vectorsearch

Oracle Database 23ai introduces native vector capabilities that enable semantic search directly within SQL. Understanding how to create vectors, calculate distances, and choose appropriate metrics is fundamental to building effective AI-powered applications. This comprehensive guide explores vector operations in Oracle 23ai with practical examples and best practices.

The VECTOR Data Type

Oracle 23ai introduces a native VECTOR data type designed specifically to store and manage vector embeddings efficiently within the database.

Declaration Syntax:

-- Flexible: any dimensions and format
embedding VECTOR

-- Specific dimensions, flexible format
embedding VECTOR(512)

-- Fully specified: dimensions and format
embedding VECTOR(512, FLOAT32)

Format Options:

INT8: 8-bit integers
FLOAT32: 32-bit floating-point (IEEE standard, most common)
FLOAT64: 64-bit floating-point (IEEE standard, higher precision)
BINARY: Binary vectors for specialized use cases

Oracle Database automatically casts values if needed between formats.

Creating Vectors with the VECTOR Constructor

The VECTOR constructor is a function that allows us to create vectors without storing them. It's particularly useful for learning purposes, testing, and ad-hoc queries.

Basic Syntax

VECTOR(array_literal [, number_of_dimensions] [, format])

Parameters:

array_literal: String representation of vector values
number_of_dimensions: Optional, inferred from array if not specified
format: Optional (INT8, FLOAT32, FLOAT64, BINARY)

Examples

Simple 2D Vector:

SELECT VECTOR('[0, 0]');

Output: [0, 0]

Vector with Scientific Notation:

SELECT VECTOR('[10, 0]');

Output: [1.0E+001, 0] (equivalent to [10, 0])

Specifying Dimensions:

SELECT VECTOR('[1, 2, 3]', 3, FLOAT32);

Creating Vectors from Variables:

DECLARE
  my_vector VECTOR;
BEGIN
  my_vector := VECTOR('[0.5, 0.8, 0.2]');
  DBMS_OUTPUT.PUT_LINE(my_vector);
END;
/

Best Practices for Learning

Use small number of dimensions (2-4) when learning vector concepts, as they're easier to visualize and understand. For production, match the dimensions to your embedding model (typically 384-1536).

The VECTOR_DISTANCE Function

VECTOR_DISTANCE is the main function for calculating distance between two vectors. It allows you to calculate distance between two parameters and therefore takes 2 vectors as params.

Syntax

VECTOR_DISTANCE(expr1, expr2 [, distance_metric])

Parameters:

expr1, expr2: Two vector expressions to compare
distance_metric: Optional; defaults to COSINE (or HAMMING for BINARY vectors)

Returns: BINARY_DOUBLE representing the distance

Default Behavior

If you do not specify a distance metric, then the default distance metric is cosine. If the input vectors are BINARY vectors, the default metric is hamming.

Basic Example

SELECT 
  VECTOR_DISTANCE(
    VECTOR('[1, 0]'),
    VECTOR('[0, 1]'),
    COSINE
  ) AS distance;

Vector Distance Metrics: Choosing the Right One

Oracle AI Vector Search supports multiple distance metrics, each suited for different use cases and data characteristics.

1. Euclidean Distance (L2)

Euclidean distance gives us the straight-line distance between two vectors. It uses the Pythagorean theorem and is sensitive to both vector size and direction.

Formula:

d = √[(x₂-x₁)² + (y₂-y₁)²]

SQL Example:

SELECT TO_NUMBER(
  VECTOR_DISTANCE(
    VECTOR('[3, 0]'),
    VECTOR('[0, 4]'),
    EUCLIDEAN
  )
) AS distance;

Result: 5 (forming a 3-4-5 right triangle)

Use Cases:

Spatial data and geographical coordinates
Physical measurements
Image similarity (pixel-level comparisons)
When absolute distance matters

2. Euclidean Squared Distance (L2_SQUARED)

The Euclidean distance without taking the square root. When ordering is more important than the actual distance values, squared Euclidean distance is very useful as it's faster to calculate, avoiding the square root computation.

Advantage: Faster computation for ranking/sorting

SQL Example:

SELECT TO_NUMBER(
  VECTOR_DISTANCE(
    VECTOR('[3, 0]'),
    VECTOR('[0, 4]'),
    EUCLIDEAN_SQUARED
  )
) AS distance_squared;

Result: 25 (5²)

3. Cosine Similarity / Cosine Distance

Cosine similarity is the most widely used similarity metric, especially in natural language processing (NLP). The smaller the angle, the more similar the vectors are.

Cosine measures the angle between two vectors rather than their magnitude. It's ideal for text embeddings where direction matters more than magnitude.

Formula:

cosine_similarity = (A · B) / (||A|| × ||B||)
cosine_distance = 1 - cosine_similarity

SQL Example:

SELECT 
  VECTOR_DISTANCE(
    VECTOR('[1, 0]'),
    VECTOR('[1, 1]'),
    COSINE
  ) AS cosine_dist;

Key Characteristics:

Range: 0 (identical) to 2 (opposite)
Normalized: Insensitive to vector magnitude
Best for: Text embeddings, document similarity, semantic search

Why It's Popular:

Cosine is one of the most useful metrics since it measures the angle between two vectors instead of the difference in size or position. This makes it perfect for comparing text embeddings where the semantic meaning is encoded in direction, not magnitude.

4. Dot Product Similarity

Dot product allows us to multiply the size of each vector by the cosine of their angle. It is equivalent to the sum of the vectors' coordinates. Larger values mean more similar; smaller values mean less similar.

Formula:

dot_product = Σ(aᵢ × bᵢ)

Note: Oracle's DOT metric calculates the negated dot product, so more negative values indicate greater similarity.

SQL Example:

SELECT 
  VECTOR_DISTANCE(
    VECTOR('[2, 3]'),
    VECTOR('[4, 5]'),
    DOT
  ) AS dot_distance;

Use Cases:

Recommendation systems
Similarity ranking with normalized vectors
Fast approximate nearest neighbor search

Important: For meaningful results, vectors should be normalized to unit length.

5. Manhattan Distance (L1)

Manhattan distance is useful for describing uniform grids. It's useful for city blocks, power grids, chessboards, and is faster than Euclidean metrics.

Also known as taxicab distance or L1 distance, it calculates the sum of absolute differences between vector components.

Formula:

manhattan = Σ|aᵢ - bᵢ|

SQL Example:

SELECT 
  VECTOR_DISTANCE(
    VECTOR('[1, 2]'),
    VECTOR('[4, 6]'),
    MANHATTAN
  ) AS manhattan_dist;

Result: |4-1| + |6-2| = 7

Use Cases:

Grid-based problems
Route planning on city streets
Feature selection in machine learning
When diagonal movement isn't allowed

6. Hamming Distance

Hamming distance describes where vector dimensions differ. They are binary vectors and tell us the number of bits that require change to match.

Hamming distance computes the position of each bit in the sequence and is used for network error detection and correction.

SQL Example:

SELECT 
  VECTOR_DISTANCE(
    VECTOR('[1, 0, 1, 1, 0]', 5, BINARY),
    VECTOR('[1, 1, 1, 0, 0]', 5, BINARY),
    HAMMING
  ) AS hamming_dist;

Result: 2 (positions 2 and 4 differ)

Use Cases:

Error detection and correction
Genetic sequence comparison
Digital communication
Binary classification problems

7. Jaccard Distance

Jaccard distance measures dissimilarity between binary vectors based on the ratio of intersection to union.

Requirements: Both vectors must be BINARY

Formula:

jaccard = 1 - (|A ∩ B| / |A ∪ B|)

SQL Example:

SELECT 
  VECTOR_DISTANCE(
    VECTOR('[1, 1, 0, 1]', 4, BINARY),
    VECTOR('[1, 0, 1, 1]', 4, BINARY),
    JACCARD
  ) AS jaccard_dist;

Use Cases:

Set similarity comparison
Document deduplication
Recommendation systems
Clustering binary data

Shorthand Distance Operators

Oracle provides convenient shorthand operators for common distance calculations, making SQL queries more concise and readable.

Available Operators

<-> Euclidean Distance Operator

SELECT '[1, 2]' <-> '[0, 1]' AS euclidean_dist;

Equivalent to:

SELECT L2_DISTANCE(VECTOR('[1, 2]'), VECTOR('[0, 1]')) AS euclidean_dist;

SELECT VECTOR_DISTANCE(VECTOR('[1, 2]'), VECTOR('[0, 1]'), EUCLIDEAN) AS euclidean_dist;

<=> Cosine Distance Operator

SELECT '[1, 0]' <=> '[0, 1]' AS cosine_dist;

Equivalent to:

SELECT COSINE_DISTANCE(VECTOR('[1, 0]'), VECTOR('[0, 1]')) AS cosine_dist;

<#> Negative Dot Product Operator

SELECT '[2, 3]' <#> '[4, 5]' AS neg_dot_product;

Equivalent to:

SELECT -1 * INNER_PRODUCT(VECTOR('[2, 3]'), VECTOR('[4, 5]')) AS neg_dot_product;

Practical Example

-- Compare products by embedding similarity
SELECT 
  p1.product_name,
  p2.product_name,
  p1.embedding <=> p2.embedding AS similarity_score
FROM products p1, products p2
WHERE p1.product_id = 100
  AND p2.product_id != 100
ORDER BY similarity_score
FETCH FIRST 5 ROWS ONLY;

Shorthand Distance Functions

In addition to operators, Oracle provides shorthand functions for cleaner code:

L1_DISTANCE - Manhattan distance
L2_DISTANCE - Euclidean distance
COSINE_DISTANCE - Cosine distance
INNER_PRODUCT - Dot product (not negated)
HAMMING_DISTANCE - Hamming distance
JACCARD_DISTANCE - Jaccard distance

Example:

SELECT 
  L2_DISTANCE(v1.embedding, v2.embedding) AS euclidean,
  COSINE_DISTANCE(v1.embedding, v2.embedding) AS cosine,
  INNER_PRODUCT(v1.embedding, v2.embedding) AS dot_prod
FROM vectors v1, vectors v2
WHERE v1.id = 1 AND v2.id = 2;

Performing Similarity Search

The VECTOR_DISTANCE function can be used to perform similarity search by ordering results based on vector proximity.

Exact Similarity Search

SELECT 
  id,
  description,
  VECTOR_DISTANCE(embedding, :query_vector, COSINE) AS distance
FROM documents
ORDER BY distance
FETCH FIRST 10 ROWS ONLY;

Approximate Similarity Search (with Index)

-- Using FETCH APPROXIMATE with vector index
SELECT 
  id,
  description
FROM documents
ORDER BY embedding <=> :query_vector
FETCH APPROXIMATE FIRST 10 ROWS ONLY;

Key Difference:

EXACT: Compares query vector with every vector (slower, 100% accurate)
APPROXIMATE: Uses vector indexes (HNSW/IVF) for fast search with ~95% accuracy

Choosing the Right Distance Metric

Decision Matrix

Use Case	Recommended Metric	Why
Text embeddings	COSINE	Captures semantic similarity, magnitude-invariant
Image similarity	EUCLIDEAN	Pixel-level comparisons benefit from absolute distance
Recommendation systems	DOT (normalized vectors)	Fast computation, works well with normalized data
Grid/route problems	MANHATTAN	Natural fit for grid-based navigation
Binary classification	HAMMING	Direct bit difference counting
Error detection	HAMMING	Counts differing positions
Set similarity	JACCARD	Measures intersection/union ratio

Performance Considerations

Fastest to Slowest:

DOT (simple multiplication and sum)
EUCLIDEAN_SQUARED (avoids square root)
MANHATTAN (absolute values and sum)
EUCLIDEAN (includes square root)
COSINE (normalization overhead)

Matching Index and Query Metrics

If a similarity search query specifies a distance metric that conflicts with the metric in a vector index, the vector index is not used and an exact search is performed instead.

Best Practice: Ensure your query metric matches your index metric for optimal performance.

-- Index created with COSINE
CREATE VECTOR INDEX docs_idx ON documents(embedding)
ORGANIZATION NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95;

-- Query should also use COSINE for index usage
SELECT id FROM documents
ORDER BY embedding <=> :query_vector  -- Uses COSINE operator
FETCH APPROXIMATE FIRST 10 ROWS ONLY;

Complete Working Example

Here's a comprehensive example demonstrating vector creation, storage, and similarity search:

-- Create table with vector column
CREATE TABLE product_embeddings (
  product_id NUMBER PRIMARY KEY,
  product_name VARCHAR2(100),
  description CLOB,
  embedding VECTOR(384, FLOAT32)
);

-- Insert sample data
INSERT INTO product_embeddings VALUES (
  1,
  'Laptop Computer',
  'High-performance laptop for developers',
  VECTOR('[0.2, 0.8, 0.5, ...]', 384, FLOAT32)
);

INSERT INTO product_embeddings VALUES (
  2,
  'Wireless Mouse',
  'Ergonomic wireless mouse',
  VECTOR('[0.1, 0.3, 0.7, ...]', 384, FLOAT32)
);

COMMIT;

-- Create vector index for fast similarity search
CREATE VECTOR INDEX product_emb_idx 
ON product_embeddings(embedding)
ORGANIZATION NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95;

-- Perform similarity search
DECLARE
  query_vec VECTOR;
BEGIN
  -- Create query vector (in practice, from embedding model)
  query_vec := VECTOR('[0.15, 0.75, 0.6, ...]', 384, FLOAT32);

  -- Find similar products
  FOR rec IN (
    SELECT 
      product_name,
      description,
      COSINE_DISTANCE(embedding, query_vec) AS similarity
    FROM product_embeddings
    ORDER BY embedding <=> query_vec
    FETCH APPROXIMATE FIRST 5 ROWS ONLY
  ) LOOP
    DBMS_OUTPUT.PUT_LINE(
      'Product: ' || rec.product_name || 
      ' | Similarity: ' || rec.similarity
    );
  END LOOP;
END;
/

Best Practices

1. Choose Appropriate Dimensions

Match embedding dimensions to your model:

384: MiniLM, lightweight models
768: BERT, sentence transformers
1024: Cohere embedding models
1536: OpenAI ada-002

2. Normalize Vectors When Using DOT

-- Normalize vector to unit length
SELECT 
  VECTOR_NORM(embedding) AS original_magnitude,
  embedding / VECTOR_NORM(embedding) AS normalized_vector
FROM documents
WHERE id = 1;

3. Use Appropriate Formats

FLOAT32: Default, balances precision and performance
FLOAT64: When high precision is critical
INT8: For quantized models, saves storage

4. Monitor Index Accuracy

-- Check if index is being used
EXPLAIN PLAN FOR
SELECT id FROM documents
ORDER BY embedding <=> :query_vector
FETCH APPROXIMATE FIRST 10 ROWS ONLY;

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);

5. Benchmark Different Metrics

Test multiple metrics on your data to find the best performer:

-- Compare metrics
SELECT 
  'COSINE' AS metric,
  AVG(COSINE_DISTANCE(e1.embedding, e2.embedding)) AS avg_dist
FROM embeddings e1, embeddings e2
WHERE e1.id != e2.id
UNION ALL
SELECT 
  'EUCLIDEAN',
  AVG(L2_DISTANCE(e1.embedding, e2.embedding))
FROM embeddings e1, embeddings e2
WHERE e1.id != e2.id;

Common Pitfalls and Solutions

Pitfall 1: Dimension Mismatch

Problem: Comparing vectors with different dimensions

-- This will error
SELECT VECTOR('[1, 2]') <=> VECTOR('[1, 2, 3]');

Solution: Ensure all vectors have the same dimensions

Pitfall 2: Wrong Metric for Data Type

Problem: Using JACCARD on non-binary vectors

Solution: Use JACCARD only with BINARY vectors

Pitfall 3: Not Using Indexes

Problem: Slow similarity searches on large datasets

Solution: Create appropriate vector indexes (HNSW for speed, IVF for scale)

Pitfall 4: Metric Mismatch with Index

Problem: Query metric conflicts with index metric, causing full table scan

Solution: Match query metric to index metric

Conclusion

Oracle Database 23ai's native vector capabilities provide a powerful, integrated platform for semantic search and AI-powered applications. Key takeaways:

Vector Creation:

VECTOR constructor for creating vectors directly in SQL
Support for multiple formats (INT8, FLOAT32, FLOAT64, BINARY)
Flexible dimension specification

Distance Metrics:

COSINE: Default, best for text embeddings and semantic similarity
EUCLIDEAN: Straight-line distance, good for spatial data
DOT: Fast for normalized vectors, recommendation systems
MANHATTAN: Grid-based problems, faster than Euclidean
HAMMING: Binary vectors, error detection
JACCARD: Set similarity with binary vectors

Shorthand Operators:

<-> for Euclidean distance
<=> for Cosine distance
<#> for negative dot product

Best Practices:

Match metric to your use case and embedding model
Create appropriate vector indexes
Ensure metric consistency between index and queries
Use approximate search for large datasets
Benchmark different metrics on your data

By understanding these vector operations and distance metrics, you can build efficient, accurate similarity search applications entirely within Oracle Database 23ai.