DEV Community

Cover image for Oracle Database 23ai: Creating Vectors and Understanding Distance Metrics for Similarity Search
Ryan Giggs
Ryan Giggs

Posted on

Oracle Database 23ai: Creating Vectors and Understanding Distance Metrics for Similarity Search

Oracle Database 23ai introduces native vector capabilities that enable semantic search directly within SQL. Understanding how to create vectors, calculate distances, and choose appropriate metrics is fundamental to building effective AI-powered applications. This comprehensive guide explores vector operations in Oracle 23ai with practical examples and best practices.

The VECTOR Data Type

Oracle 23ai introduces a native VECTOR data type designed specifically to store and manage vector embeddings efficiently within the database.

Declaration Syntax:

-- Flexible: any dimensions and format
embedding VECTOR

-- Specific dimensions, flexible format
embedding VECTOR(512)

-- Fully specified: dimensions and format
embedding VECTOR(512, FLOAT32)
Enter fullscreen mode Exit fullscreen mode

Format Options:

  • INT8: 8-bit integers
  • FLOAT32: 32-bit floating-point (IEEE standard, most common)
  • FLOAT64: 64-bit floating-point (IEEE standard, higher precision)
  • BINARY: Binary vectors for specialized use cases

Oracle Database automatically casts values if needed between formats.

Creating Vectors with the VECTOR Constructor

The VECTOR constructor is a function that allows us to create vectors without storing them. It's particularly useful for learning purposes, testing, and ad-hoc queries.

Basic Syntax

VECTOR(array_literal [, number_of_dimensions] [, format])
Enter fullscreen mode Exit fullscreen mode

Parameters:

  • array_literal: String representation of vector values
  • number_of_dimensions: Optional, inferred from array if not specified
  • format: Optional (INT8, FLOAT32, FLOAT64, BINARY)

Examples

Simple 2D Vector:

SELECT VECTOR('[0, 0]');
Enter fullscreen mode Exit fullscreen mode

Output: [0, 0]

Vector with Scientific Notation:

SELECT VECTOR('[10, 0]');
Enter fullscreen mode Exit fullscreen mode

Output: [1.0E+001, 0] (equivalent to [10, 0])

Specifying Dimensions:

SELECT VECTOR('[1, 2, 3]', 3, FLOAT32);
Enter fullscreen mode Exit fullscreen mode

Creating Vectors from Variables:

DECLARE
  my_vector VECTOR;
BEGIN
  my_vector := VECTOR('[0.5, 0.8, 0.2]');
  DBMS_OUTPUT.PUT_LINE(my_vector);
END;
/
Enter fullscreen mode Exit fullscreen mode

Best Practices for Learning

Use small number of dimensions (2-4) when learning vector concepts, as they're easier to visualize and understand. For production, match the dimensions to your embedding model (typically 384-1536).

The VECTOR_DISTANCE Function

VECTOR_DISTANCE is the main function for calculating distance between two vectors. It allows you to calculate distance between two parameters and therefore takes 2 vectors as params.

Syntax

VECTOR_DISTANCE(expr1, expr2 [, distance_metric])
Enter fullscreen mode Exit fullscreen mode

Parameters:

  • expr1, expr2: Two vector expressions to compare
  • distance_metric: Optional; defaults to COSINE (or HAMMING for BINARY vectors)

Returns: BINARY_DOUBLE representing the distance

Default Behavior

If you do not specify a distance metric, then the default distance metric is cosine. If the input vectors are BINARY vectors, the default metric is hamming.

Basic Example

SELECT 
  VECTOR_DISTANCE(
    VECTOR('[1, 0]'),
    VECTOR('[0, 1]'),
    COSINE
  ) AS distance;
Enter fullscreen mode Exit fullscreen mode

Vector Distance Metrics: Choosing the Right One

Oracle AI Vector Search supports multiple distance metrics, each suited for different use cases and data characteristics.

1. Euclidean Distance (L2)

Euclidean distance gives us the straight-line distance between two vectors. It uses the Pythagorean theorem and is sensitive to both vector size and direction.

Formula:

d = √[(x₂-x₁)² + (y₂-y₁)²]
Enter fullscreen mode Exit fullscreen mode

SQL Example:

SELECT TO_NUMBER(
  VECTOR_DISTANCE(
    VECTOR('[3, 0]'),
    VECTOR('[0, 4]'),
    EUCLIDEAN
  )
) AS distance;
Enter fullscreen mode Exit fullscreen mode

Result: 5 (forming a 3-4-5 right triangle)

Use Cases:

  • Spatial data and geographical coordinates
  • Physical measurements
  • Image similarity (pixel-level comparisons)
  • When absolute distance matters

2. Euclidean Squared Distance (L2_SQUARED)

The Euclidean distance without taking the square root. When ordering is more important than the actual distance values, squared Euclidean distance is very useful as it's faster to calculate, avoiding the square root computation.

Advantage: Faster computation for ranking/sorting

SQL Example:

SELECT TO_NUMBER(
  VECTOR_DISTANCE(
    VECTOR('[3, 0]'),
    VECTOR('[0, 4]'),
    EUCLIDEAN_SQUARED
  )
) AS distance_squared;
Enter fullscreen mode Exit fullscreen mode

Result: 25 (5²)

3. Cosine Similarity / Cosine Distance

Cosine similarity is the most widely used similarity metric, especially in natural language processing (NLP). The smaller the angle, the more similar the vectors are.

Cosine measures the angle between two vectors rather than their magnitude. It's ideal for text embeddings where direction matters more than magnitude.

Formula:

cosine_similarity = (A · B) / (||A|| × ||B||)
cosine_distance = 1 - cosine_similarity
Enter fullscreen mode Exit fullscreen mode

SQL Example:

SELECT 
  VECTOR_DISTANCE(
    VECTOR('[1, 0]'),
    VECTOR('[1, 1]'),
    COSINE
  ) AS cosine_dist;
Enter fullscreen mode Exit fullscreen mode

Key Characteristics:

  • Range: 0 (identical) to 2 (opposite)
  • Normalized: Insensitive to vector magnitude
  • Best for: Text embeddings, document similarity, semantic search

Why It's Popular:

Cosine is one of the most useful metrics since it measures the angle between two vectors instead of the difference in size or position. This makes it perfect for comparing text embeddings where the semantic meaning is encoded in direction, not magnitude.

4. Dot Product Similarity

Dot product allows us to multiply the size of each vector by the cosine of their angle. It is equivalent to the sum of the vectors' coordinates. Larger values mean more similar; smaller values mean less similar.

Formula:

dot_product = Σ(aᵢ × bᵢ)
Enter fullscreen mode Exit fullscreen mode

Note: Oracle's DOT metric calculates the negated dot product, so more negative values indicate greater similarity.

SQL Example:

SELECT 
  VECTOR_DISTANCE(
    VECTOR('[2, 3]'),
    VECTOR('[4, 5]'),
    DOT
  ) AS dot_distance;
Enter fullscreen mode Exit fullscreen mode

Use Cases:

  • Recommendation systems
  • Similarity ranking with normalized vectors
  • Fast approximate nearest neighbor search

Important: For meaningful results, vectors should be normalized to unit length.

5. Manhattan Distance (L1)

Manhattan distance is useful for describing uniform grids. It's useful for city blocks, power grids, chessboards, and is faster than Euclidean metrics.

Also known as taxicab distance or L1 distance, it calculates the sum of absolute differences between vector components.

Formula:

manhattan = Σ|aᵢ - bᵢ|
Enter fullscreen mode Exit fullscreen mode

SQL Example:

SELECT 
  VECTOR_DISTANCE(
    VECTOR('[1, 2]'),
    VECTOR('[4, 6]'),
    MANHATTAN
  ) AS manhattan_dist;
Enter fullscreen mode Exit fullscreen mode

Result: |4-1| + |6-2| = 7

Use Cases:

  • Grid-based problems
  • Route planning on city streets
  • Feature selection in machine learning
  • When diagonal movement isn't allowed

6. Hamming Distance

Hamming distance describes where vector dimensions differ. They are binary vectors and tell us the number of bits that require change to match.

Hamming distance computes the position of each bit in the sequence and is used for network error detection and correction.

SQL Example:

SELECT 
  VECTOR_DISTANCE(
    VECTOR('[1, 0, 1, 1, 0]', 5, BINARY),
    VECTOR('[1, 1, 1, 0, 0]', 5, BINARY),
    HAMMING
  ) AS hamming_dist;
Enter fullscreen mode Exit fullscreen mode

Result: 2 (positions 2 and 4 differ)

Use Cases:

  • Error detection and correction
  • Genetic sequence comparison
  • Digital communication
  • Binary classification problems

7. Jaccard Distance

Jaccard distance measures dissimilarity between binary vectors based on the ratio of intersection to union.

Requirements: Both vectors must be BINARY

Formula:

jaccard = 1 - (|A ∩ B| / |A ∪ B|)
Enter fullscreen mode Exit fullscreen mode

SQL Example:

SELECT 
  VECTOR_DISTANCE(
    VECTOR('[1, 1, 0, 1]', 4, BINARY),
    VECTOR('[1, 0, 1, 1]', 4, BINARY),
    JACCARD
  ) AS jaccard_dist;
Enter fullscreen mode Exit fullscreen mode

Use Cases:

  • Set similarity comparison
  • Document deduplication
  • Recommendation systems
  • Clustering binary data

Shorthand Distance Operators

Oracle provides convenient shorthand operators for common distance calculations, making SQL queries more concise and readable.

Available Operators

<-> Euclidean Distance Operator

SELECT '[1, 2]' <-> '[0, 1]' AS euclidean_dist;
Enter fullscreen mode Exit fullscreen mode

Equivalent to:

SELECT L2_DISTANCE(VECTOR('[1, 2]'), VECTOR('[0, 1]')) AS euclidean_dist;
Enter fullscreen mode Exit fullscreen mode

or

SELECT VECTOR_DISTANCE(VECTOR('[1, 2]'), VECTOR('[0, 1]'), EUCLIDEAN) AS euclidean_dist;
Enter fullscreen mode Exit fullscreen mode

<=> Cosine Distance Operator

SELECT '[1, 0]' <=> '[0, 1]' AS cosine_dist;
Enter fullscreen mode Exit fullscreen mode

Equivalent to:

SELECT COSINE_DISTANCE(VECTOR('[1, 0]'), VECTOR('[0, 1]')) AS cosine_dist;
Enter fullscreen mode Exit fullscreen mode

<#> Negative Dot Product Operator

SELECT '[2, 3]' <#> '[4, 5]' AS neg_dot_product;
Enter fullscreen mode Exit fullscreen mode

Equivalent to:

SELECT -1 * INNER_PRODUCT(VECTOR('[2, 3]'), VECTOR('[4, 5]')) AS neg_dot_product;
Enter fullscreen mode Exit fullscreen mode

Practical Example

-- Compare products by embedding similarity
SELECT 
  p1.product_name,
  p2.product_name,
  p1.embedding <=> p2.embedding AS similarity_score
FROM products p1, products p2
WHERE p1.product_id = 100
  AND p2.product_id != 100
ORDER BY similarity_score
FETCH FIRST 5 ROWS ONLY;
Enter fullscreen mode Exit fullscreen mode

Shorthand Distance Functions

In addition to operators, Oracle provides shorthand functions for cleaner code:

L1_DISTANCE - Manhattan distance
L2_DISTANCE - Euclidean distance
COSINE_DISTANCE - Cosine distance
INNER_PRODUCT - Dot product (not negated)
HAMMING_DISTANCE - Hamming distance
JACCARD_DISTANCE - Jaccard distance

Example:

SELECT 
  L2_DISTANCE(v1.embedding, v2.embedding) AS euclidean,
  COSINE_DISTANCE(v1.embedding, v2.embedding) AS cosine,
  INNER_PRODUCT(v1.embedding, v2.embedding) AS dot_prod
FROM vectors v1, vectors v2
WHERE v1.id = 1 AND v2.id = 2;
Enter fullscreen mode Exit fullscreen mode

Performing Similarity Search

The VECTOR_DISTANCE function can be used to perform similarity search by ordering results based on vector proximity.

Exact Similarity Search

SELECT 
  id,
  description,
  VECTOR_DISTANCE(embedding, :query_vector, COSINE) AS distance
FROM documents
ORDER BY distance
FETCH FIRST 10 ROWS ONLY;
Enter fullscreen mode Exit fullscreen mode

Approximate Similarity Search (with Index)

-- Using FETCH APPROXIMATE with vector index
SELECT 
  id,
  description
FROM documents
ORDER BY embedding <=> :query_vector
FETCH APPROXIMATE FIRST 10 ROWS ONLY;
Enter fullscreen mode Exit fullscreen mode

Key Difference:

  • EXACT: Compares query vector with every vector (slower, 100% accurate)
  • APPROXIMATE: Uses vector indexes (HNSW/IVF) for fast search with ~95% accuracy

Choosing the Right Distance Metric

Decision Matrix

Use Case Recommended Metric Why
Text embeddings COSINE Captures semantic similarity, magnitude-invariant
Image similarity EUCLIDEAN Pixel-level comparisons benefit from absolute distance
Recommendation systems DOT (normalized vectors) Fast computation, works well with normalized data
Grid/route problems MANHATTAN Natural fit for grid-based navigation
Binary classification HAMMING Direct bit difference counting
Error detection HAMMING Counts differing positions
Set similarity JACCARD Measures intersection/union ratio

Performance Considerations

Fastest to Slowest:

  1. DOT (simple multiplication and sum)
  2. EUCLIDEAN_SQUARED (avoids square root)
  3. MANHATTAN (absolute values and sum)
  4. EUCLIDEAN (includes square root)
  5. COSINE (normalization overhead)

Matching Index and Query Metrics

If a similarity search query specifies a distance metric that conflicts with the metric in a vector index, the vector index is not used and an exact search is performed instead.

Best Practice: Ensure your query metric matches your index metric for optimal performance.

-- Index created with COSINE
CREATE VECTOR INDEX docs_idx ON documents(embedding)
ORGANIZATION NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95;

-- Query should also use COSINE for index usage
SELECT id FROM documents
ORDER BY embedding <=> :query_vector  -- Uses COSINE operator
FETCH APPROXIMATE FIRST 10 ROWS ONLY;
Enter fullscreen mode Exit fullscreen mode

Complete Working Example

Here's a comprehensive example demonstrating vector creation, storage, and similarity search:

-- Create table with vector column
CREATE TABLE product_embeddings (
  product_id NUMBER PRIMARY KEY,
  product_name VARCHAR2(100),
  description CLOB,
  embedding VECTOR(384, FLOAT32)
);

-- Insert sample data
INSERT INTO product_embeddings VALUES (
  1,
  'Laptop Computer',
  'High-performance laptop for developers',
  VECTOR('[0.2, 0.8, 0.5, ...]', 384, FLOAT32)
);

INSERT INTO product_embeddings VALUES (
  2,
  'Wireless Mouse',
  'Ergonomic wireless mouse',
  VECTOR('[0.1, 0.3, 0.7, ...]', 384, FLOAT32)
);

COMMIT;

-- Create vector index for fast similarity search
CREATE VECTOR INDEX product_emb_idx 
ON product_embeddings(embedding)
ORGANIZATION NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95;

-- Perform similarity search
DECLARE
  query_vec VECTOR;
BEGIN
  -- Create query vector (in practice, from embedding model)
  query_vec := VECTOR('[0.15, 0.75, 0.6, ...]', 384, FLOAT32);

  -- Find similar products
  FOR rec IN (
    SELECT 
      product_name,
      description,
      COSINE_DISTANCE(embedding, query_vec) AS similarity
    FROM product_embeddings
    ORDER BY embedding <=> query_vec
    FETCH APPROXIMATE FIRST 5 ROWS ONLY
  ) LOOP
    DBMS_OUTPUT.PUT_LINE(
      'Product: ' || rec.product_name || 
      ' | Similarity: ' || rec.similarity
    );
  END LOOP;
END;
/
Enter fullscreen mode Exit fullscreen mode

Best Practices

1. Choose Appropriate Dimensions

Match embedding dimensions to your model:

  • 384: MiniLM, lightweight models
  • 768: BERT, sentence transformers
  • 1024: Cohere embedding models
  • 1536: OpenAI ada-002

2. Normalize Vectors When Using DOT

-- Normalize vector to unit length
SELECT 
  VECTOR_NORM(embedding) AS original_magnitude,
  embedding / VECTOR_NORM(embedding) AS normalized_vector
FROM documents
WHERE id = 1;
Enter fullscreen mode Exit fullscreen mode

3. Use Appropriate Formats

  • FLOAT32: Default, balances precision and performance
  • FLOAT64: When high precision is critical
  • INT8: For quantized models, saves storage

4. Monitor Index Accuracy

-- Check if index is being used
EXPLAIN PLAN FOR
SELECT id FROM documents
ORDER BY embedding <=> :query_vector
FETCH APPROXIMATE FIRST 10 ROWS ONLY;

SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
Enter fullscreen mode Exit fullscreen mode

5. Benchmark Different Metrics

Test multiple metrics on your data to find the best performer:

-- Compare metrics
SELECT 
  'COSINE' AS metric,
  AVG(COSINE_DISTANCE(e1.embedding, e2.embedding)) AS avg_dist
FROM embeddings e1, embeddings e2
WHERE e1.id != e2.id
UNION ALL
SELECT 
  'EUCLIDEAN',
  AVG(L2_DISTANCE(e1.embedding, e2.embedding))
FROM embeddings e1, embeddings e2
WHERE e1.id != e2.id;
Enter fullscreen mode Exit fullscreen mode

Common Pitfalls and Solutions

Pitfall 1: Dimension Mismatch

Problem: Comparing vectors with different dimensions

-- This will error
SELECT VECTOR('[1, 2]') <=> VECTOR('[1, 2, 3]');
Enter fullscreen mode Exit fullscreen mode

Solution: Ensure all vectors have the same dimensions

Pitfall 2: Wrong Metric for Data Type

Problem: Using JACCARD on non-binary vectors

Solution: Use JACCARD only with BINARY vectors

Pitfall 3: Not Using Indexes

Problem: Slow similarity searches on large datasets

Solution: Create appropriate vector indexes (HNSW for speed, IVF for scale)

Pitfall 4: Metric Mismatch with Index

Problem: Query metric conflicts with index metric, causing full table scan

Solution: Match query metric to index metric

Conclusion

Oracle Database 23ai's native vector capabilities provide a powerful, integrated platform for semantic search and AI-powered applications. Key takeaways:

Vector Creation:

  • VECTOR constructor for creating vectors directly in SQL
  • Support for multiple formats (INT8, FLOAT32, FLOAT64, BINARY)
  • Flexible dimension specification

Distance Metrics:

  • COSINE: Default, best for text embeddings and semantic similarity
  • EUCLIDEAN: Straight-line distance, good for spatial data
  • DOT: Fast for normalized vectors, recommendation systems
  • MANHATTAN: Grid-based problems, faster than Euclidean
  • HAMMING: Binary vectors, error detection
  • JACCARD: Set similarity with binary vectors

Shorthand Operators:

  • <-> for Euclidean distance
  • <=> for Cosine distance
  • <#> for negative dot product

Best Practices:

  • Match metric to your use case and embedding model
  • Create appropriate vector indexes
  • Ensure metric consistency between index and queries
  • Use approximate search for large datasets
  • Benchmark different metrics on your data

By understanding these vector operations and distance metrics, you can build efficient, accurate similarity search applications entirely within Oracle Database 23ai.

Top comments (0)