PostgreSQL as vector search (pgvector)

#ai #database #postgres #typescript

## From Relational to Intelligence: Unlocking Your Database's AI Potential

The Artificial Intelligence (AI) revolution is reshaping industries and how we interact with technology. For many, AI seems like a distant universe, accessible only to startups with unlimited budgets. But what if I told you the key to unlocking AI's power might already be in your hands, in a place you least expect: your relational database?

In this post, we'll embark on a technical journey to transform your trusty relational database into a powerful source of AI insights. We'll explore how to leverage your existing data, using modern development practices with TypeScript and Node.js, to build intelligent applications.

The Challenge: Data Overload, Underutilized Insights

Relational databases, like PostgreSQL, MySQL, and SQL Server, are the backbone of countless applications. They store structured data in an organized manner, enabling efficient queries and referential integrity. However, they were not traditionally designed to handle the complexities of unstructured data or to perform machine learning inferences directly.

The data explosion—text, images, audio, and video—presents a challenge. How can we extract predictive value and actionable insights from these vast relational data repositories, which often contain crucial information for smart decision-making?

The Solution: Vectorization and Vector Databases

The answer lies in two powerful techniques: vectorization and the rise of vector databases.

Vectorization: This is the process of converting data (text, images, etc.) into high-dimensional numerical representations called \"vectors\" or \"embeddings.\" Trained AI models (like those from OpenAI, Google, or open-source models) perform this conversion, capturing the semantics and context of the data. Similar data will have vectors close to each other in the multidimensional space.
Vector Databases: These are databases optimized for storing and querying these high-dimensional vectors efficiently. They allow you to find vectors \"similar\" to a query vector, which is the foundation for many AI applications like semantic search, recommendation systems, and anomaly detection.

But the good news is you don't need to migrate your entire relational database to a dedicated vector database. We can integrate vector functionality into your existing system.

Integrating Vectors into Your Relational Database

Many modern relational databases now support extensions or data types that allow for efficient storage and querying of vectors. The pgvector extension for PostgreSQL is a prominent example. It adds a vector data type and operators to perform similarity searches (like k-NN - k-Nearest Neighbors).

Let's demonstrate how this works using TypeScript and Node.js with a practical example: semantic search in a product catalog.

Scenario: We have a relational database with product information (name, description, price). We want to allow users to search for products using natural language, finding matches based on meaning, not just exact keywords.

Steps:

Database Setup: Install the pgvector extension in your PostgreSQL.
Embedding Generation: Use a language model API (like OpenAI's) to convert product descriptions into vectors.
Storage: Store these vectors alongside the product data in your relational database.
Search: When a user makes a query, convert the query into a vector and use pgvector to find the most similar products.

Code Example (TypeScript/Node.js)

First, let's set up our environment. We assume you already have a Node.js project set up with TypeScript and a PostgreSQL client (like pg or typeorm).

1. Product Type Definition with Vector:

// src/types/Product.ts

/**
 * Represents a product in our catalog, including its vector representation.
 */
export interface Product {
  id: number;
  name: string;
  description: string;
  price: number;
  /**
   * Vector representation of the product description.
   * Stored as an array of numbers. The dimension must match
   * the output of the embedding model used.
   */
  embedding: number[];
}

2. Function to Generate Embeddings (Mocked):

In a real-world scenario, you would call an external API here. For demonstration purposes, we'll simulate embedding generation.

// src/services/embeddingService.ts
import { Product } from '../types/Product';

// Simulates calling an embedding service (e.g., OpenAI API)
// In production, use libraries like 'openai' or similar.
async function generateEmbedding(text: string): Promise<number[]> {
  // Simulation: Returns an example vector based on text length
  const baseVector = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]; // Example 8-dimensional vector
  const textLength = text.length;
  // Create a more dynamic vector based on the text
  const dynamicVector = Array.from({ length: 8 }, (_, i) =>
    parseFloat((Math.sin(i * textLength / 100 + Date.now() / 1e9) * 0.5 + 0.5).toFixed(4))
  );
  console.log(`Generated embedding for text: \"${text.substring(0, 30)}...\"`);
  return dynamicVector;
}

/**
 * Generates embeddings for product descriptions and attaches them to Product objects.
 * @param products Array of products for which to generate embeddings.
 * @returns Array of products with the 'embedding' property populated.
 */
export async function enrichProductsWithEmbeddings(products: Omit<Product, 'embedding'>[]): Promise<Product[]> {
  const enrichedProducts: Product[] = [];
  for (const product of products) {
    const embedding = await generateEmbedding(product.description);
    enrichedProducts.push({ ...product, embedding });
  }
  return enrichedProducts;
}

/**
 * Generates an embedding for a text query.
 * @param query The user's text query.
 * @returns The embedding vector for the query.
 */
export async function generateQueryEmbedding(query: string): Promise<number[]> {
    return generateEmbedding(query);
}

3. Database Interaction Functions (using pg):

We need a PostgreSQL table with a vector type column.

-- SQL to create the table (execute via psql or your DB client)
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    description TEXT NOT NULL,
    price NUMERIC(10, 2) NOT NULL,
    embedding vector(8) -- Assuming 8-dimensional vectors
);

-- Create an index for efficient searches (e.g., IVFFlat)
CREATE INDEX ON products USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);
-- Or for cosine similarity:
-- CREATE INDEX ON products USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

Now, the TypeScript functions:

// src/db/productRepository.ts
import { Pool, QueryResult } from 'pg';
import { Product } from '../types/Product';

// Configure your database connection pool
const pool = new Pool({
  user: 'your_db_user',
  host: 'localhost',
  database: 'your_db_name',
  password: 'your_db_password',
  port: 5432,
});

/**
 * Inserts a new product into the database, including its embedding.
 * @param product The product to insert.
 * @returns The inserted product with its database ID.
 */
export async function insertProduct(product: Product): Promise<Product> {
  const queryText = `
    INSERT INTO products(name, description, price, embedding)
    VALUES($1, $2, $3, $4) RETURNING id
  `;
  const values = [product.name, product.description, product.price, product.embedding];

  try {
    const res: QueryResult = await pool.query(queryText, values);
    console.log(`Product inserted with ID: ${res.rows[0].id}`);
    return { ...product, id: res.rows[0].id };
  } catch (error) {
    console.error('Error inserting product:', error);
    throw error;
  }
}

/**
 * Searches for products by vector similarity.
 * @param queryEmbedding The embedding vector of the query.
 * @param k The number of most similar results to return.
 * @returns A list of products most similar to the query.
 */
export async function searchSimilarProducts(queryEmbedding: number[], k: number = 5): Promise<Product[]> {
  // The 'vector_dims' function gets the dimension of the vector stored in the table.
  // 'vector_l2_ops' specifies the distance operation (Euclidean). Use 'vector_cosine_ops' for cosine similarity.
  const queryText = `
    SELECT id, name, description, price, embedding
    FROM products
    ORDER BY embedding <=> $1
    LIMIT $2
  `;
  const values = [queryEmbedding, k];

  try {
    const res: QueryResult = await pool.query(queryText, values);
    // The type returned by pg is 'any[]', so we need a safe cast or mapping.
    const products: Product[] = res.rows.map((row: any) => ({
      id: row.id,
      name: row.name,
      description: row.description,
      price: parseFloat(row.price), // NUMERIC is returned as a string
      embedding: row.embedding,
    }));
    return products;
  } catch (error) {
    console.error('Error searching similar products:', error);
    throw error;
  }
}

// Example function to populate the database (one-time use)
export async function populateDatabase() {
    const sampleProductsData = [
        { name: \"Gamer Laptop X\", description: \"Powerful gaming laptop with RTX 4090 and 32GB RAM.\", price: 2500.00 },
        { name: \"Smartphone Pro Max\", description: \"The latest smartphone with advanced camera and OLED screen.\", price: 1200.00 },
        { name: \"Mechanical RGB Keyboard\", description: \"High-performance keyboard with blue switches and customizable lighting.\", price: 150.00 },
        { name: \"Ultrawide 4K Monitor\", description: \"34-inch curved monitor with 4K resolution for total immersion.\", price: 800.00 },
        { name: \"Full HD Webcam\", description: \"1080p webcam with built-in microphone, ideal for streaming.\", price: 70.00 },
    ];

    console.log(\"Enriching products with embeddings...\");
    const productsWithEmbeddings = await enrichProductsWithEmbeddings(sampleProductsData);

    console.log(\"Inserting products into the database...\");
    for (const product of productsWithEmbeddings) {
        await insertProduct(product);
    }
    console.log(\"Database populated successfully!\");
}

// To close the pool when the application finishes
export async function closeDbConnection() {
    await pool.end();
}

4. Usage Example:

// src/index.ts
import { populateDatabase, searchSimilarProducts, closeDbConnection } from './db/productRepository';
import { generateQueryEmbedding } from './services/embeddingService';

async function main() {
  try {
    // 1. Populate the database (run only once or if you need to reset)
    // await populateDatabase();

    // 2. Define a user query
    const userQuery = \"I'm looking for a fast computer for gaming.\";
    console.log(`\nUser query: \"${userQuery}\"`);

    // 3. Generate the embedding for the query
    const queryEmbedding = await generateQueryEmbedding(userQuery);

    // 4. Search for similar products in the database
    const similarProducts = await searchSimilarProducts(queryEmbedding, 3);

    console.log(\"\n--- Similar Products Found ---\");
    if (similarProducts.length > 0) {
      similarProducts.forEach(product => {
        console.log(`ID: ${product.id}, Name: ${product.name}, Description: ${product.description}, Price: $${product.price}`);
      });
    } else {
      console.log(\"No similar products found.\");
    }

  } catch (error) {
    console.error(\"An error occurred:\", error);
  } finally {
    await closeDbConnection();
    console.log(\"\nDatabase connection closed.");
  }
}

main();

Execution:

Ensure PostgreSQL is running and the pgvector extension is installed.
Adjust the database credentials in productRepository.ts.
Run npm install pg @types/pg to install dependencies.
Uncomment await populateDatabase(); in the main function to populate the database with sample products. Run the script (ts-node src/index.ts).
Run again, but this time with the populateDatabase line commented out, to perform the search.

This example demonstrates semantic search. Imagine the possibilities: personalized recommendation systems, fraud detection by analyzing transaction patterns, summarizing customer feedback, and more.

Best Practices and Considerations

Choice of Embedding Model: The quality of your embeddings is crucial. Experiment with different models (OpenAI, Sentence-BERT, etc.) to find what best suits your data and use case.
Vector Dimensionality: Different models produce vectors of varying dimensions (e.g., 768, 1536). Ensure your vector column in the database matches this dimension.
Vector Indexes: For databases with millions of records, creating appropriate vector indexes (like IVFFlat, HNSW) is essential to ensure acceptable search performance. The choice of index and its parameters (like lists for IVFFlat) affect accuracy and speed.
Distance Operator: Choose the correct distance operator (vector_l2_ops for Euclidean, vector_cosine_ops for cosine similarity) that matches how your embedding model was trained or how you want to measure similarity.
Embedding Management: Keep your embeddings up-to-date. If the data in your database changes, the corresponding embeddings may also need to be recalculated and updated.
Security: When using third-party APIs for embedding generation, be mindful of the privacy and data security policies for the data being sent.

Conclusion

Your relational database is not just a repository of historical data; it's a goldmine waiting to be tapped with the power of AI. By integrating vectorization techniques and leveraging the growing capabilities of modern relational databases (or extensions like pgvector), you can transform your structured data into a source of predictive intelligence and actionable insights.

The journey from relational data to AI doesn't require a complete rewrite of your infrastructure. With the right tools and approaches, you can start building smarter applications today, unlocking a new level of value from your most valuable data assets. The AI revolution is within your reach, and it can start with a simple, enhanced SQL query.