yukaty

Posted on Nov 7, 2024 • Edited on Nov 12, 2024

Part 2: Implementing Vector Search with OpenAI

#postgres #docker #openai #python

In Part 1, we set up PostgreSQL with pgvector. Now, let's see how vector search actually works. 🧠

Prerequisites
Understanding Vector Search
Project Setup
Exploring Vector Search
Working with JSON and Vectors
Performance Tips
Resources

Prerequisites 📋

Completed Part 1 setup for pgvector
OpenAI API key

Understanding Vector Search 🌐

A vector is a list of numbers that represents position or direction:

2D Vector: [x, y]     📍 Like coordinates on a map
3D Vector: [x, y, z]  🎲 Like a point in 3D space

When AI processes content, it creates special vectors called "embeddings" (1536 dimensions) to represent meaning. These embeddings are stored in the database, allowing us to perform similarity search:

📘 "How to use Docker"
    [0.23, 0.45, 0.12, ...]  # 1536-dimensional vector
📗 "Docker tutorial"          
    [0.24, 0.44, 0.11, ...]  🤝 Very Similar! (Distance: 0.2)
📕 "Chocolate cake recipe"    
    [0.89, 0.12, 0.67, ...]  🚫 Not Related! (Distance: 0.9)

Vectors let AI understand similarity mathematically
Vector search finds similar content by comparing distances
pgvector stores embeddings efficiently
Works across any language (it's all just numbers!)

Project Setup ⚙️

Updated Project Structure

vector-search/
├── .env
├── compose.yml
├── requirements.txt
├── postgres/            # Part 1: Database setup
│   └── schema.sql
└── scripts/             # New: Data loading
    ├── Dockerfile
    └── load_data.py

1. Set Up OpenAI API

Create .env:

OPENAI_API_KEY=your_api_key  # Get from platform.openai.com

2. Create Data Loading Script

Create scripts/load_data.py to fetch books and generate embeddings:

import openai

client = openai.OpenAI()

def get_embedding(text: str):
    """Generate embedding using OpenAI API"""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

def load_books_to_db():
    """Load books with embeddings into PostgreSQL"""

    # 1. Fetches books from Open Library API
    books = fetch_books()

    for book in books:
        # 2.Create text description for embedding
        description = f"Book titled '{book['title']}' by {', '.join(book['authors'])}. "
        description += f"Published in {book['first_publish_year']}. "
        description += f"This is a book about {book['subject']}."

        # 3. Generate embedding using OpenAI
        embedding = get_embedding(description)

        # 4. Stores books and embeddings in PostgreSQL
        store_book(book["title"], json.dumps(book), embedding)

Full source code is available on GitHub 🐙

Also create requirements.txt and scripts/Dockerfile.

3. Update Docker Compose

Update compose.yml to add the data loader:

services:
  # ... existing db service from Part 1

  data_loader:
    build:
      context: .
      dockerfile: scripts/Dockerfile
    environment:
      - DATABASE_URL=postgresql://postgres:password@db:5432/example_db
      - OPENAI_API_KEY=${OPENAI_API_KEY}
    depends_on:
      - db
    command: python load_data.py

4. Load Sample Data

docker compose up --build

Exploring Vector Search 🔦

First, connect to the database:

docker exec -it pgvector-db psql -U postgres -d example_db

Inspecting Embeddings

Check what the vectors look like:

-- View first 5 dimensions of an embedding
SELECT
    name,
    (replace(replace(embedding::text, '[', '{'), ']', '}')::float[])[1:5] as first_dimensions
FROM items
LIMIT 1;

💡 Each embedding from OpenAI's model:

Has 1536 dimensions
Contains values between -1 and 1
Represents text meaning mathematically
Outputs in [...] format, which needs to be converted to PostgreSQL's {...} array format for array operations

Finding Similar Books

Search for books about web development:

WITH web_book AS (
        SELECT embedding FROM items WHERE name LIKE '%Web%' LIMIT 1
)
SELECT 
    item_data->>'title' as title,
    item_data->>'authors' as authors,
    embedding <=> (SELECT embedding FROM web_book) as similarity
FROM items
ORDER BY similarity
LIMIT 3;  -- Returns the 3 most similar books

Working with JSON and Vectors ⚡️

JSON Operators

Use ->> to extract text value from a JSON field:

-- Get title from the 'item_data' JSON column
SELECT item_data->>'title' FROM items;

Vector Search Operators

pgvector supports multiple distance functions. Here are the two most commonly used operators.

L2 Distance: `<->`

Measures straight-line (Euclidean) distance between vectors:

-- Find similar books using L2 distance
SELECT 
    name,
    embedding <-> (
        SELECT embedding FROM items WHERE name LIKE '%Web%' LIMIT 1
    ) as distance
FROM items
ORDER BY distance
LIMIT 3;

Cosine Distance: `<=>`

Measures angle-based (cosine) distance between vectors:

-- Find similar books using Cosine distance
SELECT 
    name,
    embedding <=> (
        SELECT embedding FROM items WHERE name LIKE '%Web%' LIMIT 1
    ) as distance
FROM items
ORDER BY distance
LIMIT 3;

💡 Tips

OpenAI recommends <=> (Cosine distance) for their embeddings.
Smaller distance means higher similarity.

Performance Tips 🚀

Query Optimization

Cache query vectors instead of subquerying:

-- ❌ Inefficient: Subquery runs for every row
SELECT name, embedding <=> (
        SELECT embedding FROM items WHERE name LIKE '%Web%' LIMIT 1
) as distance
FROM items
ORDER BY distance
LIMIT 3;

-- ✅ Better: Query vector calculated once
WITH query_embedding AS (
        SELECT embedding FROM items WHERE name LIKE '%Web%' LIMIT 1
)
SELECT 
    name,
    embedding <=> (SELECT embedding FROM query_embedding) as distance
FROM items
ORDER BY distance
LIMIT 3;

Indexing

Choose an index based on your needs:

-- Option 1: IVFFlat (Less memory, good for development)
CREATE INDEX ON items USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

-- Option 2: HNSW (Faster searches, more memory)
CREATE INDEX ON items USING hnsw (embedding vector_cosine_ops);

Resources 🔗

Hope this helps you build something cool. Feel free to drop a comment below! 💬

Top comments (1)

Maksim Sheptiakov • May 26 '25

Hey
Nice tutorial, but you missed some parts/files. So if someone wants to run it:

Full scripts/load_data.py file:

import openai
import requests
import json
import psycopg2
import os
# from psycopg2.extras import Json
from typing import List, Dict, Any
"""https://dev.to/yukaty/part-3-implementing-vector-search-with-ollama-1dop - how to do locally with ollama"""
client = openai.OpenAI()
def get_embedding(text: str):
    """Generate embedding using OpenAI API"""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding
def store_book(title: str, book_json: str, embedding: List[float]) -> None:
    """
    Store a book with its embedding in the PostgreSQL database.
    Args:
        title: The title of the book
        book_json: JSON string containing book data
        embedding: Vector embedding of the book description
    Returns:
        None
    """
    # Get database connection string from environment variable
    db_url = os.environ.get("DATABASE_URL", "postgresql://postgres:password@localhost:5432/example_db")
    try:
        # Connect to the PostgreSQL database
        conn = psycopg2.connect(db_url)
        cursor = conn.cursor()
        # Insert the book data and embedding into the database
        cursor.execute(
            """
            INSERT INTO items (name, item_data, embedding)
            VALUES (%s, %s, %s)
            RETURNING id
            """,
            (title, book_json, embedding)
        )
        # Get the ID of the newly inserted book
        book_id = cursor.fetchone()[0]
        # Commit the transaction
        conn.commit()
        print(f"Successfully stored '{title}' with ID {book_id} in the database")
    except Exception as e:
        print(f"Error storing book in database: {e}")
        # Roll back the transaction in case of error
        if conn:
            conn.rollback()
    finally:
        # Close the cursor and connection
        if cursor:
            cursor.close()
        if conn:
            conn.close()
def fetch_books(search_query: str = "python programming",
                limit: int = 10) -> List[Dict[str, Any]]:
    """
    Fetch books from the Open Library API.
    Args:
        search_query: The search term for finding books (default: "python programming")
        limit: Maximum number of books to return (default: 10)
    Returns:
        A list of dictionaries containing book information
    """
    base_url = "https://openlibrary.org/search.json"
    # Define parameters for the API request
    params = {
        "q": search_query,
        "limit": limit,
        "fields": "key,title,author_name,first_publish_year,cover_i,edition_count,subject"
    }
    try:
        # Make the API request
        response = requests.get(base_url, params=params)
        response.raise_for_status()  # Raise exception for HTTP errors
        data = response.json()
        books = data.get("docs", [])
        # Process and format the book data
        formatted_books = []
        for book in books:
            # Create a cleaner book object with consistent fields
            formatted_book = {
                "title": book.get("title", "Unknown Title"),
                "authors": book.get("author_name", ["Unknown Author"]),
                "publish_year": book.get("first_publish_year", "Unknown"),
                "cover_id": book.get("cover_i"),
                "edition_count": book.get("edition_count", 0),
                "subjects": book.get("subject", []),
                "key": book.get("key", "")
            }
            formatted_books.append(formatted_book)
        print(f"Successfully fetched {len(formatted_books)} books about '{search_query}'")
        return formatted_books
    except requests.exceptions.RequestException as e:
        print(f"Error fetching books from Open Library API: {e}")
        return []
def load_books_to_db():
    """Load books with embeddings into PostgreSQL"""
    # 1. Fetches books from Open Library API
    books = fetch_books()
    for book in books:
        # 2.Create text description for embedding
        print(book)
        description = f"Book titled '{book['title']}' by {', '.join(book['authors'])}. "
        description += f"Published in {book['publish_year']}. "
        description += f"This is a book about {[x + ', ' for x in book['subjects']]}."
        # 3. Generate embedding using OpenAI
        embedding = get_embedding(description)
        # 4. Stores books and embeddings in PostgreSQL
        store_book(book["title"], json.dumps(book), embedding)
if __name__ == '__main__':
    print(1)
    load_books_to_db()

scripts/Dockerfile:

FROM python:3.11-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    gcc \
    libpq-dev \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the application code
COPY scripts/load_data.py /app/load_data.py

# Set environment variables (these will be overridden by docker-compose)
ENV DATABASE_URL=postgresql://postgres:password@db:5432/example_db
ENV OPENAI_API_KEY=""

DEV Community

Part 2: Implementing Vector Search with OpenAI

Contents

Prerequisites 📋

Understanding Vector Search 🌐

Project Setup ⚙️

Updated Project Structure

1. Set Up OpenAI API

2. Create Data Loading Script

3. Update Docker Compose

4. Load Sample Data

Exploring Vector Search 🔦

Inspecting Embeddings

Finding Similar Books

Working with JSON and Vectors ⚡️

JSON Operators

Vector Search Operators

L2 Distance: `<->`

Cosine Distance: `<=>`

💡 Tips

Performance Tips 🚀

Query Optimization

Indexing

Resources 🔗

Top comments (1)

Contents

Prerequisites 📋

Understanding Vector Search 🌐

Project Setup ⚙️

Updated Project Structure

1. Set Up OpenAI API

2. Create Data Loading Script

3. Update Docker Compose

4. Load Sample Data

Exploring Vector Search 🔦

Inspecting Embeddings

Finding Similar Books

Working with JSON and Vectors ⚡️

JSON Operators

Vector Search Operators

L2 Distance: <->

Cosine Distance: <=>

💡 Tips

Performance Tips 🚀

Query Optimization

Indexing

Resources 🔗

L2 Distance: `<->`

Cosine Distance: `<=>`