Neil Charlton

Posted on May 21

Intro to Embeddings and Vector Storage with pgvector

I recently found myself exploring the world of vector search—specifically, whether to use pgvector (a PostgreSQL extension) or a dedicated vector database like Pinecone, Weaviate, Milvus, or Qdrant. Here's a detailed, first-person account of what I learned, the trade-offs, and how I approached the problem for my own project.

Understanding the Options

pgvector is an open-source extension for PostgreSQL that lets you store and search vector data (like embeddings from machine learning models) right inside your existing database. In contrast, a dedicated vector database is built from the ground up for storing, indexing, and searching high-dimensional vectors at scale.

Use Cases: When Does Each Make Sense?

When pgvector Shines

For me, pgvector was appealing because I already had a PostgreSQL database running my application. It's perfect if you want to:

Add vector search to an existing app without introducing a new system.
Prototype or experiment with ML features like semantic search or recommendations.
Run hybrid queries that mix vector search with classic SQL (e.g., "find similar posts, but only those by user X").
Rely on strong transactional guarantees (ACID) for both relational and vector data.

When a Dedicated Vector DB is Better

If you're dealing with:

Millions or billions of vectors and need lightning-fast, scalable search.
Real-time applications (like chatbots or image search) where latency is critical.
Advanced features like distributed indexing, hybrid search, or multi-modal data (text, image, audio).
The need for managed, cloud-native solutions with auto-scaling and high availability.

Then a dedicated vector database is likely the way to go.

Advantages and Disadvantages

pgvector: The Good and the Not-So-Good

Advantages:

Simplicity—no new infrastructure, just PostgreSQL.
Full ACID compliance, so data stays in sync.
Cost-effective for small/medium workloads.
Easy integration with existing tools and workflows.
Effortless hybrid queries with SQL.

Disadvantages:

Not as fast or scalable for huge datasets.
Fewer indexing options for approximate nearest neighbor (ANN) search.
Vector search can compete with other database workloads.
Lacks some advanced features found in dedicated vector DBs.

Dedicated Vector DB: Pros and Cons

Advantages:

Blazing performance at scale.
Designed for billions of vectors and high query throughput.
Advanced indexing algorithms (HNSW, IVF, PQ, etc.).
Rich feature set: metadata filtering, hybrid search, multi-tenancy.
Many offer fully managed, cloud-native services.

Disadvantages:

Adds operational complexity (another system to manage).
Syncing data between systems can be tricky if you need strong consistency.
Managed services can get expensive.
New APIs and concepts to learn.

My Scenario: Posts and Comments in PostgreSQL

I have a PostgreSQL database with a posts table and a comments table (each comment belongs to a post). I wanted to track trends and perform similarity searches on data I've scraped from online sources.

My Approach

Start with pgvector: Since my data volume was modest and I wanted to keep things simple, I decided to try pgvector first.
Install and Configure pgvector:

   CREATE EXTENSION vector;

Add Embedding Columns:

   ALTER TABLE posts ADD COLUMN embedding vector(1536); -- Assuming OpenAI embeddings
   ALTER TABLE comments ADD COLUMN embedding vector(1536);

Generate Embeddings: I used a model (like OpenAI's text-embedding-ada-002) to generate embeddings for each post and comment.
Indexing: I created an IVF index to speed up similarity searches:

   CREATE INDEX ON posts USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
   CREATE INDEX ON comments USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

Similarity Search: I could now find similar posts:

   SELECT
       id,
       title
   FROM
       posts
   ORDER BY
       embedding <#> (SELECT embedding FROM posts WHERE id = 123)
   LIMIT 10;

Trend Tracking: I used SQL to cluster similar posts/comments and analyze trends over time.

Monitoring and Scaling

I kept an eye on query latency and resource usage. If performance ever became an issue (as my data grew), I knew I could transition to a dedicated vector database or use a hybrid approach (PostgreSQL for relational data, vector DB for embeddings).

How I Chose the Embedding Size

The embedding size depends on the model you use. For example:

OpenAI's text-embedding-ada-002: 1536 dimensions.
Sentence Transformers (like all-mpnet-base-v2): 768 dimensions.

I checked the model documentation or ran a quick test to inspect the output shape. If storage or performance became a concern, I considered dimensionality reduction (like PCA), but always balanced that against potential accuracy loss.

Combining Post and Comment Embeddings

I also explored creating an aggregate embedding for each post, combining the post content and all its comments:

CREATE TABLE post_aggregate_embeddings (
    post_id INTEGER PRIMARY KEY REFERENCES posts(id),
    embedding vector(1536) -- Adjust size as needed
);

-- Create an index for efficient similarity searches
CREATE INDEX ON post_aggregate_embeddings USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);

This enabled:

Semantic search across the entire conversation (not just the post).
Trend analysis based on discussion topics.
Anomaly detection (e.g., when comments deviate from the post's topic).

I could then query for similar discussions:

SELECT
    p.id,
    p.title
FROM
    posts p
JOIN
    post_aggregate_embeddings pae ON p.id = pae.post_id
ORDER BY
    pae.embedding <#> (SELECT embedding FROM post_aggregate_embeddings WHERE post_id = 123)
LIMIT 10;

The trade-off was increased complexity and the need to update the aggregate embedding whenever comments changed.

Similarity Metrics: Which One Should I Use?

pgvector supports several similarity metrics:

Euclidean Distance (<->):

   SELECT id, title FROM posts
   ORDER BY embedding <-> (SELECT embedding FROM posts WHERE id = 123)
   LIMIT 10;

Finds posts similar in both topic and length/detail.

Inner Product (<#>):

   SELECT id, title FROM posts
   ORDER BY embedding <#> (SELECT embedding FROM posts WHERE id = 123) DESC
   LIMIT 10;

Finds posts most relevant to a normalized query. Note the DESC keyword since larger inner product means greater similarity.

Cosine Distance:

   SELECT id, title FROM posts
   ORDER BY embedding <=> (SELECT embedding FROM posts WHERE id = 123)
   LIMIT 10;

Finds posts about the same topic, regardless of length.

My rule of thumb:

Use cosine distance for most semantic search tasks, Euclidean if you care about content length, and inner product for search/recommendation with normalized queries.

Practical Use Cases in My System

Euclidean Distance: Recommending posts that are similar in both topic and depth/detail to a given post.
Inner Product: Finding posts most relevant to a user's search query, regardless of post length.
Cosine Distance: Clustering or recommending posts purely based on topic similarity, ignoring length.

My Recommendation

If you're already using PostgreSQL and your vector data isn't massive, start with pgvector. It's simple, cost-effective, and integrates seamlessly with your existing stack. Monitor performance as your data grows. If you hit bottlenecks, consider a dedicated vector database or a hybrid approach.

By starting small and iterating, you'll gain valuable insights into your needs and be better equipped to scale when the time comes.

If you have a similar use case or want to discuss your specific scenario, feel free to reach out! I'm always happy to share what I've learned or help you brainstorm the best approach.

DEV Community