Postgres With pgvector vs Pinecone: 1 Million Embeddings, One Honest Comparison

#pgvector #database #rag #postgres

Book: Database Playbook: Choosing the Right Store for Every System You Build
Also by me: RAG Pocket Guide: Retrieval, Chunking, and Reranking Patterns for Production
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You have 1.2 million chunks, 1536 dimensions, a Postgres cluster you already pay for, and a Slack thread three engineers deep arguing about whether to add Pinecone to the bill. Half the team wants the managed service. Half the team is allergic to another vendor. Both halves are quoting blog posts that never agree on the same numbers.

This post compares the two on latency, recall, cost per million vectors per month, and operational weight. Where the data comes from a named public benchmark, you get the link. Where the data is your-team-shaped, it is clearly labeled illustrative — because most of the cost-per-team numbers floating around online are fiction in a costume.

The two contenders, briefly

pgvector is a Postgres extension. Vectors live in a column on a regular table; you index them with HNSW or IVFFlat; you query them with ORDER BY embedding <-> $1 LIMIT k. The 0.8 release shipped in late 2025 alongside the PostgreSQL 17 ecosystem with HNSW build-time and recall improvements. You operate it the same way you operate any other Postgres extension: CREATE EXTENSION, monitor, vacuum, restore from your existing backups.

Pinecone is a managed vector database with its own storage layout, its own query API, and its own pricing dimensions (read units, write units, storage). Pinecone has been steering new customers toward serverless; see the pricing page for current tiers. You upsert with the SDK, you query with the SDK, and the pricing model documented on that page is the one you live with.

p95 query latency at top-K

The 2024 HNSW deep-dive by Jonathan Katz reports pgvector with HNSW at sub-20ms for top-10 queries on a 1M, 768-dimension dataset at 95%+ recall, on a single moderately-sized Postgres instance; ann-benchmarks.com shows the same shape for HNSW generally. The exact number depends on your m, ef_construction, and ef_search parameters; the Tembo writeup on pgvector indexes is the one most teams reach for when they ask which dial to turn first.

TechTarget's serverless launch coverage reports Pinecone's marketing claim of single-digit-millisecond p50 latency on the standard serverless tier. Independent benchmarks like the Tiger Data pgvector vs Pinecone comparison show Pinecone serverless landing in the same neighborhood for typical RAG workloads, occasionally slower at p95 because cold namespaces incur a warm-up cost.

The honest summary at 1M vectors:

	pgvector 0.8 (HNSW)	Pinecone serverless
p95 latency, top-10, 1536 dim	8–25 ms (illustrative — depends on instance size and `ef_search`)	10–40 ms (illustrative — depends on namespace warmth and read-unit budget)
Source for shape	ann-benchmarks, Katz HNSW post	Pinecone pricing/docs, TechTarget coverage

The bands overlap. At 1M vectors, neither one wins on latency by an order of magnitude — they win or lose on what you are willing to pay and how warm your data is.

Recall at K

ann-benchmarks reports HNSW recall@10 of 0.95–0.99 on the standard sift-1M dataset depending on parameters. pgvector's HNSW implementation tracks the same curve closely; when teams report bad recall on pgvector, the cause is almost always ef_search set too low, not the index quality.

Pinecone does not publish per-dataset recall numbers — the index implementation is proprietary and their docs assume you trust the default. In practice, customers report 95%+ recall on serverless for typical RAG queries, but you cannot tune the index the way you can with pgvector. You get the recall the platform gives you.

Cost per million vectors per month

This is where the comparison gets real.

Pinecone's serverless pricing for the Standard plan (verified 26 April 2026 against pinecone.io/pricing):

Storage: $0.33/GB/month
Read units: $16/million RU
Write units: $0.0000004/WU (a 1536-dim vector with metadata is roughly 3–4 WU per upsert)
Minimum: $50/month committed for Standard

A 1M-vector corpus at 1536 dimensions, 4 bytes per dimension, plus a per-vector metadata payload, lands around 6.5–8 GB of storage. That is ~$2.15–$2.65/month in storage alone. A read unit consumes 1 RU per 1GB of namespace queried, so each query against the full namespace costs about 7 RU. At 100k queries/month that is 700k RU = $11/month in reads, plus the storage, plus the $50 minimum gate. Realistic monthly cost: $50–$70/month for a 1M-vector RAG with moderate query volume, climbing with traffic.

pgvector on a Postgres instance is harder to quote cleanly because you are paying for compute, not for vectors. Illustrative-only: a db.r6g.large on AWS RDS at ~$130/month gives you enough RAM to keep a 1M, 1536-dim HNSW index hot, and the marginal cost of vector operations is zero — it is the cost of the database you already run. If you are already paying for Postgres for your application, the marginal cost of pgvector is closer to $0–$20/month (a slightly bigger instance, an extra GB or two of disk).

Cost shape at 1M vectors (illustrative — your numbers will differ):
  Pinecone serverless:   $50–$70/month, scales with reads
  pgvector (new RDS):    $130/month + small overhead
  pgvector (existing PG): $0–$20/month marginal

The crossover is exactly where you would guess. If you are already running Postgres at a size that has the headroom, pgvector wins on cost decisively. If you are not (if vector search is the only reason you would deploy a database), Pinecone is cheaper at the entry tier and stays competitive into the millions of queries.

Operational ergonomics

Pinecone gives you no instance to size, no index to vacuum, no replication slot to monitor. You upsert, you query. The cost of that convenience is that when something is slow, you are debugging a vendor. There is no EXPLAIN ANALYZE. There is no pg_stat_statements. The serverless tier deliberately abstracts capacity, which means you debug latency by reading the docs and adjusting your read-unit budget.

pgvector hands you the keys. You can run EXPLAIN (ANALYZE, BUFFERS) on a vector query. You can join vectors to your application tables in a single SQL statement (the killer feature nobody talks about: the chunk's tenant_id, source_id, and modified_at filter happen in the same plan as the ANN search). The cost is that you operate Postgres. You tune shared_buffers. You monitor index bloat. You think about maintenance_work_mem when you build the HNSW index on a 1M-row table.

Schema and parity code

Postgres side. The schema and HNSW index for a 1M-chunk corpus.

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE chunks (
    id          bigserial PRIMARY KEY,
    tenant_id   uuid NOT NULL,
    source_id   uuid NOT NULL,
    text        text NOT NULL,
    embedded_at timestamptz NOT NULL DEFAULT now(),
    embedding   vector(1536) NOT NULL
);

CREATE INDEX chunks_embed_hnsw
    ON chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

CREATE INDEX chunks_tenant ON chunks (tenant_id);

Python upsert and query against pgvector with psycopg and the pgvector Python adapter.

import psycopg
from pgvector.psycopg import register_vector

conn = psycopg.connect("postgresql://localhost/rag")
register_vector(conn)

def upsert(tenant_id, source_id, text, vec):
    conn.execute(
        """
        INSERT INTO chunks (tenant_id, source_id, text, embedding)
        VALUES (%s, %s, %s, %s)
        """,
        (tenant_id, source_id, text, vec),
    )
    conn.commit()


def search(tenant_id, query_vec, k=10):
    conn.execute("SET LOCAL hnsw.ef_search = 80")
    rows = conn.execute(
        """
        SELECT id, text, embedding <=> %s AS distance
        FROM chunks
        WHERE tenant_id = %s
        ORDER BY embedding <=> %s
        LIMIT %s
        """,
        (query_vec, tenant_id, query_vec, k),
    ).fetchall()
    return rows

Note the SET LOCAL hnsw.ef_search = 80. That is the dial that buys recall at the cost of latency, set per-query so a "high recall" path and a "low latency" path can coexist.

The same workload in Pinecone serverless:

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="...")

if "rag" not in [i.name for i in pc.list_indexes()]:
    pc.create_index(
        name="rag",
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )

index = pc.Index("rag")


def upsert(chunk_id, vec, tenant_id, source_id, text):
    index.upsert(
        vectors=[{
            "id": chunk_id,
            "values": vec,
            "metadata": {
                "tenant_id": tenant_id,
                "source_id": source_id,
                "text": text,
            },
        }],
        namespace=tenant_id,
    )


def search(tenant_id, query_vec, k=10):
    return index.query(
        vector=query_vec,
        top_k=k,
        namespace=tenant_id,
        include_metadata=True,
    ).matches

Per-tenant namespace is the Pinecone equivalent of the WHERE tenant_id = $2 filter on the Postgres side. Both isolate; the SQL version composes with arbitrary other predicates, the Pinecone version does not.

When each one wins

A short decision tree, sharpened by enough postmortems to take seriously.

Postgres + pgvector wins when:

You already operate Postgres at the size you need (or close to it).
Your queries combine vector search with tenant, ACL, recency, or join filters that benefit from a SQL plan.
You want to keep your data in one place for backup, replication, and compliance reasons.
Corpus is below ~50M vectors. Above that, look at pgvectorscale or move on.

Pinecone wins when:

You do not run Postgres and are not going to start.
Your team is small enough that operating any database is a real cost.
The corpus and the query rate fit cleanly inside the serverless model.
You are okay with proprietary index behavior — if you need to tune ef_search per query, you are on the wrong platform.

Weaviate or Qdrant fit when:

You want managed-or-self-hosted optionality without coupling to a single cloud's serverless model.
You need hybrid search (BM25 + vector) and you want it native, not bolted on.
You are at scale where dedicated infra matters and you have an SRE team that wants control without writing pgvector tuning guides.

What this comparison is not

It is not a benchmark. The numbers in the latency table are bands sourced from named public benchmarks where they exist and labeled illustrative where they do not. If you are picking a vector store for a new system today, the right move is to run a 1M-vector test on your own data with your own embedding model on your own infrastructure. Both vendors will give you the SDK in twenty minutes. The difference between "this is what the docs say" and "this is what production looks like" is a long afternoon, and that afternoon is the cheapest one you will spend this quarter.

Underneath the Slack thread, the question is whether vector search belongs in your existing database or in a new one. The answers your team has reached for that question before (for caches, for queues, for full-text indexes) are a reasonable starting point.

If this was useful

The Database Playbook is the long version of the question this post asks: when to keep a workload on Postgres, when to add a specialized store, and how to tell the difference before you are migrating data at 3am. The RAG Pocket Guide covers the retrieval side end-to-end (chunking, indexing, reranking) for the half of this decision the vector store does not touch.