DEV Community

Cover image for I indexed 30 million Hacker News posts into a single Postgres database. Here is what I learned about pgvector at scale.
Yasser B.
Yasser B.

Posted on

I indexed 30 million Hacker News posts into a single Postgres database. Here is what I learned about pgvector at scale.

I have been running Kubernetes clusters and CI/CD pipelines for a living for years. Big org, 5000 developers, the usual enterprise stuff. But on the side I have always been building things. A few weeks ago I launched Rivestack, a managed Postgres service built around pgvector for AI workloads. The problem was obvious: nobody knew it existed and I had zero marketing budget.

So instead of writing blog posts nobody would read or tweeting into the void, I decided to build something useful and give it away. I built a semantic search engine over the entire Hacker News archive.
You can try it here: https://ask.rivestack.io

Type any question in plain English and it finds relevant HN threads by meaning, not by matching keywords. Search for "how to deal with burnout as a solo founder" and you will get threads where people talked about exactly that, even if they never used the word burnout.
The whole thing runs on a single Postgres instance with pgvector. No Pinecone. No Weaviate. No Qdrant. Just Postgres.
Here is what I learned building it.

pgvector is way faster than people think
There is this narrative that pgvector is fine for prototyping but you need a "real" vector database for production. That has not been my experience at all.

With HNSW indexes on NVMe storage, I am getting sub 4ms query latency at 99% recall. For context, that is fast enough that users cannot tell the difference between this and a keyword search. The bottleneck in most RAG and search applications is not the vector lookup. It is everything else around it.

I spent a lot of time tuning HNSW parameters and the biggest takeaway is that the defaults are actually pretty good for most workloads. The main thing that matters is having enough memory for the index and using NVMe storage. If your index fits in memory, pgvector flies.
HNSW vs IVFFlat is not even a close decision anymore

I started with IVFFlat because it uses less memory. Bad idea. The recall was inconsistent and it degrades as you add and delete data because the centroids get stale. I switched to HNSW and never looked back. The build time is slower but the query performance and recall stability are worth it.

If you are starting a new project in 2026, just use HNSW. The memory overhead is manageable and the results are dramatically better.
Keeping everything in one database is a superpower
This is the part that surprised me the most. When your embeddings live next to your relational data, things get simple in ways you do not expect.

Want to filter search results by date? Just add a WHERE clause. Want to boost recent posts? Order by a combination of vector distance and recency. Want to update an embedding when the source data changes? Do it in a single transaction and you never have stale data.

With a separate vector database you need to build syncing logic, handle failures where one system updated but the other did not, and manage two sets of credentials, backups, and monitoring. With pgvector it is just Postgres. The same backups, the same monitoring, the same failover you already know.

Embeddings are the easy part. Data cleaning is the hard part.
I expected the embedding generation to be the big challenge. It was not. The real work was cleaning and preparing the HN data. Dealing with deleted posts, HTML entities, encoding issues, extremely long comments that need to be chunked, short comments that are just "this" or "+1" and add noise to the index.

I probably spent 70% of my time on data preparation and 30% on everything else combined. If you are building something similar, budget your time accordingly.

The cost surprised me too
I keep hearing that vector search is expensive. Maybe it is if you are paying Pinecone prices at scale. Running this on a single Postgres instance with pgvector costs me almost nothing compared to what a dedicated vector database would charge for the same dataset. The entire infrastructure for ask.rivestack.io costs less than what some teams spend on their vector database alone.

That is not a knock on Pinecone or Qdrant. They solve real problems at very large scale. But for the vast majority of applications, the ones with under 50 million vectors, pgvector on properly configured Postgres is more than enough.

What I would do differently

If I were starting over, three things:
First, I would use HNSW from day one instead of wasting a week on IVFFlat.

Second, I would invest more in data quality upfront. Garbage embeddings give garbage results no matter how good your index is.
Third, I would set up monitoring for recall quality from the start. It is easy to ship something that seems to work and not notice that results are degrading as your dataset grows.

Why I built this as a product

After going through all of this, I realized that the hard part of pgvector is not pgvector itself. It is everything around it: provisioning, backups, HA, monitoring, index tuning, keeping Postgres updated. The vector search part is actually the easy part once the infrastructure is solid.

That is why I built Rivestack (https://rivestack.io). It is managed Postgres with pgvector already configured and optimized. You get a database in minutes with backups, metrics, SSL, and autoscaling. There is a free tier if you want to try it.

I am a one person team running this alongside my day job so I am not trying to compete with Supabase or Neon on features. I am just trying to be the simplest option for people who want pgvector to work without thinking about infrastructure.
Try the search

The semantic search over HN is free and will stay free. Go play with it: https://ask.rivestack.io
If you are building something with pgvector and have questions about tuning or architecture, drop them in the comments. I have made most of the mistakes already so I might be able to save you some time.

Top comments (0)