DEV Community

Cover image for How I Built a 37 Million Row Search Engine That Returns Results in 200ms
Faiz MD
Faiz MD

Posted on

How I Built a 37 Million Row Search Engine That Returns Results in 200ms

Published on dev.to | leadvault.to

I got tired of paying $99/month for B2B lead databases. So I built my own.
What started as a side project turned into a production system handling 37 million records with sub-200ms query times. Here's exactly how I did it, what broke along the way, and what I'd do differently.
The Stack
● ClickHouse — columnar database, the core of everything
● FastAPI — Python backend, 6 uvicorn workers
● Next.js — frontend
● Supabase — auth and user credits
● AWS EC2 m7i-flex.large — 2 vCPU, 8GB RAM
The interesting choice here is ClickHouse. Most people would reach for Postgres or MySQL for something like this. That would have been a mistake.
Why ClickHouse Over Postgres
My dataset is 37 million rows, each with ~15 columns: name, email, job title, company, location, seniority etc.
A typical query looks like:
SELECT first_name, last_name, email_domain, job_title, company,
location_city, location_country, seniority
FROM leads_clean
WHERE location_country = 'United States'
AND seniority = 'manager'
LIMIT 100
On Postgres with proper indexes, this query on 37M rows takes 8-15 seconds. On ClickHouse it takes 80-200ms.
ClickHouse is a columnar store — it only reads the columns you query instead of full rows. For analytics-style queries with filters across multiple columns, it's dramatically faster.
The Biggest Mistake: Frontend Reveals at Scale
My first bulk reveal implementation worked like this:
// BAD - don't do this for bulk operations
for (const lead of leads) {
const email = await fetch(/api/reveal/${lead.domain})
results.push(email)
}
This worked fine for 10 leads. At 60 leads it started failing. The browser would timeout, promises would hang, state would get corrupted.
The fix was moving everything server-side. Result: 200 leads exported as a full CSV in 0.2 seconds. The browser never times out because it's a single HTTP request that streams a file.
Lesson: never do bulk data operations in frontend loops. Always batch server-side.
Startup Cache Warming
Some domains have thousands of records (large enterprises). The first query against @sncf.fr (11,893 records) was slow because ClickHouse had to scan many granules.
The solution was warming a cache of the top 500 domains at startup in a background thread. After warmup (~2 minutes), the top 500 domains return in under 5ms from cache.
The Query That Killed Performance
LIKE '%@domain' — leading wildcard LIKE queries are slow because they can't use indexes. They scan every row.
-- SLOW - can't use index
WHERE email LIKE '%@sncf.fr'

-- FASTER - vectorized function
WHERE positionCaseInsensitive(email, '@sncf.fr') > 0
What I'd Do Differently

  1. Design for server-side bulk from day one. The frontend reveal loop was always going to fail at scale.
  2. Add query timeouts to every ClickHouse call. Always use SETTINGS max_execution_time=5.
  3. Use FINAL carefully. ReplacingMergeTree requires FINAL to deduplicate during reads. On 37M rows this adds overhead. The project is live at leadvault.to — 37M B2B contacts, searchable by country, job title, seniority and company. 100 free credits daily.

Top comments (0)