<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Faiz MD</title>
    <description>The latest articles on DEV Community by Faiz MD (@faiz_md_37b12e52f7275ac34).</description>
    <link>https://dev.to/faiz_md_37b12e52f7275ac34</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3815589%2F833d287b-7a58-490f-b6c5-07e893c1ad54.png</url>
      <title>DEV Community: Faiz MD</title>
      <link>https://dev.to/faiz_md_37b12e52f7275ac34</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/faiz_md_37b12e52f7275ac34"/>
    <language>en</language>
    <item>
      <title>How I Built a 37 Million Row Search Engine That Returns Results in 200ms</title>
      <dc:creator>Faiz MD</dc:creator>
      <pubDate>Mon, 09 Mar 2026 22:45:40 +0000</pubDate>
      <link>https://dev.to/faiz_md_37b12e52f7275ac34/how-i-built-a-37-million-row-search-engine-that-returns-results-in-200ms-k0l</link>
      <guid>https://dev.to/faiz_md_37b12e52f7275ac34/how-i-built-a-37-million-row-search-engine-that-returns-results-in-200ms-k0l</guid>
      <description>&lt;p&gt;Published on dev.to  |  leadvault.to&lt;/p&gt;

&lt;p&gt;I got tired of paying $99/month for B2B lead databases. So I built my own.&lt;br&gt;
What started as a side project turned into a production system handling 37 million records with sub-200ms query times. Here's exactly how I did it, what broke along the way, and what I'd do differently.&lt;br&gt;
The Stack&lt;br&gt;
● ClickHouse — columnar database, the core of everything&lt;br&gt;
● FastAPI — Python backend, 6 uvicorn workers&lt;br&gt;
● Next.js — frontend&lt;br&gt;
● Supabase — auth and user credits&lt;br&gt;
● AWS EC2 m7i-flex.large — 2 vCPU, 8GB RAM&lt;br&gt;
The interesting choice here is ClickHouse. Most people would reach for Postgres or MySQL for something like this. That would have been a mistake.&lt;br&gt;
Why ClickHouse Over Postgres&lt;br&gt;
My dataset is 37 million rows, each with ~15 columns: name, email, job title, company, location, seniority etc.&lt;br&gt;
A typical query looks like:&lt;br&gt;
SELECT first_name, last_name, email_domain, job_title, company,&lt;br&gt;
       location_city, location_country, seniority&lt;br&gt;
FROM leads_clean&lt;br&gt;
WHERE location_country = 'United States'&lt;br&gt;
AND seniority = 'manager'&lt;br&gt;
LIMIT 100&lt;br&gt;
On Postgres with proper indexes, this query on 37M rows takes 8-15 seconds. On ClickHouse it takes 80-200ms.&lt;br&gt;
ClickHouse is a columnar store — it only reads the columns you query instead of full rows. For analytics-style queries with filters across multiple columns, it's dramatically faster.&lt;br&gt;
The Biggest Mistake: Frontend Reveals at Scale&lt;br&gt;
My first bulk reveal implementation worked like this:&lt;br&gt;
// BAD - don't do this for bulk operations&lt;br&gt;
for (const lead of leads) {&lt;br&gt;
  const email = await fetch(&lt;code&gt;/api/reveal/${lead.domain}&lt;/code&gt;)&lt;br&gt;
  results.push(email)&lt;br&gt;
}&lt;br&gt;
This worked fine for 10 leads. At 60 leads it started failing. The browser would timeout, promises would hang, state would get corrupted.&lt;br&gt;
The fix was moving everything server-side. Result: 200 leads exported as a full CSV in 0.2 seconds. The browser never times out because it's a single HTTP request that streams a file.&lt;br&gt;
Lesson: never do bulk data operations in frontend loops. Always batch server-side.&lt;br&gt;
Startup Cache Warming&lt;br&gt;
Some domains have thousands of records (large enterprises). The first query against @sncf.fr (11,893 records) was slow because ClickHouse had to scan many granules.&lt;br&gt;
The solution was warming a cache of the top 500 domains at startup in a background thread. After warmup (~2 minutes), the top 500 domains return in under 5ms from cache.&lt;br&gt;
The Query That Killed Performance&lt;br&gt;
LIKE '%&lt;a class="mentioned-user" href="https://dev.to/domain"&gt;@domain&lt;/a&gt;' — leading wildcard LIKE queries are slow because they can't use indexes. They scan every row.&lt;br&gt;
-- SLOW - can't use index&lt;br&gt;
WHERE email LIKE '%@sncf.fr'&lt;/p&gt;

&lt;p&gt;-- FASTER - vectorized function&lt;br&gt;
WHERE positionCaseInsensitive(email, '@sncf.fr') &amp;gt; 0&lt;br&gt;
What I'd Do Differently&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Design for server-side bulk from day one. The frontend reveal loop was always going to fail at scale.&lt;/li&gt;
&lt;li&gt;Add query timeouts to every ClickHouse call. Always use SETTINGS max_execution_time=5.&lt;/li&gt;
&lt;li&gt;Use FINAL carefully. ReplacingMergeTree requires FINAL to deduplicate during reads. On 37M rows this adds overhead.
The project is live at leadvault.to — 37M B2B contacts, searchable by country, job title, seniority and company. 100 free credits daily.&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>clickhouse</category>
      <category>python</category>
      <category>database</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
