Why I Index Only 32k Companies When Apollo Has 250M: A B2B Database Trade-off

#architecture #database #saas #startup

One of the questions I get from prospects evaluating findmemail.io against Apollo or ZoomInfo: "your database is so much smaller — why?"

The honest answer: it's a deliberate trade-off, and the smaller-but-cleaner approach actually wins for the use case I'm targeting. Here's the architecture reasoning.

The "more contacts = better" myth

Big B2B databases brag about contact count: Apollo claims 250M+, ZoomInfo claims similar, Seamless says 1.8B+ emails. The implicit promise is that more = better.

In practice, more contacts = more stale records. Industry-standard B2B contact churn is ~3% per month — meaning a 250M database loses ~7.5M valid records every month. Re-verifying all of them at that scale is expensive enough that the providers don't do it on every entry; they batch-verify, delay flagging stale records, and let the "verified" status drift.

The result: a "verified" Apollo email has, on average, a 5-15% chance of bouncing depending on how recently the record was touched. ZoomInfo numbers are slightly better because they invest more in verification, but still meaningfully > 2%.

Why bounces matter more than database size

Cold email deliverability is gated by sender domain reputation. Once your domain trips a bounce threshold (~5% per send window), Gmail and Outlook start filtering everything from you to spam. Recovering domain reputation is slow and uncertain — sometimes you have to burn the domain entirely.

So the practical question is not "how many contacts can I get" but "how many contacts can I send to without burning my domain." A 1,000-email list at 1% bounce rate is more sendable than a 10,000-email list at 8% bounce rate.

The architectural choice

When I designed findmemail.io, I made one constraint up front: never return an email we haven't SMTP-verified at request time. Not pattern-matched, not domain-validated, not cached from a 6-month-old probe. RCPT-TO check, fresh, on every lookup.

That constraint determines the database size:

We can't index 250M companies because we'd need to constantly re-verify their emails. SMTP probes have rate limits, IP-rotation costs, anti-spoofing requirements. Doing it at 250M scale would be a multi-million-dollar/year infrastructure problem.
We CAN index 32k+ companies and re-verify aggressively. That's the size of the manageable-quality bucket.

What this means for users

The trade-off:

Apollo (large database)	findmemail.io (smaller, fresher)
250M+ contacts	32k+ companies
~5-15% bounce rate	<2% bounce rate
You build broad lists	You build precise lists
Per-seat pricing	Lifetime tier

For a 10,000-person sales org needing to spray-and-pray on a contact firehose, Apollo's size is the right answer.

For an indie founder targeting a precise ICP at indie-founder volumes, the smaller-but-cleaner database wins. You're not trying to send to 50M people. You're trying to send to 500 right people, get verified emails for them, and not burn your domain.

Implementation notes (for builders curious about the SMTP layer)

The SMTP probe at request time is the hard part. Quick architecture sketch:

client → API → cache(7-day TTL) → SMTP probe pool → MX server → response

Probe steps:

MX lookup for the recipient domain
TCP connect to MX on port 25
HELO/EHLO with throwaway sending domain
MAIL FROM with throwaway sender
RCPT TO with target email
Read response (250 = accept, 550 = reject, 421/451 = retry)
QUIT — never DATA, never deliver

Anti-detection challenges:

Mail servers rate-limit probes per source IP. Need an IP rotation pool.
Some servers (Google Workspace, Microsoft 365) refuse RCPT-TO probes entirely — return 252 for everything. We fall back to historical pattern data for these.
Catch-all detection: probe a known-bad address first; if it returns 250, the domain is catch-all and we don't return individual emails.
Greylisting: server returns 451 first, expects retry. We retry with backoff up to 3x across an hour.

This pipeline runs on every lookup. P50 response time is ~800ms, P95 is ~2s. Caching for 7 days keeps the per-domain load reasonable while keeping freshness within typical B2B contact churn windows.

What I'd build differently if I had a $10M budget

Honest answer: I'd still cap database size, but I'd index more precisely.

Keep the 32k → 100k+ companies range, not 1M+
Add intent signal layer (job changes, funding rounds) — currently you have to pair with Pharow or Clay
LinkedIn-derived enrichment, but only for verified + active accounts
Better dedup on people-who-changed-jobs (currently a manual step)

The thing I would NOT do: chase Apollo's 250M number. The trade-off doesn't favor it for the indie founder ICP.

Try it

findmemail.io — free tier on signup, 50 credits, no card. Run the same query you'd run on Apollo and compare the bounce rate on a small batch. The architecture-driven quality difference shows up immediately.

TL;DR

Database size is the wrong KPI for B2B email finder quality. Bounce rate is the right one, because bounce rate determines whether your sender domain stays sendable. Smaller databases with aggressive re-verification beat larger databases with stale records — at indie-founder volumes. That's the architectural bet behind findmemail.io.