The real challenges nobody talks about when you leave Google's shadow.
The Decision
Every developer has that moment. You're integrating a third-party search API, paying per request, watching costs climb, dealing with rate limits, getting irrelevant results you can't control. You think — how
hard can it be to build our own?
Spoiler: harder than you think. But also more rewarding than you'd imagine.
This is everything we learned building a full search engine from the ground up — crawling, indexing, ranking, and serving results at scale.
Phase 1: The Crawler
The first challenge nobody warns you about — the web is hostile to crawlers.
class WebCrawler:
def init(self):
self.visited = set()
self.queue = deque()
self.rate_limiter = RateLimiter(requests_per_second=2)
async def crawl(self, url: str):
if url in self.visited:
return
await self.rate_limiter.wait()
try:
response = await self.fetch(url)
links = self.extract_links(response.html)
self.queue.extend(links)
except BlockedByRobots:
pass # Respect robots.txt — always
What actually goes wrong:
- Sites block your user-agent within hours
- JavaScript-rendered pages return empty HTML
- Duplicate content floods your index
- robots.txt parsing edge cases are endless
- Some sites have infinite scroll — your crawler loops forever
What worked for us:
Rotating user-agents alone isn't enough. You need proper crawl politeness — respect Crawl-Delay, honor robots.txt strictly, and implement exponential backoff. Sites that detect polite crawlers are less likely
to block you permanently.
For JS-rendered content, we ended up with a hybrid approach — static HTML fetch first, Playwright fallback only when necessary. Playwright at scale is expensive. Use it sparingly.
Phase 2: The Index
Raw HTML is useless. You need to extract signal from noise.
def extract_signals(html: str, url: str) -> PageSignals:
soup = BeautifulSoup(html, 'lxml')
return PageSignals(
title=extract_title(soup),
meta_description=extract_meta(soup),
h1_tags=extract_headings(soup),
body_text=clean_text(soup),
internal_links=count_internal(soup, url),
external_links=count_external(soup, url),
word_count=count_words(soup),
freshness=extract_publish_date(soup),
)
The indexing problem nobody mentions:
Storage is cheap. Fast retrieval is not.
We went through three storage approaches:
┌───────────────────────┬─────────────────────────────────┐
│ Approach │ Problem │
├───────────────────────┼─────────────────────────────────┤
│ PostgreSQL full-text │ Too slow past 10M records │
├───────────────────────┼─────────────────────────────────┤
│ Elasticsearch │ Operational overhead was brutal │
├───────────────────────┼─────────────────────────────────┤
│ Custom inverted index │ Full control, worth the pain │
└───────────────────────┴─────────────────────────────────┘
An inverted index maps every term to the documents containing it. Simple concept. Brutal implementation.
"python" → [doc_42, doc_891, doc_2341, ...]
"search" → [doc_7, doc_42, doc_156, ...]
"python search" → intersection([doc_42, ...])
The intersection operation at scale — that's where query latency lives or dies.
Phase 3: Ranking
This is where things get philosophically interesting.
PageRank is brilliant in concept. Graph of the web, pages vote for each other via links, authority flows. In practice — you don't have the entire web's link graph. You have a slice.
What actually moves the needle for relevance:
def calculate_score(doc, query) -> float:
score = 0.0
# TF-IDF base score
score += tfidf_score(doc, query) * 1.0
# Title match = strong signal
if query_in_title(doc, query):
score *= 2.4
# Freshness boost for news content
score *= freshness_decay(doc.publish_date)
# Domain authority proxy
score += backlink_score(doc.domain) * 0.3
# Engagement signals
score += click_through_rate(doc) * 0.5
return score
The hard truth about ranking:
You will never be Google. And that's okay. The goal isn't perfect universal ranking — it's better ranking for your specific use case.
Niche search engines beat Google on their own turf constantly. Legal search, academic search, developer search — all of them outperform Google within their domain because they optimize for domain-specific
signals Google ignores.
Phase 4: Serving at Scale
Query comes in. Clock starts ticking. You have ~200ms before users notice latency.
Request → Load Balancer
→ Cache Layer (Redis, TTL 5min)
→ Query Parser
→ Index Lookup
→ Ranking
→ Result Formatter
→ Response
Caching is everything.
The most searched queries — top 5% — represent ~60% of your traffic. Cache those aggressively. Cold cache = dead service under load.
@cache(ttl=300, key=lambda q: f"search:{normalize(q)}")
async def search(query: str) -> SearchResults:
parsed = parse_query(query)
candidates = index.lookup(parsed.terms)
ranked = ranker.rank(candidates, parsed)
return format_results(ranked[:10])
What surprised us most:
Query normalization matters more than ranking tweaks. "python tutorial", "Python Tutorial", "PYTHON TUTORIAL" — same intent, same cache hit. Miss this and you're burning compute on identical queries.
Phase 5: The Problems That Never End
Spam & SEO manipulation — The moment you're indexable, spammers find you. Low-quality content farms, keyword stuffing, link schemes. You need content quality signals baked in from day one, not bolted on after.
Freshness vs. authority tradeoff — A 5-year-old authoritative page vs. a fresh article from today. Context dependent. News queries need recency. Evergreen queries need authority. Getting this wrong tanks user
trust.
The cold start problem — An index of 10,000 pages feels empty. Users arrive, get sparse results, leave. You need to seed intelligently — crawl the most linked-to pages first, not randomly.
What We'd Do Differently
Start with a focused vertical. Trying to index everything is how you get mediocre results everywhere. Pick a domain, own it, expand later.
Instrument everything from day one. Query latency, cache hit rates, zero-result queries — you can't optimize what you can't measure.
Respect the web. Polite crawlers live longer than aggressive ones. The web has memory.
The Takeaway
Building a search engine is one of the most complete engineering challenges you can take on — distributed systems, NLP, data pipelines, low-latency APIs, anti-spam. Every computer science concept you've ever
learned shows up somewhere.
Is it worth it? For us, absolutely. Full control over relevance, zero dependency on third-party costs, infrastructure tuned exactly for our use case.
The web is still indexable. Search is still a solved-but-not-solved problem. There's room for more players.
We run NestDaddy — a search and news intelligence platform. If you're curious about the infrastructure side of things,
https://www.nestdaddy.com
Top comments (0)