nestdaddy

Posted on Mar 30

We Built a Search Engine from Scratch - Here's What We Learned

#crawler #ai #python #programming

The real challenges nobody talks about when you leave Google's shadow.

The Decision

Every developer has that moment. You're integrating a third-party search API, paying per request, watching costs climb, dealing with rate limits, getting irrelevant results you can't control. You think — how
hard can it be to build our own?

Spoiler: harder than you think. But also more rewarding than you'd imagine.

This is everything we learned building a full search engine from the ground up — crawling, indexing, ranking, and serving results at scale.

Phase 1: The Crawler

The first challenge nobody warns you about — the web is hostile to crawlers.

class WebCrawler:
def init(self):
self.visited = set()
self.queue = deque()
self.rate_limiter = RateLimiter(requests_per_second=2)

  async def crawl(self, url: str):
      if url in self.visited:
          return

      await self.rate_limiter.wait()

      try:
          response = await self.fetch(url)
          links = self.extract_links(response.html)
          self.queue.extend(links)
      except BlockedByRobots:
          pass  # Respect robots.txt — always

What actually goes wrong:

Sites block your user-agent within hours
JavaScript-rendered pages return empty HTML
Duplicate content floods your index
robots.txt parsing edge cases are endless
Some sites have infinite scroll — your crawler loops forever

What worked for us:

Rotating user-agents alone isn't enough. You need proper crawl politeness — respect Crawl-Delay, honor robots.txt strictly, and implement exponential backoff. Sites that detect polite crawlers are less likely
to block you permanently.

For JS-rendered content, we ended up with a hybrid approach — static HTML fetch first, Playwright fallback only when necessary. Playwright at scale is expensive. Use it sparingly.

Phase 2: The Index

Raw HTML is useless. You need to extract signal from noise.

def extract_signals(html: str, url: str) -> PageSignals:
soup = BeautifulSoup(html, 'lxml')

  return PageSignals(
      title=extract_title(soup),
      meta_description=extract_meta(soup),
      h1_tags=extract_headings(soup),
      body_text=clean_text(soup),
      internal_links=count_internal(soup, url),
      external_links=count_external(soup, url),
      word_count=count_words(soup),
      freshness=extract_publish_date(soup),
  )

The indexing problem nobody mentions:

Storage is cheap. Fast retrieval is not.

We went through three storage approaches:

┌───────────────────────┬─────────────────────────────────┐
│ Approach │ Problem │
├───────────────────────┼─────────────────────────────────┤
│ PostgreSQL full-text │ Too slow past 10M records │
├───────────────────────┼─────────────────────────────────┤
│ Elasticsearch │ Operational overhead was brutal │
├───────────────────────┼─────────────────────────────────┤
│ Custom inverted index │ Full control, worth the pain │
└───────────────────────┴─────────────────────────────────┘

An inverted index maps every term to the documents containing it. Simple concept. Brutal implementation.

"python" → [doc_42, doc_891, doc_2341, ...]
"search" → [doc_7, doc_42, doc_156, ...]
"python search" → intersection([doc_42, ...])

The intersection operation at scale — that's where query latency lives or dies.

Phase 3: Ranking

This is where things get philosophically interesting.

PageRank is brilliant in concept. Graph of the web, pages vote for each other via links, authority flows. In practice — you don't have the entire web's link graph. You have a slice.

What actually moves the needle for relevance:

def calculate_score(doc, query) -> float:
score = 0.0

  # TF-IDF base score
  score += tfidf_score(doc, query) * 1.0

  # Title match = strong signal
  if query_in_title(doc, query):
      score *= 2.4

  # Freshness boost for news content
  score *= freshness_decay(doc.publish_date)

  # Domain authority proxy
  score += backlink_score(doc.domain) * 0.3

  # Engagement signals
  score += click_through_rate(doc) * 0.5

  return score

The hard truth about ranking:

You will never be Google. And that's okay. The goal isn't perfect universal ranking — it's better ranking for your specific use case.

Niche search engines beat Google on their own turf constantly. Legal search, academic search, developer search — all of them outperform Google within their domain because they optimize for domain-specific
signals Google ignores.

Phase 4: Serving at Scale

Query comes in. Clock starts ticking. You have ~200ms before users notice latency.

Request → Load Balancer
→ Cache Layer (Redis, TTL 5min)
→ Query Parser
→ Index Lookup
→ Ranking
→ Result Formatter
→ Response

Caching is everything.

The most searched queries — top 5% — represent ~60% of your traffic. Cache those aggressively. Cold cache = dead service under load.

@cache(ttl=300, key=lambda q: f"search:{normalize(q)}")
async def search(query: str) -> SearchResults:
parsed = parse_query(query)
candidates = index.lookup(parsed.terms)
ranked = ranker.rank(candidates, parsed)
return format_results(ranked[:10])

What surprised us most:

Query normalization matters more than ranking tweaks. "python tutorial", "Python Tutorial", "PYTHON TUTORIAL" — same intent, same cache hit. Miss this and you're burning compute on identical queries.

Phase 5: The Problems That Never End

Spam & SEO manipulation — The moment you're indexable, spammers find you. Low-quality content farms, keyword stuffing, link schemes. You need content quality signals baked in from day one, not bolted on after.

Freshness vs. authority tradeoff — A 5-year-old authoritative page vs. a fresh article from today. Context dependent. News queries need recency. Evergreen queries need authority. Getting this wrong tanks user
trust.

The cold start problem — An index of 10,000 pages feels empty. Users arrive, get sparse results, leave. You need to seed intelligently — crawl the most linked-to pages first, not randomly.

What We'd Do Differently

Start with a focused vertical. Trying to index everything is how you get mediocre results everywhere. Pick a domain, own it, expand later.

Instrument everything from day one. Query latency, cache hit rates, zero-result queries — you can't optimize what you can't measure.

Respect the web. Polite crawlers live longer than aggressive ones. The web has memory.

The Takeaway

Building a search engine is one of the most complete engineering challenges you can take on — distributed systems, NLP, data pipelines, low-latency APIs, anti-spam. Every computer science concept you've ever
learned shows up somewhere.

Is it worth it? For us, absolutely. Full control over relevance, zero dependency on third-party costs, infrastructure tuned exactly for our use case.

The web is still indexable. Search is still a solved-but-not-solved problem. There's room for more players.

We run NestDaddy — a search and news intelligence platform. If you're curious about the infrastructure side of things,
https://www.nestdaddy.com