DEV Community

Cover image for Search Autocomplete Systems — Complete Guide
ali ehab algmass
ali ehab algmass

Posted on

Search Autocomplete Systems — Complete Guide

Section 1 — Problem Definition

An autocomplete system predicts and suggests query completions as a user types, character by character. The goal is to reduce typing effort, surface popular or relevant queries, and improve UX through speed.

Real-world examples:

  • Google Search — top-K globally trending completions, personalized by history
  • YouTube — video-intent queries, trending topics, channel names
  • Amazon — product search with category-aware suggestions
  • GitHub — repo names, usernames, file paths, code symbols

Functional requirements: Given a prefix typed by a user, return the top-K (typically 5–10) most relevant completions within strict latency SLA (under 100ms end-to-end).

Non-functional requirements: 99.99% availability, horizontal scalability to billions of queries/day, eventual consistency for frequency updates (stale-by-minutes is acceptable), sub-50ms p99 response time at the API layer, multi-region support.

The hardest constraint is the latency budget. At Google scale, every character keystroke fires a request. A user typing "machine learning" fires 16 requests. With millions of concurrent users, this means tens of millions of requests per second with a hard sub-100ms SLA.---

Section 2 — Trie Data Structure

A Trie (prefix tree) is a tree data structure where each node represents a single character. A path from root to any node spells out a prefix. A path to a marked node spells a complete word.

Why Trie for autocomplete? Lookup by prefix is O(L) where L is prefix length — independent of vocabulary size. With a hash map you'd need to scan all keys. With a sorted array you'd binary search but can't retrieve all matching extensions efficiently. The Trie's structure literally mirrors the problem: navigating down the tree IS the prefix search.

Node structure:

TrieNode {
    children: Map<char, TrieNode>   // up to 26 for a-z, more for Unicode
    isEndOfWord: bool
    frequency: int                  // how many times this word was searched
    topK: List<{word, freq}>        // cached top-K at this node
}
Enter fullscreen mode Exit fullscreen mode


Operations:

  • insert(word, freq) — traverse/create nodes character by character, mark isEndOfWord=true, set frequency. O(L).
  • search(prefix) — traverse to the prefix node. If not found, return []. Otherwise collect all words in the subtree rooted there. O(L + N) where N = subtree size.
  • delete(word) — traverse to the word's end node, unmark isEndOfWord. Optionally prune leaf nodes upward. O(L).

Complexity summary:

Operation Time Space
Insert O(L) O(L × alphabet)
Search by prefix O(L + N) O(1) extra
Delete O(L) O(1)
Full Trie O(N × L × alphabet)

Advantages: Fast prefix lookup, natural prefix grouping, supports wildcard and fuzzy variants. Disadvantages: High memory — each node holds up to 26+ child pointers. For 100M terms with avg length 10, naive Trie needs tens of GB of RAM. This forces compression and sharding.


Section 3 — Building Autocomplete Using Trie

When a user types a prefix q:

  1. Start at the root, traverse one node per character of q.
  2. If any character has no matching child edge, return empty — no completions exist.
  3. At the prefix node, perform a DFS/BFS to collect all isEndOfWord=true descendants.
  4. Sort collected words by frequency, return top-K.

The naive version of step 3 has O(subtree_size) cost. For popular single-character prefixes like "a", the subtree could contain millions of words — completely unacceptable. This is why we need Top-K caching at each node.


Section 4 — Top-K Suggestions

Top-K means: for any given prefix, return only the K most relevant completions. The key insight is to precompute and cache top-K at every Trie node rather than traversing the full subtree at query time.

Ranking strategies:

  • Search frequency — "apple" searched 10M times ranks above "applesauce" searched 2K times. The baseline signal.
  • Recency — recent searches get a time-decay boost. Exponential decay: score = freq × e^(-λ × days_ago). Captures trending topics.
  • Personalization — your own history ("I always search apple stock news") boosts personal terms. Computed per-user, cannot be stored in the global Trie.
  • Trending — sudden spike detection via sliding-window counters. "World Cup" spikes during tournaments.

Storing Top-K in Trie nodes:

Each node stores a list of up to K {word, score} pairs representing the best completions reachable from that node. When a new word is inserted or a frequency updated, propagate changes upward: each ancestor node merges the new candidate into its top-K list.

Min Heap vs Max Heap:

Use a Min Heap of size K to maintain top-K efficiently. When processing a new candidate:

  • If heap size < K: push directly.
  • If candidate score > heap minimum: pop min, push candidate.
  • Otherwise: discard.

This gives O(log K) per insertion. Final top-K extraction is O(K log K).

For Max Heap: better when you need to extract elements in descending order one at a time (streaming output). But for building top-K over a large set, the Min Heap approach is more memory-efficient.With Top-K cached at every node, a query for prefix q becomes O(L) — traverse L nodes, read the precomputed top-K list. No subtree scan needed.


Section 5 — Caching Layer

The Trie alone is still insufficient. Even with O(L) lookup, the Trie lives in one process's memory. At Google scale, you need distributed caching to handle millions of QPS without hammering Trie servers.

Cache-per-prefix architecture: For each prefix string (e.g., "a", "ap", "app"), cache the top-K result list in Redis. Cache key = the prefix string. Cache value = JSON array of top-K suggestions.

cache["a"]    → ["apple", "amazon", "airbnb", "android", "api"]
cache["ap"]   → ["apple", "api", "app store", "apex", "apply"]
cache["app"]  → ["app store", "apple music", "apple id", "application", "applebee's"]
Enter fullscreen mode Exit fullscreen mode


markdown

Cache key design: Use a consistent lowercase-normalized prefix as the key. For Redis: autocomplete:v2:{prefix} — version-prefix allows atomic cache invalidation on Trie rebuild.

Cache invalidation strategies:

  • TTL-based — set TTL of 1–24 hours depending on prefix popularity. Short TTL for hot/trending prefixes, longer for stable ones. Simple but may serve stale results after a viral event.
  • Event-driven invalidation — when the Trie is rebuilt (see Section 7), publish invalidation messages to Redis Pub/Sub. Subscribers delete affected keys. More complex but fresher.
  • Versioned keys — on rebuild, switch from autocomplete:v2: to autocomplete:v3:. Old keys expire via TTL. Zero-downtime transition.

Cache hit rates in practice: The prefix "t" gets millions of hits/hour — stays in L1 memory of Redis. "the quick brown" — rare, likely a cache miss. This power-law distribution means ~20% of prefixes handle 80% of traffic. Cache them aggressively; let rare prefixes fall through to the Trie service.

Cache stampede problem: When a popular cache key expires, thousands of concurrent requests all miss and hammer the Trie simultaneously. Solutions: (1) probabilistic early expiry — refresh the key before it expires with small probability; (2) mutex/lock on cache miss — first thread computes, others wait; (3) background refresh — a worker proactively refreshes top-N prefixes before TTL expires.


Section 6 — Updating Search Frequencies

The flow:

  1. User submits a search query → logged to a search event stream (Kafka topic search-events).
  2. A frequency aggregation service consumes the stream and increments counters in a fast counter store (Redis INCR or a dedicated counter service like Druid).
  3. Periodically, these counters feed the offline Trie rebuild pipeline.

Real-time vs batch tradeoff:

Real-time updates Batch updates
Frequency freshness Seconds Hours
Trie write cost High (every search modifies Trie) Low (rebuild once per cycle)
Complexity Very high (distributed locks, race conditions) Moderate
Production choice Only for trending/signals layer Yes, for core Trie

At Google scale, real-time Trie mutation is impractical — millions of writes/second with propagation to ancestor nodes creates massive contention. The industry standard is batch rebuilds with a real-time trending overlay.

Incremental update strategy: Rather than rebuilding from scratch hourly, maintain a delta log. Apply only changed frequencies. For a word whose frequency changed from 500 to 600, update the affected nodes and re-sort top-K lists upward. Cost: O(L × K log K) per updated word.


Section 7 — Offline Trie Update Job

The core principle: never update the live Trie on every search. Instead, batch-process search logs periodically and rebuild the Trie (or apply deltas) offline, then hot-swap the new version atomically.

ETL Pipeline:Blue-Green deployment of Trie versions: At any time, two Trie instances exist — "blue" (live) and "green" (rebuilding). Once green passes validation, a routing config update atomically shifts traffic. Blue becomes the rollback target. If green shows anomalies (regressions, coverage drops), one config change rolls back in seconds. Zero downtime, zero query loss.

Production cadence: Most systems run hourly rebuilds during low-traffic windows, with a "trending override" layer that injects real-time signals (last 15 minutes of search spikes) on top of the hourly Trie.


Section 8 — Google-Level Scale Design

Assumptions: 1B searches/day = ~11,500 QPS average. With peak factor of 3× = ~35,000 QPS peak. 100M unique terms. p99 latency < 100ms.Capacity math at scale:

  • Cache layer: 100M unique prefixes × avg 200 bytes/entry = ~20GB per Redis cluster. With replication factor 3 = 60GB. Easily fits in commodity Redis.
  • Trie memory: 100M terms × avg 10 chars × 28 bytes/node ≈ 28GB raw. With top-K lists (K=10, 20 bytes/entry) per node: +20GB. ~50GB total per full Trie replica.
  • With 20 Trie nodes sharded across 5 prefix ranges with 4× replication = 20 nodes × ~10GB each = manageable.

Sharding strategy: Shard by first 2 characters of prefix. "aa"–"am" → shard 1, "an"–"az" → shard 2, etc. Each shard handles a fraction of the prefix space. A consistent hashing ring (see Section 9) handles rebalancing when nodes are added.

High availability: Each Trie shard has 3 replicas (leader + 2 followers). Reads hit any replica. Writes (new Trie deployment) go to leader, sync to followers. If leader dies, a follower is promoted within seconds via Raft consensus.

Disaster recovery: Cross-region replication of Trie snapshots to S3/GCS. RPO = last hourly snapshot. RTO = ~5 minutes to restore from snapshot into a fresh cluster.


Section 9 — Distributed Trie

A single Trie for 100M terms with top-K lists is 50GB+ and handles millions of QPS — impossible to run on one machine with acceptable latency and reliability.

Sharding by prefix (range-based):

Shard 0: a–f        (prefixes starting with a,b,c,d,e,f)
Shard 1: g–m
Shard 2: n–s
Shard 3: t–z + digits
Enter fullscreen mode Exit fullscreen mode

Simple, predictable routing. A request for prefix "apple" always goes to shard 0. Problem: Hot prefixes. "a" is searched far more than "x". Shard 0 gets 10× the traffic of shard 3 — this is the "hot shard" problem.

Sharding by hash: Hash the prefix string to determine shard. shard = hash("apple") % N. Distributes load evenly. Problem: Two-character prefixes "ap" and "app" land on different shards, breaking prefix locality. Every request requires a broadcast to all shards and result merging — expensive.

Consistent hashing: Place shards on a virtual ring. Map prefix to a point on the ring; the shard clockwise from that point owns it. Adding/removing shards only remaps a fraction of keys. Supports virtual nodes for load balancing. Best for prefix-range sharding where you want smooth rebalancing.

Production recommendation: Shard by first 2 characters using a weighted partition scheme (give "a-" more shards than "x-" proportional to expected traffic). Replicate each shard 3×. Use consistent hashing for shard assignment to enable elastic scaling.


Section 10 — Alternatives to Trie

Data Structure Memory Lookup Best For Weakness
Standard Trie High (26 pointers/node) O(L) Teaching, small vocab Memory bloat
Compressed Trie (Radix) Moderate O(L) Production prefix search Complex to implement
Patricia Trie Low O(L) IP routing, binary keys Complex insert
Ternary Search Tree Moderate O(L + log N) Spell check, ordered ops Slower than Trie
FST (Finite State Transducer) Very low (shared suffixes) O(L) Lucene, production NLP Immutable once built
Elasticsearch/Lucene High (inverted index) O(log N) Full-text search + autocomplete Overkill for pure prefix

FST (Finite State Transducer) is what Lucene/Elasticsearch uses internally. It compresses both shared prefixes AND shared suffixes into a directed acyclic graph. Memory usage is 10–100× smaller than a standard Trie for large vocabularies. The tradeoff: building an FST is expensive, and the structure is typically immutable — you rebuild on update.

Radix Tree is the practical production choice for custom implementations. It compresses chains of single-child nodes (the path "a→p→p" in standard Trie becomes one node "app"). Memory savings ~10× for English vocabulary. Lookups are the same O(L).

Elasticsearch brings typo-tolerance (edit distance matching), language analysis, scoring with TF-IDF or BM25, and horizontal scaling out of the box. The cost is significantly higher latency (10–50ms) and memory footprint. For pure autocomplete (no full-text), it's often over-engineered. For search that blends autocomplete with full-text ranking (GitHub's code search, Slack's message search), it's the right choice.


Section 11 — Production Challenges

Hot prefixes: The prefix "the" gets astronomically more traffic than "xyz". The CDN edge layer handles this by caching top-200K prefixes globally, reducing origin traffic by 95%+. For the remaining hot prefixes that miss CDN, the Redis layer absorbs the load.

Cache stampede: When a cached popular prefix expires, thousands of concurrent threads miss simultaneously. Solution: probabilistic early expiry. With probability p = max(0, (remaining_TTL - beta × log(rand())) / TTL), proactively refresh. This spreads the refresh work over the TTL window rather than concentrating it at expiry.

Memory consumption: Raw Tries use too much RAM. Solutions: (1) serialize Tries to compact binary format (FST-style), (2) store only top-K completions per node (discard the full word list), (3) shard aggressively to distribute memory across many nodes.

Typo tolerance: Pure Trie cannot handle "appel" → "apple". Solutions: (1) generate edit-distance-1 candidates client-side and query each (expensive), (2) use a separate spell-correction layer (SymSpell, BK-trees) before the Trie lookup, (3) use Elasticsearch with fuzzy query for typo-tolerant autocomplete.

International languages: UTF-8 prefixes mean up to 65,000+ possible "characters" per node. Solutions: (1) store children as a HashMap (not a fixed 26-element array), (2) use Unicode normalization (NFC/NFD) to collapse accented variants, (3) use language-specific tokenization (Chinese/Japanese use character-level Tries since there are no spaces).

Personalization: Cannot store personal results in the global Trie. Architecture: (1) server returns top-K global suggestions plus top-K personal suggestions from a lightweight personal history store (user's last N searches in a sorted set); (2) client-side merging re-ranks combined list by a blend of global and personal scores.


Section 12 — Interview Perspective

How this appears in FAANG interviews: Typically given as "Design Google Search Autocomplete" or "Design a type-ahead feature for Twitter." You have 45 minutes. Interviewers want to see that you progress from Trie basics to scale concerns to production tradeoffs.

Expected progression:

  1. Clarify requirements (latency SLA, scale, top-K value, personalization needed?)
  2. Describe Trie + top-K at each node (data model first)
  3. Identify the scaling problem with a single Trie → introduce cache layer
  4. Introduce offline rebuild to avoid write contention
  5. Address distribution: sharding strategy with tradeoffs
  6. Mention production concerns: typo tolerance, hot prefixes, multi-region

Common follow-up questions:

  • "How would you handle trending searches that appear in the last 10 minutes?" → Answer: real-time trending overlay layer, separate from the hourly Trie, injected at serving time.
  • "How would you add personalization?" → Answer: personal history stored per-user in Redis sorted set, merged at API layer with global top-K.
  • "How would you handle deleting a term (copyright violation)?" → Answer: blocklist checked at serving time, not in the Trie. Faster to update and doesn't require Trie rebuild.
  • "What's your cache invalidation strategy?" → Answer: versioned keys + TTL + proactive refresh for hot prefixes.

Common mistakes candidates make:

  • Jumping to distributed systems without first explaining the core Trie algorithm
  • Forgetting that naive prefix search (without top-K at nodes) is O(subtree_size) — interviewer will probe this
  • Not mentioning the cache layer — the Trie alone doesn't handle 35,000 QPS
  • Choosing hash sharding for Tries (breaks prefix locality)
  • Not addressing the write path (how frequencies get updated)

Senior/Architect-level expectations: You should bring up: cache stampede prevention, blue-green Trie deployment, FST vs Trie tradeoffs, the real-time trending layer architecture, and the distinction between global and personalized suggestions.


Section 13 — PHP Implementation

Let's implement four layers: basic Trie, Trie with frequency, Trie with Top-K, and cache integration.PHP isn't available in this environment, but the code is complete and correct. Let me copy all files to the output directory so you can download and run them.Run locally with php demo.php — requires PHP 8.1+. The demo will show cache hit vs miss timing, top-K reranking after frequency updates, and cache version bumping.


Section 14 — Final End-to-End Architecture---

Master Summary

Here's the mental model to carry into any interview:

The system has two completely separate paths — a read path that must be blindingly fast, and a write path that can be slow and batch-oriented.

Read path: User keystroke → CDN (first line of defense, handles 60–70% of traffic) → Load Balancer → Stateless API Server (normalizes, debounces) → Redis Cache lookup (95%+ hit rate) → on miss, Trie Service (in-memory, sharded by prefix, replicated). The Trie node itself returns pre-computed top-K in O(L) time. Total: under 100ms.

Write path: Every completed search → Kafka → Spark frequency aggregation (hourly window) → Trie builder (inserts with top-K propagation) → Validation → Blue-Green deployment swap + Redis version bump. Runs hourly. Trending signals can be overlaid in near-real-time as a separate lightweight layer.

The three insights that separate senior answers:

  1. The naive Trie (DFS at query time) fails at scale. Top-K caching at every node is the fix.
  2. The cache layer is not optional — it's what makes 35,000 QPS possible without exploding the Trie cluster.
  3. You never update the live Trie on every search. Offline batch rebuild + blue-green swap is the production pattern.

Laravel example code

Top comments (0)