Sumedh Bala

Posted on Oct 23

Part 2: Event Discovery

#systemdesign #elasticsearch #algorithms

Search & Indexing

Event discovery is the foundation of any ticketing system. The goal is to let users quickly find concerts, sports matches, or theater events by name, location, or venue. While it looks simple from the outside, the design must handle millions of queries, abbreviations (LA vs. Los Angeles), nearby cities (Pasadena), and fuzzy matches (typos like San Franciso).

This breakdown covers the baseline design first, followed by a deep dive into enhanced search capabilities.

1. Baseline Functional Solution

Services

Search API Service – Handles user queries, normalizes input, executes searches, and returns ranked results.
Indexing Service – Updates the search index whenever new events are created or updated.
Event Management Service (EMS) – Manages canonical event metadata and feeds it to the search index.

Databases & Caches

Primary Database (PostgreSQL / MySQL / DynamoDB) – Stores event metadata.
Search Index (Elasticsearch / OpenSearch / Solr) – Provides full-text and geospatial search.
Cache (Redis / Memcached) – Stores trending searches and frequently accessed results.

Database Schema (Events Table)

event_id – UUID or BIGINT (primary key)
event_name – Event title (e.g., "Taylor Swift Eras Tour")
venue_name – Venue (e.g., "SoFi Stadium")
city_normalized – Canonical city name (e.g., "Los Angeles")
state – State or province (e.g., "CA")
country – Country (e.g., "USA")
location_lat – Venue latitude
location_lng – Venue longitude
event_start_time – Start datetime
event_end_time – End datetime
search_synonyms – JSON/array of synonyms (e.g., ["LA", "L.A.", "Los Angeles"])

2. Why Indexes Differ by Column

event_id → B-Tree index, fast lookup after search returns candidate IDs.
event_name, venue_name, search_synonyms → Full-text or trigram indexes (support partial matches/typos).
city_normalized, state, country → B-Tree index (exact match queries).
location_lat, location_lng → GiST or SP-GiST index (PostGIS) for geo queries.
event_start_time, event_end_time → B-Tree index for range queries.
Composite Indexes → (city_normalized, event_start_time) for common multi-column queries like "finding all upcoming events in Los Angeles, ordered by date."

Key Takeaway: Use the right index type per query pattern: B-Trees for equality/range, GIN/trigrams for text, GiST for geo, and composite for multi-column queries. PostgreSQL handles the system of record; Elasticsearch handles fuzzy candidate generation.

3. APIs

Search API (Basic)

Endpoint: GET /search

Query Parameters:

q (string) → user query ("LA concerts")
location (optional, lat/lng) → geo filtering
radius (optional, default 30km)
date_range (optional, start + end)

Response: JSON object with an events array, each including:
event_id, event_name, venue, city, state, country, event_start_time, location (lat/lng)

4. Example SQL Queries

Find events in a city:

SELECT * FROM events WHERE city_normalized = 'Los Angeles' AND event_start_time > NOW();

Find events within 30 km radius:
Use spherical distance function with lat/lng coordinates.

Find events by name (text search):

SELECT * FROM events WHERE event_name ILIKE '%Taylor Swift%';

5. Problems Using PostgreSQL Alone

Free-text search is slow at scale.
Fuzzy/typo handling is limited.
Synonyms/aliases require hacks.
Ranking, autocomplete, aggregations are inefficient.
High-QPS text search is not distributed.

Why Elasticsearch/OpenSearch is Needed:

Inverted indices → sub-linear text lookup
Built-in typo tolerance, synonyms, stemming
BM25 ranking & boosting
Efficient geo + text + time queries
Autocomplete and faceted counts
Horizontally scalable and fault-tolerant

Soundbite for interviews: "Postgres is the source of truth, but breaks down for fuzzy search, synonyms, relevance, autocomplete, or scale. Production systems pair it with Elasticsearch/OpenSearch."

6. Enhancing User Experience

Synonym Expansion: Map "LA" → "Los Angeles" and expand queries accordingly.
Geolocation Queries: Lat/lng filtering ensures nearby cities (e.g., Pasadena) appear.
Similarity Search: Fuzzy matching using Elasticsearch analyzers or trigram indexes.

User Query Flow Example:

User types "LA concerts."
Query normalized (lowercase, punctuation removed).
Synonym expansion (LA → Los Angeles).
Geo coordinates retrieved.
Search index queried (text + geo filter).
Results ranked (relevance, time, popularity, distance).
Results cached if trending.

7. Non-Functional Requirements

Latency: Redis caching → sub-100ms for popular searches
Scalability: Elasticsearch distributed indexing/sharding
Availability: Replicated DB + multi-AZ Elasticsearch
Consistency: Event Management Service is source of truth; eventual consistency to search index

8. Deeper Dive: Fuzzy Search

Scenario:

Dictionary: all cities (~200k–500k entries)
User input: "San Franciso" (typo of "San Francisco")
Goal: find the closest matching city efficiently, despite typos

8.1 Naive Levenshtein Distance

How it works:

Compare input string with every city in the dictionary
Compute edit distance (insertions, deletions, substitutions) for each city

Example:

Input: "San Franciso"
"San Francisco" → edit distance = 2 → match
"Los Angeles" → edit distance = 11 → ignored

Data storage: Simple array or list of city names (no preprocessing)

Time complexity:

Let n = number of cities, m = average city name length
Computing edit distance per city: O(m²) (see detailed explanation at end of document)
Total: O(n * m²) → very slow for large n

Pros: Simple to implement
Cons: Not scalable for real-time search with hundreds of thousands of entries

8.2 BK-Tree (Burkhard-Keller Tree)

Purpose: BK-Trees are designed for approximate string matching under a metric (commonly Levenshtein distance). They allow you to find all strings within a given "distance" efficiently, without scanning the entire dataset.

Structure:

Each node = a string (city name)
Each edge = edit distance between parent and child
If a child exists at that edge, insertion continues down that edge

Root Choice: The first inserted word becomes the root. Root choice doesn't affect correctness, but a "central" root can reduce tree depth.

Insertion Example (Detailed):
Insertion order: Los Angeles → Pasadena → Glendale → San Jose → San Mateo → San Francisco → Fresno

Insert Los Angeles → root
Insert Pasadena → dist(Pasadena, Los Angeles) = 8 → add under Los Angeles
Insert Glendale → dist(Glendale, Los Angeles) = 8 → edge exists → recurse into Pasadena → dist(Glendale, Pasadena) = 7 → add under Pasadena
Insert San Jose → dist(San Jose, Los Angeles) = 11 → add under Los Angeles
Insert San Mateo → dist(San Mateo, Los Angeles) = 11 → recurse into San Jose → dist(San Mateo, San Jose) = 4 → add under San Jose
Insert San Francisco → dist(San Francisco, Los Angeles) = 10 → add under Los Angeles
Insert Fresno → dist(Fresno, Los Angeles) = 10 → recurse into San Francisco → dist(Fresno, San Francisco) = 9 → add under San Francisco

Final Tree:

Los Angeles──(8)──▶ Pasadena  ──(7)──▶ Glendale
              ──(10)──▶ San Francisco  ──(9)──▶ Fresno
              ──(11)──▶ San Jose       ──(4)──▶ San Mateo

Search Logic (Fuzzy Search):
Query: "San Jsoe" (typo for San Jose), max edit distance = 2

Start at Los Angeles → dist = 11 → allowed child edges [9,13]
Children within range: San Francisco (10), San Jose (11)
Check San Francisco → dist = 6 → prune branch
Check San Jose → dist = 1 → ✅ Match found
Explore San Jose's children → San Mateo → dist = 6 → prune

Key Observations:

Branch pruning via [d-k, d+k] makes search efficient
Chains form naturally when cities share distances
Insertion order affects tree shape but not correctness

BK-Tree Complexity:

Distance computation per node: O(m²)
Average case: O(log n) nodes visited
Worst case: O(n) nodes visited
Total search time: O(v * m²), v = nodes actually visited

Intuition: BK-Tree is like a binary tree where only relevant branches are explored, but each node comparison costs O(m²).

8.3 Trie + Levenshtein Automaton

How it works:

Store all city names in a Trie (prefix tree)
Example: "San Francisco" → S → a → n → _ → F → r → …
Use Levenshtein Automaton to traverse the Trie:
- Track allowed edits at each step
- Prune paths exceeding threshold

Time Complexity:

Worst-case: O(n)
Average case with pruning: much less → efficient for large dictionaries

Pros: Scales well, precise, dynamic typo handling
Cons: Higher implementation complexity than BK-Tree

Fuzzy Search with Trie + Levenshtein Automaton (Beginner-Friendly)

Suppose you have a list of city names:

["San Francisco", "San Jose", "San Mateo", "Los Angeles", "Fresno"]

You want to search for "San Jsoe" (typo) and still find "San Jose".

We can do this efficiently using two things together:

A Trie (prefix tree) to store city names.
Levenshtein distance to measure "how different" two strings are.

1. Trie Basics

A Trie is a tree where each node is a letter. Words are formed by following a path from the root to a leaf.

Example:

Root
 |
 S
 |
 a
 |
 n
 |
(space)
 ├── F → "San Francisco"
 ├── J → "San Jose"
 └── M → "San Mateo"

Every path from root → leaf = a city name.
We can quickly follow letters to see which words match a prefix.

2. Levenshtein Distance Basics

Levenshtein distance = minimum edits to turn one string into another

Edit = insert, delete, or change a letter
Example: "San Jsoe" → "San Jose" → 1 edit (swap 's' and 'o')

We allow a maximum distance (e.g., 2) so we match words even with typos.

3. How the Search Works

Instead of comparing "San Jsoe" with every city (slow), we combine Trie + Levenshtein Automaton.

At each Trie node, we maintain a distance row representing edits needed to match the query so far.
Row length = query length + 1 = 8 + 1 = 9.

Step 0: Start at empty string

Query = "San Jsoe" (8 chars)
Trie path = ""

Distance row from empty string:

[0, 1, 2, 3, 4, 5, 6, 7, 8]

Index = number of query characters considered
Value = edits needed to turn empty string → query prefix

Step 1: Add first letter `'S'` → Trie path `"S"`

Compute new row using formula:

new_row[j] = min(
    previous_row[j] + 1,      # deletion
    new_row[j-1] + 1,         # insertion
    previous_row[j-1] + cost  # substitution (0 if match, 1 if not)
)

Example table:

j	Query prefix	Compare 'S' vs query[j-1]	Cost	Computation	new_row[j]
0	""	—	—	0 + 1 = 1	1
1	"S"	S vs S	0	min(1+1,1+1,0+0)=0	0
2	"Sa"	S vs a	1	min(2+1,0+1,1+1)=1	1
3	"San"	S vs n	1	min(3+1,1+1,2+1)=2	2
4	"San "	S vs (space)	1	min(4+1,2+1,3+1)=3	3
5	"San J"	S vs J	1	min(5+1,3+1,4+1)=4	4
6	"San Js"	S vs s	1	min(6+1,4+1,5+1)=5	5
7	"San Jso"	S vs o	1	min(7+1,5+1,6+1)=6	6
8	"San Jsoe"	S vs e	1	min(8+1,6+1,7+1)=7	7

Resulting row: [1, 0, 1, 2, 3, 4, 5, 6, 7]

Step 2: Add `'a'` → Trie path `"Sa"`

Example table:

j	Query prefix	Compare 'a' vs query[j-1]	Cost	Computation	new_row[j]
0	""	—	—	1 + 1 = 2	2
1	"S"	a vs S	1	min(0+1,2+1,1+1)=1	1
2	"Sa"	a vs a	0	min(1+1,1+1,0+0)=0	0
3	"San"	a vs n	1	min(2+1,0+1,1+1)=1	1
4	"San "	a vs (space)	1	min(3+1,1+1,2+1)=2	2
5	"San J"	a vs J	1	min(4+1,2+1,3+1)=3	3
6	"San Js"	a vs s	1	min(5+1,3+1,4+1)=4	4
7	"San Jso"	a vs o	1	min(6+1,4+1,5+1)=5	5
8	"San Jsoe"	a vs e	1	min(7+1,5+1,6+1)=6	6

Resulting row: [2, 1, 0, 1, 2, 3, 4, 5, 6]

Step 3: Add `'n'` → Trie path `"San"`

Distance row:

[3, 2, 1, 0, 1, 2, 3, 4, 5]

Minimum distance = 0 → path "San" matches query "San" perfectly so far.

Step 4: Add space `' '` → Trie path `"San "`

Distance row:

[4, 2, 1, 1, 0, 1, 2, 3, 4]

Minimum distance = 0 → "San " matches query so far

Step 5: Continue down `"San J"` → `"San Jose"`

Add 'J' → distance row: [5,4,3,2,1,0,1,2,3]
Add 'o' → [6,5,4,3,2,1,1,2,3]
Add 's' → [7,6,5,4,3,2,1,1,2]
Add 'e' → [8,7,6,5,4,3,2,1,1] ✅
Final distance = 1 → MATCH FOUND "San Jose"

Step 6: Early Pruning

"San Francisco" → distance eventually > 2 → stop exploring
"San Jose" → distance ≤ 2 → match
"San Mateo" → distance > 2 → stop

Step 7: Intuition

Walk down the Trie (letter by letter).
Track minimum edits to match the query at each step (distance row).
Stop exploring paths where edits exceed max allowed distance.
If a complete word reaches end node within max edits → success!

This method powers fuzzy search, autocomplete, and spell correction in real systems.

5. Key Advantages

Early Pruning: Stop exploring paths when distance exceeds threshold
Shared Prefixes: Multiple cities with same prefix share computation
Incremental Updates: Distance matrix updates are O(m) per character
Memory Efficient: Only store one distance matrix per search path

6. Performance Comparison

Method	Time Complexity	Space	Implementation
Naive Levenshtein	O(n × m²)	O(m²)	Simple
BK-Tree	O(log n × m²)	O(n)	Medium
Trie + Levenshtein Automaton	O(n × m)	O(n + m)	Complex

7. When to Use

Large dictionaries (100k+ entries)
Frequent fuzzy searches
Memory constraints (trie is more compact than BK-tree)
Need for prefix-based filtering

The Trie + Levenshtein Automaton approach is the most sophisticated but also the most efficient for large-scale fuzzy string matching in production systems.

8.4 Trigram Index / Elasticsearch Approach

How it works:

Create Trigrams: Break each city name into 3-character overlapping sequences
- Example: "San Francisco" → ["San", "an ", "n F", " Fr", "Fra", "ran", "anc", "nci", "cis", "isc", "sco"]
Build Inverted Index: Each trigram → list of city IDs
- Example: "Fra" → [San Francisco, Frankfurt]
Querying: Input "San Franciso" → generate trigrams → retrieve all cities containing at least one trigram → compute similarity score → return top matches

Pros:

Scales to millions of entries
Handles typos, partial matches, multi-word queries
Produces ranked results for UX

Cons: Requires search engine infrastructure, slightly more memory overhead

9. How Elasticsearch Ranks Results: BM25 + Boosting

BM25 (Best Match 25)

Statistical ranking algorithm (improvement on TF-IDF)

Key ingredients:

Term Frequency (TF) → frequency of term in field, diminishing returns
Inverse Document Frequency (IDF) → rare terms weigh more
Field Length Normalization → shorter fields get higher weight

Boosting

Manually tweak ranking

Field Boosting: event_name^3, artist_name^2, city^1
Query Boosting: recency, proximity
Business Logic Boosting: trending events, sponsored events

Interview Takeaway: BM25 + Boosting ensures results are relevant (BM25) and aligned with business/user intent (boosting).

10. Indexing Every PostgreSQL Field: Pros & Cons

When designing a large-scale ticketing system, deciding which fields to index in PostgreSQL is crucial, especially when pairing it with a specialized search engine like Elasticsearch.

Pros of PostgreSQL Indexing:

Backup System: PostgreSQL can act as a reliable fallback if Elasticsearch experiences an outage, ensuring continuous service.
Direct Access for Specific Queries: For certain precise queries or filters (e.g., direct lookups by event_id), direct access through PostgreSQL indexes can be significantly faster.
Strong Consistency & Data Integrity: Indexes on crucial fields (like event_id, often as a primary key) are paramount for enforcing strong consistency and data integrity. While the guarantee of integrity comes from database constraints (e.g., PRIMARY KEY, UNIQUE) and transactional properties, indexes are essential for the efficient and scalable enforcement of these guarantees. They enable the database to quickly check for uniqueness and relationships, preventing issues like duplicate events, even when the events table contains millions of records. This is a core requirement that search engines are not designed to fulfill.
Simplified Read-Your-Own-Writes: Immediate query results for newly created or updated events are possible directly from PostgreSQL, providing consistent results for the user who made the change.

Cons of PostgreSQL Indexing:

Increased Write Overhead: More indexes generally lead to slower insert and update operations, as each index needs to be maintained.
Higher Storage Costs: Each index consumes additional disk space, increasing overall storage expenses.
Inefficient for Complex Queries: PostgreSQL's built-in text and fuzzy search capabilities are less efficient at scale compared to specialized search engines like Elasticsearch.
Maintenance Complexity: Managing and tuning a large number of indexes can add to operational overhead.

Recommendation for Ticketing Systems: Strategic PostgreSQL Indexing

While it's not necessary to index every field in PostgreSQL, it plays a vital role. The primary responsibility for high-volume, complex search queries is offloaded to Elasticsearch, which is horizontally scalable and optimized for such tasks.

PostgreSQL indexes serve two critical, complementary roles in this architecture:

Ensuring Data Integrity and Transactional Consistency: Indexes on crucial fields like event_id (typically as a primary key) are paramount. They enable the database to efficiently enforce UNIQUE and PRIMARY KEY constraints, ensuring that event data is always correct and preventing issues like duplicate event entries. This is a core requirement that search engines cannot guarantee.
Providing a Failsafe: Event creation and updates are typically low-volume operations in a ticketing system (compared to search queries). Therefore, the write overhead incurred by maintaining additional PostgreSQL indexes is acceptable. This investment ensures a robust and consistent backup of critical event data, providing a crucial safety net in the event of a search engine outage. The key here is that writes to the events table are very infrequent (as very few events are added relative to read operations), so the overhead of more indexes does not pose a significant problem in terms of performance.

Soundbite for Interview: "Mentioning PostgreSQL indexing demonstrates you've considered distributed system failure modes and have a plan B. Downsides are outweighed by reliability, consistency, and data integrity."

11. Indexing Service & Event Management Service

These services ensure search data is accurate, timely, and scalable, maintaining a strong connection between canonical event data in PostgreSQL and the search index in Elasticsearch/OpenSearch.

11.1 Event Management Service (EMS)

Purpose: EMS is the source of truth for all event metadata. Responsible for creating, updating, and managing canonical event records.

Responsibilities:

Manage all event-related data:
- Event name, venue, city, state, country
- Start and end times
- Event capacity and ticket types
- Synonyms for search queries (e.g., LA → Los Angeles)
Validate incoming event data for consistency and correctness
Track changes via event versioning or timestamps
Emit notifications for downstream systems when events are created or updated

Key Considerations for High Scale:

Use atomic transactions in PostgreSQL to ensure consistency
Implement soft deletes to prevent broken references in search index
Emit event change logs to message broker (Kafka, RabbitMQ) for asynchronous processing by Indexing Service
Ensure idempotency in updates to avoid duplicate indexing

Example Flow:

Event created/updated in PostgreSQL
EMS validates the record
EMS publishes event notification (message queue / Kafka topic)
Indexing Service consumes the message to update the search index

11.2 Indexing Service

Purpose: Keep the search index up-to-date with canonical data for fast, accurate results

Responsibilities:

Consume event change notifications from EMS
Transform data into search-friendly format:
- Text fields → tokenized
- Locations → geo-points
- Dates → sortable timestamps
Update/delete documents in Elasticsearch/OpenSearch
Handle batch updates and real-time streaming
Maintain versioning to detect stale updates

Key Implementation Details:

Message Queue Consumption: Kafka, RabbitMQ, or Kinesis
Idempotent Updates: Each event has a unique ID
Bulk Indexing: For new events or large updates
Error Handling: Failed updates retried, dead-letter queue stores permanently failed updates

Consistency Model:

Elasticsearch is eventually consistent
EMS remains the source of truth

Example Flow:

Receive "event_created" message from EMS
Transform event record (normalize names, add synonyms, compute geo coordinates)
Push to Elasticsearch via bulk API
Confirm update or retry if it fails

11.3 Architecture & Reliability

Decoupling: EMS and Indexing Service communicate via message queues → prevents blocking writes in Postgres during indexing
Resilience: Retry mechanisms and dead-letter queues for failures
Scalability: Multiple consumers can scale horizontally for high volumes
Observability: Metrics (throughput, error rates, lag), logging full trace (event_id, timestamp, status)

11.4 Interview Takeaways

EMS ensures canonical, consistent event data
Indexing Service keeps search index current and performant
Decoupled, message-driven architecture → scalable & fault-tolerant
Idempotent + bulk operations → reliability under high load
Plan B for indexing failures demonstrates awareness of real-world distributed system design

12. Performance Benchmarks

Search Performance Comparison

Category	Method	P95 Latency	QPS	Memory	Notes
DB-native	PostgreSQL ILIKE	200ms	1,000	2GB	Exact text matching
DB-native	PostgreSQL Trigram	500ms	500	4GB	Fuzzy text search
Search Engine	Elasticsearch (Fuzzy)	50ms	10,000+	8GB	Scalable fuzzy search
Search Engine	Elasticsearch (Regex)	80ms	8,000+	10GB	Regex/wildcard queries
Search Engine	OpenSearch	60ms	9,000+	8GB	AWS-managed Elasticsearch
Search Engine	Solr	70ms	8,500+	9GB	Enterprise search platform
Algorithm	BK-Tree	20ms	5,000	16GB	Specialized fuzzy matching
Algorithm	Trie + Levenshtein	10ms	15,000+	6GB	Efficient fuzzy matching
Cache	Redis Cached	5ms	50,000+	4GB	Hot queries only

Why Trigram Indexes Excel at Complex Queries

While Trie + Levenshtein Automaton is faster for simple fuzzy matching, Trigram indexes are superior for complex queries because they handle multiple search patterns simultaneously. Trigram indexes break text into overlapping 3-character sequences, allowing them to efficiently process:

Multi-word queries: "Taylor Swift concert" matches documents containing "Taylor", "Swift", and "concert" even with typos
Partial matches: "San Fran*" matches "San Francisco", "San Fernando", "San Francisco Bay"
Regex patterns: Complex regular expressions like [A-Z][a-z]+ [A-Z][a-z]+ for proper names
Wildcard queries: "conc*" matches "concert", "conference", "conclusion"
Phrase proximity: Finding documents where "Taylor" and "Swift" appear within 5 words of each other

Performance advantage: Trigram indexes pre-compute all possible 3-character combinations, making complex pattern matching O(1) lookup time instead of O(n×m²) for each query. This is why Elasticsearch uses trigrams for regex/wildcard queries despite higher memory overhead - the complexity of pattern matching makes the trade-off worthwhile.

Real-world example: A query like "LA concerts this weekend" with typos becomes much more efficient with trigram indexing because it can simultaneously match "LA" (city), "concert*" (event type), and "weekend" (time) across multiple fields and handle variations in each term.

Production Metrics

Elasticsearch Cluster: 3-node, 16GB RAM, 500GB index
Query Latency: P50: 15ms, P95: 45ms, P99: 100ms
Throughput: 8,000 QPS sustained, 15,000 QPS burst
Cache Hit Rate: 85% for trending searches

13. Caching Strategy

Multi-Layer Architecture

CDN Cache (CloudFlare) - Static content, 24h TTL
Application Cache (Redis) - Query results, 5-15min TTL
Elasticsearch Query Cache - Search index cache

Cache Layers

Query Results: Popular searches cached for 15 minutes
Autocomplete: City/venue suggestions cached for 1 hour
Geo Data: Location lookups cached for 24 hours
Trending Searches: Top queries cached for 1 hour

Cache Invalidation

Event Updates: Invalidate related searches by city/venue
Time-based: Different TTLs based on data volatility
Cache Warming: Pre-populate trending searches during off-peak

14. Monitoring & Observability

Key Metrics

Search Performance: Latency (P50/P95/P99), success rate, error rate
System Health: Elasticsearch cluster status, node health, shard status
Business Metrics: Search volume, popular queries, user engagement
Cache Performance: Hit rate, memory usage, eviction rate

Alerting Thresholds

Critical: Search latency > 100ms, error rate > 5%, cluster health = Red
Warning: Search latency > 50ms, error rate > 1%, cluster health = Yellow
Channels: PagerDuty (critical), Slack (warnings), Email (reports)

Logging

Search Queries: Query, user, latency, results count, cache hit status
Errors: Timeout, connection failures, retry attempts
Performance: Elasticsearch response times, cache hit rates

15. Disaster Recovery

Failure Scenarios

Single Node: 0 downtime, automatic shard rebalancing
Cluster Failure: 15-30 min downtime, restore from snapshots
Data Corruption: 2-4 hours downtime, full reindex from PostgreSQL

Recovery Procedures

Cluster Health Check: Verify cluster status and shard allocation
Snapshot Restore: Restore from daily snapshots stored in S3
Full Rebuild: Reindex entire dataset from PostgreSQL source
Incremental Sync: Catch up on missed updates using timestamps

Fallback Strategies

PostgreSQL Fallback: Use ILIKE queries with reduced functionality
Read-Only Mode: Serve cached results only during outages
Graceful Degradation: Show maintenance message with trending events

Backup Strategy

Daily Snapshots: Automated daily backups to S3
Cross-Region Replication: Real-time replication to backup cluster
RTO/RPO: 15-30 min recovery time, 15 min data loss maximum

Appendix: Why Edit Distance is O(m²)

The Levenshtein distance (edit distance) has O(m²) time complexity because it uses a dynamic programming approach with a 2D table where m is the length of the strings being compared.

The Algorithm Structure

For two strings of length m each, the algorithm creates a m × m table and fills it using the following recurrence relation:

def levenshtein_distance(s1, s2):
    m, n = len(s1), len(s2)

    # Create (m+1) × (n+1) table
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    # Base cases
    for i in range(m + 1):
        dp[i][0] = i
    for j in range(n + 1):
        dp[0][j] = j

    # Fill the table
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i-1] == s2[j-1]:
                dp[i][j] = dp[i-1][j-1]  # No operation needed
            else:
                dp[i][j] = 1 + min(
                    dp[i-1][j],      # Deletion
                    dp[i][j-1],      # Insertion  
                    dp[i-1][j-1]     # Substitution
                )

    return dp[m][n]

Understanding the Base Cases

The base cases initialize the first row and first column of the DP table:

# Base cases
for i in range(m + 1):
    dp[i][0] = i
for j in range(n + 1):
    dp[0][j] = j

What these base cases represent:

dp[i][0] = i (First column):
- Converting string1[0...i-1] to an empty string
- Requires i deletions (delete all i characters)
- Example: "San Francisco" → "" needs 13 deletions
dp[0][j] = j (First row):
- Converting empty string to string2[0...j-1]
- Requires j insertions (insert all j characters)
- Example: "" → "San Franciso" needs 12 insertions

Visual representation:

     ""  S  a  n     F  r  a  n  c  i  s  o
""  [0][1][2][3][4][5][6][7][8][9][10][11][12]
S   [1][?][?][?][?][?][?][?][?][?][?][?][?]
a   [2][?][?][?][?][?][?][?][?][?][?][?][?]
n   [3][?][?][?][?][?][?][?][?][?][?][?][?]
    [4][?][?][?][?][?][?][?][?][?][?][?][?]
F   [5][?][?][?][?][?][?][?][?][?][?][?][?]
r   [6][?][?][?][?][?][?][?][?][?][?][?][?]
a   [7][?][?][?][?][?][?][?][?][?][?][?][?]
n   [8][?][?][?][?][?][?][?][?][?][?][?][?]
c   [9][?][?][?][?][?][?][?][?][?][?][?][?]
i   [10][?][?][?][?][?][?][?][?][?][?][?][?]
s   [11][?][?][?][?][?][?][?][?][?][?][?][?]
c   [12][?][?][?][?][?][?][?][?][?][?][?][?]
o   [13][?][?][?][?][?][?][?][?][?][?][?][?]

Why these base cases make sense:

Row 0, Column 0: dp[0][0] = 0 (empty string to empty string = 0 operations)
Row 0, Column j: dp[0][j] = j (empty string to j-character string = j insertions)
Row i, Column 0: dp[i][0] = i (i-character string to empty string = i deletions)

These base cases provide the foundation for the dynamic programming recurrence relation that fills the rest of the table.

Why O(m²)

Table Size: We create a table of size (m+1) × (n+1), which is approximately m × m when strings are similar length
Nested Loops: We have two nested loops:
- Outer loop: for i in range(1, m + 1) → O(m) iterations
- Inner loop: for j in range(1, n + 1) → O(m) iterations (when n ≈ m)
Total Operations: O(m) × O(m) = O(m²)

Visual Example

For strings "San Francisco" (m=13) and "San Franciso" (m=12):

     S  a  n     F  r  a  n  c  i  s  o
   [0][1][2][3][4][5][6][7][8][9][10][11][12]
S  [1][0][1][2][3][4][5][6][7][8][9][10][11]
a  [2][1][0][1][2][3][4][5][6][7][8][9][10]
n  [3][2][1][0][1][2][3][4][5][6][7][8][9]
   [4][3][2][1][0][1][2][3][4][5][6][7][8]
F  [5][4][3][2][1][0][1][2][3][4][5][6][7]
r  [6][5][4][3][2][1][0][1][2][3][4][5][6]
a  [7][6][5][4][3][2][1][0][1][2][3][4][5]
n  [8][7][6][5][4][3][2][1][0][1][2][3][4]
c  [9][8][7][6][5][4][3][2][1][0][1][2][3]
i  [10][9][8][7][6][5][4][3][2][1][0][1][2]
s  [11][10][9][8][7][6][5][4][3][2][1][0][1]
c  [12][11][10][9][8][7][6][5][4][3][2][1][0]
o  [13][12][11][10][9][8][7][6][5][4][3][2][1]

Each cell requires constant time to compute, but we need to fill 13 × 12 = 156 cells, which is O(m²).

Why This Matters in the Ticketing System

In the context of the ticketing system:

City Names: "San Francisco" vs "San Franciso" (typo)
Scale Problem: With 200k-500k cities, computing edit distance for each one would be:
- Per city: O(m²) where m ≈ 10-15 characters
- Total: O(n × m²) = O(500,000 × 15²) = O(112.5 million operations)
Performance Impact: This is why the document mentions it's "very slow for large n"

Alternative Solutions

This is exactly why the document suggests:

BK-Trees: O(log n) nodes visited, but each comparison still costs O(m²)
Trie + Levenshtein Automaton: More complex but can be optimized
Trigram Indexing: O(1) lookup time for approximate matches

The O(m²) complexity is inherent to the edit distance algorithm itself - it's the mathematical cost of computing the minimum number of operations needed to transform one string into another.

Search & Indexing

1. Baseline Functional Solution

Services

Databases & Caches

Database Schema (Events Table)

2. Why Indexes Differ by Column

3. APIs

Search API (Basic)

4. Example SQL Queries

5. Problems Using PostgreSQL Alone

6. Enhancing User Experience

7. Non-Functional Requirements

8. Deeper Dive: Fuzzy Search

8.1 Naive Levenshtein Distance

8.2 BK-Tree (Burkhard-Keller Tree)

8.3 Trie + Levenshtein Automaton

Fuzzy Search with Trie + Levenshtein Automaton (Beginner-Friendly)

1. Trie Basics

2. Levenshtein Distance Basics

3. How the Search Works

Step 0: Start at empty string

Step 1: Add first letter 'S' → Trie path "S"

Step 2: Add 'a' → Trie path "Sa"

Step 3: Add 'n' → Trie path "San"

Step 4: Add space ' ' → Trie path "San "

Step 5: Continue down "San J" → "San Jose"

Step 6: Early Pruning

Step 7: Intuition

8.4 Trigram Index / Elasticsearch Approach

9. How Elasticsearch Ranks Results: BM25 + Boosting

BM25 (Best Match 25)

Boosting

10. Indexing Every PostgreSQL Field: Pros & Cons

Pros of PostgreSQL Indexing:

Cons of PostgreSQL Indexing:

Recommendation for Ticketing Systems: Strategic PostgreSQL Indexing

11. Indexing Service & Event Management Service

11.1 Event Management Service (EMS)

11.2 Indexing Service

11.3 Architecture & Reliability

11.4 Interview Takeaways

12. Performance Benchmarks

Search Performance Comparison

Why Trigram Indexes Excel at Complex Queries

Production Metrics

13. Caching Strategy

Multi-Layer Architecture

Cache Layers

Cache Invalidation

14. Monitoring & Observability

Key Metrics

Alerting Thresholds

Logging

15. Disaster Recovery

Failure Scenarios

Recovery Procedures

Fallback Strategies

Backup Strategy

Appendix: Why Edit Distance is O(m²)

The Algorithm Structure

Understanding the Base Cases

Why O(m²)

Visual Example

Why This Matters in the Ticketing System

Alternative Solutions

Next

Step 1: Add first letter `'S'` → Trie path `"S"`

Step 2: Add `'a'` → Trie path `"Sa"`

Step 3: Add `'n'` → Trie path `"San"`

Step 4: Add space `' '` → Trie path `"San "`

Step 5: Continue down `"San J"` → `"San Jose"`