DEV Community: Sumedh Bala

Hotel Booking: Schema Design Comparison

Sumedh Bala — Wed, 19 Nov 2025 20:07:02 +0000

1. Introduction

Hotel booking systems face a fundamental design decision: how to model availability and reservations in the database. This document compares two primary approaches:

Schema 1: Row-Per-Day for Fungible Room Types

One row per (hotel_id, room_type_id, date)
Rooms of the same type are fungible (interchangeable)
Uses B-tree indexes and optimistic locking

Schema 2: GiST with Individual Rooms

One row per physical room
Rooms are non-fungible (each room tracked separately)
Uses GiST indexes and exclusion constraints to prevent overlapping bookings (see Section 3.2 for detailed explanation of how GiST works)

Both approaches solve the same problem (preventing double-booking) but with different trade-offs in complexity, performance, and flexibility.

Prerequisite Reading (Shared Building Blocks)

This document extends the event-ticketing series by @sumedhbala:

We reuse the same foundation: PostgreSQL as system of record, transactional outbox + Debezium CDC into Kafka, Elasticsearch for discovery, Redis for low-latency availability, and Stripe-style payment flows. Familiarity with those components is assumed so we can focus on hotel-specific schema and availability trade-offs here.

Key Technical Topics Covered (Interview Highlights)

GiST indexes + exclusion constraints (Section 3): Prevent overlapping range bookings with PostgreSQL GiST/exclusion constraints.
Optimistic vs. pessimistic locking (Sections 2.4 & 3.6): Version columns + FOR UPDATE SKIP LOCKED to avoid overbooking.
Fungible vs. non-fungible inventory modeling (Sections 2 & 3): When to use row-per-day counts vs. per-room range bookings.
Per-room-type pricing (Section 2.1): reservation_room_nights to support multiple room types in one reservation.
Hybrid design (Section 2.6): Schema 1 inventory + GiST room assignments to avoid room switching after check-in.
Scaling patterns (Section 7): Redis/search caches, read replicas, sharding by hotel_id, regional deployments, streaming events.

2. Schema 1: Row-Per-Day for Fungible Room Types

2.1 Schema Design

-- Room types (fungible - Room 101 and 102 of same type are interchangeable)
CREATE TABLE room_types (
    hotel_id VARCHAR(50) NOT NULL,
    room_type_id VARCHAR(100) NOT NULL,
    type_name VARCHAR(100) NOT NULL,  -- e.g., "Deluxe", "Suite"
    max_capacity INT NOT NULL,
    amenities JSON,
    PRIMARY KEY (hotel_id, room_type_id)
);

-- Inventory: One row per room type per night
CREATE TABLE inventory (
    hotel_id VARCHAR(50) NOT NULL,
    room_type_id VARCHAR(100) NOT NULL,
    date DATE NOT NULL,  -- Specific night date
    total_rooms INT NOT NULL DEFAULT 0,
    available_rooms INT NOT NULL DEFAULT 0,
    reserved_rooms INT NOT NULL DEFAULT 0,
    version INT NOT NULL DEFAULT 0,  -- For optimistic locking

    PRIMARY KEY (hotel_id, room_type_id, date),
    FOREIGN KEY (hotel_id, room_type_id) REFERENCES room_types(hotel_id, room_type_id),

    -- Database constraint ensures: total_rooms = available_rooms + reserved_rooms
    CONSTRAINT chk_inventory_balance 
        CHECK (total_rooms = available_rooms + reserved_rooms),

    -- Additional constraints for data integrity
    CONSTRAINT chk_inventory_non_negative 
        CHECK (available_rooms >= 0 AND reserved_rooms >= 0 AND total_rooms >= 0)
);

-- B-tree indexes for fast lookups
-- Note: The PRIMARY KEY already creates an index on (hotel_id, room_type_id, date),
-- so idx_inventory_lookup is technically redundant but kept for clarity/explicit naming.
CREATE INDEX idx_inventory_lookup ON inventory (hotel_id, room_type_id, date);
-- Partial index: Only includes rows where available_rooms > 0
-- This is smaller and faster for availability queries (the most common operation).
-- PostgreSQL will automatically use this index when querying for available rooms.
CREATE INDEX idx_inventory_available ON inventory (hotel_id, room_type_id, date) 
WHERE available_rooms > 0;

-- Reservations: Header table (one reservation can have multiple room types)
CREATE TABLE reservations (
    reservation_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    hotel_id VARCHAR(50) NOT NULL,
    check_in_date DATE NOT NULL,
    check_out_date DATE NOT NULL,
    num_guests INT NOT NULL,
    status ENUM('pending_payment', 'confirmed', 'checked_in', 'checked_out', 'cancelled', 'expired') NOT NULL,
    total_amount_minor_units BIGINT NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    FOREIGN KEY (hotel_id) REFERENCES hotels(hotel_id)
);

-- Reservation rooms: One row per room type in the reservation
-- Allows booking multiple room types in a single reservation
CREATE TABLE reservation_rooms (
    reservation_room_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    reservation_id UUID NOT NULL,
    hotel_id VARCHAR(50) NOT NULL,  -- Needed for composite foreign key
    room_type_id VARCHAR(100) NOT NULL,
    num_rooms INT NOT NULL DEFAULT 1,  -- Number of rooms of this type in the reservation

    FOREIGN KEY (reservation_id) REFERENCES reservations(reservation_id) ON DELETE CASCADE,
    FOREIGN KEY (hotel_id, room_type_id) REFERENCES room_types(hotel_id, room_type_id),
    CONSTRAINT chk_num_rooms_positive CHECK (num_rooms > 0),
    UNIQUE (reservation_id, room_type_id)  -- One row per room type per reservation
);

-- Reservation room nights: One row per room type per night for pricing
-- This allows different room types in the same reservation to have different nightly rates
-- (e.g., Deluxe room at $150/night, Suite at $300/night)
CREATE TABLE reservation_room_nights (
    reservation_room_night_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    reservation_room_id UUID NOT NULL,
    night_date DATE NOT NULL,
    nightly_rate_minor_units BIGINT NOT NULL,  -- Rate per room of this type for this night

    FOREIGN KEY (reservation_room_id) REFERENCES reservation_rooms(reservation_room_id) ON DELETE CASCADE,
    UNIQUE (reservation_room_id, night_date)
);

2.2 Key Characteristics

Fungible Inventory: Room 101 and Room 102 of type "Deluxe" are treated as identical
Multiple Room Types: A single reservation can include multiple room types (e.g., 2 Deluxe + 1 Suite) via the reservation_rooms table
Row-Per-Day: Each night gets its own inventory row
Count-Based: Tracks available_rooms count, not specific room assignments
B-tree Indexes: Standard indexes for point queries
Optimistic Locking: Uses version column to detect concurrent modifications

2.3 Availability Check Query

-- Check availability for June 15-17 (3 nights) for 1 room
SELECT date, available_rooms, version
FROM inventory
WHERE hotel_id = 'hotel_123' 
  AND room_type_id = 'deluxe_001'
  AND date IN ('2024-06-15', '2024-06-16', '2024-06-17')
  AND available_rooms >= 1;  -- Check for at least 1 room

-- Check availability for multiple rooms (e.g., 3 rooms)
SELECT date, available_rooms, version
FROM inventory
WHERE hotel_id = 'hotel_123' 
  AND room_type_id = 'deluxe_001'
  AND date IN ('2024-06-15', '2024-06-16', '2024-06-17')
  AND available_rooms >= 3;  -- Check for at least 3 rooms

2.4 Booking Creation (With Optimistic Locking)

Option 1: Single Room Type (Simplified)

def create_booking_schema1_single_type(hotel_id, room_type_id, check_in_date, check_out_date, num_rooms=1, num_guests=1):
    """
    Create a booking for one or more rooms of the same type.

    Args:
        num_rooms: Number of rooms to book (default: 1)
        num_guests: Total number of guests across all rooms
    """
    # Use the multi-type function with single room type
    return create_booking_schema1(
        hotel_id, 
        [(room_type_id, num_rooms)], 
        check_in_date, 
        check_out_date, 
        num_guests
    )

Option 2: Multiple Room Types (Full Implementation)

Example Usage:

# Book 1 room of single type
reservation_id = create_booking_schema1(
    'hotel_123', 
    [('deluxe_001', 1)], 
    '2024-06-15', 
    '2024-06-17', 
    num_guests=2
)

# Book multiple rooms of same type
reservation_id = create_booking_schema1(
    'hotel_123', 
    [('deluxe_001', 3)], 
    '2024-06-15', 
    '2024-06-17', 
    num_guests=8
)

# Book multiple room types in one reservation
# Example: 2 Deluxe rooms + 1 Suite for a family
reservation_id = create_booking_schema1(
    'hotel_123', 
    [('deluxe_001', 2), ('suite_001', 1)], 
    '2024-06-15', 
    '2024-06-17', 
    num_guests=10
)

2.5 Pros and Cons

Pros:
✅ Simple and intuitive: Easy to understand and maintain
✅ Efficient point queries: Direct date lookups are O(log n) with B-tree index
✅ Easy updates: Update individual nights independently
✅ Supports per-night pricing: Each row can have different pricing/rates
✅ Handles overlapping bookings: User books June 15-17, another books June 16-18
✅ Flexible room assignment: Any available room of the type can be assigned
✅ Works with standard indexes: No special index types needed

Cons:
❌ More rows: 365-day window = 365 rows per room type (scales linearly with booking window)
❌ Optimistic locking complexity: Requires retry logic on conflicts
❌ No specific room assignment at booking: Don't know which exact room guest will get until check-in
❌ Count-based tracking: Must update all three fields together (total, available, reserved) - enforced by database CHECK constraint

2.6 Room Assignment Guarantee After Check-In

Question: Can Schema 1 guarantee that guests won't have to switch rooms once they've checked in?

Answer: Schema 1 alone cannot guarantee this because it only tracks room types, not specific room assignments. However, you can add a room assignment table that locks in the specific room at check-in time.

Option 1: Add Room Assignment at Check-In (Recommended)

Add a table to track specific room assignments after check-in:

-- Physical rooms table (needed for room assignments)
CREATE TABLE rooms (
    hotel_id VARCHAR(50) NOT NULL,
    room_number VARCHAR(20) NOT NULL,  -- e.g., "101", "102"
    room_type_id VARCHAR(100) NOT NULL,
    PRIMARY KEY (hotel_id, room_number),
    FOREIGN KEY (hotel_id, room_type_id) REFERENCES room_types(hotel_id, room_type_id)
);

-- Room assignments: Lock in specific room at check-in
CREATE TABLE room_assignments (
    assignment_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    reservation_id UUID NOT NULL,
    hotel_id VARCHAR(50) NOT NULL,
    room_number VARCHAR(20) NOT NULL,
    stay_dates DATERANGE NOT NULL,  -- [check_in_date, check_out_date)
    assigned_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    FOREIGN KEY (reservation_id) REFERENCES reservations(reservation_id) ON DELETE CASCADE,
    FOREIGN KEY (hotel_id, room_number) REFERENCES rooms(hotel_id, room_number),

    -- GiST exclusion constraint prevents overlapping assignments for same room
    -- GiST is explained in section 3.3
    EXCLUDE USING GIST (
        hotel_id WITH =,
        room_number WITH =,
        stay_dates WITH &&
    )
);

How This Works:

At Booking Time: Reservation is created with multiple room types via reservation_rooms table (no specific rooms assigned yet)
At Check-In: Hotel staff assigns specific rooms (one room_assignments entry per room) matching the room types and counts
GiST Constraint: Prevents any other reservation from being assigned to the same room for overlapping dates
Guarantee: Once assigned, each room cannot be assigned to anyone else during the guest's stay

Check-In Flow:

def check_in_guest(reservation_id, room_assignments):
    """
    Assign specific rooms to a reservation at check-in.

    Args:
        reservation_id: The reservation ID
        room_assignments: Dict mapping room_type_id to list of room numbers
                         Example: {'deluxe_001': ['101', '102'], 'suite_001': ['201']}
    """
    reservation = get_reservation(reservation_id)

    if reservation.status != 'confirmed':
        raise InvalidStateError("Reservation must be confirmed")

    # Get reservation_rooms to validate assignments
    reservation_rooms = db.query("""
        SELECT room_type_id, num_rooms
        FROM reservation_rooms
        WHERE reservation_id = ?
    """, reservation_id)

    # Validate room assignments match reservation
    for res_room in reservation_rooms:
        assigned_rooms = room_assignments.get(res_room.room_type_id, [])
        if len(assigned_rooms) != res_room.num_rooms:
            raise InvalidRoomCountError(
                f"Room type {res_room.room_type_id}: Reservation requires {res_room.num_rooms} rooms, "
                f"but {len(assigned_rooms)} provided"
            )

    # Assign each room (GiST constraint prevents overlaps)
    try:
        for room_type_id, room_numbers in room_assignments.items():
            for room_number in room_numbers:
                db.execute("""
                    INSERT INTO room_assignments 
                    (reservation_id, hotel_id, room_number, stay_dates)
                    VALUES (?, ?, ?, daterange(?, ?, '[]'))
                """, reservation_id, reservation.hotel_id, room_number, 
                     reservation.check_in_date, reservation.check_out_date)

        # Update reservation status
        db.execute("""
            UPDATE reservations
            SET status = 'checked_in', checked_in_at = NOW()
            WHERE reservation_id = ?
        """, reservation_id)

        db.commit()
    except ExclusionViolation:
        # One or more rooms already assigned to another guest for these dates
        raise RoomAlreadyAssignedError(f"One or more rooms are not available")

Multiple Room Types Example:

# Reservation for 2 Deluxe + 1 Suite
reservation_id = create_booking_schema1(
    'hotel_123', 
    [('deluxe_001', 2), ('suite_001', 1)], 
    '2024-06-15', 
    '2024-06-17', 
    num_guests=10
)

# At check-in, assign specific rooms for each type
check_in_guest(reservation_id, {
    'deluxe_001': ['101', '102'],  # 2 Deluxe rooms
    'suite_001': ['201']           # 1 Suite
})
# Creates 3 room_assignments entries total
# GiST constraint ensures no overlaps for any of the 3 rooms

Benefits:

✅ Guarantees no room switching: Once assigned, GiST constraint prevents conflicts
✅ Flexibility before check-in: Can reassign rooms if needed before guest arrives
✅ Best of both worlds: Fungible inventory for booking, specific assignment at check-in
✅ Maintenance handling: If room needs maintenance, can reassign before check-in

Trade-offs:

Requires additional rooms and room_assignments tables
More complex than pure Schema 1, but simpler than full Schema 2

Option 2: Pure Schema 1 (No Guarantee)

If you don't add room assignments, Schema 1 cannot guarantee that guests won't switch rooms because:

No specific room is tracked
Hotel could theoretically reassign rooms (though this would be bad practice)
No database constraint prevents it

When This Is Acceptable:

Standard hotels where room switching is rare and handled operationally
Hotels that prioritize flexibility over guarantees
Systems where room assignment happens outside the booking system

Recommendation

For production systems, add room assignments at check-in (Option 1) to:

Guarantee no room switching after check-in
Maintain flexibility before check-in
Provide audit trail of room assignments
Enable room-level tracking for maintenance

This hybrid approach gives you the benefits of Schema 1 (fungible inventory, simple booking) with the guarantees of Schema 2 (specific room assignment, no conflicts).

3. Schema 2: GiST with Individual Rooms

3.1 Tables: Reused from Schema 1 vs. New Tables

Tables Reused from Schema 1:

✅ room_types - Room type metadata (Deluxe, Suite, etc.)
✅ reservations - Reservation header table (check-in/out dates, guest info, status)
✅ reservation_room_nights - Pricing per room type per night (optional, for pricing calculations)

Tables NOT Used from Schema 1:

❌ inventory - Not needed (availability calculated from room_bookings overlaps)
❌ reservation_rooms - Not needed (bookings are per-room, not per-room-type)

New Tables for Schema 2:

🆕 rooms - Physical room inventory (hotel_id, room_number, room_type_id)
🆕 room_bookings - Individual room bookings with date ranges (uses GiST index with partial exclusion constraint to prevent overlaps for pending/confirmed/checked_in)

Key Difference:

Schema 1: Tracks availability by room type (fungible: "2 Deluxe rooms available")
Schema 2: Tracks availability by specific room (non-fungible: "Room 101 available, Room 102 booked")

3.2 Schema Design

-- Physical rooms (non-fungible - each room tracked separately)
CREATE TABLE rooms (
    hotel_id VARCHAR(50) NOT NULL,
    room_number VARCHAR(20) NOT NULL,  -- e.g., "101", "102", "Suite-A"
    room_type_id VARCHAR(100) NOT NULL,  -- Links to room_types for metadata
    floor_number INT,

    PRIMARY KEY (hotel_id, room_number),
    FOREIGN KEY (hotel_id, room_type_id) REFERENCES room_types(hotel_id, room_type_id)
);

-- Room bookings: One row per booking with date range
CREATE TABLE room_bookings (
    booking_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    hotel_id VARCHAR(50) NOT NULL,
    room_number VARCHAR(20) NOT NULL,  -- Specific physical room
    reservation_id UUID NOT NULL,

    stay_dates DATERANGE NOT NULL,  -- [check_in_date, check_out_date)
    status ENUM('pending', 'confirmed', 'checked_in', 'checked_out', 'cancelled') NOT NULL,

    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,

    FOREIGN KEY (hotel_id, room_number) REFERENCES rooms(hotel_id, room_number),
    FOREIGN KEY (reservation_id) REFERENCES reservations(reservation_id)
);

-- GiST index for fast range queries and overlap detection
CREATE INDEX idx_room_bookings_dates 
ON room_bookings USING GIST (hotel_id, room_number, stay_dates);

-- Exclusion constraint prevents overlapping bookings for same room
-- Option 1: Partial exclusion constraint (RECOMMENDED - most idiomatic)
-- This constraint ONLY applies to "active" bookings (pending, confirmed, checked_in)
-- Cancelled/checked_out bookings don't block new bookings
ALTER TABLE room_bookings
ADD EXCLUDE USING GIST (
    hotel_id WITH =,
    room_number WITH =,
    stay_dates WITH &&
) WHERE (status IN ('pending', 'confirmed', 'checked_in'));

-- Option 2: Use trigger (more flexible, but less idiomatic)
-- Triggers are more flexible because you can add custom logic beyond just checking overlaps.
-- For example, you could log failed booking attempts, allow VIP users to override conflicts,
-- or enforce additional business rules like "no bookings within 1 hour of each other".
-- However, we don't need this flexibility here - the simple overlap check is sufficient.
-- Note: We prevent overlaps for 'pending', 'confirmed', and 'checked_in' to avoid
-- double-booking scenarios where multiple pending reservations could later be confirmed
CREATE OR REPLACE FUNCTION check_room_overlap() RETURNS TRIGGER AS $$
BEGIN
    IF EXISTS (
        SELECT 1 FROM room_bookings
        WHERE hotel_id = NEW.hotel_id
          AND room_number = NEW.room_number
          AND stay_dates && NEW.stay_dates  -- Overlap operator
          AND status IN ('pending', 'confirmed', 'checked_in')  
          AND booking_id != NEW.booking_id
    ) THEN
        RAISE EXCEPTION 'Room already booked for these dates';
    END IF;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trigger_check_overlap
BEFORE INSERT OR UPDATE ON room_bookings
FOR EACH ROW
WHEN (NEW.status IN ('pending', 'confirmed', 'checked_in')) 
EXECUTE FUNCTION check_room_overlap();

3.3 How GiST Works

GiST (Generalized Search Tree) is a PostgreSQL index type optimized for complex data types like ranges, geometric data, and full-text search. Unlike B-tree indexes (which work with simple comparisons like <, =, >), GiST indexes support spatial and range operations.

What is GiST?

GiST is an extensible indexing framework that allows you to define custom index methods for complex data types. For date ranges, PostgreSQL provides built-in GiST support that understands range operations.

How GiST Indexes Range Data

1. Range Representation:

-- A date range is stored as: [start_date, end_date)
daterange('2024-06-15', '2024-06-17', '[]')  -- Includes both endpoints
daterange('2024-06-15', '2024-06-17', '[)')  -- Standard: includes start, excludes end

2. GiST Index Structure:

GiST builds a tree where each node contains bounding boxes (the smallest range that contains all child ranges)
Leaf nodes contain actual date ranges
Internal nodes contain bounding boxes of their children

Example GiST Tree Structure:

                    Root Node
              [2024-01-01, 2024-12-31]  (bounding box)
                    /              \
         Node A                    Node B
    [2024-01-01, 2024-06-30]  [2024-07-01, 2024-12-31]
         /        \                  /        \
    Leaf 1      Leaf 2          Leaf 3      Leaf 4
[2024-06-15,  [2024-06-20,   [2024-07-10,  [2024-08-01,
 2024-06-17]   2024-06-22]    2024-07-15]   2024-08-05]

Range Operations GiST Supports

1. Overlap Operator (&&):

-- Check if two ranges overlap
daterange('2024-06-15', '2024-06-17') && daterange('2024-06-16', '2024-06-18')
-- Returns: TRUE (they overlap on June 16)

daterange('2024-06-15', '2024-06-17') && daterange('2024-06-18', '2024-06-20')
-- Returns: FALSE (no overlap)

How GiST Evaluates Overlap:

Start at root node
Check if query range overlaps with node's bounding box
If yes, descend into that branch
If no, skip entire branch (pruning)
Continue until leaf nodes
Check actual ranges for overlap

2. Contains Operator (@>):

-- Check if range contains a date
daterange('2024-06-15', '2024-06-17') @> '2024-06-16'::date
-- Returns: TRUE

-- Check if range contains another range
daterange('2024-06-10', '2024-06-20') @> daterange('2024-06-15', '2024-06-17')
-- Returns: TRUE

3. Contained By Operator (<@):

-- Check if range is contained by another range
daterange('2024-06-15', '2024-06-17') <@ daterange('2024-06-10', '2024-06-20')
-- Returns: TRUE

4. Adjacent Operator (-|-):

-- Check if ranges are adjacent (touch but don't overlap)
daterange('2024-06-15', '2024-06-17') -|- daterange('2024-06-17', '2024-06-19')
-- Returns: TRUE (they touch at June 17)

How Exclusion Constraints Use GiST

Exclusion Constraint Syntax:

EXCLUDE USING GIST (
    hotel_id WITH =,        -- Equality operator for hotel_id
    room_number WITH =,     -- Equality operator for room_number
    stay_dates WITH &&      -- Overlap operator for date ranges
)

What This Means:

For the same (hotel_id, room_number) combination
Prevent any two rows where stay_dates overlap (&&)
The constraint is enforced at the database level (before insert/update)

How It Works Internally:

When inserting a new booking, PostgreSQL checks the GiST index
Finds all existing bookings for the same (hotel_id, room_number)
For each existing booking, checks if stay_dates && new_stay_dates
If any overlap is found, the insert fails with an exclusion violation error
This happens atomically - no race conditions possible

Example:

-- Booking 1: Room 101, June 15-17
INSERT INTO room_bookings VALUES
('booking_1', 'hotel_123', '101', 'reservation_1', 
 daterange('2024-06-15', '2024-06-17', '[]'), 'confirmed');
-- SUCCESS

-- Booking 2: Room 101, June 16-18 (OVERLAPS with Booking 1!)
INSERT INTO room_bookings VALUES
('booking_2', 'hotel_123', '101', 'reservation_2', 
 daterange('2024-06-16', '2024-06-18', '[]'), 'confirmed');
-- ERROR: conflicting key value violates exclusion constraint "room_bookings_excl"
-- The GiST index found that Booking 1's range [2024-06-15, 2024-06-17] 
-- overlaps with Booking 2's range [2024-06-16, 2024-06-18]

Why GiST is Efficient for Range Queries

1. Pruning:

GiST can skip entire branches of the tree if their bounding boxes don't overlap
Example: If query range is [2024-06-15, 2024-06-17] and a node's bounding box is [2024-12-01, 2024-12-31], that entire branch is skipped

2. Index-Only Scans:

For overlap checks, GiST can often answer queries using only the index (without accessing table data)
This is much faster than scanning all rows

3. Spatial Locality:

Ranges that are close in time are stored near each other in the index
Queries for nearby dates are very efficient

Performance Comparison: GiST vs B-tree for Ranges

B-tree with Date Ranges (Inefficient):

-- Without GiST, you'd need to check every booking
SELECT * FROM room_bookings
WHERE hotel_id = 'hotel_123' 
  AND room_number = '101'
  AND (
    (check_in_date <= '2024-06-17' AND check_out_date >= '2024-06-15')
    OR (check_in_date BETWEEN '2024-06-15' AND '2024-06-17')
    OR (check_out_date BETWEEN '2024-06-15' AND '2024-06-17')
  );
-- Must scan many rows, complex WHERE clause

GiST with Date Ranges (Efficient):

-- With GiST, single overlap operator
SELECT * FROM room_bookings
WHERE hotel_id = 'hotel_123' 
  AND room_number = '101'
  AND stay_dates && daterange('2024-06-15', '2024-06-17', '[]');
-- Uses GiST index, very fast, simple query

GiST Index Maintenance

Insert Performance:

Inserting a new range requires updating the GiST tree
May need to split nodes if bounding boxes become too large
Generally O(log n) but can be slower than B-tree for simple inserts

Update Performance:

Updating a range may require rebalancing the tree
More expensive than B-tree updates

Query Performance:

Overlap queries: Excellent (O(log n) with pruning)
Point queries: Good (but B-tree is better)
Range containment: Excellent

3.4 Key Characteristics

Non-Fungible Inventory: Each physical room (Room 101, Room 102) is tracked separately
Date Ranges: Uses DATERANGE to store booking periods
GiST Indexes: Optimized for range queries and overlap detection (see section 3.3 for details)
Database-Enforced Overlaps: Exclusion constraints or triggers prevent double-booking
Specific Room Assignment: Know exactly which room guest will get

3.5 Availability Check Query

-- Find available rooms of a specific type for June 15-17
-- Exclude rooms with pending, confirmed, or checked_in bookings
SELECT r.room_number
FROM rooms r
WHERE r.hotel_id = 'hotel_123' 
  AND r.room_type_id = 'deluxe_001'
  AND NOT EXISTS (
      SELECT 1 FROM room_bookings b
      WHERE b.hotel_id = r.hotel_id
        AND b.room_number = r.room_number
        AND b.stay_dates && daterange('2024-06-15', '2024-06-17', '[]')  -- Overlap check
        AND b.status IN ('pending', 'confirmed', 'checked_in') 
  )
LIMIT 1;

3.6 Booking Creation

Single Room Booking

def create_booking_schema2_single(hotel_id, room_type_id, check_in_date, check_out_date, num_guests):
    """
    Create a booking for a single room.
    """
    # Step 1: Find an available room of this type
    # Exclude rooms with pending, confirmed, or checked_in bookings for these dates
    available_room = db.query("""
        SELECT r.room_number
        FROM rooms r
        WHERE r.hotel_id = ? AND r.room_type_id = ?
          AND NOT EXISTS (
              SELECT 1 FROM room_bookings b
              WHERE b.hotel_id = r.hotel_id
                AND b.room_number = r.room_number
                AND b.stay_dates && daterange(?, ?, '[]')
                AND b.status IN ('pending', 'confirmed', 'checked_in')  -- Include pending
          )
        LIMIT 1
    """, hotel_id, room_type_id, check_in_date, check_out_date)

    if not available_room:
        raise InsufficientAvailabilityError()

    room_number = available_room.room_number

    # Step 2: Create reservation
    reservation_id = db.execute("""
        INSERT INTO reservations 
        (hotel_id, check_in_date, check_out_date, num_guests, status, total_amount_minor_units)
        VALUES (?, ?, ?, ?, 'pending_payment', 0)
        RETURNING reservation_id
    """, hotel_id, check_in_date, check_out_date, num_guests)

    # Step 3: Create room booking (exclusion constraint prevents overlap)
    try:
        db.execute("""
            INSERT INTO room_bookings 
            (hotel_id, room_number, reservation_id, stay_dates, status)
            VALUES (?, ?, ?, daterange(?, ?, '[]'), 'pending')
        """, hotel_id, room_number, reservation_id, check_in_date, check_out_date)
        db.commit()
    except ExclusionViolation:
        # Another booking took this room - retry with different room
        db.rollback()
        return create_booking_schema2_single(hotel_id, room_type_id, check_in_date, check_out_date, num_guests)

    return reservation_id

Multiple Rooms Booking (Complex - Requires Retry Logic)

def create_booking_schema2(hotel_id, room_type_id, check_in_date, check_out_date, num_rooms, num_guests):
    """
    Create a booking for multiple rooms of the same type.

    CRITICAL: This is significantly more complex than Schema 1's atomic decrement.
    We must find and book N different rooms atomically, which requires retry logic
    if any room becomes unavailable during the transaction.

    Args:
        num_rooms: Number of rooms to book (e.g., 3 Deluxe rooms)
    """
    max_retries = 5

    for attempt in range(max_retries):
        try:
            with db.transaction():
                # Step 1: Find N available rooms of this type
                # Use FOR UPDATE SKIP LOCKED to lock available rooms atomically
                # 
                # FOR UPDATE SKIP LOCKED Explanation (VERY USEFUL FOR INTERVIEWS):
                # This PostgreSQL feature is crucial for preventing race conditions in concurrent systems.
                # 
                # How it works:
                # 1. FOR UPDATE: Locks the selected rows for this transaction (prevents other transactions
                #    from modifying them until we commit/rollback)
                # 2. SKIP LOCKED: If a row is already locked by another transaction, skip it and continue
                #    to the next available row (instead of waiting/blocking)
                # 
                # Why it's useful:
                # - Prevents "lost update" problems: Multiple transactions can't grab the same room
                # - High concurrency: Transactions don't block each other - they just skip locked rows
                # - Perfect for work queues, booking systems, inventory allocation
                # - Common interview question: "How do you prevent two workers from processing the same job?"
                # 
                # Example scenario:
                #   T1: SELECT ... FOR UPDATE SKIP LOCKED LIMIT 3 → locks rooms 101, 102, 103
                #   T2: SELECT ... FOR UPDATE SKIP LOCKED LIMIT 3 → skips 101,102,103, locks 104, 105, 106
                #   Both transactions proceed without blocking each other!
                # 
                # Alternative (FOR UPDATE without SKIP LOCKED):
                #   T1: SELECT ... FOR UPDATE LIMIT 3 → locks rooms 101, 102, 103
                #   T2: SELECT ... FOR UPDATE LIMIT 3 → WAITS for T1 to commit (blocks, reduces concurrency)
                available_rooms = db.query("""
                    SELECT r.room_number
                    FROM rooms r
                    WHERE r.hotel_id = ? AND r.room_type_id = ?
                      AND NOT EXISTS (
                          SELECT 1 FROM room_bookings b
                          WHERE b.hotel_id = r.hotel_id
                            AND b.room_number = r.room_number
                            AND b.stay_dates && daterange(?, ?, '[]')
                            AND b.status IN ('pending', 'confirmed', 'checked_in')
                      )
                    ORDER BY r.room_number
                    LIMIT ?
                    FOR UPDATE SKIP LOCKED  -- Lock rows, skip if already locked
                """, hotel_id, room_type_id, check_in_date, check_out_date, num_rooms)

                if len(available_rooms) < num_rooms:
                    raise InsufficientAvailabilityError(
                        f"Only {len(available_rooms)} rooms available, need {num_rooms}"
                    )

                # Step 2: Create reservation header
                reservation_id = db.execute("""
                    INSERT INTO reservations 
                    (hotel_id, check_in_date, check_out_date, num_guests, status, total_amount_minor_units)
                    VALUES (?, ?, ?, ?, 'pending_payment', 0)
                    RETURNING reservation_id
                """, hotel_id, check_in_date, check_out_date, num_guests)

                # Step 3: Create room bookings for all N rooms
                # The exclusion constraint will prevent overlaps if another transaction
                # grabbed a room between our SELECT and INSERT
                for room in available_rooms:
                    try:
                        db.execute("""
                            INSERT INTO room_bookings 
                            (hotel_id, room_number, reservation_id, stay_dates, status)
                            VALUES (?, ?, ?, daterange(?, ?, '[]'), 'pending')
                        """, hotel_id, room.room_number, reservation_id, check_in_date, check_out_date)
                    except ExclusionViolation:
                        # Room was taken by another transaction - abort entire booking
                        raise ConcurrentModificationError(
                            f"Room {room.room_number} was booked by another transaction"
                        )

                db.commit()
                return reservation_id

        except (ConcurrentModificationError, ExclusionViolation):
            if attempt == max_retries - 1:
                raise InsufficientAvailabilityError(
                    f"Could not book {num_rooms} rooms after {max_retries} attempts"
                )
            # Exponential backoff before retry
            time.sleep(0.1 * (2 ** attempt))
            continue

    raise InsufficientAvailabilityError("Booking failed after retries")

Key Challenges with Multi-Room Bookings in Schema 2:

Concurrency Risk: Even with FOR UPDATE SKIP LOCKED, there's a small window between SELECT and INSERT where another transaction could grab a room. Important: FOR UPDATE does NOT prevent other transactions from reading the rows - it only prevents them from:
- Taking a FOR UPDATE lock on the same rows (they'd wait or skip with SKIP LOCKED)
- Modifying the rows (UPDATE/DELETE)
- Other transactions can still read the rows with regular SELECT (no lock required)
- This means another transaction could check availability (read) and insert into room_bookings
- The exclusion constraint on room_bookings is the final protection (will raise ExclusionViolation if overlap detected)
- This is why we catch ExclusionViolation and retry the entire booking
All-or-Nothing for Same Room Type: If booking 3 rooms of the same type and we can only find 2 available, the entire booking fails. This is different from Schema 1, where booking 3 rooms of the same type is a single atomic UPDATE inventory SET available_rooms = available_rooms - 3 operation.
Retry Complexity: Requires retry logic with exponential backoff
Performance: FOR UPDATE SKIP LOCKED helps prevent conflicts but adds locking overhead (see detailed explanation above)

Comparison with Schema 1:

Schema 1 (same room type): UPDATE inventory SET available_rooms = available_rooms - 3 (single atomic operation for N rooms of same type)
Schema 1 (multiple room types): Also all-or-nothing - if any room type lacks availability, entire booking fails
Schema 2: Must find N different physical rooms, lock them, and insert N rows (multiple operations, higher failure risk even for same room type)

3.7 Pros and Cons

Pros:
✅ Automatic overlap prevention: Database enforces no double-booking at constraint level (for single rooms)
✅ Exact room assignment: Know exactly which room guest will get (Room 101, not "any Deluxe room")
✅ Room-level tracking: Can track maintenance, room-specific issues, preferences
✅ Efficient range queries: GiST index optimized for date range operations
✅ No optimistic locking needed: Exclusion constraint handles single-room concurrency

Cons:
❌ Multi-room booking complexity: Booking N rooms requires finding and locking N different rooms, with retry logic if any room becomes unavailable. Much more complex than Schema 1's atomic UPDATE inventory SET available_rooms = available_rooms - N
❌ Concurrency challenges for multi-room: Between finding available rooms and inserting bookings, another transaction may grab one, requiring full retry
❌ Less flexible: Can't reassign rooms easily (guest booked Room 101, but it needs maintenance)
❌ More complex queries: Need to find available rooms by checking non-overlapping ranges
❌ Room assignment logic: Must decide which available room to assign (first available? best view?)
❌ More dynamic booking rows: For a booking of 3 rooms, creates 3 rows in room_bookings (vs. Schema 1's 1 row in reservation_rooms)
❌ GiST index overhead: Larger index size, more complex query planning
❌ Not fungible: Room 101 and 102 are different, even if same type (like event seats)

4. Comparative Analysis

4.1 Schema Structure

Aspect	Schema 1: Row-Per-Day (Fungible)	Schema 2: GiST (Individual Rooms)
Inventory Model	Fungible (Room 101 = Room 102)	Non-fungible (Room 101 ≠ Room 102)
Inventory Table	`inventory(hotel_id, room_type_id, date)`	`rooms(hotel_id, room_number)`
Booking Table	`reservations` + `reservation_rooms(room_type_id, num_rooms)`	`room_bookings(room_number, stay_dates)`
Date Storage	One row per night (DATE)	Date range per booking (DATERANGE)
Index Type	B-tree	GiST
Rows per Booking	1 reservation + N nights	1 room_booking (with date range)

4.2 Query Performance

Operation	Schema 1	Schema 2
Check Availability	`SELECT ... WHERE date IN (...)` - O(log n) per date	`SELECT ... WHERE NOT EXISTS (overlap check)` - O(log n) with GiST
Point Query	Excellent (direct date lookup)	Good (range overlap check)
Range Query	Good (multiple point queries)	Excellent (single range query)
Find Available Room	Count-based (simple)	Must query all rooms, check overlaps (more complex)

4.3 Concurrency Control

Aspect	Schema 1	Schema 2
Method	Optimistic locking (version numbers)	Database constraints (GiST exclusion)
Conflict Detection	Check version in WHERE clause	Constraint violation on overlap
Retry Logic	Required (on version mismatch)	Required for multi-room (must find N rooms atomically)
Deadlocks	No (no long-held locks)	Possible (if using FOR UPDATE SKIP LOCKED)
Code Complexity	Medium (retry logic for version)	High for multi-room (find N rooms, lock, insert N rows, handle failures)
Single Room	Atomic `UPDATE ... SET available_rooms = available_rooms - 1`	Simple: exclusion constraint handles it
Multi-Room	Atomic `UPDATE ... SET available_rooms = available_rooms - N`	Complex: Must find N different rooms, lock them, insert N rows

4.4 Update Operations

Operation	Schema 1	Schema 2
Create Booking	UPDATE inventory (decrement count)	INSERT room_booking (with date range)
Cancel Booking	UPDATE inventory (increment count)	DELETE room_booking or update status
Modify Dates	UPDATE multiple inventory rows	DELETE + INSERT new date range
Room Reassignment	No change needed (fungible)	DELETE + INSERT (different room)

4.5 Storage and Scalability

Important Distinction: We must separate static inventory rows from dynamic booking rows.

Metric	Schema 1	Schema 2
Static Inventory Rows	365 rows per room type per hotel (scales with booking window)	1 row per physical room (fixed)
Note on "365"	This represents a 1-year (365-day) booking window (common in hotels). The number scales linearly: 90-day window = 90 rows, 365-day window = 365 rows. This is configurable based on business needs.	Fixed regardless of booking window
Rows per Single-Room Booking	1 reservation + 1 reservation_rooms + N reservation_room_nights	1 reservation + 1 room_booking
Rows per Multi-Room Booking	1 reservation + M reservation_rooms + (M × N) reservation_room_nights (M = room types, N = nights)	1 reservation + N room_bookings (one per room)
Index Size	~2MB (B-tree, 100 hotels × 3 types × 365 days)	~200KB (GiST, but more complex)
Scalability	Static rows scale with booking window	Static rows fixed; dynamic rows scale with bookings

Example Calculation (100 hotels, 3 room types, 10 rooms per type, 365-day window):

Schema 1 Static: 100 × 3 × 365 = 109,500 inventory rows (large, but simple queries)
Schema 2 Static: 100 × 3 × 10 = 3,000 room rows (smaller, fixed)

Why 365 days? This is a typical booking window for hotels (1 year ahead). The number is configurable:

90-day window: 90 rows per room type (shorter horizon, less storage)
180-day window: 180 rows per room type (6 months ahead)
365-day window: 365 rows per room type (1 year ahead, most common)
The number scales linearly: rows = booking_window_days

Example: Booking 3 Deluxe Rooms for 2 Nights:

Schema 1 Dynamic: 1 reservation + 1 reservation_rooms + 2 reservation_room_nights = 4 rows
Schema 2 Dynamic: 1 reservation + 3 room_bookings = 4 rows (same count, but different structure)

Example: Booking 2 Deluxe + 1 Suite for 2 Nights (multiple room types):

Schema 1 Dynamic: 1 reservation + 2 reservation_rooms + (2 × 2) reservation_room_nights = 7 rows
- 1 reservation header
- 2 reservation_rooms (one for Deluxe, one for Suite)
- 4 reservation_room_nights (2 nights × 2 room types, each with potentially different rates)
Schema 2 Dynamic: 1 reservation + 3 room_bookings = 4 rows

Key Insight:

Schema 1 has more static rows (inventory table) but fewer dynamic rows per multi-room booking (1 reservation_rooms row regardless of N)
Schema 2 has fewer static rows (rooms table) but more dynamic rows per multi-room booking (N room_bookings rows for N rooms)

4.6 Flexibility and Use Cases

Requirement	Schema 1	Schema 2
Room Reassignment	Easy (any room of type works)	Hard (must change room_number)
Maintenance Scheduling	Flexible (remove from inventory)	Must track per-room
Room Preferences	Not supported	Supported (per-room metadata)
Exact Room Assignment	No (assigned at check-in)	Yes (assigned at booking)
Boutique Hotels	Less suitable	More suitable
Standard Hotels	More suitable	Less suitable

5. Detailed Code Examples

5.1 Availability Check: Schema 1

def check_availability_schema1(hotel_id, room_type_id, check_in_date, check_out_date, num_rooms=1):
    """
    Check if enough rooms are available for all nights.

    Args:
        num_rooms: Number of rooms needed (default: 1)
    """
    nights = get_nights(check_in_date, check_out_date)  # [2024-06-15, 2024-06-16]

    availability = db.query("""
        SELECT date, available_rooms
        FROM inventory
        WHERE hotel_id = ? 
          AND room_type_id = ?
          AND date IN (?)
          AND available_rooms >= ?
    """, hotel_id, room_type_id, nights, num_rooms)

    # Check if all nights have enough availability
    available_nights = {row.date for row in availability}
    if set(nights) != available_nights:
        missing = set(nights) - available_nights
        raise InsufficientAvailabilityError(
            f"Not enough rooms available on: {missing}. Need {num_rooms} rooms."
        )

    return True

5.2 Availability Check: Schema 2

def check_availability_schema2(hotel_id, room_type_id, check_in_date, check_out_date, num_rooms=1):
    """
    Check if enough rooms are available for the date range.

    Args:
        num_rooms: Number of rooms needed (default: 1)
    """
    available_count = db.query("""
        SELECT COUNT(*) as count
        FROM rooms r
        WHERE r.hotel_id = ? 
          AND r.room_type_id = ?
          AND NOT EXISTS (
              SELECT 1 FROM room_bookings b
              WHERE b.hotel_id = r.hotel_id
                AND b.room_number = r.room_number
                AND b.stay_dates && daterange(?, ?, '[]')
                AND b.status IN ('pending', 'confirmed', 'checked_in')  -- Include pending
          )
    """, hotel_id, room_type_id, check_in_date, check_out_date)

    if available_count.count < num_rooms:
        raise InsufficientAvailabilityError(
            f"Only {available_count.count} rooms available, need {num_rooms}"
        )

    return True

5.3 Booking Creation: Schema 2 (Full Example)

Note: See Section 3.6 for the complete implementation with both single-room and multi-room booking examples. The key difference is that Schema 2 requires significantly more complex logic for multi-room bookings compared to Schema 1's atomic UPDATE inventory SET available_rooms = available_rooms - N.

Key Points:

Single room booking: Relatively straightforward with exclusion constraint
Multi-room booking: Requires finding and locking N rooms, with retry logic if any room becomes unavailable
See Section 3.6 for the full create_booking_schema2() implementation with num_rooms parameter

6. When to Use Each Schema

Use Schema 1 (Row-Per-Day, Fungible) When:

✅ Standard hotel operations: Rooms of the same type are truly interchangeable
✅ Flexibility needed: Want to reassign rooms easily (maintenance, upgrades)
✅ Simplicity priority: Prefer straightforward queries and standard indexes
✅ High booking volume: Need to handle many concurrent bookings efficiently
✅ Per-night pricing: Different rates per night (weekend vs weekday)
✅ Most hotels: Matches how most hotels actually operate

Example Use Cases:

Large chain hotels (Marriott, Hilton)
Standard hotel rooms (not unique suites)
High-volume booking systems

Use Schema 2 (GiST, Individual Rooms) When:

✅ Unique rooms: Rooms are non-fungible (boutique hotels, unique suites)
✅ Exact room assignment: Need to know which specific room guest will get
✅ Room-level tracking: Need to track maintenance, preferences per room
✅ Database-enforced integrity: Want strongest guarantees against double-booking
✅ Event venues: Similar to fixed-seat event booking (non-fungible)
✅ Lower booking volume: Can handle more complex queries

Example Use Cases:

Boutique hotels with unique rooms
Luxury resorts with distinct suites
Event venues (similar pattern)
Hotels where guests request specific rooms

Interview Tidbit: When to Use GiST Indexes

Use GiST when a row-per-entry approach is not scalable or not possible.

Classic Example: Calendar Meeting Booking

Imagine building a calendar system where users can book meetings. You need to check for time slot overlaps (e.g., "Is 2:00 PM - 3:00 PM available?").

❌ Row-Per-Minute Approach (Not Scalable):

-- Bad: One row per minute for every possible time slot
CREATE TABLE time_slots (
    date DATE,
    minute INT,  -- 0 to 1439 (minutes in a day)
    is_booked BOOLEAN
);
-- Problem: 1,440 rows per day × 365 days = 525,600 rows per year per resource
-- For 1000 resources: 525 million rows per year!

✅ GiST Range Approach (Scalable):

-- Good: One row per booking with date range
CREATE TABLE meetings (
    meeting_id UUID PRIMARY KEY,
    resource_id VARCHAR(50),
    meeting_time TSRANGE,  -- Time range: [start_time, end_time)
    EXCLUDE USING GIST (resource_id WITH =, meeting_time WITH &&)
);
-- One row per actual booking - scales with usage, not time granularity

Key Insight:

Row-per-entry: Scales with time granularity (minutes, seconds) × duration × resources → exponential growth
GiST ranges: Scales with actual bookings → linear growth

Other Use Cases for GiST:

Resource booking: Conference rooms, equipment, vehicles
IP address ranges: Network allocation, geolocation
Geographic data: Spatial queries, map boundaries
Time-series overlaps: Scheduling, availability windows

7. Recommendation

For most hotel booking systems, Schema 1 (Row-Per-Day, Fungible) is recommended because:

Matches real-world operations: Most hotels treat rooms of the same type as interchangeable
Simpler queries: Direct date lookups are easier to understand and optimize
Better performance: B-tree indexes are well-understood and perform excellently
More flexible: Can reassign rooms without database changes
Proven pattern: Widely used in production systems

Use Schema 2 (GiST, Individual Rooms) only if:

You have unique, non-fungible rooms
You need exact room assignment at booking time
You're building an event venue system (where this design excels)
Room-level tracking is critical

Note: If you need Schema 1's fungible inventory but also want to guarantee no room switching after check-in, see Section 2.6 for a hybrid approach that adds a room assignment table with GiST constraints.

Scaling Beyond a Single Database

If a single database instance cannot sustain throughput/latency requirements:

Vertical scale + read replicas:
- Keep inventory, reservations, reservation_rooms, and reservation_room_nights on the primary for strong consistency.
- Serve availability/search queries from read replicas (or a Redis cache) to offload the primary.
Logical partitioning (sharding):
- Shard by hotel_id (most queries are scoped to a hotel).
- Each shard owns its own inventory, reservations, room_assignments, etc.
- Global services (payments, user accounts) remain shared; booking services route to the correct shard via a metadata service.
Regional replication:
- Deploy per-region databases (still sharded by hotel) with asynchronous replication for cross-region disaster recovery.
- Keep writes region-local; replicate summary events to a central analytics store.
Streaming + caches:
- Publish inventory delta events to Kafka/PubSub.
- Materialize availability views in Redis/Elastic for search so that the OLTP database handles only booking-critical writes.
- Ensure cache invalidation by listening to the same event stream.
Background reconciliation:
- Periodically reconcile per-shard inventory totals against master room counts (or via nightly jobs) to catch drift.
- Use idempotent booking operations so retries across shards are safe.

These patterns let you grow from a single-node deployment to multi-shard, multi-region topologies without rewriting the core schemas.

8. Summary

Decision Factor	Winner
Simplicity	Schema 1 (Row-Per-Day)
Flexibility	Schema 1 (Row-Per-Day)
Database-Enforced Integrity	Schema 2 (GiST)
Exact Room Assignment	Schema 2 (GiST)
Query Performance	Tie (depends on query pattern)
Concurrency Control (Single Room)	Schema 2 (GiST) - simpler
Concurrency Control (Multi-Room)	Schema 1 (atomic decrement) - much simpler
Static Rows	Schema 2 (fewer static rows)
Dynamic Rows (Multi-Room)	Schema 1 (1 row per booking vs N rows)
Multi-Room Booking Complexity	Schema 1 (atomic decrement vs find/lock N rooms)
Most Hotels	Schema 1 (Row-Per-Day)

Final Recommendation: Start with Schema 1 (Row-Per-Day, Fungible) for most hotel booking systems. Consider Schema 2 (GiST, Individual Rooms) only if you have specific requirements for exact room assignment or unique, non-fungible rooms.

Hotel Search System Design

Sumedh Bala — Wed, 19 Nov 2025 20:05:19 +0000

1. Introduction

Key Technical Topics Covered (Interview Highlights)

Hybrid OLAP/OLTP architecture (Sections 1 & 2): Elasticsearch for multi-dimensional search + Redis for real-time availability (and why neither alone is sufficient).
Transactional Outbox + CDC (Section 2.2): Keeping Elasticsearch in sync with the primary DB without dual writes.
Elasticsearch modeling (Sections 3 & 4): Denormalized hotel/room documents, geo fields, amenity facets, script scoring, and index aliasing for zero-downtime reindex.
Redis availability design (Section 6): Per-night bitmaps / sorted sets, multi-night intersections, TTL/eviction strategies, and consistency with bookings.
Query pipeline (Section 7): Coordinated ES + Redis flow with fallbacks when availability or index data diverges.
Advanced filtering & ranking (Section 5): Amenity faceting, price buckets, popularity boosts, personalization hooks.
Scaling patterns (Sections 2 & 7): Multi-region clusters, hot/warm node tiers, cache layers, snapshot/alias-based reindexing. (Cross-region replication—e.g., keeping Redis regional while asynchronously syncing aggregates or using multi-region Elasticsearch—follows the same patterns but is out of scope for this doc.)

Prerequisite Reading (Shared Components)

This hotel search design builds on the same platform described in @sumedhbala’s ticketing series:

From those posts we inherit the core stack: PostgreSQL + transactional outbox, Debezium/Kafka pipelines, Elasticsearch clusters, Redis caches, and Stripe-style payment flows. This document focuses on how those building blocks are reconfigured for hotel-specific search, availability, and ranking concerns.

Hotel Search Challenges

Hotel search and booking introduce complexities that differ sharply from event search. Two fundamental differences from event search drive the architectural complexity:

1. Availability Integration in Search:

Real-time Availability Validation: Availability must be checked during the search phase, not after results are displayed
Date Range Queries: Users search for stays spanning multiple nights (check-in to check-out), requiring multi-date availability validation
Recurring Availability: Same rooms available daily, unlike one-time events
Real-time Updates: Availability changes constantly as bookings are made, requiring sub-millisecond response times

2. Extensive Filter Requirements:

Multiple Amenity Filters: Users expect to filter by numerous amenities (pool, spa, beachfront, gym, restaurant, concierge, parking, pet-friendly, etc.)
Room Type Complexity: Multiple room categories (Standard, Deluxe, Suite) with different capacities and room-level amenities
Complex Filter Combinations: Location, amenities, price range, room type, guest count, and date combinations all working together

Additional challenges include:

Fungible Inventory: Room 101 and Room 102 of the same type are interchangeable
Dynamic Pricing: Prices vary by date, season, demand, and length of stay

Key Differences from Event Search

Aspect	Event Search	Hotel Search
Inventory Model	Fixed seats, non-fungible	Room types, fungible inventory
Availability	One-time events	Recurring daily availability
Date Handling	Single event date	Date ranges (check-in to check-out)
Pricing	Dynamic per seat/section (demand-based)	Dynamic per room type and date (seasonal, demand-based)
Search Complexity	Simple (venue, date, category)	Complex (location, dates, room type, multiple amenity filters: pool, spa, beachfront, gym, etc.)
Availability in Search	Checked after displaying events	Must be part of search criteria

Hybrid Architecture Overview: Mandatory OLAP/OLTP Separation

Terminology refresher: OLAP (Online Analytical Processing) handles read-heavy, multi-dimensional queries (e.g., Elasticsearch), while OLTP (Online Transaction Processing) powers write-heavy, transactional workloads (e.g., Redis + primary DB). The rest of this document uses these terms frequently.

Hotel search systems face two fundamentally contradictory workloads that cannot be solved by a single system:

1. OLAP Workload (Analytical Search):

Complex, read-intensive, multi-dimensional queries
Example: "Find all 4-star hotels in New York with pool and free breakfast for 2 adults and 2 children from 2024-10-26 to 2024-10-30, sorted by review score"
Optimized for: High-speed reads of large datasets, complex filtering, full-text search
Technology: Elasticsearch (built on Apache Lucene, designed for OLAP)

2. OLTP Workload (Transactional Availability):

High-frequency, simple, isolated transactions
Example: "Decrement room count for Hotel 123, room type 'King', on date 2024-10-26"
Optimized for: High-throughput, low-latency updates, absolute data integrity
Technology: Redis (in-memory database designed for OLTP)

The Architectural Mandate:

The dual-system architecture (Elasticsearch + Redis) is not optional—it is mandatory because:

OLTP optimizes for write efficiency (small, fast, atomic updates)
OLAP optimizes for read efficiency (complex, large-scale, multi-dimensional queries)
These optimization goals are directly contradictory

Attempting to use Elasticsearch for OLTP availability updates would fail catastrophically due to its "update penalty" (see Section 3.2). Conversely, using a relational database for OLAP search queries lacks the ability to efficiently perform complex, multi-dimensional, and full-text search.

Key Takeaway: The combination of OLAP (Elasticsearch) and OLTP (Redis) systems is not merely the "best" solution—it is the only viable one for a business requiring both robust transaction processing and powerful data analysis.

2. Baseline Functional Solution

Services

Search API Service – Handles user queries, coordinates Elasticsearch and Redis operations, returns ranked results with availability
Availability Service – Manages Redis availability data, handles booking updates, ensures data consistency
Hotel Management Service – Source of truth for hotel metadata (writes to primary database)

Note: Data synchronization from the primary database to Elasticsearch is handled by an event-driven pipeline (see Section 2.2: Data Synchronization Architecture)

Data Stores

Elasticsearch Cluster – Search index for hotels, room types, amenities, location data (includes built-in query caching)
Redis Cluster (or AWS ElastiCache) – Real-time availability data exclusively
Primary Database (PostgreSQL/MySQL) – Hotel metadata, room configurations, pricing rules

Note: Redis is used exclusively for availability data. Search result caching is handled by Elasticsearch's built-in query cache or application-level caching mechanisms. (This is explained in detail in Section 6.)

High-Level Architecture

The system follows a three-tier approach:

Search Tier: Elasticsearch handles complex queries (location, amenities, text search)
Availability Tier: Redis provides real-time availability validation
Data Tier: Primary database maintains canonical hotel information

Soundbite for interviews: "Hotel search requires a hybrid architecture because Elasticsearch excels at complex filtering but struggles with real-time availability updates, while Redis provides sub-millisecond availability checks but lacks sophisticated search capabilities."

Data Synchronization Architecture: Transactional Outbox + Change Data Capture (CDC)

At a glance, data flows DB → Outbox → CDC/Debezium → Kafka → Elasticsearch. The subsections below break this down so readers can skim the summary and dive deeper only if needed.

The Dual-Write Anti-Pattern Problem:

A naive approach would use an "Indexing Service" that attempts to write to both the database and Elasticsearch synchronously:

Write A: Update PostgreSQL (hotel name change)
Write B: Update Elasticsearch (hotel name change)

This is a well-known anti-pattern because it cannot guarantee atomic execution across two independent systems. The risk of data inconsistency is not a risk—it is a guarantee over time and at scale.

Failure Scenarios:

Partial Failure (Stale Search Index): Database update succeeds, Elasticsearch update fails → Search index permanently stale
Partial Failure (Phantom Data): Elasticsearch update succeeds, database update fails → Hotel searchable but doesn't exist in system of record
Race Conditions: Concurrent updates arrive out of order → Permanent data corruption

The Solution: Transactional Outbox Pattern + CDC

The robust solution decouples writes using an asynchronous, event-driven architecture:

1. Transactional Outbox Pattern:

Create an Outbox_Events table in the primary PostgreSQL database
When updating hotel data, perform one atomic database transaction:

  BEGIN TRANSACTION;
  -- Business data write
  UPDATE Hotels SET name = 'New Name' WHERE id = 123;
  -- Event write (atomic within same transaction)
  INSERT INTO Outbox_Events (payload) VALUES ('{"id": 123, "name": "New Name"}');
  COMMIT;

This guarantees atomicity: if UPDATE fails, INSERT is rolled back; if INSERT fails, UPDATE is rolled back

2. Change Data Capture (CDC):

Use a log-based CDC tool (e.g., Debezium) that reads PostgreSQL's Write-Ahead Log (WAL)
Debezium sees the committed event in Outbox_Events and publishes it to Apache Kafka
This is non-intrusive and low-overhead (reads transaction log, doesn't query database)

3. Event-Driven Pipeline:

PostgreSQL (System of Record)
  ↓ (WAL)
Debezium (CDC Source Connector)
  ↓ (Kafka message)
Apache Kafka (Event Bus - provides resilience and buffering)
  ↓ (Kafka message)
Kafka Connect (Elasticsearch Sink Connector)
  ↓ (index operation)
Elasticsearch (Search Index)

Key Benefits:

Data Integrity: Only 100% committed data is ever published
Resilience: If Elasticsearch is down, events buffer in Kafka with no data loss
Decoupling: Database and Elasticsearch are fully independent
Scalability: Each component can scale independently
Eventual Consistency: Tunable lag (typically seconds to low minutes) is the correct trade-off for high-performance distributed systems

Configuration Requirements:

PostgreSQL: Enable logical replication (wal_level = logical)
Debezium: Configure to read only from Outbox_Events table
Kafka: Long retention period (e.g., 7 days) for disaster recovery
Elasticsearch Sink: Configure for idempotency (pk.mode: record_key) and delete handling (behavior.on.null.values: delete)
Dead Letter Queue (DLQ): Route failed messages to DLQ to prevent pipeline halting

Key Takeaway: The Transactional Outbox + CDC pattern eliminates the dual-write anti-pattern, providing a production-grade, resilient architecture for data synchronization that guarantees data integrity and system availability.

3. Elasticsearch Document Structure

Availability Data Structure Options

Quick overview: We evaluate four availability strategies:

Keep all availability in the primary database + Redis (Elasticsearch remains metadata-only).
Embed detailed availability inside Elasticsearch documents.
Hybrid flag in Elasticsearch plus full detail in Redis.
Maintain a separate availability-focused Elasticsearch index. Use this summary to choose which option details to read.

Option 1: Availability Only in Primary Database and Redis

Elasticsearch: No availability data stored
Redis: Real-time availability for all date ranges
Primary Database: Source of truth for availability

Pros:

Single Source of Truth: No data synchronization complexity
Always Accurate: No stale availability data in Elasticsearch
Simpler Architecture: Fewer moving parts, less to maintain
Real-time Guarantees: Redis provides sub-millisecond updates

Cons:

Post-Filtering Required: Must filter Elasticsearch results with Redis availability check
Two-Step Process: Elasticsearch search → Redis availability check
Slightly Higher Latency: Additional Redis lookup after search

Option 2: Nested Availability in Elasticsearch

Availability data embedded in room type documents
Good for simple queries, limited scalability for date ranges

Pros:

Filtering in Search Phase: Can filter by availability during Elasticsearch query
Single Query: All filtering happens in one place

Cons:

Data Synchronization: Must keep Elasticsearch in sync with DB
Stale Data Risk: Availability in Elasticsearch may be outdated
Update Complexity: Every booking requires Elasticsearch update
Index Size: Large index with date-based availability fields

Option 3: Hybrid Approach

Elasticsearch: Summary availability flag (e.g., "has_availability_next_30_days: true")
Redis: Detailed real-time availability for specific dates
Primary Database: Source of truth

Pros:

Initial Filtering: Elasticsearch can filter out hotels with no availability
Real-time Accuracy: Redis provides precise availability for date ranges
Reduced Load: Fewer hotels passed to Redis for availability check
Best of Both Worlds: Search performance + real-time accuracy

Cons:

Dual Maintenance: Must update both Elasticsearch summary and Redis details
Complexity: More components to manage

Option 4: Separate Availability Index in Elasticsearch

Dedicated index for availability data
Better for complex date range queries, requires joins

Pros:

Complex Queries: Can handle sophisticated date range queries
Separation of Concerns: Availability data separate from hotel metadata

Cons:

Join Overhead: Requires joining availability index with hotel index
Synchronization Complexity: Multiple indexes to keep in sync
Performance: Joins are slower than single-index queries

Recommendation: Option 1 (Redis/DB Only) is Recommended

Why Option 1 is the Default Choice:

Availability changes frequently - Every booking updates availability in real-time
Redis is already optimized - Sub-millisecond lookups, perfect for real-time data
Simplicity wins - No sync complexity, no stale data risk
Performance is acceptable - Parallel execution (Elasticsearch + Redis) provides good performance
Single source of truth - Database is the authority, Redis is a performance cache (see Section 6 for database authority details)

Why Elasticsearch Cannot Handle OLTP Availability Updates: The "Update Penalty"

The Fundamental Problem: Elasticsearch's Immutable Segment Architecture

Elasticsearch is built on Apache Lucene, which stores data in immutable segments on disk. This design has a critical consequence: Elasticsearch does not support in-place updates or deletes.

How "Updates" Actually Work in Elasticsearch:

When a document is "updated" (e.g., changing num_available_rooms from 10 to 9), Elasticsearch performs two operations:

Soft Delete: Marks the old document (with 10 rooms) as "deleted"
Re-index: Indexes a brand new document (with 9 rooms) into a new, small segment

This means a simple transactional counter change becomes a full document re-index, which is CPU-intensive and requires the entire document to be processed and analyzed again.

The Compounding Cost: Segment Merging

High-frequency updates create thousands of tiny segments and massive numbers of soft-deleted documents. This is catastrophic for search performance because:

Elasticsearch must open and query every tiny segment
CPU cycles are wasted filtering out soft-deleted documents
To combat this, Lucene runs a background segment merging process

Segment Merging Overhead:

Reads small segments
Filters out soft-deleted documents
Merges remaining "live" documents into new, larger segments
Extremely resource-intensive: Consumes substantial CPU, disk I/O, and memory

The "Update Penalty" Impact:

If Elasticsearch were used for the OLTP availability workload:

High-frequency updates (thousands per minute) would trigger constant, aggressive segment merging
Intense background CPU and I/O activity would compete directly with primary OLAP search queries
Result: Increased query latency and system instability

This is why using Elasticsearch for frequently updated, mutable data is a well-known architectural anti-pattern. The OLTP availability workload must be handled by a system designed for high-throughput, low-latency updates—Redis.

When to Consider Option 3 (Hybrid with Summary) - Edge Cases Only:

Very high query volume - When you need to reduce Redis load by pre-filtering (rare)
Large candidate sets - When Elasticsearch returns 10,000+ hotels and you want to reduce Redis checks (unusual)
Geographic filtering - When initial geographic filter yields too many results (can be handled with better Elasticsearch queries)

Note: For most hotel search systems, Option 1 (Redis/DB Only) is sufficient and preferred due to its simplicity and real-time accuracy.

Architecture Pattern (Option 1 - Recommended):

Elasticsearch: Filters by location, amenities, room types, price (no availability)
Redis: Validates availability for filtered hotels (date range check)
Result: Only hotels with availability are returned

Performance Impact:

Elasticsearch search: 25ms (filters by location, amenities, etc.)
Redis availability check: 10ms (checks availability for filtered hotels)
Total: 35ms with parallel execution

Key Takeaway: For most hotel search systems, keeping availability only in Redis and the primary database is the recommended approach. It provides simplicity, real-time accuracy, and acceptable performance. Only use Elasticsearch for availability if you have very high query volumes and need to reduce Redis load through pre-filtering.

Hotel Document Schema

Hotel documents in Elasticsearch use nested structures to represent the complex relationships between hotels and their room types:

Recommended Approach (Option 1 - Redis/DB Only):

Hotel Document:
├── Basic Information (hotel_id, name, location, rating)
├── Amenities (pool, spa, gym, restaurant)
├── Location Data (geo_point, address, city, country)
└── Room Types (nested array)
    ├── Room Type 1 (room_type_id: "deluxe_001", type: "deluxe", capacity: 2, base_price: 299, amenities: ["wifi", "tv"])
    ├── Room Type 2 (room_type_id: "suite_garden_001", type: "suite", view: "garden", capacity: 4, base_price: 599, amenities: ["wifi", "tv", "balcony"])
    └── Room Type 3 (room_type_id: "suite_ocean_001", type: "suite", view: "ocean", capacity: 4, base_price: 799, amenities: ["wifi", "tv", "balcony", "ocean_view"])

Note: Room types can have variations (e.g., "suite with garden view" vs "suite with ocean view"). Each variation has a unique room_type_id used for availability tracking in Redis. (See "Important: Room Type ID in All Keys" in Section 6 for detailed explanation.)

Note: No availability data in Elasticsearch. Availability is checked in Redis after Elasticsearch filtering.

Alternative Approach (Option 3 - Hybrid with Summary):

Hotel Document:
├── Basic Information (hotel_id, name, location, rating)
├── Amenities (pool, spa, gym, restaurant)
├── Location Data (geo_point, address, city, country)
└── Room Types (nested array)
    ├── Room Type 1 (room_type_id: "deluxe_001", type: "deluxe", capacity: 2, base_price: 299, amenities: ["wifi", "tv"])
    ├── Room Type 2 (room_type_id: "suite_garden_001", type: "suite", view: "garden", capacity: 4, base_price: 599, amenities: ["wifi", "tv", "balcony"])
    ├── Room Type 3 (room_type_id: "suite_ocean_001", type: "suite", view: "ocean", capacity: 4, base_price: 799, amenities: ["wifi", "tv", "balcony", "ocean_view"])
    └── Availability Summary
        ├── has_availability_next_30_days: true/false
        ├── price_range: {min: 199, max: 599}
        └── last_updated: timestamp

What is Availability Summary? (Only for Option 3 - Hybrid Approach)

If using the hybrid approach, the Availability Summary contains aggregated availability information used for initial filtering:

Fields:

has_availability_next_30_days: Boolean flag indicating if hotel has any availability in the next 30 days
- Purpose: Quickly filter out hotels with no availability
- Example: true means hotel has at least one room available in next 30 days
- Updated: Periodically (e.g., every 15 minutes or when significant availability changes)
price_range: Minimum and maximum prices across all available rooms
- Purpose: Pre-filter by price range before detailed Redis check
- Example: {min: 199, max: 599} means cheapest room is $199, most expensive is $599
- Updated: When room prices change significantly
last_updated: Timestamp of when summary was last refreshed
- Purpose: Track data freshness, implement cache invalidation
- Example: 2024-06-15T10:30:00Z

Important Notes:

Not Real-Time: Summary is updated periodically, not on every booking
Coarse Filtering: Used to eliminate hotels with no availability, not for final availability check
Redis Still Required: Detailed availability check still happens in Redis
Trade-off: Summary reduces Redis load but adds sync complexity

When Availability Summary is Updated:

Periodic refresh (every 15-30 minutes)
When availability drops to zero (no rooms available)
When availability becomes available after being zero
Manual refresh triggered by significant booking events

Understanding Nested Structures (Beginner-Friendly)

What is a Nested Structure?

Imagine you have a hotel document. Without nested structures, you might store room types like this:

Hotel: "Luxury Hotel"
Room Types: ["deluxe", "suite", "standard"]
Room Prices: [299, 599, 199]

Problem: If you search for hotels with "deluxe" rooms under $300, Elasticsearch might match this hotel incorrectly. Why? Because it sees "deluxe" (✓) and "$199" (✓) in the arrays, but doesn't know that "$199" belongs to "standard", not "deluxe". It treats all array items as independent.

Solution - Nested Structures:

With nested structures, each room type (including variations) is stored as a separate "mini-document" inside the hotel document:

Hotel: "Luxury Hotel"
Room Types (nested):
  - Room Type 1: {room_type_id: "deluxe_001", type: "deluxe", price: 299, capacity: 2}
  - Room Type 2: {room_type_id: "suite_garden_001", type: "suite", view: "garden", price: 599, capacity: 4}
  - Room Type 3: {room_type_id: "suite_ocean_001", type: "suite", view: "ocean", price: 799, capacity: 4}

Now when you search for "deluxe rooms under $300", Elasticsearch checks each room type as a complete unit. It finds "deluxe" with price "$299" as a matched pair, which is correct.

Room Type Variations:
The same room type category (e.g., "suite") can have multiple variations with different amenities or views:

"suite with garden view" (room_type_id: "suite_garden_001")
"suite with ocean view" (room_type_id: "suite_ocean_001")

Each variation has its own unique room_type_id used for availability tracking in Redis, allowing separate availability management for each variation. (See Section 6 for how room_type_id is used in Redis keys.)

Real-World Analogy:
Think of a hotel document as a filing cabinet. Without nesting, all room information is in one big drawer mixed together. With nesting, each room type has its own folder in the drawer, keeping related information together.

How Does It Help?

Accurate Filtering: When filtering by "deluxe rooms under $300", you get hotels that actually have deluxe rooms at that price, not hotels that have a deluxe room AND a cheaper room (but unrelated).
Independent Queries: You can ask "show me hotels where the suite has ocean view but the standard room doesn't" - nested structures allow you to query each room type independently.
Better Performance: Elasticsearch can optimize queries better because it knows which fields belong together, reducing false matches.

Why Nested Model Over Parent-Child: Optimizing for OLAP Reads

Elasticsearch offers two options for modeling one-to-many relationships (hotel → room types): Nested and Parent-Child.

Nested Model (Selected):

Hotel and room types stored as a single document
Parent (hotel) and children (rooms) are co-located in the same Lucene block
Pros: Significantly faster queries, low memory overhead
Cons: High update cost—updating any field forces re-indexing of the entire document (hotel + all rooms)

Parent-Child Model (Rejected):

Hotel (parent) and rooms (children) indexed as separate documents
Pros: Low update cost—updating one child only re-indexes that child
Cons: Slower queries due to join overhead, higher memory overhead (requires in-memory "join list")

The Decision: OLTP Workload Removed = No Compromise Needed

The Nested vs Parent-Child debate is a microcosm of the entire OLAP/OLTP problem:

Nested model optimizes for reads (OLAP)
Parent-Child model compromises on read performance to gain update efficiency (a step towards OLTP)

Because the high-frequency OLTP workload (availability) has been completely removed and placed in Redis, the system is no longer forced to compromise. The correct choice is to optimize the search index for its primary, read-heavy workload. Therefore, the Nested model is used.

Rationale:

Infrequent updates to static hotel data (handled by asynchronous CDC pipeline) will incur the higher re-indexing cost
This is a worthwhile trade-off for maximizing query performance for all users
The update penalty only affects static data changes (hotel name, amenities), not high-frequency availability updates

Index Refresh Configuration:

For this OLAP index, index.refresh_interval can be set to a high value (e.g., 30s or 60s) or disabled during bulk loads
A high-frequency OLTP workload would require a low interval (e.g., default 1s), triggering constant, costly refreshes
With OLTP removed, the index can be tuned for maximum indexing performance and stability

Key Takeaway: By moving the OLTP availability workload to Redis, Elasticsearch can be fully optimized for OLAP reads using the Nested model, without compromising on update performance for the primary search workload.

Field Mapping Strategy

What is Field Mapping?

Field mapping tells Elasticsearch how to store and index each field in your documents. Different field types enable different query capabilities and performance characteristics. Choosing the right field type is crucial for search performance and accuracy.

Field Type Breakdown:

hotel_id: keyword (exact matching, aggregations)
- Why: Hotel IDs are unique identifiers that need exact matching only
- Use Case: "Find hotel with ID 'hotel_123'" or "Count hotels by ID"
- Performance: Very fast exact lookups, no text analysis overhead
name: text with keyword subfield (full-text search + exact matching)
- Why: Hotel names need both fuzzy text search AND exact matching
- text field: Handles typos, partial matches, relevance scoring (e.g., "Luxury" matches "Luxurious")
- keyword subfield: Exact matching for autocomplete, sorting, or filtering
- Use Case: Text search finds "Luxury Downtown Hotel" even if user types "Luxry Downtown", while keyword enables exact filtering
location: geo_point (geospatial queries, distance calculations)
- Why: Geographic coordinates need special handling for distance queries
- Use Case: "Find hotels within 5km of coordinates [34.0522, -118.2437]"
- Performance: Optimized spatial indexing for fast distance calculations
amenities: keyword array (filtering, faceted search)
- Why: Amenities are standardized discrete values (e.g., ["pool", "spa", "gym"]) that need exact matching
- Use Case: Filter by "pool AND spa" - must match exact amenity values, not partial text matches
- Why not text:
- Text analysis would tokenize values, breaking exact matching needed for filtering
- Amenities are categorical data (not free-form text), so they should be matched exactly
- Example: If "pool table" appears in hotel description, text type might match "pool" filter incorrectly in the wrong field (pool table = billiards, not swimming pool)
- Performance: Fast exact matching for multiple amenities simultaneously
room_types: nested object (complex room type queries)
- Why: Room types have their own attributes (price, capacity, amenities) that need independent filtering
- Use Case: "Find hotels with deluxe rooms under $300" - must check room type AND price together
- Performance: Enables filtering within nested documents without false matches
rating: float (range queries, sorting)
- Why: Ratings are numeric values that need range queries and sorting
- Use Case: "Find hotels with rating >= 4.0" or "Sort by rating descending"
- Performance: Optimized for numeric comparisons and sorting operations

Right field type = Accurate results + Fast queries

Key Takeaway: Each field type is optimized for specific query patterns. Text fields handle fuzzy matching, keyword fields handle exact matching, geo_point handles distance calculations, and nested objects handle complex relationships. Choosing the right type ensures both query accuracy and performance.

4. Elasticsearch Index Design

Index Mapping Configuration

The hotel index uses a carefully designed mapping to optimize for different query patterns:

Text Fields: Standard analyzer for full-text search with stemming and stop words
- Stemming: Reduces words to their root form (e.g., "swimming" → "swim", "pools" → "pool") so that variations of the same word match
- Stop Words: Removes common words like "the", "and", "is" that don't add search value
Keyword Fields: Exact matching for filters and aggregations
Geo Fields: Geo-point mapping for distance-based queries
Nested Fields: Specialized mapping for parent-child relationships

Nested Document Configuration

From an index design perspective, nested documents provide:

Multi-level Filtering: Enables filtering by both hotel amenities (pool, spa) and room-level amenities (wifi, minibar) simultaneously
Performance: Nested queries are optimized for parent-child relationships with efficient BitSet operations
Flexibility: Easy to add new room types or room attributes without schema changes

Key Takeaway: The index design balances search performance with query flexibility, using appropriate field types for different use cases. For the recommended approach (Option 1), availability data is not stored in Elasticsearch - it's managed entirely in Redis and the primary database for real-time accuracy.

5. Query Patterns and Optimization

Filter-First Execution Strategy

Elasticsearch optimizes query performance by executing filters before expensive scoring operations:

Execution Order:

Filter Context: Execute all filters, create BitSets (fast, cached)
Query Context: Execute text search, create BitSets (fast, cached)
BitSet Intersection: Combine filters with AND operations (very fast)
Scoring Phase: Apply BM25 scoring only to filtered results (expensive but small set)

Performance Impact:

Without Filter-First: Score 50,000 documents, then filter to 1,000
With Filter-First: Filter to 1,000 documents, then score only those
Performance Improvement: 25-50x faster execution

Example: Search for "Luxury hotels with pool and spa in Los Angeles"

Query Components:

Text search: "Luxury" (query context - needs scoring)
Location filter: "Los Angeles" (filter context - no scoring)
Amenity filter: "pool AND spa" (filter context - no scoring)

Without Filter-First (Inefficient):

Text search finds 50,000 hotels with "luxury" in name/description
BM25 scoring calculated for all 50,000 hotels (expensive!)
Field boosting applied to all 50,000 hotels
Results ranked: Hotel_A (score 0.95), Hotel_B (score 0.92), ...
Location filter applied: 50,000 → 2,000 hotels in Los Angeles
Amenity filter applied: 2,000 → 800 hotels with pool AND spa
Total cost: 50,000 scoring operations + ranking + filtering

With Filter-First (Optimized):

Location filter: Create BitSet A of hotels in "Los Angeles" → 2,000 hotels (fast, cached)
Amenity filter: Create BitSet B of hotels with "pool AND spa" → 5,000 hotels (fast, cached)
Text search: Create BitSet C of hotels containing "luxury" → 50,000 hotels (fast - from inverted index)
BitSet intersection: A ∩ B ∩ C = 600 hotels (very fast - bitwise AND operation)
BM25 scoring: Calculate scores for only 600 hotels (expensive but small set)
Field boosting: Apply to only 600 hotels
Results ranked: Final 600 hotels with scores
Total cost: 600 scoring operations (25-50x fewer than without filter-first)

Key Insight:

Text search BitSet creation: Still processes ALL documents (50,000) from inverted index - this is fast
Filter BitSet creation: Processes ALL documents (fast, uses doc values)
BitSet intersection: Very fast - just bitwise AND operations
Scoring optimization: Only scores the 600 hotels in the intersection, not all 50,000
The magic: BitSet operations are fast even on large sets, but scoring is expensive - so we minimize scoring by intersecting BitSets first

Why This Works:

Filters are fast: BitSet operations are O(1) per document
Scoring is expensive: BM25 calculation is O(term_frequency) per document
Reduce scoring set: Filter first to minimize expensive operations Interview Takeaway: BitSet Caching and Precomputation
Caching Strategy: BitSets are cached for repeated queries (85-90% cache hit rate for popular filters)
Precomputation Opportunity: Amenity-based BitSets can be precomputed since amenities don't change frequently
- Common filters like "pool", "spa", "gym" can be precomputed and stored in memory
- Only updated when hotel amenities are added/removed (rare event)
- Provides instant filter execution without recalculating BitSets on every query
- Key Insight: Identify filters that change infrequently (amenities, location) vs. frequently (availability, price) - precompute the stable ones

Real-World Impact:

Query latency: 200ms → 30ms (6.7x faster)
CPU usage: 50,000 scoring operations → 600 operations (83x reduction)
Memory: Cached BitSets reused for popular queries (85-90% cache hit rate)

Text Search with Multi-Match

Multi-match queries handle complex text search across multiple fields:

Field Boosting: Hotel name gets higher weight than description
Query Types: best_fields, most_fields, cross_fields for different matching strategies
Fuzzy Matching: Automatic typo tolerance and synonym expansion
Phrase Matching: Exact phrase matching with proximity scoring

Nested Queries for Room Type Filtering

Nested queries enable complex filtering within room type documents:

Room Type Filtering: Find hotels with specific room types
Capacity Filtering: Filter by guest count requirements
Amenity Filtering: Room-level amenities (wifi, minibar, ocean_view)
Price Range Filtering: Filter by room type pricing

How Filtering Works (Inverted Index vs Filtering)

Understanding the Difference:

Text Search (Uses Inverted Index):

Inverted Index: Maps each word → list of documents containing that word
Example: "deluxe" → [Hotel_A, Hotel_B, Hotel_C]
Used for: Finding documents that contain specific words
Fast for: "Find hotels with 'deluxe' in the name"

Filtering (Uses Doc Values):

Doc Values: Columnar storage format - stores all values for a field together
Example: All room prices stored in one column: [299, 599, 199, 450, ...]
Used for: Exact matching, range queries, aggregations
Fast for: "Find hotels with deluxe rooms under $300"

Why Different Data Structures?

Inverted indexes are optimized for text search (finding which documents contain terms), but they're inefficient for filtering operations like numeric comparisons or exact matches. Doc values store data in a columnar format that's optimized for filtering, sorting, and aggregations.

Comparison to Columnar Databases:
Elasticsearch's doc values use a similar storage strategy to columnar databases (e.g., Amazon Redshift, Google BigQuery, ClickHouse):

Columnar Storage: All values for a field are stored together in a column (e.g., all prices: [299, 599, 199, 450])
Optimized for Analytics: Fast filtering, sorting, and aggregations on numeric/categorical fields
Key Difference: Columnar databases store ALL data in columnar format, while Elasticsearch uses a hybrid approach:
- Inverted indexes (row-oriented) for text search
- Doc values (columnar) for filtering/aggregations
- This hybrid approach gives Elasticsearch both fast text search AND fast filtering in one system

Example: "Deluxe Rooms Under $300" Query

Let's trace through how Elasticsearch handles this query:

Step 1: Understanding the Data Structure

Hotel Document:
  hotel_id: "hotel_123"
  name: "Luxury Hotel"
  room_types (nested):
    - {type: "deluxe", price: 299, capacity: 2}
    - {type: "suite", price: 599, capacity: 4}
    - {type: "standard", price: 199, capacity: 2}

Pricing Reality (Interview Tip): In production the price stored in Elasticsearch is usually a summary (min_price_next_30_days, max_price, or a bucketed price range). This lets Elasticsearch filter aggressively without forcing constant reindexing. The precise per-date price (and availability) comes from Redis/the pricing service during the availability step, ensuring tomorrow’s rate (e.g., $400) is accurate even if Elasticsearch still lists $300 as the minimum. The rule of thumb: use Elasticsearch for coarse filtering, Redis for rapidly-changing truth.

Step 2: Nested Query Execution (Using BitSet Operations)

Access Nested Documents: Elasticsearch treats each nested room type as a separate "mini-document"
- Room Type 1: {type: "deluxe", price: 299}
- Room Type 2: {type: "suite", price: 599}
- Room Type 3: {type: "standard", price: 199}
Filter by Type (BitSet Creation): Using doc values, create BitSet of nested documents with type = "deluxe"
- Room Type 1: type = "deluxe" ✓ → BitSet bit 1 = true
- Room Type 2: type = "suite" ✗ → BitSet bit 2 = false
- Room Type 3: type = "standard" ✗ → BitSet bit 3 = false
- Result: BitSet A = 1, 0, 0
Filter by Price (BitSet Creation): Using doc values, create BitSet of nested documents with price < 300
- Room Type 1: price = 299 < 300 ✓ → BitSet bit 1 = true
- Room Type 2: price = 599 < 300 ✗ → BitSet bit 2 = false
- Room Type 3: price = 199 < 300 ✓ → BitSet bit 3 = true
- Result: BitSet B = 1, 0, 1
BitSet Intersection: A ∩ B = [1, 0, 0] ∩ [1, 0, 1] = [1, 0, 0]
- Only Room Type 1 matches both conditions (bitwise AND operation)
Return Parent Document: If any nested document matches, the parent hotel document is included
- Result: Hotel_123 is returned

Key Insight:

Doc Values: Used to create BitSets for filtering operations (fast, cached)
BitSet Operations: Filtering by type and price uses BitSet intersection (bitwise AND)
Nested Documents: Each room type is stored separately, allowing independent BitSet creation
Performance: BitSet operations are O(1) per document, much faster than scoring all documents
Combination: The query can use both - text search for hotel names AND BitSet filtering for room types

Performance Comparison:

Operation	Data Structure	Speed
Text search ("deluxe" in name)	Inverted Index	Very Fast
Filter (price < 300)	Doc Values	Very Fast
Filter (type = "deluxe")	Doc Values	Very Fast
Nested filter (deluxe AND price < 300)	Doc Values + Nested	Fast

Why This Matters:

Inverted indexes are great for "find documents containing words"
Doc values are great for "find documents matching criteria"
Nested structures allow filtering on related data (room types) independently
Combining both gives you powerful search + accurate filtering

Query Execution Order and Performance

Optimal Query Structure (Option 1 - Recommended):

Geographic Filters: city, radius (most selective)
Amenity Filters: pool, spa, gym (moderately selective)
Room Type Filters: nested queries (selective)
Text Search: multi-match queries (expensive)
Scoring: BM25 calculation (expensive but small set)

Note: Date range validation is handled by Redis after Elasticsearch filtering, not during the Elasticsearch query phase. This aligns with Option 1 (Redis/DB Only) where availability is not stored in Elasticsearch.

Key Takeaway: Filter-first execution with BitSet caching provides 25-50x performance improvement by reducing expensive scoring operations to only the documents that pass all filters. Detailed BitSet mechanics and caching strategies are explained in the "Filter-First Execution Strategy" section above.

6. Redis Availability Architecture

Architecture Note: Redis (or AWS ElastiCache) is used exclusively for availability data. Search result caching and other data caching are handled by Elasticsearch's built-in caching mechanisms or application-level caching.

Redis Data Structures for Availability

Important: Room Type ID in All Keys

All Redis availability keys include room_type_id because:

Hotels have multiple room types and variations (e.g., "suite with garden view" vs "suite with ocean view") with separate availability
Each room type variation has a unique room_type_id (e.g., "suite_garden_001", "suite_ocean_001")
Elasticsearch filters by room type and returns room_type_id, which is used to build Redis keys
This allows separate availability tracking for each variation

Redis uses multiple data structures optimized for availability patterns:

Simple Key-Value Pattern:

Key Format: availability:hotel_id:room_type_id:check_in:check_out
Value: "1" (at least one room available) or number of available rooms (e.g., "5" for 5 rooms available) or null (unavailable)
Use Case: Simple availability checks (not recommended due to update complexity)
Performance: O(1) lookup time

How the Key Gets Created:

Booking Event: When a booking is made or cancelled for a specific room type variation
Room Type ID from Elasticsearch: Elasticsearch nested query filters hotels by room type and returns room_type_id (e.g., "suite_garden_001", "suite_ocean_001")
Key Generation: Combine hotel_id, room_type_id, check_in date, and check_out date
- Example: availability:hotel_123:suite_garden_001:2024-06-15:2024-06-17
Availability Check: Query database for availability of that room type variation across all dates in range
Key Creation: If all dates are available for that room type variation, set key with value "1" (at least one available) or the number of available rooms (e.g., "5" for 5 rooms)
- Redis command: SET availability:hotel_123:suite_garden_001:2024-06-15:2024-06-17 "1" EX 14400 (4 hour TTL) - for at least one room
- Redis command: SET availability:hotel_123:suite_garden_001:2024-06-15:2024-06-17 "5" EX 14400 (4 hour TTL) - for 5 available rooms
Key Deletion: If any date becomes unavailable for that room type variation, delete the key
- Redis command: DEL availability:hotel_123:suite_garden_001:2024-06-15:2024-06-17
Update Trigger: Keys are created/updated/deleted when:
- Bookings are confirmed for specific room type variations
- Bookings are cancelled for specific room type variations
- New availability is added for specific room type variations
- Periodic sync from database

Challenge: Finding Which Keys to Modify - Combinatorial Explosion

The Problem:
When a booking is made for a specific date and room type variation (e.g., 2024-06-16, suite with ocean view), you need to invalidate all keys that include that date for that room type variation.

Combinatorial Explosion Example:
For a 30-day booking window, there are (30 × 31) / 2 = 465 possible check-in/check-out combinations. A single booking on one day (e.g., June 16th) would require finding and invalidating every single one of these 465 keys that contains "June 16th". This creates a combinatorial explosion of keys that must be tracked and updated.

Example Keys to Invalidate:

availability:hotel_123:suite_ocean_001:2024-06-15:2024-06-17 (includes 2024-06-16)
availability:hotel_123:suite_ocean_001:2024-06-16:2024-06-18 (includes 2024-06-16)
availability:hotel_123:suite_ocean_001:2024-06-14:2024-06-17 (includes 2024-06-16)
... and 462 more keys for a 30-day window

But Redis doesn't provide an efficient way to find all keys matching a date pattern.

Solution Options:

Option 1: Pattern Matching (Inefficient)

Use KEYS availability:hotel_123:* to find all keys for a hotel
Filter keys that include the affected date
Delete matching keys
Problem: KEYS command is O(N) and blocks Redis, not suitable for production

Option 2: Secondary Index (Recommended)

Maintain a separate index tracking which keys exist for each date and room type variation
Use a hash: availability_index:hotel_123:suite_ocean_001:2024-06-16 → Set of all keys containing this date for this room type variation
When booking affects 2024-06-16 for suite with ocean view:
1. Get all keys from index: SMEMBERS availability_index:hotel_123:suite_ocean_001:2024-06-16
2. Delete all those keys: DEL key1 key2 key3...
3. Remove index entry: DEL availability_index:hotel_123:suite_ocean_001:2024-06-16
When creating a key: Add key to index for each date in range and room type variation
Performance: O(M) where M = number of keys containing that date for that room type variation

Option 3: Hash Structure (Better Alternative) - Solves Combinatorial Explosion

Use hash structure instead of simple keys (see Hash Structure section)
Key format: availability:hotel_id:room_type_id
Hash fields: Individual dates (2024-06-15, 2024-06-16, etc.)
Solves the Problem: Instead of 465 keys for a 30-day window, you have only 30 hash fields (one per day)
When booking affects 2024-06-16 for suite with garden view:
- Simply update/delete the hash field: HDEL availability:hotel_123:suite_garden_001 2024-06-16
- No need to find which keys to modify - directly update the specific date field
Performance: O(1) - directly access the date field for specific room type variation
Scalability: Linear growth (30 fields for 30 days) vs. quadratic growth (465 keys for 30 days)

Recommendation:
For simple availability checks, Hash Structure is preferred because:

No need to track which keys to invalidate
Direct date-based access (O(1))
Easier to update individual dates
More efficient for date range queries

Hash Structure:

Key Format: availability:hotel_id:room_type_id
Hash Fields: Individual dates representing nights (2024-06-15, 2024-06-16, 2024-06-17, etc.)
- Each date field represents availability for that night
- Example: Field "2024-06-15" = availability for the night of June 15th
Hash Values: JSON object with room_count and last_updated (price comes from Elasticsearch/database, not Redis)
Use Case: Detailed availability per room type variation
Performance: O(1) field access

Examples:

availability:hotel_123:deluxe_001 (for deluxe rooms)
availability:hotel_123:suite_garden_001 (for suite with garden view)
availability:hotel_123:suite_ocean_001 (for suite with ocean view)

Advantage: Direct Date-Based Updates
Unlike the simple key-value pattern, hash structure allows direct updates to specific dates without needing to find which keys to modify. When a booking affects 2024-06-16 for a specific room type variation (e.g., suite with garden view), you directly update that hash field - no pattern matching or secondary indexes needed.

How the Hash Gets Created:

Initial Creation: When hotel availability data is loaded into Redis for a specific room type variation
Hash Key: Create hash with key availability:hotel_id:room_type_id
- Example: availability:hotel_123:suite_garden_001 (for suite with garden view)
- Example: availability:hotel_123:suite_ocean_001 (for suite with ocean view)
Hash Field Population: For each date with availability, add a field:
- Redis command: HSET availability:hotel_123:suite_garden_001 2024-06-15 '{"room_count":5,"last_updated":"2024-06-15T10:30:00Z"}'
- Redis command: HSET availability:hotel_123:suite_garden_001 2024-06-16 '{"room_count":3,"last_updated":"2024-06-15T10:30:00Z"}'
Field Updates: When availability changes for that room type variation:
- Update specific date field: HSET availability:hotel_123:suite_garden_001 2024-06-15 '{"room_count":4,"last_updated":"2024-06-15T14:20:00Z"}'
Field Deletion: When date becomes unavailable for that room type variation:
- Remove field: HDEL availability:hotel_123:suite_garden_001 2024-06-17
TTL: Set expiration on entire hash: EXPIRE availability:hotel_123:suite_garden_001 14400 (4 hours)
Update Triggers: Hash is updated when:
- Bookings change availability for specific room type variations and dates
- Periodic batch sync from database (availability data only)

Atomic Operations for OLTP Transactions: Redis Hash Design

The OLTP Transaction Pattern:

The entire booking transaction (decrementing room count) is a single, atomic Redis command:

HINCRBY availability:hotel_123:suite_garden_001 2024-06-15 -1

Why This Works for OLTP:

O(1) Performance: Constant time operation, designed for high-throughput scenarios
Atomic: The increment/decrement operation is atomic—no race conditions
Single Operation: No multi-step process, no re-indexing, no segment merging
Conceptual Opposite of Elasticsearch: This is the exact opposite of Elasticsearch's multi-stage, resource-intensive re-indexing "update"

Hash Structure Design:

Redis Key: availability:hotel_id:room_type_id (e.g., availability:hotel_123:suite_garden_001)
Hash Fields: {date} (e.g., 2024-06-15, 2024-06-16)
Hash Values: Integer representing room count (e.g., 10, 5, 0)

Alternative: JSON Value with Room Count

The document shows JSON values with room_count and last_updated. For pure OLTP counter operations, you can also use simple integer values:

Simple Integer: HSET availability:hotel_123:suite_garden_001 2024-06-15 10
Atomic Decrement: HINCRBY availability:hotel_123:suite_garden_001 2024-06-15 -1 → Result: 9
Atomic Increment: HINCRBY availability:hotel_123:suite_garden_001 2024-06-15 1 → Result: 10

Date Range Check (OLAP Query Pattern):

Checking availability for a 5-night stay is a single, efficient command:

HMGET availability:hotel_123:suite_garden_001 2024-06-15 2024-06-16 2024-06-17 2024-06-18 2024-06-19

Performance: O(N) where N = number of fields requested (i.e., number of nights), not the total number of documents in the database. This is extremely fast.

Key Takeaway: Redis Hash structure with atomic HINCRBY operations provides the exact OLTP capabilities needed for high-frequency availability updates—simple, fast, atomic, and designed for this exact workload.

Date Range Validation with Hash Structure (Room Type Variation Aware):

Critical: Hotel Bookings are for Nights

Guest checks in on check-in date (stays that night)
Guest checks out on check-out date (stays the previous night, leaves on check-out date)
Room is occupied on check-in date and (check-out - 1 day), but available again on check-out date
Date range to check: check_in_date, check_out_date - 1 day

Process:

Room Type ID from Elasticsearch: Elasticsearch nested query filters hotels by room type and returns room_type_id (e.g., "suite_garden_001", "suite_ocean_001")
Generate date range: Dates from check-in to (check-out - 1 day) inclusive
Hash Key: Use availability:hotel_id:room_type_id (e.g., availability:hotel_123:suite_garden_001)
Batch check: HMGET availability:hotel_id:room_type_id date1 date2... (all nights in range)
Validation: If all dates return non-null values with room_count > 0, room type variation is available for entire stay
Performance: O(N) where N = number of nights (check-out - check-in), not total available dates

Example: Check-in 2024-06-15, check-out 2024-06-17 → Check dates [2024-06-15, 2024-06-16]

Sorted Set Pattern:

Key Format: availability:hotel_id:room_type_id:dates
Members: date strings (2024-06-15, 2024-06-16)
Scores: availability status (1 = available, 0 = unavailable)
Use Case: Date range queries with ranking per room type variation
Performance: O(log N) for range queries

Examples:

availability:hotel_123:suite_garden_001:dates
availability:hotel_123:suite_ocean_001:dates

How the Sorted Set Gets Created:

Initial Creation: When hotel availability data is loaded for a specific room type variation
Sorted Set Key: Create sorted set with key availability:hotel_id:room_type_id:dates
- Example: availability:hotel_123:suite_garden_001:dates (for suite with garden view)
- Example: availability:hotel_123:suite_ocean_001:dates (for suite with ocean view)
Member Addition: For each available date for that room type variation, add as member with score 1:
- Redis command: ZADD availability:hotel_123:suite_garden_001:dates 1 "2024-06-15"
- Redis command: ZADD availability:hotel_123:suite_garden_001:dates 1 "2024-06-16"
- Redis command: ZADD availability:hotel_123:suite_garden_001:dates 1 "2024-06-17"
Member Updates: When availability changes for that room type variation:
- Mark available: ZADD availability:hotel_123:suite_garden_001:dates 1 "2024-06-17" (add date with score 1)
- Mark unavailable: ZREM availability:hotel_123:suite_garden_001:dates "2024-06-17" (remove date from set)
Clean Set: Only available dates are stored in the sorted set (no "unavailable" members with score 0)
Range Queries: Query all available dates for that room type variation:
- Redis command: ZRANGE availability:hotel_123:suite_garden_001:dates 0 -1 (get all available dates, sorted)
- Filter dates in desired range (e.g., June 2024)
TTL: Set expiration: EXPIRE availability:hotel_123:suite_garden_001:dates 14400 (4 hours)
Update Triggers: Sorted set is updated when:
- Bookings change availability
- New availability periods are added
- Periodic sync from database

Hash Structure vs Sorted Set: When to Use Each

Hash Structure Advantages:

O(1) field access: Direct access to any date
Direct updates: Update specific dates without pattern matching
Simple date range validation: Check all dates in range using HMGET
Efficient storage: Only store dates that have availability
Better for: Checking if specific dates are available, updating individual dates

Example Use Case:

Query: "Is hotel_123 available for check-in 2024-06-15 to check-out 2024-06-17?"
Date range to check: 2024-06-15, 2024-06-16
Hash: HMGET availability:hotel_123:suite_garden_001 2024-06-15 2024-06-16
Check if all dates return non-null values
Performance: O(N) where N = number of nights (check-out - check-in)

Sorted Set Advantages:

Range queries: Efficiently find all available dates in a date range
Sorted order: Dates are automatically sorted, making range queries fast
Bulk operations: Get all available dates in a range with one query
Better for: Finding all available dates in a future period, date range discovery

Example Use Case:

Query: "Find all available dates for hotel_123 in June 2024"
Sorted Set: ZRANGE availability:hotel_123:suite_garden_001:dates 0 -1 (get all available dates, sorted)
Filter dates in June 2024 range
Performance: O(log N + M) where N = total dates, M = dates in range
Advantage: Clean set with only available dates, no need to filter out "unavailable" members

Comparison Table:

Feature	Hash Structure	Sorted Set
Single Date Check	O(1)	O(log N)
Date Range Check	O(N) - check each date	O(log N + M) - efficient range query
Update Single Date	O(1) - direct update	O(log N) - need to find member
Find All Available Dates	O(N) - scan all fields	O(log N + M) - efficient range query
Memory Efficiency	Efficient (only available dates)	Efficient (only available dates)
Use Case	Check specific date ranges	Discover available dates in periods

Recommendation for Hotel Search:

Use Hash Structure for the recommended approach because:

Most queries check specific date ranges (check-in to check-out)
Need to verify all dates in range are available
Hash structure provides O(1) access to specific dates
Simpler to update when bookings change
More intuitive for date range validation

Use Sorted Set if you need:

To find all available dates in a future period (e.g., "show me all available dates in July")
Efficient range queries for date discovery
Automatic sorting of dates

For Most Hotel Search Systems: Hash structure is the better choice because the primary use case is checking if a specific date range is available, not discovering all available dates.

Key Naming Conventions

Redis Key Patterns (Availability Only):

Hash Structure (Recommended): availability:hotel_id:room_type_id (hash fields are dates)
Simple Key-Value: availability:hotel_id:room_type_id:check_in:check_out
Sorted Set: availability:hotel_id:room_type_id:dates

TTL Strategies for Availability Data

Time-to-Live Configuration (Redis/ElastiCache):

Availability Keys: 1-4 hours (matches booking timeout, ensures fresh availability data)

Atomic Operations for Concurrent Booking

Important: Database is the Final Authority

The primary database is the source of truth for all availability data. Redis is a fast access layer:

Database: Authoritative data, transactional consistency, ACID guarantees
Redis: Performance optimization layer, real-time access, eventual consistency with database
Synchronization: Redis must be updated to reflect database state, not the other way around
Conflict Resolution: If Redis and database disagree, database state is correct

Race Condition Prevention:

SET with NX: Only set if key doesn't exist (prevent concurrent updates)
WATCH/MULTI/EXEC: Transactional updates for complex operations
Lua Scripts: Atomic multi-step operations

Booking Flow (Database as Authority):

Check Redis: Fast availability check in Redis
Reserve in Database: Create booking record in database (authoritative)
Update Redis: Update Redis to reflect database state
If Database Fails: Booking is not confirmed, Redis remains unchanged
If Redis Fails: Database state is correct, Redis will be synced later

Real-Time Update Patterns

Update Flow: Database → Redis (One-Way Sync)

All updates originate from the database. Redis is updated to reflect database state:

Update Strategies:

Immediate Updates: After database transaction commits, immediately update Redis to reflect new availability
Batch Updates: Periodic sync from database to Redis to catch any missed updates
Event-Driven: Database triggers or change data capture (CDC) events update Redis in real-time
Fallback Sync: Periodic reconciliation with database to ensure Redis matches database state

Update Order (Critical):

Database Transaction: Make changes in database first (authoritative)
Transaction Commit: Wait for database commit to succeed
Redis Update: Update Redis to match database state
If Redis Update Fails: Database state is still correct, Redis will be synced later

Why This Matters:

Data Integrity: Database transactions ensure consistency
Redis is Cache: Redis can be rebuilt from database if needed
No Data Loss: If Redis fails, database has all the data
Eventual Consistency: Redis may be temporarily out of sync, but database is always correct

Key Takeaway: Redis provides sub-millisecond availability checks through optimized data structures and atomic operations, enabling real-time booking systems that would be impossible with database-only approaches. The primary database remains the final authority (see "Important: Database is the Final Authority" above) - all updates flow from database to Redis, ensuring data integrity and consistency.

7. Hybrid Search Execution Flow

Sequential vs Parallel Execution

Sequential Execution (Simple but Slower):

Elasticsearch search (25ms) → Get candidate hotels
Redis availability check (10ms) → Filter by availability
Total: 35ms

Parallel Execution (Complex but Faster):

Start Elasticsearch search (25ms) and Redis availability check (15ms) simultaneously
Wait for both to complete (max of both = 25ms)
Filter Elasticsearch results by Redis availability
Total: 25ms (28% improvement)

Elasticsearch Candidate Generation

Search Phase:

Geographic Filtering: City, radius, location-based filtering
Amenity Filtering: Pool, spa, gym, restaurant availability
Text Search: Hotel name, description, location text matching
Room Type Filtering: Nested queries for room type requirements
Price Range Filtering: Room type pricing constraints

Result Set:

Candidate Hotels: 100-1000 hotels matching search criteria
Hotel Metadata: Name, location, amenities, room types
Search Scores: Relevance ranking for result ordering

Redis Availability Validation

How Elasticsearch and Redis Work Together:

Step 1: Elasticsearch Filters by Room Type

User search includes room type filter (e.g., "suite with ocean view")
Elasticsearch nested query filters hotels that have matching room type variations (e.g., suite with ocean view)
Elasticsearch returns: List of hotels with matching room types (hotel metadata + room type info including room_type_id)
Result: We know which hotels have matching room types and which room_type_id to check in Redis

Step 2: Redis Checks Availability for Specific Room Type Variation

For each hotel from Elasticsearch, we get the room_type_id (e.g., "suite_ocean_001", "deluxe_001")
Use hash key: availability:hotel_id:room_type_id (e.g., availability:hotel_123:suite_ocean_001)
Check availability for the specific room type variation across date range

Availability Check Process:

Extract Room Type ID: From Elasticsearch results, get room_type_id (e.g., "suite_ocean_001", "deluxe_001")
Generate Date List: Create list of dates from check-in to (check-out - 1 day) inclusive
- Critical: Hotel bookings are for nights, not days (see Section 6 for detailed explanation)
- Dates to check: [check_in_date, check_out_date - 1 day]
Build Hash Key: availability:hotel_id:room_type_id
HMGET Check: HMGET availability:hotel_id:room_type_id date1 date2... (all nights in range)
Validate Results: Check if all dates have non-null values with room_count > 0

Example Flow:

User Query: "Suite with ocean view in Los Angeles, check-in: 2024-06-15, check-out: 2024-06-17"

Elasticsearch:
  - Filters: city=Los Angeles, room_type="suite", view="ocean"
  - Returns: [hotel_123 (has suite_ocean_001), hotel_456 (has suite_ocean_001), hotel_789 (has suite_ocean_001)]
  - Each result includes: hotel_id, room_type_id="suite_ocean_001"

Redis Availability Check:
  - Dates to check: [2024-06-15, 2024-06-16] (see Section 6 for explanation of date range logic)
  - For hotel_123: HMGET availability:hotel_123:suite_ocean_001 "2024-06-15" "2024-06-16"
  - For hotel_456: HMGET availability:hotel_456:suite_ocean_001 "2024-06-15" "2024-06-16"
  - For hotel_789: HMGET availability:hotel_789:suite_ocean_001 "2024-06-15" "2024-06-16"

Results:
  - hotel_123: [value, value] → Available (both nights have suite with ocean view)
  - hotel_456: [value, null] → Unavailable (2024-06-16 has no suite with ocean view available)
  - hotel_789: [value, value] → Available

Capacity Validation:

If user needs multiple rooms (e.g., 2 suites with ocean view):
- Check room_count in hash values: {"room_count": 5, ...}
- Ensure room_count >= required_rooms for all dates in range
- Example: If user needs 2 rooms, but room_count is 1 → Unavailable

Price Validation:

Price comes from Elasticsearch room type data (Redis is exclusively for availability - see Section 6)
Check price from Elasticsearch results against user's budget
Ensure price is within budget for the selected room type variation

Filtering Process:

Hash Key Pattern: availability:hotel_id:room_type_id
Batch Operations: Use Redis pipeline to batch multiple HMGET commands
Early Exit: Stop checking if any date returns null (hotel unavailable)
Result Filtering: Remove hotels where room type is unavailable for any date in range

Result Merging and Ranking Strategy

Merging Process:

Elasticsearch Results: Ranked by relevance score
Redis Availability: Binary available/unavailable status
Intersection: Keep only available hotels from Elasticsearch results
Final Ranking: Maintain Elasticsearch relevance order for available hotels

Ranking Factors:

Search Relevance: BM25 score from Elasticsearch
Availability Status: Available hotels ranked higher
Price Competitiveness: Lower prices get slight boost
User Preferences: Historical booking patterns, loyalty status

Interview Tip: Streaming Results with Pagination

Once Elasticsearch returns ranked results, you don't need to wait for all availability checks before sending results to the UI. Instead, use a streaming/pagination approach:

Streaming Optimization:

Elasticsearch Returns: Ranked list of hotels (e.g., 500 hotels)
Batch Processing: Process hotels in batches of 10-20
Redis Check: Check availability for first batch (e.g., first 10 hotels)
Send to UI: Immediately send available hotels from first batch to UI
Continue Processing: While UI displays first page, check availability for next batch
Pagination: As user scrolls, send next batch of available hotels

Benefits:

Time to First Result (TTFR): User sees results faster (e.g., 30ms instead of 200ms)
Better UX: Progressive loading feels faster than waiting for all results
Reduced Latency: Don't check availability for hotels user might never see
Efficient Resource Usage: Only process what's needed for current page

Example Flow:

Elasticsearch: Returns 500 ranked hotels (25ms)
↓
Batch 1: Check availability for hotels 1-10 (5ms) → Send to UI
↓
Batch 2: Check availability for hotels 11-20 (5ms) → Cache for next page
↓
Batch 3: Check availability for hotels 21-30 (5ms) → Cache for next page
...
User scrolls → Load cached batch 2

Key Insight: Since results are already ranked by Elasticsearch, you can stream them in order. The UI only needs the first 10-20 results immediately, so there's no need to wait for all 500 hotels to be checked.

Performance Comparison

Execution Pattern	Elasticsearch	Redis	Total Time	Complexity
Sequential	25ms	10ms	35ms	Simple
Parallel	25ms	15ms	25ms	Medium
Batch Redis	25ms	5ms	30ms	Medium
Cached Results	5ms	2ms	7ms	High

Key Takeaway: Parallel execution provides 28% performance improvement while batch Redis operations can reduce availability check time by 50%, making the hybrid approach significantly faster than sequential processing.

8. APIs and Query Flow

Search API Endpoint Structure

Primary Endpoint: GET /api/v1/hotels/search

Query Parameters:

location (required): City name or coordinates
check_in (required): Check-in date (YYYY-MM-DD)
check_out (required): Check-out date (YYYY-MM-DD)
guests (optional): Number of guests (default: 2)
room_type (optional): Specific room type (deluxe, suite)
amenities (optional): Array of amenities (pool, spa, gym)
price_min/max (optional): Price range constraints
radius (optional): Search radius in kilometers (default: 30)
sort (optional): Sort order (relevance, price, rating, distance)

Step-by-Step Query Execution Flow

Phase 1: Query Processing (5ms)

Input Validation: Validate dates, location, parameters
Query Normalization: Standardize location names, date formats
Parameter Extraction: Extract search criteria and filters
Cache Check: Check for cached results

Phase 2: Elasticsearch Search (25ms)

Geographic Filter: Filter by location and radius
Amenity Filter: Filter by requested amenities
Room Type Filter: Nested queries for room type requirements
Text Search: Multi-match queries for hotel names and descriptions
Price Filter: Range queries for price constraints
Result Ranking: BM25 scoring and relevance ranking

Phase 3: Redis Availability Check (10ms)

Hotel ID Extraction: Get hotel IDs from Elasticsearch results
Availability Keys: Generate Redis keys for date range
Batch Check: MGET operation for multiple hotels
Availability Filtering: Remove unavailable hotels
Price Validation: Verify pricing within constraints

Phase 4: Result Assembly (5ms)

Result Merging: Combine Elasticsearch and Redis results
Final Ranking: Apply availability and pricing factors
Response Formatting: Structure response with hotel details
Cache Storage: Store results for future queries

Response Format with Availability and Pricing

Response Structure:

{
  "hotels": [
    {
      "hotel_id": "hotel_123",
      "name": "Luxury Downtown Hotel",
      "location": {
        "city": "Los Angeles",
        "coordinates": [34.0522, -118.2437],
        "address": "123 Main St"
      },
      "rating": 4.5,
      "amenities": ["pool", "spa", "gym"],
      "room_types": [
        {
          "type": "deluxe",
          "capacity": 2,
          "price_per_night": 299,
          "available": true,
          "rooms_left": 5
        }
      ],
      "total_price": 598,
      "search_score": 0.95
    }
  ],
  "total_results": 150,
  "search_time": 35,
  "cache_hit": false
}

Error Handling and Fallback Strategies

Error Scenarios:

Elasticsearch Unavailable: Fallback to database with reduced functionality
Redis Unavailable: Skip availability filtering, show all results
Timeout Errors: Return partial results with warning
Invalid Parameters: Return error with parameter validation details

Fallback Strategies:

Database Fallback: Use PostgreSQL for basic search when Elasticsearch fails
Cached Results: Serve cached results during outages
Degraded Mode: Reduce functionality but maintain basic search
Graceful Degradation: Show maintenance message with trending hotels

Monitoring and Alerting:

Latency Monitoring: Alert if response time > 100ms
Error Rate Monitoring: Alert if error rate > 5%
Cache Performance: Monitor cache hit rates and memory usage
Availability Monitoring: Track Elasticsearch and Redis health

Key Takeaway: The API design balances comprehensive search capabilities with performance optimization, using parallel execution and intelligent caching to maintain sub-100ms response times while providing rich search functionality and robust error handling.

Soundbite for interviews: "Hotel search requires a hybrid Elasticsearch + Redis architecture because no single system can efficiently handle both complex search/filtering and real-time availability updates. The key is using Elasticsearch for what it does best (search and filtering) and Redis for what it does best (real-time data), with parallel execution improving time to first results."

Hotel Booking Schema Design

Part 4: Payments and Ticket Issuance

Sumedh Bala — Thu, 23 Oct 2025 17:51:10 +0000

Idempotency, Consistency, and Race-Safe Design

This section focuses on the critical final steps of the user journey: securely processing payments and reliably issuing digital tickets. These operations demand high integrity, fault tolerance, and consistency to prevent double-charging, lost tickets, or overselling — while also protecting sensitive financial data and preventing race conditions.

Today, we're going to architect a production-grade solution using a modern, event-driven approach. Our design is built on three key principles:

Microservice Architecture: We'll separate our system into distinct services, each with one job.
Saga Pattern: We'll manage the complex, multi-step purchase process as a single, coordinated workflow.
Transactional Outbox: We'll use this pattern as the "atomic glue" that guarantees our services stay in sync, even if one of them fails.

The Saga Pattern

The Saga Pattern is a design pattern for managing distributed transactions across multiple services. Instead of using traditional ACID transactions (which don't work well across service boundaries), sagas break down complex business processes into a sequence of local transactions, each with a corresponding compensating action.

How it works:

Each step in the saga is a local transaction within a single service
If any step fails, the saga executes compensating transactions to undo previous steps
This ensures eventual consistency across the entire system
Perfect for complex workflows like: Reserve Seats → Process Payment → Generate Tickets → Send Notifications

Transactional Outbox Pattern

The Transactional Outbox Pattern solves the dual-write problem in distributed systems. When you need to update your database AND publish an event (like sending a notification), you can't do both atomically across service boundaries.

Why it's useful:

Atomicity: Write to database and outbox table in the same transaction
Reliability: Guarantees events are eventually published, even if the service fails
Consistency: Ensures database state and event publishing stay in sync
Resilience: Handles network failures, service restarts, and partial failures gracefully

How it works:

Application writes business data AND event data to the same database transaction
A separate process (outbox processor) reads from the outbox table
Publishes events to message queues (Kafka, RabbitMQ, etc.)
Marks events as processed to prevent duplicates

The central theme is separation of concerns. Let's see how it works.

The Core Architecture: Separating Our Concerns

We'll decompose our system into three primary services. The key is to define what each service is—and is not—responsible for.

Order Management Service: This is our "smart" orchestrator. It acts as the brain of the operation, managing the business workflow (the Saga). Its only concern is the state of an order (e.g., pending, paid, tickets_issued).
Payment Processing Service: This is a "dumb," single-purpose service. Its only concern is interacting with a payment gateway (like Stripe) and reporting success or failure. It knows nothing about seats or tickets.
Ticket Issuance Service: This is another "dumb" service. Its only concern is managing inventory (seats, sections) and generating ticket records. It knows nothing about credit cards or payment intents.

This separation is what allows us to scale and maintain the system. We can update our payment provider without ever touching the ticket generation code. If the ticket service is slow, it won't block new payments from being accepted.

But how do these separate services talk to each other reliably?

The Glue: A Saga for Distributed Transactions

You can't use a single database COMMIT across three different services. So, how do you guarantee that a successful payment always results in a ticket?

This is where the Saga and Transactional Outbox patterns come in.

The Saga: Our "Order Management Service" will manage the purchase as a step-by-step process.
1. Step 1: Tell the Payment Service to process a payment.
2. Step 2 (if success): Tell the Ticket Issuance Service to generate tickets.
3. Step 3 (if fail): Tell the Ticket Issuance Service (or Reservation Service) to release the held seats.
The Transactional Outbox: This is the mechanism that makes the Saga reliable. When the Payment Service confirms a payment, it can't just hope to send a message to the Ticket Issuance Service. What if it crashes right after it saves the payment to its own database? The payment would be successful, but the ticket would never be issued.

The Outbox Pattern solves this. In a single, atomic database transaction, the Payment Service does two things:
1. Updates the payments table to status = 'successful'.
2. Inserts a message into a transactional_outbox table in its own database.
Because this is one transaction, it's 100% atomic. It's impossible for the payment to be marked "successful" without the "issue tickets" message being created. A separate background process (a "message relay") then reliably publishes this message from the outbox to the rest of the system.

How the Message Relay Works:
The message relay is a critical infrastructure component that bridges the gap between database transactions and message queues. It has two common implementation patterns:

Polling Pattern: A simple service that polls the transactional_outbox table every few seconds for status = 'pending' events, publishes them to Kafka/RabbitMQ, then marks them as status = 'published'. This is reliable but has higher latency (2-5 seconds).
Change Data Capture (CDC) Pattern: A more advanced approach using tools like Debezium that tail the database's transaction log (WAL) and instantly convert table inserts into Kafka messages. This offers lower latency (milliseconds) but requires more sophisticated infrastructure.

Both patterns guarantee that every outbox event is eventually published, even if the service fails mid-process.

This is the ultimate separation of concerns. The Payment Service's only job is to update its own state and "leave a memo" in its outbox. It doesn't know or care who reads that memo, allowing our services to be beautifully decoupled.

The Blueprint: Our Production-Grade Database Schema

-- Create ENUM types for reusability
CREATE TYPE order_payment_status_enum AS ENUM (
    'pending',               -- Order created, awaiting payment
    'confirmed_optimistic',  -- Client-side /confirm received, awaiting webhook
    'paid',                  -- Authoritative webhook received
    'failed',                -- Payment failed
    'refunded'               -- Payment refunded
);

CREATE TYPE payment_transaction_status_enum AS ENUM (
    'initiated',             -- Payment intent created
    'confirmed_optimistic',  -- Client-side /confirm received
    'successful',            -- Authoritative webhook received
    'failed',                -- Payment failed
    'refunded'               -- Payment refunded
);

CREATE TYPE ticket_status_enum AS ENUM ('pending', 'issued', 'failed');
CREATE TYPE ticket_scan_status_enum AS ENUM ('active', 'scanned', 'canceled');

-- New ENUM for outbox processing
CREATE TYPE outbox_status_enum AS ENUM ('pending', 'sent');

-- Orders table with constraints
CREATE TABLE orders (
    order_id UUID PRIMARY KEY,
    reservation_id UUID REFERENCES reservations(reservation_id) UNIQUE,
    payment_status order_payment_status_enum NOT NULL DEFAULT 'pending',
    ticket_status ticket_status_enum NOT NULL DEFAULT 'pending',
    created_at TIMESTAMP DEFAULT now(),
    updated_at TIMESTAMP DEFAULT now()
);

-- Payments table with transaction tracking
CREATE TABLE payments (
    payment_id UUID PRIMARY KEY,
    order_id UUID NOT NULL REFERENCES orders(order_id),  -- NO CASCADE DELETE
    payment_gateway_id VARCHAR(100),
    payment_method_token TEXT NOT NULL,
    payment_intent_id VARCHAR(100) NULL,
    transaction_reference VARCHAR(100) NULL,
    amount_minor_units INT NOT NULL,
    currency CHAR(3) NOT NULL CHECK (char_length(currency) = 3),
    status payment_transaction_status_enum NOT NULL DEFAULT 'initiated',
    idempotency_key UUID UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT now(),
    updated_at TIMESTAMP DEFAULT now()
);

-- Tickets table with seat/section exclusivity
CREATE TABLE tickets (
    ticket_id UUID PRIMARY KEY,
    order_id UUID NOT NULL REFERENCES orders(order_id),  -- NO CASCADE DELETE
    event_id UUID NOT NULL REFERENCES events(event_id),
    seat_id VARCHAR(50) REFERENCES seats(seat_id) NULL,
    section_id VARCHAR(50) REFERENCES sections(section_id) NULL,
    owner_user_id VARCHAR(50) REFERENCES users(user_id) NULL,
    qr_code_url TEXT,
    status ticket_scan_status_enum NOT NULL DEFAULT 'active',
    issued_at TIMESTAMP DEFAULT now(),
    UNIQUE(order_id, seat_id),  -- prevent duplicate seat assignment within same order
    CHECK (
        (seat_id IS NOT NULL AND section_id IS NULL)
        OR (seat_id IS NULL AND section_id IS NOT NULL)
    )
);

-- Critical constraint: prevent double-booking across orders for ACTIVE tickets
CREATE UNIQUE INDEX idx_tickets_event_seat_unique_active
ON tickets (event_id, seat_id) 
WHERE seat_id IS NOT NULL AND status IN ('active', 'scanned');

-- NEW: Transactional Outbox Table
-- This table guarantees atomic event publishing
CREATE TABLE transactional_outbox (
    event_id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    aggregate_type VARCHAR(255) NOT NULL,
    aggregate_id VARCHAR(255) NOT NULL,
    topic VARCHAR(255) NOT NULL,
    payload JSONB NOT NULL,
    status outbox_status_enum NOT NULL DEFAULT 'pending',
    created_at TIMESTAMP WITH TIME ZONE DEFAULT now()
);

-- Index for the message relay to efficiently find pending events
CREATE INDEX idx_outbox_pending_status ON transactional_outbox (status)
WHERE status = 'pending';


-- Indexes for performance
CREATE INDEX idx_orders_reservation_id ON orders (reservation_id);
CREATE INDEX idx_payments_order_id ON payments (order_id);
CREATE INDEX idx_tickets_event_status ON tickets (event_id, status);
CREATE INDEX idx_tickets_owner ON tickets (owner_user_id);

The API & Workflow: A Step-by-Step Breakdown

Here is the complete user flow, showing how each API call and background process works within our separated architecture.

Payment Processing APIs

1. Initiate Payment

This is the first step. The client asks the Payment Service to create a payment intent. No money is charged yet - this only creates the payment intent with Stripe.

Endpoint: POST /payments
Request Body:

{
  "order_id": "order_789",
  "payment_method_token": "pm_123456789",
  "idempotency_key": "idem_987654321"
}

Response:

{
  "payment_id": "pay_123456789",
  "status": "initiated",
  "payment_intent_id": "pi_stripe_abc123"
}

Associated SQL Operations:

-- 1. Check for idempotency
SELECT payment_id, status 
FROM payments 
WHERE idempotency_key = 'idem_987654321';

-- 2. Get server-side price (NEVER trust the client)
SELECT 
    r.total_amount_minor_units, r.currency
FROM orders o
JOIN reservations r ON o.reservation_id = r.reservation_id
WHERE o.order_id = 'order_789';

-- 3. [Application calls Stripe API with the server-side amount]

-- 4. Create the payment record
INSERT INTO payments (
    payment_id, order_id, payment_intent_id, amount_minor_units, currency, 
    status, idempotency_key, created_at
) VALUES (
    'pay_123456789', 'order_789', 'pi_stripe_abc123',
    r.total_amount_minor_units,  -- ✅ Server-validated amount
    r.currency,                  -- ✅ Server-validated currency
    'initiated', 'idem_987654321', NOW() 
);

-- ⚠️ NOTE: This creates the PaymentIntent but does NOT charge the user yet.
-- The actual payment happens on the client-side when the user confirms.

Complete Payment Flow:

POST /payments (this endpoint): Creates PaymentIntent, returns client_secret to browser
Client-Side: Browser uses client_secret with Stripe.js to show payment form
Client-Side: User clicks "Pay" → stripe.confirmCardPayment() is called → This is when money moves!
Client-Side: On success, browser calls POST /payments/{id}/confirm (optimistic UX)
Server-Side: Stripe sends POST /webhooks/payments (authoritative confirmation)

2. Payment Webhook (The Asynchronous Source of Truth)

This is the most important endpoint. It's called by Stripe's servers, not the user. It is the authoritative record of a transaction.

Endpoint: POST /webhooks/payments
Headers: X-Webhook-Signature: <HMAC_SHA256>

Hardened Implementation Logic:

Verify Signature: Cryptographically verify the signature.
Respond Immediately: Return a 200 OK to Stripe to acknowledge receipt.
Queue for Processing: Place the event on an internal queue.

Asynchronous Handler SQL (Processing the event from the internal queue):
This logic demonstrates our separation of concerns perfectly. The Payment Service's only job is to update its own tables and (atomically) create an outbox event. It has no knowledge of how to issue a ticket.

-- CASE 1: Payment Successful (e.g., 'payment_intent.succeeded')
BEGIN;

    -- Step 1: Update the payment record idempotently.
    UPDATE payments 
    SET 
        status = 'successful',
        transaction_reference = 'txn_987654321', -- from webhook
        updated_at = NOW()
    WHERE payment_intent_id = 'pi_stripe_abc123'
      AND status IN ('initiated', 'confirmed_optimistic');

    -- Step 2: Update the master order status.
    UPDATE orders 
    SET 
        payment_status = 'paid',
        updated_at = NOW()
    WHERE order_id = (SELECT order_id FROM payments WHERE payment_intent_id = 'pi_stripe_abc123')
      AND payment_status IN ('pending', 'confirmed_optimistic');

    -- Step 3: Update reservation status to confirmed
    UPDATE reservations 
    SET 
        status = 'confirmed',
        confirmed_at = NOW(),
        updated_at = NOW()
    WHERE reservation_id = (
        SELECT reservation_id FROM orders WHERE order_id = (
            SELECT order_id FROM payments WHERE payment_intent_id = 'pi_stripe_abc123'
        )
    )
      AND status = 'pending_payment';

    -- Step 4: Create the outbox event to trigger Ticket Issuance
    -- This is ATOMIC with the reservation update.
    INSERT INTO transactional_outbox (
        aggregate_type, aggregate_id, topic, payload
    )
    VALUES (
        'order',
        (SELECT order_id FROM payments WHERE payment_intent_id = 'pi_stripe_abc123'),
        'order.paid',
        '{ ... webhook payload ... }'
    );

COMMIT;


-- CASE 2: Payment Failed (e.g., 'payment_intent.payment_failed')
BEGIN;

    -- Step 1: Update payment to failed
    UPDATE payments 
    SET status = 'failed', updated_at = NOW()
    WHERE payment_intent_id = 'pi_stripe_abc123'
      AND status IN ('initiated', 'confirmed_optimistic');

    -- Step 2: Update order to failed
    UPDATE orders 
    SET payment_status = 'failed', updated_at = NOW()
    WHERE order_id = (SELECT order_id FROM payments WHERE payment_intent_id = 'pi_stripe_abc123')
      AND payment_status IN ('pending', 'confirmed_optimistic');

    -- Step 3: Update reservation status to canceled
    UPDATE reservations 
    SET 
        status = 'canceled',
        canceled_at = NOW(),
        updated_at = NOW()
    WHERE reservation_id = (
        SELECT reservation_id FROM orders WHERE order_id = (
            SELECT order_id FROM payments WHERE payment_intent_id = 'pi_stripe_abc123'
        )
    )
      AND status = 'pending_payment';

    -- Step 4: Create outbox event to trigger reservation release
    INSERT INTO transactional_outbox (
        aggregate_type, aggregate_id, topic, payload
    )
    VALUES (
        'order',
        (SELECT order_id FROM payments WHERE payment_intent_id = 'pi_stripe_abc123'),
        'order.failed',
        '{ ... failure details ... }'
    );

COMMIT;

3. Confirm Payment (The Optimistic UX Endpoint)

This endpoint is only for user experience. It's called by the user's browser to give them an instant success screen, without waiting for the (slower) webhook.

Endpoint: POST /payments/{payment_id}/confirm
Response:

{
  "payment_id": "pay_123456789",
  "status": "confirmed_optimistic",
  "message": "Payment confirmed. Your tickets will be issued shortly."
}

Associated SQL Operations:
This transaction is deliberately fast and lightweight. It only updates the status to confirmed_optimistic. It does not trigger ticket issuance.

BEGIN;

    -- Step 1: Mark the payment as optimistically confirmed
    UPDATE payments 
    SET 
        status = 'confirmed_optimistic',
        updated_at = NOW()
    WHERE payment_id = 'pay_123456789'
      AND status = 'initiated';

    -- Step 2: Mark the order as optimistically confirmed
    UPDATE orders 
    SET 
        payment_status = 'confirmed_optimistic',
        updated_at = NOW()
    WHERE order_id = (SELECT order_id FROM payments WHERE payment_id = 'pay_123456789')
      AND payment_status = 'pending';

COMMIT;

The Saga Continues: Asynchronous Ticket Issuance

This is not an API. This is where our separation of concerns pays off. The Ticket Issuance Service has one job: listen for order.paid events and make tickets.

Trigger: Consuming an order.paid event from the message bus (which was reliably published by the Payment Service's outbox).

Associated SQL Operations (Deadlock-Proofed):
This is the big, complex transaction. By moving it to a background worker, we prevent it from ever blocking our payment flow. If it fails, it can safely retry without losing data.

BEGIN;

    -- Step 1: IDEMPOTENCY CHECK - Prevent duplicate ticket generation
    -- This makes the entire operation idempotent against message re-delivery
    SELECT ticket_status 
    FROM orders
    WHERE order_id = 'order_789'
    FOR UPDATE; -- Lock the order row

    -- [Application Logic]
    -- IF ticket_status is already 'issued', COMMIT immediately and ACK the message.
    -- The order is already processed - no need to generate tickets again.

    -- IF ticket_status is 'pending', proceed with the rest of the logic...

    -- Step 2: Lock the parent reservation
    SELECT *
    FROM reservations r
    WHERE r.reservation_id = (
        SELECT reservation_id FROM orders WHERE order_id = 'order_789'
    )
    AND r.status = 'confirmed'
    FOR UPDATE;

    -- Step 3: CRITICAL: Lock seats in a consistent order
    -- This (ORDER BY s.seat_id) is the key to preventing deadlocks
    SELECT *
    FROM seats s
    WHERE s.seat_id IN (
        SELECT rs.seat_id 
        FROM reservation_seats rs 
        JOIN orders o ON rs.reservation_id = o.reservation_id
        WHERE o.order_id = 'order_789'
    )
    AND s.status = 'reserved'
    ORDER BY s.seat_id  -- This deterministic ordering is vital
    FOR UPDATE;

    -- Step 4: Get reservation details for ticket generation
    SELECT 
        rs.seat_id, rsg.section_id, rsg.quantity, r.event_id
    FROM reservations r
    LEFT JOIN reservation_seats rs ON r.reservation_id = rs.reservation_id
    LEFT JOIN reservation_ga rsg ON r.reservation_id = rsg.reservation_id
    WHERE r.reservation_id = (
        SELECT reservation_id FROM orders WHERE order_id = 'order_789'
    );

    -- Step 5: Generate individual tickets (example)
    INSERT INTO tickets (
        ticket_id, order_id, event_id, seat_id, 
        section_id, qr_code_url, status, issued_at
    ) VALUES 
        ('TKT-ABC123DEF456', 'order_789', 'event_456', 'A-101', NULL, '...', 'active', NOW()),
        ('TKT-JKL345MNO678', 'order_789', 'event_456', NULL, 'GA1', '...', 'active', NOW());

    -- Step 6: Atomically update seats from 'reserved' to 'sold'
    UPDATE seats 
    SET status = 'sold', updated_at = NOW()
    WHERE seat_id IN (
        SELECT rs.seat_id 
        FROM reservation_seats rs 
        JOIN orders o ON rs.reservation_id = o.reservation_id
        WHERE o.order_id = 'order_789'
    )
      AND status = 'reserved';

    -- Step 7: Atomically update reservation to 'completed'
    UPDATE reservations 
    SET status = 'completed', updated_at = NOW()
    WHERE reservation_id = (
        SELECT reservation_id FROM orders WHERE order_id = 'order_789'
    )
      AND status = 'confirmed';

    -- Step 8: Update order ticket status
    UPDATE orders 
    SET ticket_status = 'issued', updated_at = NOW()
    WHERE order_id = 'order_789';

    -- Step 9: Create outbox event to notify user via email
    INSERT INTO transactional_outbox (
        aggregate_type, aggregate_id, topic, payload
    )
    VALUES (
        'order',
        'order_789',
        'tickets.issued',
        '{ ... ticket URLs and delivery info ... }'
    );

COMMIT;

Supporting Operations

Ticket Status Update (Scan)

Endpoint: POST /tickets/{ticket_id}/scan
Associated SQL Operations:

BEGIN;
    -- Lock the specific ticket row to prevent double-scans
    SELECT ticket_id, status
    FROM tickets 
    WHERE ticket_id = 'TKT-ABC123DEF456'
    FOR UPDATE;

    -- Check status and update (if active)
    UPDATE tickets 
    SET 
        status = 'scanned',
        scan_count = scan_count + 1,
        scanned_at = NOW(),
        updated_at = NOW()
    WHERE ticket_id = 'TKT-ABC123DEF456'
      AND status = 'active';
COMMIT;

Event Cancellation Refunds (Bulk Refunds)

Endpoint: POST /refunds/event-cancellation
Associated SQL Operations:

-- This query is CRITICAL: It selects the `transaction_reference`
-- which is the *actual charge ID* needed by the gateway to process a refund.
SELECT 
    p.payment_id, p.transaction_reference, p.amount_minor_units
FROM payments p
JOIN orders o ON p.order_id = o.order_id
JOIN reservations r ON o.reservation_id = r.reservation_id
WHERE r.event_id = 'event_456'
  AND p.status = 'successful'
  AND p.transaction_reference IS NOT NULL;

(The application then loops over these results, calls the gateway for each refund, and updates the payments and orders tables to status = 'refunded'.)

Verify Anonymous Refund Request

Endpoint: POST /refunds/verify
Associated SQL Operations:

-- 1. Validate order exists and is eligible for refund
SELECT 
    o.order_id, o.payment_status
FROM orders o
JOIN reservations r ON o.reservation_id = r.reservation_id
JOIN payments p ON o.order_id = p.order_id
WHERE o.order_id = 'order_789'
  AND o.payment_status = 'paid'
  AND p.status = 'successful';

-- 2. [Application Logic: Check rate limits]

-- 3. Generate and store verification code
INSERT INTO refund_verifications (
    verification_id, order_id, contact_method, 
    contact_value, verification_code, 
    expires_at, created_at
) VALUES (
    'ver_001', 'order_789', 'email',
    'user@example.com', '123456', -- Securely generated OTP
    NOW() + INTERVAL '15 minutes', NOW()
);

Our Design's Superpowers

By strictly separating concerns and using the Saga + Transactional Outbox patterns, our system is now:

Resilient: If the Ticket Issuance Service fails, payments are still safely accepted. The order.paid events will simply wait in the outbox to be processed when the service recovers.
Consistent: We have an atomic guarantee that a committed payment will always result in an event to generate a ticket. It's impossible to charge a customer and forget to issue their ticket.
Scalable: We can add 100 more Ticket Issuance workers to handle a massive on-sale event without ever needing to touch the Payment Service.
Maintainable: Our services are small and focused. Our Payment Service team can work on compliance and gateways, while the Ticket Issuance team can optimize for inventory and concurrency. They don't need to be in the same meetings.

Further Improving Consistency: The Reconciliation Service

While our Saga + Transactional Outbox patterns provide strong consistency guarantees, production systems can benefit from an additional layer of validation. A Reconciliation Service can further enhance data consistency across all system components by continuously monitoring and validating that all financial transactions, seat allocations, and ticket issuances are properly synchronized.

Why it's valuable as an afterthought:

Financial Accuracy: Ensures payment amounts match between our system and payment gateways
Inventory Consistency: Verifies seat allocations match between reservations and actual seat status
Ticket Integrity: Confirms all issued tickets correspond to valid, paid orders
Audit Compliance: Provides complete audit trails for regulatory requirements
Discrepancy Detection: Identifies and flags inconsistencies before they impact customers

What it reconciles:

Payment Gateway vs Database: Stripe/PayPal transaction records vs our payment records
Reservation vs Seat Status: Reserved seats in our system vs actual seat availability
Order vs Ticket Count: Number of tickets issued vs order quantities
Revenue vs Transactions: Total revenue calculations vs individual payment sums
Refund Consistency: Refund amounts vs original payment amounts

How it works:

Scheduled Reconciliation: Runs every 15 minutes to check recent transactions
Real-time Monitoring: Continuously validates critical operations (payments, ticket generation)
Cross-system Validation: Compares data across payment gateways, databases, and external systems
Automated Correction: Fixes minor discrepancies automatically (e.g., status updates)
Alert Generation: Flags major discrepancies for manual investigation
Audit Reporting: Generates comprehensive reports for compliance and analysis

This reconciliation service acts as a safety net, providing an additional layer of confidence that our distributed system maintains data integrity even under extreme load and complex failure scenarios.

Review by Google Gemini

Part 3: Seat Management

Sumedh Bala — Thu, 23 Oct 2025 17:48:52 +0000

Reservations, Locking, Availability & Queuing

Seat management ensures users can reliably reserve seats without conflicts, overselling, or double-booking. It handles reserved vs General Admission seats, tracks reservations until payment, manages concurrency, and provides real-time availability updates. Queuing prevents overload during high-demand events.

1. Baseline Functional Solution

Service Responsibilities & API Mapping

Service	Primary Responsibilities	API Endpoints	Data Sources
Seat Management Service	Seat reservations, locking, cancellations, expired cleanup	POST /events/{event_id}/reservations	Primary Database, Redis Cache
Queue Management Service	User queuing, position tracking, admission control	POST /events/{event_id}/queue GET /events/{event_id}/queue/status	Redis Queue, Database
Availability Aggregation Service	Section counts, real-time availability tracking	GET /events/{event_id}/sections	Message Bus, Redis Cache
Real-time Notification Service	SSE connections for live updates	SSE /events/{event_id}/sections/updates	Redis Pub/Sub, Message Bus

Databases & Caches

Primary Database (PostgreSQL / MySQL / DynamoDB): Stores canonical seat and reservation state.
Reservation Store (DynamoDB / Redis / PostgreSQL): Tracks in-progress reservations. Only stores queue_token for anonymous users; logged-in users are identified via JWT.
Cache (Redis / Memcached): Hot lookups for section counts, seat maps, and user queue positions.
Message Bus (Kafka / Pulsar / Kinesis / Pub/Sub): Streams reservation events for real-time updates and availability aggregation.
Queueing Infrastructure: Queued requests for high-demand events. Distributed coordination ensures unique positions and queue ordering.
Real-Time Updates Infrastructure: In-memory counters maintain section-level seats_remaining. Fanout layer scales push notifications to many clients.

Example Database Schema

Note: The following are simplified schema overviews for understanding the basic structure.

Seats Table

seat_id – string (primary key, e.g., "A-101")
event_id – string (foreign key)
section_id – string (foreign key)
row – string
seat_number – string
status – ENUM(available, reserved, sold)

Reservations Table (Header)

reservation_id – UUID (primary key)
event_id – string (foreign key)
queue_token – string (nullable, only for anonymous users)
user_id – string (foreign key, nullable, for logged-in users)
status – ENUM(pending_payment, confirmed, canceled, expired)
total_amount_minor_units – integer (amount in smallest currency unit to avoid floating point precision issues)
currency – string (e.g., "USD")
payment_intent_id – string (nullable)
created_at – TIMESTAMP
expires_at – TIMESTAMP
confirmed_at – TIMESTAMP (nullable)

Reservation_Seats Table (Reserved Seats)

reservation_seat_id – UUID (primary key)
reservation_id – UUID (foreign key to reservations)
seat_id – string (foreign key to seats)
created_at – TIMESTAMP

Reservation_GA Table (General Admission)

reservation_ga_id – UUID (primary key)
reservation_id – UUID (foreign key to reservations)
section_id – string (foreign key to sections)
quantity – integer (number of GA tickets)
created_at – TIMESTAMP

Sections Table

section_id – string (primary key, e.g., "section_A")
event_id – string (foreign key)
name – string (e.g., "104")
capacity – integer
seats_remaining – integer (updated in real time)

Queue Table

queue_id – UUID (primary key)
event_id – FK
user_id – nullable (logged-in)
queue_token – nullable (anonymous, signed)
status – ENUM(waiting, ready, expired, completed)
created_at, last_updated_at – timestamps

2. Reservation Flow

Enter Queue (High-Demand Events)

Anonymous users receive a queue_token.
Logged-in users are identified via JWT; no token issued.
If the queue is empty, users are immediately ready to reserve seats.

Browse Section Availability

Anyone can fetch aggregated section counts before joining the queue.
Individual reserved seat details require ready status (queue_token or JWT).

Reserve Seats / General Admission Slots

Must be ready in the queue.
Backend validates queue readiness via queue_token (anonymous) or JWT/account ID (logged-in).
Reserved seats have TTL until payment is completed.

Real-Time Updates

Users receive live updates on section availability via Server-Sent Events (SSE).
Anonymous users subscribe with queue_token.
Logged-in users subscribe with JWT.

3. APIs (in order of user flow)

Enter Queue

Endpoint: POST /events/{event_id}/queue
Request Body: {}
Response:

{
  "queue_token": "<signed_token>",
  "estimated_wait_time": 0,
  "position_in_queue": 0,
  "status": "ready"
}

Notes: Logged-in users receive no token; identity tracked via JWT.

Queue Status

Endpoint: GET /events/{event_id}/queue/status
Headers: Authorization: Bearer (anonymous) OR Authorization: Bearer (logged-in)
Response:

{
  "position_in_queue": 123,
  "status": "waiting"
}

Notes: All users must check queue status before reserving seats during high-demand events.

Get Section Availability

Endpoint: GET /events/{event_id}/sections
Headers: Optional
Response:

{
  "sections": [
    { "section_id": "A", "total_seats": 500, "available_seats": 120 },
    { "section_id": "B", "total_seats": 800, "available_seats": 300 },
    { "section_id": "GA1", "type": "General Admission", "available_seats": 2000 }
  ]
}

Notes: Anyone can fetch counts even before joining the queue. Reserved seat details are not returned.

Get Seats in a Section (Reserved Seating Only, Paginated)

Endpoint: GET /events/{event_id}/sections/{section_id}/seats
Headers: Authorization: Bearer (anonymous, ready) OR Authorization: Bearer (logged-in, ready)
Query Params: page=1, page_size=50
Response:

{
  "section_id": "A",
  "total_seats": 500,
  "page": 1,
  "page_size": 50,
  "seats": [
    { "seat_id": "A-101", "row": "A", "number": 1, "status": "available" },
    { "seat_id": "A-102", "row": "A", "number": 2, "status": "reserved" },
    { "seat_id": "A-103", "row": "A", "number": 3, "status": "available" }
  ]
}

Reserve Seats / General Admission Slots

Endpoint: POST /events/{event_id}/reservations
Headers: Authorization: Bearer (anonymous) OR Authorization: Bearer (logged-in)
Request Body:

{
  "seats": ["A-101","A-103"],
  "ga_section": "GA1",
  "ga_quantity": 2
}

Database Operations:

Create Reservation Header: Insert into reservations table
Link Reserved Seats: Insert into reservation_seats table for each seat
Link GA Quantities: Insert into reservation_ga table for GA sections
Update Seat Status: Mark seats as 'reserved' in seats table
Update Section Counts: Decrement seats_remaining in sections table

Response:

{
  "reservation_id": "res_789",
  "reserved_until": "2025-09-02T12:35:00Z",
  "seats": ["A-101","A-103"],
  "ga_section": "GA1",
  "ga_quantity": 2,
  "status": "pending_payment"
}

Notes: Backend validates queue readiness. TTL ensures release if payment isn't completed.

Real-Time Section Updates (Server-Sent Events)

Why Server-Sent Events (SSE) for Seat Notifications:

SSE is the optimal choice for seat availability notifications because:

One-Way Communication: Seat updates flow server → client only (no client → server needed)
Simpler Implementation: Native browser support with automatic reconnection
Better Performance: Lower memory overhead (~2KB vs ~8KB per WebSocket connection)
Higher Scalability: Can handle 50K+ connections per pod vs 10K for WebSockets
Built-in Resilience: Automatic reconnection and error handling
Perfect Use Case Match: Ideal for push notifications like seat availability changes

WebSocket vs SSE: When to Use Each

Why WebSocket Has Lower Latency (1ms vs 8ms):

Persistent Connection: No HTTP handshake overhead per message
Binary Protocol: More efficient than HTTP text-based SSE
Bidirectional: Client can send acknowledgments, reducing server-side queuing
Direct Connection: No intermediate services (SNS/SQS) adding latency

Use WebSocket When:

Bidirectional Communication: Chat, real-time collaboration, gaming
Low Latency Critical: Real-time trading, live sports updates
Interactive Features: User can send commands/responses
Custom Protocols: Need binary data or custom message formats

Use SSE When:

One-Way Communication: Seat availability updates, notifications
HTTP Infrastructure: Leverage existing HTTP caching, CDNs
Browser Compatibility: Better support for automatic reconnection
Simpler Implementation: No need for connection state management

Endpoint: /events/{event_id}/sections/updates
Headers: Authorization: Bearer (anonymous, ready) OR Authorization: Bearer (logged-in, ready)
Message:

{
  "section_id": "A",
  "available_seats": 120,
  "timestamp": "2025-09-02T12:30:00Z"
}

Notes: General Admission counts and reserved seat holds are pushed in real time. Updates are throttled to reduce load.

Why Params Are in the Path vs. Request Body

Event ID, Section ID → Path Params: Represent resources in a hierarchy (/events/{event_id}/sections/{section_id}/seats). REST convention prefers path params when accessing a specific resource.
Seats, GA Section, GA Quantity → Request Body: Represent actions or state changes (reserving seats). Describe what the client wants to do, not the resource being fetched.
Queue Token / JWT → Headers: Authentication and authorization tokens always go in headers, not bodies or paths, to keep APIs clean, stateless, and cacheable.

This separation keeps the API design RESTful, predictable, and consistent with industry practices.

Key Takeaways

Anonymous users are tracked via queue_token.
Logged-in users are tracked via JWT/account ID; no token issued.
Queue status is required for all users during high-demand events before reserving seats.
Pagination improves performance for large sections.
Real-time updates keep section availability accurate.
Unauthorized users (no JWT or queue_token) cannot reserve seats.

The Virtual Waiting Room: A Deep Dive into High-Scale Queueing

This deep dive focuses on the Queue Management Service that gracefully manages demand spikes for seat reservations. We'll cover the full system design: its runtime architecture, the exact Redis data structures and operations, a modern approach to consistency and recovery, and a head-to-head comparison of two primary Redis queue implementations.

The Problem: Surviving a Demand Spike

Imagine a major concert ticket sale. Millions of users hit your site simultaneously. Without an admission control system, your backend services—for seat reservation and payment—would be instantly overwhelmed. This leads to a chaotic user experience: slow, unresponsive pages, frustrating timeouts, and a flood of angry customer service calls. This is a critical business problem, costing millions in lost sales and brand trust. The solution isn't to build a bigger backend overnight; it's to create a virtual waiting room that gracefully manages the flood of users and ensures an orderly process.

Goals

Protect downstream seat/reservation systems from overload (admission control).
Maintain predictable, mostly FIFO ordering across millions of waiting users.
Provide "where am I?" and "how long left?" with low latency.
Allow anonymous and logged-in users (queue_token vs JWT).
Recover from cache failures quickly without losing order.

Non-Goals (handled elsewhere)

Payments and final ticket issuance.
Full bot mitigation (you will still rate-limit and verify).

High-Level Architecture

The system's core is a hybrid, two-part architecture: a durable database for long-term state and a blazing-fast Redis cache for real-time operations.

Write Path (Join Queue)

The API handles a dual-write process for speed and durability:

A user joins the queue by hitting an API endpoint. The API authenticates them (logged-in via JWT, anonymous gets a signed queue_token).
The API performs two writes in parallel: it persists a minimal user entry into the Queue DB (our single source of truth) and enqueues the user into Redis (our hot path for ordering).
The API instantly returns the user's initial position and ETA, derived from the monotonic sequence number (join_seq) received from the Redis write. We'll discuss the O(1) position calculation algorithm in detail in the "Technical Deep Dive" section below.

Admission (Making People Ready)

A Gatekeeper worker steadily admits users from the head of the queue at a controlled rate (e.g., N users/sec). It marks them "ready" in the DB and emits an event to the message bus for observability and client notification.

Read Path (Position, ETA)

In steady state, all reads are served from Redis, providing low-latency updates. If Redis is degraded, the system falls back to a degraded mode (details in section 5).

State Transitions

waiting → ready → (reserve) → done/expired. A user who doesn't reserve in time can re-queue subject to system policy.

Data Model (DB + Redis)

Relational DB (minimal, durable)

This is our rock-solid, durable layer. We store only the essential, long-lived data.

Redis (Reconstructible Cache)

This is our high-performance, live-ordering layer. We'll explore two primary implementations.

A) Sorted Set (ZSET) per event

Key: q:{event_id}:z
Member: user_key (e.g., u:{user_id})
Score: join_seq (monotonic sequence number)

B) List + Hash (list as deque, hash for metadata)

Key: q:{event_id}:list (RPUSH/LPOP)
Hash: q:{event_id}:meta:{user_key} → field: join_seq (monotonic sequence number)

Consistency, Failure & Recovery (Event-Sourced Approach)

To build a truly resilient system, we use an event-sourced model where the database is the single source of truth and Redis acts as a reconstructible, high-speed cache. This approach eliminates race conditions and manual intervention.

DB-First Writes: All changes, like a user joining or being admitted, are first written to the database by the API.
Optimized Recovery (The Snapshot + Stream Model): Our primary cache, Redis, is highly available but can still fail. To ensure a fast and accurate rebuild, we combine periodic snapshots with a continuous event stream.
- Periodic Snapshots: The system periodically takes a snapshot of the Redis queue's live ordering state. This is a quick baseline backup of the essential user data and their order. These snapshots are stored in durable storage.
- Rapid Rebuild: If Redis crashes, a dedicated recovery worker performs two steps: It loads the latest durable snapshot into Redis. It then queries the event log table to replay only the events that occurred since that snapshot was taken. This is a much smaller number of events, allowing for a fast and reliable catch-up.

How CDC Event Replay Works: CDC maintains a complete log of all database changes with timestamps. When Redis needs to be rebuilt, the recovery worker queries the CDC log for all changes that occurred after the last snapshot timestamp. This approach provides complete coverage since CDC captures every database change automatically, ensuring no events are missed during recovery.

Redis Design Choices & Time Complexity

A fundamental challenge in high-scale queueing is guaranteeing predictable behavior. A simple timestamp isn't reliable because millions of users could join the queue at the exact same millisecond, leading to a race condition and a random, non-deterministic order. Additionally, system clocks can sometimes go backwards due to clock synchronization issues (NTP adjustments, server reboots, etc.), which would make timestamp-based ordering even more confusing. To solve this, we use a monotonically increasing sequence number (join_seq) generated by Redis's atomic INCR command. This ensures that every single user gets a unique, sequential ticket, making the queue's internal logic perfectly deterministic.

This simple design choice unlocks a massive performance benefit: O(1) position lookup. While native Redis commands for position finding are slow (O(logN) for Sorted Sets and O(N) for Lists), we can bypass them entirely. We simply track the sequence number of the queue's head and subtract it from the user's join_seq to get their exact position in constant time. This is the "Ah-Ha!" moment that makes our system so fast and responsive for millions of users checking their status.

A) Sorted Set (ZSET)

A Redis Sorted Set is a unique data structure that combines a hash table and a skip list to balance performance. A skip list is a probabilistic data structure that offers logarithmic time complexity (O(logN)) for searches, insertions, and deletions, much like a balanced binary search tree, but is generally simpler to implement.

Enqueue (ZADD): O(logN)
- Explanation: When you add a new user to the Sorted Set using the ZADD command, Redis needs to place it in the correct, sorted position. A Redis Sorted Set is a hybrid data structure that uses a skip list to maintain order. Finding the correct spot in a skip list to insert an element requires traversing a subset of the elements, which is a logarithmic time operation, hence O(logN).
Dequeue (Batch ZPOPMIN): O(M⋅logN)
- Explanation: The ZPOPMIN command removes the element with the lowest score (the head of your queue). The complexity is O(logN) for a single removal. When the operation is done in a batch to remove M users, Redis has to perform this O(logN) removal process M times. This multiplies the complexity, resulting in O(M⋅logN).
Get User's Position: O(1) (bypasses the native ZRANK command).

Why ZRANK Has O(logN) Complexity: Redis Sorted Sets use skip lists where each node maintains counts - the number of elements between itself and the next node at each level. To find a user's rank:

Start at the header: Begin at the top level of the skip list
Traverse levels: For each level, follow forward pointers while the next node's score < target score
Accumulate counts: Add the count stored at each node to the running rank total
Level descent: Move down one level and repeat until reaching the bottom level

Count Maintenance: When inserting/deleting elements, Redis must update counts at all affected levels. Each insertion requires updating counts at log(N) levels on average, and each deletion requires similar updates to maintain count accuracy.

Why O(logN): Even with counts, ZRANK must traverse log(N) levels, performing count accumulation at each level. For 1M users: ~20 level traversals × count operations = O(logN) complexity.

B) List + Hash

This design separates the two primary functions of the queue: a List for ordered processing and a Hash for user metadata. A List is a doubly linked list, optimized for fast additions and removals from the head and tail. A Hash provides O(1) access to a user's metadata.

Enqueue (RPUSH): O(1)
Dequeue (LPOP): O(1)
Get User's Position: O(1) (bypasses the native LINDEX command).

Why LINDEX Has O(N) Complexity: The native LINDEX command finds an element at a specific index within a Redis List. A Redis List is implemented as a doubly linked list. To find the position of a specific user, Redis has no choice but to start from the beginning of the list and traverse each element one by one until it finds the matching user ID. This means the time it takes is directly proportional to the number of elements in the list, making it a linear time operation, or O(N). For a large queue with millions of users, this would be prohibitively slow and would not scale.

How We Achieve O(1)

We achieve a constant time position lookup by bypassing native Redis commands entirely and using simple application-level logic. This is possible because our queue design is based on a monotonically increasing sequence number (join_seq).

Monotonic Counter: The key to this approach is that every user is assigned a unique, sequential number upon joining the queue. This number (join_seq) serves as their unchanging ticket number.
Tracking the Head: The system maintains a global counter for the "head" of the queue, representing the join_seq of the user currently being processed or the last user admitted.
The Simple Math: A user's position is simply the difference between their join_seq and the current head of the queue. This is a single subtraction operation.

Example: If the queue head is at join_seq = 1,000,000, and a user has join_seq = 1,000,150, their position is 1,000,150 − 1,000,000 = 150. This calculation is a single step that takes constant time, regardless of whether there are 10 users or 10 million in the queue. This mathematical approach transforms a potentially slow operation into an instant one.

The Con of O(1) Application-Level Logic

While powerful, relying on application-level logic for O(1) lookups creates a risk of race conditions. A user's position could be calculated incorrectly if the head_seq counter changes mid-request.

How to Prevent Race Conditions: The most reliable method is to use a Redis Lua script. The script atomically fetches the user's join_seq and the head_seq counter. Redis guarantees that the entire script runs as a single, indivisible operation, eliminating any risk of a race condition.

Lua Script for Atomic Position Lookup:

-- Atomic position calculation to prevent race conditions
local user_key = KEYS[1]
local head_seq_key = KEYS[2]
local queue_key = KEYS[3]

-- Get user's join sequence number
local user_seq = redis.call('ZSCORE', queue_key, user_key)
if not user_seq then
    return {err = "User not in queue"}
end

-- Get current head sequence
local head_seq = redis.call('GET', head_seq_key)
if not head_seq then
    head_seq = 0
end

-- Calculate position atomically
local position = tonumber(user_seq) - tonumber(head_seq)
return {position = position, user_seq = user_seq, head_seq = head_seq}

Race Condition Prevention: The Lua script ensures atomicity by fetching both user_seq and head_seq in a single Redis operation, preventing the race condition where head_seq changes between the two reads.

Note: The specific implementation of the Lua script will vary slightly based on the chosen Redis data structure. For the List + Hash model, the join_seq is fetched from the Hash, while for the Sorted Set model, it is retrieved as the score of the user's member in the set.

What This Gives Your Product

This design delivers a robust, scalable, and resilient queue that transforms a chaotic demand spike into a predictable user experience. By separating concerns—the DB for durable identity, Redis for real-time ordering, and an event bus for reliable communication—we've built a system that provides a smooth user experience, ensures business continuity, and is easy to maintain.

The Grand Unification of Ticketing: Seat Management Done Right

The art of building a reliable ticketing platform lies in one core principle: preventing chaos. When millions of fans are clamoring for a limited number of tickets, your system must be a bastion of order. This deep dive into the Seat Management Service reveals the crucial mechanisms—database transactions and locking—that ensure every reservation is a clean, atomic operation, preventing the nightmare of overselling and double-booking.

We'll build this system using three key tables: Seats to manage unique tickets, Sections for overall availability, and Reservations to handle temporary holds.

The Database as Your Security Guard 🛡️

At the heart of our strategy is the database, specifically PostgreSQL. Its powerful transactional capabilities allow us to treat a series of operations as a single, indivisible unit. The entire reservation process either succeeds completely, or it fails and is entirely rolled back, leaving no trace behind. This is the ACID principle in action, guaranteeing atomicity and consistency.

The Atomic Reservation Flow

When a user requests seats, the backend initiates a carefully choreographed sequence of database operations.

Step 1: Start the Transaction

The very first action is to begin a transaction. Think of this as putting a "Do Not Disturb" sign on the data you're about to work on.

BEGIN;

Step 2: Check & Lock for a Flawless Hold

This is where we prevent overselling. The process differs slightly depending on the type of seating.

For Reserved Seating: Pessimistic Locking 🔒
When a user selects a unique seat (e.g., "Seat A101"), the system immediately places a lock on that exact seat row in the database. This is a pessimistic lock, so named because we're being pessimistic and assuming another user might want the same seat. It guarantees that no other user can even read or attempt to modify that seat's status until our transaction is complete. The other user is forced to wait, preventing a conflict from ever occurring.

This approach is perfect for reserved seats because they are unique and non-fungible. If the query returns fewer seats than requested, it means some were already taken. The transaction is immediately rolled back, and the user receives an error.

-- Select the requested seat and lock it for update
SELECT seat_id, status FROM seats WHERE seat_id = 'A101' FOR UPDATE;

Why not Optimistic Locking for Reserved Seats? While optimistic locking can offer higher concurrency, it's a poor fit for unique, non-fungible items like reserved seats. The approach involves checking for conflicts at the final moment of the transaction. A second user could start a reservation on the same seat, only to have their entire transaction rejected at the very end when the system detects the conflict. This creates a frustrating and unpredictable user experience, as a "seat not available" message delivered late in the process is far worse than an immediate one.

Comparing the Costs:

Cost of a Pessimistic Lock: The cost is a single, brief wait at the beginning of the process. The application sends one SELECT...FOR UPDATE query to the database, which handles all locking and waiting internally. This is a predictable, easy-to-manage cost.
Cost of a Failed Optimistic Lock: This cost is deceptive. While it avoids an up-front wait, it introduces the high cost of a complete re-run when a conflict occurs. The application must perform an entire sequence of operations—fetching data, running business logic, and attempting the final update—only for the update to fail. The application then has to discard all the work and restart the entire process, leading to redundant CPU cycles and wasted database round trips.

The essential difference is that a pessimistic lock's cost is a single, brief, and transparent wait. A failed optimistic lock's cost is the wasteful re-execution of a full and complex reservation attempt.

For General Admission (GA): Optimistic Locking 🟢
For GA, we must avoid a bottleneck. Many users will try to reserve GA slots at the same time, all hitting the same Sections table row. A pessimistic lock would serialize these requests, forcing them to wait in a single line, which defeats the purpose of the queuing system.

Instead, we use optimistic locking, which assumes conflicts are rare. We don't lock the data upfront. We rely on a single, atomic SQL statement to perform a compare-and-swap (CAS) operation, checking and updating the value simultaneously. This approach is highly performant and scalable.

Why Conflicts Are Rare in GA Reservations:

Conflicts occur when seats_remaining drops to 0 or below the requested quantity between the time a user checks availability and attempts to reserve. This is rare because:

Large Capacity Buffers: GA sections typically have thousands of seats (2,000-10,000+), making it unlikely for the last few seats to be contested simultaneously.
Queue-Controlled Access: The queuing system already limits concurrent users. Only users who have been "admitted" from the queue can attempt reservations, reducing simultaneous access.
Time Windows: Users have limited time (10-15 minutes) to complete reservations, and most successful reservations happen within the first few minutes of admission.
Natural Distribution: User behavior naturally spreads out reservation attempts - some users are faster at selecting, others take time to decide.
Batch Processing: The system can process multiple small reservations (1-4 tickets) simultaneously without conflict, as the remaining capacity buffer is usually large enough.

When Conflicts Do Occur: Conflicts become more likely only when seats_remaining approaches very low numbers (e.g., < 10 seats remaining), but by then most users have already completed their reservations, and the queue system has already controlled the flow.

Why Conflicts Are NOT Rare for Reserved Seats: Unlike GA sections with thousands of seats, reserved seating creates high conflict scenarios. Popular seats (front row, center sections, VIP areas) are unique and non-fungible - only one person can have seat A-101. When thousands of users simultaneously target the same premium seats, conflicts are inevitable. This is why reserved seating requires pessimistic locking to prevent overselling and ensure data consistency, while GA's large capacity buffers make optimistic locking viable.

UPDATE sections
SET seats_remaining = seats_remaining - :ga_quantity
WHERE section_id = 'section_GA1' AND seats_remaining >= :ga_quantity
RETURNING seats_remaining;

This single statement is executed atomically. The WHERE clause acts as our check, ensuring the database only performs the UPDATE if the seats_remaining count is sufficient at that exact moment. If another transaction has already updated the count, the WHERE condition will fail, and the UPDATE will affect zero rows.

Why this is better than Pessimistic Locking for GA: This approach avoids the single-row bottleneck. The database does not need to serialize all reservation requests; it only needs to check for a conflict at the moment of the update. This allows for massive parallelism, making the system highly scalable for high-demand GA events.

Performance Analysis: Locking Strategies Comparison

Pessimistic Locking Performance (Reserved Seats):

Row Lookup: O(log N) - B-tree index lookup by seat_id
Lock Acquisition: O(1) - Single row lock after finding row
Lock Duration: O(1) - Constant time for transaction
Contention Impact: O(N) - Linear with concurrent users
Best Case: 10-20ms (no contention)
Worst Case: 5-10 seconds (high contention)
Average Case: 50-100ms (moderate contention)

CAS-Based Performance (General Admission):

Row Lookup: O(log N) - B-tree index lookup by section_id
Update Operation: O(1) - Single row update with condition check
No Retries: O(1) - Either succeeds or fails atomically
Success Case: 20-30ms (atomic update succeeds)
Failure Case: 20-30ms (atomic update fails due to insufficient seats)
High Concurrency: O(log N) per transaction regardless of load

Locking Strategy Decision Matrix:

Factor	Pessimistic	CAS
Consistency	Strong	Strong
Concurrency	Low	High
Latency (Low Load)	Fast	Fast
Latency (High Load)	Slow	Moderate
Memory Usage	High	Low
Implementation	Simple	Complex
Best For	Reserved Seats	General Admission

Step 3: Update Counts

With the seats or GA section securely locked, we can now confidently update their availability.

-- For reserved seating, update the seat status to 'reserved' (with availability check)
UPDATE seats SET status = 'reserved' WHERE seat_id = 'A101' AND status = 'available';

-- Also update the count in the sections table
UPDATE sections SET seats_remaining = seats_remaining - 1 WHERE section_id = 'section_A';

This step ensures the section-level availability information is always accurate for other users browsing the event.

Step 4: Create the Reservation Record

A new record is inserted into the Reservations table. This record is the master holding information, linking the user to their selected seats or GA quantity. Crucially, we set a Time-to-Live (TTL) using the expires_at timestamp. This is an essential safety net. It ensures that if a user doesn't complete the payment, the reservation will automatically expire and release the seats.

-- Create reservation header
INSERT INTO reservations (reservation_id, user_id, expires_at, status, total_amount_minor_units, currency, created_at)
VALUES ('res_789', 'user_123', NOW() + INTERVAL '10 minutes', 'pending_payment', 8950, 'USD', NOW());

-- Link reserved seat to reservation
INSERT INTO reservation_seats (reservation_seat_id, reservation_id, seat_id, created_at)
VALUES ('rs_001', 'res_789', 'A-101', NOW());

Step 5: The Final Commitment

If all preceding steps succeed, we execute the COMMIT command. This makes all changes permanent. At this point, the seats are officially marked as reserved (or the GA count is reduced), and the locks are released. If any step failed, the ROLLBACK command is issued, and the database returns to its state before the transaction began.

-- All changes are made permanent and locks are released
COMMIT;

-- If a failure occurs, all changes are undone
-- ROLLBACK;

Expiring Abandoned Reservations

No matter how robust your system, some users will abandon their carts. To prevent these "seat leaks," a separate background process, or worker, periodically scans the Reservations table for records where expires_at is in the past. When an expired reservation is found, the worker executes a new transaction to return the seats to the available pool. For reserved seats, it updates their status to available. For GA, it adds the ga_quantity back to the seats_remaining in the Sections table. This mechanism ensures that inventory is never held indefinitely.

⚠️ Race Condition Considerations: The cleanup process must handle concurrent operations safely. Multiple cleanup jobs, new reservations being created during cleanup, and concurrent seat releases can all cause data inconsistency if not properly managed.

-- Cleanup log table for monitoring and debugging
-- This table tracks every cleanup operation so we can monitor the system's health
CREATE TABLE cleanup_log (
    log_id INT AUTO_INCREMENT PRIMARY KEY,           -- Unique identifier for each log entry
    section_id VARCHAR(50),                          -- Which section was cleaned up
    expired_ga_count INT,                            -- How many GA seats were released
    expired_reserved_count INT,                      -- How many reserved seats were released
    affected_reservations INT,                       -- How many reservations were marked as expired
    cleaned_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, -- When the cleanup happened
    INDEX idx_cleaned_at (cleaned_at)               -- Index for efficient queries by time
);

-- PostgreSQL version (more robust with better locking)
CREATE OR REPLACE PROCEDURE cleanup_expired_reservations(p_section_id VARCHAR(50))
LANGUAGE plpgsql
AS $$
DECLARE
    v_expired_ga_count INT := 0;
    v_expired_reserved_count INT := 0;
    v_affected_rows INT := 0;
BEGIN
    -- Start a transaction block (automatically handled in procedures in PG >=11)
    -- We still use EXCEPTION handling for rollback on any error.

    BEGIN
        -- Lock the section row to prevent concurrent cleanups
        PERFORM 1 FROM sections WHERE section_id = p_section_id FOR UPDATE;

        -- Lock all affected reservations early to prevent concurrent modifications
        PERFORM 1
        FROM reservations r
        LEFT JOIN reservation_ga rga ON r.reservation_id = rga.reservation_id
        LEFT JOIN reservation_seats rs ON r.reservation_id = rs.reservation_id
        LEFT JOIN seats s ON rs.seat_id = s.seat_id
        WHERE (rga.section_id = p_section_id OR s.section_id = p_section_id)
          AND r.expires_at < NOW()
          AND r.status = 'pending_payment'
        FOR UPDATE;

        -- Count expired GA reservations for this section
        SELECT COALESCE(SUM(rga.quantity), 0)
        INTO v_expired_ga_count
        FROM reservations r
        JOIN reservation_ga rga ON r.reservation_id = rga.reservation_id
        WHERE rga.section_id = p_section_id
          AND r.expires_at < NOW()
          AND r.status = 'pending_payment';

        -- Count expired reserved seat reservations for this section
        SELECT COUNT(*)
        INTO v_expired_reserved_count
        FROM reservations r
        JOIN reservation_seats rs ON r.reservation_id = rs.reservation_id
        JOIN seats s ON rs.seat_id = s.seat_id
        WHERE s.section_id = p_section_id
          AND r.expires_at < NOW()
          AND r.status = 'pending_payment';

        -- Restore seats_remaining count
        IF (v_expired_ga_count > 0 OR v_expired_reserved_count > 0) THEN
            UPDATE sections
            SET seats_remaining = seats_remaining + v_expired_ga_count + v_expired_reserved_count
            WHERE section_id = p_section_id;
        END IF;

        -- Re-release reserved seats for this section only
        UPDATE seats s
        SET status = 'available'
        FROM reservations r
        JOIN reservation_seats rs ON r.reservation_id = rs.reservation_id
        WHERE rs.seat_id = s.seat_id
          AND s.section_id = p_section_id
          AND r.expires_at < NOW()
          AND r.status = 'pending_payment';

        -- Mark expired reservations (both GA and reserved) as expired for this section
        UPDATE reservations r
        SET status = 'expired'
        WHERE r.expires_at < NOW()
          AND r.status = 'pending_payment'
          AND (
              r.reservation_id IN (
                  SELECT rga.reservation_id FROM reservation_ga rga WHERE rga.section_id = p_section_id
              ) OR
              r.reservation_id IN (
                  SELECT rs.reservation_id FROM reservation_seats rs 
                  JOIN seats s ON rs.seat_id = s.seat_id 
                  WHERE s.section_id = p_section_id
              )
          );

        -- Count affected rows (optional: count only those for this section)
        SELECT COUNT(*)
        INTO v_affected_rows
        FROM reservations r
        WHERE r.status = 'expired'
          AND r.expires_at < NOW()
          AND (
              r.reservation_id IN (
                  SELECT rga.reservation_id FROM reservation_ga rga WHERE rga.section_id = p_section_id
              ) OR
              r.reservation_id IN (
                  SELECT rs.reservation_id FROM reservation_seats rs 
                  JOIN seats s ON rs.seat_id = s.seat_id 
                  WHERE s.section_id = p_section_id
              )
          );

        -- Log cleanup operation
        INSERT INTO cleanup_log (
            section_id, 
            expired_ga_count, 
            expired_reserved_count,
            affected_reservations, 
            cleaned_at
        )
        VALUES (
            p_section_id, 
            v_expired_ga_count, 
            v_expired_reserved_count, 
            v_affected_rows, 
            NOW()
        );

    EXCEPTION
        WHEN OTHERS THEN
            RAISE NOTICE 'Error during cleanup: %', SQLERRM;
            ROLLBACK;
            RAISE;
    END;
END;
$$;

Key Race Condition Protections:

Row-Level Locking: FOR UPDATE prevents concurrent section modifications
Atomic Operations: All updates in a single transaction
Error Handling: Automatic rollback on any failure
Audit Trail: Preserves expired reservations for compliance
Monitoring: Cleanup log tracks operations for debugging

The Anatomy of a Hyper-Scale Real-Time System: A Production-Ready Deep Dive

A detailed technical guide to designing a resilient, high-performance, and scalable notification platform.

In the high-stakes world of live event ticketing, a real-time notification system is a critical business tool, not a luxury. It directly drives revenue by preventing lost sales and builds brand trust by providing a transparent user experience.

A static "seats available" number leads to frustration and cart abandonment when users attempt to reserve seats that are already gone. A real-time system provides an accurate, live view of inventory, which eliminates this negative experience. Additionally, seeing seat availability change dynamically creates a sense of urgency that encourages faster purchasing decisions.

Building a real-time system capable of serving millions of concurrent users requires a series of deliberate architectural choices that prioritize decoupling, resilience, and performance. This article provides a deep dive into a production-ready architecture for a large-scale ticketing notification system. We will define each component's role, trace the data flow in detail, and analyze the system's resilience to failure.

Part 1: Foundational Components

This architecture is composed of several specialized components, each responsible for a distinct part of the data lifecycle.

Event Producers & The Durable Log (Amazon MSK)

The system is event-driven. State changes are captured automatically using Change Data Capture (CDC) from the database, ensuring complete decoupling between high-frequency reservation operations and event publishing.

How CDC Works in Practice:

Database Changes: When the Seat Management Service commits a reservation transaction (e.g., seat status changed from 'available' to 'reserved'), the database records the change
Automatic Capture: CDC automatically captures all database changes at the transaction log level, including:
- Table modifications (seats, reservations, sections)
- Before/after values for each change
- Transaction metadata and timestamps
- User context from the application
Event Transformation: A CDC processor transforms raw database changes into business events:
- Raw UPDATE → 'seat_reserved' event
- Raw INSERT → 'reservation_created' event
- Raw DELETE → 'reservation_expired' event
MSK Publishing: Transformed events are published to appropriate Kafka topics (e.g., 'seat-events', 'reservation-events')
Zero Performance Impact: Reservation operations have no latency impact since CDC runs asynchronously
Downstream Processing: Other services (Availability Aggregation, Notifications) consume these events to update their state

MSK serves as the system's durable log and provides several critical functions:

Decoupling: It decouples producers from consumers, allowing them to operate and scale independently.
Durability & Replayability: Events are durably stored, allowing them to be re-processed in case of a consumer-side error.
Backpressure Management: The bus acts as a large buffer, absorbing traffic spikes.

The Processing Engine (Apache Flink)

The Availability Aggregation Service is implemented as a stateful stream processing application using Apache Flink. It consumes the raw event stream from MSK and transforms it into clean, aggregated availability counts.

Stateful Processing: Flink uses a keyBy operation on the section_id to ensure all events for a single section are processed sequentially by the same task.
Fault Tolerance: Flink is configured to take periodic checkpoints of its state to a durable store like Amazon S3, guaranteeing data accuracy through failures.
Windowed Aggregation & Throttling: Instead of emitting an update for every single event, Flink can operate on windows of time. For example, it can collect all events for a specific section over a 2-second window, perform a single aggregation at the end of that window, and then emit a single, consolidated update. This micro-batching is crucial for throttling the update stream, reducing the load on the entire downstream notification system and preventing the end user's UI from flickering with an excessive number of rapid updates.

The Distribution Fleet (Amazon EKS)

The Real-time Notification Service is a fleet of containerized applications deployed on Amazon EKS (Elastic Kubernetes Service). These long-running pods are the "SSE hosts."

Persistent Connections: Each pod is capable of maintaining thousands of persistent Server-Sent Events (SSE) connections with clients.
Local State: Each pod maintains two private, in-memory hash tables to manage its local connections, enabling extremely fast lookups.
Horizontal Scalability: The fleet runs as a Kubernetes Deployment, allowing it to be scaled horizontally.

The Connection Manager (Application Load Balancer)

An Application Load Balancer (ALB) sits in front of the EKS fleet, terminating TLS, performing health checks, and distributing incoming SSE connection requests to the EKS pods using a "Least Connections" algorithm.

The Routing Directory (Amazon ElastiCache for Redis)

A central, low-latency in-memory database acts as our real-time routing table, or "Director."

Function: Its primary job is to maintain a live map between a logical topic (e.g., section_id) and the physical identifiers of the EKS pods that are currently serving clients interested in that topic.
Implementation: This is implemented using Amazon ElastiCache for Redis, configured for high availability. The data structure is a simple Redis Set: section:104:subscribers -> { "pod-a-ip", "pod-c-ip" }.

Part 2: The Core Architecture — A Deep Dive into the Data Flow

The architecture is designed for low-latency, targeted message delivery.

Phase 1: Connection & State Registration

This phase establishes the client connection and builds the two-layer routing map.

A client initiates an SSE connection request to the ALB. The ALB selects a target pod, pod-C, and forwards the request.
The application inside pod-C establishes the SSE connection and assigns it a unique internal ID.
The client sends a subscription message (e.g., for section-104).
pod-C updates its two local, in-memory hash tables:
- Forward Map (connectionId -> sections): Used for efficient cleanup when the client disconnects.
- Reverse Map (section -> connectionIds): Used for O(1) lookups during message delivery.
pod-C then updates the Redis Director, adding its own unique, addressable identifier to the Redis Set for section-104.

   redis.SADD("section:104:subscribers", "pod-c-ip")

Phase 2: Real-Time Message Delivery

This phase traces an event from processing to final delivery.

The Flink application publishes a processed update for section-104 to an Amazon SNS Topic.
SNS invokes a subscribed AWS Lambda function (the "Router Lambda").
The Lambda function queries the Redis Director to get the set of pod identifiers subscribed to section-104.

   redis.SMEMBERS("section:104:subscribers") -> returns { "pod-c-ip", "pod-f-ip" }

The Lambda performs a direct, service-to-service push to the target pods via an internal API endpoint on each pod.
The target pod, pod-C, receives this internal API call.
pod-C performs a highly efficient O(1) lookup in its local in-memory reverse_map to get the list of specific client connectionIds.
The pod's application iterates through this list and writes the message into the open SSE streams for each connection.

Message Delivery Optimization Analysis

Fanout Pattern Performance:

Direct Broadcast (Naive Approach):

1 message → 100K users = 100K individual WebSocket writes
Time Complexity: O(N) where N = number of subscribers
Memory Complexity: O(N) for connection management

SNS Fanout (Optimized Approach):

1 message → SNS → 100K Lambda invocations → 100K WebSocket writes
Time Complexity: O(1) for publish + O(N) for delivery
Memory Complexity: O(1) for SNS + O(N) for Lambda memory

Batching Optimization:

100 messages → Batch → 1K WebSocket writes
Time Complexity: O(N/K) where K = batch size
Memory Complexity: O(N/K) for batched connections

Message Delivery Guarantees:

At-Most-Once Delivery:

Implementation: Fire-and-forget with no acknowledgments
Use Case: Non-critical updates (seat availability changes)
Performance: Highest throughput, lowest latency
Trade-off: Some messages may be lost

At-Least-Once Delivery:

Implementation: Retry mechanism with acknowledgments
Use Case: Critical updates (payment confirmations)
Performance: Lower throughput, higher latency
Trade-off: Some messages may be duplicated

Exactly-Once Delivery:

Implementation: Idempotent processing with deduplication
Use Case: Financial transactions
Performance: Lowest throughput, highest latency
Trade-off: Highest complexity, highest reliability

Part 3: Bulletproofing the System — A Deep Dive into Resilience

A production-ready architecture must be designed explicitly for failure.

Resilience of the EKS Fleet

Pod Failure: If a pod crashes, Kubernetes's ReplicaSet immediately launches a replacement. The ALB's health check will have already stopped routing traffic to the failed pod. Clients that were connected to it will trigger their automatic reconnect logic and will be seamlessly routed to a healthy pod by the ALB.
Node Failure: The EKS worker nodes are managed by an Auto Scaling Group. If an EC2 instance fails, the ASG will terminate it and launch a replacement. Kubernetes will then automatically reschedule the affected pods onto the remaining healthy nodes in the cluster.

Resilience of the Director (Redis Cluster)

Primary Defense (High Availability): The Director is deployed as an Amazon ElastiCache for Redis cluster with Multi-AZ and automatic failover enabled. If the primary Redis node fails, ElastiCache automatically promotes a replica.
Disaster Recovery (Graceful Degradation): In a total service failure, the Router Lambda can be programmed with a fallback. Upon detecting that Redis is unavailable, it could revert to a less efficient broadcast model or log the failure and wait for recovery.

Performance Benchmarks

Queue Management Performance Comparison

Maximum QPS & Memory at 10,000 Concurrent Users:

Category	Method	P95 Latency	Max QPS	Memory (10K users)	Complexity	Notes
Queue	Redis ZSET	5ms	25,000	400MB	O(logN)	Skip list implementation
Queue	Redis List+Hash	2ms	50,000	320MB	O(1)	Linked list + hash table
Queue	Database Queue	50ms	2,000	1GB	O(N)	PostgreSQL-based
Queue	In-Memory Array	1ms	100,000	200MB	O(1)	Single-threaded only
Reservation	Pessimistic Lock	30ms	2,000	800MB	O(1)	Strong consistency
Reservation	Optimistic Lock	15ms	8,000	600MB	O(1)	High concurrency
Reservation	Database Transaction	40ms	1,500	500MB	O(1)	ACID guarantees
Notification	Redis Pub/Sub	2ms	30,000	200MB	O(1)	Fire-and-forget
Notification	WebSocket Direct	1ms	5,000	800MB	O(1)	Bidirectional, persistent connection
Notification	SSE + SNS	8ms	15,000	1.2GB	O(1)	AWS managed

Memory Usage Calculation Breakdown

Performance at 10,000 Concurrent Users:

Queue Methods:

Redis ZSET (400MB): 10K users × 32 bytes (user_id + score + skip list pointers) + 25% Redis overhead
Redis List+Hash (320MB): 10K users × 40 bytes (list + hash entries) + Redis metadata
Database Queue (1GB): 10K users × 100 bytes (PostgreSQL row) + connection pools + indexes
In-Memory Array (200MB): 10K users × 8 bytes (integer user_id) + JVM overhead + GC buffers

Reservation Methods:

Pessimistic Lock (800MB): 10K concurrent locks × 1KB (lock metadata) + database buffers
Optimistic Lock (600MB): No lock overhead, just version numbers + database buffers
Database Transaction (500MB): 10K transactions × 50 bytes (transaction state) + database buffers

Notification Methods:

Redis Pub/Sub (200MB): 10K QPS × 100 bytes (message overhead) + connection buffers
WebSocket Direct (800MB): 10K connections × 8KB (connection state) + WebSocket server overhead
SSE + SNS (1.2GB): 10K connections × 2KB (SSE state) + AWS Lambda + SNS message processing

Key Assumptions: 10K concurrent users baseline, Redis overhead 20-25%, database connection pools, network buffers, AWS service overhead. Actual usage varies by implementation details and configuration.

Production Metrics

High-Demand Event Benchmarks:

Peak Queue Length: 2M+ users waiting
Admission Rate: 1,000 users/second (controlled)
Reservation Success Rate: 95%+ (5% timeout/abandonment)
Notification Delivery: 99.9% within 2 seconds
System Recovery: <30 seconds from Redis failure

Resource Utilization:

Redis Cluster: 3 nodes, 16GB RAM each, 50% CPU utilization
Database: 8 cores, 32GB RAM, 70% CPU utilization
Notification Fleet: 20 pods, 4GB RAM each, 60% CPU utilization

Monitoring & Observability

Key Metrics

Queue Management Metrics:

Queue Length: Current number of users waiting
Average Wait Time: Mean time from join to admission
Admission Rate: Users admitted per second
Queue Position Accuracy: Consistency of position calculations
Redis Memory Usage: Memory consumption by queue data structures

Reservation System Metrics:

Reservation Success Rate: Percentage of successful reservations
Lock Contention: Number of lock conflicts per second
Transaction Duration: Average time for reservation transactions
Abandonment Rate: Percentage of users who don't complete payment
Seat Release Rate: Expired reservations released per minute

Real-Time Notification Metrics:

Message Delivery Rate: Notifications delivered per second
Delivery Latency: Time from event to user notification
Connection Health: Active WebSocket/SSE connections
Message Loss Rate: Failed deliveries due to connection drops
Fanout Efficiency: Messages per SNS publish

Alerting Thresholds

Critical Alerts:

Queue length > 1M users
Reservation failure rate > 10%
Notification delivery latency > 5 seconds
Redis memory usage > 80%
Database connection pool exhaustion

Warning Alerts:

Queue length > 500K users
Reservation failure rate > 5%
Notification delivery latency > 2 seconds
Redis memory usage > 60%
Average wait time > 15 minutes

Channels:

Critical: PagerDuty (immediate escalation)
Warning: Slack (team notification)
Info: Email (daily reports)

Logging Strategy

Queue System Logs:

User join/leave events with timestamps
Position changes and ETA updates
Redis operation performance
Queue admission decisions

Reservation System Logs:

Lock acquisition/release events
Transaction success/failure with reasons
Seat status changes
Payment timeout events

Notification System Logs:

Message publish events
Delivery confirmations
Connection establishment/teardown
Fanout performance metrics

Disaster Recovery

Failure Scenarios

Queue System Failures:

Redis Cluster Failure: 15-30 second recovery with automatic failover
Queue Order Corruption: 2-5 minute recovery using database reconstruction
Position Calculation Errors: Immediate detection via monitoring, <1 minute fix
Admission Rate Overload: Automatic throttling, 30 second stabilization

Reservation System Failures:

Database Connection Loss: 10-20 second recovery with connection pooling
Lock Deadlock: Automatic detection and resolution in <5 seconds
Transaction Rollback: Immediate cleanup, no data corruption
Seat Status Inconsistency: Background reconciliation process

Notification System Failures:

WebSocket Connection Drops: Automatic reconnection in <3 seconds
SNS Service Outage: Fallback to direct database polling
Lambda Function Timeout: Automatic retry with exponential backoff
Message Bus Failure: Graceful degradation to polling mode

Recovery Procedures

Queue System Recovery:

Redis Failover: Automatic promotion of replica to primary
Queue Reconstruction: Rebuild from database using join_seq ordering
Position Recalculation: Update all user positions using head_seq tracking
Admission Resume: Restart controlled admission at safe rate

Reservation System Recovery:

Connection Pool Reset: Clear failed connections and establish new ones
Lock Cleanup: Release any orphaned locks from failed transactions
Seat Status Audit: Verify all seat statuses match reservation records
Transaction Log Replay: Replay any committed transactions that weren't reflected

Notification System Recovery:

Connection Re-establishment: Reconnect all dropped WebSocket/SSE connections
Message Replay: Replay missed notifications from event log
Fanout Restart: Resume SNS-based message distribution
Client Notification: Notify users of temporary service interruption

Backup Strategy

Queue State Backup:

Real-time Replication: Continuous Redis replication to standby cluster
Event Log Backup: All queue events stored in durable database (primary recovery mechanism)
Queue Reconstruction: Rebuild queue from database events if Redis fails
RTO/RPO: 30 second recovery time, 5 second data loss maximum

Reservation Data Backup:

Database Replication: Real-time replication to standby database
Transaction Log Backup: Continuous backup of all reservation transactions
Seat Map Backup: Daily backup of seat configuration and status
RTO/RPO: 15 second recovery time, 1 second data loss maximum

Notification System Backup:

Message Queue Backup: SNS topic replication across regions
Connection State Backup: Periodic backup of active connections
Event Stream Backup: All notification events stored in durable log
RTO/RPO: 45 second recovery time, 10 second data loss maximum

Part 2: Event Discovery

Sumedh Bala — Thu, 23 Oct 2025 17:46:01 +0000

Search & Indexing

Event discovery is the foundation of any ticketing system. The goal is to let users quickly find concerts, sports matches, or theater events by name, location, or venue. While it looks simple from the outside, the design must handle millions of queries, abbreviations (LA vs. Los Angeles), nearby cities (Pasadena), and fuzzy matches (typos like San Franciso).

This breakdown covers the baseline design first, followed by a deep dive into enhanced search capabilities.

1. Baseline Functional Solution

Services

Search API Service – Handles user queries, normalizes input, executes searches, and returns ranked results.
Indexing Service – Updates the search index whenever new events are created or updated.
Event Management Service (EMS) – Manages canonical event metadata and feeds it to the search index.

Databases & Caches

Primary Database (PostgreSQL / MySQL / DynamoDB) – Stores event metadata.
Search Index (Elasticsearch / OpenSearch / Solr) – Provides full-text and geospatial search.
Cache (Redis / Memcached) – Stores trending searches and frequently accessed results.

Database Schema (Events Table)

event_id – UUID or BIGINT (primary key)
event_name – Event title (e.g., "Taylor Swift Eras Tour")
venue_name – Venue (e.g., "SoFi Stadium")
city_normalized – Canonical city name (e.g., "Los Angeles")
state – State or province (e.g., "CA")
country – Country (e.g., "USA")
location_lat – Venue latitude
location_lng – Venue longitude
event_start_time – Start datetime
event_end_time – End datetime
search_synonyms – JSON/array of synonyms (e.g., ["LA", "L.A.", "Los Angeles"])

2. Why Indexes Differ by Column

event_id → B-Tree index, fast lookup after search returns candidate IDs.
event_name, venue_name, search_synonyms → Full-text or trigram indexes (support partial matches/typos).
city_normalized, state, country → B-Tree index (exact match queries).
location_lat, location_lng → GiST or SP-GiST index (PostGIS) for geo queries.
event_start_time, event_end_time → B-Tree index for range queries.
Composite Indexes → (city_normalized, event_start_time) for common multi-column queries like "finding all upcoming events in Los Angeles, ordered by date."

Key Takeaway: Use the right index type per query pattern: B-Trees for equality/range, GIN/trigrams for text, GiST for geo, and composite for multi-column queries. PostgreSQL handles the system of record; Elasticsearch handles fuzzy candidate generation.

3. APIs

Search API (Basic)

Endpoint: GET /search

Query Parameters:

q (string) → user query ("LA concerts")
location (optional, lat/lng) → geo filtering
radius (optional, default 30km)
date_range (optional, start + end)

Response: JSON object with an events array, each including:
event_id, event_name, venue, city, state, country, event_start_time, location (lat/lng)

4. Example SQL Queries

Find events in a city:

SELECT * FROM events WHERE city_normalized = 'Los Angeles' AND event_start_time > NOW();

Find events within 30 km radius:
Use spherical distance function with lat/lng coordinates.

Find events by name (text search):

SELECT * FROM events WHERE event_name ILIKE '%Taylor Swift%';

5. Problems Using PostgreSQL Alone

Free-text search is slow at scale.
Fuzzy/typo handling is limited.
Synonyms/aliases require hacks.
Ranking, autocomplete, aggregations are inefficient.
High-QPS text search is not distributed.

Why Elasticsearch/OpenSearch is Needed:

Inverted indices → sub-linear text lookup
Built-in typo tolerance, synonyms, stemming
BM25 ranking & boosting
Efficient geo + text + time queries
Autocomplete and faceted counts
Horizontally scalable and fault-tolerant

Soundbite for interviews: "Postgres is the source of truth, but breaks down for fuzzy search, synonyms, relevance, autocomplete, or scale. Production systems pair it with Elasticsearch/OpenSearch."

6. Enhancing User Experience

Synonym Expansion: Map "LA" → "Los Angeles" and expand queries accordingly.
Geolocation Queries: Lat/lng filtering ensures nearby cities (e.g., Pasadena) appear.
Similarity Search: Fuzzy matching using Elasticsearch analyzers or trigram indexes.

User Query Flow Example:

User types "LA concerts."
Query normalized (lowercase, punctuation removed).
Synonym expansion (LA → Los Angeles).
Geo coordinates retrieved.
Search index queried (text + geo filter).
Results ranked (relevance, time, popularity, distance).
Results cached if trending.

7. Non-Functional Requirements

Latency: Redis caching → sub-100ms for popular searches
Scalability: Elasticsearch distributed indexing/sharding
Availability: Replicated DB + multi-AZ Elasticsearch
Consistency: Event Management Service is source of truth; eventual consistency to search index

8. Deeper Dive: Fuzzy Search

Scenario:

Dictionary: all cities (~200k–500k entries)
User input: "San Franciso" (typo of "San Francisco")
Goal: find the closest matching city efficiently, despite typos

8.1 Naive Levenshtein Distance

How it works:

Compare input string with every city in the dictionary
Compute edit distance (insertions, deletions, substitutions) for each city

Example:

Input: "San Franciso"
"San Francisco" → edit distance = 2 → match
"Los Angeles" → edit distance = 11 → ignored

Data storage: Simple array or list of city names (no preprocessing)

Time complexity:

Let n = number of cities, m = average city name length
Computing edit distance per city: O(m²) (see detailed explanation at end of document)
Total: O(n * m²) → very slow for large n

Pros: Simple to implement
Cons: Not scalable for real-time search with hundreds of thousands of entries

8.2 BK-Tree (Burkhard-Keller Tree)

Purpose: BK-Trees are designed for approximate string matching under a metric (commonly Levenshtein distance). They allow you to find all strings within a given "distance" efficiently, without scanning the entire dataset.

Structure:

Each node = a string (city name)
Each edge = edit distance between parent and child
If a child exists at that edge, insertion continues down that edge

Root Choice: The first inserted word becomes the root. Root choice doesn't affect correctness, but a "central" root can reduce tree depth.

Insertion Example (Detailed):
Insertion order: Los Angeles → Pasadena → Glendale → San Jose → San Mateo → San Francisco → Fresno

Insert Los Angeles → root
Insert Pasadena → dist(Pasadena, Los Angeles) = 8 → add under Los Angeles
Insert Glendale → dist(Glendale, Los Angeles) = 8 → edge exists → recurse into Pasadena → dist(Glendale, Pasadena) = 7 → add under Pasadena
Insert San Jose → dist(San Jose, Los Angeles) = 11 → add under Los Angeles
Insert San Mateo → dist(San Mateo, Los Angeles) = 11 → recurse into San Jose → dist(San Mateo, San Jose) = 4 → add under San Jose
Insert San Francisco → dist(San Francisco, Los Angeles) = 10 → add under Los Angeles
Insert Fresno → dist(Fresno, Los Angeles) = 10 → recurse into San Francisco → dist(Fresno, San Francisco) = 9 → add under San Francisco

Final Tree:

Los Angeles──(8)──▶ Pasadena  ──(7)──▶ Glendale
              ──(10)──▶ San Francisco  ──(9)──▶ Fresno
              ──(11)──▶ San Jose       ──(4)──▶ San Mateo

Search Logic (Fuzzy Search):
Query: "San Jsoe" (typo for San Jose), max edit distance = 2

Start at Los Angeles → dist = 11 → allowed child edges [9,13]
Children within range: San Francisco (10), San Jose (11)
Check San Francisco → dist = 6 → prune branch
Check San Jose → dist = 1 → ✅ Match found
Explore San Jose's children → San Mateo → dist = 6 → prune

Key Observations:

Branch pruning via [d-k, d+k] makes search efficient
Chains form naturally when cities share distances
Insertion order affects tree shape but not correctness

BK-Tree Complexity:

Distance computation per node: O(m²)
Average case: O(log n) nodes visited
Worst case: O(n) nodes visited
Total search time: O(v * m²), v = nodes actually visited

Intuition: BK-Tree is like a binary tree where only relevant branches are explored, but each node comparison costs O(m²).

8.3 Trie + Levenshtein Automaton

How it works:

Store all city names in a Trie (prefix tree)
Example: "San Francisco" → S → a → n → _ → F → r → …
Use Levenshtein Automaton to traverse the Trie:
- Track allowed edits at each step
- Prune paths exceeding threshold

Time Complexity:

Worst-case: O(n)
Average case with pruning: much less → efficient for large dictionaries

Pros: Scales well, precise, dynamic typo handling
Cons: Higher implementation complexity than BK-Tree

Fuzzy Search with Trie + Levenshtein Automaton (Beginner-Friendly)

Suppose you have a list of city names:

["San Francisco", "San Jose", "San Mateo", "Los Angeles", "Fresno"]

You want to search for "San Jsoe" (typo) and still find "San Jose".

We can do this efficiently using two things together:

A Trie (prefix tree) to store city names.
Levenshtein distance to measure "how different" two strings are.

1. Trie Basics

A Trie is a tree where each node is a letter. Words are formed by following a path from the root to a leaf.

Example:

Root
 |
 S
 |
 a
 |
 n
 |
(space)
 ├── F → "San Francisco"
 ├── J → "San Jose"
 └── M → "San Mateo"

Every path from root → leaf = a city name.
We can quickly follow letters to see which words match a prefix.

2. Levenshtein Distance Basics

Levenshtein distance = minimum edits to turn one string into another

Edit = insert, delete, or change a letter
Example: "San Jsoe" → "San Jose" → 1 edit (swap 's' and 'o')

We allow a maximum distance (e.g., 2) so we match words even with typos.

3. How the Search Works

Instead of comparing "San Jsoe" with every city (slow), we combine Trie + Levenshtein Automaton.

At each Trie node, we maintain a distance row representing edits needed to match the query so far.
Row length = query length + 1 = 8 + 1 = 9.

Step 0: Start at empty string

Query = "San Jsoe" (8 chars)
Trie path = ""

Distance row from empty string:

[0, 1, 2, 3, 4, 5, 6, 7, 8]

Index = number of query characters considered
Value = edits needed to turn empty string → query prefix

Step 1: Add first letter `'S'` → Trie path `"S"`

Compute new row using formula:

new_row[j] = min(
    previous_row[j] + 1,      # deletion
    new_row[j-1] + 1,         # insertion
    previous_row[j-1] + cost  # substitution (0 if match, 1 if not)
)

Example table:

j	Query prefix	Compare 'S' vs query[j-1]	Cost	Computation	new_row[j]
0	""	—	—	0 + 1 = 1	1
1	"S"	S vs S	0	min(1+1,1+1,0+0)=0	0
2	"Sa"	S vs a	1	min(2+1,0+1,1+1)=1	1
3	"San"	S vs n	1	min(3+1,1+1,2+1)=2	2
4	"San "	S vs (space)	1	min(4+1,2+1,3+1)=3	3
5	"San J"	S vs J	1	min(5+1,3+1,4+1)=4	4
6	"San Js"	S vs s	1	min(6+1,4+1,5+1)=5	5
7	"San Jso"	S vs o	1	min(7+1,5+1,6+1)=6	6
8	"San Jsoe"	S vs e	1	min(8+1,6+1,7+1)=7	7

Resulting row: [1, 0, 1, 2, 3, 4, 5, 6, 7]

Step 2: Add `'a'` → Trie path `"Sa"`

Example table:

j	Query prefix	Compare 'a' vs query[j-1]	Cost	Computation	new_row[j]
0	""	—	—	1 + 1 = 2	2
1	"S"	a vs S	1	min(0+1,2+1,1+1)=1	1
2	"Sa"	a vs a	0	min(1+1,1+1,0+0)=0	0
3	"San"	a vs n	1	min(2+1,0+1,1+1)=1	1
4	"San "	a vs (space)	1	min(3+1,1+1,2+1)=2	2
5	"San J"	a vs J	1	min(4+1,2+1,3+1)=3	3
6	"San Js"	a vs s	1	min(5+1,3+1,4+1)=4	4
7	"San Jso"	a vs o	1	min(6+1,4+1,5+1)=5	5
8	"San Jsoe"	a vs e	1	min(7+1,5+1,6+1)=6	6

Resulting row: [2, 1, 0, 1, 2, 3, 4, 5, 6]

Step 3: Add `'n'` → Trie path `"San"`

Distance row:

[3, 2, 1, 0, 1, 2, 3, 4, 5]

Minimum distance = 0 → path "San" matches query "San" perfectly so far.

Step 4: Add space `' '` → Trie path `"San "`

Distance row:

[4, 2, 1, 1, 0, 1, 2, 3, 4]

Minimum distance = 0 → "San " matches query so far

Step 5: Continue down `"San J"` → `"San Jose"`

Add 'J' → distance row: [5,4,3,2,1,0,1,2,3]
Add 'o' → [6,5,4,3,2,1,1,2,3]
Add 's' → [7,6,5,4,3,2,1,1,2]
Add 'e' → [8,7,6,5,4,3,2,1,1] ✅
Final distance = 1 → MATCH FOUND "San Jose"

Step 6: Early Pruning

"San Francisco" → distance eventually > 2 → stop exploring
"San Jose" → distance ≤ 2 → match
"San Mateo" → distance > 2 → stop

Step 7: Intuition

Walk down the Trie (letter by letter).
Track minimum edits to match the query at each step (distance row).
Stop exploring paths where edits exceed max allowed distance.
If a complete word reaches end node within max edits → success!

This method powers fuzzy search, autocomplete, and spell correction in real systems.

5. Key Advantages

Early Pruning: Stop exploring paths when distance exceeds threshold
Shared Prefixes: Multiple cities with same prefix share computation
Incremental Updates: Distance matrix updates are O(m) per character
Memory Efficient: Only store one distance matrix per search path

6. Performance Comparison

Method	Time Complexity	Space	Implementation
Naive Levenshtein	O(n × m²)	O(m²)	Simple
BK-Tree	O(log n × m²)	O(n)	Medium
Trie + Levenshtein Automaton	O(n × m)	O(n + m)	Complex

7. When to Use

Large dictionaries (100k+ entries)
Frequent fuzzy searches
Memory constraints (trie is more compact than BK-tree)
Need for prefix-based filtering

The Trie + Levenshtein Automaton approach is the most sophisticated but also the most efficient for large-scale fuzzy string matching in production systems.

8.4 Trigram Index / Elasticsearch Approach

How it works:

Create Trigrams: Break each city name into 3-character overlapping sequences
- Example: "San Francisco" → ["San", "an ", "n F", " Fr", "Fra", "ran", "anc", "nci", "cis", "isc", "sco"]
Build Inverted Index: Each trigram → list of city IDs
- Example: "Fra" → [San Francisco, Frankfurt]
Querying: Input "San Franciso" → generate trigrams → retrieve all cities containing at least one trigram → compute similarity score → return top matches

Pros:

Scales to millions of entries
Handles typos, partial matches, multi-word queries
Produces ranked results for UX

Cons: Requires search engine infrastructure, slightly more memory overhead

9. How Elasticsearch Ranks Results: BM25 + Boosting

BM25 (Best Match 25)

Statistical ranking algorithm (improvement on TF-IDF)

Key ingredients:

Term Frequency (TF) → frequency of term in field, diminishing returns
Inverse Document Frequency (IDF) → rare terms weigh more
Field Length Normalization → shorter fields get higher weight

Boosting

Manually tweak ranking

Field Boosting: event_name^3, artist_name^2, city^1
Query Boosting: recency, proximity
Business Logic Boosting: trending events, sponsored events

Interview Takeaway: BM25 + Boosting ensures results are relevant (BM25) and aligned with business/user intent (boosting).

10. Indexing Every PostgreSQL Field: Pros & Cons

When designing a large-scale ticketing system, deciding which fields to index in PostgreSQL is crucial, especially when pairing it with a specialized search engine like Elasticsearch.

Pros of PostgreSQL Indexing:

Backup System: PostgreSQL can act as a reliable fallback if Elasticsearch experiences an outage, ensuring continuous service.
Direct Access for Specific Queries: For certain precise queries or filters (e.g., direct lookups by event_id), direct access through PostgreSQL indexes can be significantly faster.
Strong Consistency & Data Integrity: Indexes on crucial fields (like event_id, often as a primary key) are paramount for enforcing strong consistency and data integrity. While the guarantee of integrity comes from database constraints (e.g., PRIMARY KEY, UNIQUE) and transactional properties, indexes are essential for the efficient and scalable enforcement of these guarantees. They enable the database to quickly check for uniqueness and relationships, preventing issues like duplicate events, even when the events table contains millions of records. This is a core requirement that search engines are not designed to fulfill.
Simplified Read-Your-Own-Writes: Immediate query results for newly created or updated events are possible directly from PostgreSQL, providing consistent results for the user who made the change.

Cons of PostgreSQL Indexing:

Increased Write Overhead: More indexes generally lead to slower insert and update operations, as each index needs to be maintained.
Higher Storage Costs: Each index consumes additional disk space, increasing overall storage expenses.
Inefficient for Complex Queries: PostgreSQL's built-in text and fuzzy search capabilities are less efficient at scale compared to specialized search engines like Elasticsearch.
Maintenance Complexity: Managing and tuning a large number of indexes can add to operational overhead.

Recommendation for Ticketing Systems: Strategic PostgreSQL Indexing

While it's not necessary to index every field in PostgreSQL, it plays a vital role. The primary responsibility for high-volume, complex search queries is offloaded to Elasticsearch, which is horizontally scalable and optimized for such tasks.

PostgreSQL indexes serve two critical, complementary roles in this architecture:

Ensuring Data Integrity and Transactional Consistency: Indexes on crucial fields like event_id (typically as a primary key) are paramount. They enable the database to efficiently enforce UNIQUE and PRIMARY KEY constraints, ensuring that event data is always correct and preventing issues like duplicate event entries. This is a core requirement that search engines cannot guarantee.
Providing a Failsafe: Event creation and updates are typically low-volume operations in a ticketing system (compared to search queries). Therefore, the write overhead incurred by maintaining additional PostgreSQL indexes is acceptable. This investment ensures a robust and consistent backup of critical event data, providing a crucial safety net in the event of a search engine outage. The key here is that writes to the events table are very infrequent (as very few events are added relative to read operations), so the overhead of more indexes does not pose a significant problem in terms of performance.

Soundbite for Interview: "Mentioning PostgreSQL indexing demonstrates you've considered distributed system failure modes and have a plan B. Downsides are outweighed by reliability, consistency, and data integrity."

11. Indexing Service & Event Management Service

These services ensure search data is accurate, timely, and scalable, maintaining a strong connection between canonical event data in PostgreSQL and the search index in Elasticsearch/OpenSearch.

11.1 Event Management Service (EMS)

Purpose: EMS is the source of truth for all event metadata. Responsible for creating, updating, and managing canonical event records.

Responsibilities:

Manage all event-related data:
- Event name, venue, city, state, country
- Start and end times
- Event capacity and ticket types
- Synonyms for search queries (e.g., LA → Los Angeles)
Validate incoming event data for consistency and correctness
Track changes via event versioning or timestamps
Emit notifications for downstream systems when events are created or updated

Key Considerations for High Scale:

Use atomic transactions in PostgreSQL to ensure consistency
Implement soft deletes to prevent broken references in search index
Emit event change logs to message broker (Kafka, RabbitMQ) for asynchronous processing by Indexing Service
Ensure idempotency in updates to avoid duplicate indexing

Example Flow:

Event created/updated in PostgreSQL
EMS validates the record
EMS publishes event notification (message queue / Kafka topic)
Indexing Service consumes the message to update the search index

11.2 Indexing Service

Purpose: Keep the search index up-to-date with canonical data for fast, accurate results

Responsibilities:

Consume event change notifications from EMS
Transform data into search-friendly format:
- Text fields → tokenized
- Locations → geo-points
- Dates → sortable timestamps
Update/delete documents in Elasticsearch/OpenSearch
Handle batch updates and real-time streaming
Maintain versioning to detect stale updates

Key Implementation Details:

Message Queue Consumption: Kafka, RabbitMQ, or Kinesis
Idempotent Updates: Each event has a unique ID
Bulk Indexing: For new events or large updates
Error Handling: Failed updates retried, dead-letter queue stores permanently failed updates

Consistency Model:

Elasticsearch is eventually consistent
EMS remains the source of truth

Example Flow:

Receive "event_created" message from EMS
Transform event record (normalize names, add synonyms, compute geo coordinates)
Push to Elasticsearch via bulk API
Confirm update or retry if it fails

11.3 Architecture & Reliability

Decoupling: EMS and Indexing Service communicate via message queues → prevents blocking writes in Postgres during indexing
Resilience: Retry mechanisms and dead-letter queues for failures
Scalability: Multiple consumers can scale horizontally for high volumes
Observability: Metrics (throughput, error rates, lag), logging full trace (event_id, timestamp, status)

11.4 Interview Takeaways

EMS ensures canonical, consistent event data
Indexing Service keeps search index current and performant
Decoupled, message-driven architecture → scalable & fault-tolerant
Idempotent + bulk operations → reliability under high load
Plan B for indexing failures demonstrates awareness of real-world distributed system design

12. Performance Benchmarks

Search Performance Comparison

Category	Method	P95 Latency	QPS	Memory	Notes
DB-native	PostgreSQL ILIKE	200ms	1,000	2GB	Exact text matching
DB-native	PostgreSQL Trigram	500ms	500	4GB	Fuzzy text search
Search Engine	Elasticsearch (Fuzzy)	50ms	10,000+	8GB	Scalable fuzzy search
Search Engine	Elasticsearch (Regex)	80ms	8,000+	10GB	Regex/wildcard queries
Search Engine	OpenSearch	60ms	9,000+	8GB	AWS-managed Elasticsearch
Search Engine	Solr	70ms	8,500+	9GB	Enterprise search platform
Algorithm	BK-Tree	20ms	5,000	16GB	Specialized fuzzy matching
Algorithm	Trie + Levenshtein	10ms	15,000+	6GB	Efficient fuzzy matching
Cache	Redis Cached	5ms	50,000+	4GB	Hot queries only

Why Trigram Indexes Excel at Complex Queries

While Trie + Levenshtein Automaton is faster for simple fuzzy matching, Trigram indexes are superior for complex queries because they handle multiple search patterns simultaneously. Trigram indexes break text into overlapping 3-character sequences, allowing them to efficiently process:

Multi-word queries: "Taylor Swift concert" matches documents containing "Taylor", "Swift", and "concert" even with typos
Partial matches: "San Fran*" matches "San Francisco", "San Fernando", "San Francisco Bay"
Regex patterns: Complex regular expressions like [A-Z][a-z]+ [A-Z][a-z]+ for proper names
Wildcard queries: "conc*" matches "concert", "conference", "conclusion"
Phrase proximity: Finding documents where "Taylor" and "Swift" appear within 5 words of each other

Performance advantage: Trigram indexes pre-compute all possible 3-character combinations, making complex pattern matching O(1) lookup time instead of O(n×m²) for each query. This is why Elasticsearch uses trigrams for regex/wildcard queries despite higher memory overhead - the complexity of pattern matching makes the trade-off worthwhile.

Real-world example: A query like "LA concerts this weekend" with typos becomes much more efficient with trigram indexing because it can simultaneously match "LA" (city), "concert*" (event type), and "weekend" (time) across multiple fields and handle variations in each term.

Production Metrics

Elasticsearch Cluster: 3-node, 16GB RAM, 500GB index
Query Latency: P50: 15ms, P95: 45ms, P99: 100ms
Throughput: 8,000 QPS sustained, 15,000 QPS burst
Cache Hit Rate: 85% for trending searches

13. Caching Strategy

Multi-Layer Architecture

CDN Cache (CloudFlare) - Static content, 24h TTL
Application Cache (Redis) - Query results, 5-15min TTL
Elasticsearch Query Cache - Search index cache

Cache Layers

Query Results: Popular searches cached for 15 minutes
Autocomplete: City/venue suggestions cached for 1 hour
Geo Data: Location lookups cached for 24 hours
Trending Searches: Top queries cached for 1 hour

Cache Invalidation

Event Updates: Invalidate related searches by city/venue
Time-based: Different TTLs based on data volatility
Cache Warming: Pre-populate trending searches during off-peak

14. Monitoring & Observability

Key Metrics

Search Performance: Latency (P50/P95/P99), success rate, error rate
System Health: Elasticsearch cluster status, node health, shard status
Business Metrics: Search volume, popular queries, user engagement
Cache Performance: Hit rate, memory usage, eviction rate

Alerting Thresholds

Critical: Search latency > 100ms, error rate > 5%, cluster health = Red
Warning: Search latency > 50ms, error rate > 1%, cluster health = Yellow
Channels: PagerDuty (critical), Slack (warnings), Email (reports)

Logging

Search Queries: Query, user, latency, results count, cache hit status
Errors: Timeout, connection failures, retry attempts
Performance: Elasticsearch response times, cache hit rates

15. Disaster Recovery

Failure Scenarios

Single Node: 0 downtime, automatic shard rebalancing
Cluster Failure: 15-30 min downtime, restore from snapshots
Data Corruption: 2-4 hours downtime, full reindex from PostgreSQL

Recovery Procedures

Cluster Health Check: Verify cluster status and shard allocation
Snapshot Restore: Restore from daily snapshots stored in S3
Full Rebuild: Reindex entire dataset from PostgreSQL source
Incremental Sync: Catch up on missed updates using timestamps

Fallback Strategies

PostgreSQL Fallback: Use ILIKE queries with reduced functionality
Read-Only Mode: Serve cached results only during outages
Graceful Degradation: Show maintenance message with trending events

Backup Strategy

Daily Snapshots: Automated daily backups to S3
Cross-Region Replication: Real-time replication to backup cluster
RTO/RPO: 15-30 min recovery time, 15 min data loss maximum

Appendix: Why Edit Distance is O(m²)

The Levenshtein distance (edit distance) has O(m²) time complexity because it uses a dynamic programming approach with a 2D table where m is the length of the strings being compared.

The Algorithm Structure

For two strings of length m each, the algorithm creates a m × m table and fills it using the following recurrence relation:

def levenshtein_distance(s1, s2):
    m, n = len(s1), len(s2)

    # Create (m+1) × (n+1) table
    dp = [[0] * (n + 1) for _ in range(m + 1)]

    # Base cases
    for i in range(m + 1):
        dp[i][0] = i
    for j in range(n + 1):
        dp[0][j] = j

    # Fill the table
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if s1[i-1] == s2[j-1]:
                dp[i][j] = dp[i-1][j-1]  # No operation needed
            else:
                dp[i][j] = 1 + min(
                    dp[i-1][j],      # Deletion
                    dp[i][j-1],      # Insertion  
                    dp[i-1][j-1]     # Substitution
                )

    return dp[m][n]

Understanding the Base Cases

The base cases initialize the first row and first column of the DP table:

# Base cases
for i in range(m + 1):
    dp[i][0] = i
for j in range(n + 1):
    dp[0][j] = j

What these base cases represent:

dp[i][0] = i (First column):
- Converting string1[0...i-1] to an empty string
- Requires i deletions (delete all i characters)
- Example: "San Francisco" → "" needs 13 deletions
dp[0][j] = j (First row):
- Converting empty string to string2[0...j-1]
- Requires j insertions (insert all j characters)
- Example: "" → "San Franciso" needs 12 insertions

Visual representation:

     ""  S  a  n     F  r  a  n  c  i  s  o
""  [0][1][2][3][4][5][6][7][8][9][10][11][12]
S   [1][?][?][?][?][?][?][?][?][?][?][?][?]
a   [2][?][?][?][?][?][?][?][?][?][?][?][?]
n   [3][?][?][?][?][?][?][?][?][?][?][?][?]
    [4][?][?][?][?][?][?][?][?][?][?][?][?]
F   [5][?][?][?][?][?][?][?][?][?][?][?][?]
r   [6][?][?][?][?][?][?][?][?][?][?][?][?]
a   [7][?][?][?][?][?][?][?][?][?][?][?][?]
n   [8][?][?][?][?][?][?][?][?][?][?][?][?]
c   [9][?][?][?][?][?][?][?][?][?][?][?][?]
i   [10][?][?][?][?][?][?][?][?][?][?][?][?]
s   [11][?][?][?][?][?][?][?][?][?][?][?][?]
c   [12][?][?][?][?][?][?][?][?][?][?][?][?]
o   [13][?][?][?][?][?][?][?][?][?][?][?][?]

Why these base cases make sense:

Row 0, Column 0: dp[0][0] = 0 (empty string to empty string = 0 operations)
Row 0, Column j: dp[0][j] = j (empty string to j-character string = j insertions)
Row i, Column 0: dp[i][0] = i (i-character string to empty string = i deletions)

These base cases provide the foundation for the dynamic programming recurrence relation that fills the rest of the table.

Why O(m²)

Table Size: We create a table of size (m+1) × (n+1), which is approximately m × m when strings are similar length
Nested Loops: We have two nested loops:
- Outer loop: for i in range(1, m + 1) → O(m) iterations
- Inner loop: for j in range(1, n + 1) → O(m) iterations (when n ≈ m)
Total Operations: O(m) × O(m) = O(m²)

Visual Example

For strings "San Francisco" (m=13) and "San Franciso" (m=12):

     S  a  n     F  r  a  n  c  i  s  o
   [0][1][2][3][4][5][6][7][8][9][10][11][12]
S  [1][0][1][2][3][4][5][6][7][8][9][10][11]
a  [2][1][0][1][2][3][4][5][6][7][8][9][10]
n  [3][2][1][0][1][2][3][4][5][6][7][8][9]
   [4][3][2][1][0][1][2][3][4][5][6][7][8]
F  [5][4][3][2][1][0][1][2][3][4][5][6][7]
r  [6][5][4][3][2][1][0][1][2][3][4][5][6]
a  [7][6][5][4][3][2][1][0][1][2][3][4][5]
n  [8][7][6][5][4][3][2][1][0][1][2][3][4]
c  [9][8][7][6][5][4][3][2][1][0][1][2][3]
i  [10][9][8][7][6][5][4][3][2][1][0][1][2]
s  [11][10][9][8][7][6][5][4][3][2][1][0][1]
c  [12][11][10][9][8][7][6][5][4][3][2][1][0]
o  [13][12][11][10][9][8][7][6][5][4][3][2][1]

Each cell requires constant time to compute, but we need to fill 13 × 12 = 156 cells, which is O(m²).

Why This Matters in the Ticketing System

In the context of the ticketing system:

City Names: "San Francisco" vs "San Franciso" (typo)
Scale Problem: With 200k-500k cities, computing edit distance for each one would be:
- Per city: O(m²) where m ≈ 10-15 characters
- Total: O(n × m²) = O(500,000 × 15²) = O(112.5 million operations)
Performance Impact: This is why the document mentions it's "very slow for large n"

Alternative Solutions

This is exactly why the document suggests:

BK-Trees: O(log n) nodes visited, but each comparison still costs O(m²)
Trie + Levenshtein Automaton: More complex but can be optimized
Trigram Indexing: O(1) lookup time for approximate matches

The O(m²) complexity is inherent to the edit distance algorithm itself - it's the mathematical cost of computing the minimum number of operations needed to transform one string into another.

Designing a Large-Scale Ticketing System

Sumedh Bala — Thu, 23 Oct 2025 17:41:50 +0000

Introduction

Inspired by https://www.youtube.com/watch?v=fhdPyoO6aXI&t=48s
Ticketing systems such as Ticketmaster or Eventbrite appear simple on the surface: a user finds an event, selects a seat, and completes the purchase. To a user, it feels like a straightforward transaction. In reality, these systems are among the most challenging distributed systems to design because they must handle extreme concurrency, strict consistency requirements, fraud prevention, and unpredictable demand spikes.

This series will walk through the design of a ticketing system step by step, highlighting the functional requirements, non-functional requirements, design trade-offs, and architectural patterns that are relevant for a system design interview.

This series takes a different approach:

Go deep into multiple areas – sometimes interviewers drive the follow-ups, not the candidate. Each functional requirement is treated as a system design interview in itself.
Cover database schemas and APIs – some interviewers want to see practical modeling and integration.
Build reusable solutions – patterns from one domain (e.g., event search) can often apply to others (e.g., product search).

Functional Requirements

A ticketing system supports the entire user journey: discovery, selection, purchase, and ticket delivery.

Event Discovery

Search for events by location, category, or date.
View event details (venue, performers, time).

Seat Selection

Display seating maps with real-time availability.
Support both reserved seating (specific seat numbers) and general admission (ticket buckets).

Reservation

Temporarily reserve seats while a user is in the checkout flow.
Expire reservations if payment is not completed within a time limit (usually 2–10 minutes depending on policy).

Purchase & Payment

Securely process payments.
Ensure idempotency (no double-charging if retries occur).

Ticket Issuance

Generate a digital ticket (QR/barcode) upon successful purchase.
Send confirmation via email and/or mobile app.

Non-Functional Requirements

Large-scale ticketing systems are defined as much by how they perform under stress as by their feature set.

Scalability

Handle sudden spikes (e.g., millions of users joining at ticket release time).
Support horizontal scaling across multiple regions.
Adapt to bursty traffic patterns rather than steady-state load.

Low Latency

Fast responses for search, seat availability, and reservation status (<200 ms target for most read queries).
Low latency matters because delays directly increase user drop-off during checkout.

Reliability & Consistency

Ensure no double-booking of seats.
Strong consistency for seat inventory, eventual consistency acceptable for less critical data (e.g., search).
Techniques such as distributed locks or optimistic concurrency control are critical here.

Fault Tolerance

Gracefully handle partial failures (e.g., payment succeeds but DB write fails).
Use retries, dead-letter queues, and idempotent APIs.

Fairness & Security

Queueing or rate limiting to prevent system overload.
Anti-bot measures such as CAPTCHA, token-based access, and dynamic queueing systems to protect inventory.

Observability

Real-time logging, monitoring, and alerting.
Traceability for debugging failed transactions.

Roadmap for This Series

Each upcoming part of this series will address a key aspect of the system, with diagrams, real-world examples, and interview tips:

By the end of this series, you should be able to discuss the design of a large-scale ticketing system in a structured, interview-ready manner, while demonstrating the ability to balance correctness, scalability, and user experience.