Building a Real-Time Hotel Booking Engine: How We Solved Double-Booking Across 6 OTAs

#architecture #webdev #postgres #distributedsystems

Last year, our team built a centralized booking engine for a hotel chain operating 12 properties across Southeast Asia. The core challenge: their rooms were listed simultaneously on Booking.com, Expedia, Agoda, Traveloka, Trip.com, and their own direct booking site. Double-bookings were happening 3-4 times per week, each one costing the property an average of $180 in relocation fees and guest compensation.

This post breaks down the architecture we used to eliminate that problem.

Why Double-Booking Is Harder Than a Simple Database Lock

The naive solution sounds straightforward: lock the room row in the database before confirming a booking. But at scale across multiple OTA channels, this breaks down fast.

The root issue is distributed timing. When a guest clicks "Book Now" on Expedia, that request hits Expedia's servers first, then gets forwarded to the hotel's system via API. Meanwhile, another guest on Agoda books the same room type for the same dates. Both requests arrive at the booking engine within a 200-400ms window. A simple row-level lock in PostgreSQL handles sequential requests fine, but when two OTA webhooks fire near-simultaneously, the second request often reads stale availability data before the first transaction commits.

The problem gets worse with connection pooling. Under load, database connections queue up, and the gap between "read availability" and "write confirmation" widens. We measured this gap averaging 150ms in normal conditions, spiking to 800ms during peak booking hours (6-10 PM local time).

Our Architecture: Event-Driven Inventory with Optimistic Locking

We evaluated two approaches: pessimistic locking (SELECT FOR UPDATE) and optimistic locking with version control. We chose optimistic locking for one reason: pessimistic locks under high concurrency caused connection pool exhaustion in our load tests. With 6 OTAs sending concurrent requests, the lock wait times cascaded.

The architecture has three core components:

Inventory Service (Node.js): Owns the single source of truth for room availability. Every inventory record carries a version integer. When a booking request arrives, the service reads the current version, validates availability, then attempts an UPDATE with a WHERE clause matching both the room ID and the expected version number. If the version has changed between read and write, the UPDATE affects zero rows, and the service rejects the booking with a retry signal.
Message Queue (RabbitMQ): All incoming OTA booking requests land in a RabbitMQ queue before hitting the Inventory Service. This serializes concurrent requests per room-date combination using consistent hashing on the routing key (property_id.room_type.date). Two bookings for the same room on the same date always route to the same queue consumer.
Channel Sync Worker (Python): After every confirmed booking or cancellation, this worker pushes updated availability to all 6 OTA channels via their respective APIs.

sql-- Optimistic lock: only succeeds if version hasn't changed
UPDATE room_inventory
SET available_count = available_count - 1,
    version = version + 1
WHERE property_id = $1
  AND room_type = $2
  AND stay_date = $3
  AND version = $4
  AND available_count > 0;
-- If rows_affected = 0, another booking won the race

The key design decision was using RabbitMQ's consistent hash exchange rather than a standard topic exchange. This guaranteed that competing requests for the same inventory naturally serialized through a single consumer, reducing the optimistic lock collision rate from ~12% (in our initial tests without the queue) to under 0.3%.

Syncing Availability Back to 6 OTAs

After a booking confirms, the Channel Sync Worker must update availability across all channels. Each OTA has a different API, different rate limits, and different latency profiles.

We learned two things the hard way:

Batch updates beat individual pushes. Booking.com's API accepts bulk availability updates, but Agoda's API at the time only supported single-date, single-room-type calls. Pushing updates one-by-one to Agoda added 4-6 seconds of total latency per booking. We switched to batching Agoda updates every 30 seconds, which was an acceptable trade-off: a 30-second window where availability might be slightly stale versus a guaranteed fast sync path for the other 5 channels.
Webhook-based sync is faster than polling, but you need a fallback. Expedia and Booking.com support outbound webhooks to notify the hotel system of new bookings. Trip.com and Traveloka did not (at the time of our integration). For channels without webhooks, we poll every 60 seconds. The polling fallback caught roughly 8% of bookings that would have otherwise created conflicts.

Handling Edge Cases: Cancellations and Partial Failures

The hardest part was not the booking itself but what happens after.
When a guest cancels on Expedia, the system must release that inventory and push the updated count to the other 5 channels. If the push to Booking.com fails (network timeout, API downtime), the room stays marked as unavailable there. Multiply this across hundreds of rooms and dates, and ghost unavailability quietly eats into revenue.

We implemented a Saga pattern with compensating actions. Each availability update to an OTA channel is tracked as a step in a saga. If a step fails, the system retries with exponential backoff (3 attempts, 5s/15s/45s intervals). If all retries fail, the saga marks that channel as "dirty" and a reconciliation job runs every 15 minutes to force-sync dirty channels by pulling their current state and comparing it against the Inventory Service.

This reconciliation loop caught an average of 23 stale records per day across all properties during the first month. After stabilizing the OTA integrations, that number dropped to 2-3 per day.

What We Measured After 3 Months

Double-bookings dropped from 3-4 per week to zero in the first 90 days of production
Average sync latency across all 6 channels fell to 1.2 seconds for webhook-enabled OTAs and 34 seconds worst-case for polling-based channels
Optimistic lock collision rate held steady at 0.2-0.4%, meaning less than 1 in 200 concurrent booking attempts needed a retry
Reconciliation job "dirty" records stabilized at 2-3 per day, almost all from Traveloka API timeouts

What We Would Do Differently

If we started this project today, three things would change:

Redis for the hot inventory cache. We served all availability reads directly from PostgreSQL. Under peak load (Black Friday sale), query latency spiked. A Redis layer for real-time availability reads with PostgreSQL as the write-through backend would have smoothed this out.
Event sourcing from day one. We added event sourcing to the booking flow in month two after debugging a dispute where the hotel claimed a cancellation never came through. Having the full event log from the start would have saved a week of forensic work.
Contract testing for OTA APIs. Three of the six OTA APIs changed their response schemas during our 8-month engagement without prior notice. Pact-style contract tests running against sandbox environments would have caught these before production.

The engineering team at Adamo Software, where we build booking engines, channel managers, and travel platforms for operators worldwide. Got questions? Drop a comment below.