Onion Lee

Posted on Apr 25

Cruise Booking System Redesign

Motivation

I'm currently serving in the military on a remote island, and I often need to take a cruise home during breaks. Unfortunately, the cruise booking app had some critical issues that caused delays and disrupted my plans.

1. Race Condition: During seat selection and payment, the system allowed multiple users to select the same seat at the same time. The seat would ultimately go to whoever completed payment first—even if someone else had selected it earlier.

2. Slow Response Time: When weather or rough seas triggered mass rebookings, the system's response time dropped drastically. Seat reservation operations would lag or fail completely, often causing payment attempts to be aborted.

Because of these issues, I set out to redesign the booking API with a focus on robust, reliable seat reservation logic.

Assumptions

To keep the project scoped around the core booking logic, I made several simplifying assumptions:

Users are assumed to be authenticated.
All cruise information, including seat maps and address details, is predefined (no admin panel).
Payment processing is simulated, including webhooks and is simplified.

Two-Phase Booking Design

To address the first problem - race condition - I explored the timer-based approach. A countdown timer starts as soon as the user selects a seat. If the user fails to complete the checkout process within the allotted time, the seat is released for other users.

Using a Status Column

I tried to achieve this in the relational database layer using a status column. The process goes as follows:

A user sends a POST request to the /trips/reserve endpoint.
The system checks the status of the seat in the database - if status is available, the system updates it to pending.

However, this had several issues, where the system needs to clean up stale entries periodically with a cron job (e.g., every 1 minute). This caused a stale time window between the periodic sweeping intervals. This was unacceptable as the seat should become reservable immediately when the timeout expires, reflecting the real-time status of the seat.

Using an Expiration Column

A more efficient approach exists using expiration column:

Expiration column stores expiration timestamp
A POST request checks the status and the current time against the expiration column. If the status = pending but the expiration time has passed, the system considers the seat available for booking.

This solution is much elegant with no stale time window, perfectly solving our first problem.

Concurrency Issues

Now, I went to design the seat reservation endpoint. When I first designed the seat reservation endpoint, it seemed straightforward: let users pick a seat and write to the database. However, upon running simple concurrency tests, I discovered a last-writer-wins issue when multiple users simultaneously attempted to reserve the same seat. In practice, all users received success notifications—even though only the last update was actually committed—causing data inconsistencies and confusion.

I built a seat reservation service to assign seats to users with a time-based expiration. The core logic is:

@Transactional
public SeatReservation reserveSeat(Long seatId, Long userId){
    SeatReservation seatReservation = seatReservationRepository.findById(seatId).orElseThrow(() -> new RuntimeException("Seat Not Found"));

    if(seatReservation.getExpirationTime() != null && seatReservation.getExpirationTime().isBefore(LocalDateTime.now())) throw new RuntimeException("Another user has reserved it already");

    seatReservation.setReservationStatus(ReservationStatus.PENDING);
    seatReservation.setReservedByUserId(userId);
    seatReservation.setExpirationTime(LocalDateTime.now().plusMinutes(EXPIRATION_TIME));

    return seatReservation;
}

Sequential Testing

To ensure the service worked as intended, I tested it with 20 users reserving a seat one-by-one:

@Test
void shouldReserveSeatSequentiallyWithOnlyOneSuccess() {
    final int userCount = 20;
    final AtomicInteger successCount = new AtomicInteger();
    final AtomicInteger failureCount = new AtomicInteger();

    for (int i = 0; i < userCount; i++) {
        final long userId = i + 1;
        try {
            seatService.reserveSeat(seatId, userId);
            successCount.incrementAndGet();
        } catch (Exception ex) {
            failureCount.incrementAndGet();
        }
    }

    System.out.println("Success: " + successCount.get());
    System.out.println("Failure: " + failureCount.get());

    assertThat(successCount.get()).isEqualTo(1);
}

The test passed—only one user got the seat, as expected.

Success: 1
Failure: 19

In production, users might hit the endpoint /trips/.../reserve simultaneously for popular seats. I simulated this with 20 concurrent users using threads:

@Test
void shouldReserveSeatConcurrentlyWithOnlyOneSuccess() throws InterruptedException {
    final int userCount = 20;
    final ExecutorService executorService = Executors.newFixedThreadPool(userCount);
    final CountDownLatch countDownLatch = new CountDownLatch(userCount);

    final AtomicInteger successCount = new AtomicInteger();
    final AtomicInteger failureCount = new AtomicInteger();

    for (int i = 0; i < userCount; i++) {
        final long userId = i + 1;
        executorService.submit(() -> {
            try {
                seatService.reserveSeat(seatId, userId);
                successCount.incrementAndGet();
            } catch (Exception ex) {
                failureCount.incrementAndGet();
            } finally {
                countDownLatch.countDown();
            }
        });
    }

    countDownLatch.await();
    executorService.shutdown();

    System.out.println("Success: " + successCount.get());
    System.out.println("Failure: " + failureCount.get());

    assertThat(successCount.get()).isEqualTo(1);
}

The test failed—10 users succeeded instead of 1. I diagnosed this as a "last writer wins" race condition: concurrent threads overwrote each other’s updates before the expiration check kicked in.

org.opentest4j.AssertionFailedError:
expected: 1
but was: 10
Expected: 1
Actual: 10

"Last-Writer-Wins" Scenario

To solve this anomaly, I looked into the transactional isolation level in MySQL and found out the issue. Using MySQL’s REPEATABLE READ isolation, transactions operate on independent snapshots. For example, User2 reads an outdated seat state, sees it as available, and updates it—unaware of User1’s concurrent reservation. This allows multiple successes where only one should occur.

My goal was clear: ensure that the first user to hit /trips/.../reserve reserves the seat if it’s available, while subsequent users fail without overwriting the reservation during the concurrent environment.

To fix this, I devised five different approaches including:

SERIALIZABLE Isolation Level
In-Memory Critical Section
Optimistic Locking (OCC)
Pessimistic Locking
Atomic Database Operation

SERIALIZABLE Isolation Level

This is the most straightforward approach: bumping the isolation level to SERIALIZABLE, which prevents concurrent modifications to the seat during both reads and writes. Specifically, MySQL achieves searizliable isolation level using locks, where reads and writes block each other. Below is the code for changing isolation level using JPA annotation.

@Transactional(isolation = Isolation.SERIALIZABLE)
public SeatReservation reserveSeat(Long seatId, Long userId){

The 20-user concurrency test confirmed it worked:

result:

Success: 1
Failure: 19

However, upon researching, decreased read/write throughput due to blocking read/write and deadlocks concerned me as seat booking must be highly concurrent. So I explored different approaches as well.

In addition to this, I thought of using PostgreSQL instead of MySQL because it supports serializable snapshot isolation via lock-free MVCC. MySQL achieves similar behavior via record locks and gap locks, with reduced concurrency. However, I decided to stick with MySQL to tackle this challenge.

In-Memory Critical Section

As an alternative to tweaking database isolation levels, I explored an in-memory solution. I added the synchronized keyword to the reserveSeat method to limit access to one thread at a time, expecting it to prevent concurrency issues:

@Transactional
public synchronized SeatReservation reserveSeat(Long seatId, Long userId){
    SeatReservation seatReservation = seatReservationRepository.findById(seatId).orElseThrow(() -> new RuntimeException("Seat Not Found"));

    if(seatReservation.getExpirationTime() != null && seatReservation.getExpirationTime().isBefore(LocalDateTime.now())) throw new RuntimeException("Another user has reserved it already");

    seatReservation.setReservationStatus(ReservationStatus.PENDING);
    seatReservation.setReservedByUserId(userId);
    seatReservation.setExpirationTime(LocalDateTime.now().plusMinutes(EXPIRATION_TIME));

    return seatReservation;
}

The test failed unexpectedly:

org.opentest4j.AssertionFailedError:
expected: 1
but was: 7
Expected: 1
Actual: 7

Instead of one success, seven threads succeeded, revealing a flaw in my approach. Digging deeper, I realized the issue stemmed from the interplay between @Transactional and synchronized block.

Spring’s transactional proxy wraps the method, managing transaction begin and commit outside the synchronized block.

The synchronized keyword locks only the method’s execution, not the transaction commit phase. This allowed another thread to enter while the first was committing, leading to multiple successes. A diagram I sketched illustrates the race condition:

Even if this had worked locally, I recognized a major limitation: synchronized is ineffective in a horizontally scaled environment. With multiple servers, in-memory locks don’t coordinate across instances, making this approach unscalable. Therefore, I moved on to try other approaches.

Optimistic locks

Using Hibernate, I added a version column to the SeatReservation entity:

@Version
private Long version;

Hibernate automatically manages versioning. On read, it fetched the row with its version atomically:

Hibernate: select sr1_0.seat_id,sr1_0.expiration_time,sr1_0.payment_intent_id,sr1_0.reservation_status,sr1_0.reserved_by_user_id,sr1_0.transaction_id,sr1_0.trip_id,sr1_0.version from seat_reservation sr1_0 where sr1_0.seat_id=?

On write, it updated only if the version hadn’t changed:

Hibernate: update seat_reservation set expiration_time=?,payment_intent_id=?,reservation_status=?,reserved_by_user_id=?,transaction_id=?,trip_id=?,version=? where seat_id=? and version=?

I ran my concurrency test with 20 users:

test result:
Success: 1
Failure: 19

The test passed—only one user reserved the seat, as Hibernate detected version mismatches and rolled back conflicting updates.

Encouraged by the result, I tested with a smaller scale—such as three concurrent users. However, a conflict occurred where the seat remained unreserved by any of them as all transactions failed simultaneously and no retries occurred. Adding retry logic was an option, but it clashed with the business rule:

“The first user to call the seat reservation endpoint (e.g., /trips/.../reserve) successfully reserves the seat if it is available.”

Retries shuffle the order, meaning the "first" user might not win—success would hinge on retry timing, not request order. This violated the requirement. Therefore, I went on to explore other solutions.

Pessimistic locks

I implemented it using JPA’s @lock annotation and a custom query:

@Lock(LockModeType.PESSIMISTIC_WRITE)
@Query("SELECT sr FROM SeatReservation sr WHERE sr.seatId = :id")
Optional<SeatReservation> findByIdWithPessimisticLock(@Param("id") Long id);

Hibernate automatically appended a FOR UPDATE clause to the SQL query, ensuring the row is locked since the read:

Hibernate: select sr1_0.seat_id,sr1_0.expiration_time,sr1_0.payment_intent_id,sr1_0.reservation_status,sr1_0.reserved_by_user_id,sr1_0.transaction_id,sr1_0.trip_id,sr1_0.version from seat_reservation sr1_0 where sr1_0.seat_id=? for update

Hibernate: update seat_reservation set expiration_time=?,payment_intent_id=?,reservation_status=?,reserved_by_user_id=?,transaction_id=?,trip_id=?,version=? where seat_id=? and version=?

My 20-user concurrency test confirmed success:

result:
Success: 1
Failure: 19

Pessimistic locking looks effective, guaranteeing order and consistency in our current architecture. It is a strong fit for our system unless we scale to a multi-leader architecture, which would require reevaluating our approach.

Atomic Database Operation

Pessimistic locking worked, and quite nicely, but since it locks the row from read until the commit, I sought further optimization.

Upon some thinking, I realized that the SELECT and UPDATE operations could be combined into one single atomic query. This leverages the query’s inherent atomicity, cuts database round-trips, and simplifies the operation.

To implement this, I crafted a JPQL query in SeatReservationRepository and adjusted the service layer:

@Transactional
  public SeatReservation reserveSeat(Long seatId, Long userId) {
      LocalDateTime now = LocalDateTime.now();
      LocalDateTime newExpirationTime = now.plusMinutes(EXPIRATION_TIME);

      // Try to update the seat in one atomic statement.
      int updatedRows = seatReservationRepository.updateSeatReservationAtomically(
              seatId,
              userId,
              ReservationStatus.PENDING,
              newExpirationTime,
              now
      );

      // If no rows were updated, seat was not free (already reserved by someone else).
      if (updatedRows == 0) {
          throw new RuntimeException("Another user has reserved it already or seat not found.");
      }

      // If here, seat was successfully updated.
      // Now we can fetch the seat from DB to return it (and to record the new state).
      SeatReservation seatReservation = seatReservationRepository.findById(seatId)
              .orElseThrow(() -> new RuntimeException("Seat Not Found after update!"));

      return seatReservation;
  }

@Modifying
@Query("UPDATE SeatReservation sr " +
        "SET sr.reservationStatus = :newStatus, " +
        "    sr.reservedByUserId = :userId, " +
        "    sr.expirationTime = :expirationTime " +
        "WHERE sr.seatId = :seatId " +
        "  AND (sr.expirationTime IS NULL OR sr.expirationTime < :currentTime)")
int updateSeatReservationAtomically(@Param("seatId") Long seatId,
                                    @Param("userId") Long userId,
                                    @Param("newStatus") ReservationStatus newStatus,
                                    @Param("expirationTime") LocalDateTime expirationTime,
                                    @Param("currentTime") LocalDateTime currentTime);

The 20-user concurrency test succeeded and only one user successfully reserved the seat, as the atomic update prevented concurrent modifications.

This approach shines for our simple case, reducing overhead. However, if future logic spans multiple tables or adds complex checks between read and write, a single query might not scale. For now, I deemed it aa efficient and meets our needs.

Tests

To see the average execution time of each approach, I also measured execution time locally, focusing on relative performance between the approaches:

@Test
void shouldReserveSeatConcurrentlyWithOnlyOneSuccess() throws InterruptedException {
        ...
    // Start timing
    long startTime = System.nanoTime();
    for (int i = 0; i < userCount; i++) {
            ...
    }
    // Stop timing
    long endTime = System.nanoTime();
    long elapsedNanos = endTime - startTime;
    long elapsedMillis = TimeUnit.NANOSECONDS.toMillis(elapsedNanos);

    System.out.println("Elapsed Time: " + elapsedMillis + " ms");
        ...
}

Result:

Serializable Isolation Level: Elapsed Time: 450 ms
Optimistic locks: Elapsed Time: 83 ms
Pessimistic locks: Elapsed Time: 130 ms
Atomic Query: 105 ms

Serializable isolation level was the slowest, as expected, while optimistic locks were fast but impractical for high contention scenario. Pessimistic locks worked but added latency. The atomic query struck a balance—fast, scalable, and reliable.

Webhook Issues

Payment gateways, including Stripe, operate on an at-least-once delivery model for webhook events. This means that if our application fails to acknowledge a webhook request with an HTTP 200 OK response (due to high traffic, server issues, etc.), the payment gateway will retry the webhook delivery as shown below:

At first, I thought, "Wouldn't it be fine if I just checked the latest payment status?" But then I realized:

Retries could create duplicate processing if not handled properly.
Repeated webhook events for the same transaction may overwrite the original timestamp in the SeatReservation table.

To prevent double-processing, I designed the system to handle webhooks idempotently. That is, every time a webhook is received, the system checks:

If the payment_intent_id (or equivalent unique transaction reference) already exists in the database and has been marked as PAID.
If it does, the system simply acknowledges the webhook without making any changes and sends an HTTP 200 OK response back to the gateway.
If it doesn’t, it updates the reservation status accordingly and records the change.

This way, no matter how many times Stripe sends a webhook event, the same transaction is never processed twice.

Proposed Solution

I have adjusted my database schema to ensure smooth handling of these scenarios.

When a webhook request arrives with a given payment_intent_id, the system checks the SeatReservation row to see if that payment_intent_id is already present.
If it does not exist, the system updates the seat reservation to reflect the new payment status (e.g., PAID), records a timestamp (if needed), and logs the update.
If a duplicate webhook is detected (i.e., the payment_intent_id is already present), the system preserves the original timestamp and record by simply returning an HTTP 200 response, thus preventing unnecessary retries.

Another Webhook Issues

As I fleshed out the payment processing logic, another tricky edge case caught my attention: delayed webhook deliveries. What happens if a payment gateway like Stripe processes a payment successfully, but the webhook arrives after the seat reservation’s timer expires? I investigated this to ensure reservation validity remained intact.

After some reflection, I mapped out the fallout:

Payment Succeeds, Reservation Expires: The user pays, but the webhook lags, arriving post-timeout. The seat’s reservation is released, despite a valid payment.
System Misalignment: Since the expired reservation is no longer linked to the payment, the system cannot assign the seat.
Cleanup Needed: This leaves a paid-but-unreserved seat, likely requiring a refund despite a successful transaction.

If a delayed webhook arrives for a reservation that has expired, the system must issue a refund in the system, while also making sure no seat is reserved.

Now, to sum up, I have designed the below logic to incorporate both the idempotency in payment processing and the refund logic.

Verify Webhook Request:
- Validate using secret keys/signatures.
Acquire a Lock / Begin Transaction:
- Check if the seat exists.
- Lock the reservation record (or use an optimistic locking strategy).
Perform an Idempotence Check:
Check UserId and Reservation Expiry:
- Check to ensure the user matches the reserved user
- Check for expired reservation
- If Expired:
  - Log the refund action so it can later be aggregated for the refund processing.
- If Not Expired:
  - Process the webhook payload:
    - SUCCESS: Update reservationStatus to PAID.
    - FAILED/CANCELLED: write a structured log about the message.
  - Ensure updates and history logging occur in one transaction.
Release the Lock / Commit Transaction:
- Ensure all changes are committed atomically.
Return an HTTP 200 Response:
- Acknowledge receipt of the webhook to prevent further retries.

Conclusion

In designing a more robust seat reservation API, key issues from my initial experience—race conditions and slow response times—were addressed to improve the booking experience.

To resolve the race condition issue, I introduced a two-phase booking process. This approach allowed a user to acquire a timed lock on a row when selecting a seat, preventing scenarios where a faster payment response inadvertently claims the seat.

Improving response times was challenging due to limited access to Ticket30's codebase. However, through concurrency testing, I was about to somewhat approximate the potential bottleneck and address them. Multiple approaches were considered, and atomic database operation was deemed to be most suitable yet efficient. This solution significantly improved performance while avoiding locking overhead.

During the implementation stage, additional edge cases, such as duplicate webhook events and delayed payment confirmations, were identified. These were carefully addressed to ensure system reliability and resilience under peak load.

By resolving these challenges, I believe the new booking system ensures a more efficient, scalable, and user-friendly experience.

DEV Community