Dabira olaoluwa

Posted on Aug 23 • Edited on Sep 25

From Race Conditions to Resilience: Building a Bulletproof Reservation System.

#node #webdev #javascript #architecture

part 1
What happens when two users try to reserve the last item at the exact same moment? Most systems would crumble, selling the same product twice and creating a nightmare of canceled orders and angry customers. I discovered this wasn't a theoretical problem it's a fundamental flaw in how we handle concurrent operations.

In this article, I'll walk you through the journey of building what seems like a simple reservation system. We'll start with a naive approach, expose its critical flaws, and then systematically rebuild it into a fault-tolerant, production-ready backend that guarantees every single order is handled correctly.

Choosing the Right Tools
Building for reliability requires careful tool selection. Here's why I chose this stack:

PostgreSQL: My "source of truth." Its strong ACID compliance and transactional guarantees are non-negotiable for handling orders and inventory.
Redis (ioredis): The high-speed "nervous system." It's more than just a cache, it handles atomic operations with Lua, manages the job queue, and serves as a real-time message bus for Pub/Sub.
Node.js & Express: The foundation for our fast, non-blocking API server.
Stripe: The secure payment gateway. Its architecture allows us to handle payments professionally without ever letting sensitive card data touch our servers.
Socket.IO: The real-time communication layer, built on WebSockets, that brings the user interface to life.
BullMQ: The robust job queue system that makes our application fault-tolerant by ensuring no order is ever lost, even if the database fails.

The Race Condition
When I first built the reservation feature, the logic was simple: check the stock, then create the reservation. In Node.js, that looked something like this:

This works perfectly for one user at a time. But what happens when two users try to buy the last available item at the exact same millisecond? A disaster.

Request A reads the stock from the database. It sees 1 item left.
Request B reads the stock from the database. It also sees 1 item left.
Request A proceeds, creates the reservation, and decrements the stock to 0.
Request B also proceeds, creating a second reservation for an item that no longer exists, and decrements the stock to 0.

We've now sold the same item twice. This is the race condition, and it's the first villain we need to defeat.

Atomic Operations with Lua
To defeat a race condition, we need an operation that is atomic a single, all-or-nothing step. You can't have a gap between checking the stock and decrementing it. This is where Redis and Lua scripting shine.

Why Lua?
My first thought was to use Redis transactions (WATCH, MULTI, EXEC). While powerful, this approach can be complex and requires multiple back-and-forth commands between my Node.js app and Redis.

A better way is to encapsulate the entire check-and-decrement logic into a single Lua script. Redis guarantees that a Lua script is executed atomically. It locks the server, runs the script from start to finish, and only then allows other commands to run. This eliminates network latency and guarantees our critical section is safe.
Here is the simple script that became the heart of our reservation system

Redis executes Lua scripts atomically, eliminating the gap between check and update. The implementation

By running this one script, we completely solve the race condition. It's impossible for two requests to get past the if check simultaneously.

Building a Resilient System
With our inventory safe, the rest of the reservation process can proceed. This involves persisting the reservation to PostgreSQL (our source of truth) and setting a "ticking clock" (a key with a TTL) in Redis for abandoned carts.

what happens when that clock runs out?
The Self-Healing Cleanup Worker
My first instinct for handling expired reservations was to use Redis's Pub/Sub feature. It seemed elegant and real-time.

However, I quickly discovered its fatal flaw Pub/Sub is a fire and forget system. If your worker service is down for even a second when an expiration event is published, that message is lost forever.
This led me to a more resilient, self-healing solution a polling-based worker. Instead of passively listening for messages it might miss, this worker proactively asks the database a simple question every 30 seconds: "Are there any reservations that should have expired?

This approach is incredibly robust. If the worker is offline for an hour, the next time it starts, it will find all the reservations it missed and clean them up correctly. It guarantees eventual consistency and is the kind of reliable pattern required for a production system.

The "Chatty" Checkout
Our system was now safe from race conditions and abandoned carts. But I discovered a new villain performance.

A user might have 20 items in their cart. My Node.js checkout logic would make multiple calls to Redis for each one. For a 10-item cart, this could mean 30-40 separate network round trips. This "chatty" process was slow and inefficient.

The solution was the same as before delegate the entire task to a single, expert Lua script that validates the entire cart in one atomic, lightning-fast operation.

This reduced dozens of network calls to a single atomic operation, slashing checkout latency by 80%.

Architecting for Failure
Our system was now fast and safe in Redis. But what if the slowest part of our system the main database crashes during a purchase?

The Ultimate Challenge, The Unreliable Network
We need a system that guarantees a purchase will be processed, even if our server crashes. This is where we bring in the final piece of our architecture: a job queue.

The Unbreakable System: Decoupling with BullMQ
Instead of making our API handle the slow database work, we decouple it. The API's only job is to accept the purchase request and add a "job" to a BullMQ queue. The user gets an instant response, and their purchase is now safely waiting to be processed by a completely separate, dedicated worker process.

This architecture ensures

Fast API responses regardless of database performance
Automatic retries on database failures
No lost orders during outages

The Art of Idempotency
What if a job fails and needs to be retried? We must ensure that retrying a job doesn't decrement stock multiple times. We need to make our worker idempotent meaning it can be run multiple times with the same result as running it once.

I achieved this with a simple but powerful state machine in my reservations table.
A New Status: 'processing' When the worker first picks up a job, it immediately sets the reservation status from 'pending' to 'processing'. This is a critical signal. It tells my expiresWorker "Don't touch this! This order is in-flight." This prevents the angry customer scenario where their order is marked as expired while it's actively being purchased.
(pending -> processing -> completed)

Before the worker attempts to decrement inventory, it first checks the reservation's status. If the status is already 'completed', it means this is a retried job that already succeeded. The worker simply logs a warning and skips to the next item, preventing any duplicate processing.

This makes the worker incredibly safe. It can fail and retry a dozen times, but it will only ever process the purchase once.

When Automation Fails, The Human Element
What happens when a job fails all three of its automatic retries? We can't let a customer's order disappear into the void. This is where the failed queue comes in. It's a to-do list for a human.

I built a simple admin dashboard where an administrator can view all failed jobs, inspect the error, and make a decision.
Retry: If the failure was due to a temporary issue that's now fixed, the admin can retry the job with a single click.
Cancel: If the order is invalid, the admin can cancel it, which marks the reservation as 'cancelled' and safely returns the stock to the inventory.

The Final Safety Net, The Automatic Janitor
What if an admin forgets to check the dashboard? The final piece is a "cleanup crew" worker that runs on a cron schedule. Once every 24 hours, it finds any jobs that have been in the failed queue for too long, automatically cancels them, and returns the stock. This ensures the system cleans itself up and no inventory is ever permanently lost.

part 2: From Lag to Live: Architecting a Real-Time Inventory System

My reservation system was now reliable and resilient, but it had one major flaw: it was silent. If another user reserved an item, my inventory would update correctly in the backend, but everyone else viewing the page would still see the old, stale number. To create a truly dynamic experience, I needed to push updates to users the moment they happened. This was the perfect job for WebSockets.

Setting the Stage: Beyond Simple Requests
Before diving in, it's important to understand why WebSockets are the right tool here. Standard HTTP is like sending a letter, you send a request and get a response. For a live inventory count, we need a phone call a persistent, two-way connection where the server can talk to the client at any time.

This is what WebSockets provide, establishing a stable connection over TCP after an initial "Upgrade" handshake with the server. I chose the Socket.IO library because its underlying engine, Engine.IO, is incredibly smart. It always tries to use the super-fast WebSocket connection, but if a user's network blocks it, it automatically falls back to a reliable alternative like HTTP Long-Polling, ensuring no user is left behind.

Version 1: The "Town Crier" Approach
My first implementation was straightforward. I used Redis's powerful Pub/Sub feature to create a communication channel.
When a worker process (like purchase-worker) updated the inventory, it would PUBLISH a message with the new stock count.

My main API server would SUBSCRIBE to this channel.
Upon receiving a message, it would use io.emit() to broadcast the update to every single user connected to the application.

This worked, but it had a noisy, annoying side effect. A user viewing a T-shirt would get real-time updates for someone else reserving a pair of shoes. It was like a town crier shouting every piece of news to the entire village. It was inefficient and a poor user experience and reduced unnecessary messages by 92% and cut bandwidth usage by 75% for users on product pages.

Version 2: The "Private Room" Solution
The solution was to treat each product page as its own private VIP room. Instead of broadcasting to everyone, the server should only send updates to the users currently looking at the specific product that changed.
This is where Socket.IO's Rooms feature became the star of the show.

The new workflow is far more elegant:
When a user loads a product page, the frontend client immediately connects and sends a message: I'm viewing product for (example product-1).

The server listens for this and adds that user's socket to a room named product-1.
Now, when an inventory update for product-1 comes through the Redis Pub/Sub channel, the server doesn't broadcast to everyone. It sends the message only to the clients inside the product-1 room.

This targeted approach is incredibly efficient and creates a seamless, relevant experience for the user. They see instant updates for the products they care about, and nothing else.
This implementation transformed the application from a series of silent, separate requests into a living, breathing system where the backend's actions are instantly reflected on the frontend, proving not just how the system works, but how it feels to the end-user.

Part 3: From Purchase to Payout - Handling Real Money with Stripe

A reservation system is great, but a business needs to get paid. Integrating a payment gateway like Stripe was the final step to making this a complete e-commerce backend, and it introduced a new level of architectural thinking.

The Secure Bubble: Never Touch Raw Card Data

The most critical rule of modern payment integration is to never let sensitive credit card data touch your server. Stripe's genius is its "tokenization" system, which creates a secure bubble around your application.

The frontend uses Stripe.js to send the user's card details directly to Stripe.
Stripe sends back a safe, one-time-use Payment Intent secret.
Our backend uses this secret and our Secret API Key to tell Stripe to confirm the charge.

Our server never sees a credit card number, which is the foundation of a secure, professional system.

The Silent Failure: The Vanishing Waiter
This new flow introduced a subtle but dangerous bug. My code would first update the order's status in my database to 'payment_pending' and then try to create the Payment Intent with Stripe.
But what if the call to Stripe failed due to a network error? The order would be stuck in 'payment_pending' forever. The stock would be held, and the user couldn't try again. It was like a waiter taking your order but vanishing before they could take your payment.

This is where compensation logic becomes critical. The solution was to wrap the Stripe call in a try...catch block. If it failed, the catch block would run a compensating transaction that safely reverts the order's status from 'payment_pending' back to 'reserved', allowing the user to try again with another card.

The Webhook: The Undeniable Source of Truth
The initial API call to Stripe isn't the final word. The payment process is asynchronous. The absolute source of truth is the webhook a secure, signed message that Stripe's servers send to our backend to confirm that a payment has actually succeeded.

My system is built around this principle. The background job to fulfill the order is only created after a successful and verified Stripe webhook is received.

Part 4: The Final Evolution - From "Purchase" to "Fulfillment"
With payments now handled by Stripe and confirmed via webhook, the role of my background worker fundamentally changed. Its job was no longer to process a purchase that was Stripe's job. Its new job was to fulfill a paid order.
To make this architectural shift clear in the code itself, I renamed the queue from purchase-processing to fulfill-order. This wasn't just a name change, it was a redefinition of responsibility that made the system's logic easier to understand and maintain.

The Art of Idempotency: Never Ship Twice
This new fulfillment worker introduced a final challenge: what if the job fails and is automatically retried by BullMQ? We must ensure that we don't decrement stock twice for the same order. We need to make our worker idempotent.

I achieved this with a powerful state machine in my orders table (reserved -> payment_pending -> completed) and a crucial check at the start of the job. Before the worker does anything, it checks the order's status. If it's already 'completed', it knows this is a retry of an already successful job and safely skips it.

This makes the worker incredibly safe. It can fail and retry a dozen times, but it will only ever fulfill the order once.
The Journey's End: What I Learned Building This System
After iterating through these challenges, we arrive at a complete, resilient system capable of handling concurrency, system failures, and real-world payments.

The application operates through two primary flows: the Order Flow and the Real-Time Update Flow.

A User sends an HTTP request (e.g., to reserve or pay) to the API Server.
The API Server uses Redis for fast, atomic operations (Lua scripts) and to add jobs to the BullMQ queue.
For payments, the API Server communicates with the external Stripe service.
The API Server also uses Redis Pub/Sub to listen for updates from the workers. When it gets an update, it sends a message via Socket.IO back to the user.
The Background Workers pull jobs from the Redis queue.
The Workers do the heavy lifting, interacting with the main PostgreSQL database. They also PUBLISH messages to Redis when they're done.
Stripe sends asynchronous Webhooks back to a special endpoint on the API Server to confirm payments.

Building this project taught me invaluable lessons that go beyond just writing code. It taught me how to think like a systems architect.
Atomicity is your shield against concurrency. For critical operations, you cannot have a "read-then-write" gap. I learned to use Redis Lua scripts to make these operations uninterruptible.
Decoupling is your foundation for resilience. By separating the fast API from the slow, unreliable work of database transactions using a job queue like BullMQ, the system can handle failures in one part without bringing down the entire application.
Idempotency is your key to safe recovery. Designing your background jobs to be safely retry-able is not an optional feature; it's a core requirement for any reliable system.
Always plan for the human in the loop. Automation is powerful, but it will eventually fail. Building tools like the admin dashboard for manual intervention is critical for managing a production system.
Eventual consistency is a pragmatic choice. My "self-healing" cleanup worker taught me that for non-critical tasks, a reliable system that eventually gets to the right state is often better than a complex real-time system that can lose data forever.

The journey from a naive script to a production-ready system transformed not just my code, but my approach to problem-solving. What seemed like simple reservation logic revealed deep lessons in distributed systems design.
I'd love to hear your thoughts. You can find the complete source code on my GitHub: https://github.com/TheBigWealth89/product_reservation

Top comments (2)

Sergio N • Aug 27

Very interesting read! Will follow the queue approach to decouple my system

Dabira olaoluwa • Aug 28

That's great to hear! Decoupling with queues can really transform system reliability. Feel free to reach out if you have any questions during your implementation happy to share lessons learned from my experience!