What happens when two users try to reserve the last item at the exact same moment? Most systems would crumble, selling the same product twice and creating a nightmare of canceled orders and angry customers. I discovered this wasn't a theoretical problem it's a fundamental flaw in how we handle concurrent operations.
In this article, I'll walk you through the journey of building what seems like a simple reservation system. We'll start with a naive approach, expose its critical flaws, and then systematically rebuild it into a fault-tolerant, production-ready backend that guarantees every single order is handled correctly.
Choosing the Right Tools
Building for reliability requires careful tool selection. Here's why I chose this stack:
PostgreSQL: My source of truth. Its strong ACID compliance guarantees data integrity, making it perfect for tracking reservations and inventory.
Redis: More than just a cache. I used it as a high-speed data structure server for atomic operations and real-time inventory management.
Node.js with Express: Ideal for building fast, non-blocking APIs that can handle high concurrency.
BullMQ: A robust job queue system that makes our application fault-tolerant by handling failures gracefully.
The Race Condition
When I first built the reservation feature, the logic was simple: check the stock, then create the reservation. In Node.js, that looked something like this:
This works perfectly for one user at a time. But what happens when two users try to buy the last available item at the exact same millisecond? A disaster.
Request A reads the stock from the database. It sees 1 item left.
Request B reads the stock from the database. It also sees 1 item left.
Request A proceeds, creates the reservation, and decrements the stock to 0.
Request B also proceeds, creating a second reservation for an item that no longer exists, and decrements the stock to -1.
We've now sold the same item twice. This is the race condition, and it's the first villain we need to defeat.
Atomic Operations with Lua
To defeat a race condition, we need an operation that is atomic a single, all-or-nothing step. You can't have a gap between checking the stock and decrementing it. This is where Redis and Lua scripting shine.
Why Lua?
My first thought was to use Redis transactions (WATCH, MULTI, EXEC). While powerful, this approach can be complex and requires multiple back-and-forth commands between my Node.js app and Redis.
A better way is to encapsulate the entire check-and-decrement logic into a single Lua script. Redis guarantees that a Lua script is executed atomically. It locks the server, runs the script from start to finish, and only then allows other commands to run. This eliminates network latency and guarantees our critical section is safe.
Here is the simple script that became the heart of our reservation system
Redis executes Lua scripts atomically, eliminating the gap between check and update. The implementation
By running this one script, we completely solve the race condition. It's impossible for two requests to get past the if check simultaneously.
Building a Resilient System
With our inventory safe, the rest of the reservation process can proceed. This involves persisting the reservation to PostgreSQL (our source of truth) and setting a "ticking clock" (a key with a TTL) in Redis for abandoned carts.
what happens when that clock runs out?
The Self-Healing Cleanup Worker
My first instinct for handling expired reservations was to use Redis's Pub/Sub feature. It seemed elegant and real-time.
However, I quickly discovered its fatal flaw Pub/Sub is a fire and forget system. If your worker service is down for even a second when an expiration event is published, that message is lost forever.
This led me to a more resilient, self-healing solution a polling-based worker. Instead of passively listening for messages it might miss, this worker proactively asks the database a simple question every 30 seconds: "Are there any reservations that should have expired?
This approach is incredibly robust. If the worker is offline for an hour, the next time it starts, it will find all the reservations it missed and clean them up correctly. It guarantees eventual consistency and is the kind of reliable pattern required for a production system.
The "Chatty" Checkout
Our system was now safe from race conditions and abandoned carts. But I discovered a new villain performance.
A user might have 20 items in their cart. My Node.js checkout logic would make multiple calls to Redis for each one. For a 10-item cart, this could mean 30-40 separate network round trips. This "chatty" process was slow and inefficient.
The solution was the same as before delegate the entire task to a single, expert Lua script that validates the entire cart in one atomic, lightning-fast operation.
This reduced dozens of network calls to a single atomic operation, slashing checkout latency by 80%.
Architecting for Failure
Our system was now fast and safe in Redis. But what if the slowest part of our system the main database crashes during a purchase?
The Ultimate Challenge, The Unreliable Network
We need a system that guarantees a purchase will be processed, even if our server crashes. This is where we bring in the final piece of our architecture: a job queue.
The Unbreakable System: Decoupling with BullMQ
Instead of making our API handle the slow database work, we decouple it. The API's only job is to accept the purchase request and add a "job" to a BullMQ queue. The user gets an instant response, and their purchase is now safely waiting to be processed by a completely separate, dedicated worker process.
- Fast API responses regardless of database performance
- Automatic retries on database failures
- No lost orders during outages
The Art of Idempotency
What if a job fails and needs to be retried? We must ensure that retrying a job doesn't decrement stock multiple times. We need to make our worker idempotent meaning it can be run multiple times with the same result as running it once.
I achieved this with a simple but powerful state machine in my reservations table.
A New Status: 'processing' When the worker first picks up a job, it immediately sets the reservation status from 'pending' to 'processing'. This is a critical signal. It tells my expiresWorker "Don't touch this! This order is in-flight." This prevents the angry customer scenario where their order is marked as expired while it's actively being purchased.
(pending -> processing -> completed)
Before the worker attempts to decrement inventory, it first checks the reservation's status. If the status is already 'completed', it means this is a retried job that already succeeded. The worker simply logs a warning and skips to the next item, preventing any duplicate processing.
This makes the worker incredibly safe. It can fail and retry a dozen times, but it will only ever process the purchase once.
When Automation Fails, The Human Element
What happens when a job fails all three of its automatic retries? We can't let a customer's order disappear into the void. This is where the failed queue comes in. It's a to-do list for a human.
I built a simple admin dashboard where an administrator can view all failed jobs, inspect the error, and make a decision.
Retry: If the failure was due to a temporary issue that's now fixed, the admin can retry the job with a single click.
Cancel: If the order is invalid, the admin can cancel it, which marks the reservation as 'cancelled' and safely returns the stock to the inventory.
The Final Safety Net, The Automatic Janitor
What if an admin forgets to check the dashboard? The final piece is a "cleanup crew" worker that runs on a cron schedule. Once every 24 hours, it finds any jobs that have been in the failed queue for too long, automatically cancels them, and returns the stock. This ensures the system cleans itself up and no inventory is ever permanently lost.
The Final Architecture
After iterating through these challenges, we arrive at a complete, resilient system.
API Layer: Fast, stateless endpoints using Redis for atomic operations
Queue Layer: BullMQ handling background processing with automatic retries
Worker Layer: Idempotent processors ensuring data consistency
Database Layer: PostgreSQL as the source of truth
Admin Layer: Human oversight for edge cases
Cleanup Layer: Automated systems preventing data staleness
Conclusion & Lessons Learned
Building this system taught me invaluable lessons about production-ready development:
Atomicity is non-negotiable for concurrent operations
Eventual consistency beats real-time inconsistency for reliability
Decoupling creates resilience - separate concerns survive failures
Idempotency enables recovery - design for retries from day one
Human oversight complements automation - build admin tools early
The journey from naive implementation to production ready system transformed not just my code, but my approach to problem solving. What seemed like simple reservation logic revealed deep lessons in distributed systems design.
You can find the complete source code on GitHub: https://github.com/TheBigWealth89/product_reservation
Top comments (2)
Very interesting read! Will follow the queue approach to decouple my system
That's great to hear! Decoupling with queues can really transform system reliability. Feel free to reach out if you have any questions during your implementation happy to share lessons learned from my experience!