π± How It Started
Few Years ago, I had a system design interview. The interviewer gave me this scenario:
"Design a national vaccine appointment booking system. Millions of citizens need to register and book slots. Clinics must administer the doses. The government needs audit logs and fraud prevention."
My first thought was simple just let people book a slot, check the stock, and confirm. I drew a basic flow on the whiteboard and felt pretty good about it. Then the interviewer started asking harder questions.
"What if two people try to book the last slot at the same time?"
"What if the clinic runs out of doses after the booking is already confirmed?"
"How do you undo things if eligibility check fails in the middle?"
I didn't have good answers. I only designed for the happy path.
That interview stuck in my mind. Months later, I was doing research on inventory reservation patterns for an internet credit purchase system, and I realized the same ideas could have helped me in that interview. So I went back to the problem and redesigned it. This is what I came up with.
β‘ My Initial (NaΓ―ve) Solution
Here's what I proposed during the interview:
Simple, right? But the problems come fast:
- Race conditions: Two people click "Book" at the same time for the last slot. Both get confirmed. Now one citizen has no seat.
- Stock mismatch: Slot is confirmed, but the clinic ran out of vaccine doses between booking day and appointment day.
- Late eligibility failure: System confirms appointment first, then finds out the citizen doesn't meet age or insurance requirement. Now you need to undo everything, but stock is already allocated.
- No rollback: If something fails in the middle, there's no way to release the slot or dose back to the pool.
These are the same problems I found later when designing the internet credit purchase system the happy path is not enough when you deal with limited resources and many users at the same time.
π Rethinking the Flow
The main idea, which I learned from inventory reservation strategies in e-commerce, is: don't confirm anything until everything is verified. Use a multi-stage process temporary hold first, then verify, then confirm. If anything fails, rollback.
It's like buying concert tickets. When you select a seat, it's held for you while you pay. If you don't finish in time, the seat goes back. Same concept here.
Here's the full flow of the improved design:
π§© The Improved Design
1. Reserve First (Temporary Hold)
When a citizen selects a clinic, time slot, and vaccine type, the system does not confirm right away. Instead:
- It creates a temporary reservation in Redis with a TTL (time-to-live), for example 5 minutes.
- Appointment status is set to
PENDING. - Slot capacity and vaccine dose count are decreased temporarily other users will see less availability.
Why Redis? Because we need something fast and temporary. A relational database could work too, but you would need a separate scheduled job to clean up expired reservations. Redis handles this automatically with TTL when 5 minutes pass, the key just disappears. For a system that handles millions of bookings during a national vaccine campaign, this performance difference is important.
How to handle race condition on Redis? We use Redis DECR command on the slot counter. This is atomic meaning if two requests come at the same time, Redis processes them one by one. If the counter reaches zero, the next request is rejected. For extra safety, you can use a Lua script to make the check-and-decrement happen in one step.
2. Eligibility Verification
While the slot is held, the system runs eligibility checks:
- Age requirement (e.g., some vaccines only for 60+).
- Insurance verification through external API.
- Medical history (allergies, previous doses).
- Geographic check (is this citizen in the right region?).
If any check fails, the reservation is released Redis key is deleted, slot goes back to the pool. The citizen gets a clear message explaining why they are not eligible, not just "something went wrong."
3. Confirm Appointment
If all checks pass:
- Slot capacity and vaccine stock are decreased permanently in the main database.
- Appointment status changes from
PENDINGtoCONFIRMED. - Redis reservation is cleared (not needed anymore).
- Confirmation is sent to the citizen (SMS, email, or push notification).
This is the point of no return. Before this step, everything can be undone.
4. Administration (Vaccination Day)
When the citizen arrives at the clinic:
- Clinic staff scans the citizen's QR code. The QR code contains the appointment ID and a verification hash. The hash is generated on the server using appointment ID + citizen ID + a secret key, so it cannot be faked.
- System verifies the QR code against the appointment record.
- Staff records the vaccine batch number and time of administration.
- Appointment status changes to
ADMINISTERED. - An event is sent to other systems analytics, government reporting, audit logs.
5. Failure & Rollback Scenarios
This is the part I completely missed in my interview. Here's how each failure is handled:
-
No-show: A scheduled job checks for
CONFIRMEDappointments that passed their time window. Status becomesNO_SHOW, stock is released back. - Citizen cancels: They can cancel through the portal. Stock is released right away.
- Clinic cancels a slot (e.g., not enough staff): All affected appointments are flagged. Citizens get notified and can rebook with priority.
- External API is down (e.g., insurance service): The system uses a circuit breaker pattern. After several failures in a row, the system stops calling that API temporarily. Meanwhile, the booking is either queued for retry (with increasing wait time between retries) or allowed provisionally with a flag for manual review later. The important thing is: one broken dependency should not block the whole flow.
- Redis goes down: The system falls back to database-level reservations with a cleanup job. It's slower, but the booking still works.
ποΈ System Components
Here's the high-level architecture:
- Frontend: Booking portal for citizens + Dashboard for clinic staff.
- API Gateway: Authentication, rate limiting (very important during mass booking), and routing.
-
Core Services:
- Auth Service Login, national ID verification.
- Patient Service Medical records, vaccination history.
- Clinic Service Slot management, staff schedules, capacity.
- Inventory Service Vaccine stock per clinic, batch tracking.
- Appointment Service The main service. Manages reservations, confirmations, and status changes.
- Eligibility Service Rules engine + external API calls.
- Notification Service SMS, email, push. Retries if delivery fails.
- Audit Service Append-only logs for every status change. Required for government compliance.
- Data Layer: PostgreSQL for permanent data, Redis for temporary reservations and caching.
-
Async Messaging: Kafka for events
AppointmentReserved,AppointmentConfirmed,AppointmentAdministered,AppointmentCancelled. This keeps services separated and makes the system auditable by default.
π― What I Would Do Differently Now
Looking back at that interview, the biggest thing I missed was not about technology it was about mindset. I jumped to the happy path because it felt complete. But the interviewer was not testing if I can design a booking form. They were testing if I can think about what happens when things go wrong.
Here's what I learned from this experience:
- Start with failure scenarios, not the happy path. Ask yourself "what can go wrong at each step?" before finalizing any design.
- Temporary reservation is a pattern, not a hack. Whether it's concert tickets, flash sales, or vaccine slots if you have limited stock and many users, you need hold-then-confirm flow.
- Don't be vague about rollbacks. "We'll handle errors" is not a design. Be specific what happens to the data, the stock, and the user when something fails.
- External services will go down. Always have a plan for when the insurance API or notification service is not available. Circuit breakers and retry queues are not optional they are necessary.
If you're preparing for system design interviews, I recommend studying inventory reservation patterns. My earlier post on designing an internet credit purchase system covers these patterns with more detail and code examples. The core idea reserve first, verify, then commit appears in many systems once you start looking.
Thanks for reading. If you faced similar interview questions or have ideas to improve this design, I would like to hear about it in the comments.




Top comments (0)