K. Polash

Posted on Apr 1

Why Your System Fails on the Most Predictable Day of the Year

#architecture #backend #distributedsystems #systemdesign

Most applications don't fail because of bad code.

They fail because of bad architecture decisions made early that nobody questioned
until it was too late.

Here's what actually breaks systems at scale:

Everything talking to everything (no clear boundaries)
The database doing work the application should do
Synchronous processing where async was needed
One giant service that owns too much responsibility
No separation between reads and writes under heavy load

None of these are framework problems. None of these are language problems.
They are thinking problems.

The Scenario: University Enrollment Day

Think about course enrollment day at any university.

Every semester, thousands of students flood the portal at the exact same time, each trying to register for their courses. The load is not a surprise. The date is on the calendar. It happens every single semester like clockwork.

But the system was never designed for it.

Every request hits the same flow — check eligibility, check seat availability, write enrollment, update seat count — all synchronously, all at once. No queue. No cache. No separation between reads and writes. Just a database choking under the weight of an entirely predictable moment.

Students get timeout errors. Duplicate enrollments. Lost seats they were fully eligible for. Everyone refreshes in panic. And every semester, someone calls IT to "increase the server capacity" and nothing really changes.

The code isn't broken. The thinking is.

What's Actually Going On

Most teams look at this and see one problem: too much traffic. So they throw more servers at it. It helps a little, then fails again next semester.

The reality is there are 5 separate problems here, each requiring a different solution. Solving one without the others just moves the failure to a different place.

🔥 The Spike

Thousands of simultaneous requests will bring any database to its knees regardless of hardware. A queue, a cache layer, and read/write separation need to work together. Most teams implement one and call it done.

🔁 The Race Condition

Even with a queue, two workers can read "1 seat available" at the same time, both pass eligibility, and both enroll into the last seat. The queue serialized intake. It did not serialize processing. You need locking — pessimistic, optimistic, or distributed — and each has real tradeoffs.

👆 The Double Click

A student hits Enroll and the page is slow. They click again. Now two identical requests are in flight. Even with locking in place, without idempotency handling both clicks can create two enrollment records. This is not a database problem. It is an API design problem — and the one most teams discover only after finding duplicate records in production.

💥 The Half-Enrolled Student

A successful enrollment is not one database write. It is a chain: write the record, decrement the seat count, send confirmation, update the academic record, generate a fee entry. What happens if the system crashes after step 2?

The seat is taken. The student has no confirmation. The database says enrolled. The academic record disagrees. Designing for the middle of failure is architecture. Designing only for the happy path is wishful thinking.

🕳️ The Stale Cache Trap

The cache says 5 seats available. The database already has 0. Students attempt to enroll based on stale data, hit the lock, fail, and get a confusing error even though the portal showed availability 3 seconds ago. The cache improved performance but silently introduced a trust problem. Cache invalidation strategy needs to be designed upfront, not patched after support tickets pile up.

The Questions That Actually Matter

The engineers who scale systems well ask different questions from the start:

Where are my bottlenecks under 10x load?
What happens if this one service goes down?
Am I coupling things that should be independent?
What happens in the middle of a failure, not just at the end?
Can the same request safely arrive twice?

None of these have anything to do with which framework you picked or which cloud provider you use. They are design questions. Thinking questions.

I wrote a full deep dive on all 5 problems with concrete solutions for each one.

👉 Read the full article here →

Architecture is not about knowing the right tools. It is about asking the right questions early enough that the answers still matter.

Top comments (4)

Andre Cytryn • Apr 1

the race condition + stale cache trap combo is brutal in real production systems. we had this exact scenario with a ticketing platform where the cache was showing 3 seats available while the db was already at 0. what made it worse was the cache TTL was set to 5 minutes because "it's just a read cache" -- nobody thought about enrollment spikes when that decision was made.

one pattern that helped us: showing architecture diagrams of these flows upfront during design reviews instead of just describing them in prose. when you draw out the cache-db-queue interactions visually, the failure modes become obvious in a way that's hard to see in text. i've been building a tool for exactly this kind of system design diagramming and it's changed how our team spots these issues early.

K. Polash • Apr 2

That 5-minute TTL story is a perfect example. The decision made total sense in isolation, and only became a problem when the context changed (a spike nobody designed for).

And yes, diagrams expose the lies that prose hides. If you have to draw the arrow between the cache and the DB, you have to reckon with what happens when they disagree.

Curious about your diagramming tool. Sounds like something a lot of teams genuinely need.

Andre Cytryn • Apr 2

the race condition between queue intake and processing is the thing that bites teams most. serializing intake with a queue doesn't help if workers read shared state concurrently — you still need optimistic locking or a compare-and-swap on the seat count. the half-enrolled state is also brutal: I've been building a tool to track system state across async operations and visualize where those partial writes live, which has made these multi-step chains a lot easier to reason about. the stale cache trap is underrated too — most teams only think about cache invalidation after they've shipped the enrollment bug.

K. Polash • Apr 3

That compare-and-swap point is exactly right. A queue serializes intake, not processing. The correctness problem starts after, not before.
The half-enrolled state is the one that haunts me most. At least a crash is loud. A partial write just silently disagrees with itself across tables until a confused user calls support.
Async state visualization is the hardest part to reason about in code alone.