1. Start from First Principles: What Is a “Failure Class”?
A failure class is not:
- a bug
- a timeout
- an outage
A failure class is:
A category of things that can go wrong because of how responsibility, time, and state are structured
So we ask:
- What must be true for correctness?
- What assumptions does the model silently make?
- What breaks when those assumptions are false?
2. Core Difference (One Sentence)
Synchronous systems fail by blocking and cascading.
Asynchronous systems fail by duplication, reordering, and invisibility.
Everything else is a consequence.
3. Synchronous Systems — Failure Classes
Definition (First Principles)
A synchronous system assumes:
“The caller waits while the callee finishes the work.”
This couples:
- time
- availability
- correctness
Failure Class 1: Blocking Amplification
Question asked:
What happens while the system waits?
Reality:
- Threads blocked
- Connections held
- Memory retained
Failure mode:
Load increases → latency increases → throughput collapses
This is not just “slow.”
It is non-linear failure.
Failure Class 2: Cascading Failure
Question asked:
What if a dependency slows down?
Because everything is waiting:
- Agent slows → backend slows
- Backend slows → frontend retries
- Retries amplify load
Failure mode:
One slow dependency can take down the entire system
Failure Class 3: Availability Coupling
Question asked:
Can the system function if the dependency is down?
Answer in sync systems:
- No
Failure mode:
Partial outage becomes total outage
Summary: Sync Failure Classes
| Category | Root Cause |
|---|---|
| Blocking | Time is coupled |
| Cascades | Dependencies are inline |
| Global outage | Availability is transitive |
4. Asynchronous Systems — Failure Classes
Definition (First Principles)
An async system assumes:
“Work can finish later, possibly multiple times, possibly out of order.”
This decouples time but removes guarantees.
Failure Class 1: Duplicate Execution
Question asked:
What happens if work is retried?
Reality:
- At-least-once delivery
- Worker crashes
- Message reprocessed
Failure mode:
Same logical action happens multiple times
This breaks:
- Exactly-once semantics
- Idempotency assumptions
Failure Class 2: Ordering Violations
Question asked:
What defines sequence?
Reality:
- Queues don’t know business order
- Workers process independently
Failure mode:
Effects appear out of logical order
For chat systems:
- Responses based on future messages
- Context corruption
Failure Class 3: Completion Invisibility
Question asked:
How does the user know when work is done?
Reality:
- No direct signal
- Polling or guessing
Failure mode:
Users wait blindly or see stale state
Failure Class 4: Orphaned Work
Question asked:
What if the user disappears?
Reality:
- Job keeps running
- Response stored but never consumed
Failure mode:
Wasted compute, leaked state
Summary: Async Failure Classes
| Category | Root Cause |
|---|---|
| Duplication | Retries |
| Reordering | Decoupled execution |
| Invisibility | No direct completion path |
| Orphans | Detached lifecycles |
5. Side-by-Side Contrast (Mental Model)
| Dimension | Synchronous | Asynchronous |
|---|---|---|
| Time | Coupled | Decoupled |
| Failure style | Blocking, cascades | Duplication, disorder |
| Availability | All-or-nothing | Partial |
| Correctness risk | Latency-based | Logic-based |
| Debugging | Easier | Harder |
6. Deep Insight (This Is the Interview Gold)
Synchronous systems fail loudly and immediately.
Asynchronous systems fail quietly and later.
- Sync failures are obvious (timeouts, errors)
- Async failures are subtle (double writes, wrong order)
7. Why Neither Is “Better”
From first principles:
- Sync systems protect causality but sacrifice availability
- Async systems protect availability but sacrifice causality
Real systems exist to reintroduce the lost property:
- Async systems add idempotency, ordering, state machines
- Sync systems add timeouts, circuit breakers, fallbacks
8. One-Line Rule to Remember
Sync breaks under load.
Async breaks under ambiguity.
If you want next, we can:
- Map these failure classes to real outages
- Show how streaming combines both failure types
- Practice identifying failure classes on a fresh system
Tell me the next direction.
Top comments (0)