TAGS: system-design, reliability-engineering, distributed-systems, devops
Last month, a major DeFi protocol lost $47 million because their circuit breaker did exactly what it was designed to do: it failed closed. When the price oracle lagged by 90 seconds, the system shut down entirely. Traders couldn't exit positions. Liquidations froze. By the time humans intervened, underwater positions had become unrecoverable.
The lesson? Sometimes the safest failure mode is not stopping everything.
What "Fail-Open" Actually Means
In security engineering, "fail-open" describes systems that permit operation when controls malfunction. This sounds dangerous—and it can be. But in trading systems, payment networks, and real-time data pipelines, the alternative often hurts worse.
Traditional reliability engineering optimizes for fail-closed behavior: if validation fails, reject the request. This works beautifully for authentication, authorization, and financial settlement. But it fails catastrophically in scenarios where time-sensitive decisions outweigh perfect accuracy.
Consider three domains where fail-open patterns shine:
| Domain | Fail-Closed Risk | Fail-Open Strategy |
|---|---|---|
| Trading | Missed liquidation windows, cascade failures | Degraded execution with position limits |
| Streaming data | Pipeline stalls, unbounded backpressure | Sampling + estimated values with confidence flags |
| ML inference | Request timeouts, queue explosions | Cached predictions with staleness metadata |
The Architecture of Controlled Degradation
Effective fail-open systems don't simply "let everything through." They establish graduated degradation planes—predefined operational modes that sacrifice precision for continuity.
1. Confidence-Weighted Fallbacks
When your primary signal degrades, don't binary switch to a backup. Instead, maintain a confidence score that modulates position sizing or decision latency:
class DegradedExecutionEngine:
def execute(self, signal: Signal) -> Order:
confidence = self.assess_signal_quality(signal)
if confidence < 0.3:
# Fail open: execute minimal size with extended slippage
return self._conservative_order(signal, max_notional=confidence * self.limit)
if confidence < 0.7:
# Partial degradation
return self._throttled_order(signal, throttle_factor=confidence)
return self._standard_order(signal)
The key insight: degradation should be continuous, not discrete. Binary failover creates cliff effects that trigger cascading failures.
2. Stale Data Contracts
Define explicit contracts for data freshness. Rather than rejecting stale prices, annotate them:
interface MarketData {
price: Decimal;
timestamp: UnixTimestamp;
stalenessTier: 'realtime' | 't+1s' | 't+5s' | 'stale';
confidenceInterval?: [Decimal, Decimal];
}
Downstream systems then decide their tolerance. A high-frequency market maker might reject t+5s data. A portfolio rebalancer might accept stale with widened execution bands.
3. Circuit Breakers with Partial Closure
Traditional circuit breakers are binary: open or closed. Graduated breakers throttle rather than stop:
- Green: Normal operation
- Yellow: 50% sampling, async validation, expanded timeouts
- Red: Essential operations only, manual confirmation required
The transition between states should be hysteresis-based to prevent flapping, with clear observability into which degradation plane is active.
Implementation Patterns That Actually Work
Shadow Validation
Run your expensive validation checks asynchronously. Permit the operation on primary path, but queue validation for retroactive audit. If validation fails, you have a compensating transaction ready—not a blocked user.
Feature Flags as Safety Valves
Every critical path should have a kill switch that bypasses non-essential validation. These aren't "temporary hacks"—they're operational necessities. Document them, test them quarterly, and monitor their activation.
Explicit Assumption Tracking
When operating in degraded mode, log the assumptions you're violating:
with DegradedContext(
violated=["real_time_pricing", "full_orderbook_depth"],
compensating_controls=["position_limit_10pct", "manual_review_queue"]
) as ctx:
execute_trade(signal)
This creates audit trails and enables automatic recovery when assumptions are restored.
When Not to Fail Open
Fail-open is not a universal virtue. Never apply it to:
- Authentication and authorization (fail closed, always)
- Cryptographic verification (fail closed)
- Settlement finality (fail closed with explicit human override)
The pattern applies specifically to decision-support systems where delayed truth is worse than approximate truth.
Measuring Success
Traditional SLAs mislead here. Track instead:
- Time-to-degraded: How quickly you detect and enter a safe degradation plane
- Decision quality under degradation: Backtest your approximate decisions against optimal
- Recovery latency: Time to return to full precision without manual intervention
The Hard Truth
Most production incidents aren't caused by systems failing. They're caused by systems failing in the wrong direction—shutting down when they should limp, or limping when they should shut down.
The discipline is in designing the limp: explicit, tested, observable degradation modes that preserve core function without creating hidden risk.
Your next architecture review should include one question: "If this component fails, what's our degradation plane?" If the answer is "it stops working," you may have already made the expensive choice.
Building autonomous systems that know when to trust themselves—and when not to. Engineering at A3E Ecosystem.
Top comments (0)