TAGS: system-design, distributed-systems, fintech, reliability-engineering
Most engineers have heard of "fail-safe" design. Circuit breakers trip. Transactions roll back. Systems degrade gracefully. But in high-frequency trading infrastructure, the opposite pattern—fail-open—is equally critical and far less understood. When your safety mechanisms themselves become the single point of failure, you need a different playbook.
I've spent the last three years architecting execution systems where a "safe" shutdown can cost more than a controlled continuation. Here's what actually works in production.
The Problem: Fail-Safe Isn't Always Safe
Consider a typical risk gateway: it validates orders against position limits, credit checks, and market conditions. The naïve implementation fails closed—any validation error, any timeout, any anomaly blocks the order. This feels correct until you hit edge cases:
- Latency spikes in validation services trigger cascading rejections during volatile markets when you most need liquidity
- Consensus failures in distributed limit checks deadlock legitimate trades
- Garbage collection pauses in your risk engine turn protective throttling into accidental denial-of-service
In 2021, a major European exchange saw exactly this: their market-wide circuit breaker, designed to halt trading on volatility spikes, triggered repeatedly due to a clock synchronization bug. Each "protective" halt lasted 15 minutes. The system was technically fail-safe. It was also unusable.
Fail-Open: Definition and When It Applies
Fail-open means your system continues operating with degraded guarantees rather than stopping entirely. This isn't recklessness—it's explicit trade-off engineering.
You should consider fail-open patterns when:
- The cost of false positives exceeds false negatives (blocking a valid trade vs. allowing a risky one)
- Validation dependencies are unreliable (external credit checks, cross-region consensus)
- Human operators can intervene faster than automated recovery (minutes, not milliseconds)
- Partial correctness is observable and correctable (you can detect and fix bad trades post-hoc)
Production Pattern: Tiered Degradation
Don't implement binary fail-open. Build explicit degradation tiers:
Tier 1: Full Validation (normal operation)
All checks execute: pre-trade risk, real-time position limits, counterparty credit, regulatory constraints.
Tier 2: Cached Validation (degraded)
When real-time checks lag, fall back to cached snapshots with bounded staleness. Log everything. Alert immediately. Continue trading with known, quantified risk exposure.
Tier 3: Notional Limits Only (emergency)
If even cached data is unavailable, enforce only hard notional limits (e.g., $10M max per symbol). This prevents catastrophic errors while preserving core functionality.
Tier 4: Manual Override (last resort)
Require human authorization, but don't block indefinitely. Queue orders with explicit "degraded mode" flags for post-trade reconciliation.
Each tier has explicit invariants: what assumptions are violated, what risks are accepted, what compensating controls apply.
Implementation: The Degradation Controller
Here's the architectural pattern we use at A3E:
class DegradationController:
def __init__(self):
self.tiers = [
FullValidationTier(),
CachedValidationTier(max_staleness_ms=500),
NotionalLimitTier(max_notional=10_000_000),
ManualOverrideTier()
]
self.current_tier = 0
self.health_checker = HealthChecker()
def validate(self, order: Order) -> ValidationResult:
# Attempt current tier
tier = self.tiers[self.current_tier]
result = tier.validate(order)
if result.success:
return result
# Tier failure: can we degrade?
if self.can_degrade(result.failure_reason):
self.degrade(result.failure_reason)
return self.validate(order) # Retry at lower tier
# Cannot degrade further: explicit rejection with full context
return ValidationResult.rejected(
reason="validation_unavailable",
attempted_tiers=self.tiers[:self.current_tier+1],
suggested_action="manual_review"
)
def can_degrade(self, reason: FailureReason) -> bool:
# Business logic: which failures permit degradation?
return reason in DEGRADABLE_FAILURES and \
self.current_tier < len(self.tiers) - 1
Key insight: degradation is a first-class operation, not an exception handler. It's logged, measured, and alerted. Operators know immediately when the system is running hot.
The Hard Part: Post-Trade Reconciliation
Fail-open without detection is just failure. You need compensating controls:
- Shadow validation: Run full checks asynchronously, flag discrepancies
- Kill switches: Human-operated circuit breakers with clear, fast paths
- Reversible settlement: Design for trade unwinding when detection lags execution
We maintain a 30-second "reconciliation window" where trades can be marked for review before entering clearing. This sounds short—it's enough for automated detection, and human escalation has a parallel path that can freeze settlement chains.
Anti-Patterns to Avoid
Silent degradation: If you're running on cached data, every downstream system must know. We inject tier headers into all internal messages. Silent degradation becomes invisible degradation, then surprise.
Automatic recovery: Don't climb back up tiers without explicit validation. Recovery requires proving the system is healthy, not assuming it. We use "health proofs": successful full validations on synthetic orders before restoring normal operation.
Uniform tiering: Different instruments need different rules. Cryptocurrency perpetuals degrade differently than listed equities. Build tier configurations per asset class, not global defaults.
Measuring Success
Track these metrics:
- Degradation frequency: How often you leave Tier 1
- Time-to-recovery: Mean and tail latency for returning to full validation
- False negative rate: Bad trades that slipped through (from shadow validation)
- Operator intervention rate: How often humans needed to use kill switches
If degradation is rare but recovery is slow, your health detection is broken. If degradation is frequent, your dependencies are unreliable—fix upstream, don't mask with fail-open.
Conclusion
Fail-open isn't a license to ignore risk. It's explicit risk acceptance with full instrumentation. The goal isn't to prevent all failures—it's to ensure that when protection systems fail, they fail in a direction you chose, with controls you designed, and visibility you built.
In trading systems, the worst failure mode is the one you didn't know you had. Fail-open, done right, makes every degradation visible and every risk quantified.
Engineering lead at A3E Ecosystem, building autonomous infrastructure for algorithmic trading and digital asset markets.
Top comments (0)