Designing Solana Programs for Safe Failure: Circuit Breakers, Rate Limits, and the Architecture That Could Have Saved Step Finance $40M

#security #solana #defi #blockchain

The $40 Million Question Nobody Asked

On January 31, 2026, an attacker compromised executive devices at Step Finance and drained 261,854 SOL from multisig wallets. Within weeks, one of Solana's oldest DeFi platforms was dead — along with SolanaFloor and Remora Markets.

The post-mortems focused on the obvious: phishing, key hygiene, multisig configuration. All valid. But they missed the deeper architectural question:

Why could a single compromised key drain the entire treasury in one transaction?

The answer reveals a design philosophy that pervades most Solana programs today: optimistic architecture. We build for the happy path and bolt security on as an afterthought. This article proposes the opposite — pessimistic architecture, where every program assumes it will be compromised and limits the blast radius accordingly.

The Three Pillars of Safe Failure

1. Temporal Rate Limiting (The Velocity Check)

Most Solana programs process withdrawals atomically — request and execute in one transaction. This is a feature for users and a gift for attackers.

// ❌ Dangerous: Unlimited withdrawal in single tx
pub fn withdraw(ctx: Context<Withdraw>, amount: u64) -> Result<()> {
    // Transfer full amount immediately
    transfer_tokens(&ctx.accounts.vault, &ctx.accounts.destination, amount)?;
    Ok(())
}

// ✅ Safer: Velocity-limited withdrawal
pub fn withdraw(ctx: Context<Withdraw>, amount: u64) -> Result<()> {
    let vault = &mut ctx.accounts.vault_state;
    let clock = Clock::get()?;
    let current_epoch = clock.unix_timestamp / EPOCH_DURATION;

    // Reset counter if new epoch
    if vault.last_withdrawal_epoch != current_epoch {
        vault.epoch_withdrawn = 0;
        vault.last_withdrawal_epoch = current_epoch;
    }

    // Enforce per-epoch limit
    let new_total = vault.epoch_withdrawn
        .checked_add(amount)
        .ok_or(ErrorCode::Overflow)?;
    require!(
        new_total <= vault.max_epoch_withdrawal,
        ErrorCode::VelocityLimitExceeded
    );

    vault.epoch_withdrawn = new_total;
    transfer_tokens(&ctx.accounts.vault, &ctx.accounts.destination, amount)?;
    Ok(())
}

The Step Finance application: If their treasury had a 50,000 SOL per-epoch withdrawal limit, the attacker would have drained ~50K SOL before the team noticed — not 261,854 SOL. That's the difference between a painful incident and an extinction event.

2. Tiered Authorization (The Blast Radius Limiter)

Most Solana multisigs treat all signers equally: reach threshold M of N, do anything. But not all operations carry equal risk:

Operation Risk Tiers:
┌─────────────────────────────────────┐
│ TIER 3: Emergency (1-of-N)          │  ← Pause, freeze
│ TIER 2: Routine (M-of-N)            │  ← Normal ops, small transfers
│ TIER 1: Critical (M+1-of-N + delay) │  ← Large transfers, upgrades
│ TIER 0: Existential (N-of-N + 72h)  │  ← Authority transfer, shutdown
└─────────────────────────────────────┘

pub fn execute_action(ctx: Context<Execute>, action: Action) -> Result<()> {
    let multisig = &ctx.accounts.multisig;
    let signatures = count_valid_signatures(&ctx.accounts.signatures)?;

    match action.tier() {
        Tier::Emergency => {
            require!(signatures >= 1, ErrorCode::InsufficientSignatures);
        }
        Tier::Routine => {
            require!(signatures >= multisig.threshold, ErrorCode::InsufficientSignatures);
        }
        Tier::Critical => {
            require!(
                signatures >= multisig.threshold + 1,
                ErrorCode::InsufficientSignatures
            );
            require!(
                action.queued_at + CRITICAL_DELAY <= Clock::get()?.unix_timestamp,
                ErrorCode::TimelockActive
            );
        }
        Tier::Existential => {
            require!(
                signatures == multisig.total_signers,
                ErrorCode::RequiresAllSigners
            );
            require!(
                action.queued_at + EXISTENTIAL_DELAY <= Clock::get()?.unix_timestamp,
                ErrorCode::TimelockActive
            );
        }
    }
    Ok(())
}

Why this matters: Even if an attacker compromises enough keys to meet the standard threshold, they still can't execute critical operations without waiting through timelocks — giving defenders time to react.

3. Programmatic Circuit Breakers (The Dead Man's Switch)

The most underused pattern in Solana security: automatic pause mechanisms that trigger on anomalous behavior.

#[account]
pub struct CircuitBreaker {
    pub is_paused: bool,
    pub total_outflow_current_window: u64,
    pub window_start: i64,
    pub max_window_outflow: u64,       // e.g., 10% of TVL
    pub max_single_transaction: u64,    // e.g., 2% of TVL
    pub consecutive_large_txs: u8,
    pub pause_authority: Pubkey,
    pub auto_resume_after: i64,
}

pub fn process_outflow(breaker: &mut CircuitBreaker, amount: u64, tvl: u64) -> Result<()> {
    require!(!breaker.is_paused, ErrorCode::CircuitBreakerTripped);

    let clock = Clock::get()?;

    // Reset window if expired
    if clock.unix_timestamp > breaker.window_start + WINDOW_DURATION {
        breaker.total_outflow_current_window = 0;
        breaker.window_start = clock.unix_timestamp;
        breaker.consecutive_large_txs = 0;
    }

    let single_limit = tvl.checked_mul(2).unwrap() / 100;
    if amount > single_limit {
        breaker.consecutive_large_txs += 1;
    }

    // Auto-pause: single tx exceeds 5% of TVL
    if amount > tvl.checked_mul(5).unwrap() / 100 {
        breaker.is_paused = true;
        emit!(CircuitBreakerTripped { reason: "single_tx_limit" });
        return err!(ErrorCode::CircuitBreakerTripped);
    }

    // Auto-pause: window outflow exceeds 10% of TVL
    let new_total = breaker.total_outflow_current_window.checked_add(amount).unwrap();
    if new_total > breaker.max_window_outflow {
        breaker.is_paused = true;
        emit!(CircuitBreakerTripped { reason: "window_outflow_limit" });
        return err!(ErrorCode::CircuitBreakerTripped);
    }

    // Auto-pause: 3 consecutive large transactions
    if breaker.consecutive_large_txs >= 3 {
        breaker.is_paused = true;
        emit!(CircuitBreakerTripped { reason: "consecutive_large_txs" });
        return err!(ErrorCode::CircuitBreakerTripped);
    }

    breaker.total_outflow_current_window = new_total;
    Ok(())
}

Putting It Together: The Defense-in-Depth Stack

Here's how these patterns compose into a layered defense:

Attack Timeline with Safe Failure Architecture:
─────────────────────────────────────────────────────

T+0min:   Attacker compromises key
T+1min:   Attempts large withdrawal
          → Circuit breaker: "5% TVL single-tx limit exceeded"
          → Transaction REJECTED, protocol auto-paused
          → Alert fired to team

T+2min:   Attacker tries smaller withdrawals
          → Rate limiter: processes up to epoch limit
          → After 3 large txs, circuit breaker trips again

T+5min:   Attacker tries to change authority
          → Tiered auth: requires ALL signers + 72h timelock
          → Transaction REJECTED

T+15min:  Team receives alerts, initiates emergency pause
          → Emergency tier: single signer can pause
          → All operations frozen

Result:   ~2% of TVL lost instead of 100%

Compare this to what actually happened at Step Finance:

T+0min:   Attacker compromises executive device
T+1min:   Unstakes 261,854 SOL
T+2min:   Transfers everything out
T+3min:   Gone

Practical Implementation Checklist

For teams building on Solana today:

Immediate (This Sprint):

[ ] Add per-epoch withdrawal caps to treasury programs
[ ] Implement emergency pause with single-signer authority
[ ] Set up Clockwork/automation alerts for large transfers
[ ] Define what "anomalous" means for your protocol (% of TVL thresholds)

Short-Term (This Quarter):

[ ] Deploy tiered authorization with timelocks for critical operations
[ ] Implement programmatic circuit breakers with auto-pause
[ ] Create a runbook: "Key compromised — now what?" with specific steps
[ ] Run a tabletop exercise: simulate a key compromise with your team

Ongoing:

[ ] Review and adjust rate limits as TVL changes
[ ] Monitor false-positive rates on circuit breakers
[ ] Rotate keys on a schedule, not just after incidents
[ ] Consider time-locked recovery mechanisms (social recovery, guardian keys)

The Uncomfortable Truth

No amount of key hygiene eliminates compromise risk. Phishing gets more sophisticated. Supply chain attacks hit trusted tools. Insiders go rogue. The question isn't if a key gets compromised — it's when.

The protocols that survive 2026 and beyond won't be the ones that never get breached. They'll be the ones where a breach costs 2% of TVL instead of 100%.

Design for the breach. Build for survival.

This article is part of the DeFi Security Research series. Previous entries cover vulnerability analysis, audit tooling, and security best practices across Solana and EVM ecosystems.