Dante

Posted on Mar 18 • Edited on Mar 22

Building Resilient Systems: Implementing Stateful Failover Between Multiple External Providers

In today's interconnected digital landscape, system reliability isn't just a nice-to-have—it's a business imperative. When your application relies on external providers for critical functionality, each of those dependencies represents a potential point of failure. What happens when your payment processor goes down? Or when your authentication service experiences an outage?

This is where stateful failover between multiple external providers comes into play. In this article, I'll break down the components of a robust traffic management system with failover capabilities, explain implementation approaches, and share practical code examples to help you build more resilient systems.

Understanding the Problem

Let's start with a common scenario: You've built an e-commerce platform that relies on multiple external payment processors. Each processor has its own API, pricing structure, and reliability characteristics. If your primary payment processor experiences downtime, you need to seamlessly transition to a backup without:

Disrupting the user experience
Losing transaction context
Creating duplicate charges
Causing data inconsistencies

A naive solution might just retry failed requests with a different provider, but this approach can lead to serious issues. What if the payment was processed but the confirmation was lost due to a network issue? You could end up charging customers twice!

Key Components of a Stateful Failover System

1. Load Balancer

The front door of your system, responsible for distributing traffic across providers based on:

Health status
Capacity
Performance characteristics
Business rules (cost, features, etc.)

2. Health Checks

Continuous monitoring of provider health through:

Active probes (pinging APIs directly)
Passive monitoring (tracking error rates)
Synthetic transactions (simulating real user flows)

3. State Management

Tracking the context needed to maintain session continuity:

User session data
Transaction state
Provider-specific tokens and identifiers
Historical interaction data

4. Failover Logic

Rules governing when and how to switch providers:

Error thresholds
Performance degradation patterns
Circuit breaker patterns
Gradual traffic shifting

Implementation Approaches

Let's explore three practical approaches to implementing stateful failover.

Approach 1: Proxy-Based Failover

In this model, your application acts as a proxy between users and providers. All requests flow through your system, allowing you to control routing decisions.

// Example using Fastify
const fastify = require('fastify')();
const providerManager = require('./provider-manager');

// Create a session store
const sessions = new Map();

// Pre-request hook to handle provider selection
fastify.addHook('preHandler', async (request, reply) => {
  // Extract or generate session ID
  const sessionId = request.headers['session-id'] || generateSessionId();

  // Get provider for this session
  let provider = sessions.get(sessionId);
  if (!provider || !providerManager.isHealthy(provider)) {
    // Select a healthy provider
    provider = providerManager.getHealthyProvider();
    sessions.set(sessionId, provider);
  }

  // Attach provider to request for use in route handlers
  request.provider = provider;

  // Ensure client knows their session ID
  if (!request.headers['session-id']) {
    reply.header('session-id', sessionId);
  }
});

// Generic handler that proxies to the selected provider
fastify.all('/*', async (request, reply) => {
  const provider = request.provider;

  try {
    // Forward request to provider
    const response = await provider.handleRequest(request);
    return response;
  } catch (error) {
    // Handle provider failure
    if (isFailoverError(error)) {
      const newProvider = providerManager.getNextHealthyProvider(provider);
      if (newProvider) {
        // Update session's provider
        sessions.set(request.headers['session-id'], newProvider);

        // Transfer state if needed
        await transferState(request, provider, newProvider);

        // Retry with new provider
        return await newProvider.handleRequest(request);
      }
    }

    // Either not a failover-eligible error or all providers failed
    reply.code(503).send({ error: 'Service unavailable' });
  }
});

async function transferState(request, oldProvider, newProvider) {
  // Implement provider-specific state transfer logic
  // Example: For payment processors, you might need to check if a transaction
  // was initiated but not completed
  const sessionId = request.headers['session-id'];
  const state = await oldProvider.getSessionState(sessionId);
  await newProvider.initializeSession(sessionId, state);
}

Approach 2: Service Mesh Failover

For microservices architectures, a service mesh like Istio can manage traffic routing between your services and external providers.

# Istio VirtualService example
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: payment-service
spec:
  hosts:
  - payment.example.com
  http:
  - route:
    - destination:
        host: primary-payment-provider
        port:
          number: 443
      weight: 90
    - destination:
        host: backup-payment-provider
        port:
          number: 443
      weight: 10
    retries:
      attempts: 3
      perTryTimeout: 2s
    fault:
      abort:
        percentage:
          value: 0
        httpStatus: 500

With a service mesh, you can:

Gradually shift traffic between providers
Implement retry logic at the network level
Insert fault injection for testing
Collect detailed metrics on provider performance

However, you'll still need application logic to handle state transfer between providers.

Approach 3: Circuit Breaker Pattern

The circuit breaker pattern prevents cascading failures by "tripping" when error rates exceed thresholds.

class CircuitBreaker {
  constructor(provider, options = {}) {
    this.provider = provider;
    this.state = 'CLOSED'; // CLOSED, OPEN, HALF_OPEN
    this.failureCount = 0;
    this.successThreshold = options.successThreshold || 5;
    this.failureThreshold = options.failureThreshold || 3;
    this.resetTimeout = options.resetTimeout || 30000; // 30 seconds
    this.lastFailureTime = null;
  }

  async executeRequest(request) {
    if (this.state === 'OPEN') {
      // Check if we should try half-open state
      const now = Date.now();
      if ((now - this.lastFailureTime) > this.resetTimeout) {
        this.state = 'HALF_OPEN';
        console.log(`Circuit half-open for ${this.provider.name}`);
      } else {
        throw new Error('Circuit breaker is OPEN');
      }
    }

    try {
      const result = await this.provider.handleRequest(request);

      // Success handling
      if (this.state === 'HALF_OPEN') {
        this.successCount++;
        if (this.successCount >= this.successThreshold) {
          this.state = 'CLOSED';
          this.failureCount = 0;
          this.successCount = 0;
          console.log(`Circuit closed for ${this.provider.name}`);
        }
      }

      return result;
    } catch (error) {
      // Failure handling
      this.failureCount++;
      this.lastFailureTime = Date.now();

      if (this.state === 'CLOSED' && this.failureCount >= this.failureThreshold) {
        this.state = 'OPEN';
        console.log(`Circuit opened for ${this.provider.name}`);
      }

      if (this.state === 'HALF_OPEN') {
        this.state = 'OPEN';
        this.successCount = 0;
        console.log(`Circuit re-opened for ${this.provider.name}`);
      }

      throw error;
    }
  }
}

// Usage
const primaryProvider = new CircuitBreaker(paymentProviders.stripe);
const backupProvider = new CircuitBreaker(paymentProviders.paypal);

async function processPayment(paymentDetails) {
  try {
    return await primaryProvider.executeRequest({
      method: 'processPayment',
      data: paymentDetails
    });
  } catch (error) {
    console.log('Failover to backup provider');
    return await backupProvider.executeRequest({
      method: 'processPayment',
      data: paymentDetails
    });
  }
}

Advanced Considerations

Managing Stateful Sessions

For truly stateful failover, you need to consider how session state is maintained:

class StatefulProviderManager {
  constructor(providers) {
    this.providers = providers.map(p => ({
      provider: p,
      circuitBreaker: new CircuitBreaker(p)
    }));
    this.sessionStore = new Map();
    this.stateReplicationService = new StateReplicationService();
  }

  async getProviderForSession(sessionId) {
    // Get existing provider for session
    let providerInfo = this.sessionStore.get(sessionId);

    // If no provider assigned or current one is unhealthy
    if (!providerInfo || providerInfo.circuitBreaker.state === 'OPEN') {
      // Get next healthy provider
      providerInfo = this.providers.find(p => p.circuitBreaker.state !== 'OPEN');

      if (!providerInfo) {
        throw new Error('No healthy providers available');
      }

      // If switching providers, replicate state
      if (this.sessionStore.has(sessionId)) {
        const oldProviderInfo = this.sessionStore.get(sessionId);
        await this.stateReplicationService.transferState(
          sessionId,
          oldProviderInfo.provider,
          providerInfo.provider
        );
      }

      this.sessionStore.set(sessionId, providerInfo);
    }

    return providerInfo;
  }

  async executeRequest(sessionId, request) {
    const providerInfo = await this.getProviderForSession(sessionId);

    try {
      return await providerInfo.circuitBreaker.executeRequest(request);
    } catch (error) {
      // Remove failed provider from session
      this.sessionStore.delete(sessionId);

      // Retry with a different provider
      return await this.executeRequest(sessionId, request);
    }
  }
}

Replication vs. Reconstruction

When switching providers, you have two main approaches for handling state:

State Replication: Actively copy session state between providers
- Pros: Immediate availability, complete state transfer
- Cons: Additional complexity, potential consistency issues
State Reconstruction: Rebuild state from persistent storage
- Pros: Simpler implementation, guaranteed consistency
- Cons: Potentially slower, requires comprehensive data model

Choose based on your specific requirements around recovery time objectives (RTO) and consistency needs.

Testing Your Failover System

No failover system is complete without rigorous testing. Consider implementing:

1. Chaos Engineering

Regularly introduce failures to test your system's resilience:

// Simple chaos monkey implementation
function startChaosTesting(providers, options = {}) {
  const interval = options.interval || 3600000; // Default: hourly
  const failureDuration = options.failureDuration || 300000; // Default: 5 minutes
  const failureRate = options.failureRate || 0.1; // Default: 10% chance

  setInterval(() => {
    providers.forEach(provider => {
      if (Math.random() < failureRate) {
        console.log(`🐒 Chaos monkey disabling ${provider.name} for ${failureDuration}ms`);
        provider.simulateOutage();

        setTimeout(() => {
          console.log(`🐒 Chaos monkey restoring ${provider.name}`);
          provider.restoreService();
        }, failureDuration);
      }
    });
  }, interval);
}

2. Load Testing During Failover

Test system behavior under load while triggering failovers to identify potential bottlenecks.

3. Session Persistence Verification

Ensure that user sessions remain intact through provider transitions:

async function verifySessionPersistence() {
  // Create test session
  const sessionId = await createTestSession();
  const testData = { key: 'value', timestamp: Date.now() };

  // Store data with initial provider
  await providerManager.executeRequest(sessionId, {
    method: 'storeData',
    data: testData
  });

  // Force failover to next provider
  await providerManager.providers[0].simulateOutage();

  // Retrieve data after failover
  const retrievedData = await providerManager.executeRequest(sessionId, {
    method: 'retrieveData'
  });

  // Verify data persistence
  assert.deepEqual(retrievedData, testData);
}

Production Considerations

When implementing stateful failover in production, consider:

1. Observability

Implement comprehensive monitoring to detect and diagnose issues:

// Example with Prometheus metrics
const Prometheus = require('prom-client');

const failoverCounter = new Prometheus.Counter({
  name: 'provider_failovers_total',
  help: 'Total number of provider failovers',
  labelNames: ['from_provider', 'to_provider', 'reason']
});

const requestLatencyHistogram = new Prometheus.Histogram({
  name: 'provider_request_duration_seconds',
  help: 'Request duration in seconds',
  labelNames: ['provider', 'endpoint', 'status'],
  buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 10]
});

// Instrument your provider calls
async function instrumentedProviderCall(provider, request) {
  const end = requestLatencyHistogram.startTimer({
    provider: provider.name,
    endpoint: request.endpoint
  });

  try {
    const result = await provider.handleRequest(request);
    end({ status: 'success' });
    return result;
  } catch (error) {
    end({ status: 'error' });
    throw error;
  }
}

2. Geographic Distribution

Deploy providers across multiple regions to protect against regional outages:

const providers = [
  {
    name: 'Primary US East',
    region: 'us-east-1',
    priority: 1,
    // Provider-specific config
  },
  {
    name: 'Primary US West',
    region: 'us-west-2',
    priority: 2,
    // Provider-specific config
  },
  {
    name: 'Europe',
    region: 'eu-central-1',
    priority: 3,
    // Provider-specific config
  }
];

// Select provider based on user's region and provider health
function selectOptimalProvider(userRegion, providers) {
  // Filter to healthy providers
  const healthyProviders = providers.filter(p => isProviderHealthy(p));

  if (healthyProviders.length === 0) {
    throw new Error('No healthy providers available');
  }

  // First try: closest healthy provider to user
  const closestProvider = healthyProviders
    .sort((a, b) => getLatency(userRegion, a.region) - getLatency(userRegion, b.region))[0];

  return closestProvider;
}

3. Cost Management

Different providers often have different pricing models. Implement logic to optimize for cost:

function selectProviderWithCostOptimization(request, healthyProviders) {
  // For high-value transactions, prioritize reliability over cost
  if (request.transactionValue > HIGH_VALUE_THRESHOLD) {
    return healthyProviders.sort((a, b) => a.reliabilityScore - b.reliabilityScore)[0];
  }

  // For normal transactions, balance cost and reliability
  return healthyProviders.sort((a, b) => {
    const aCost = a.baseCost + (request.transactionValue * a.percentageFee);
    const bCost = b.baseCost + (request.transactionValue * b.percentageFee);

    // Balance cost (70% weight) and reliability (30% weight)
    return (aCost * 0.7 + a.reliabilityScore * 0.3) - 
           (bCost * 0.7 + b.reliabilityScore * 0.3);
  })[0];
}

Conclusion

Building a stateful failover system between multiple external providers isn't trivial, but it's an essential component of a resilient architecture. By implementing proper health checks, circuit breakers, session management, and failover logic, you can create systems that gracefully handle provider outages.

Remember that failover systems themselves need regular testing and monitoring. A failover mechanism that hasn't been tested recently is a liability, not an asset.

The code examples in this article provide a starting point, but you'll need to adapt them to your specific providers and requirements. Focus on maintaining user experience and data consistency, and you'll build systems that keep working even when your dependencies don't.

What strategies have you used for managing provider failover? Have you encountered any particularly challenging aspects? Share your experiences in the comments below!

DEV Community

Building Resilient Systems: Implementing Stateful Failover Between Multiple External Providers

Building Resilient Systems: Implementing Stateful Failover Between Multiple External Providers

Understanding the Problem

Key Components of a Stateful Failover System

1. Load Balancer

2. Health Checks

3. State Management

4. Failover Logic

Implementation Approaches

Approach 1: Proxy-Based Failover

Approach 2: Service Mesh Failover

Approach 3: Circuit Breaker Pattern

Advanced Considerations

Managing Stateful Sessions

Replication vs. Reconstruction

Testing Your Failover System

1. Chaos Engineering

2. Load Testing During Failover

3. Session Persistence Verification

Production Considerations

1. Observability

2. Geographic Distribution

3. Cost Management

Conclusion

Top comments (0)