Designing for Failure: Building Reliable Crypto-to-Utility Payments with Provider Failover

#architecture #fintech #backend #systemdesign

Introduction

On paper, a crypto-to-utility pipeline is a simple sequence: verify the on-chain transaction, trigger a biller's API, and return a success state.

In production, "simple" integrations are where the most expensive failures happen. When handling user funds, reliability isn't a feature; it’s the foundation. As the lead engineer, I built Elites Africa with a Multi-Provider Failover strategy from day one. To balance high availability with latency constraints, I architected the system around a primary provider with a ‘hot’ secondary backup. This ensures we can recover from a provider failure in real-time without exceeding the timeout thresholds that degrade the user experience.

The Architecture: Designing for Failure

I built the system to handle everything from airtime and data to electricity and Chowdeck (food delivery wallet) funding. Given the nature of crypto payments, the user expectation for "instant value" is high.

The Lifecycle

Selection & Intent: The user picks a service (e.g., MTN Data). Instead of a live fetch, we serve this from our local cache to keep the UI snappy.
The Crypto Leg: We leverage the Coinley SDK for the wallet handshake. Once the success callback triggers, our backend takes over.
Synchronous Execution: Instead of relying on asynchronous webhooks that delay user certainty, we process bill payments synchronously. When a provider fails, we detect it immediately and can retry or refund without leaving users waiting.
Logging & Attempts: Every request is wrapped in a PaymentAttempt entity. This gives us a granular audit trail of exactly which provider failed and why.

The Problem: Data Fragmentation

When we integrated our second provider, we hit a wall. Provider A might represent an MTN 1GB plan as mtn-data-1gb, while Provider B uses an integer ID like 444402.

If the primary provider fails, you can't simply "retry" with the second provider using the first one's metadata. The systems are fundamentally incompatible in their naming conventions and response structures. Without a translation layer, failover logic is dead on arrival.

The Solution: Internal Code Mapping

To solve this, I implemented an Internal Abstraction Layer. We stopped treating provider codes as the source of truth. Instead, we introduced a standardized internal schema that represents the logical service. At runtime, the system resolves internalCode → provider-specific metadata based on provider availability

The Mapping Strategy:

Category	Internal Code Pattern	Example
Airtime	`{network}-airtime`	`mtn-airtime`
Data	`{network}-data-{value}`	`glo-data-1000`
Electricity	`{disco}-electricity`	`ikeja-electricity`

We run a weekly synchronization script that pulls fresh metadata from all providers and maps them to these internal codes. This allows the backend to be provider-agnostic. When a payment fails on Provider A, the system looks up the same internalCode for Provider B and executes the retry instantly.

Engineering Trade-offs

Every architecture has its "debt." Here is how we weighed our decisions:

Synchronous Processing vs. Event-Driven: By choosing synchronous calls, we prioritize user certainty over server throughput. The trade-off is increased latency during provider timeouts, but in fintech, a slow success is better than a fast "maybe."
Database vs. Live Fetch: Storing plans locally allows for provider switching without redeploying code. The downside is the "stale data" risk, which we mitigate with our weekly cron syncs.
Partial Catalog Failover (80/20 Rule): Not all plans exist on all providers (e.g., certain niche data bundles). Our fallback logic currently covers about 80% of our catalog. For the remaining 20%, we’ve accepted that a failure on the primary provider will result in a refund or manual intervention; a conscious choice to avoid over-engineering for edge cases.

Closing Thoughts

After 7 years in engineering, you realize that the "happy path" is easy. The real work is in the edge cases. At Elites Africa, we didn't just build a bill payment platform; we built a system that assumes providers will fail and has a plan for when they do.

True architectural maturity isn't about the newest tech stack; it's about how gracefully your system fails.