Reliability Patterns for Asynchronous APIs in Fintech: A Migration Guide

#dotnet #architecture #webhooks #fintech

Disclaimer: The views and opinions expressed in this article are strictly my own and do not reflect the official policy or position of my employer. The architecture, system designs, and workflows discussed have been abstracted, generalized, and simplified for educational purposes. All metrics, timelines, and numbers are approximate or illustrative and do not represent actual company data. No confidential, proprietary, or sensitive business information is disclosed in this post.

Our core banking provider, NexusBank, sent us a notice: their synchronous account-opening API was being deprecated. The deadline was strict- if we didn't migrate to their new asynchronous webhook-based flow within two weeks, new customer KYC in Belgium and Netherlands would stop completely.

If you work in fintech, you know that "account opening" is the critical backend pipeline that assigns a real IBAN to a user. Moving this flow from synchronous to asynchronous is not just about replacing one endpoint request with another. It is a fundamental shift in the reliability paradigm of your system.

When you lose synchronous immediate feedback, you can no longer treat interactions as a single atomic request-response transaction. This article serves as an engineering playbook, detailing the architectural patterns and checklists needed to safely migrate a critical distributed pipeline to an asynchronous model.

1. The Architecture Shift: From Blocking to State Machines

In the old synchronous model, our account-opening pipeline was a linear sequence that was easy to reason about:

[1] Create User Profile →
[2] Request Provider Account (Blocks) → 
[3] Issue Virtual Card

Each step ran in sequence. Step 2 blocked the execution thread until the bank generated and returned the new IBAN.

With the asynchronous provider, this atomic pipeline breaks. The account creation API now returns instantly with just a tracking_id. The actual IBAN arrives via a webhook at an unknown point in the future. Since we can't reliably predict when that happens, blocking a thread is no longer viable.

To handle this without rewriting our entire workflow engine, we split the flow and shifted to a state machine pattern by introducing a targeted "Wait State":

Phase 1: Initiation
[1] Create User Profile → 
[2] Request Provider Account (Async) → *(Pipeline Pauses)*

Phase 2: Completion
*(Webhook Arrives)* → 
[3] Match Tracking ID & Save IBAN → 
[4] Issue Virtual Card

Here is the crucial implementation detail: during Phase 1, we immediately create the account record in our database with an InProgress status and an empty IBAN/BIC.

This is critical: the system must have a persistent local database object to act as an anchor. You never want to hold memory or threads ransom waiting for a callback. When the webhook eventually arrives, it finds this pending record, updates the status, fills in the IBAN, and gracefully triggers the pipeline to resume.

2. Handling Transport Unreliability and Fallbacks

Asynchronous integrations are common in fintech, but the primary engineering challenge is delivery guarantees. Webhooks can be delayed, delivered out of order, or simply lost.

We already knew NexusBank webhooks occasionally experienced production delays. You cannot rely purely on the provider calling you back. You must design a deterministic fallback path.

Our fallback policy:

Webhook arrives
  ├─ COMPLETED / REJECTED
  │    └─ process terminal status, resume orchestration
  │
  └─ INITIATED / IN_PROGRESS (provider still working)
       └─ schedule retry in 30 min
            └─ poll provider status API directly
                 ├─ terminal status → process
                 └─ still pending → retry
                      └─ 12 hours elapsed?
                           ├─ no  → schedule another retry
                           └─ yes → emit escalation event
                                     → create ops investigation ticket

By introducing a polling fallback mechanism that terminates after 12 hours, we bounded our waiting time. If a webhook is permanently lost, the system automatically creates a Jira ticket (AccountOpeningWebhookNotReceived) with a deep link for manual investigation. No orphaned records, no silent failures.

3. Deduplication: Cache for Burst Protection

Providers often guarantee "at-least-once" delivery, making duplicate webhooks inevitable. A duplicate webhook can cause severe logical errors if processed twice.

To handle this, we implemented a cache-based deduplication layer with a specific key structure:

account_opening_request_id_{aorId}_{status}

Why include status in the key? Because a single request ID legitimately transitions through multiple states over time (INITIATED -> IN_PROGRESS -> SUCCEEDED). Processing the "same request" with a "new status" is a valid state transition.

Why Cache instead of the Database? The purpose of this specific lock is to absorb immediate retry bursts (e.g., the provider sending the same webhook three times in two seconds), not long-term state idempotency. The cache key expires after 15 minutes. Long-term idempotency is handled inherently by our database's orchestration state (a SUCCEEDED account cannot be transitioned to SUCCEEDED twice). The cache simply acts as a fast, lightweight shield against transient network spam.

4. Testing Distributed Workflows with a Sandbox

Testing async integrations by mocking external APIs locally often fails to catch edge cases in the webhook processing layer.

Instead of relying on fragile mocks, we built a dedicated Sandbox Endpoint (POST /company/account-opening/set-status-sandbox) in our non-production environments. You send it a request ID and a target status. It publishes a simulated internal webhook message that flows through the exact same production processing pipeline as a real external webhook.

This simulated message hits the deduplication locks, triggers state transitions, and advances the orchestration pipeline. This allowed us to write repeatable, end-to-end integration tests without depending on third-party staging environments.

5. Safe Rollouts via Feature Flags

Migrating a core flow with a hard deadline leaves no room for "big bang" release failures. I wrapped the dispatching logic behind dynamic feature flags scoped by market:

var isAsyncEnabled = await featureFlags.IsEnabledAsync(
    Features.AsyncAccountOpening,
    marketCode,
    ct);

if (!isAsyncEnabled)
    return await syncOpenAccountService.ExecuteAsync(context, ct);

return await asyncOpenAccountService.ExecuteAsync(context, ct);

This approach allowed us to deploy the new microservices before the deadline alongside the legacy code. We rolled out to Netherlands (companies, then freelancers) and then Belgium. At each step, if we detected malformed webhook payloads or provider instability, the feature flag allowed us to instantly reroute traffic back to the legacy sync flow- zero service interruptions and no emergency rollback deployments required.

Takeaways Checklist for Backend Teams

Moving to webhook-driven architectures requires an intentional design shift:

State Representation: Does your system explicitly track an "In Progress" database state that acts as an anchor for incoming callbacks?
Burst Protection: Are you deduplicating rapid webhook retries (e.g., using a short-lived cache lock with a composite key of ID + Status)?
Bounded Waiting: Do you have a fallback mechanism (like polling) to catch missing webhooks and prevent indefinite pipeline hanging?
Automated Escalation: What happens when the bounds are exceeded? (e.g., automatically generating an Ops Jira ticket).
End-to-End Testing: Can you inject simulated webhooks deep into your pipeline to test the actual handling logic?

Incidents in webhook architectures are rarely caused by business logic bugs; they are caused by false assumptions about transport reliability. Design for failure from day one.