Tobiloba

Posted on Apr 17

Building a Fintech Infrastructure Platform From Scratch: What I Thought It Would Take vs. What It Actually Took

#architecture #backend #showdev #systemdesign

The Brief

A fintech startup needed a platform that lets other fintech companies the kind building neobanks, savings apps, lending products in Nigeria, provision virtual bank accounts for their customers, process payouts to any Nigerian bank, and hold customer funds with interest accrual. All behind a clean multi-tenant API.

On a whiteboard this looks like three rectangles: Account Management, Payments, and Interest. Four or five endpoints each. Maybe six weeks but i did it in 2 weeks, Yes 2 weeks only.

I shipped something that's 94 entity types, 100+ database migrations, five authentication schemes, two virtual account providers, a general ledger, a webhook delivery system with retry logic, and a distributed job locking system. It took significantly just 2 weeks.

Here's the gap between the whiteboard and the reality.

The First Thing That Humbled Me: Money Doesn't Forgive Retries

Early on, I was building the payout endpoint, a company initiates a transfer to a Nigerian bank account. The happy path took maybe two days. Then I started thinking about failure modes.

What happens if the HTTP request to the payment provider times out after we've sent the money but before we get the confirmation? What happens if the client retries the same request because they didn't hear back? What happens if our database write succeeds but the network drops before we respond to the caller?

In a normal API, a duplicate POST is annoying. In a payments API, it's catastrophic. A duplicate POST means someone's money gets moved twice.

I implemented idempotency via a clientReference field, a unique key per company that the caller provides. Before processing any payout, we check the idempotency table for that reference. If it exists, we return the original result instead of processing again. Simple in concept. The devil was in what "original result" means across every failure state.

What's the right response when a previous request with this reference is currently in flight? What's the right response when the previous request failed at the provider level but succeeded at ours? When it failed at ours but might have succeeded at the provider?

I spent 2 days on idempotency edge cases. I don't think most people who haven't built payments systems understand that two days is not excessive for this problem. In most software, a wrong answer is a bug. In payments, a wrong answer is someone's payroll being sent twice 😂 .

The Architecture Decision I'm Most Proud Of: Provider Abstraction

Nigeria's fintech infrastructure runs through a small number of licensed commercial banks acting as settlement partners. Each has a different API, different credential format, different error codes, different rate limits, different webhooks.

The naive approach is to build against one provider and hard-code everything. That's the approach I almost took. What stopped me was thinking about provider reliability, what happens when your primary provider has downtime, or changes their fee structure, or stops onboarding new customers?

So I built an abstraction layer:

IVirtualAccountProvider (interface)
├── BankAVirtualAccountProvider
└── BankBVirtualAccountProvider

IVirtualAccountProviderResolver
└── resolves provider at runtime based on company configuration
And a separate one for payouts:

IPayoutProvider (interface)
├── BankAPayoutProvider
└── BankBPayoutProvider

PayoutProviderResolver
└── reads company payout settings, routes accordingly
The resolver pattern sounds simple when described in two bullet points. The hard part was discipline resisting the urge to let provider-specific logic leak into service layer code. Every time I found myself writing if provider == Polaris in the service layer, I pushed it back down into the provider implementation.

The result: switching a company from one provider to another is a single settings change. No code change, no deployment. During a provider incident, affected companies can be migrated in minutes.

I know we will need a third provider anyway so it will just be a plug and play. But I know we would have needed to refactor part of this architecture eventually anyway. Doing it first cost maybe two extra days and saved what would have been a weeks-long migration later.

The Part Nobody Talks About: The General Ledger

Virtual accounts do not hold real money it is all swept into a pool account immediately. Companies pay fees on every payout. Those fees need to be tracked. The money needs to be auditable.

I didn't expect to build a general ledger. I expected to build a transactions table and call it done.

The problem with a transactions table is that it tells you what happened but not why the numbers add up. When a company asks "why is my wallet balance X?" you want to be able to answer that by summing a set of verifiable entries, not by reconstructing logic from disparate event logs.

Double-entry bookkeeping solves this. Every payout posts balanced debit and credit entries across at least two accounts. The payout principal debits the company disbursement account. The fee credits a separate fee accrual account. FX conversion rates get recorded against the transaction at the exact moment of execution, not approximated later.

I built GeneralLedgerAccount, GeneralLedgerJournal, and GeneralLedgerJournalLine entities with a validation invariant: every journal must balance (total debits = total credits). Any code path that creates unbalanced entries fails at commit time.

This sounds like overengineering until you're in a support conversation with a company that believes their balance is wrong. With a proper ledger, you can walk them through every entry, sorted chronologically, and prove where every naira went. Without it, you're doing forensic archaeology in a transactions table, hoping the application logs are complete.
**
What I Underestimated: Webhook Reliability Is a Full Problem**

Webhooks seem like an afterthought. They're not.

The inbound side: the payment provider sends a POST to our webhook endpoint when a customer's virtual account receives a credit. We need to credit the customer's balance. Simple. Except providers retry on timeout. They retry on 5xx. They might deliver the same event multiple times due to network partitioning.

I built deduplication into the inbound processor: every incoming credit comes with a provider_transaction_ref that we store with a unique constraint per company. Duplicate delivery returns 200 OK immediately without re-processing. Without this, every network blip between the provider and our servers is a potential double-credit.

The outbound side: when a customer's account gets credited, we need to notify the company that integrated with us so their app can update. That notification needs to be reliable. We sign every payload with HMAC-SHA512, bearer token and so on, so companies can verify authenticity. We retry on failed delivery. We track delivery status per event with the HTTP response code and timestamp.

The part I underestimated was the operational surface area. Companies need to resend missed events after their systems recover from downtime. They need to test their webhook handlers without real transactions. They need to see delivery history. Every one of those is an endpoint I didn't budget for initially, and collectively they took as long as the core delivery mechanism.

Webhooks are not just sending HTTP requests. They're a reliability contract.

The Problem I Solved Three Times: Multi-Tenancy

When I say this platform is multi-tenant, I mean complete data isolation. Company A cannot see Company B's customers, accounts, transactions, or payouts. Not through a bug. Not through a misconfigured query. Not ever.

I solved this at three layers:

Database layer: Every transactional entity has a CompanyId foreign key. Not as a suggestion as a constraint. Queries without a CompanyId filter don't return results; they fail.

Service layer: I built ICompanyContext, an interface injected into every service that carries the company identity from the authenticated token. Every service method that touches data calls through the context. There is no "global query" method that bypasses it.

Authentication layer: Companies authenticate with time-limited bearer tokens. Those tokens encode the company identity as a claim. CompanyContextMiddleware extracts and validates that claim before any request reaches a controller. There is no path to service code that hasn't been through the context middleware.

The reason I solved it three times is that each layer catches a different class of mistake. The database constraints catch application bugs. The service layer catches architectural drift, a new developer who doesn't know to filter by company. The authentication layer catches authorization failures, a token from one company being used to access another.

Defense in depth is not paranoia in financial software. It's the minimum.

The Thing I Got Wrong: Abstracting Too Late

The second virtual account provider I integrated was BankB. By the time I integrated it, the first version of the provider abstraction was already in place. Adding it took about a week, most of which was reading their API documentation and handling their credential format.

What I got wrong was building the abstraction after the first provider was already integrated rather than before. I had to retrofit the interface onto an implementation that had been written with implicit assumptions about what a provider looks like. Some of those assumptions were wrong they were actually provider-specific behaviors I'd treated as universal.

The refactor wasn't catastrophic. But it was a week of careful surgery that could have been two days of deliberate design if I'd started with the interface. When you're building against an external API under time pressure, it's tempting to optimize for the working implementation rather than the abstraction. That debt comes due when you need the second provider.

If I were starting over, I'd write the interface first even if I only had one provider and force every piece of service code to depend on the interface from day one.

What Building Payments Taught Me About Correctness

Most software has some tolerance for imprecision. A bug in a recommendation system means a user sees a slightly less relevant result. A bug in a search ranking means a slightly less optimal ordering. Users don't always notice. Systems recover.

A bug in a payments system means real money moved somewhere it shouldn't have. Users notice immediately. Recovery requires manual intervention, reconciliation, and sometimes regulatory disclosure.

This forces a different mentality than most software engineering work I'd done before. I stopped thinking about correctness as "the tests pass" and started thinking about it as "the system maintains its invariants under every failure mode I can enumerate." That's a higher bar. It's also a more interesting engineering problem.

The idempotency system, the double-entry ledger, the webhook deduplication, the distributed job locking none of these feel like features. They feel like correctness properties. Things the system must have to be considered correct at all, rather than optional improvements.

I think that reframe from features to correctness properties is the most useful thing I took out of this project.

What's Next

The platform is in production. Companies are onboarding. Real money is moving through it.

They're the backlog that accumulates when you've built something that works and now want it to work better.

If you're building fintech infrastructure especially in emerging markets where the provider ecosystem is fragmented and reliability is uneven. I'm happy to talk about what I learned. The comment section works. So does a DM.

DEV Community

Building a Fintech Infrastructure Platform From Scratch: What I Thought It Would Take vs. What It Actually Took

The Brief

Top comments (0)