What was wrong with v1?
Three things.
The networking was over-engineered. I built a two-node WireGuard setup where a CloudWatch alarm would trigger a Lambda to move an Elastic IP from the primary EC2 node to a standby node on failure. Looked great on paper. Then I found a gap I had documented myself: the failover Lambda moved the Elastic IP but never updated the VPC route table. The route table still pointed at EC2 node 0's ENI after failover. That means traffic from Lambda would still hit the old interface. What I had was not high availability. It was a more complicated single point of failure.
The application had no real financial data model. PayCore had one table: transaction. You could queue a payment, Lambda would validate it, status would get patched back. That demonstrates async processing. It does not demonstrate a payment backend. A payment backend needs accounts, ledger entries, and idempotency. Without those three things you are building a webhook relay, not a financial system.
There was no observability. I could deploy the thing but I could not show I was able to operate it. Every Cloud and SRE role I have applied to at financial institutions, telcos, or larger tech companies lists Prometheus and Grafana. My project had nothing.
v2 fixes all three.
The networking got simpler
I removed the dual-EC2 failover setup entirely. PayCore now runs a single EC2 instance as the WireGuard gateway, one public subnet, one route table. The Elastic IP is still there for address stability. There is no failover automation.
This is the honest call. At homelab scale, the two-node setup added operational complexity without actually delivering HA because of the route table gap. At real enterprise scale, the right answer is AWS Site-to-Site VPN or AWS Transit Gateway with redundant tunnels. Both handle route propagation automatically. AWS Site-to-Site VPN is around $0.05 per hour per connection, roughly $36 a month. The cost of managing WireGuard HA yourself, patching it, monitoring tunnel health, scripting route updates, and testing failover paths, does not make sense when a managed service exists that handles all of that.
I am genuinely curious how people running hybrid infrastructure in financial services or telco handle this in practice. If you have done it at scale, I want to know what the actual operational overhead looked like.
The app now has a proper financial data model
Three tables got added: accounts, ledger_entry, and idempotency_keys.
accounts is scoped per merchant per currency. Before a merchant can submit an NGN payment, they need an NGN account. Currency isolation enforced at the data layer, not just the API.
ledger_entry records every credit and debit against an account. Balance is computed by summing across entries, not stored as a column. Storing a running balance as a column creates race conditions under concurrent writes and makes it harder to audit individual movements.
def get_balance(db: Session, account_id: str):
rows = db.execute(
select(LedgerEntry).where(LedgerEntry.account_id == account_id)
).scalars().all()
balance = 0.0
for row in rows:
if row.movement_direction == LedgerMovement.credit:
balance += row.amount
else:
balance -= row.amount
return balance
idempotency_keys stores the response body against a client-supplied header. If the same key arrives twice from the same merchant, the stored response comes back without touching the database a second time. Network retries are real. A duplicate charge is a support incident. This is not optional in payment systems.
Prometheus and Grafana, finally
Four exporters running in the same Docker Compose stack:
prometheus-fastapi-instrumentator instruments the FastAPI app at startup and exposes /metrics. Prometheus scrapes it every 10 seconds. You get request count, latency histograms, and error rates per endpoint without writing custom metric code.
node-exporter for host-level metrics. CPU, memory, disk, network. It runs with pid: host and sees the host's process tree from inside the container.
cAdvisor (Container Advisor) for per-container resource usage. Reads directly from the Docker daemon.
postgres-exporter for database internals. Connection counts, query duration, table sizes.
Grafana is provisioned entirely through code. Datasource and dashboard JSON are mounted via Docker volumes at startup. No manual UI setup. Tear it down, redeploy, dashboards come back.
volumes:
- grafana_data:/var/lib/grafana
- ./monitoring/grafana/provisioning:/etc/grafana/provisioning
Four alerts in alerts.yml: API down, PostgreSQL down, host CPU above 85% for five minutes, disk below 15% free. Alertmanager is not deployed in the dev setup so these do not deliver anywhere right now. In production this routes to PagerDuty or SNS.
The reason I cared about this specifically: most of the Cloud and SRE roles I have looked at in finance, oil and gas, and telco all list observability. The day to day work in those environments is not architecture. It is watching dashboards, chasing slow queries, correlating an API error spike with a disk event. This stack is proof I can do that, not just deploy infrastructure.
What is still not production-ready
- Terraform state is local, on disk. Production uses S3 backend with DynamoDB locking.
- No alertmanager target configured as to keep things simple.
- No settlement layer. The ledger tracks movements. There is no reconciliation job, no settlement engine, no balance sheet. The goal of this project was the infrastructure, implementing the ledger application layer was a good opportunity to delve deeper than I normally would.
- Secrets Manager recovery window is 0. Fast teardown for dev. Production minimum is 7 days.
Code is public: github.com/escanut/paycore-hybrid-payment-infra
Victor Ojeje, Cloud and DevOps Engineer
LinkedIn | Dev.to | ojejevictor@gmail.com
Top comments (0)