It has been a while since I posted here. One of my previous article on GitHub Actions OIDC got picked up and reposted by Dev.to's own X account, which I did not expect. That was a good signal to keep writing, but between job applications, AWS SAA-C03 prep, and a few client things, the writing fell off.
The reason I am back now is a specific realisation I had while reviewing my own portfolio: every single project I have built is greenfield AWS. Clean VPCs, Lambda functions, managed services from day one. There is nothing wrong with that, but the real world is not greenfield. Most companies are running something on-premises, in a data centre, or on a server that someone's CTO refused to migrate five years ago. If I cannot architect across that boundary, I am limited.
So I built Paycore: a hybrid cloud B2B payment middleware that runs a FastAPI application on an on-premises Proxmox VM, bridged to AWS serverless infrastructure via WireGuard VPN. This is not a tutorial scaffold. The infrastructure, secrets management, networking, and deployment pipeline are all production-pattern. I will walk through the architecture, the security decisions, and the parts I deliberately left imperfect and why.
What This Project Actually Is
PayCore accepts payment requests from merchants over HTTPS, tokenises the card number, writes the transaction to a PostgreSQL database, queues it to AWS SQS (Simple Queue Service), and returns a queued status immediately. An AWS Lambda function picks up the message, archives the raw payload to S3, runs fraud detection logic, and PATCHes the transaction status back to the on-premises API over the VPN tunnel.
The on-prem VM has no public IP. AWS Lambda never touches the public internet to reach the API. The only path in is the WireGuard tunnel through the EC2 nodes acting as VPN gateways.
This is a hybrid cloud architecture of the kind most enterprises actually run, not the simplified "everything in AWS" patterns you see in certifications. It covers secure cross-boundary networking, IAM (Identity and Access Management) scoping, event-driven async processing, secrets management, and automated failover. The constraints are real: PAN tokenisation so raw card numbers never hit the database or message queue, callback authentication to prevent spoofed status updates, and a KMS (Key Management Service) Customer Master Key encrypting both S3 objects and Secrets Manager entries.
One core problem i encountered was getting Lambda to reach a private API on a Proxmox VM with no public IP. The solution is a two-node WireGuard setup where EC2 instances act purely as VPN gateways in which no application logic runs on EC2. Lambda sits inside a VPC (Virtual Private Cloud) with a route table entry pointing 10.10.0.0/24 at EC2 node 0's ENI (Elastic Network Interface). The API port binds exclusively to the WireGuard interface on the VM.
Network Topology
A merchant sends an HTTPS request which is first handled by Cloudflare. The request passes through a Cloudflare Tunnel into an Nginx reverse proxy, which forwards it to a FastAPI application running on a Proxmox virtual machine with no public IP address.
When the FastAPI service processes the request, it publishes a message to Amazon Simple Queue Service (SQS). This triggers an AWS Lambda function running inside a Virtual Private Cloud (VPC).
The Lambda function needs to communicate back to the FastAPI service on the Proxmox VM. To achieve this, the VPC route table contains a rule that directs traffic for the subnet 10.10.0.0/24 to the Elastic Network Interface (ENI) of the currently active EC2 instance.
This EC2 instance acts as a WireGuard gateway. Traffic from Lambda is routed to this instance, then forwarded through a WireGuard tunnel to the Proxmox VM. The FastAPI service receives the request and processes the PATCH call to update the transaction status.
Failover Mechanism
Two EC2 instances are configured as WireGuard gateways:
EC2 node 0 is the primary and holds the Elastic IP
EC2 node 1 is the standby and does not hold the Elastic IP unless failover occurs
Amazon CloudWatch monitors the health of the primary node. When it detects two consecutive failed health checks, it triggers a failover Lambda function.
This function detaches the Elastic IP from the primary node and attaches it to the standby node. Because the WireGuard configuration uses the Elastic IP as its endpoint, the tunnel automatically reconnects to the new active node.
The VPC route table does not change during this process. It continues to route traffic to the same ENI target logic, while the public-facing identity shifts between EC2 instances.
Key Design Behavior
The system separates two concerns:
Private routing inside the VPC remains constant and does not require updates during failover
Public connectivity is abstracted through the Elastic IP, which can be reassigned between instances
This design avoids modifying route tables during failure events and allows faster recovery since only the public endpoint changes while internal routing stays stable.
Node Roles
Proxmox VM at 10.10.0.1 runs the FastAPI application and initiates the WireGuard tunnel
EC2 node 0 at 10.10.0.2 acts as the primary WireGuard gateway with the Elastic IP
EC2 node 1 at 10.10.0.3 acts as the standby gateway and takes over during failover
Security Decisions Worth Explaining
PAN Tokenisation
The PAN (Primary Account Number, the 16-digit card number) is tokenised on arrival before it touches the database or SQS. What gets stored and queued is a UUID token and a masked string like **** **** **** 1111. The raw number is used once, validated, masked, and discarded. This is the minimum floor for anything handling card data.
def tokenise_pan(pan: str) -> tuple[str, str]:
token = str(uuid.uuid4())
masked = "**** **** **** " + pan[-4:]
return token, masked
Dual Authentication: JWT and API Key
Two separate auth flows exist intentionally. Merchant-facing endpoints use JWT (JSON Web Token) with HS256, issued on login and validated via FastAPI dependency injection. The Lambda callback endpoint (PATCH /transactions/{token}/status) uses a separate X-API-Key header, validated against a value fetched from AWS Secrets Manager at Lambda startup.
These are intentionally different because they protect against different threats. A compromised JWT does not give an attacker the ability to spoof transaction statuses. A leaked API key cannot be used to enumerate merchant transactions.
bcrypt_sha256 Instead of bcrypt
Standard bcrypt silently truncates passwords at 72 bytes. A password longer than 72 characters hashes identically to its first 72 characters. bcrypt_sha256 from passlib pre-hashes with SHA-256 before bcrypt, eliminating the truncation issue entirely. This is a real vulnerability, not a theoretical one. It affects production systems using bcrypt with long or Unicode-heavy passwords.
IAM Scoping
The Lambda validator role has explicit, resource-scoped permissions:
- SQS permissions scoped to the single queue ARN
- S3
PutObjectscoped toarn:aws:s3:::${bucket_name}/transactions/* - Secrets Manager access scoped to
paycore/internal/config*
The EC2 permissions (DescribeInstances, DescribeNetworkInterfaces) are wildcarded. That is a known gap, documented in the README. Those should be scoped to specific resource ARNs in production.
KMS Encryption
A single Customer Master Key (CMK) with key rotation enabled covers both S3 server-side encryption and Secrets Manager entries. The KMS key policy explicitly grants decrypt access to the Lambda IAM role, rather than relying on root account inheritance. The S3 bucket also has bucket_key_enabled = true, which caches the data key at the bucket level and reduces KMS API call volume against your monthly quota.
The Deployment Pipeline
Provisioning runs in three layers.
1. Build the Proxmox golden image
A cloud-init Ubuntu 22.04 VM is provisioned with Ansible: Docker Engine, Docker Compose plugin, WireGuard tools, SSH hardening, QEMU guest agent. Once provisioned, it is converted to an immutable Proxmox template with qm template. Any VM cloned from that template is pre-hardened and Docker-ready without running a single extra apt command.
2. Provision AWS infrastructure
Terraform manages the full AWS side: VPC, subnets, route tables, security groups, EC2 WireGuard nodes, Elastic IP, SQS queue with DLQ (Dead Letter Queue), SNS (Simple Notification Service) topic, S3 bucket, KMS key, Lambda functions, Secrets Manager secrets, and CloudWatch alarm. The layout is modular across five directories: networking, compute, kms, messaging, storage. Each module has its own variables and outputs. aws-infra/dev/main.tf wires them together.
3. Configure on-prem and deploy the app
A single Ansible playbook pair (configure.yml then deploy_app.yml) handles WireGuard setup on the VM, fetches secrets from Secrets Manager, git-pulls the repository, and brings up the Docker Compose stack. The final health check hits http://10.10.0.1:80/api/health over the VPN tunnel. The playbook does not exit clean until that endpoint returns 200.
Fraud Detection Flow
Lambda processes one SQS record per invocation (batch_size = 1). Before any fraud check runs, the raw payload is archived to S3. The archive happens before the decision, not after, so both flagged and processed transactions have an immutable record regardless of outcome.
def check_fraud(body: dict):
if body["amount"] > 10000000: # 10M NGN threshold
return True
if body.get("currency") not in ["NGN", "USD", "EUR"]:
return True
return False
If fraud is detected, Lambda publishes to the SNS topic (email alert) and PATCHes the status to flagged. Otherwise it PATCHes processed. The SQS queue has a 3-retry DLQ policy. If Lambda fails three times on a given message, the message moves to the DLQ for investigation rather than being silently dropped.
What I Deliberately Left Imperfect
Honest projects document their gaps. These are in the README and I will repeat them here.
Local Terraform state. dev.tfstate is on disk. Production requires an S3 backend with DynamoDB state locking to prevent concurrent apply conflicts.
Route table failover gap. The failover Lambda moves the Elastic IP to the standby EC2 node, but the VPC route table entry still points at EC2 node 0's ENI. Full failover requires additional Lambda logic to update the route table. There is currently a window where the Elastic IP is on node 1 but VPN traffic is still routing through node 0's interface.
S3 Object Lock disabled. force_destroy = true for fast teardown in dev. Production would use COMPLIANCE mode with a minimum 365-day retention period, consistent with PCI DSS Requirement 10.7 on audit log retention.
No ledger engine. PayCore validates and processes payment events. It does not maintain merchant balances, settlement records, or fund movement. A production payment platform needs double-entry bookkeeping, reconciliation, and settlement logic sitting on top of this.
Secrets Manager recovery window is 0. Immediate deletion enabled for fast iteration in dev. Production minimum is 7 days.
What the Stack Demonstrates
This project covers hybrid cloud networking (WireGuard VPN bridging AWS and on-prem), event-driven async processing (SQS to Lambda with DLQ), layered authentication (JWT plus API key separation), KMS-backed secrets management, immutable infrastructure patterns (Proxmox golden image via cloud-init and Ansible), and automated failover via CloudWatch alarms. The architecture decisions are reasoned out, and the gaps are explicitly documented, not hidden.
The code is public: github.com/escanut/paycore-hybrid-payment-infra
Victor Ojeje, Cloud and DevOps Engineer LinkedIn | Dev.to | ojejevictor@gmail.com
Top comments (0)