How we slashed a fintech's AWS bill by 65% with open source infrastructure
A European fintech was hemorrhaging €28,000 monthly on AWS for processing 2.3M transactions. Six months later, they were spending €9,800 for the same workload with better performance. Here's the engineering breakdown.
The problem: classic cloud cost spiral
The fintech ran 40 microservices across AWS with PCI DSS and GDPR requirements. Their architecture looked standard on paper, but the monthly bills told a different story.
Compute waste everywhere:
- 60 EC2 instances running 24/7
- CPU utilization: 23% peak, 8% overnight
- Only 30% reserved instances (paying on-demand for predictable workloads)
Storage bleeding money:
- 2.4TB monthly PostgreSQL logs with no retention
- 800GB application logs stored indefinitely
- 15TB of accumulated EBS snapshots
Network transfer costs:
- €3,200/month in cross-AZ microservices chatter
- NAT gateway charges for external API calls
The kicker? Their workloads were completely predictable. Payment processing peaked 9 AM to 6 PM weekdays. Fraud detection ran nightly batches. Customer onboarding spiked during monthly marketing campaigns.
The solution: sovereign open source stack
Instead of AWS optimization theater, we built a dedicated stack using:
- Proxmox: Virtualization and cluster management
- Ceph: Distributed storage with built-in redundancy
- OpenStack: Cloud APIs without vendor lock-in
- Kubernetes: Efficient resource sharing
Implementation highlights
Hardware foundation:
6 bare-metal servers in Frankfurt: 64 cores, 256GB RAM, 4TB NVMe each.
Smart Ceph storage tiering:
# Hot transaction data on NVMe
ceph osd pool create transactions 128 128 replicated
ceph osd pool set transactions size 3
# Cold analytics data with erasure coding
ceph osd pool create analytics 64 64 erasure
ceph osd erasure-code-profile set ec-profile k=4 m=2
Resource-aware Kubernetes scheduling:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: payment-api-hpa
spec:
minReplicas: 2
maxReplicas: 12
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Migration strategy:
Built in parallel, migrated non-critical services first, then payment processing during a 47-minute maintenance window using PostgreSQL logical replication.
Results that matter
Performance improvements:
- API response times: 180ms → 95ms average
- Same 99.95% uptime SLA maintained
- Sub-200ms latency requirements exceeded
Cost breakdown:
- Before: €28,000/month on AWS
- After: €9,800/month total (€4,200 hardware + €3,200 managed services)
- 65% cost reduction
Operational wins:
- No vendor lock-in
- Full EU data residency
- Predictable monthly costs
- Better resource utilization (65% average vs 23%)
Key takeaways for engineers
- Audit first: Most "scaling" problems are resource waste problems
- Predictable workloads don't need cloud premium: If you can forecast it, you can right-size it
- Open source infrastructure scales: Proxmox + Ceph + K8s handles enterprise workloads
- Migration risk is manageable: Parallel builds beat big-bang deployments
The real lesson? Sometimes the best cloud optimization is leaving the cloud entirely.
Originally published on binadit.com
Top comments (0)