Architecture and Cost Trade-offs
How we designed a scalable, reliable microservices platform on Azure Kubernetes Service while keeping infrastructure costs predictable and operational overhead low.
Introduction
When building cloud-native systems, teams often face a familiar tension:
reliability versus cost
Fully managed services offer strong SLAs and peace of mind, but their pricing compounds quickly as systems grow. Self-hosted alternatives reduce spend, yet introduce operational complexity that many teams underestimate.
In this article, we share how we designed a production-grade microservices platform on Azure Kubernetes Service (AKS) using a hybrid architecture—combining managed services where reliability is non-negotiable and self-hosting components where the risk-to-cost trade-off is acceptable.
This approach resulted in:
- 70–90% lower monthly infrastructure costs compared to fully managed stacks
- Strong reliability guarantees for critical data
- Full observability without per-GB ingestion fees
- A portable, infrastructure-as-code-driven foundation
This post focuses on architecture and decision-making.
Implementation details (Terraform, Helm, autoscaling, identity) are covered in Part 2.
The Core Problem: Cost, Reliability, and Operational Load
Modern microservices platforms are more than application code. Even relatively small systems require:
- Databases, caches, and message queues
- Stateless application services (APIs, workers, frontend)
- Metrics, logs, and dashboards
- Secure networking and access controls
Each component introduces the same decision:
Should this be a managed service—or something we run ourselves?
The Cost Reality
Individually, managed services are reasonably priced. Together, they add up quickly.
| Component | Managed Option | Typical Cost |
|---|---|---|
| PostgreSQL | Azure PostgreSQL | $65–200 / month |
| Redis | Azure Cache for Redis | $50–150 / month |
| RabbitMQ | Managed broker | $100–300 / month |
| Logging | Azure Monitor Logs | $2.50 per GB |
For early-stage products or cost-sensitive deployments, this pricing model can become a constraint long before scale becomes a problem.
The Key Insight: Not All State Is Equal
The most important architectural decision we made was classifying state.
Two Categories of State
Irreplaceable state
- Transactional data
- User data
- Financial records
This data cannot be regenerated. It requires backups, point-in-time recovery, and strong durability guarantees.
Regenerable state
- Caches
- Queues
- Derived logs and metrics
This data can be rebuilt, replayed, or tolerated if lost temporarily.
Once this distinction is made, the managed vs self-hosted decision becomes much clearer.
Our Hybrid Architecture Strategy
Rather than going “all managed” or “all self-hosted,” we adopted a selective hybrid approach.
Managed Services
- PostgreSQL (database of record) Chosen for durability, automated backups, and operational guarantees.
In-Cluster Services (Kubernetes)
- Redis for caching
- RabbitMQ for asynchronous workloads
- Observability stack (metrics and logs)
These components are important, but failure is recoverable. Kubernetes handles restarts and rescheduling, making this trade-off reasonable.
This single decision accounted for the majority of cost savings—without increasing the blast radius of real failures.
Kubernetes-First, Not Kubernetes-Everything
Kubernetes serves as the orchestration layer, not a universal hosting solution.
Where Kubernetes Adds Value
- Consistent deployment and scaling model
- Horizontal scaling built-in
- Strong ecosystem and tooling
- Cloud-agnostic abstraction
Where It Does Not
- Primary databases
- Systems requiring strict consistency and durability guarantees
Using Kubernetes for compute and managed services for critical state strikes a pragmatic balance.
Scaling Without Paying for Idle Capacity
One of the most common cloud cost mistakes is provisioning for peak traffic.
Our approach:
- Keep a small baseline footprint
- Scale pods horizontally based on load
- Allow nodes to scale only when required
At idle, the platform remains inexpensive. During traffic spikes, it scales automatically—and scales back down afterward.
This keeps costs proportional to actual usage rather than theoretical peak demand.
Using Spot Capacity—Deliberately
Not every workload requires guaranteed availability.
Background workers, batch processing, and asynchronous jobs can tolerate interruptions. These workloads run on spot capacity, trading availability guarantees for steep discounts.
What Runs on Spot
- Workers
- Batch jobs
- Non-user-facing tasks
What Never Does
- APIs
- Databases
- Stateful services
By isolating these workloads, we significantly reduced compute costs without affecting user experience.
Observability Without Ingestion-Based Pricing
Observability is not optional—but ingestion-based pricing makes it expensive at scale.
Instead of paying per-GB ingestion fees, we use:
- Open-source tooling for metrics and logs
- Object storage for long-term retention
- Dashboards tailored to the system’s needs
Why This Works
- Logs are queried infrequently
- Storage is inexpensive
- Ingestion costs dominate managed observability pricing
This approach preserves full visibility while keeping observability spend negligible relative to compute.
Security and Access: Identity-First by Design
Security decisions were guided by a few core principles:
Private by Default
- Databases are reachable only within the virtual network
- Internal services are never exposed publicly
Identity Over Secrets
- Workloads authenticate using cloud identity
- No long-lived credentials stored in Kubernetes
Least Privilege
- Day-to-day access uses minimal permissions
- Elevated access is required only during initial setup
This avoids VPN sprawl, bastion hosts, and unnecessary operational overhead.
Cost Snapshot (Production)
A representative production environment looked roughly like this:
| Component | Monthly Cost |
|---|---|
| AKS compute (baseline) | $80–100 |
| Managed PostgreSQL | $65 |
| Storage & networking | $30 |
| Persistent volumes | $10 |
| Total | ~$190 |
Comparable fully managed setups typically exceeded $500–$1,500 per month.
Lessons Learned
1. Don’t Optimize the Wrong Layer
Saving money on the database of record is rarely worth the risk.
2. Spot Capacity Is High-Leverage
Used correctly, it provides some of the largest cost reductions available.
3. Observability Is Mandatory
Skipping it to save money always costs more later.
4. Infrastructure as Code Pays Off Early
Teams that automate early spend less time firefighting later.
5. Kubernetes Is an Enabler, Not the Goal
Use it where it adds leverage—not as a default for everything.
Conclusion
Production-grade microservices don’t require premium managed services at every layer.
By:
- Classifying state correctly
- Using managed services selectively
- Scaling dynamically
- Leveraging open-source observability
it’s possible to build systems that are reliable, cost-efficient, secure, and portable—even with small teams.
Part 2 dives into the implementation details: Terraform modules, AKS autoscaling, spot node pools, Workload Identity, and Kubernetes deployment patterns.
Top comments (0)