Dilip Kola

Posted on Jan 19

Building Production-Grade Microservices on Azure Kubernetes

#architecture #infrastructure #microservices #azure

Architecture and Cost Trade-offs

How we designed a scalable, reliable microservices platform on Azure Kubernetes Service while keeping infrastructure costs predictable and operational overhead low.

Introduction

When building cloud-native systems, teams often face a familiar tension:

reliability versus cost

Fully managed services offer strong SLAs and peace of mind, but their pricing compounds quickly as systems grow. Self-hosted alternatives reduce spend, yet introduce operational complexity that many teams underestimate.

In this article, we share how we designed a production-grade microservices platform on Azure Kubernetes Service (AKS) using a hybrid architecture—combining managed services where reliability is non-negotiable and self-hosting components where the risk-to-cost trade-off is acceptable.

This approach resulted in:

70–90% lower monthly infrastructure costs compared to fully managed stacks
Strong reliability guarantees for critical data
Full observability without per-GB ingestion fees
A portable, infrastructure-as-code-driven foundation

This post focuses on architecture and decision-making.
Implementation details (Terraform, Helm, autoscaling, identity) are covered in Part 2.

The Core Problem: Cost, Reliability, and Operational Load

Modern microservices platforms are more than application code. Even relatively small systems require:

Databases, caches, and message queues
Stateless application services (APIs, workers, frontend)
Metrics, logs, and dashboards
Secure networking and access controls

Each component introduces the same decision:

Should this be a managed service—or something we run ourselves?

The Cost Reality

Individually, managed services are reasonably priced. Together, they add up quickly.

Component	Managed Option	Typical Cost
PostgreSQL	Azure PostgreSQL	$65–200 / month
Redis	Azure Cache for Redis	$50–150 / month
RabbitMQ	Managed broker	$100–300 / month
Logging	Azure Monitor Logs	$2.50 per GB

For early-stage products or cost-sensitive deployments, this pricing model can become a constraint long before scale becomes a problem.

The Key Insight: Not All State Is Equal

The most important architectural decision we made was classifying state.

Two Categories of State

Irreplaceable state

Transactional data
User data
Financial records

This data cannot be regenerated. It requires backups, point-in-time recovery, and strong durability guarantees.

Regenerable state

Caches
Queues
Derived logs and metrics

This data can be rebuilt, replayed, or tolerated if lost temporarily.

Once this distinction is made, the managed vs self-hosted decision becomes much clearer.

Our Hybrid Architecture Strategy

Rather than going “all managed” or “all self-hosted,” we adopted a selective hybrid approach.

Managed Services

PostgreSQL (database of record) Chosen for durability, automated backups, and operational guarantees.

In-Cluster Services (Kubernetes)

Redis for caching
RabbitMQ for asynchronous workloads
Observability stack (metrics and logs)

These components are important, but failure is recoverable. Kubernetes handles restarts and rescheduling, making this trade-off reasonable.

This single decision accounted for the majority of cost savings—without increasing the blast radius of real failures.

Kubernetes-First, Not Kubernetes-Everything

Kubernetes serves as the orchestration layer, not a universal hosting solution.

Where Kubernetes Adds Value

Consistent deployment and scaling model
Horizontal scaling built-in
Strong ecosystem and tooling
Cloud-agnostic abstraction

Where It Does Not

Primary databases
Systems requiring strict consistency and durability guarantees

Using Kubernetes for compute and managed services for critical state strikes a pragmatic balance.

Scaling Without Paying for Idle Capacity

One of the most common cloud cost mistakes is provisioning for peak traffic.

Our approach:

Keep a small baseline footprint
Scale pods horizontally based on load
Allow nodes to scale only when required

At idle, the platform remains inexpensive. During traffic spikes, it scales automatically—and scales back down afterward.

This keeps costs proportional to actual usage rather than theoretical peak demand.

Using Spot Capacity—Deliberately

Not every workload requires guaranteed availability.

Background workers, batch processing, and asynchronous jobs can tolerate interruptions. These workloads run on spot capacity, trading availability guarantees for steep discounts.

What Runs on Spot

Workers
Batch jobs
Non-user-facing tasks

What Never Does

APIs
Databases
Stateful services

By isolating these workloads, we significantly reduced compute costs without affecting user experience.

Observability Without Ingestion-Based Pricing

Observability is not optional—but ingestion-based pricing makes it expensive at scale.

Instead of paying per-GB ingestion fees, we use:

Open-source tooling for metrics and logs
Object storage for long-term retention
Dashboards tailored to the system’s needs

Why This Works

Logs are queried infrequently
Storage is inexpensive
Ingestion costs dominate managed observability pricing

This approach preserves full visibility while keeping observability spend negligible relative to compute.

Security and Access: Identity-First by Design

Security decisions were guided by a few core principles:

Private by Default

Databases are reachable only within the virtual network
Internal services are never exposed publicly

Identity Over Secrets

Workloads authenticate using cloud identity
No long-lived credentials stored in Kubernetes

Least Privilege

Day-to-day access uses minimal permissions
Elevated access is required only during initial setup

This avoids VPN sprawl, bastion hosts, and unnecessary operational overhead.

Cost Snapshot (Production)

A representative production environment looked roughly like this:

Component	Monthly Cost
AKS compute (baseline)	$80–100
Managed PostgreSQL	$65
Storage & networking	$30
Persistent volumes	$10
Total	~$190

Comparable fully managed setups typically exceeded $500–$1,500 per month.

Lessons Learned

1. Don’t Optimize the Wrong Layer

Saving money on the database of record is rarely worth the risk.

2. Spot Capacity Is High-Leverage

Used correctly, it provides some of the largest cost reductions available.

3. Observability Is Mandatory

Skipping it to save money always costs more later.

4. Infrastructure as Code Pays Off Early

Teams that automate early spend less time firefighting later.

5. Kubernetes Is an Enabler, Not the Goal

Use it where it adds leverage—not as a default for everything.

Conclusion

Production-grade microservices don’t require premium managed services at every layer.

By:

Classifying state correctly
Using managed services selectively
Scaling dynamically
Leveraging open-source observability

it’s possible to build systems that are reliable, cost-efficient, secure, and portable—even with small teams.

Part 2 dives into the implementation details: Terraform modules, AKS autoscaling, spot node pools, Workload Identity, and Kubernetes deployment patterns.

DEV Community