DEV Community

Dilip Kola
Dilip Kola

Posted on

Building Production-Grade Microservices on Azure Kubernetes

Architecture and Cost Trade-offs

How we designed a scalable, reliable microservices platform on Azure Kubernetes Service while keeping infrastructure costs predictable and operational overhead low.

Azure Kubernetes hybrid microservices architecture

Introduction

When building cloud-native systems, teams often face a familiar tension:

reliability versus cost

Fully managed services offer strong SLAs and peace of mind, but their pricing compounds quickly as systems grow. Self-hosted alternatives reduce spend, yet introduce operational complexity that many teams underestimate.

In this article, we share how we designed a production-grade microservices platform on Azure Kubernetes Service (AKS) using a hybrid architecture—combining managed services where reliability is non-negotiable and self-hosting components where the risk-to-cost trade-off is acceptable.

This approach resulted in:

  • 70–90% lower monthly infrastructure costs compared to fully managed stacks
  • Strong reliability guarantees for critical data
  • Full observability without per-GB ingestion fees
  • A portable, infrastructure-as-code-driven foundation

This post focuses on architecture and decision-making.
Implementation details (Terraform, Helm, autoscaling, identity) are covered in Part 2.


The Core Problem: Cost, Reliability, and Operational Load

Modern microservices platforms are more than application code. Even relatively small systems require:

  • Databases, caches, and message queues
  • Stateless application services (APIs, workers, frontend)
  • Metrics, logs, and dashboards
  • Secure networking and access controls

Each component introduces the same decision:

Should this be a managed service—or something we run ourselves?

The Cost Reality

Individually, managed services are reasonably priced. Together, they add up quickly.

Component Managed Option Typical Cost
PostgreSQL Azure PostgreSQL $65–200 / month
Redis Azure Cache for Redis $50–150 / month
RabbitMQ Managed broker $100–300 / month
Logging Azure Monitor Logs $2.50 per GB

For early-stage products or cost-sensitive deployments, this pricing model can become a constraint long before scale becomes a problem.


The Key Insight: Not All State Is Equal

The most important architectural decision we made was classifying state.

Two Categories of State

Irreplaceable state

  • Transactional data
  • User data
  • Financial records

This data cannot be regenerated. It requires backups, point-in-time recovery, and strong durability guarantees.

Regenerable state

  • Caches
  • Queues
  • Derived logs and metrics

This data can be rebuilt, replayed, or tolerated if lost temporarily.

Once this distinction is made, the managed vs self-hosted decision becomes much clearer.


Our Hybrid Architecture Strategy

Rather than going “all managed” or “all self-hosted,” we adopted a selective hybrid approach.

Managed Services

  • PostgreSQL (database of record) Chosen for durability, automated backups, and operational guarantees.

In-Cluster Services (Kubernetes)

  • Redis for caching
  • RabbitMQ for asynchronous workloads
  • Observability stack (metrics and logs)

These components are important, but failure is recoverable. Kubernetes handles restarts and rescheduling, making this trade-off reasonable.

This single decision accounted for the majority of cost savings—without increasing the blast radius of real failures.


Kubernetes-First, Not Kubernetes-Everything

Kubernetes serves as the orchestration layer, not a universal hosting solution.

Where Kubernetes Adds Value

  • Consistent deployment and scaling model
  • Horizontal scaling built-in
  • Strong ecosystem and tooling
  • Cloud-agnostic abstraction

Where It Does Not

  • Primary databases
  • Systems requiring strict consistency and durability guarantees

Using Kubernetes for compute and managed services for critical state strikes a pragmatic balance.


Scaling Without Paying for Idle Capacity

One of the most common cloud cost mistakes is provisioning for peak traffic.

Our approach:

  • Keep a small baseline footprint
  • Scale pods horizontally based on load
  • Allow nodes to scale only when required

At idle, the platform remains inexpensive. During traffic spikes, it scales automatically—and scales back down afterward.

This keeps costs proportional to actual usage rather than theoretical peak demand.


Using Spot Capacity—Deliberately

Not every workload requires guaranteed availability.

Background workers, batch processing, and asynchronous jobs can tolerate interruptions. These workloads run on spot capacity, trading availability guarantees for steep discounts.

What Runs on Spot

  • Workers
  • Batch jobs
  • Non-user-facing tasks

What Never Does

  • APIs
  • Databases
  • Stateful services

By isolating these workloads, we significantly reduced compute costs without affecting user experience.


Observability Without Ingestion-Based Pricing

Observability is not optional—but ingestion-based pricing makes it expensive at scale.

Instead of paying per-GB ingestion fees, we use:

  • Open-source tooling for metrics and logs
  • Object storage for long-term retention
  • Dashboards tailored to the system’s needs

Why This Works

  • Logs are queried infrequently
  • Storage is inexpensive
  • Ingestion costs dominate managed observability pricing

This approach preserves full visibility while keeping observability spend negligible relative to compute.


Security and Access: Identity-First by Design

Security decisions were guided by a few core principles:

Private by Default

  • Databases are reachable only within the virtual network
  • Internal services are never exposed publicly

Identity Over Secrets

  • Workloads authenticate using cloud identity
  • No long-lived credentials stored in Kubernetes

Least Privilege

  • Day-to-day access uses minimal permissions
  • Elevated access is required only during initial setup

This avoids VPN sprawl, bastion hosts, and unnecessary operational overhead.


Cost Snapshot (Production)

A representative production environment looked roughly like this:

Component Monthly Cost
AKS compute (baseline) $80–100
Managed PostgreSQL $65
Storage & networking $30
Persistent volumes $10
Total ~$190

Comparable fully managed setups typically exceeded $500–$1,500 per month.


Lessons Learned

1. Don’t Optimize the Wrong Layer

Saving money on the database of record is rarely worth the risk.

2. Spot Capacity Is High-Leverage

Used correctly, it provides some of the largest cost reductions available.

3. Observability Is Mandatory

Skipping it to save money always costs more later.

4. Infrastructure as Code Pays Off Early

Teams that automate early spend less time firefighting later.

5. Kubernetes Is an Enabler, Not the Goal

Use it where it adds leverage—not as a default for everything.


Conclusion

Production-grade microservices don’t require premium managed services at every layer.

By:

  • Classifying state correctly
  • Using managed services selectively
  • Scaling dynamically
  • Leveraging open-source observability

it’s possible to build systems that are reliable, cost-efficient, secure, and portable—even with small teams.

Part 2 dives into the implementation details: Terraform modules, AKS autoscaling, spot node pools, Workload Identity, and Kubernetes deployment patterns.

Top comments (0)