Lycore Development

Posted on May 15

Microservices with Azure: What Actually Works in Production (and What Doesn't)

#microservices #azure #productivity #automation

The Microservices Promise vs. Reality

Every architecture diagram looks clean before it meets real traffic.

Microservices on Azure promise independent deployability, team autonomy, granular scaling, and fault isolation. Those benefits are real — but they come with a cost that's rarely discussed honestly in tutorials: operational complexity that scales faster than your team does if you're not careful.

This post isn't a beginner's introduction to microservices. It's an honest account of what we've learned building and running microservice architectures on Azure across multiple production systems — what the platform does well, where you'll get burned, and the specific patterns that separate systems that hold up from systems that fall apart at 3am.

Why Azure for Microservices?

Before getting into the patterns, it's worth being clear about why Azure is a reasonable choice for microservice workloads — and what you're actually signing up for.

Azure's microservices story is primarily built around three services:

Azure Kubernetes Service (AKS) — Managed Kubernetes that handles control plane upgrades, node pool management, and integrates cleanly with the rest of the Azure ecosystem (AAD, ACR, Monitor). If you're running containerised services, AKS is the default choice.

Azure Container Apps — A higher-level abstraction on top of Kubernetes and KEDA. Less control than AKS, but dramatically less operational overhead. Appropriate for teams that want microservice benefits without a full Kubernetes investment.

Azure Service Bus — The backbone of async communication between services. More reliable than rolling your own queue, with dead-letter queuing, message sessions, and duplicate detection built in.

The choice between AKS and Container Apps is the first consequential decision. Our rule: if you have a dedicated platform engineer or SRE, AKS gives you the flexibility you'll eventually need. If you don't, Container Apps will keep you sane.

Service Design: The Decisions That Matter

Get the service boundary right before writing code

The most expensive microservices mistake isn't technical — it's drawing the wrong boundaries.

Services that are too fine-grained (nanoservices) create distributed monolith problems: services that are tightly coupled at runtime even though they're deployed independently. You end up with synchronous chains of service calls, where one slow service creates cascading latency across the whole system.

Services that are too coarse-grained lose the benefits of the architecture. You've added operational complexity without gaining deployment independence.

The right heuristic: services should own their data and be independently deployable without coordination with other services. If you can't deploy Service A without also deploying Service B, you've drawn the boundary wrong.

Domain-Driven Design gives you the vocabulary for this: bounded contexts. Each service should correspond to a bounded context — a domain area with its own data model, its own language, and its own rules. Payments is a bounded context. Inventory is a bounded context. User authentication is a bounded context. "Everything the API needs" is not.

The database-per-service rule

This is non-negotiable in a proper microservices architecture: each service owns its own database. No shared databases across service boundaries.

This feels wasteful — why run separate database instances when one could serve everything? Because shared databases create coupling at the data layer that defeats the independence you're trying to achieve. Schema changes in a shared database require coordinating across every team that reads that data. You've traded deployment independence for schema coupling.

On Azure, this means each service gets its own Azure SQL database, Cosmos DB container, or PostgreSQL flexible server. Yes, this costs more. The tradeoff is worth it.

For read-heavy cross-service queries (the most common objection to database-per-service), the answer is materialised views and event-driven synchronisation — which brings us to messaging.

Async Communication with Azure Service Bus

Synchronous REST calls between services are seductive because they're familiar. They're also the primary cause of cascading failures in microservice systems.

If Service A calls Service B synchronously, and Service B is slow or down, Service A is slow or failing. Multiply that across a system with 15 services and synchronous call chains, and you have a brittle distributed monolith.

The rule we follow: synchronous calls for reads that need immediate consistency; async messaging for everything that changes state.

Azure Service Bus is our default for async messaging. Here's the basic pattern for a producer:

import json
from azure.servicebus import ServiceBusClient, ServiceBusMessage
from azure.identity import DefaultAzureCredential
from dataclasses import dataclass, asdict
from datetime import datetime

@dataclass
class OrderPlacedEvent:
    event_type: str = "order.placed"
    order_id: str = ""
    customer_id: str = ""
    total_amount: float = 0.0
    items: list = None
    placed_at: str = ""

    def __post_init__(self):
        if self.items is None:
            self.items = []
        if not self.placed_at:
            self.placed_at = datetime.utcnow().isoformat()

class OrderEventPublisher:
    def __init__(self, namespace_url: str, topic_name: str):
        credential = DefaultAzureCredential()
        self.client = ServiceBusClient(namespace_url, credential)
        self.topic_name = topic_name

    def publish_order_placed(self, order: dict) -> str:
        event = OrderPlacedEvent(
            order_id=order["id"],
            customer_id=order["customer_id"],
            total_amount=order["total"],
            items=order["items"]
        )

        message = ServiceBusMessage(
            body=json.dumps(asdict(event)),
            content_type="application/json",
            subject=event.event_type,
            message_id=f"order-placed-{event.order_id}",  # Idempotency key
        )

        with self.client.get_topic_sender(self.topic_name) as sender:
            sender.send_messages(message)

        return event.order_id

And the consumer side with proper error handling and dead-letter processing:

import logging
from azure.servicebus import ServiceBusClient, ServiceBusReceivedMessage
from azure.identity import DefaultAzureCredential

logger = logging.getLogger(__name__)

class OrderEventConsumer:
    def __init__(self, namespace_url: str, topic_name: str, subscription_name: str):
        credential = DefaultAzureCredential()
        self.client = ServiceBusClient(namespace_url, credential)
        self.topic_name = topic_name
        self.subscription_name = subscription_name
        self.processed_message_ids = set()  # In production: use Redis or DB

    def process_messages(self, max_messages: int = 10):
        receiver = self.client.get_subscription_receiver(
            topic_name=self.topic_name,
            subscription_name=self.subscription_name,
            max_wait_time=5
        )

        with receiver:
            messages = receiver.receive_messages(max_message_count=max_messages)

            for message in messages:
                try:
                    self._handle_message(message, receiver)
                except Exception as e:
                    logger.error(f"Failed to process message {message.message_id}: {e}")
                    # Dead-letter after max delivery count (configured on Service Bus)
                    receiver.dead_letter_message(
                        message,
                        reason="ProcessingFailed",
                        error_description=str(e)
                    )

    def _handle_message(self, message: ServiceBusReceivedMessage, receiver):
        msg_id = message.message_id

        # Idempotency check — Service Bus guarantees at-least-once delivery
        if msg_id in self.processed_message_ids:
            logger.info(f"Duplicate message {msg_id}, skipping")
            receiver.complete_message(message)
            return

        import json
        event = json.loads(str(message))

        if event["event_type"] == "order.placed":
            self._handle_order_placed(event)

        self.processed_message_ids.add(msg_id)
        receiver.complete_message(message)

    def _handle_order_placed(self, event: dict):
        logger.info(f"Processing order {event['order_id']} for customer {event['customer_id']}")
        # Actual business logic here

Two things the code above makes explicit that tutorials often skip: idempotency keys on messages (Service Bus guarantees at-least-once delivery, so your consumers must handle duplicates) and dead-letter routing for messages that fail processing (rather than infinitely retrying and blocking the queue).

Service Discovery and API Gateway

On Azure, internal service-to-service communication within AKS uses Kubernetes DNS. Services call each other by name — http://inventory-service/api/v1/stock — and Kubernetes handles the routing.

For external traffic, Azure API Management (APIM) is the recommended gateway layer. It handles:

Authentication and authorisation before requests reach your services
Rate limiting per consumer
Request/response transformation
Analytics and monitoring across all your service endpoints

One pattern that saves a lot of pain: version your APIs from day one. Every endpoint under /api/v1/. When you need to make breaking changes, you add /api/v2/ and run both versions in parallel during migration. This is trivial to enforce at the APIM layer.

Observability: The Thing Teams Leave Too Late

You cannot operate a microservices system without distributed tracing. A request that touches 6 services before returning a result cannot be debugged with per-service logs alone — by the time you've correlated log lines across 6 different log streams, the on-call engineer has aged noticeably.

The Azure-native answer is Application Insights with distributed tracing enabled. Every service emits telemetry with a shared correlation ID that Azure Monitor can use to reconstruct the full trace of a request across service boundaries.

The practical setup:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from azure.monitor.opentelemetry.exporter import AzureMonitorTraceExporter

def configure_tracing(connection_string: str, service_name: str):
    """Configure OpenTelemetry with Azure Monitor export."""
    exporter = AzureMonitorTraceExporter(connection_string=connection_string)
    provider = TracerProvider()
    provider.add_span_processor(BatchSpanProcessor(exporter))
    trace.set_tracer_provider(provider)
    return trace.get_tracer(service_name)

tracer = configure_tracing(
    connection_string="InstrumentationKey=...",
    service_name="order-service"
)

def process_order(order_id: str):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)

        with tracer.start_as_current_span("validate_inventory"):
            # This span will appear as a child in the distributed trace
            inventory_result = check_inventory(order_id)

        with tracer.start_as_current_span("charge_payment"):
            payment_result = process_payment(order_id)

        return {"order_id": order_id, "status": "processed"}

Beyond distributed tracing, every service should emit:

Health endpoints: /health/live (is the process running?) and /health/ready (is the service ready to receive traffic?)
Structured logs: JSON-formatted logs with consistent fields — service name, request ID, user ID, duration. Human-readable logs don't scale.
Business metrics: Not just technical metrics. "Orders processed per minute" and "payment failure rate" are more actionable than CPU utilisation.

Deployment: AKS Patterns That Hold Up

Rolling deployments with readiness gates

The default Kubernetes rolling deployment will replace pods one at a time, which is almost always what you want. The critical addition is proper readiness probes — Kubernetes won't route traffic to a new pod until the readiness probe passes. Without this, you'll send traffic to pods that are starting up but not yet ready to serve requests.

# Excerpt from a Kubernetes deployment manifest
spec:
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0      # Never take a pod down before a replacement is ready
      maxSurge: 1            # Allow one extra pod during rollout
  template:
    spec:
      containers:
        - name: order-service
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10

Namespace isolation per environment

One AKS cluster with namespace isolation for dev/staging/prod is a reasonable setup for smaller teams. Separate clusters per environment is cleaner but more expensive. The important thing: never mix production and non-production workloads in the same namespace, even on separate clusters.

GitOps with Azure DevOps

Every deployment should be triggered by a git commit, not a manual kubectl apply. We use Azure DevOps pipelines with a structure that separates build (create and push the container image) from deploy (update the Kubernetes manifest with the new image tag). Flux or ArgoCD manages the sync between the git state and the cluster state.

The Honest Cost of Microservices

Before we close, a direct assessment: microservices add real complexity. If you're a small team building an early-stage product, a well-structured monolith will serve you better. The operational overhead of running distributed services — separate deployments, distributed tracing, inter-service communication, saga patterns for distributed transactions — is significant.

The right time to move to microservices is when you have specific, demonstrated problems that microservices solve: teams that are slowing each other down due to codebase coupling, components with genuinely different scaling requirements, or a need for polyglot services using different runtimes.

If you're evaluating whether microservices are the right move for your current system, or if you're mid-migration and running into the architectural challenges described above, our team at Lycore has written extensively on this and works on these architectures across fintech, SaaS, and enterprise software. Happy to discuss your specific situation.

What's been your biggest challenge with microservices in production? The patterns that worked for us might not be universal — I'd like to hear what others have found.

DEV Community