Avinash Hedaoo

Posted on Jun 21

Micro-Services And System Designs

#microservices #systemdesign #distributedsystems #interview

Microservice Designs Article: Different Patterns in One System

This consolidated article brings together the microservices patterns into a single practical system example. It uses the existing online retail marketplace scenario and the images already available in this folder. The goal is to unify the blueprint, use case, and individual pattern explanations into one article.

Example System: Online Retail Marketplace

The marketplace contains storefront, order, payment, inventory, user, shipping, and analytics services. Each pattern is described in the context of this system, showing how it helps support scalability, reliability, and maintainability.

Tier 1: Foundational Discovery & Boundaries

Purpose: Establish the core infrastructure for service discovery, client communication, and data isolation.

01. Service Registry

Purpose: Acts as the centralized directory for runtime location metadata of dynamically scaling service instances.
Dynamic Registration: Service instances self-register on startup and update their entry with metadata such as host, port, health state, and version.
Health Tracking: Heartbeat mechanisms detect stale registrations and automatically evict failed instances from discovery results.
Client Discovery: Upstream components query the registry to discover healthy endpoints for load balancing or direct service invocation.
Deployment Modes: Can operate in client-side discovery models or server-side discovery through a proxy layer.
Production Options: Common implementations include Consul, Netflix Eureka, ZooKeeper, and Kubernetes service discovery.

Example : A service registry lets the storefront locate the order service and the payment service dynamically. In the marketplace, each microservice instance registers with the registry at startup, so the API gateway can discover healthy endpoints without hardcoding addresses. During a flash sale, new order service instances spin up and register automatically, allowing the system to scale. If an instance fails, the registry removes it and avoids sending traffic to it.

02. API Gateway

Edge Abstraction: Provides a consolidated entry point for clients, hiding internal service topology and routing complexity.
Cross-Cutting Concerns: Centralizes SSL termination, authentication, authorization, rate limiting, and request validation.
Request Orchestration: Aggregates calls to multiple backend services into a single client-facing response.
Protocol Translation: Bridges external HTTP/JSON or WebSocket requests to internal RPC or gRPC service calls.
Risk Exposure: Can become a runtime bottleneck and single point of failure if overloaded with business logic.
Implementation Examples: Kong, AWS API Gateway, Apigee, Envoy, and Spring Cloud Gateway.

Example : The API gateway serves as the single entry point for customers and mobile app users. In this retail system, the gateway handles authentication, routing to the storefront service, and request aggregation for search and cart operations. It also enforces rate limits during peak shopping hours to prevent abuse. The gateway helps centralize cross-cutting concerns so backend services stay small and focused.

03. Backends for Frontends (BFF)

Client-Specific Interfaces: Deploys tailored backend layers differentiated by client type (mobile, web, third-party API).
Payload Optimization: Produces lean, client-specific response shapes to minimize over-fetching and unnecessary data transfer.
Team Autonomy: Separates frontend-specific orchestration from core backend services, enabling independent deployment.
Downstream Multiplexing: Coordinates data retrieval from different services and assembles responses optimized for each UI.
Duplication Risk: May lead to duplicate logic across different BFFs if shared concerns are not factored out.
Best Use Case: Useful for large systems with distinct client experiences and varying performance profiles.

Example : The marketplace uses a separate BFF for the web app and for the mobile app to tailor payloads. The web BFF aggregates product listings, user recommendations, and promotions into a rich storefront response. The mobile BFF returns a lighter response optimized for slow mobile networks and smaller screens. This results in better user experience and reduced over fetching on mobile devices.

04. Database Per Service

Data Ownership: Ensures each microservice owns and controls its own private datastore and schema.
Schema Autonomy: Allows services to evolve their storage model without requiring cross-team coordination.
Coupling Reduction: Prevents direct cross-service queries and database joins across service boundaries.
Polyglot Capability: Enables service-specific technology selection such as relational, document, graph, or key-value stores.
Consistency Trade-off: Pushes cross-service consistency concerns into asynchronous patterns like sagas or event-driven sync.
Operational Overhead: Increases the number of databases to administer, monitor, and secure.

Example : Each marketplace service owns its own database: orders use PostgreSQL, inventory uses Redis, and user profiles use MongoDB. This isolation enables each service to choose the best storage model and evolve independently. The product catalog service can scale its database separately from the checkout service. It also reduces coupling because services do not share the same schema.

05. Sidecar Pattern

Infrastructure Companion: Runs an auxiliary helper process alongside the primary service in the same host or pod.
Shared Lifecycle: The sidecar shares the same lifecycle and network namespace as the main application.
Cross-Cutting Offload: Handles infrastructure concerns such as telemetry, configuration, security, and proxying.
Language Independence: Supports non-intrusive enhancements for legacy or polyglot services.
Local Communication: Communicates over local loopback interfaces, reducing network latency while adding host overhead.
Common Use Case: The foundational building block for service mesh proxies and observability sidecars.

Example : A sidecar is attached to the inventory service to provide logging and metrics collection without changing the service code. For example, the inventory pod includes a sidecar proxy that captures stock updates and sends them to a monitoring pipeline. This keeps the inventory service free of observability responsibilities while still delivering telemetry. It also supports network policy enforcement and service mesh integration for the inventory component.

06. Health Check API

Automated Probing: Exposes endpoints such as /healthz, /ready, and /live for orchestrator health monitoring.
Liveness Detection: Indicates whether a service instance is alive; failures trigger restarts by the orchestrator.
Readiness Verification: Signals when an instance is ready to process traffic after startup and dependency initialization.
Lightweight Checks: Must be simple and fast to avoid creating monitoring-induced load on the service.
Dependency Awareness: Should validate only the minimum required runtime dependencies to avoid false positives.
Orchestration Integration: Drives behavior in Kubernetes, ECS, Nomad, and other container orchestrators.

Example : Each service exposes a health check endpoint that the orchestrator polls continuously. The gateway uses health checks to stop routing requests to unhealthy storefront instances. If the payment service fails its readiness probe, the cluster replaces it before traffic reaches customers. This keeps the marketplace resilient under failure.

TIER II : DATA PATTERNS & TRANSACTIONAL LOGIC

01. CQRS [Command Query Responsibility Segregation]

Model Separation: Splits the application into command-side write models and query-side read models.
Write Path: Commands focus on state changes, business rules, validation, and transactional updates.
Read Path: Queries serve optimized, denormalized read views for fast retrieval.
Database Asymmetry: Each side can use different data stores suited to its access pattern.
Event Propagation: Updates to read models are typically driven by events emitted by the write side.
Consistency Implication: Introduces eventual consistency between write and read models. Example : The marketplace separates command and query concerns by using a write model for order processing and a read model for customer dashboards. When an order is placed, the write service updates the transactional store. Events then update a denormalized read store used for fast order status views and reporting. This makes reads efficient without slowing down order writes.

02. Event Sourcing

Event Ledger: Persists every state change as an immutable event rather than updating mutable entity state.
Source of Truth: The current state is derived by replaying the event stream from the beginning.
Auditability: Delivers a complete historical trail for debugging, regulatory audits, and rebuilding state.
Snapshot Optimization: Uses periodic snapshots to reduce replay latency for long-lived aggregates.
Read Projections: Builds consumer-specific read models from the event stream asynchronously.
Implementation Fit: Common in event-driven systems and pairing with CQRS architectures. Example : The marketplace records order state changes as events in an event store. Each action like OrderPlaced, PaymentAccepted, and OrderShipped becomes an immutable event. This enables replaying history to rebuild order state or diagnose issues after a bug. It also supports audit logs and analytics by preserving the full sequence of changes.

03. Data Sharding

Horizontal Partitioning: Splits a large dataset into shards across multiple database nodes.
Scale Out: Allows workloads to grow beyond the capacity of a single database instance.
Shard Strategies: Includes range-based, hash-based, and directory-based partitioning.
Routing Logic: Requires a shard map or deterministic function to locate data.
Cross-Shard Complexity: Makes transactions and joins more difficult across shards.
Operational Cost: Increases complexity for re-sharding, backup, and capacity planning. Example : Customer records are sharded by region so the user service can scale globally. For example, European shoppers are stored in one shard and North American shoppers in another. Cross-region queries are minimized, and each shard handles its own traffic footprint. This reduces latency and improves throughput during localized promotions.

04. Outbox Pattern

Transactional Guarantee: Writes business data and outgoing messages in the same local transaction.
Durable Outbox: Stores outbound events in a local outbox table when the business transaction commits.
Relayer Process: A separate process polls the outbox and publishes events to external brokers.
Atomicity: Eliminates the risk of outbox events being lost when a service crashes after commit.
At-Least-Once Semantics: Requires idempotent consumers to handle duplicates safely.
Change Data Capture: Can also leverage log tailing tools like Debezium for reliable publication. Example : The order service writes both database changes and inventory update events to an outbox table in one transaction. A separate process reads the outbox and publishes messages to the inventory queue. This prevents lost events when the order commit succeeds but the message publish fails. It ensures reliable communication between order and inventory.

05. Polyglot Persistence

Best-Fit Storage: Matches each service’s data model to the most appropriate database technology.
Relational Use: Uses SQL databases for transactional workloads with strong consistency needs.
Document Use: Chooses document stores for schema-flexible or aggregate-oriented data.
In-Memory Use: Uses Redis or Memcached for fast caching and session state.
Graph Use: Applies graph stores for highly connected relationship queries.
Team Burden: Increases operational and organizational overhead across database platforms. Example : The marketplace uses multiple databases for different needs: MongoDB for product catalog flexibility, PostgreSQL for transactional orders, and Elasticsearch for search. Each service selects the storage technology that matches its access patterns. This allows the product search team to optimize search indexes separately from transactional order consistency. The system becomes more adaptable to varied data requirements.

06. Externalized Configuration

Config Separation: Keeps configuration outside of the application code or image.
Single Artifact: Enables the same build artifact to deploy across environments with different settings.
Runtime Injection: Loads configuration via environment variables, mounted files, or remote config services.
Centralized Management: Uses configuration servers or vaults for centralized runtime settings.
Secret Handling: Keeps sensitive credentials in vaults or encrypted stores instead of code.
Dynamic Refresh: Supports hot reload for non-sensitive settings without redeploying containers. Example : All service endpoints, feature flags, and database credentials are stored in a centralized configuration service. The storefront, order, and shipping services retrieve configuration at startup and refresh when changed. This avoids hardcoding environment-specific values into images. It also enables safe toggling of new features in production.

07. Consumer-Driven Contract Testing

Contract Definition: The consumer declares the API contract it expects from a provider.
Provider Verification: The provider tests itself against consumer-defined expectations.
Integration Safety: Prevents breaking changes before services are deployed to shared environments.
Mock-Driven Development: Allows consumers to develop against provider contracts independently.
Change Control: Acts as a safety net for API evolution across independent teams.
Common Tools: Includes Pact, Spring Cloud Contract, and similar contract testing frameworks. Example : The web storefront team defines a contract for the product service API, and the product team uses it to validate changes. The contract ensures the storefront can still fetch product details after product service updates. If the API response changes, contract tests fail before deployment. This prevents frontend/back-end mismatches in the marketplace.

Tier III: Decoupling, Messaging & Resilience Controls

01. Smart Endpoints

Domain Ownership: Places workflow, validation, and business logic inside the service endpoints.
Thin Middleware: Keeps infrastructure middleware simple and pushes behavior into the service.
Autonomous Decision Making: Services decide when to emit events or call other services.
Clear Responsibility: Improves domain-driven design by aligning behavior with the owning service.
Testability: Makes endpoints easier to test in isolation because logic is not hidden in the pipeline.
Resilience Trade-off: Can increase endpoint complexity while improving service autonomy.

02. Dumb Pipes

Transport Simplicity: Uses the messaging layer only to move data without applying business logic.
Message Transparency: Keeps the event stream or queue as a simple carrier for payloads.
Separation of Concerns: Prevents the pipeline from becoming an execution engine.
Observability: Simplifies tracing, retries, and failure handling in the transport layer.
Consumer Flexibility: Enables new consumers to attach without changing the message broker logic.
Ideal Fit: Best for event-driven architectures where service behavior belongs inside the services.

03. Asynchronous Messaging vs. Synchronous Communication

Synchronous (Blocking): The calling service sends a request and blocks its execution thread, waiting for an immediate, real-time HTTP REST or gRPC response from the receiver.
Temporal Coupling Risk: Excessive nested synchronous chains ($\text{Service A} \rightarrow \text{Service B} \rightarrow \text{Service C}$) cause latency inflation and introduce single points of failure across the entire call path.
Asynchronous (Non-Blocking): The originating service drops a message payload onto an intermediary queue or event stream and returns immediate control back to the caller thread.
Temporal Decoupling: Breaks immediate dependency ties; downstream consumers process incoming message packets at their own pace whenever resources become available.
Design Trade-offs: Synchronous is ideal for real-time operations like user authentication, while asynchronous is perfect for long-running, non-blocking background tasks like processing video updates or sending emails.
Infrastructure Pipeline: Asynchronous messaging relies on stateless event brokers, durable distributed message logs, or message queues (such as Apache Kafka, RabbitMQ, or AWS SQS). Example : Customer checkout uses synchronous calls for immediate order confirmation, while inventory updates and shipping notifications use asynchronous messaging. When an order is placed, the checkout service calls payment synchronously to approve payment. Once confirmed, it publishes an asynchronous event to the inventory queue and shipping pipeline. This combination gives a fast customer response while decoupling backend processing.

04. Bulkhead Pattern

Fault Containment: Isolates system resources into distinct pools to limit failure blast radius.
Resource Quotas: Uses separate thread pools, connection pools, or service partitions per functional domain.
Failure Isolation: Prevents a heavy failure in one area from starving others.
Service Stability: Allows degraded service behavior without taking down the entire system.
Database Limits: Can extend isolation to separate database connections by traffic type.
Design Inspiration: Named after ship bulkheads that keep damage contained within compartments. Example : The shipping service and analytics service each have separate bulkheads so bursts in analytics processing do not affect shipping updates. During a promotion, analytics jobs might consume a lot of resources, but the shipping path remains isolated. This prevents the system from collapsing just because one service is busy. It effectively enforces resource limits per service domain.

05. Service Mesh

Infrastructure Data Plane: An infrastructure networking tier made of lightweight network sidecar proxies deployed alongside application instances to manage system-wide container communication.
Decoupled Traffic Engineering: Allows infrastructure operators to handle traffic routing, canary splitting percentages, mutual TLS (mTLS) encryption, and circuit breaking without altering application code.
The Control Plane Brain: Provides a centralized control plane (e.g., Istio's control architecture) to distribute security policies, routing tables, and encryption certificates down to data plane proxies.
Mutual TLS (mTLS) Security: Automatically encrypts all inter-container network communication with mTLS at the transport layer, handling certificate rotation and validation transparently.
Observability Ingestion: Gathers network performance telemetry data across all proxy hops, generating deep communication flow graphs and mapping system connectivity.
Latency Overhead Trade-off: Introduces an extra local network hop through the sidecar proxy plane, requiring careful memory allocation tuning across dense container environments.
Industry Frameworks: Deployed across cloud-native platforms using open-source service mesh projects like Istio, Linkerd, or Consul Connect. Example : The marketplace uses a service mesh to handle secure communication, traffic policies, and observability between services. The mesh provides mutual TLS between the order, payment, and shipping services. It also collects metrics and enforces retries centrally. The teams can define policies without changing application code

06. Distributed Tracing

Cross-Boundary Visibility: Traces the end-to-end path of a single client request as it travels across networks, thread pools, and asynchronous microservice boundaries.
The Global Correlation ID: Injects a unique trace ID into the HTTP/gRPC metadata headers at the edge API gateway; this ID is passed along transparently to every downstream service down the line.
Span Metrics Capture: Every localized step inside a service measures its own timing execution data as a "span," appending its timeline metadata directly back to the global trace ID context.
Latency Bottleneck Detection: Provides clear visual graphs mapping exactly which microservice hop or database query is causing latency drops or throwing errors.
Sampling Rate Control: Limits network overhead by adjusting the sampling rate (e.g., tracing only 5% of successful requests but capturing 100% of errors).
Open Standard Integration: Configured using standard frameworks like OpenTelemetry, and visualized through distributed tracing platforms like Jaeger, Zipkin, or AWS X-Ray. Example : Each customer checkout request is tagged with a trace ID across services. When the storefront calls the order service, payment service, and shipping pipeline, the trace carries through. This lets developers see the end-to-end latency and find slow segments. It is particularly useful when diagnosing distributed failures in the marketplace.

07. Log Aggregation

Centralized Stream Collection: Consolidates stdout and stderr runtime logs across hundreds of scattered container instances into a single, searchable central index dashboard.
Distributed Tracking Challenge: Replaces isolated server log files, which become unmanageable when scaling containers across elastic cloud networks.
The Data Shipper Pipeline: Deploys daemon data agents (e.g., FluentBit, Logstash, Filebeat) onto application hosts to instantly parse, tag, and forward logs to a centralized ingestion pipeline.
Structured JSON Formatting: Enforces standardized, structured JSON log outputs across all engineering teams to enable efficient indexing, querying, and filtering by metadata.
Storage Ingestion Tier: Deposits log data streams into highly scalable text-search database indices capable of processing millions of rows per second.
Production Observability Stack: Typically implemented using enterprise observability stacks like ELK (Elasticsearch, Logstash, Kibana), Grafana Loki, or OpenSearch. Example : All marketplace services send logs to a centralized logging platform so operations can search and analyze failures. Order, payment, inventory, and shipping logs are aggregated into a single dashboard. During a high-traffic sale, engineers can trace errors across services from one place. Central logging also enables alerts on unusual error rates.

08. Saga Orchestration vs. Choreography

Distributed Consistency: A design pattern used to maintain data consistency across decoupled service databases by breaking down a long distributed transaction into a chain of smaller, local transactions.
Compensating Transactions: If an update fails mid-chain, the system steps backward down the line, executing explicit compensating transactions to reverse changes and restore a consistent global state.
Saga Orchestration (Centralized): Uses a central orchestrator controller component that acts as a conductor, explicitly directing the execution steps and compensation paths across downstream services.
Orchestration Pro/Con: Simplifies tracking the global transaction state, but risks turning the orchestrator into a complex single point of control that is tightly coupled to all participant domains.
Saga Choreography (Decentralized): Follows a decentralized, reactive approach where services operate without a central controller, listening to a message bus and publishing events to trigger the next localized transaction.
Choreography Pro/Con: Delivers low coupling and high scalability, but makes tracing global transaction state across dozens of services complex and difficult to troubleshoot. Example : The order workflow uses saga orchestration for payment, inventory reserve, and shipping booking. The orchestrator service executes each step and compensates if one step fails, such as releasing inventory if payment declines. In other cases, shipping and notifications may use choreography by listening to events and acting independently. This gives a clear flow for critical checkout steps while preserving loose coupling for auxiliary actions.

09. Strangler Fig Pattern

Incremental Migration: A migration strategy that decommission monolith architectures by progressively replacing specific routes with newly designed microservices.
The Interceptor Proxy: Deploys an API routing gateway or reverse proxy at the system entrance to smoothly direct traffic across legacy paths and migrated paths based on endpoint URIs.
Risk Blast Minimization: Avoids risky "Big Bang" architectural rewrites by letting teams safely migrate separate business domains one slice at a time over several months.
Monolith Shrinkage: The legacy system stays alive and serving traffic throughout the migration, shrinking progressively until it can be fully sunsetted.
Data Layer Bridging: Requires careful database synchronization strategies (such as change data capture or dual-writing) to keep legacy databases and new databases in sync during migration phases.
Nomenclature Origin: Named after the tropical strangler fig tree, which grows slowly around a host tree until it completely replaces the original structure. Example : The marketplace gradually replaces a legacy monolithic order processor by routing new checkout flows to a new microservice. The legacy monolith still handles old payment flows, while new services handle modern cart checkout. Over time, more routes are diverted away from the monolith until it can be removed. This lets the team migrate without taking the system offline.

10. Stateless vs. Stateful Services

Stateless Service Mechanics: Every client request is entirely independent and contains all the contextual information needed to process it. The service instance does not store any session history or transaction state locally in its memory.
Stateless Horizontal Scaling: Extremely simple to scale horizontally; a load balancer can route requests to any identical instance in the cluster, making it easy to autoscale nodes up or down.
Stateless State Offloading: Persists all durable data state externally by offloading it to shared, highly available databases or distributed cache tiers (like Redis).
Stateful Service Mechanics: Instances retain client session data or transactional history locally in memory across multiple consecutive requests, requiring clients to hit the exact same server instance.
Stateful Scaling Challenges: Scaling out requires complex sticky session routing, partition key constraints, and state replication layers to ensure data is not lost if a node crashes.
Stateful Use Cases: Ideal for ultra-low latency architectures that require instant access to changing local state data—like real-time multiplayer gaming servers, active chat gateways, or streaming processing engines.
Architectural Standard: Modern microservice designs heavily prefer Stateless configurations for general business logic layers, while reserving Stateful setups for dedicated, partitioned data streaming infrastructure components. Example : The storefront and search services are stateless so they can scale quickly behind the gateway. The shopping cart service is stateful when it pins active sessions in memory for fast access. Customer profile data remains stateful in the user service database, while the checkout path itself stays stateless with state held externally. This balance optimizes scalability while preserving session semantics where needed.

TIER IV: RESILIENCE & LIFECYCLE MANAGEMENT

01. Circuit Breaker

Fault Isolation: Prevents temporary downstream dependencies or database drops from causing system-wide, catastrophic cascading thread exhaustion failures.
State Machine Mechanics: Operates continuously across three distinct runtime states: Closed (passing calls normally), Open (tripping fast and short-circuiting calls), and Half-Open (canary testing).
Threshold Tracking: Monitors remote network execution failure ratios; once errors cross a defined percentage window, the internal state trips to Open.
Fast Failure & Fallback: When the circuit is Open, incoming calls bypass the broken downstream service entirely and execute a safe, locally cached fallback routine to preserve user experience.
Automatic Healing: After a configurable cooldown sleep window, the circuit moves to Half-Open, letting a small trickle of canary traffic pass through to evaluate downstream recovery.
Production Tools: Implemented cleanly in modern application stacks using framework libraries like Resilience4j, Envoy Proxy filters, or Istio service meshes. Example : The payment service is protected by a circuit breaker to avoid cascading failures. When the downstream payment processor starts timing out, the circuit breaker opens and immediately returns a friendly error instead of waiting. This prevents the order service from becoming overloaded with stuck requests. Once the payment processor recovers, the breaker moves to half-open and tests a few requests before allowing traffic again.

02. Retry Strategy

Transient Error Handling: Automatically replays failed network operations to gracefully handle short-lived failures like temporary network drops or quick target service restarts.
Exponential Backoff: Progressively delays consecutive retry attempts exponentially (e.g., $100\text{ms} \rightarrow 200\text{ms} \rightarrow 400\text{ms} \rightarrow 800\text{ms}$) to give struggling downstream systems time to recover.
Random Jitter Injection: Introduces random noise into the backoff calculation; this prevents the Thundering Herd Effect where failed instances hit downstream servers in synchronized waves.
Strict Idempotency Rule: Can only be applied safely to idempotent operations; retrying a timed-out, non-idempotent request without an absolute uniqueness key risks creating duplicate charges or entries.
Amplification Danger: Deeply nested microservice retry loops can trigger massive traffic amplification spikes, turning a minor downstream slowdown into a full cluster outage.
Framework Solutions: Configured and managed at the application code level via resilience engines like Resilience4j, Polly, or at the infrastructure proxy layer using Envoy. Example : The order service retries transient failures when calling the inventory service with exponential backoff and jitter. If the inventory service briefly rejects a request due to load, the order service retries after a short delay. This increases reliability without overwhelming the backend. It ensures temporary outages do not immediately fail customer checkout.

03. Shadow Deployment

Risk-Free Testing: A deployment pattern that routes a live copy of production traffic to a new microservice version without altering the user response or affecting production state.
Non-Blocking Replication: The traffic duplication layer mirrors the request payload asynchronously, ensuring any latency or failure within the shadow environment has no impact on the live user path.
Production Sandbox Isolation: The shadow microservice processes incoming cloned inputs against a specialized read-only database replica or virtual sandbox to prevent side effects.
The Evaluation Loop: A response comparison engine tracks the outputs of both the live production version and the shadow testing version to validate performance, correctness, and data handling before cutover.
High-Stakes Validation: Perfect for testing complex updates—like fraud detection algorithms or core payment processor updates—under full production load with zero user risk.
Traffic Control Tier: Managed at the networking tier using service mesh sidecar routing policies (e.g., Envoy's traffic mirroring feature) or advanced API gateway routing rules. Example : A new recommendation engine runs in shadow mode, processing real user traffic but never affecting what customers see. The marketplace compares its output against the current production engine before switching it live. This lets the team validate behavior on real traffic without risk. If results are good, they can promote the shadow service safely.

04. Rolling Deployment

Zero-Downtime Updates: A progressive release strategy that updates active running instances of a microservice application incrementally across a production cluster.
Node-by-Node Progression: Takes single nodes or a fixed subset percentage of servers offline at a time, upgrades them to the new version, and introduces them back into the load balancer rotation.
Auto-scaling Balance: During the middle of a rollout phase, the cluster infrastructure handles live application traffic across a mixed environment running both the old version and the new version concurrently.
Safe Rollbacks: If validation errors or failure metrics spike mid-deployment, the orchestrator immediately halts the rollout, making a safe rollback as simple as rerouting traffic back to the remaining older nodes.
State Management Caution: Requires careful backward and forward API compatibility, as well as database schema compatibility, since both code versions must run against the database simultaneously.
Cloud Integration: Standard native deployment behavior out of the box for modern container orchestration engines like Kubernetes (strategy: type: RollingUpdate) and AWS ECS. Example : The marketplace deploys a new version of the recommendations service with a rolling deployment so customers are not disrupted. One instance is updated at a time while others stay live. Traffic shifts gradually from the old version to the new version, and if errors appear, the update stops. This allows safe continuous delivery for large user volumes.

DEV Community