DEV Community: Randa

Microservices Security: From Fundamentals to Advanced Patterns

Randa — Tue, 09 Sep 2025 13:47:24 +0000

This article explores key security principles and practical tools for protecting distributed microservices. From foundational ideas like least privilege and defense in depth to real-world practices including zero trust, encryption, observability, and service meshes, it guides you through making security decisions in microservice environments.

The Distributed Security Challenge
The Three Core Security Principles
- Least Privilege
- Defense in Depth
- Automation
The Five Functions of Cybersecurity
- Identify
- Protect
- Detect
- Respond
- Recover
Zero Trust
- Zero Trust Principles
- Zero Trust Use Cases
- Zero Trust Architecture
Protection Mechanisms
- Patching
- Authentication and Authorization
- Data in Transit
- Data at Rest
- Observability
- Service Meshes
Wrap-Up
Further Reading

The Distributed Security Challenge

Breaking apart a monolith into microservices creates a fundamental trade-off: you gain flexibility but multiply your security challenges.

Bigger Attack Surface

A monolith typically has three main security concerns: one application server, one database, and a few external APIs. With microservices, each service has its own endpoints, databases, and dependencies, creating a larger attack surface. If each service has a 1% daily vulnerability risk, 10 services increase the chance of a breach to nearly 10%, and with 100 services, it’s guaranteed.

The Three Core Security Principles

Before diving into specific patterns and practices, we must establish the three core principles that should guide all security decisions in distributed systems.

1. Least Privilege

Grant the minimum access needed for each service to do its job. Nothing more.

Database Access Control

Ensure services only have access to the data they truly need. For example, Order service needs read/write access to the orders table but zero access to the payment table. If an attacker compromises the Order service, they can't touch payment data, minimizing the blast radius.

Network Segmentation

Limit which services can communicate with each other to limit attacker movement between services. For example, Order service needs access to Payment but not Inventory. Most organizations deploy on "open by default" networks violating least privilege. Use network policies to whitelist allowed connections and reduce lateral movement.

The Default-Deny Approach

Start secure, then open access as needed. This requires more setup but creates stronger security and makes systems easier to reason about. So start with:

No network ports open by default
No database connections allowed by default
No service-to-service communication permitted by default

2. Defense in Depth

Don't rely on a single security measure. Build overlapping protections so attackers have to breach multiple defenses to cause real damage.

Security Controls

Security controls are the specific measures you put in place to protect your system, the actual defensive tools and processes. We group them into three types:

Preventative - Stop attacks (encryption, authentication, firewalls)
Detective - Spot attacks happening (monitoring, intrusion detection)
Responsive - Handle attacks (incident response, backups, recovery)

A robust security system requires all three types. Having only preventative controls means you won't know when they fail. Having only detective controls means attacks succeed before you can respond.

Layered Defense in Microservices

Microservices introduce multiple layers where you can implement these security controls. These typically include the network layer (securing North-South and East-West traffic through segmentation and encryption), the service layer (enforcing access rules and validation within each microservice), and the data layer (encrypting and restricting sensitive information).

Each layer protects against different threats. An SQL injection might bypass your network perimeter but gets stopped by input validation. A stolen credential might pass authentication but fails at role-based access controls.

3. Automation

Manual security processes don’t scale with microservices. Automation accelerates repetitive tasks, reduces human error, and ensures consistency. As your system grows with microservices, automation becomes essential for:

Consistently applying security configurations
Continuously monitoring and responding to security events
Efficiently applying patches and updates
Automating service-to-service communication

Throughout this article, we'll explore how automation supports these protection mechanisms.

Infrastructure as Code (IaC)

IaC allows you to manage and provision infrastructure through configuration files and scripts rather than manual processes. These files specify network rules, access controls, which services can talk to each other, and more.

By storing security configurations in version control (just like application code), automation tools apply them consistently across your infrastructure, eliminating manual intervention, reducing errors, and enabling quick recovery by rebuilding environments from version-controlled configurations after failures.

The Five Functions of Cybersecurity

The US National Institute of Standards and Technology (NIST) has defined a framework that breaks cybersecurity into five core functions, encouraging a broad, strategic approach rather than focusing only on the technical protection mechanisms.

As developers and architects, we often focus on the "Protect" function because it involves the technical challenges we enjoy solving (ironically, this is also the focus of our article). But truly secure systems require all five functions.

1. Identify

You cannot secure what you don't know exists. In microservices, this challenge multiplies dramatically as services span across teams and environments. To achieve this identification, you need to follow these steps:

Asset Inventory

List all deployed services and where they run
Track the version of each service in use
Map what dependencies each service has
Identify data each service handles or stores
Assign ownership - who maintains each service

Threat Modeling

Threat modeling is the process to identify what attackers might want, how they might try to get it and their potential impact. To achieve this:

Build attack trees: Start with the attacker’s goal and work backward to explore possible attack paths.
Assign costs and impact for each attack path:
- Cost from the attacker's perspective ($ to $$$$)
- Potential impact to your business (High, Medium, Low)
Handle microservices complexities:
- Attack paths can span multiple services
- Service dependencies may cause cascading risk
- Rapid development cycles require frequent updates to threat models
Create multiple threat models:
- System-level model covering overall architecture
- Service-level models for high-risk services
- Integration models for critical service-to-service communications
- Regular cross-team modeling sessions to identify risks

Threat Intelligence

While threat modeling analyzes your system, threat intelligence tracks real-world attacks and focuses on actual threats not just theoretical ones. Use this to focus your security efforts where they matter most. A good resource is the Verizon Data Breach Investigations Report, which annually analyzes thousands of real security incidents. Key takeaways from their report:

Credential theft remains the most common attack vector (80% of breaches)
Unpatched vulnerabilities are increasingly exploited rapidly
Social engineering attacks are growing more sophisticated
Insider threats remain a significant risk

Prioritize Risks

Create a risk prioritization matrix by impact and likelihood. Focus on high-impact, low-cost attacks and also attack paths where you can easily increase attacker's costs.

Impact \ Likelihood	High Likelihood	Medium Likelihood	Low Likelihood
Critical	Critical Risk	High Risk	Medium Risk
High	High Risk	Medium Risk	Low Risk
Medium	Medium Risk	Low Risk	Very Low Risk
Low	Low Risk	Very Low Risk	Minimal Risk

Make It Ongoing

Threat modeling isn't a one-time exercise. You need to schedule:

Quarterly threat model reviews as part of architecture planning
Post-incident threat model updates incorporating lessons learned
Threat modeling for new features during design phases
Regular review and integration of threat intelligence

2. Protect

Protection means implementing security controls to prevent incidents before they happen. We will talk more about this in Protection Mechanisms, where we will cover:

Authentication and authorization
Data encryption (in transit and at rest)
Vulnerability management and patching
Keys management
Service meshes

3. Detect

Protection systems can eventually fail or be bypassed. Detection capabilities help identify security incidents quickly to minimize their impact.

Detection strategies:

Centralized logging and security event correlation
Behavioral analysis to identify unusual patterns
Automated threat detection using known attack signatures
Service mesh observability for network-level monitoring
Application performance monitoring to detect anomalies

Detection challenges in microservices:

Increased monitoring scope - dozens of services generating security events
Distributed attack patterns spanning multiple services
Managing false positives due to numerous alerts
Complexity in correlating events across services

4. Respond

When detection systems alert you to a potential security incident, you need well-defined response procedures. During an active incident, people are stressed and don't think clearly. So predefind playbooks and decision trees are essential.

Response planning considerations:

Escalation procedures - Who needs to be notified and when?
Communication plans - How to notify customers, partners, and regulators?
Containment strategies - How will you isolate compromised services?
Evidence preservation - How will you maintain forensic evidence?
Decision-making authority - Who decides during an incident?

Microservices-specific response challenges:

Service isolation - Can you take services offline without breaking everything?
Blast radius assessment - How do you quickly determine the impact scope?
Rollback procedures - Can you revert to safe versions fast?
Communication coordination - How to align teams’ actions?

5. Recover

Recovery involves restoring systems and applying lessons learned to prevent future incidents and improve resilience.

Recovery considerations:

Service restoration priorities - Which services need to come back online first?
Data integrity - How do you ensure your data hasn't been corrupted?
Dependency management - How to handle interdependent services in recovery?
Customer communication - How do you rebuild trust after an incident?

Learning and improvement:

Blameless post-mortems to understand what went wrong
System and process improvements based on lessons learned
Training and awareness to improve future response

Zero Trust

Zero Trust is a modern security architecture built on one core idea: Never trust, always verify, no matter where a request comes from. Traditional models rely on implicit trust, assuming anything inside the perimeter (like a VPN or internal network) is safe. This assumption fails once attackers breach that perimeter.

Zero Trust model embodies the three core security principles introduced earlier as you will see in the next sections. This model is adopted by Google (BeyondCorp), Netflix (LISA), and Microsoft (ZT Model).

Zero Trust Principles

Zero Trust assumes no one is trusted by default, inside or outside the network. The core principles are:

Verify Explicitly: Authenticate and authorize on every layer every request, based on identity, device, and context.
Use Least Privilege Access: Limit access by role, resource, and action, not just broad user groups.
Assume Breach: Design your system as if an attacker is already inside.

Zero Trust Use Cases

Zero trust isn't one-size-fits-all. The decision should be driven by your threat model and business requirements.

Use it when:

You manage sensitive data (i.e. PII, finance, healthcare)
You operate in regulated industries (i.e. PCI, HIPAA, FedRAMP)
Your systems span multiple networks or cloud providers
You face advanced or persistent threats

Avoid it when:

You're a small team with internal-only systems
You threat model shows low risk or low attacker motivation
You lack time and expertise to maintain strong identity and policy infrastructure

Zero Trust Architecture

Modern Zero Trust systems apply security controls across multiple layers for defense in depth. We’ll explore many of these mechanisms throughout the article:

Identity Layer - Verify who or what is making the request
- OAuth2/OIDC for user authentication (Auth0, Azure AD, Okta)
- Workload identities for services (SPIFFE/SPIRE, AWS IAM roles)
- Short-lived credentials and cert rotation
Network Layer - Don’t trust internal networks
- mTLS for service-to-service encryption
- Use microsegmentation to isolate workloads (i.e. Kubernetes Network Policies)
- Block unauthenticated east-west traffic
Service Layer - Each service enforces its own policy
- Service meshes to manage traffic and identity
- Policy enforcement via tools like OPA and Gatekeeper
- Per-request, context-aware authorization
Data Layer - Limit who can access what data and how
- Encrypt sensitive data at rest and in transit
- Authorize access at the service or method level
- Monitor and audit access to critical data sources

Protection Mechanisms

Now let's dive into the practical mechanisms you can use to protect your microservices. We'll cover the most critical areas where microservices create new security challenges or require different approaches than monolithic applications.

Patching

Patching is the process of applying updates to software, operating systems, and hardware to fix security vulnerabilities and enhance system performance. In microservices, where multiple layers and dependencies interact, patching is critical to reduce exposure to risks.

Microservices create a multi-layered environment that requires attention for patching at each level:

Each of these layers, down to the hardware, needs regular patching. Container OS vulnerabilities, for instance, can accumulate even if your application code hasn't changed, making it essential to patch every layer and dependency in your architecture.

Why We Care About Patching

Security Vulnerabilities
Patching helps to mitigate known vulnerabilities. Unpatched systems can be easily exploited by attackers. The Equifax breach in 2017 was caused by an unpatched Apache Struts vulnerability (CVE-2017-5638), even though a patch had been available for months, which affected 147 million Americans and cost over $1.7 billion in damages and regulatory fines.
Operational Stability
Outdated systems or components can become unstable, leading to service disruptions. As seen with the 2017 AWS S3 outage, a routine update to outdated systems caused a failure in the S3 service, disrupting access to critical cloud services for hours and impacting many websites and apps.
Regulatory Compliance
Many industries face stringent compliance requirements (i.e. GDPR, HIPAA) that mandate timely patching of security vulnerabilities. Failing to patch systems can result in non-compliance, leading to fines and damage to reputation.

Challenges with Patching

Complex Dependency Chains
Microservices rely on hundreds of third-party libraries, creating tangled dependency trees. A vulnerability in one dependency can affect your entire system. Log4Shell incident in 2021 is a prime example, where the Log4j vulnerability was hidden deep within a service's dependencies, making it hard for organizations to know they were even at risk as they didn't know they rely on this library.
Vendor-Supplied Updates
Even security tools can introduce vulnerabilities. In July 2024, a CrowdStrike update caused widespread system failures due to a logic error in the update. The vendor pushed the update, which resulted in systems crashing globally. This incident emphasizes the importance of thoroughly vetting and testing third-party updates before deploying them in production.
Operational Overhead
As microservices grow in size and complexity, keeping track of the patches for each service and component becomes a daunting task. With thousands of containers and dependencies, ensuring timely patching requires heavy automation and monitoring.
Patching Can Cause Disruptions
Even when patches are available, applying them often involves downtime, which may not always be feasible. This issue is exacerbated in production environments, where continuous availability is a requirement.

Protection Mechanisms

Automated Dependency Scanning
Use tools like Snyk, GitHub Advanced Security, or OWASP Dependency-Check to automatically scan for vulnerabilities in both direct and transitive dependencies. These tools integrate seamlessly with CI/CD pipelines, ensuring vulnerabilities are caught early in the development process before they reach production.
Container Image Scanning
Implement image scanning tools such as Aqua Security, Twistlock, or Snyk Container to detect vulnerabilities in container images. These tools should be integrated with your container registry, blocking vulnerable images from being deployed in production.
Software Bill of Materials (SBOM)
SBOM is essential for tracking all components and dependencies in your microservices stack, helping you quickly assess which need to be patched.
Automate Patching with Infrastructure as Code (IaC)
Leverage IaC tools (Terraform, CloudFormation) to automate patching across infrastructure.
Staged Rollouts and Canary Deployments
Applying patches through staged rollouts or canary deployments to catch issues early before a production rollout.
Managed Services
Offload patching of infrastructure layers (i.e. VMs, container orchestration) to managed cloud services (i.e. AWS ECS, Azure AKS, GKE), to minimize the patching burden on your team and ensure that lower layers of the stack are updated automatically.

Authentication and Authorization

Authentication and authorization represent two of the most critical security challenges in microservices architectures. Authentication checks who is making a request, usually at the system edge. Authorization decides what they’re allowed to do and must be enforced across all services. You authenticate once, but authorize everywhere.

In a monolithic architecture, this is simpler. The entire system runs as a single application, so authentication and authorization can be handled centrally. Access control logic has full visibility into user identity and data, and is enforced consistently across all layers (UI, backend, and database).

In a microservices architecture, responsibilities are split across independent services. Authentication is typically handled at the API gateway, which verifies identity and passes it along. But authorization is more complex, each service must make its own decisions based on the identity and claims it receives. Data is distributed, context is limited, and consistent enforcement requires clear token design and local policy checks within each service.

Authentication

Authentication verifies the identity of users or systems, usually at the perimeter of the system via a centralized component of an identity provider and an edge proxy:

Identity Provider (IdP): Authenticates users and issues identity tokens using standards like OAuth 2.0 and OIDC. Can be cloud-based (Okta, Auth0) or self-hosted.
Edge proxy/gateway: Validates tokens, forwards unauthenticated requests to the IdP, and passes authenticated traffic to backend services. These proxies/gateways may take the form of traditional API gateways, ingress controllers, service mesh sidecars, or lightweight reverse proxies.

By delegating authentication to the proxy and IdP, the system avoids duplicating authentication logic across services and eliminates the need for individual services to store or validate credentials directly.

Single Sign-On (SSO)

In distributed systems, SSO allows users to authenticate once with the IdP and access multiple services without logging in again. It's typically implemented using identity protocols like OIDC on top of OAuth 2.0. This simplifies the login experience and avoids duplicating authentication logic across microservices.

User requests /login page and is redirected to IdP to submit credentials.
IdP authenticates user and issues a JWT containing identity and claims.
User requests a /checkout with JWT.
API gateway validates JWT locally. If invalid, user is redirected to IdP.
API gateway passes the request and token to the downstream service.
Subsequent requests include the token in Authorization: Bearer header.

If no API gateway used, each microservice should validate the JWT.

Best practices

Use standard IdPs supporting OAuth 2.0 and OIDC.
Centralize authentication enforcement at the gateway level.
Avoid embedding authentication logic or credentials in microservices.
Enable MFA, especially for privileged users.
Choose short-lived credentials to limit risk.

Authorization

Authorization decides what that user or system is allowed to do and must be enforced throughout the system - inside services, between them, and at data boundaries. Four common models for handling authorization:

Role-Based Access Control (RBAC): Based on user roles (admin, editor).
Attribute-Based Access Control (ABAC): Based on user/resource/env attributes.
Permission-Based Access Control: Use fine-grained explicit permissions (read:order, write:order).
Policy-Based Access Control (PBAC): External policy engines (OPA) manage centralized policies.

Centralized Authorization

One implementation is to use a centralized service where every service asks it whether a user is allowed to perform a certain action. This approach adds latency, bottlenecks, risks downtime, couples services tightly, and lose business context.

Another implementation is to centralize all authorization logic at the API gateway, so that every request goes through the gateway, where access is evaluated then routed to the services. No authorization checks inside services. This approach causes network overhead, added latency, complex config, and tight coupling.

Decentralized Authorization

A more scalable and resilient approach is to use self-contained tokens, typically JWTs (JSON Web Tokens), to carry authorization data with each request. This allows services to enforce policies locally without relying on a central service.

A JWT is a compact, secure token composed of three parts: Header.Payload.Signature

Header: Specifies the token type and signing algorithm.
Payload: Contains user identity, roles, permissions, and other claims.
Signature: Verifies token integrity using cryptographic keys.

Since JWTs contains all required information, each microservice can validate and authorize requests independently, improving scalability and fault isolation.

Authenticate user via /login then IdP as explained before.
IdP authenticates user and issues a JWT containing identity and permissions.
User requests a /checkout with JWT.
API gateway validates JWT and passes it to the downstream services.
Order service checks JWT for read:order and write:order permissions.
Payment service checks JWT for write:payment.
Inventory service checks JWT for write:inventory.

If no API gateway used, each microservice should validate the JWT.

JWT Considerations

Validation: Every service must validate the JWT.
Public Key Distribution: Services need the public key to validate JWT signatures. Use JWKS endpoints, service mesh integration, or secrets managers.
Token Size: Keep tokens small by including only necessary claims. Large tokens can exceed header size limits. Do extra calls for additional details.
Request-Scoped vs. Session-Scoped Tokens: Use session-scoped tokens for general-purpose, longer-lived access, and request-scoped tokens for short-lived, narrowly scoped operations. Request-scoped tokens enforce least privilege and reduce risk if leaked.

Data in Transit

Data in transit refers to information actively moving between services, across the internet, internal networks, or within distributed systems. This includes API calls, service-to-service communication, or any data exchanged over a network.

Why Protect Data in Transit

When services communicate over networks, four major risks arise:

Observation - Can attackers see your data?

Unencrypted traffic can be intercepted and read. Leaks PII, credit card info, internal APIs.
Manipulation - Can attackers modify your data?

Intercepted data can be altered before reaching its destination. Alters payments, injects malicious payloads, breaks logic.
Access - Can attackers reach your endpoints?

Exposed services can be directly hit. Bypasses checks, hits internal APIs, performs actions.
Impersonation - Can attackers pretend to be your services?

Without identity checks, attackers act as legit services. Enables MITM attacks, fake data, unauthorized access.

TLS vs Mutual TLS

To secure data in transit, systems rely on Transport Layer Security (TLS) or in more secure environments, Mutual TLS (mTLS). Both are cryptographic protocols that encrypt communication, but differ in how they authenticate the parties involved.

TLS: Encrypts data in transit and authenticates the server, but the client is not verified during the handshake. It is the foundation of secure communication on the internet and internal systems.

Observation: Data is encrypted, preventing attackers from reading it in transit.
Manipulation: Integrity checks reject altered data.
Impersonation: The server proves its identity via certificate, the client isn't verified.
Access: Any client can connect. TLS does not authenticate the client. Access control must be handled at the application layer using tokens, keys, or credentials.

Mutual TLS: mTLS builds on TLS by requiring both the client and server to present valid certificates, enforcing mutual authentication during the handshake.

Observation: Data is encrypted on both ends.
Manipulation: Integrity checks reject altered data.
Impersonation: Both the client and server prove their identity using certificates.
Access: Only clients with valid certificates can connect, enforcing access before the application layer unlike TLS.

Application-Layer Protocols with TLS/mTLS

TLS and mTLS secure data as it moves over the network, but they’re applied through the protocols your services actually use to communicate with each other in a distributed environment.

Most of these are application-layer protocols built on top of TCP. Here are some of the most commonly used in modern systems:

HTTPS (HTTP over TLS)

The standard for web and API communication. Built on HTTP and secured by TLS.
gRPC

A high-performance communication framework that runs on HTTP2 and supports TLS and mTLS natively. Suitable for service-to-service communication.
Message Brokers

Systems like Kafka or RabbitMQ support TLS for client-to-broker and broker-to-broker communication.
Custom Protocols

Any custom protocol built on TCP can be secured by layering TLS over the connection.

Protection Mechanisms

To ensure that communication across your systems is private and authenticated, implement the following:

Encrypt All Internal and External Traffic

All external and internal services should communicate over HTTPS or TLS, ensuring sensitive data remains protected at every hop. Suitable for zero trust.
Avoid Terminating HTTPS Too Early

TLS should not be terminated at the gateway or load balancer. Internal traffic must also remain encrypted to prevent exposure inside the network. Even better, use separate public/internal certificates.
Use Mutual TLS (mTLS)

Enforce mTLS between services that require strong identity validation. It allows you to reject unauthorized clients before the request even reaches the application layer. Makes sense with zero trust architecture.
Automate with a Service Mesh

Managing TLS and mTLS manually at scale is difficult. Service meshes automate certificate issuance, renewal, and rotation, handling encryption and authentication transparently across all traffic. We will cover Service Meshes in more details later.
Apply the Same Standards to Non-HTTP Protocols

TLS and mTLS aren’t just for HTTP. Protocols like gRPC, message brokers, and custom protocols also support them and should be secured at the transport layer.

Data at Rest

Data at rest refers to any stored data (inside databases, file systems, backups, or logs) on disk, SSDs, or cloud storage. Unlike data in transit, it's not moving between systems but sits idle, waiting to be accessed.

In microservices, data is spread across many services, increasing the attack surface. That’s why defense in depth is critical, even with strong network and API security, assume breaches can happen and make sure stolen data is useless.

What Data to Protect

Not all data is equally sensitive. Start by classifying sensitive data per service or database. Common examples include:

PII (Personally Identifiable Information): names, emails, addresses
Authentication credentials: hashed passwords, session tokens, API keys
Payment data: credit card info, billing history
Business data: pricing models, analytics, trade secrets
Logs: which may unintentionally contain PII or secrets
Backups: often overlooked, but contain full data snapshots

How to Protect Data at Rest

Protect sensitive data with encryption and minimize data exposure:

Encryption Strategies

Encrypt sensitive data early, decrypt only when needed, and never store plain text:

Full Disk Encryption: Encrypt the entire disk. Simple to implement, but doesn't protect data if the app is compromised.
Transparent Data Encryption (TDE): Supported by many databases. Automatically encrypts data files and logs.
Column-Level Encryption: Encrypt specific database columns.
Application-Level Encryption: Encrypt data in code before storing it. The app controls this offering the most control but adds complexity.

Avoid implementing your own encryption algorithms, use proven and maintained libraries. Keep them updated and track vulnerabilities.

Key Management

Encryption is ineffective without proper key management. If you store the encryption keys alongside the data they protect, an attacker gets both.

Use a dedicated key management system (KMS) or secret manager
Separate data and key storage
Restrict key access by service identity and role
Rotate keys regularly, and make sure expired keys are removed
Audit key usage in production

Tools like HashiCorp Vault, AWS KMS, Azure Key Vault, and Google Cloud KMS help automate and secure key management.

Data Minimization

The less data you collect and retain, the less you have to protect, and the less an attacker can steal:

Collect only what's necessary for your service to function
Avoid storing sensitive data long-term unless required
Mask, hash, or anonymize data when full details aren’t needed
Regularly delete stale or unused data

Observability

Observability gives you visibility into how your system behaves, critical in microservices where many services interact. It doesn't just help with spotting bugs, but also helps you detect threats, misconfigurations, or breaches by collecting telemetry that includes logs, metrics, and traces - the three pillars of observability:

Logs - Timestamped event records with structured format for easy search and correlation.
Metrics - Aggregated data like failure rates, latency, auth attempts used for alerting and trend tracking.
Traces - Show the path of a request across services, to spot abnormal access or performance bottlenecks.

To collect and analyze these, teams often use tools like Prometheus, Grafana, Jaeger, and OpenTelemetry.

Use Cases

Authentication/Authorization monitoring - Track failed logins, permission failures. Alert on unusual spikes or suspicious patterns.
Internal movement detection - Observe unexpected service-to-service calls to prevent internal compromise.
Incident audits and compliance - Maintain logs and metrics to trace issues and support regulatory requirements.

Best Practices

Use structured, centralized logs (like JSON, ELK stack) with correlation IDs to trace requests across services.
Track key health and security metrics, and watch for anything unusual.
Combine logs, metrics, and traces under a unified system to spot problems faster.
Build observability into your system from the start, not after things break.

Service Meshes

A service mesh is an infrastructure layer that manages secure communication between microservices without requiring code changes in each service. It simplifies certificate management, enforces strong service identities, and ensures encrypted traffic. Widely used solutions include Istio, Linkerd, and Consul Connect.

Architecture Overview

Let's walk through the main components of a service mesh and how a request flows through it.

Data Plane: Composed of sidecar proxies deployed alongside each service. These proxies handle all service-to-service communication (routing, retries, mTLS encryption, and telemetry) without modifying services code. The service communicates locally with its sidecar over plain HTTP, while sidecars handle all outbound/inbound network communication.
Control Plane: A centralized component configures proxies, applies policies, manages certificates, and aggregates telemetry.

Example flow:

User sends a /checkout request via Ingress Gateway:
The request enters the mesh through the gateway, which terminates TLS and handles external-to-mesh traffic.
Ingress Gateway validates and forwards to Order service sidecar:
The gateway validates external identity (JWT, OAuth), applies mesh-level policies (rate limits, IP restrictions), and then establishes mTLS with Order service sidecar using certificates issued by the mesh control plane.
Order service sidecar forwards to local Order service instance:
Order service sidecar receives the request and forwards it to the local Order service instance over HTTP on localhost
Sidecar-to-sidecar communication between Order and Payment services:
Order service sends /payment request to its sidecar, which establishes mTLS connection with Payment service sidecar, and then Payment sidecar forwards request to the local Payment service. This process repeats for all other internal services calls.
Telemetry is captured throughout:
Each sidecar emits telemetry, which the control plane aggregates and analyzes.

Security in Service Meshes

As seen, service meshes enhance security by default. The following features enable secure communication and consistent policy enforcement across services.

Automatic mTLS between services

All service-to-service traffic is encrypted using mTLS, enforced by sidecar proxies.
Centralized certificate management

Certificates are automatically issued, rotated, and revoked by the control plane.
Service identity and authentication

Each service gets a cryptographic identity, with authorization policies enforced by control plane.
Fine-grained authorization policies

Sidecars enforce detailed access rules, controlling which services can communicate.
Centralized JWT validation

Offloads token validation from service code to sidecars.
User identity propagation

Meshes can forward external user identities (from OAuth or SSO) across service calls.
Zero trust enforcement

All connections are authenticated, authorized, and encrypted. No implicit trust.
Observability and resilience

Built-in telemetry, retries, circuit-breaking, and load balancing.

When to Use a Service Mesh

In large microservice architectures with too many services
Ideal for zero-trust environments
When strict security policies require mTLS everywhere
Polyglot environments where consistent security is hard to maintain manually
In multi-team or multi-tenant environments requiring strong isolation

Wrap-Up

Securing distributed systems requires designing with resilience and layered defenses, knowing that failures and breaches can happen. The key is to assume compromise and build security controls that work together smoothly.

We’ve discussed core principles (least privilege, defense in depth, and automation) and examined how these translate into practical and scalable protections like encryption, zero trust, observability, and service mesh integration.

No single control is enough on its own. Strong security comes from a consistent use of these strategies across the entire architecture, early and continuously, making sure that when one layer weakens, others keep the system safe and reliable.

Designing Distributed Systems: Sagas and Trade-Offs

Randa — Thu, 05 Jun 2025 13:36:23 +0000

This article breaks down the three core forces behind designing distributed systems (communication, coordination and consistency) and shows how they combine into eight saga patterns. You’ll see how each pattern works, where it fits, and what trade-offs come with it. Whether designing a new workflow or improving old ones, this guide helps you reason through the options and make informed design decisions.

Throughout this article, we’ll explain things using an order checkout flow example.

The Three Forces of Service Interaction
Communication
- Synchronous Communication
- Asynchronous Communication
- Choosing Between Synchronous and Asynchronous
Coordination
- Orchestration Pattern
- Choreography Pattern
- Choosing Between Orchestration and Choreography
Consistency
- ACID vs BASE
- Atomic Transactions
- Eventual Transactions
- Choosing Between Atomic and Distributed Transactions
Saga Patterns
- Epic Saga
- Phone Tag Saga
- Fairy Tale Saga
- Time Travel Saga
- Fantasy Fiction Saga
- Horror Story Saga
- Parallel Saga
- Anthology Saga
Wrap-Up
Further Reading

The Three Forces of Service Interaction

Software has evolved from monoliths (one deployable, one database) to SOA (multiple deployables, often one shared database) and finally to microservices (each service owns its data and deploys on its own).

Splitting a system into separate services with the right modularity and granularity is hard, but getting those services to work together is even harder. Business requests like placing an order often span multiple services (Order, Inventory, Payment, Shipping) requiring coordination and introducing new design decisions and trade-offs.

To make sense of those trade-offs, Mark Richards and Neal Ford introduced in their book a useful way to think about service interactions. They identified three forces that show up every time services need to work together:

Communication - How does one service talk to another?
- Synchronous (like REST or gRPC): Caller waits for a response.
- Asynchronous (messaging or events): Caller sends a message and moves on.
Coordination - Who drives the workflow?
- Orchestrator: Central service tells each service what to do.
- Choreography: Services listen and react to events independently.
Consistency - When must the data be correct?
- Atomic: All-or-nothing, like a traditional transaction.
- Eventual: Some inconsistency is fine, resolved over time.

These forces trade off against each other. Atomic consistency leans on sync calls and orchestration. Async flows favor eventual consistency and choreography. Most systems mix styles, like orchestration for payments, choreography for notifications.

Next, we'll explore each of these forces in more detail, then show how they come together in eight saga patterns, practical approaches to handling distributed transactions.

Communication

When two services need to coordinate a task, how they communicate is just as critical as what they exchange. This choice directly impacts system responsiveness, fault tolerance, scalability, and the degree of coupling between services.

The fundamental communication styles are synchronous and asynchronous.

Synchronous Communication

In synchronous communication, one service sends a request to another service and waits for the response before continuing. This is a blocking interaction, the caller is stalled until it hears back. This pattern is common in protocols like HTTP/REST and gRPC.

The frontend sends a POST /checkout to Order Service.
Order Service calls Payment Service and waits for it to confirm the charge.
Once payment is confirmed, it calls Inventory Service to reserve stock.
Inventory Service calls Shipping Service to arrange delivery after successful reservation.
Only once all steps succeed, Order Service returns "Order confirmed." to the user.

We now have tight temporal coupling: all services must be online, responsive, and agree in real-time, or the whole system stalls.

Trade-offs

Upsides	Downsides
Immediate, deterministic feedback to the caller	Lower availability, one service failure breaks the chain
Simple control flow and debugging	Tight coupling between services
Fits user actions that must finish now (login, payment)	Requires resilience mechanisms (retries, timeouts, circuit breakers)

Asynchronous Communication

In asynchronous communication, one service places a message on a queue and moves on without waiting for a response. This is a non-blocking interaction. The other service picks up the message when ready, often using a message broker like Kafka or RabbitMQ. This decouples services in time and allows for more parallelism.

The frontend sends a POST /checkout to Order Service.
Order Service saves the order and emits an OrderPlaced event.
Order Service immediately responds to the user: "Your order is being processed."
Payment Service listens to that event, charges the card, then emits PaymentCaptured.
Inventory Service sees PaymentCaptured, reserves the stock, and emits StockReserved.
Shipping Service sees StockReserved, ships the item, and emits OrderShipped.
Email Service sees OrderShipped and sends the confirmation email.

No service blocks another, and messages queue safely while any service is down, but this also introduces eventual consistency. We will talk about consistency in the next section.

Trade-offs

Upsides	Downsides
High availability: If the receiver is down, messages queue and are processed once it recovers	No immediate feedback
Loose temporal coupling, highly resilient	Eventual consistency, caller sees only "accepted"
High parallelism and scalability	Requires extra infrastructure (brokers, tracing)

Choosing Between Synchronous and Asynchronous

The choice depends on the trade-offs you're willing to make between responsiveness, reliability, and coupling.

Use synchronous communication when:

The caller needs an immediate result (e.g. credit-card charge, login).
The service's response directly controls what happens next.
Dependencies are reliable and low-latency.

Use asynchronous communication when:

Loose coupling and resilience matter more than speed.
The task can be done later or retried (e.g., sending emails, logging, bulk imports).
You need high throughput or resilience. Services need to keep working even if others are down.
Services are independently deployable or might be temporarily unavailable.

Coordination

When a business request spans multiple services, those services need to work in sync to get the job done. But who drives the workflow? Should one service take charge, or should each one act on its own? That's what coordination is all about.

The coordination style you choose shapes everything, from how you handle errors to where state lives to how complex things get. There are two main patterns: orchestration and choreography.

Orchestration Pattern

A dedicated service (orchestrator) is in charge. It drives the flow by calling each participating service, waiting for their responses, and deciding what happens next. It also owns the workflow state, often storing it in a local table or event log (CREATED, PAID, SHIPPED, etc.). This makes it easy to know exactly where a request stands.

Happy Path

The frontend sends a POST /checkout to the Orchestrator.
The orchestrator calls Order Service (sync) to create the order.
Then it calls Payment Service (sync) to charge the card.
Then it calls Inventory Service (sync) to reserve the stock.
Then it notifies Shipping Service (async) to ship the item.
Then it notifies Email Service (async) to send confirmation.
Finally, it responds to the user with "Order confirmed.".
In each step the orchestrator updates the workflow state.

Failure Path

Payment Service says "declined".
The orchestrator updates workflow state to FAILED_PAYMENT.
Then it asks Order Service to undo their changes - This is known as a compensating action.
Then It asks Email Service to notify the user.
Then it responds to the user with "Payment has failed".
No extra communications are needed, the orchestrator already talks to every service.

These examples illustrate the Fairy Tale Saga, we will talk about sagas later.

Trade-offs

Upsides	Downsides
Single source of truth for progress and errors	Extra network hops adds latency
Central place for timeouts, retries, compensations	Orchestrator can bottleneck or fail
Easier to reason about and unit-test complex flows	Limits parallelism, steps are often serialized
	Tighter coupling between orchestrator and service

Choreography Pattern

Choreography works without a central service. Each service reacts to events and publishes its own events. Together, these event-driven reactions form the workflow. Since there's no orchestrator, managing state is trickier. Here are common options:

Front Controller: The first service in the chain (e.g. Order Service) tracks the state. Others report back. Easy to query, but adds responsibilities and coupling.
Stateless: No service tracks workflow state. To know what happened, you query each service and reconstruct the state on the fly. Loose coupling, but lots of network chatter.
Stamp Coupling: Instead of storing state, pass it along. Each service adds its progress to the shared message or event as it moves through the workflow. No extra queries, but messages get heavier.

Happy Path

The frontend sends a POST /checkout to Order Service.
Order Service saves the order, emits OrderPlaced.
Order Service returns immediately to the user "You order is being processed".
Payment Service listens, charges the card, emits PaymentCaptured.
Inventory Service listens, reserves the stock, emits StockReserved.
Shipping Service hears StockReserved, ships the item, emits OrderShipped.
Email Service listens for OrderShipped and sends confirmation to the user.

Failure Path x

Shipping Service emits OutOfStock.
Payment, Inventory and Order services listens to OutOfStock to undo their changes.
Email Service listens to OutOfStock and notifies the user.
New communication links are added each time you discover a new error path.

These examples illustrate the Anthology Saga.

Trade-offs

Upsides	Downsides
High parallelism, steps run in parallel	Debugging involves multiple logs and topics
Loose coupling, services scale independently	No built-in global state, must design your own approach
Better fault-isolation, no single point of failure	Error handling scatters across services

Choosing Between Orchestration and Choreography

Start with the workflow's priorities, then pick the style that matches.

Complex logic or many ways to fail? Orchestration wins. A single component tracks steps, rolls back work, and hides complexity from others.
Need fast responses and high parallelism? Choreography fits. Each service does its job and moves on, letting the rest catch up through events.
Want easy way to track the workflow status? Orchestration gives a single source of truth. With choreography, you'll need to reconstruct state from events.
Worried about a single point of failure? Choreography removes the central brain at the cost of more scattered error handling.

Most production systems mix the two. Keep orchestration for high-risk, money-moving steps such as payment and refunds, where clear control and fast rollback matter. Use choreography for high-volume, low-risk steps like sending emails, updating analytics, or syncing inventory, where speed and autonomy pay off.

Consistency

Consistency is the guarantee (strong or weak) that when one service updates data, all other service will immediately or eventually see the same result.

In a distributed system, as soon as a business request involves more than one service, you have to decide how much inconsistency you can tolerate between them, and for how long. Whether you aim for strict, all-or-nothing guarantees (atomic consistency) or let things settle over time (eventual consistency), your consistency strategy shapes how reliable, responsive, and maintainable your system really is.

There are two ways for consistency: atomic consistency and eventual consistency. Before exploring these consistency styles, let's look at how consistency works in the monolith world.

ACID vs BASE

Inside a single service with a single database the "order checkout" workflow is simple. A request starts and triggers a single transaction: insert the order row, reserve stock, charge the card, mark the order ready to ship. If the card step fails, the database rolls everything back. That comes from the four ACID guarantees for transactions:

Atomicity: All-or-nothing. All updates commit or none do.
Consistency: Business rules and constraints stay valid throughout the transaction.
Isolation: During a transaction, other requests can't see its uncommitted changes.
Durability: Once committed, it's permanent, a crash can't erase the data.

Move the same workflow into four microservices (Order, Inventory, Payment and Shipping), each with its own database, and ACID breaks. Order and Inventory commit, Payment times out, no global rollback, constraints drift, and partial updates leaks to users. ACID only applies within one database connection.

You could try a global XA transaction using 2PC, but it means extra network round-trips and long-held locks. The single coordinator can stall the system and kill availability, and every datastore must support the same XA protocol. Most modern teams decide the cost is too high.

Instead, you swap ACID for BASE:

Basic availability: Services respond quickly, even if data is temporarily inconsistent.
Soft state: State may temporarily be incorrect or incomplete.
Eventual consistency: Given retries, compensations or human help, the data will line up.

BASE is a promise to converge, not a guarantee of instant correctness.

Atomic Transactions

If you want an ACID-like experience across services, you typically introduce a central service (orchestrator) that drives the whole workflow. It synchronously invokes each service, commits locally inside each one, and triggers compensating transactions to undo all work if something fails as if it never happened. A response is returned to the caller once all steps succeed or rollback completes.

Happy path

The frontend sends a POST /checkout to the Orchestrator.
The orchestrator calls Order, Payment, Inventory and Shipping services in sequence.
Each service commits to its local database immediately with no failures.
The orchestrator returns "Order confirmed." to the user.

Failure path

Order, Payment, and Inventory services have already committed.
Shipping Service times-out.
The orchestrator immediately issues three compensating transactions to undo the earlier steps.
The orchestrator returns "Unable to ship" to the user once every compensation succeeds.

Points to watch

This gives you ACD but no Isolation, other requests can see intermediate states before compensation finishes, dirty reads can happen, or other requests might overwrite in-progress changes.
Compensation itself might fail (e.g. refund gateway offline), you need retry or manual dashboards.
Side-effects (email, analytics) already triggered may not be reversible.

This is the Epic Saga, one way to handle the atomic transactions.

Trade-offs

Upsides	Downsides
Data consistency and invariants are restored immediately once compensations finish	Lower availability, response time grows with each hop and compensation
User sees one clear success/failure result	Orchestrator is a coordination hot-spot and potential bottleneck
Deterministic rollback logic lives in one place	Isolation is gone, other requests may see half-done state until compensation finishes

Eventual Transactions

The more scalable alternative is to let each service act independently. Services commit changes locally, publish asynchronous events, return immediately, and rely on other services to react to these events in their own time. To handle failures, instead of trying to undo work immediately, they are managed through retries, fallback states, or human intervention.

Happy Path

The frontend sends a POST /checkout to Order Service.
Order Service saves and commits the order, emits OrderCreated event.
Order Service responds to the user immediately "You order is being processed".
Payment Service processes OrderCreated, charges card and emits PaymentCaptured.
Inventory Service processes PaymentCaptured, reserves stock and emits StockReserved.
Shipping Service hears StockReserved, ships the item and emits OrderShipped.
Email Service hears OrderShipped and notifies the user.
Order Service hears OrderShipped and mark the order as FULLFILLED.

Failure Path

Payment Service declines the charge and emits PaymentFailed.
Order Service hears PaymentFailed, marks order as PAYMENT_FAILED.
From here, we have several recovery paths:
- Retry Policy: Payment Service retries the charge and emits PaymentCaptured or PaymentFailed again.
- Human Intervention: A support dashboard highlights stuck orders with PAYMENT_FAILED for a human to manually fix or retry.
- Fallback State: System gives up and issues compensating transactions to clean-up. Here Order Service hears PaymentFailed, marks order as CANCELLED and emails users about this issue. Similar to the example in Choreography - Failure Path.

This is the Anthology Saga, one way to handle the eventual transactions.

Points to Watch

Decide where the status lives (row column, side-car table, or event stream). Splitting state across multiple places invites race conditions.
Idempotency is crucial. Every step may be retried. Services must handle duplicate events without breaking state.
For every non-terminal failure state (i.e. PAYMENT_FAILED), identify who's responsible for fixing it and how (automatic retry, human help, or another event).
Failures that can't recover should be moved to a holding queue or flagged for investigation.

Trade-offs

Upsides	Downsides
High availability	Short windows of data drift. Dashboards, users, and code must tolerate it
Services scale and deploy independently	Requires retry logic, compensating transactions, or human help to clean-up
High throughput, no tight transaction boundaries	Debugging spans multiple event hops

Choosing Between Atomic and Distributed Transactions

The choice depends on the trade-offs you're willing to make between responsiveness, level of consistency, or effort to recover from failure. Ask yourself a few questions:

How strict is consistency? If any mismatch causes serious issues (money, security), atomic wins. If delay is fine, eventual scales better.
Can you undo steps? Atomic needs safe rollbacks. If not possible, prefer retries or manual repair.
Do users need fast responses? Atomic blocks until all steps finish. Eventual responds fast, even if some parts run later.
What's your fault tolerance? Atomic isolates failure but can reduce availability. Distributed keeps moving, but errors may surface later.
How autonomous are your services? Atomic often requires orchestration. Distributed keeps services decoupled and event-driven.

Most production systems combine atomic transactions for local operations with distributed, asynchronous messaging across services. Some steps might use synchronous calls for strong feedback, while others rely on eventual consistency and retries.

Saga Patterns

We've already explored different ways to handle business workflows that span multiple services, these known as sagas. A saga breaks the workflow into local transactions, each owned by one service. After each step commits, the next is triggered via a call or an event, depending on the communication style. If any step fails, the saga issues compensations or moves into an error‐handling path, depending on the consistency and coordination model.

There are eight saga patterns. They're simply every possible combination of the three forces we've been using throughout the article. Mark Richards and Neal Ford gave these sagas memorable names:

Pattern name	Communication	Consistency	Coordination
Epic Saga	synchronous	atomic	orchestrated
Phone-Tag Saga	synchronous	atomic	choreographed
Fairy-Tale Saga	synchronous	eventual	orchestrated
Time-Travel Saga	synchronous	eventual	choreographed
Fantasy-Fiction Saga	asynchronous	atomic	orchestrated
Horror-Story Saga	asynchronous	atomic	choreographed
Parallel Saga	asynchronous	eventual	orchestrated
Anthology Saga	asynchronous	eventual	choreographed

Dotted boxes show atomic consistency. No box means eventual consistency.

Epic Saga

Synchronous • Atomic • Orchestrated

This pattern enforces all-or-nothing behavior via an orchestrator that makes blocking, synchronous calls and triggers compensating actions on failure. This makes the system behaves as a monolith.

The orchestrator receives the request and manages the workflow.
It calls each service one after the other, waiting for each to respond.
If all services succeed, the saga completes successfully.
If any step fails, the orchestrator triggers compensating actions in reverse order.
Guarantees atomicity but suffers from bottlenecks and tight coupling.

Choose Epic Saga when you need all-or-nothing behavior and the workflow is relatively short-lived. It’s a familiar approach, but should be avoided for long chains or highly distributed systems.

Trade-offs

Characteristic	Value	Description
Coupling	Very High	Sync calls, atomicity, and an orchestrator maximize coupling between services.
Complexity	Low	Sync calls and rollback logic is centralised in the orchestrator.
Availability	Low	One service failure aborts the whole flow. All-or-nothing behavior will affect responsiveness.
Scale	Very Low	Orchestrator and atomicity coupling create bottlenecks and limit scaling.

Phone Tag Saga

Synchronous • Atomic • Choreographed

A fully choreographed version of the Epic Saga where services call each other in a strict order and handle their own rollback logic.

The initiating service starts the chain and calls the next service synchronously.
Each service commits locally and calls the next service.
If any step fails, services must independently send compensating messages upstream.
No orchestrator exists, each service has coordination and rollback logic which increases complexity.

This is only better for simple and linear workflows that rarely fail. Many error handling paths and conditional flows make the code unmanageable, best treated as a transitional or legacy-friendly model.

Trade-offs

Characteristic	Value	Description
Coupling	High	Atomicity and sync calls cause high coupling, but distributed coordination makes it less coupled than Epic Saga.
Complexity	High	Each service has coordination and rollback logic.
Availability	Low	Error handling without an orchestrator requires callbacks and multiple round-trips.
Scale	Low	Sync calls and atomicity prevent parallelism.

Fairy Tale Saga

Synchronous • Eventual • Orchestrated

Orchestration with synchronous calls, but each service manages its own commit, consistency is achieved eventually, not atomically.

The orchestrator sends synchronous calls to services in sequence.
Each service commits its changes independently.
The orchestrator listens for success or failure after each step.
If any step fails, the data will eventually line up.
The orchestrator still can trigger compensating actions but they won't be part of an active transaction.

Ideal for business processes where a central controller is valuable and consistency can be delayed. Think of checkout, signup, or account setup flows that need visibility and control but don’t require strict atomicity, which makes this saga popular and common with many microservices architectures.

Trade-offs

Characteristic	Value	Description
Coupling	High	Uses an orchestrator and sync calls, but avoids global transactions.
Complexity	Very Low	Sync calls and rollback logic are centralised in the orchestrator, also consistency is loosened.
Availability	Medium	Still blocks on each call, but allows for eventual consistency.
Scale	High	Better scalability due to lack of transactional coupling.

Time Travel Saga

Synchronous • Eventual • Choreographed

Fully decentralized version of the Fairy Tale Saga. Services call each other in sequence and own all workflow logic, including failures.

A service begins and completes its local transaction.
It then calls the next service synchronously and passes control forward.
Each service continues this chain until the workflow ends.
If an error occurs, each service must handle its own compensations.

Best for throughput-focused, one-way and linear flows, such as ETL pipelines and simple chains where each step progresses naturally, independently and in-order.

Trade-offs

Characteristic	Value	Description
Coupling	Medium	No orchestrator and no atomicity reduce coupling, but sync calls retain some coupling.
Complexity	Low	No transactional logic, services handle only local logic.
Availability	Medium	Still blocks on each call, but no central bottleneck means fewer hops.
Scale	High	Choreographed flows with local commits scale well.

Fantasy Fiction Saga

Asynchronous • Atomic • Orchestrated

An orchestrated saga that attempts atomic coordination over asynchronous calls, introducing heavy complexity in managing order and state.

The orchestrator sends asynchronous commands to each participating service.
Services perform local transactions and respond back but out-of-order.
The orchestrator tracks progress and handles pending state.
On failure, it issues compensating commands asynchronously.
Coordination logic must handle race conditions and retries.

Only consider this pattern when atomic guarantees are a must and you need some parallelism or better performance. It is hard to get it right due to the challenges of managing transactional consistency asynchronously, it requires advanced orchestration and observability tooling.

Trade-offs

Characteristic	Value	Description
Coupling	High	Atomic guarantees demand coordination, async makes timing harder.
Complexity	High	Orchestrator must manage out-of-order events, rollbacks, retries, and partial states.
Availability	Low	Async compensations mean long recovery paths, and one service failure affects the whole flow.
Scale	Low	High scale is still challenging with atomic services, async alone can't offset coordination bottlenecks.

Horror Story Saga

Asynchronous • Atomic • Choreographed

The most difficult model that tries to achieve atomic consistency with no orchestrator and only async messaging (the two loosest coupling factors). All services must coordinate rollbacks without global state.

Services exchange messages asynchronously and commit locally.
No orchestrator so each service must track workflow state and handle compensation.
Compensation logic must handle failures across out-of-order, possibly incomplete message chains.
High risk of race conditions, cascading failures, and coordination errors.

Never use this pattern, it's considered a red flag, signaling accidental complexity or under-designed coordination. Use it if you truly require atomicity but cannot introduce orchestration due to organizational boundaries.

Trade-offs

Characteristic	Value	Description
Coupling	Medium	No orchestrator helps loosen structure, but atomicity still enforces shared state constraints.
Complexity	Very High	Services must coordinate rollbacks asynchronously, tracking transaction state and order.
Availability	Low	Async chatter to achieve atomicity hurts responsiveness.
Scale	Medium	Parallelism is possible with async calls. No orchestrator helps as well.

Parallel Saga

Asynchronous • Eventual • Orchestrated

A scalable and resilient pattern where the orchestrator coordinates async service calls with eventual consistency, enabling high throughput.

The orchestrator sends async requests to all participating services.
Services execute independently and manage their own commits.
Results are returned asynchronously to the orchestrator.
If the orchestrator receives a failure, it sends async messages to services to compensate for this failed change.
Enables parallel execution and graceful recovery at scale.

Perfect for high-volume complex business flows, e.g., onboarding, order processing, subscription handling, where speed and observability matter more than atomic guarantees. Great balance of control, resilience, and performance.

Trade-offs

Characteristic	Value	Description
Coupling	Low	No global transaction, services react to events, the orchestrator only sequences steps.
Complexity	Low	The orchestrator's logic is simple due to low coupling.
Availability	High	Fast responses, non-blocking flows.
Scale	High	No atomicity guarantee, services scale at their own pace.

Anthology Saga

Asynchronous • Eventual • Choreographed

The most decoupled pattern: services communicate via events without orchestration, each maintaining its own state and reacting to changes.

Services emit events upon completion of local work.
Other services listen and react to those events asynchronously.
Each service is responsible for its own transaction scope and compensation.
No orchestrator or synchronous links, state is emergent from event flow.
Maximizes scalability and autonomy at the cost of visibility and control.

Choose it when scale and service independence are priority. Ideal for data ingestion, analytics pipelines, or any process tolerant to loose consistency. Expect reduced observability, but maximum throughput and fault isolation. It's common in many microservices architectures.

Trade-offs

Characteristic	Value	Description
Coupling	Very Low	No orchestrator, no global transaction, and fully decoupled via events.
Complexity	High	Error handling and state reconstruction are tricky.
Availability	High	Services operate independently, queues absorb load spikes.
Scale	Very High	No coupling factors. Ideal for massive scale.

Wrap-Up

There’s no one-size-fits-all saga. Each pattern involves trade-offs across key characteristics like consistency, availability, scalability and performance. You can't maximize them all at once. Strong control often limits scalability, while loose coupling increases flexibility but demands stronger coordination and observability.

In practice, many systems adopt multiple saga patterns. For example, you might use the Epic Saga for critical and atomic flows like payments, and the Parallel Saga for scalable tasks that doesn't require immediate consistency like order fulfillment. The key is to choose the right trade-offs for each workflow guided by the characteristics your business values most and can’t afford to sacrifice.

Microservices Caching: Strategies, Topologies, and Best Practices

Randa — Wed, 26 Feb 2025 23:32:20 +0000

This article offers a thorough look at caching in microservices from the fundamental to more advanced techniques and patterns. Along the way, we’ll see how caching can accelerate performance, keep services decoupled, and respect each microservice’s autonomy. We will go through the following topics:

Introduction: Core Concepts and Definitions
- What Are Microservices?
- Bounded Context in Microservices
- What Is Caching?
- Consistency vs Eventual Consistency
- Why Caching Matters in Microservices?
Cache Implementation Approaches
- IMDG (In-Memory Data Grid)
- IMDB (In-Memory Database)
- IMDG vs. IMDB
Caching Strategies
- Read-Through
- Write-Through
- Write-Behind
Caching Topologies
- Single In-Memory Caching
- Distributed Caching (Client-Server)
- Replicated Caching (In-Process)
- Near-Cache Hybrids
- Topologies Comparison
Caching Patterns and Use Cases
- Data Sharing
- Data Sidecars
- Multi-Instance Caching
- Tuple-Space Pattern
Data Collisions
- Understanding Data Collisions
- Avoiding Data Collisions
- Calculating Collision Probability
Eviction Policies
- Time-to-Live (TTL)
- Archive (ARC) Policy
- Least Frequently Used (LFU)
- Least Recently Used (LRU)
- Random Replacement (RR)
- Selecting the Right Eviction Policy
Wrap-Up
Further Reading

Introduction: Core Concepts and Definitions

Let's clarify first few key concepts and definitions related to microservices and caching before we deep dive into the caching topologies and strategies.

What Are Microservices?

Microservices is an architectural style where software is composed of multiple independent services, each focused on a single purpose. These services:

Can be deployed, scaled, and updated independently.
Communicate (often via HTTP or messaging) rather than relying on a single monolithic database.
Avoid tightly coupled monolithic structures, enabling faster iteration and smaller failure cycles.

This separation helps teams iterate faster and isolate failures. However, data management across microservices can become more complex, especially when different services need overlapping sets of information.

Bounded Context in Microservices

A bounded context is a principle from domain-driven design, crucial for microservices. It means:

Each microservice owns its domain logic and data.
Internally, the service can structure or store data however it wants (e.g., a relational database schema, NoSQL documents, or a simple file system).
Other services cannot directly query or modify that data store.

This is often called a share-nothing approach at the data level: each service controls its own resources. However, this does not necessarily require each service to have a completely separate physical database instance. A common setup is one database (e.g., PostgreSQL) where each microservice is assigned a dedicated schema or set of tables it alone manages. As long as the service is the only one reading/writing those specific tables (and no other service bypasses it), the bounded context principle holds.

What Is Caching?

Caching means temporarily storing data in a faster medium (often memory) to make subsequent requests for the same data quicker. By avoiding repeated expensive queries or computations, caching can significantly boost performance and scalability. It’s a common technique everywhere from simple in-memory lookups to distributed systems that replicate large data sets.

Consistency vs Eventual Consistency

Consistency or strong consistency means that whenever you read data, you always get the most recent write (like in a traditional database with full ACID guarantees). This is great for correctness but can slow down distributed systems.

Eventual Consistency means data might be out of date for a short while, but eventually, all replicas or caches catch up. In microservices, we often accept a brief window of staleness in exchange for better speed and uptime. For example, if you update user preferences, a remote cache might still have the old version for a few seconds until it’s invalidated or refreshed. That’s “eventual consistency”.

If you want absolute consistency, you might do synchronous writes, which can slow the system or cause partial unavailability. If you accept occasional staleness, you get better performance and resilience.

Why Caching Matters in Microservices?

In microservices, caching can:

Improve Performance: Serve data from memory instead of re-fetching from databases or external APIs. This is crucial when a microservice must repeatedly call another microservice or run expensive queries.
Enhance Scalability: Offloading repeated reads to a cache lightens the load on the original data store or service, allowing the overall system to handle more traffic.
Reduce Inter-Service Chatter: Some services might rely heavily on data “owned” by another service. Instead of making many network calls, a local or shared cache can speed things up.
Partially Decouple Services: If the owner goes offline temporarily, other services can still serve cached data (for read-only cases).

Yet, caching in microservices introduces additional complexity:

Consistency: Cached data can become stale or out-of-sync.
Collision Handling: Multiple services or instances writing the same cached data can overwrite each other.
Bounded Context: We must ensure that caching external data doesn’t break the share-nothing principle by bypassing the owning service’s authority over updates.
Eviction Policies: Which data gets removed when the cache is full or out-of-date?

Cache Implementation Approaches

In many caching products, you’ll find two broad ways to store and query data: IMDG (In-Memory Data Grid) and IMDB (In-Memory Database).

IMDG (In-Memory Data Grid)

Definition: A distributed key-value store kept entirely in RAM.
Data Model: Typically a map or dictionary of name-value pairs, plus some metadata.
Use Case: Fast get/put caching with minimal overhead, primarily for simple data access.
Examples: Hazelcast, Apache Ignite, Infinispan, Coherence, GemFire.

If your caching usage centers on straightforward queries, i.e., fetching or updating objects by key, an IMDG is ideal for its simplicity and speed.

IMDB (In-Memory Database)

Definition: An in-memory system that can behave more like a database, often supporting SQL-like queries, indexing, or advanced data operations.
Data Model: Potentially relational or table-like, capable of handling more complex queries (joins, aggregates).
Use Case: You need robust query capabilities or analytics on cached data, not just key-based lookups.
Trade-Off: Usually higher memory/CPU usage than an IMDG due to indexing and query engines.

An IMDB is valuable if your cache must support complex queries, like filtering or joining multiple data sets in-memory. This can be a big performance gain for analytics or specialized read patterns but requires more resources.

IMDG vs. IMDB

Simplicity: If your data is basically a series of name-value pairs, an IMDG suffices.
Complex Queries: If you want advanced querying (e.g., partial scans, joins, SQL), an IMDB is a better fit.
Performance Overhead: IMDB’s query engines can be slower and more memory-intensive compared to IMDG.
Purpose: Evaluate whether the cache is just a performance booster for repeated gets or a mini-database in memory for more elaborate data logic.

Caching Strategies

These strategies describe how reads and writes flow between your service, the cache, and the underlying data store. You can apply them to almost any caching topology (single in-memory or distributed), though they’re commonly used with local caches.

Read-Through

The microservice always reads from the cache.
If the data is missing (cache miss), the cache itself fetches from the database, updates the cache, and returns the result.
From the microservice’s perspective, it’s only talking to the cache.
Simplifies reading, but if the database belongs to another microservice domain, you bypass the actual owner’s logic.
For purely read-only usage in your domain only, this can be straightforward.

Write-Through

The microservice writes directly to the cache.
The cache synchronously writes the change to the underlying database.
From the microservice’s perspective, it’s only talking to the cache.
Keeps data consistent but can slow performance if the database call is slow as it must wait for the write to complete.
Similarly can break domain boundaries if you are writing to another microservice’s database.

Write-Behind (Write-Back)

The microservice writes to the cache and returns quickly.
The cache asynchronously updates the database afterward.
Reduces write latency since it does not wait for the database write, but risks data loss if the cache node fails before persisting to the database, and can cause timing issues if other processes expect immediate writes.
Similar boundary issues if updating another microservice’s database.

In strict microservices, letting a cache talk directly to another service's database can undermine the bounded context principle unless carefully encapsulated. Often, you'd prefer your own domain data for these strategies, or you might rely on read-only caching for external data, for that you can consider a data sidecar or a data sharing approach that we will discuss later to avoid direct database calls that bypass the rightful domain owner.

Caching Topologies

In microservices, caching can take several architectural forms, each physically arranged in distinct ways. Each topology has strengths and limitations, particularly regarding fault tolerance, data consistency, scalability, and complexity.

Single In-Memory Caching

Here, you simply load data (e.g., user preferences, some small reference set) into local RAM within a microservice instance. Each instance keeps its own cache.

Suitable for:

Small or mostly static data sets.
Your microservice runs as a single instance or you can tolerate minimal updates and data skew.
The data belongs to your domain (bounded context) so you’re not breaking ownership rules.

Pros:

Performance: Extremely fast, as data is stored in local memory with no network latency.
Complexity: Simple to implement. Requires no extra infrastructure.

Cons:

Consistency and Multiple Instances: If your microservice is scaled across containers, each instance has its own local cache. Updates in one instance aren’t automatically propagated to others, leading to data skew or stale data if the data changes often.
Scalability: A single instance’s memory might not handle large data sets.
Write-Heavy Scenarios: Single in-memory caching suits read-heavy loads. For writes, multiple instances might each update local data, leading to divergent caches or stale state if no synchronization is in place.
Bounded Context: If you rely on read-through/write-through for data that belongs to another domain, you skip that domain’s service logic unless you encapsulate calls through their API.

Still, single in-memory caching is simple and great for static or rarely updated data, or small reference sets that every request needs. For bigger or more complex systems, you’ll often turn to more advanced topologies.

Code Snippet:

This example demonstrates how to use an in-memory cache in .NET with IMemoryCache:

// Register IMemoryCache in Program.cs
builder.Services.AddMemoryCache();

// Get or create a cached value
var value = await _memoryCache.GetOrCreateAsync("key 1", _ => Task.FromResult("value 1"));

Distributed Caching (Client-Server)

A distributed cache keeps data in an external caching cluster, often a separate server or group of servers, while microservices connect to it through a client library over the network. Examples include Redis, Memcached, or Apache Ignite/Hazelcast in client-server mode.

How It Works:

You have one unified external caching cluster (i.e. redis).
You have a cache library in each microservice instance.
Your code calls this library’s API.
The library uses a proprietary protocol to talk to the external cluster.
The clusters stores and replicates data as configured.

Bounded context is not violated as no one is hitting someone else's database, each service has its own read-only cache in the caching server. Also, IMDG or IMDB can be used here, if you only need key-value usage, you’d likely configure an IMDG mode, if you want to run queries, you might pick IMDB mode (though that’s less common for a simple caching scenario).

Pros:

Consistency: All instances share one cache to read and update. Consistency is simpler to manage.
Scalability: If the distributed cache cluster is robust (e.g., horizontally sharded or replicated), it can handle large data volumes and concurrency.
Many real-world microservices rely on distributed caching (e.g., Redis) because it’s straightforward to manage and widely supported.

Cons:

Performance: Slower reads/writes compared to local memory (due to network latency).
Complexity: Must manage an external caching layer (e.g., multiple Redis nodes, replication, or clustering).
Availability: If the external cluster is unreachable, caching fails for all microservice instances.
Fault Tolerance: Potential single point of failure unless replicated or clustered properly. Losing the cache node can disrupt everything.

Code Snippet:

This example demonstrates how to use a distributed cache in .NET with Redis:

// Wire Redis in Program.cs and use IDistributedCache to get/set data
builder.Services.AddStackExchangeRedisCache(options =>
{
    options.Configuration = Configuration.GetConnectionString("Redis");
});

// Store a value in the cache
await _distributedCache.SetStringAsync("key 1", "value 1");

// Get the value from the cache
var value = await _distributedCache.GetStringAsync("key 1");

Replicated Caching (In-Process)

This type doesn't require an external server. Each microservice instance has an in-process cache, but updates are replicated to all other nodes, and this is handled by the cache engine. Products like Hazelcast, Apache Ignite, GemFire, Coherence, and Infinispan support this mode.

How It Works:

You still use a library (e.g., Hazelcast, Ignite) in each microservice instance.
Each instance has its own in-process memory cache.
When your app writes to the local cache, updates are automatically replicated to other instances via a proprietary protocol.
So every node eventually has the same data in memory.

Pros:

Performance: Extremely fast local reads (nanosecond-level) because data is in the same process memory.
Fault tolerance: If one instance fails, others still hold the fully copy of the data in memory (assuming no partition issues).

Cons:

Scalability: Large data sets can cause scaling issues as every instance must store it.
Collisions: High update rates risk collisions or “split-brain” scenarios if replication lags. This will be discussed later in Data Collisions section.
Complexity: More complex coordination among large numbers of instances.

Code Snippet:

This example demonstrates how to use a replicated cache in .NET with Hazelcast:

var options = new HazelcastOptionsBuilder()
.With(args)
.Build();

// Create an Hazelcast client and connect to a server running on localhost
await using var client = await HazelcastClientFactory.StartNewClientAsync(options);

// Get the distributed map from the cluster
await using var replicatedMap = await client.GetReplicatedMapAsync<string, string>("replicated-map-1");

// Store a value in the replicated map
await replicatedMap.PutAsync("key 1", "value 1");

// Get the value from the replicated map
var value = await replicatedMap.GetAsync("key 1");

Near-Cache Hybrids

A near-cache approach combines the distributed caching and the replicated caching.

How It Works:

A microservice instance has a local “front” cache for “hot” items with a capacity limit and an eviction policy configured. We will talk later about Eviction Policies.
There's also a distributed “backing” cache (like Hazelcast or Ignite cluster) that holds the full data set.
Reads first go to the local near/front cache. If it's not there, they retrieve from the backing cache.
Writes usually go to the backing cache, which sends invalidates or updates to local near-caches for other instances via a proprietary protocol to ensure they remain in sync.

Pros:

Blends scalability of a distributed store with fast local reads for frequently accessed keys.
Reduces repeated remote calls if the item is “hot”.
Limits local memory usage (only “most recently/frequently used” items).

Cons:

Additional complexity in configuring two-tier caching.
Brief staleness possible unless invalidation updates propagate instantaneously.
Doesn’t store the entire data set locally, so cache misses still require network access to the backing store.

Code Snippet:

This example demonstrates how to use a near cache in .NET with Hazelcast:

var options = new HazelcastOptionsBuilder()
    .With(args)
    .Build();

// Configure NearCache
options.NearCaches["near-cache-map-1"] = new NearCacheOptions
{
    Eviction = new Hazelcast.Models.EvictionOptions()
    {
        // Evicts least recently used entries
        EvictionPolicy = EvictionPolicy.Lru,
        // Max size for entries
        Size = 10000,
    },
    // Max number of seconds for each entry to stay in the Near Cache
    TimeToLiveSeconds = 60,
    // Max number of seconds for each entry can stay in the Near Cache untouched
    MaxIdleSeconds = 3600,
    InvalidateOnChange = true
};

// Create a Hazelcast client and connect to a server running on localhost
await using var client = await HazelcastClientFactory.StartNewClientAsync(options);

// Get the distributed map from the cluster
await using var map = await client.GetMapAsync<string, string>("near-cache-map-1");

// Store a value in the cache
await map.SetAsync("key 1", "value 1");

// Get the value from the cache by key
var value = await map.GetAsync("key 1");

Topologies Comparison

	Single In-Memory	Distributed (Client-Server)	Replicated (In-Process)	Near-Cache (Hybrid)
Performance	Extremely fast local	Network-based reads	Nanosecond local reads	Local + distributed store
Data Volume	Small, mostly static	Potentially large	Usually smaller sets	Large in backing
Update Rate	Very low changes	Handles high writes	Moderate updates	Moderate / High
Fault Tolerance	None if multi-instance	Cluster config dependent	Node-level replication	Partial replication
Consistency	Cache is per instance, no unification	Central store	Collision risk under concurrency	Local front can be stale briefly

Caching Patterns and Use Cases

We will discuss some higher-level, application-focused solutions for typical microservice challenges. These patterns can be built on top of different topologies.

Data Sharing

Scenario: Product microservice owns products information, while Order microservice needs to read that data regularly. Order microservice calling Product microservice’s API constantly might become a bottleneck or add unnecessary network overhead.

How It Works:

Product microservice remains the sole owner of the data (bounded context).
Order microservice, which needs that data, sets up a local cache to store read-only copies.
When Order microservice needs the data, it can check its cache first. If it’s stale or missing, it calls Product microservice’s API.
Order microservice never writes directly to Product microservice’s data store. Product microservice is still the only one responsible to modify its own data.

Pros:

Respects boundaries and achieves strong decoupling.
Performance: Faster reads due to the local cache for the other services that need the data.
Fault Tolerance: The other services can continue to operate even if the original service is unavailable.

Cons:

Consistency: The other services might not see the changes immediately made by the original service.
Cache Invalidation: The other services must decide how long it trusts the cached data before refreshing from the original service. So avoid this pattern if the service is write-heavy.
Memory Overhead: If the dataset is large, the cache can consume significant memory.

Data Sidecars

Scenario: Profile microservice owns detailed user profile data. Several other microservices need to read it heavily. They shouldn’t directly connect to Profile microservice’s database, nor spam the Profile microservice API every time.

How It Works:

Profile microservice writes changes to its domain data as usual.
Whenever data changes, Profile microservice also updates a distributed cache (the “sidecar”).
Other microservices read from the sidecar, which is effectively read-only for them. The domain logic for writes remains in Profile microservice.

Pros:

Respects boundaries and achieves strong decoupling.
Performance: Less load on the microservice. Others read from the sidecar cache instead of making direct calls or updating the database.
Consistency: Everyone sees a consistent (or eventually consistent) picture from the sidecar.
Scalability: Sidecar is scalable and can handle large volumes of data efficiently.

Cons:

Fault Tolerance: If the cache node goes down, reading services lose their data unless there’s replication or a fallback path.
Extra Complexity: Setting up the push/refresh logic or using events to keep the sidecar in sync.

Multi-Instance Caching

Scenario: One microservice, say Order microservice, needs to be scaled to 10 containers to handle high traffic. Each container needs the same reference data or read/writes to a shared domain. You want local caching but must keep them consistent enough.

How It Works:

If each container does single in-memory caching independently, you get data skew.
Instead, you pick a replicated or near-cache approach so that changes can propagate among instances.
- Replicated: All instances store the full data set in memory. When one node updates a key, it’s broadcast to others via a proprietary protocol.
- Near-Cache: Each node has a partial local cache and fetches from a backing store if missing or stale.

Pros:

Performance: Each instance can quickly respond to read requests from memory.
Scalability: You can add more containers without manually syncing caches.

Cons:

Collisions: If multiple nodes write the same key concurrently, overwrites can happen.
Memory Usage (replicated) or Complex Invalidation (near-cache).
Consistency: Some nodes might see outdated data briefly.

Tuple-Space Pattern

Scenario: You have a system that does high speed processing (i.e. a stock trading platform) and relies on all data being in memory for lightning-fast reads and can accept the overhead.

How It Works:

You load all relevant data into an IMDG or IMDB (like a huge in-memory store).
Reads are basically memory-speed lookups, no disk or external service.
Writes must also sync with the store or an underlying database eventually.
The entire microservice logic might revolve around the in-memory “space” (hence the name “tuple-space pattern”, and also the “space-based” architecture style).

Pros:

Performance: Ultra-Fast Reads. Everything is in memory.
Ideal for: Very high read or compute-intensive tasks (e.g., real-time analytics, stock trading, or matching engines).

Cons:

Huge Memory usage: Storing all data in RAM can be expensive.
Complex Writes: If multiple services or instances attempt to update data, concurrency and collisions can be tough to handle.

Wrap-Up on Patterns

These patterns often overlap, for example, a sidecar approach might also leverage multi-instance caching or near-cache logic. The key is to keep the domain lines clear so you never override someone else’s data domain rules and choose a pattern that balances performance with the reality of concurrency, staleness, and memory cost.

Data Collisions

Understanding Data Collisions

When using replicated caching (or multi-master distributed caching), two instances can update the same record at nearly the same time, with replication lag. For example:

Instance A decrements an inventory count from 700 to 690.
Instance B decrements from 700 to 695.
Both replication messages cross in flight, overwriting each other. End result might incorrectly show 690 or 695 instead of 685 total.

These inconsistencies are typically called split-brain or data collisions.

Avoiding Data Collisions

Queueing: Instead of writing to the cache directly, each instance sends a message to a queue. A separate service processes these messages sequentially, ensuring no collisions but the trade off is eventual consistency.
Compare-and-Set (Version or Timestamp Checks): The microservice checks a version (timestamp or sequence) before updating. If the version changed, it means someone else updated the data and the operation should be retried.

Calculating Collision Probability

Collision probability can be approximated by the following formula:

Collision_Rate ≈ Number_of_Instances × (Update_Rate² / Cache_Size) × Replication_Latency

Number_of_Instances: How many instances.
Update_Rate: Writes per second.
Cache_Size: Total distinct data entries. The bigger it is, the less often the exact same entry collides.
Replication_Latency: Average time for updates to propagate (ms).

If the collision rate is low (like under 1%), you might be fine. If it’s high, you’ll need concurrency mechanisms.

Example:

Number_of_Instances = 8
Update_Rate (seconds) = 300
Cache_Size (rows) = 30000
Replication_Latency (milliseconds) = 50
Then Collision Rate is 1.2 per second, which is above 1%, so collision probability is a bit high and we need to consider some concurrency mechanism.

Eviction Policies

Caches are finite. When they fill up, something must be removed to make room for new entries. Various eviction policies address different usage patterns.

Time-to-Live (TTL)

Definition: Each entry has an expiration timer. After the time elapses, the cache discards it.
Pros: Good for data that “naturally” becomes stale quickly (like real-time bidding info).
Cons: Does not handle the scenario where the cache is simply full (some items might still be unexpired).

Archive (ARC) Policy

Definition: Evicts items based on creation date, e.g., only keep entries under 6 months old.
Pros: Excellent for storing recent transactions (user orders for the last 6 months) and automatically discarding older data.
Cons: Also doesn’t handle the scenario of a “full” cache. If the cache is at capacity but none of the data is older than X months, new entries cannot be added.

Least Frequently Used (LFU)

Definition: Evicts the entry with the lowest access frequency.
Pros: If data is heavily read over time but rarely updated, this can keep popular items in memory.
Cons: When new items are inserted, many LFU algorithms reset counters. Frequently used items might get evicted if a series of puts occur. Can cause surprising evictions in “put-heavy” workloads.

Least Recently Used (LRU)

Definition: Evicts items that have not been accessed for the longest period.
Pros: Generally the most intuitive for interactive data. Items used recently remain in cache.
Cons: Has overhead in tracking recency (often via a linked list or timestamps).

LRU is a common default for near-cache front caches (a “most recently used” approach). Just remember, an MRU eviction policy is the opposite: it evicts the most recently used item (rarely beneficial).

Random Replacement (RR)

Definition: When the cache is full, pick an item at random to evict.
Pros: Minimal overhead, extremely fast.
Cons: No intelligence about usage patterns; can evict the most popular item.

Selecting the Right Eviction Policy

A recommended approach:

Start with Random (RR) if usage patterns are unknown. Measure cache hit rates (via logs, counters, or built-in metrics).
Experiment with LRU or LFU for a trial period, measuring the difference in hit ratio and overall performance.
Choose the best performer for your data behavior.
Time-based polices (TTL, ARC) shine when data is stale after a certain window or you only want to keep recent or valid data.

Wrap-Up

Caching in microservices isn’t just about speed, it’s about reducing network calls, managing concurrency, and respecting domain boundaries. Make sure to understand your application's characteristics, data behavior, and the trade-offs of each caching approach before committing to any caching strategy.

The Ultimate Cheat Sheet: CLI Man Pages, tldr, and cheat.sh

Randa — Mon, 13 Jan 2025 18:09:58 +0000

Introduction

When you’re coding or working extensively in the command line, having quick references for commands to know how they work is incredibly handy. Typically, you might Google it (or now, ask ChatGPT) to find a command’s usage and examples. However, that often means leaving the terminal, and context-switching can slow you down. Also, you might need to look-up multiple online resources to get what you want.

This is where CLI cheat-sheet tools come in. They allow you to search or recall command examples on the fly, directly from your shell. In this article, we’ll explore three major solutions:

man pages: Built-in, offline, and highly detailed documentation on Unix-like systems.
tldr: Short, simple, example-driven cheat sheets for popular commands.
cheat.sh: A curl-friendly tool that aggregates both CLI commands and programming snippets (e.g., Python, JavaScript, Go, etc.).

Man Pages

Man pages (short for "manual pages") are the official documentation method on Unix-like systems. They’re typically installed by default, providing offline references for nearly every system command, library, or config file.

Pros:

Offline & Detailed: Perfect for advanced or obscure flags.
Official Documentation: Maintained by the system or package authors.
Searchable: /<pattern> inside the man page for quick navigation.

Cons

Verbose: Can be overwhelming if you just want a quick example.
Windows Support: Doesn't support it. You need WSL or rely on PowerShell’s Get-Help alternative.

Installation

Windows
- Use WSL (Windows Subsystem for Linux) with a distro like Ubuntu installed to get a real man (no pun intended) out of the box.
- Or rely on PowerShell’s Get-Help for Windows-native commands (e.g. Get-Help dir). While not exactly "man," it serves a similar purpose.
Linux/macOS:
- Typically pre-installed. Just type man <command>.

How to Use

Basic Lookup - Shows very detailed information about the given command. If you want to search in the results for the word variable, type /variable, and press n to jump to the next match:
```
man <command>
man grep
```
Find Commands by Keyword - Shows all man-page entries related to the given keyword:
```
man -k zip       # Shows results for `unzip`, `gzip`, etc.
```
whatis - Searches the manual page names and displays the manual page descriptions of any name matched. It's equivalent to whatis ip command:
```
man -f <term>
man -f ip        # Displays the man page descriptions matching `ip`
```
Man Section Control - Displays detailed documentation for a specific topic within the specified manual section. In Linux, different "sections" exist (e.g., 2 for system calls, 3 for library calls). This is crucial if you want library-level details vs. userland commands.
```
man <section> <topic>
man 2 open       # Displays the man page for the `open` system call
```

Learn More

An online collection of Linux man pages is available at: https://man7.org/linux/man-pages/

tldr

Tldr (short for "too long; didn't read") is a community-driven project providing concise, example-focused cheat sheets. Instead of swimming through 100 lines in a man page, you get 5–10 lines of the most common usage patterns.

Pros:

Minimal & Fast: Great for everyday tasks.
Actively Updated: Large open-source community.
Offline Cache: While internet is needed initially, tldr can be used offline afterwards thanks to its caching feature.

Cons:

Limited Depth: Doesn’t always show advanced flags or environment info.
Requires Installation: Typically not pre-installed, so you need to install a client.

Installation

Windows
- Common approach via npm (assuming Node.js is installed):
```
npm install -g tldr
```
Linux/macOS
- npm again is straightforward:
```
npm install -g tldr
```
- Alternatively, install the official Rust Client using Homebrew (or other package managers on other operating systems):
```
brew install tlrc
```

How to Use

Basic Lookup - Shows minimal information about the given command:
```
tldr <command>
tldr cat
```
Update Cache - Pulls the latest cheat sheets from the official repo to be used offline:
```
tldr --update
```
To load a tldr for a random page:
```
tldr -r
```

Learn More

Official Site: https://tldr.sh
tldr Clients: https://github.com/tldr-pages/tldr/wiki/Clients
GitHub Repo: https://github.com/tldr-pages/tldr

cheat.sh

cheat.sh is a web-based service you can query via curl. Unlike tldr or man pages, it includes programming language snippets (python/regex, go/http, etc.). Perfect for devs who want both CLI commands and language cheat sheets in one place.

Pros:

Zero-Install: Just curl to fetch results.
Covers 56 programming languages, several DBMSes, and more than 1000 most important UNIX/Linux commands.
Ultrafast, returns answers within 100 ms, as a rule.
Ability to add more cheat sheets and modify existing ones.

Cons:

Requires Internet: Unless you self-host cheat.sh.
Inconsistent Formatting: Pulled from various sources, so the style can vary.

Installation

No formal installation needed, just use curl command. It's a REST API, so as long as you have internet and a terminal, you’re all set.

Windows
- PowerShell in Windows 10+ already includes curl as an alias for Invoke-WebRequest.
- Git Bash or WSL also have curl by default or easily installed.
Linux/macOS
- curl is usually pre-installed on major distros. Check with curl --version.

How to Use

Basic Lookup - Shows multiple usage examples for the given command, sometimes more extensive than tldr:
```
curl cheat.sh/<command>
curl cheat.sh/tar
```

Subtopic Filtering - Use /~<keyword> to focus on specific usage:

curl cheat.sh/scala/~currying   # Looks for currying in scala cheat sheets

Programming languages cheat sheets - For each supported programming language there are several special cheat sheets: its own sheet, hello, :list and :learn:

curl cheat.sh/lua
curl cheat.sh/lua/hello
curl cheat.sh/lua/:list
curl cheat.sh/lua/:learn

To know how to randomize numbers in C# for example:

curl cheat.sh/csharp/random

Output will be sth like this:

/*
 * The [`Random` class][1] is used to create random numbers. (Pseudo-
 * random that is of course.).
 *
 * Example:
 *
 * <!-- language: c# -->
 */

 Random rnd = new Random();
 int month  = rnd.Next(1, 13);  // creates a number between 1 and 12
 int dice   = rnd.Next(1, 7);   // creates a number between 1 and 6
 int card   = rnd.Next(52);     // creates a number between 0 and 51
 // Rest of details

Special pages - Few example:

curl cheat.sh/:help     # Description of all special pages and options
curl cheat.sh/:intro    # cheat.sh introduction, covering the most important usage questions
curl cheat.sh/:list     # Lists all cheat sheets

Pipe into fzf (for super-advanced searching) - If you have fzf installed, you can interactively sift through cheat.sh’s output:
```
curl cheat.sh/python | fzf
```

Alias

To speed things up, the curl command is a bit long (at least for me), so we can add an alias for it:

Windows: Add the following alias to $PROFILE:

function cheat {
    param([string]$topic)
    curl "cheat.sh/$topic"
}

Then you can use it this way:

cheat tar
cheat csharp/random

Linux: Add the following alias to ~/.bashrc or ~/.zshrc:
```
cheat() {
    curl "cheat.sh/$*"
}
```
Then you can use it this way:
```
cheat csharp/random
```

Learn More

GitHub Repo: https://github.com/chubin/cheat.sh.
You're most likely coding in an editor, so you might wonder how to access cheat sheets specific to your programming language. One option is to open a terminal within your editor and run the commands there. Alternatively, you can check out the cheat.sh repo for instructions on how to integrate your editor with cheat.sh directly.

Quick Compare Table

Feature/Tool	Man Pages	tldr	cheat.sh
Installation	Pre-installed on most Unix-like systems (Use WSL on Win)	Install a tldr client (e.g., via npm)	No install needed—just `curl`
Offline Usage	Fully offline	Cached offline after initial update	Requires internet unless you self-host
Detail Level	Extremely comprehensive (official docs)	Concise, covers common commands/features	Varies; includes code snippets & subtopics
Advanced Flags	Thorough coverage	Limited coverage of advanced flags	Medium coverage (from multiple community sources)
Programming	System-level docs only	CLI commands only	Includes language cheat sheets (Python, JS, Go, etc.)
Speed	Instant offline results	Very fast for typical usage	Usually sub-100ms response, but must be online
Platform	Unix-like systems (WSL for Win)	Cross-platform (Win, Linux, macOS), with a client	Cross-platform (Win, Linux, macOS)
Primary Use	Deep dive into official docs	Quick references for everyday commands	Quick references plus code snippets in multiple languages

Other Tools to Explore

Even with man, tldr, and cheat.sh, you might want to explore more similar tools:

eg: Provides simple, practical command-line examples, acting as a quick-reference companion to man pages.
Cheat: Enables creating and viewing interactive command-line cheatsheets, helping *nix admins recall options for commands they use occasionally.
devhints: Provides quick, easy-to-navigate cheatsheets for developers, offering concise references for various tools, frameworks, and programming languages.

Wrap-Up

Whether you're diving into man pages for detailed offline docs, using tldr for quick command overviews, or exploring cheat.sh for filtered subtopics and snippets, you'll have everything you need right at your fingertips. You can also mix and match these tools to cover all your bases and tackle any situation. We’ve only touched on the basics here, so consider playing with these tools to explore their full potential.

MSBuild and .NET Project Files Explained

Randa — Thu, 26 Dec 2024 20:32:22 +0000

What is MSBuild?

MSBuild (Microsoft Build Engine) is a build system and platform for building applications, primarily in the .NET ecosystem. It orchestrates how code is compiled, tested, packaged, and deployed by processing XML project files like .csproj, .fsproj and .vbproj. Visual Studio uses MSBuild, but you can use MSBuild without Visual Studio to build .NET applications.

Key Features of MSBuild

Project Building:
- Compiles your source code into Intermediate Language (IL) and packages it into binaries (.dll, .exe).
- Resolves project dependencies (e.g., NuGet packages) and includes them in the build process.
- Executes additional build tasks (e.g., running tests, creating packages, deployment tasks).
XML-Based Configuration:
- Uses XML-based project files to define build instructions in a clear and extensible format.
- Project files are used to define build steps, configurations, dependencies, and more.
Highly Customizable:
- You can write custom targets and tasks to extend its functionality.
Integrated with .NET CLI:
- Commands from .NET CLI like dotnet build, dotnet restore, and dotnet publish use MSBuild under the hood to build projects.
- Visual Studio has a built-in support for MSBuild. For editors that are not integrated with MSBuild like VS Code and Zed, the .NET CLI is commonly used to manage builds.
Build Automation and CI/CD Integration:
- Integrates with CI/CD systems like GitHub Actions, Azure Pipelines, and Jenkins to automate builds, tests, and deployments.
- Defines build and deployment pipelines entirely in MSBuild scripts.
Cross-Platform:
- Initially Windows-only, MSBuild became cross-platform starting with .NET Core, allowing builds on Linux and macOS.
- Ensures the same build logic works across operating systems, making it ideal for CI/CD pipelines.

MSBuild Project File

MSBuild processes an XML-based project file format which is configured to describe build items, configurations, and reusable build rules for consistency across projects.
There are different types of project files such as .csproj for C# projects, .fsproj for F# projects and .vbproj for visual basic projects.

MSBuild Project File Structure

The following is an overview of the key elements that make up an MSBuild project file, explaining how each contributes to the build process and project configuration.

File Root

The project file begins with the <Project> root element which acts as the container for all other elements.
```
<Project>
</Project>
```
Modern .NET projects use SDK-style projects, where the SDK specifies a predefined set of build logic, properties, and imports. Some available SDKs:
- Microsoft.NET.Sdk: For console apps or libraries.
- Microsoft.NET.Sdk.Web: For web projects like Web APIs or MVC apps.
- Microsoft.NET.Sdk.Worker: For worker services and background jobs.
- Aspire.AppHost.Sdk: For Aspire app host.
- MSTest.Sdk: For MSTest apps.

Ways of declaring SDK:

<!-- Inline SDK declaration -->
<Project Sdk="Microsoft.NET.Sdk">
</Project>

<!-- Using the `<Sdk>` element -->
<Project>
  <Sdk Name="Microsoft.NET.Sdk" />
</Project>

Properties

Key-value pairs used to configure builds and global settings like target framework, build configuration, and output paths.
They are defined within a <PropertyGroup>. Multiple <PropertyGroup> sections can be added.
```
<PropertyGroup>
  <TargetFramework>net9.0</TargetFramework>
</PropertyGroup>
```

Conditions can be specified to dynamically enable properties:

<PropertyGroup Condition="'$(Configuration)' == 'Release'">
  <Optimize>true</Optimize>
</PropertyGroup>

Common properties:

Target Framework
- <TargetFramework>/<TargetFrameworks>: Used to specify the .NET version. This property is required.
```
<TargetFramework>net9.0</TargetFramework>
<TargetFrameworks>net9.0;net40;net45</TargetFrameworks>
```
- Common TFMs: net9.0, net8.0, netstandard2.1, netcoreapp3.1, net481. OS-specific TFMs (e.g., net5.0-windows, net6.0-ios) include platform-specific bindings.
- You can add to source code preprocessor directives for conditional compilation by framework:
```
#if NET40
Console.WriteLine("Target framework: .NET Framework 4.0");
#endif
```
Output Type
- <OutputType>: Used to specify the application type.
- Available types:
  - Exe for console apps.
  - Library for class libraries. (Default)
  - Module for modules.
  - Winexe for windows-based programs.
```
<PropertyGroup>
  <OutputType>Exe</OutputType>
</PropertyGroup>
```
Implicit using Directives
- Starting with .NET 6, C# projects automatically include commonly used namespaces via implicit global using directives, reducing the need to manually add them.
- Enabled by default for SDKs like Microsoft.NET.Sdk, Microsoft.NET.Sdk.Web, Microsoft.NET.Sdk.Worker, and Microsoft.NET.Sdk.WindowsDesktop.
- <ImplicitUsings>: Used to enable/disable the feature:
```
<PropertyGroup>
  <ImplicitUsings>enable</ImplicitUsings>
</PropertyGroup>
```
- <Using>: Used to add additional items to global using directives - We will talk about items in the next section:
```
<ItemGroup>
  <Using Include="System.IO.Pipes" />
</ItemGroup>
```
Compiler and Code Analyzer Warnings
- <TreatWarningsAsErrors>: Converts all compiler warnings into errors.
- <WarningsAsErrors>: Converts specific compiler warnings into errors.
- <CodeAnalysisTreatWarningsAsErrors>: Converts code analysis warnings into errors.
- <NoWarn>: Suppresses specific warnings and doesn't show them in build outputs.
```
<PropertyGroup>
  <TreatWarningsAsErrors>true</TreatWarningsAsErrors>
  <WarningsAsErrors>CS0168</WarningsAsErrors>
  <CodeAnalysisTreatWarningsAsErrors>true</CodeAnalysisTreatWarningsAsErrors>
  <NoWarn>CS2002</NoWarn>
</PropertyGroup>
```
Package properties
- These properties are used when generating a NuGet package from a project, they define the metadata for the package.
- <PackageId>: A unique identifier for the package. Default value is AssemblyName.
- <Version>: The version of the package. Default value is 1.0.0.
- <Authors>: The authors of the package. Default value is AssemblyName.
- <Company>: The company name associated with the project. Default value is AssemblyName.
```
<PropertyGroup>
  <PackageId>ClassLibDotNetStandard</PackageId>
  <Version>1.0.0</Version>
  <Authors>your_name</Authors>
  <Company>your_company</Company>
</PropertyGroup>
```

Items

Specify inputs to the build process, such as source files, packages, dependencies, and resources.
They are defined within a <ItemGroup>. Multiple <ItemGroup> sections can be added.

Common items:
<PackageReference>: Represents a reference to a package. For simplicity, use dotnet add package to add a package instead of manually adding it to the project file.
<ProjectReference>: Represents a reference to another project.
<EmbeddedResource>: Represents a resource to be embedded in the generated assembly.

<Compile>: Represents the source files for the compiler. SDK-style projects predefine Compile includes, so no need to explicitly add each source file to the project.

<ItemGroup>
  <PackageReference Include="Swashbuckle.AspNetCore" Version="6.6.2" />
  <ProjectReference Include="..\OtherProject\OtherProject.csproj" />
  <EmbeddedResource Include="fonts\OpenSans.ttf" />
  <Compile Include="Program.cs" />
</ItemGroup>

Tasks

Individual steps within targets to perform certain actions.

MSBuild includes built-in tasks (e.g., Copy, Exec, MakeDir, Csc) and supports custom ones (by implementing ITask or deriving from the helper class Task)

<Target Name="CustomTarget">
  <Exec Command="dotnet restore" />
  <Copy SourceFiles="README.md" DestinationFolder="bin\docs\" />
  <Csc Sources="@(Compile)" OutputAssembly="bin\MyApp.dll" />
</Target>

Targets

Group tasks and define sections of the project file as entry points for the build process. e.g., one target cleans build artifacts, while another compiles the source code and outputs binaries.

BeforeTargets, AfterTargets and DependsOnTargets attributes can be used to order targets.

<Target Name="PreBuild" BeforeTargets="PreBuildEvent">
  <Exec Command="echo pre build" />
</Target>

<Target Name="PostBuild" AfterTargets="PostBuildEvent">
  <Exec Command="echo post build" />
</Target>

<Target Name="PostPostBuild" DependsOnTargets="PostBuild">
  <Exec Command="echo post post build" />
</Target>

C# Web API Project Example

Lets create a dummy C# Web API project from scratch, without using Visual Studio nor dotnet new command to generate a template. Instead, we will use the terminal to manually create the required files, use the dotnet CLI to build and run the project, and test it using a simple HTTP request. Feel free to use your favorite editor to edit the files.

Open your favorite terminal.
Create the project directory and navigate to it:
```
mkdir DemoApp
cd DemoApp
```
Create the .csproj file:
```
vim DemoApp.csproj
```

Add the following content to the .csproj file:

<Project Sdk="Microsoft.NET.Sdk.Web">

  <PropertyGroup>
    <TargetFramework>net9.0</TargetFramework>
    <ImplicitUsings>enable</ImplicitUsings>
  </PropertyGroup>

</Project>

Create the Program.cs file:
```
vim Program.cs
```

Add the following content to Program.cs to setup the web app and add a simple minimal API:

var builder = WebApplication.CreateBuilder(args);
var app = builder.Build();

app.MapGet("/welcome", () => "Hello, you!");
app.Run();

Build the project - The output binaries will be placed in the bin/Debug/net9.0 directory by default.
```
dotnet build
```
Run the project - The output will indicate that the app is listening on: http://localhost:5000.
```
dotnet run
```
Test the project - The output should be: Hello, you!.
```
curl http://localhost:5000/welcome
```
Final output via Zed editor:

And that's all! You can play around .csproj configurations and explore other properties.

Learning Resources

Refer to the following resources if you would like to learn more about MSBuild and project files:

DEV Community: Randa

Microservices Security: From Fundamentals to Advanced Patterns

The Distributed Security Challenge

Bigger Attack Surface

More Problems, Better Defenses

The Three Core Security Principles

1. Least Privilege

Database Access Control

Network Segmentation

The Default-Deny Approach

2. Defense in Depth

Security Controls

Layered Defense in Microservices

3. Automation

Infrastructure as Code (IaC)

The Five Functions of Cybersecurity

1. Identify

Asset Inventory

Threat Modeling

Threat Intelligence

Prioritize Risks

Make It Ongoing

2. Protect

3. Detect

Detection strategies:

Detection challenges in microservices:

4. Respond

Response planning considerations:

Microservices-specific response challenges:

5. Recover

Recovery considerations:

Learning and improvement:

Zero Trust

Zero Trust Principles

Zero Trust Use Cases

Use it when:

Avoid it when:

Zero Trust Architecture

Protection Mechanisms

Patching

Why We Care About Patching

Challenges with Patching

Protection Mechanisms

Authentication and Authorization

Authentication

Single Sign-On (SSO)

Best practices

Authorization

Centralized Authorization

Decentralized Authorization

JWT Considerations

Data in Transit

Why Protect Data in Transit

TLS vs Mutual TLS

Application-Layer Protocols with TLS/mTLS

Protection Mechanisms

Data at Rest

What Data to Protect

How to Protect Data at Rest

Encryption Strategies

Key Management

Data Minimization

Observability

Use Cases

Best Practices

Service Meshes

Architecture Overview

Security in Service Meshes

When to Use a Service Mesh

Wrap-Up

Further Reading

Designing Distributed Systems: Sagas and Trade-Offs

The Three Forces of Service Interaction

Communication

Synchronous Communication

Trade-offs

Asynchronous Communication

Trade-offs

Choosing Between Synchronous and Asynchronous

Coordination