<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: InstaDevOps</title>
    <description>The latest articles on DEV Community by InstaDevOps (@instadevops).</description>
    <link>https://dev.to/instadevops</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2952358%2F474aa7f4-09cf-409d-891e-cfc4f071d18a.png</url>
      <title>DEV Community: InstaDevOps</title>
      <link>https://dev.to/instadevops</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/instadevops"/>
    <language>en</language>
    <item>
      <title>Chaos Engineering: Building Resilient Systems in Production</title>
      <dc:creator>InstaDevOps</dc:creator>
      <pubDate>Sat, 09 May 2026 13:48:13 +0000</pubDate>
      <link>https://dev.to/instadevops/chaos-engineering-building-resilient-systems-in-production-2l33</link>
      <guid>https://dev.to/instadevops/chaos-engineering-building-resilient-systems-in-production-2l33</guid>
      <description>&lt;h2&gt;
  
  
  Chaos Engineering: Building Resilient Systems with Litmus, Gremlin, and Chaos Monkey
&lt;/h2&gt;

&lt;p&gt;Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. It is not about breaking things randomly - it is a scientific method where you form a hypothesis about how your system handles failure, inject a controlled fault, observe the behavior, and improve based on what you learn. The alternative is waiting for production to surprise you at 3 AM.&lt;/p&gt;

&lt;p&gt;The practice starts with steady-state definition: what does normal look like for your system? Define it with metrics - request success rate above 99.9%, P95 latency below 200ms, error rate below 0.1%. Then design experiments: what happens when a database replica fails, when network latency increases by 100ms between two services, or when a pod's CPU is throttled to 50%? Tools like Litmus (Kubernetes-native, open source), Gremlin (SaaS with enterprise features), and Chaos Monkey (Netflix's original tool for random instance termination) let you inject these faults in a controlled manner.&lt;/p&gt;

&lt;p&gt;Start small and expand. Your first chaos experiment should be killing a single pod of a replicated service - if your system cannot handle that, you have bigger problems than chaos engineering can solve. Graduate to network partitions between services, DNS failures, and disk pressure. Run game days where the team practices incident response with injected failures. The goal is not to find every possible failure mode but to build muscle memory for responding to the unexpected and to systematically eliminate single points of failure.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Want to build more resilient infrastructure?&lt;/strong&gt; &lt;a href="https://instadevops.com" rel="noopener noreferrer"&gt;InstaDevOps&lt;/a&gt; helps teams implement chaos engineering practices and improve system reliability. &lt;a href="https://calendly.com/instadevops/15min" rel="noopener noreferrer"&gt;Book a free consultation&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>chaos</category>
      <category>reliability</category>
      <category>sre</category>
      <category>devops</category>
    </item>
    <item>
      <title>Linux Performance Tuning for DevOps Engineers</title>
      <dc:creator>InstaDevOps</dc:creator>
      <pubDate>Fri, 08 May 2026 13:48:10 +0000</pubDate>
      <link>https://dev.to/instadevops/linux-performance-tuning-for-devops-engineers-2mk4</link>
      <guid>https://dev.to/instadevops/linux-performance-tuning-for-devops-engineers-2mk4</guid>
      <description>&lt;h2&gt;
  
  
  Linux Performance Tuning for DevOps: CPU, Memory, I/O, and Network Optimization
&lt;/h2&gt;

&lt;p&gt;Linux performance tuning is a skill that separates good DevOps engineers from great ones. Default kernel parameters are optimized for general-purpose workloads, but production servers running databases, web applications, or container orchestrators need targeted tuning. The difference between default settings and properly tuned parameters can be 2-5x throughput improvement without any hardware changes.&lt;/p&gt;

&lt;p&gt;Start with understanding your bottleneck. Use &lt;code&gt;top&lt;/code&gt;, &lt;code&gt;htop&lt;/code&gt;, and &lt;code&gt;mpstat&lt;/code&gt; for CPU analysis - look at user vs system time, I/O wait, and per-core utilization. For memory, &lt;code&gt;free -h&lt;/code&gt;, &lt;code&gt;vmstat&lt;/code&gt;, and &lt;code&gt;/proc/meminfo&lt;/code&gt; reveal swap usage, page faults, and buffer/cache behavior. Disk I/O diagnostics use &lt;code&gt;iostat&lt;/code&gt;, &lt;code&gt;iotop&lt;/code&gt;, and &lt;code&gt;blktrace&lt;/code&gt; to identify whether you are IOPS-limited or throughput-limited. Network performance requires &lt;code&gt;ss&lt;/code&gt;, &lt;code&gt;netstat&lt;/code&gt;, and &lt;code&gt;iperf3&lt;/code&gt; to find connection bottlenecks, dropped packets, and bandwidth limits.&lt;/p&gt;

&lt;p&gt;The most impactful sysctl tunings for web-facing servers include increasing &lt;code&gt;net.core.somaxconn&lt;/code&gt; and &lt;code&gt;net.ipv4.tcp_max_syn_backlog&lt;/code&gt; for connection handling, tuning &lt;code&gt;vm.swappiness&lt;/code&gt; to 10 for application servers (keeping data in RAM), adjusting &lt;code&gt;vm.dirty_ratio&lt;/code&gt; and &lt;code&gt;vm.dirty_background_ratio&lt;/code&gt; for write-heavy workloads, and enabling &lt;code&gt;net.ipv4.tcp_fastopen&lt;/code&gt; for reduced latency. For containerized workloads, file descriptor limits (&lt;code&gt;fs.file-max&lt;/code&gt;), inotify watches (&lt;code&gt;fs.inotify.max_user_watches&lt;/code&gt;), and PID limits need attention. Always benchmark before and after changes - tuning without measurement is just guessing.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Need help optimizing your infrastructure?&lt;/strong&gt; &lt;a href="https://instadevops.com" rel="noopener noreferrer"&gt;InstaDevOps&lt;/a&gt; tunes production Linux servers for maximum performance. &lt;a href="https://calendly.com/instadevops/15min" rel="noopener noreferrer"&gt;Book a free consultation&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>linux</category>
      <category>performance</category>
      <category>sre</category>
      <category>devops</category>
    </item>
    <item>
      <title>OpenTelemetry: The Future of Cloud-Native Observability</title>
      <dc:creator>InstaDevOps</dc:creator>
      <pubDate>Thu, 07 May 2026 13:48:06 +0000</pubDate>
      <link>https://dev.to/instadevops/opentelemetry-the-future-of-cloud-native-observability-2323</link>
      <guid>https://dev.to/instadevops/opentelemetry-the-future-of-cloud-native-observability-2323</guid>
      <description>&lt;h2&gt;
  
  
  OpenTelemetry in Practice: Unified Traces, Metrics, and Logs for Microservices
&lt;/h2&gt;

&lt;p&gt;OpenTelemetry has become the industry standard for instrumenting applications, but most teams struggle with the gap between understanding the concept and running it in production. The documentation covers individual components well, but the real challenge is wiring everything together - auto-instrumentation, the OTel Collector pipeline, sampling strategies, and connecting traces to logs to metrics in a way that actually speeds up debugging.&lt;/p&gt;

&lt;p&gt;Start with auto-instrumentation. For Node.js, the &lt;code&gt;@opentelemetry/auto-instrumentations-node&lt;/code&gt; package automatically traces HTTP requests, database queries, gRPC calls, and message queue operations without changing application code. For Python, &lt;code&gt;opentelemetry-instrumentation&lt;/code&gt; does the same. Deploy the OpenTelemetry Collector as a sidecar or DaemonSet - it receives telemetry via OTLP, processes it (batching, filtering, adding resource attributes), and exports to your backends. The Collector is where you implement tail-based sampling: keep 100% of error traces and slow requests, sample 10% of successful fast requests.&lt;/p&gt;

&lt;p&gt;The real power of OpenTelemetry is correlation. When a trace ID propagates through every service, you can click from a Grafana dashboard showing elevated P99 latency directly to the specific trace causing it, then pivot to the structured logs for that trace ID. Configure your logging library to inject trace and span IDs into every log entry. Use exemplars to link metrics to traces. This workflow - alert fires, check dashboard, find trace, read logs - should take minutes, not hours.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Need help implementing observability?&lt;/strong&gt; &lt;a href="https://instadevops.com" rel="noopener noreferrer"&gt;InstaDevOps&lt;/a&gt; deploys production OpenTelemetry stacks with Grafana, Tempo, Loki, and Prometheus. &lt;a href="https://calendly.com/instadevops/15min" rel="noopener noreferrer"&gt;Book a free consultation&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>opentelemetry</category>
      <category>observability</category>
      <category>monitoring</category>
      <category>devops</category>
    </item>
    <item>
      <title>AWS VPC Networking: Transit Gateway, Peering &amp; PrivateLink</title>
      <dc:creator>InstaDevOps</dc:creator>
      <pubDate>Wed, 06 May 2026 13:48:03 +0000</pubDate>
      <link>https://dev.to/instadevops/aws-vpc-networking-transit-gateway-peering-privatelink-775</link>
      <guid>https://dev.to/instadevops/aws-vpc-networking-transit-gateway-peering-privatelink-775</guid>
      <description>&lt;h2&gt;
  
  
  AWS VPC Networking: Subnets, NAT Gateways, Transit Gateway, and PrivateLink
&lt;/h2&gt;

&lt;p&gt;AWS networking is the foundation that everything else sits on, yet it is the area where most teams accumulate the most technical debt. A poorly designed VPC leads to security gaps, connectivity issues, and painful migrations later. Getting your network architecture right from the start - proper CIDR planning, subnet tiers, and connectivity patterns - saves enormous headaches as you scale.&lt;/p&gt;

&lt;p&gt;Every production VPC should have three subnet tiers across multiple availability zones: public subnets for load balancers and bastion hosts, private subnets for application workloads, and isolated subnets for databases with no internet access. NAT Gateways provide outbound internet access for private subnets - deploy one per AZ for high availability, but be aware they are one of the most expensive networking components. Use VPC endpoints (Gateway endpoints for S3/DynamoDB, Interface endpoints for other services) to keep traffic on the AWS backbone and reduce NAT Gateway costs.&lt;/p&gt;

&lt;p&gt;For multi-VPC architectures, Transit Gateway replaces the mesh of VPC peering connections that becomes unmanageable beyond 3-4 VPCs. Transit Gateway acts as a hub that all VPCs connect to, with route tables controlling which VPCs can communicate. PrivateLink exposes services across VPCs or to customers without traversing the public internet - essential for SaaS architectures and shared services platforms. Plan your CIDR ranges carefully to avoid overlaps, and use AWS RAM to share subnets across accounts in an AWS Organizations setup.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Need help designing your AWS network?&lt;/strong&gt; &lt;a href="https://instadevops.com" rel="noopener noreferrer"&gt;InstaDevOps&lt;/a&gt; architects production-grade VPC layouts for startups and scale-ups. &lt;a href="https://calendly.com/instadevops/15min" rel="noopener noreferrer"&gt;Book a free consultation&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>networking</category>
      <category>vpc</category>
      <category>devops</category>
    </item>
    <item>
      <title>Terraform State Management: Remote Backends, Locking &amp; Recovery</title>
      <dc:creator>InstaDevOps</dc:creator>
      <pubDate>Tue, 05 May 2026 13:47:55 +0000</pubDate>
      <link>https://dev.to/instadevops/terraform-state-management-remote-backends-locking-recovery-3ofe</link>
      <guid>https://dev.to/instadevops/terraform-state-management-remote-backends-locking-recovery-3ofe</guid>
      <description>&lt;h2&gt;
  
  
  Terraform State Management: Remote Backends, Locking, and State Surgery
&lt;/h2&gt;

&lt;p&gt;Terraform state is the single most critical file in your infrastructure-as-code workflow. It maps your Terraform configuration to real-world resources, and corrupting or losing it means Terraform loses track of everything it manages. Yet many teams start with local state files, no locking, and no backup strategy - a recipe for disaster the moment two people run terraform apply simultaneously.&lt;/p&gt;

&lt;p&gt;Remote backends solve the fundamentals: store state in S3 with DynamoDB locking, or use Terraform Cloud for managed state with built-in versioning. Every remote backend should have encryption at rest, versioning enabled for rollback, and a DynamoDB table for state locking to prevent concurrent modifications. Structure your state files by environment and component - a single monolithic state file for your entire infrastructure creates blast radius problems and slows down every plan and apply.&lt;/p&gt;

&lt;p&gt;State surgery is the skill you need when things go wrong. &lt;code&gt;terraform state mv&lt;/code&gt; renames resources without destroying them. &lt;code&gt;terraform state rm&lt;/code&gt; removes resources from state without deleting them from your cloud provider. &lt;code&gt;terraform import&lt;/code&gt; brings existing infrastructure under Terraform management. These commands are dangerous - always back up your state file before running them. For large-scale refactoring, &lt;code&gt;terraform state pull&lt;/code&gt; and &lt;code&gt;terraform state push&lt;/code&gt; let you manipulate state as JSON, but treat this as a last resort.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Need help with your Terraform setup?&lt;/strong&gt; &lt;a href="https://instadevops.com" rel="noopener noreferrer"&gt;InstaDevOps&lt;/a&gt; specializes in production-grade infrastructure as code. &lt;a href="https://calendly.com/instadevops/15min" rel="noopener noreferrer"&gt;Book a free consultation&lt;/a&gt; to discuss your IaC strategy.&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>iac</category>
      <category>devops</category>
      <category>cloud</category>
    </item>
    <item>
      <title>Kubernetes Autoscaling: HPA, VPA, and KEDA Deep Dive</title>
      <dc:creator>InstaDevOps</dc:creator>
      <pubDate>Mon, 04 May 2026 13:47:52 +0000</pubDate>
      <link>https://dev.to/instadevops/kubernetes-autoscaling-hpa-vpa-and-keda-deep-dive-32g7</link>
      <guid>https://dev.to/instadevops/kubernetes-autoscaling-hpa-vpa-and-keda-deep-dive-32g7</guid>
      <description>&lt;h2&gt;
  
  
  Kubernetes Autoscaling Deep Dive: HPA, VPA, KEDA, and Cluster Autoscaler
&lt;/h2&gt;

&lt;p&gt;Kubernetes autoscaling is not a single feature - it is a layered system where four components work together to match your infrastructure to actual demand. Getting autoscaling wrong means either wasting money on idle resources or dropping traffic during load spikes. Understanding how HPA, VPA, KEDA, and the Cluster Autoscaler interact is essential for any production Kubernetes deployment.&lt;/p&gt;

&lt;p&gt;The Horizontal Pod Autoscaler (HPA) scales the number of pod replicas based on CPU, memory, or custom metrics. The Vertical Pod Autoscaler (VPA) adjusts resource requests and limits for individual containers based on observed usage - critical for right-sizing workloads you have not profiled yet. KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with scalers for external event sources like SQS queue depth, Kafka consumer lag, or Prometheus queries, enabling scale-to-zero for workloads that do not need to run continuously. The Cluster Autoscaler adds or removes nodes when pods cannot be scheduled due to insufficient cluster capacity.&lt;/p&gt;

&lt;p&gt;The key to effective autoscaling is layering these components correctly. Run VPA in recommendation mode first to establish baseline resource requests. Configure HPA with appropriate metrics and thresholds - CPU utilization of 60-70% is a good starting point. Use KEDA for event-driven workloads like queue processors. And ensure Cluster Autoscaler is configured with appropriate node groups and scale-down policies to avoid unnecessary churn.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Struggling with Kubernetes scaling?&lt;/strong&gt; &lt;a href="https://instadevops.com" rel="noopener noreferrer"&gt;InstaDevOps&lt;/a&gt; helps teams implement autoscaling strategies that balance performance and cost. &lt;a href="https://calendly.com/instadevops/15min" rel="noopener noreferrer"&gt;Book a free consultation&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>autoscaling</category>
      <category>keda</category>
      <category>devops</category>
    </item>
    <item>
      <title>HashiCorp Vault: Production Secrets Management Guide</title>
      <dc:creator>InstaDevOps</dc:creator>
      <pubDate>Sun, 03 May 2026 13:47:49 +0000</pubDate>
      <link>https://dev.to/instadevops/hashicorp-vault-production-secrets-management-guide-387p</link>
      <guid>https://dev.to/instadevops/hashicorp-vault-production-secrets-management-guide-387p</guid>
      <description>&lt;h2&gt;
  
  
  HashiCorp Vault for DevOps: Dynamic Secrets, PKI, and Zero-Trust Infrastructure
&lt;/h2&gt;

&lt;p&gt;Static secrets are a ticking time bomb. Hardcoded API keys, long-lived database passwords stored in environment variables, and shared credentials passed around in Slack channels - these are the security gaps that attackers exploit every day. HashiCorp Vault solves this by providing a centralized secrets management platform that generates short-lived, dynamic credentials on demand.&lt;/p&gt;

&lt;p&gt;Vault's power lies in its secrets engines. The database secrets engine generates unique credentials for each application instance with automatic expiration. The PKI engine issues TLS certificates programmatically, eliminating manual certificate management. The AWS engine creates temporary IAM credentials scoped to exactly the permissions each service needs. Combined with Vault Agent for automatic secret injection and renewal, you can build infrastructure where no secret lives longer than it needs to.&lt;/p&gt;

&lt;p&gt;Implementing zero-trust with Vault means every service authenticates independently - whether through Kubernetes service accounts, AWS IAM roles, or AppRole for CI/CD pipelines. Vault's audit log tracks every secret access, giving you a complete trail for compliance. The transit engine handles encryption-as-a-service without exposing keys to applications. For teams running Kubernetes, the Vault CSI provider and sidecar injector make integration seamless.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Need help securing your infrastructure?&lt;/strong&gt; At &lt;a href="https://instadevops.com" rel="noopener noreferrer"&gt;InstaDevOps&lt;/a&gt;, we implement production-grade secrets management with HashiCorp Vault. &lt;a href="https://calendly.com/instadevops/15min" rel="noopener noreferrer"&gt;Book a free consultation&lt;/a&gt; to discuss your security posture.&lt;/p&gt;

</description>
      <category>vault</category>
      <category>security</category>
      <category>secrets</category>
      <category>devops</category>
    </item>
    <item>
      <title>Microservices Communication: gRPC vs REST, Message Queues, and Sagas</title>
      <dc:creator>InstaDevOps</dc:creator>
      <pubDate>Sat, 02 May 2026 13:47:45 +0000</pubDate>
      <link>https://dev.to/instadevops/microservices-communication-grpc-vs-rest-message-queues-and-sagas-337b</link>
      <guid>https://dev.to/instadevops/microservices-communication-grpc-vs-rest-message-queues-and-sagas-337b</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;The moment you split a monolith into microservices, communication becomes your biggest challenge. In a monolith, a function call is nanoseconds, type-safe, and transactional. In microservices, every interaction is a network call that can fail, timeout, arrive out of order, or silently duplicate. The communication patterns you choose determine your system's reliability, latency, and operational complexity.&lt;/p&gt;

&lt;p&gt;This guide covers the practical decisions you face when designing microservice communication: synchronous vs asynchronous, gRPC vs REST, choosing the right message broker, implementing the saga pattern for distributed transactions, and building circuit breakers to prevent cascade failures.&lt;/p&gt;

&lt;h2&gt;
  
  
  Synchronous vs Asynchronous Communication
&lt;/h2&gt;

&lt;p&gt;The first architectural decision is whether services communicate synchronously (request-response) or asynchronously (event-driven). Most systems need both.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Synchronous (request-response):&lt;/strong&gt; Service A sends a request to Service B and waits for a response. Use for operations where the caller needs an immediate answer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User → API Gateway → Order Service → Inventory Service (check stock)
                                    ← stock confirmed
                     ← order created
← 201 Created
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Asynchronous (event-driven):&lt;/strong&gt; Service A publishes an event and moves on. Other services consume the event whenever they are ready. Use for operations where the caller does not need an immediate answer, or when you want to decouple services.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Order Service → publishes "OrderCreated" event → Message Queue
                                                    ├→ Payment Service (processes payment)
                                                    ├→ Notification Service (sends email)
                                                    └→ Analytics Service (records metric)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;When to use synchronous:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User-facing requests that need an immediate response&lt;/li&gt;
&lt;li&gt;Data reads where you need the latest state&lt;/li&gt;
&lt;li&gt;Simple request-response queries between two services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to use asynchronous:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Operations that can happen in the background (email, analytics, audit logs)&lt;/li&gt;
&lt;li&gt;When you need to fan out to multiple consumers&lt;/li&gt;
&lt;li&gt;When services have different availability requirements&lt;/li&gt;
&lt;li&gt;When you need to handle traffic spikes (the queue acts as a buffer)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The biggest mistake teams make is going 100% synchronous. If your order service synchronously calls payment, inventory, notification, and analytics services in sequence, a slowdown in any one of them degrades the entire order flow. Move non-critical steps to async processing.&lt;/p&gt;

&lt;h2&gt;
  
  
  gRPC vs REST
&lt;/h2&gt;

&lt;p&gt;For synchronous service-to-service communication, you are choosing between REST (HTTP/JSON) and gRPC (HTTP/2, Protocol Buffers).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;REST&lt;/strong&gt; is the default because it is simple, well-understood, and debuggable with curl:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Express.js REST endpoint&lt;/span&gt;
&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/api/inventory/:productId&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;stock&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;SELECT quantity FROM inventory WHERE product_id = $1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;productId&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;productId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;productId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;stock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]?.&lt;/span&gt;&lt;span class="nx"&gt;quantity&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Calling it from another service&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;http://inventory-service:3000/api/inventory/SKU-123&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;gRPC&lt;/strong&gt; uses Protocol Buffers for serialization and HTTP/2 for transport, giving you strong typing, smaller payloads, and bidirectional streaming:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight protobuf"&gt;&lt;code&gt;&lt;span class="c1"&gt;// inventory.proto&lt;/span&gt;
&lt;span class="na"&gt;syntax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"proto3"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kn"&gt;package&lt;/span&gt; &lt;span class="nn"&gt;inventory&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;service&lt;/span&gt; &lt;span class="n"&gt;InventoryService&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;rpc&lt;/span&gt; &lt;span class="n"&gt;CheckStock&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;StockRequest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;returns&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;StockResponse&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;rpc&lt;/span&gt; &lt;span class="n"&gt;WatchStock&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;StockRequest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;returns&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stream&lt;/span&gt; &lt;span class="n"&gt;StockUpdate&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// Server streaming&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;message&lt;/span&gt; &lt;span class="nc"&gt;StockRequest&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="na"&gt;product_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;message&lt;/span&gt; &lt;span class="nc"&gt;StockResponse&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="na"&gt;product_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kt"&gt;int32&lt;/span&gt; &lt;span class="na"&gt;quantity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="na"&gt;in_stock&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kd"&gt;message&lt;/span&gt; &lt;span class="nc"&gt;StockUpdate&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="na"&gt;product_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kt"&gt;int32&lt;/span&gt; &lt;span class="na"&gt;quantity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="na"&gt;timestamp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// gRPC server (Node.js)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;grpc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@grpc/grpc-js&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;protoLoader&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@grpc/proto-loader&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;packageDef&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;protoLoader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loadSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;inventory.proto&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;proto&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;grpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loadPackageDefinition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;packageDef&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;grpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Server&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addService&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;proto&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;inventory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;InventoryService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;service&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;checkStock&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;call&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;callback&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;stock&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;SELECT quantity FROM inventory WHERE product_id = $1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="nf"&gt;callback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;call&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;product_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;stock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]?.&lt;/span&gt;&lt;span class="nx"&gt;quantity&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;in_stock&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;stock&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]?.&lt;/span&gt;&lt;span class="nx"&gt;quantity&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bindAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;0.0.0.0:50051&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;grpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ServerCredentials&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createInsecure&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gRPC server running on port 50051&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Choose REST when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;External-facing APIs (browsers, mobile apps, third parties)&lt;/li&gt;
&lt;li&gt;Simple CRUD operations&lt;/li&gt;
&lt;li&gt;You want maximum debuggability&lt;/li&gt;
&lt;li&gt;Team is not familiar with protobuf&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Choose gRPC when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Internal service-to-service communication with high call volume&lt;/li&gt;
&lt;li&gt;You need streaming (real-time feeds, file uploads)&lt;/li&gt;
&lt;li&gt;Latency-sensitive paths where the ~30% serialization speedup matters&lt;/li&gt;
&lt;li&gt;You want strict API contracts enforced at compile time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, many teams use REST for their public API and gRPC for internal service mesh communication.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing a Message Broker
&lt;/h2&gt;

&lt;p&gt;For asynchronous communication, you need a message broker. The three most common choices are SQS, RabbitMQ, and Kafka, and they serve different use cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Amazon SQS&lt;/strong&gt; - Managed queue, simplest to operate, no infrastructure to manage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;SQSClient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;SendMessageCommand&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ReceiveMessageCommand&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;DeleteMessageCommand&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@aws-sdk/client-sqs&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sqs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;SQSClient&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;us-east-1&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;QUEUE_URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://sqs.us-east-1.amazonaws.com/123456789/order-events&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// Producer: publish event&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;sqs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;SendMessageCommand&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;QueueUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;QUEUE_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;MessageBody&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;eventType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;OrderCreated&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;orderId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ord_abc123&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;usr_456&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;total&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;99.99&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
  &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="na"&gt;MessageGroupId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;ord_abc123&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// FIFO queue: ensures ordering per order&lt;/span&gt;
  &lt;span class="na"&gt;MessageDeduplicationId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`order-created-ord_abc123`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// Prevents duplicates&lt;/span&gt;
&lt;span class="p"&gt;}));&lt;/span&gt;

&lt;span class="c1"&gt;// Consumer: poll for events&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;pollMessages&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;sqs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ReceiveMessageCommand&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;QueueUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;QUEUE_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;MaxNumberOfMessages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;WaitTimeSeconds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// Long polling&lt;/span&gt;
      &lt;span class="na"&gt;VisibilityTimeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}));&lt;/span&gt;

    &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;message&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Messages&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Body&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;processEvent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

      &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;sqs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;send&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;DeleteMessageCommand&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="na"&gt;QueueUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;QUEUE_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;ReceiptHandle&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ReceiptHandle&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;}));&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;RabbitMQ&lt;/strong&gt; - Feature-rich broker with flexible routing (exchanges, bindings, dead letter queues):&lt;/p&gt;

&lt;p&gt;Best when you need complex routing patterns (topic-based, headers-based), priority queues, or delayed messages. More operational overhead than SQS since you manage the broker yourself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Apache Kafka&lt;/strong&gt; - Distributed log for high-throughput event streaming:&lt;/p&gt;

&lt;p&gt;Best when you need event replay (consumers can re-read history), very high throughput (millions of events/second), or multiple consumer groups reading the same stream independently. Most complex to operate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision matrix:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Requirement&lt;/th&gt;
&lt;th&gt;SQS&lt;/th&gt;
&lt;th&gt;RabbitMQ&lt;/th&gt;
&lt;th&gt;Kafka&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Zero ops overhead&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Message ordering&lt;/td&gt;
&lt;td&gt;FIFO queues&lt;/td&gt;
&lt;td&gt;Per queue&lt;/td&gt;
&lt;td&gt;Per partition&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Event replay&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complex routing&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No (use streams)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Throughput&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Medium&lt;/td&gt;
&lt;td&gt;Very high&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Latency&lt;/td&gt;
&lt;td&gt;~20ms&lt;/td&gt;
&lt;td&gt;~1ms&lt;/td&gt;
&lt;td&gt;~5ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost at low volume&lt;/td&gt;
&lt;td&gt;Cheapest&lt;/td&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;Expensive&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For most startups: &lt;strong&gt;start with SQS&lt;/strong&gt;. You can always migrate to Kafka later when you actually need event replay or million-message-per-second throughput.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Saga Pattern for Distributed Transactions
&lt;/h2&gt;

&lt;p&gt;In a monolith, creating an order means a single database transaction: deduct inventory, charge payment, create order - all or nothing. In microservices, each step is a different service with its own database. You cannot use a distributed transaction (2PC) because it does not scale and couples all services together.&lt;/p&gt;

&lt;p&gt;The saga pattern breaks a distributed transaction into a sequence of local transactions, each publishing an event that triggers the next step. If any step fails, compensating transactions undo the previous steps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Choreography-based saga&lt;/strong&gt; (each service reacts to events):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OrderService                PaymentService              InventoryService
    │                            │                            │
    ├─ Create order (PENDING)    │                            │
    ├─ Publish "OrderCreated" ──►│                            │
    │                            ├─ Charge payment            │
    │                            ├─ Publish "PaymentCharged" ─►│
    │                            │                            ├─ Reserve stock
    │                            │                            ├─ Publish "StockReserved"
    │◄───────────────────────────┼────────────────────────────┤
    ├─ Update order → CONFIRMED  │                            │
    │                            │                            │
    │   --- If payment fails --- │                            │
    │                            ├─ Publish "PaymentFailed" ──►│
    │◄───────────────────────────┤                            ├─ Release stock
    ├─ Update order → CANCELLED  │                            │
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Orchestration-based saga&lt;/strong&gt; (a central coordinator manages the flow):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// saga-orchestrator.js&lt;/span&gt;
&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CreateOrderSaga&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;orderService&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;paymentService&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;inventoryService&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;steps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;inventoryService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reserveStock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="na"&gt;compensate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;inventoryService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;releaseStock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;items&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;paymentService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;charge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;total&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="na"&gt;compensate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;paymentService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;refund&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;paymentId&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
      &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="na"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;orderService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;confirmOrder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;orderId&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="na"&gt;compensate&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;orderService&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cancelOrder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;orderId&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;];&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;orderData&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;completedSteps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;

    &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;step&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;steps&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;step&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;orderData&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="nx"&gt;orderData&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;orderData&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
        &lt;span class="nx"&gt;completedSteps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;step&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Saga step failed: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;. Compensating...`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="c1"&gt;// Compensate in reverse order&lt;/span&gt;
        &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;completed&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;completedSteps&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reverse&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;completed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compensate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;orderData&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
          &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;compError&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Compensation failed: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;compError&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="c1"&gt;// Alert on-call - manual intervention needed&lt;/span&gt;
          &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Order saga failed: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;orderData&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Choreography vs orchestration:&lt;/strong&gt; Use choreography when you have 2-3 services in the saga and the flow is simple. Use orchestration when you have 4+ services or complex branching logic. Orchestration is easier to understand, debug, and monitor.&lt;/p&gt;

&lt;h2&gt;
  
  
  Circuit Breakers
&lt;/h2&gt;

&lt;p&gt;When a downstream service is failing, you do not want every request to wait for a timeout. A circuit breaker detects failures and short-circuits requests to the failing service, returning an error immediately or falling back to a cached response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// circuit-breaker.js&lt;/span&gt;
&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CircuitBreaker&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;options&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;failureThreshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;failureThreshold&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resetTimeout&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resetTimeout&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mi"&gt;30000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// 30 seconds&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;failureCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;lastFailureTime&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;CLOSED&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;  &lt;span class="c1"&gt;// CLOSED = normal, OPEN = failing, HALF_OPEN = testing&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;fallback&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;state&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;OPEN&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;lastFailureTime&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resetTimeout&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;HALF_OPEN&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fallback&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;fallback&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
        &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Circuit breaker is OPEN&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;state&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;HALF_OPEN&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;CLOSED&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;failureCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;failureCount&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;lastFailureTime&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;failureCount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;failureThreshold&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;OPEN&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Circuit breaker tripped to OPEN&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;

      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fallback&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;fallback&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
      &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Usage&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;inventoryBreaker&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;CircuitBreaker&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;resetTimeout&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;15000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;checkStock&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;productId&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;inventoryBreaker&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="c1"&gt;// Primary: call inventory service&lt;/span&gt;
    &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`http://inventory-service:3000/api/stock/&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;productId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;then&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Inventory service returned &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="c1"&gt;// Fallback: return cached/default value&lt;/span&gt;
    &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;productId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;in_stock&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;fallback&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In production, use a library like &lt;a href="https://github.com/nodeshift/opossum" rel="noopener noreferrer"&gt;opossum&lt;/a&gt; for Node.js or rely on your service mesh (Istio, Linkerd) to handle circuit breaking at the infrastructure level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Istio DestinationRule with circuit breaking&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.istio.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DestinationRule&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;inventory-service&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;inventory-service&lt;/span&gt;
  &lt;span class="na"&gt;trafficPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;connectionPool&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;tcp&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;maxConnections&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
      &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;h2UpgradePolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DEFAULT&lt;/span&gt;
        &lt;span class="na"&gt;http1MaxPendingRequests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50&lt;/span&gt;
        &lt;span class="na"&gt;http2MaxRequests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
        &lt;span class="na"&gt;maxRequestsPerConnection&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;outlierDetection&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;consecutive5xxErrors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
      &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10s&lt;/span&gt;
      &lt;span class="na"&gt;baseEjectionTime&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
      &lt;span class="na"&gt;maxEjectionPercent&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;50&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Observability for Distributed Communication
&lt;/h2&gt;

&lt;p&gt;Debugging microservice communication without proper observability is like debugging a distributed system blindfolded. Implement these three pillars:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Distributed tracing&lt;/strong&gt; - propagate trace IDs across service boundaries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Using OpenTelemetry&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;trace&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;propagation&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@opentelemetry/api&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// When making an outbound request, propagate context&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;callService&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{};&lt;/span&gt;
  &lt;span class="nx"&gt;propagation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;inject&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;active&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nx"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;fetch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;method&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;POST&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Content-Type&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;application/json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// Includes traceparent header&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Structured logging&lt;/strong&gt; - include correlation IDs in every log line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Every log entry includes the trace/request ID&lt;/span&gt;
&lt;span class="nx"&gt;logger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Processing order&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;spanContext&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nx"&gt;traceId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;orderId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;order&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;order-service&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;action&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;create_order&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;duration_ms&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;startTime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Metrics&lt;/strong&gt; - track request rate, error rate, and latency (the RED method):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Prometheus metrics&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;httpRequestDuration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;promClient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Histogram&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;http_request_duration_seconds&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;help&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Duration of HTTP requests&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;labelNames&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;method&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;route&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;status&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;target_service&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;buckets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Need Help with Your DevOps?
&lt;/h2&gt;

&lt;p&gt;Designing and implementing microservices communication patterns - from choosing the right broker to implementing sagas and circuit breakers - requires deep infrastructure expertise. At &lt;a href="https://instadevops.com" rel="noopener noreferrer"&gt;InstaDevOps&lt;/a&gt;, we help startups and SMBs build reliable, scalable microservice architectures - starting at $2,999/mo.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://calendly.com/instadevops/15min" rel="noopener noreferrer"&gt;Book a free 15-minute consultation&lt;/a&gt; to discuss your microservices architecture and communication challenges.&lt;/p&gt;

</description>
      <category>microservices</category>
      <category>architecture</category>
      <category>devops</category>
      <category>grpc</category>
    </item>
    <item>
      <title>Backup and Disaster Recovery: RTO/RPO Planning and AWS Backup</title>
      <dc:creator>InstaDevOps</dc:creator>
      <pubDate>Fri, 01 May 2026 13:47:40 +0000</pubDate>
      <link>https://dev.to/instadevops/backup-and-disaster-recovery-rtorpo-planning-and-aws-backup-41hn</link>
      <guid>https://dev.to/instadevops/backup-and-disaster-recovery-rtorpo-planning-and-aws-backup-41hn</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Every startup says they have backups. Very few have actually tested restoring from them. Even fewer have a documented disaster recovery plan with defined recovery objectives. Then something goes wrong - a developer runs a destructive migration against production, a region goes down, ransomware encrypts your EBS volumes - and the team discovers their "backup strategy" was an untested assumption.&lt;/p&gt;

&lt;p&gt;Disaster recovery does not have to be complicated or expensive, especially on AWS. But it does have to be intentional. This guide walks through building a real DR strategy: defining your recovery objectives, implementing the 3-2-1 backup rule, configuring AWS Backup, setting up cross-region replication, and most importantly, testing that everything actually works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding RTO and RPO
&lt;/h2&gt;

&lt;p&gt;Before you configure a single backup, you need two numbers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recovery Point Objective (RPO):&lt;/strong&gt; How much data can you afford to lose? If your RPO is 1 hour, you need backups at least every hour. If your RPO is zero, you need synchronous replication.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recovery Time Objective (RTO):&lt;/strong&gt; How long can your service be down before the business impact becomes unacceptable? If your RTO is 4 hours, you need a recovery process that can restore full service within 4 hours.&lt;/p&gt;

&lt;p&gt;These numbers vary by system. Here is a realistic example for a SaaS startup:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;System&lt;/th&gt;
&lt;th&gt;RPO&lt;/th&gt;
&lt;th&gt;RTO&lt;/th&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Primary database (PostgreSQL)&lt;/td&gt;
&lt;td&gt;5 minutes&lt;/td&gt;
&lt;td&gt;30 minutes&lt;/td&gt;
&lt;td&gt;Continuous WAL archiving + read replica&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;User uploads (S3)&lt;/td&gt;
&lt;td&gt;0 (no loss)&lt;/td&gt;
&lt;td&gt;1 hour&lt;/td&gt;
&lt;td&gt;Cross-region replication&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Application servers&lt;/td&gt;
&lt;td&gt;N/A (stateless)&lt;/td&gt;
&lt;td&gt;15 minutes&lt;/td&gt;
&lt;td&gt;Auto-scaling group, multi-AZ&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Redis cache&lt;/td&gt;
&lt;td&gt;1 hour&lt;/td&gt;
&lt;td&gt;15 minutes&lt;/td&gt;
&lt;td&gt;Rebuild from DB if needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Elasticsearch&lt;/td&gt;
&lt;td&gt;24 hours&lt;/td&gt;
&lt;td&gt;4 hours&lt;/td&gt;
&lt;td&gt;Nightly snapshots to S3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Configuration/secrets&lt;/td&gt;
&lt;td&gt;0 (no loss)&lt;/td&gt;
&lt;td&gt;30 minutes&lt;/td&gt;
&lt;td&gt;Versioned in AWS Secrets Manager&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key insight: not everything needs the same level of protection. Your production database needs 5-minute RPO. Your Elasticsearch cluster (which can be rebuilt from the database) can tolerate 24-hour RPO. Treating everything equally means overspending on DR for low-criticality systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 3-2-1 Backup Rule
&lt;/h2&gt;

&lt;p&gt;The 3-2-1 rule is decades old and still the best framework for backup strategy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3&lt;/strong&gt; copies of your data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2&lt;/strong&gt; different storage media/types&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1&lt;/strong&gt; copy offsite (different region or provider)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In AWS terms:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Copy 1:&lt;/strong&gt; Production data (RDS instance, EBS volumes, S3 buckets)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Copy 2:&lt;/strong&gt; AWS Backup vault in the same region (different storage from production)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Copy 3:&lt;/strong&gt; Cross-region backup copy (different region entirely)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For truly critical data, consider a fourth copy outside AWS entirely (GCP bucket, Azure blob, or even Backblaze B2). This protects against the unlikely but possible scenario of an AWS account compromise.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Example: backup RDS snapshot to an external provider using pg_dump&lt;/span&gt;
pg_dump &lt;span class="nt"&gt;-h&lt;/span&gt; your-rds-endpoint.region.rds.amazonaws.com &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-U&lt;/span&gt; dbadmin &lt;span class="nt"&gt;-d&lt;/span&gt; production &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;custom &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;production-&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%Y%m%d&lt;span class="si"&gt;)&lt;/span&gt;.dump

&lt;span class="c"&gt;# Encrypt and upload to external storage&lt;/span&gt;
gpg &lt;span class="nt"&gt;--symmetric&lt;/span&gt; &lt;span class="nt"&gt;--cipher-algo&lt;/span&gt; AES256 production-&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%Y%m%d&lt;span class="si"&gt;)&lt;/span&gt;.dump
aws s3 &lt;span class="nb"&gt;cp &lt;/span&gt;production-&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; +%Y%m%d&lt;span class="si"&gt;)&lt;/span&gt;.dump.gpg s3://offsite-backups/ &lt;span class="nt"&gt;--storage-class&lt;/span&gt; GLACIER_IR
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Configuring AWS Backup
&lt;/h2&gt;

&lt;p&gt;AWS Backup provides a centralized service to manage backups across RDS, EBS, EFS, DynamoDB, S3, and more. Here is a production-ready configuration using Terraform:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Backup vault&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_backup_vault"&lt;/span&gt; &lt;span class="s2"&gt;"main"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production-vault"&lt;/span&gt;
  &lt;span class="nx"&gt;kms_key_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_kms_key&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;backup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;

  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Environment&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Cross-region vault for DR&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_backup_vault"&lt;/span&gt; &lt;span class="s2"&gt;"dr"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;provider&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dr_region&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production-vault-dr"&lt;/span&gt;
  &lt;span class="nx"&gt;kms_key_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_kms_key&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;backup_dr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Backup plan&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_backup_plan"&lt;/span&gt; &lt;span class="s2"&gt;"production"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production-backup-plan"&lt;/span&gt;

  &lt;span class="c1"&gt;# Hourly backups retained for 24 hours&lt;/span&gt;
  &lt;span class="nx"&gt;rule&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;rule_name&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"hourly"&lt;/span&gt;
    &lt;span class="nx"&gt;target_vault_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_backup_vault&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
    &lt;span class="nx"&gt;schedule&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"cron(0 * * * ? *)"&lt;/span&gt;
    &lt;span class="nx"&gt;start_window&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;
    &lt;span class="nx"&gt;completion_window&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;120&lt;/span&gt;

    &lt;span class="nx"&gt;lifecycle&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;delete_after&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;  &lt;span class="c1"&gt;# days&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;# Daily backups retained for 30 days&lt;/span&gt;
  &lt;span class="nx"&gt;rule&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;rule_name&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"daily"&lt;/span&gt;
    &lt;span class="nx"&gt;target_vault_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_backup_vault&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
    &lt;span class="nx"&gt;schedule&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"cron(0 3 * * ? *)"&lt;/span&gt;
    &lt;span class="nx"&gt;start_window&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;
    &lt;span class="nx"&gt;completion_window&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;180&lt;/span&gt;

    &lt;span class="nx"&gt;lifecycle&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;cold_storage_after&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;
      &lt;span class="nx"&gt;delete_after&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;# Copy to DR region&lt;/span&gt;
    &lt;span class="nx"&gt;copy_action&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;destination_vault_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_backup_vault&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
      &lt;span class="nx"&gt;lifecycle&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;delete_after&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;# Weekly backups retained for 1 year&lt;/span&gt;
  &lt;span class="nx"&gt;rule&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;rule_name&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"weekly"&lt;/span&gt;
    &lt;span class="nx"&gt;target_vault_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_backup_vault&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
    &lt;span class="nx"&gt;schedule&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"cron(0 4 ? * SUN *)"&lt;/span&gt;
    &lt;span class="nx"&gt;start_window&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt;
    &lt;span class="nx"&gt;completion_window&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;

    &lt;span class="nx"&gt;lifecycle&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;cold_storage_after&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;
      &lt;span class="nx"&gt;delete_after&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;365&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Assign resources to backup plan&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_backup_selection"&lt;/span&gt; &lt;span class="s2"&gt;"production"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;iam_role_arn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;backup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production-resources"&lt;/span&gt;
  &lt;span class="nx"&gt;plan_id&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_backup_plan&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;production&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;

  &lt;span class="c1"&gt;# Backup everything tagged with Backup=true&lt;/span&gt;
  &lt;span class="nx"&gt;selection_tag&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;type&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"STRINGEQUALS"&lt;/span&gt;
    &lt;span class="nx"&gt;key&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Backup"&lt;/span&gt;
    &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"true"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tag your resources to include them in the backup plan:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_db_instance"&lt;/span&gt; &lt;span class="s2"&gt;"production"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;# ... other config&lt;/span&gt;
  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Backup&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"true"&lt;/span&gt;
    &lt;span class="nx"&gt;Environment&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ebs_volume"&lt;/span&gt; &lt;span class="s2"&gt;"data"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;# ... other config&lt;/span&gt;
  &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;Backup&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"true"&lt;/span&gt;
    &lt;span class="nx"&gt;Environment&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Cross-Region Replication
&lt;/h2&gt;

&lt;p&gt;For services where you need faster recovery than restoring from backups, set up active replication:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RDS Cross-Region Read Replica:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Primary in us-east-1&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_db_instance"&lt;/span&gt; &lt;span class="s2"&gt;"primary"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;identifier&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production-primary"&lt;/span&gt;
  &lt;span class="nx"&gt;engine&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"postgres"&lt;/span&gt;
  &lt;span class="nx"&gt;engine_version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"16.2"&lt;/span&gt;
  &lt;span class="nx"&gt;instance_class&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"db.r6g.xlarge"&lt;/span&gt;
  &lt;span class="nx"&gt;multi_az&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

  &lt;span class="nx"&gt;backup_retention_period&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;
  &lt;span class="nx"&gt;backup_window&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"03:00-04:00"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Cross-region replica in us-west-2&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_db_instance"&lt;/span&gt; &lt;span class="s2"&gt;"dr_replica"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;provider&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dr_region&lt;/span&gt;

  &lt;span class="nx"&gt;identifier&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production-dr-replica"&lt;/span&gt;
  &lt;span class="nx"&gt;replicate_source_db&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_db_instance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
  &lt;span class="nx"&gt;instance_class&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"db.r6g.large"&lt;/span&gt;  &lt;span class="c1"&gt;# Can be smaller to save cost&lt;/span&gt;

  &lt;span class="c1"&gt;# Can be promoted to primary during DR&lt;/span&gt;
  &lt;span class="nx"&gt;auto_minor_version_upgrade&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;S3 Cross-Region Replication:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_s3_bucket"&lt;/span&gt; &lt;span class="s2"&gt;"primary"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;bucket&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"myapp-uploads-primary"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_s3_bucket_versioning"&lt;/span&gt; &lt;span class="s2"&gt;"primary"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;bucket&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_s3_bucket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;versioning_configuration&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Enabled"&lt;/span&gt;  &lt;span class="c1"&gt;# Required for replication&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_s3_bucket_replication_configuration"&lt;/span&gt; &lt;span class="s2"&gt;"primary"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;bucket&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_s3_bucket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;primary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;role&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_iam_role&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;replication&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;

  &lt;span class="nx"&gt;rule&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;id&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"replicate-all"&lt;/span&gt;
    &lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Enabled"&lt;/span&gt;

    &lt;span class="nx"&gt;destination&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;bucket&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_s3_bucket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;dr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
      &lt;span class="nx"&gt;storage_class&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"STANDARD_IA"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;DynamoDB Global Tables:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_dynamodb_table"&lt;/span&gt; &lt;span class="s2"&gt;"sessions"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"sessions"&lt;/span&gt;
  &lt;span class="nx"&gt;billing_mode&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"PAY_PER_REQUEST"&lt;/span&gt;
  &lt;span class="nx"&gt;hash_key&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"session_id"&lt;/span&gt;

  &lt;span class="nx"&gt;attribute&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"session_id"&lt;/span&gt;
    &lt;span class="nx"&gt;type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"S"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;replica&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;region_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"us-west-2"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Testing Your DR Plan
&lt;/h2&gt;

&lt;p&gt;This is where most organizations fail. A backup strategy without tested recovery procedures is not a strategy - it is a hope.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Quarterly DR drill checklist:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## DR Drill - Q2 2026&lt;/span&gt;

&lt;span class="gu"&gt;### Pre-drill&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Schedule 4-hour maintenance window
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Notify stakeholders
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Document current production state (record counts, latest transactions)

&lt;span class="gu"&gt;### Database Recovery&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Restore RDS from latest cross-region snapshot
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Verify restore completed without errors
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Compare row counts against production
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Run application smoke tests against restored DB
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Record actual restore time: _____ minutes (RTO target: 30 min)
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Record data loss window: _____ minutes (RPO target: 5 min)

&lt;span class="gu"&gt;### Application Recovery&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Deploy application stack in DR region using Terraform
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Verify DNS failover configuration works
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Run full integration test suite
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Record actual application recovery time: _____ minutes

&lt;span class="gu"&gt;### S3 Data Verification&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Compare object counts between primary and DR bucket
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Download and verify random sample of objects
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Verify replication lag: _____ seconds

&lt;span class="gu"&gt;### Post-drill&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Document any issues encountered
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Update runbooks based on findings
&lt;span class="p"&gt;-&lt;/span&gt; [ ] File tickets for gaps identified
&lt;span class="p"&gt;-&lt;/span&gt; [ ] Update RTO/RPO estimates based on actual times
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Automate what you can. This script verifies your RDS backup is restorable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="c"&gt;# dr-test-rds-restore.sh&lt;/span&gt;
&lt;span class="nb"&gt;set&lt;/span&gt; &lt;span class="nt"&gt;-euo&lt;/span&gt; pipefail

&lt;span class="nv"&gt;SNAPSHOT_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;aws rds describe-db-snapshots &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-instance-identifier&lt;/span&gt; production-primary &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'reverse(sort_by(DBSnapshots, &amp;amp;SnapshotCreateTime))[0].DBSnapshotIdentifier'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; text&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Restoring from snapshot: &lt;/span&gt;&lt;span class="nv"&gt;$SNAPSHOT_ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

aws rds restore-db-instance-from-db-snapshot &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-instance-identifier&lt;/span&gt; dr-test-restore &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-snapshot-identifier&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$SNAPSHOT_ID&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-instance-class&lt;/span&gt; db.t3.medium &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--no-multi-az&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Waiting for instance to be available..."&lt;/span&gt;
aws rds &lt;span class="nb"&gt;wait &lt;/span&gt;db-instance-available &lt;span class="nt"&gt;--db-instance-identifier&lt;/span&gt; dr-test-restore

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Running verification queries..."&lt;/span&gt;
&lt;span class="nv"&gt;ENDPOINT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;aws rds describe-db-instances &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-instance-identifier&lt;/span&gt; dr-test-restore &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'DBInstances[0].Endpoint.Address'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; text&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="nv"&gt;RESTORED_COUNT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;psql &lt;span class="nt"&gt;-h&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ENDPOINT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-U&lt;/span&gt; dbadmin &lt;span class="nt"&gt;-d&lt;/span&gt; production &lt;span class="nt"&gt;-tAc&lt;/span&gt; &lt;span class="s2"&gt;"SELECT count(*) FROM users"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nv"&gt;PROD_COUNT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;psql &lt;span class="nt"&gt;-h&lt;/span&gt; prod-endpoint &lt;span class="nt"&gt;-U&lt;/span&gt; dbadmin &lt;span class="nt"&gt;-d&lt;/span&gt; production &lt;span class="nt"&gt;-tAc&lt;/span&gt; &lt;span class="s2"&gt;"SELECT count(*) FROM users"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Production users: &lt;/span&gt;&lt;span class="nv"&gt;$PROD_COUNT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Restored users: &lt;/span&gt;&lt;span class="nv"&gt;$RESTORED_COUNT&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="c"&gt;# Cleanup&lt;/span&gt;
aws rds delete-db-instance &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--db-instance-identifier&lt;/span&gt; dr-test-restore &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--skip-final-snapshot&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Common DR Mistakes to Avoid
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Backing up data but not configuration.&lt;/strong&gt; Your database backup is useless if you cannot recreate the VPC, security groups, IAM roles, and application configuration needed to run your service. Infrastructure as code (Terraform, CloudFormation) is a DR requirement, not a nice-to-have.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Not encrypting backups.&lt;/strong&gt; Unencrypted backups in S3 are a security liability. Use KMS encryption on all backup vaults and S3 buckets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Using the same KMS key for primary and DR.&lt;/strong&gt; If your primary region's KMS key becomes unavailable, you cannot decrypt backups in the DR region. Use separate KMS keys in each region, or use multi-region KMS keys.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Ignoring DNS TTL.&lt;/strong&gt; If you need to failover DNS to a DR region, a 24-hour TTL means clients will keep hitting the dead primary for up to 24 hours. Set TTLs to 60 seconds for any record that might need to change during DR.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. No communication plan.&lt;/strong&gt; When production goes down, people panic. Document who gets notified, how (Slack, PagerDuty, phone), and who has the authority to initiate DR procedures. Practice this too.&lt;/p&gt;

&lt;h2&gt;
  
  
  Need Help with Your DevOps?
&lt;/h2&gt;

&lt;p&gt;Building a disaster recovery strategy that actually works when you need it requires planning, implementation, and regular testing. At &lt;a href="https://instadevops.com" rel="noopener noreferrer"&gt;InstaDevOps&lt;/a&gt;, we help startups and SMBs implement production-grade backup and DR infrastructure - starting at $2,999/mo.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://calendly.com/instadevops/15min" rel="noopener noreferrer"&gt;Book a free 15-minute consultation&lt;/a&gt; to discuss your backup and disaster recovery needs.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>disasterrecovery</category>
      <category>devops</category>
      <category>backup</category>
    </item>
    <item>
      <title>ArgoCD GitOps Deployment Guide: App-of-Apps and Progressive Delivery</title>
      <dc:creator>InstaDevOps</dc:creator>
      <pubDate>Thu, 30 Apr 2026 13:47:36 +0000</pubDate>
      <link>https://dev.to/instadevops/argocd-gitops-deployment-guide-app-of-apps-and-progressive-delivery-7c7</link>
      <guid>https://dev.to/instadevops/argocd-gitops-deployment-guide-app-of-apps-and-progressive-delivery-7c7</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;GitOps is the practice of using Git as the single source of truth for your infrastructure and application configuration. ArgoCD is the most widely adopted GitOps operator for Kubernetes, and for good reason - it watches your Git repositories and automatically reconciles your cluster state to match what is defined in your manifests.&lt;/p&gt;

&lt;p&gt;But installing ArgoCD is the easy part. The hard part is structuring your repositories, managing multi-environment deployments, implementing progressive delivery, and setting up proper RBAC so your platform team does not become a bottleneck. This guide covers all of that with production-tested patterns.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installing and Configuring ArgoCD
&lt;/h2&gt;

&lt;p&gt;Start with a production-ready ArgoCD installation using the HA manifest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create namespace&lt;/span&gt;
kubectl create namespace argocd

&lt;span class="c"&gt;# Install ArgoCD HA (recommended for production)&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-n&lt;/span&gt; argocd &lt;span class="nt"&gt;-f&lt;/span&gt; https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/ha/install.yaml

&lt;span class="c"&gt;# Wait for all pods to be ready&lt;/span&gt;
kubectl &lt;span class="nb"&gt;wait&lt;/span&gt; &lt;span class="nt"&gt;--for&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;condition&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ready pod &lt;span class="nt"&gt;-l&lt;/span&gt; app.kubernetes.io/part-of&lt;span class="o"&gt;=&lt;/span&gt;argocd &lt;span class="nt"&gt;-n&lt;/span&gt; argocd &lt;span class="nt"&gt;--timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;300s

&lt;span class="c"&gt;# Get initial admin password&lt;/span&gt;
kubectl &lt;span class="nt"&gt;-n&lt;/span&gt; argocd get secret argocd-initial-admin-secret &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;jsonpath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"{.data.password}"&lt;/span&gt; | &lt;span class="nb"&gt;base64&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expose ArgoCD via an Ingress (assuming you have an ingress controller and cert-manager):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd-server&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/ssl-passthrough&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true"&lt;/span&gt;
    &lt;span class="na"&gt;nginx.ingress.kubernetes.io/backend-protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HTTPS"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ingressClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx&lt;/span&gt;
  &lt;span class="na"&gt;tls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;argocd.yourcompany.com&lt;/span&gt;
      &lt;span class="na"&gt;secretName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd-tls&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd.yourcompany.com&lt;/span&gt;
      &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/&lt;/span&gt;
            &lt;span class="na"&gt;pathType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prefix&lt;/span&gt;
            &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd-server&lt;/span&gt;
                &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
                  &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;443&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Configure ArgoCD to connect to your Git repositories. For private repos, use SSH deploy keys:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;argocd repo add git@github.com:yourorg/k8s-manifests.git &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--ssh-private-key-path&lt;/span&gt; ~/.ssh/argocd_deploy_key
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The App-of-Apps Pattern
&lt;/h2&gt;

&lt;p&gt;The app-of-apps pattern is the standard way to manage multiple ArgoCD applications declaratively. Instead of manually creating each Application resource through the UI or CLI, you define a single root Application that points to a directory of Application manifests.&lt;/p&gt;

&lt;p&gt;Repository structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;k8s-manifests/
├── apps/                    # Root app-of-apps directory
│   ├── api.yaml             # Application manifest for API service
│   ├── frontend.yaml        # Application manifest for frontend
│   ├── worker.yaml          # Application manifest for worker
│   ├── redis.yaml           # Application manifest for Redis
│   └── monitoring.yaml      # Application manifest for monitoring stack
├── services/
│   ├── api/
│   │   ├── base/
│   │   │   ├── deployment.yaml
│   │   │   ├── service.yaml
│   │   │   └── kustomization.yaml
│   │   └── overlays/
│   │       ├── staging/
│   │       │   └── kustomization.yaml
│   │       └── production/
│   │           └── kustomization.yaml
│   ├── frontend/
│   │   └── ...
│   └── worker/
│       └── ...
└── infrastructure/
    ├── redis/
    └── monitoring/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The root application:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# root-app.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Application&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;root&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
  &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;git@github.com:yourorg/k8s-manifests.git&lt;/span&gt;
    &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;main&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps&lt;/span&gt;
  &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://kubernetes.default.svc&lt;/span&gt;
    &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd&lt;/span&gt;
  &lt;span class="na"&gt;syncPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;automated&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;prune&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;selfHeal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An individual app manifest within the &lt;code&gt;apps/&lt;/code&gt; directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# apps/api.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Application&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;argocd.argoproj.io/sync-wave&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
  &lt;span class="na"&gt;finalizers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;resources-finalizer.argocd.argoproj.io&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;project&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default&lt;/span&gt;
  &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;repoURL&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;git@github.com:yourorg/k8s-manifests.git&lt;/span&gt;
    &lt;span class="na"&gt;targetRevision&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;main&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;services/api/overlays/production&lt;/span&gt;
  &lt;span class="na"&gt;destination&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://kubernetes.default.svc&lt;/span&gt;
    &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
  &lt;span class="na"&gt;syncPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;automated&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;prune&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;selfHeal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;syncOptions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;CreateNamespace=true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you add a new service, you just add a new YAML file to the &lt;code&gt;apps/&lt;/code&gt; directory and push to Git. ArgoCD picks it up automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sync Waves and Resource Ordering
&lt;/h2&gt;

&lt;p&gt;Sync waves control the order in which ArgoCD applies resources. This is critical when you have dependencies - you need namespaces before deployments, CRDs before custom resources, and databases before applications.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Wave -1: Namespaces and CRDs first&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Namespace&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;argocd.argoproj.io/sync-wave&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-1"&lt;/span&gt;

&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="c1"&gt;# Wave 0: Infrastructure (databases, caches)&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Application&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;redis&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;argocd.argoproj.io/sync-wave&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# ... Redis application config&lt;/span&gt;

&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="c1"&gt;# Wave 1: Shared services (service mesh, secrets)&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Application&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;external-secrets&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;argocd.argoproj.io/sync-wave&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# ... External Secrets Operator config&lt;/span&gt;

&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="c1"&gt;# Wave 2: Application services&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Application&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;argocd.argoproj.io/sync-wave&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2"&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# ... API application config&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Combine sync waves with resource hooks for even finer control:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Run database migration before deploying new version&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;batch/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Job&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;db-migrate&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;argocd.argoproj.io/hook&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PreSync&lt;/span&gt;
    &lt;span class="na"&gt;argocd.argoproj.io/hook-delete-policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;BeforeHookCreation&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;migrate&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;yourorg/api:latest&lt;/span&gt;
          &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;node"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;migrate.js"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;restartPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Never&lt;/span&gt;
  &lt;span class="na"&gt;backoffLimit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Progressive Delivery with Argo Rollouts
&lt;/h2&gt;

&lt;p&gt;ArgoCD handles syncing manifests to your cluster, but it does not manage how traffic shifts to new versions. That is where Argo Rollouts comes in. It replaces the standard Kubernetes Deployment with a Rollout resource that supports canary and blue-green deployment strategies.&lt;/p&gt;

&lt;p&gt;Install Argo Rollouts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl create namespace argo-rollouts
kubectl apply &lt;span class="nt"&gt;-n&lt;/span&gt; argo-rollouts &lt;span class="nt"&gt;-f&lt;/span&gt; https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A canary rollout with automated analysis:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Rollout&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;yourorg/api:v2.1.0&lt;/span&gt;
          &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
          &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;250m&lt;/span&gt;
              &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;256Mi&lt;/span&gt;
            &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;500m&lt;/span&gt;
              &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;512Mi&lt;/span&gt;
  &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;canary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;canaryService&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-canary&lt;/span&gt;
      &lt;span class="na"&gt;stableService&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-stable&lt;/span&gt;
      &lt;span class="na"&gt;trafficRouting&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;nginx&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;stableIngress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api-ingress&lt;/span&gt;
      &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;setWeight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;pause&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;5m&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;analysis&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;templates&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;templateName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;success-rate&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;setWeight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;pause&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;5m&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;analysis&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;templates&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;templateName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;success-rate&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;setWeight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;60&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;pause&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;duration&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;5m&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;setWeight&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;100&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The analysis template that gates each promotion step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AnalysisTemplate&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;success-rate&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;metrics&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;success-rate&lt;/span&gt;
      &lt;span class="na"&gt;interval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;60s&lt;/span&gt;
      &lt;span class="na"&gt;count&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
      &lt;span class="na"&gt;successCondition&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;result[0] &amp;gt;= &lt;/span&gt;&lt;span class="m"&gt;0.99&lt;/span&gt;
      &lt;span class="na"&gt;failureLimit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
      &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;prometheus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http://prometheus.monitoring:9090&lt;/span&gt;
          &lt;span class="na"&gt;query&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
            &lt;span class="s"&gt;sum(rate(http_requests_total{app="api",status=~"2.."}[5m]))&lt;/span&gt;
            &lt;span class="s"&gt;/&lt;/span&gt;
            &lt;span class="s"&gt;sum(rate(http_requests_total{app="api"}[5m]))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the success rate drops below 99% during any analysis phase, Argo Rollouts automatically rolls back to the stable version. No human intervention required at 3 AM.&lt;/p&gt;

&lt;h2&gt;
  
  
  RBAC and Multi-Tenancy
&lt;/h2&gt;

&lt;p&gt;For teams with multiple projects or environments, ArgoCD's RBAC system controls who can see and sync what. Define projects to create boundaries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argoproj.io/v1alpha1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AppProject&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;team-payments&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Payments&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;team&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;applications"&lt;/span&gt;
  &lt;span class="na"&gt;sourceRepos&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;git@github.com:yourorg/payments-*'&lt;/span&gt;
  &lt;span class="na"&gt;destinations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;payments-*'&lt;/span&gt;
      &lt;span class="na"&gt;server&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://kubernetes.default.svc&lt;/span&gt;
  &lt;span class="na"&gt;clusterResourceWhitelist&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;group&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;'&lt;/span&gt;
      &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Namespace&lt;/span&gt;
  &lt;span class="na"&gt;namespaceResourceWhitelist&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;group&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;*'&lt;/span&gt;
      &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;*'&lt;/span&gt;
  &lt;span class="na"&gt;roles&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;developer&lt;/span&gt;
      &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Payments&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;team&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;developers"&lt;/span&gt;
      &lt;span class="na"&gt;policies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;p, proj:team-payments:developer, applications, get, team-payments/*, allow&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;p, proj:team-payments:developer, applications, sync, team-payments/*, allow&lt;/span&gt;
      &lt;span class="na"&gt;groups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;payments-team&lt;/span&gt;  &lt;span class="c1"&gt;# Maps to SSO group&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Configure SSO integration (Dex with GitHub example):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# argocd-cm ConfigMap&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ConfigMap&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd-cm&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;argocd&lt;/span&gt;
&lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://argocd.yourcompany.com&lt;/span&gt;
  &lt;span class="na"&gt;dex.config&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;connectors:&lt;/span&gt;
      &lt;span class="s"&gt;- type: github&lt;/span&gt;
        &lt;span class="s"&gt;id: github&lt;/span&gt;
        &lt;span class="s"&gt;name: GitHub&lt;/span&gt;
        &lt;span class="s"&gt;config:&lt;/span&gt;
          &lt;span class="s"&gt;clientID: $dex.github.clientID&lt;/span&gt;
          &lt;span class="s"&gt;clientSecret: $dex.github.clientSecret&lt;/span&gt;
          &lt;span class="s"&gt;orgs:&lt;/span&gt;
            &lt;span class="s"&gt;- name: yourorg&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Repository Structure Best Practices
&lt;/h2&gt;

&lt;p&gt;After working with dozens of ArgoCD deployments, here are the patterns that hold up:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Separate app manifests from app source code.&lt;/strong&gt; Keep your Kubernetes manifests in a dedicated repository, not alongside your application code. This gives you independent versioning, cleaner git history, and prevents application CI from triggering ArgoCD syncs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Kustomize overlays for environments.&lt;/strong&gt; Do not duplicate manifests for staging and production. Use a base with overlays:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# services/api/overlays/production/kustomization.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kustomize.config.k8s.io/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Kustomization&lt;/span&gt;
&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;../../base&lt;/span&gt;
&lt;span class="na"&gt;patches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;patch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|-&lt;/span&gt;
      &lt;span class="s"&gt;- op: replace&lt;/span&gt;
        &lt;span class="s"&gt;path: /spec/replicas&lt;/span&gt;
        &lt;span class="s"&gt;value: 5&lt;/span&gt;
    &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
&lt;span class="na"&gt;images&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;yourorg/api&lt;/span&gt;
    &lt;span class="na"&gt;newTag&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v2.1.0&lt;/span&gt;   &lt;span class="c1"&gt;# Updated by CI pipeline&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Automate image tag updates.&lt;/strong&gt; Your application CI pipeline should update the image tag in the manifests repo after a successful build. Use &lt;code&gt;kustomize edit set image&lt;/code&gt; or a tool like ArgoCD Image Updater:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# In your application CI pipeline (GitHub Actions example)&lt;/span&gt;
- name: Update manifest repo
  run: |
    git clone git@github.com:yourorg/k8s-manifests.git
    &lt;span class="nb"&gt;cd &lt;/span&gt;k8s-manifests/services/api/overlays/production
    kustomize edit &lt;span class="nb"&gt;set &lt;/span&gt;image yourorg/api&lt;span class="o"&gt;=&lt;/span&gt;yourorg/api:&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="p"&gt;{ github.sha &lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;
    git add &lt;span class="nb"&gt;.&lt;/span&gt;
    git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"Update api image to &lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="p"&gt;{ github.sha &lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;}"&lt;/span&gt;
    git push
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Need Help with Your DevOps?
&lt;/h2&gt;

&lt;p&gt;Implementing GitOps with ArgoCD properly - from repository structure to progressive delivery to RBAC - takes experience and planning. At &lt;a href="https://instadevops.com" rel="noopener noreferrer"&gt;InstaDevOps&lt;/a&gt;, we help startups and SMBs set up production-grade Kubernetes infrastructure and deployment pipelines - starting at $2,999/mo.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://calendly.com/instadevops/15min" rel="noopener noreferrer"&gt;Book a free 15-minute consultation&lt;/a&gt; to discuss your Kubernetes and deployment challenges.&lt;/p&gt;

</description>
      <category>argocd</category>
      <category>gitops</category>
      <category>kubernetes</category>
      <category>devops</category>
    </item>
    <item>
      <title>Elasticsearch Operations Guide: Cluster Sizing and Performance Tuning</title>
      <dc:creator>InstaDevOps</dc:creator>
      <pubDate>Wed, 29 Apr 2026 13:47:33 +0000</pubDate>
      <link>https://dev.to/instadevops/elasticsearch-operations-guide-cluster-sizing-and-performance-tuning-361j</link>
      <guid>https://dev.to/instadevops/elasticsearch-operations-guide-cluster-sizing-and-performance-tuning-361j</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Elasticsearch is one of those tools that is easy to set up but notoriously difficult to operate at scale. A basic single-node cluster handles your first million documents fine, then things start breaking: slow queries, disk pressure alerts, unbalanced shards, and the dreaded cluster yellow or red status at 3 AM.&lt;/p&gt;

&lt;p&gt;This guide covers the operational side of Elasticsearch - the things you learn after running it in production for years. We will go through cluster sizing, index lifecycle management (ILM), shard allocation strategies, snapshot and restore procedures, and performance tuning techniques that actually matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cluster Sizing and Hardware Planning
&lt;/h2&gt;

&lt;p&gt;The most common mistake in Elasticsearch sizing is starting with too few nodes and too little memory. Here are the numbers that matter:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory:&lt;/strong&gt; Elasticsearch lives and dies by its JVM heap and filesystem cache. The rule of thumb: give Elasticsearch 50% of available RAM as JVM heap (capped at 31GB to stay under compressed oops threshold), and leave the other 50% for the OS filesystem cache.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# jvm.options&lt;/span&gt;
&lt;span class="s"&gt;-Xms16g&lt;/span&gt;
&lt;span class="s"&gt;-Xmx16g&lt;/span&gt;
&lt;span class="c1"&gt;# Never set heap above 31g - you lose compressed ordinary object pointers&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;For a production logging cluster handling ~100GB/day of logs:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Nodes&lt;/th&gt;
&lt;th&gt;CPU&lt;/th&gt;
&lt;th&gt;RAM&lt;/th&gt;
&lt;th&gt;Storage&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Master-eligible&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;2 vCPU&lt;/td&gt;
&lt;td&gt;8 GB&lt;/td&gt;
&lt;td&gt;50 GB SSD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data (hot)&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;8 vCPU&lt;/td&gt;
&lt;td&gt;64 GB&lt;/td&gt;
&lt;td&gt;2 TB NVMe SSD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Data (warm)&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;4 vCPU&lt;/td&gt;
&lt;td&gt;32 GB&lt;/td&gt;
&lt;td&gt;8 TB HDD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coordinating&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;4 vCPU&lt;/td&gt;
&lt;td&gt;16 GB&lt;/td&gt;
&lt;td&gt;100 GB SSD&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Dedicated master nodes are non-negotiable&lt;/strong&gt; for any production cluster. Master nodes handle cluster state, shard allocation, and index creation. If your master nodes are also serving queries and indexing data, cluster instability follows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# elasticsearch.yml - dedicated master node&lt;/span&gt;
&lt;span class="na"&gt;node.roles&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;master&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;cluster.name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production-logs&lt;/span&gt;
&lt;span class="na"&gt;node.name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;master-1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# elasticsearch.yml - hot data node&lt;/span&gt;
&lt;span class="na"&gt;node.roles&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;data_hot&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;data_content&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;ingest&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;span class="na"&gt;node.attr.data_tier&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hot&lt;/span&gt;
&lt;span class="na"&gt;cluster.name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production-logs&lt;/span&gt;
&lt;span class="na"&gt;node.name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hot-data-1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;AWS-specific sizing:&lt;/strong&gt; On EC2, use &lt;code&gt;r6g&lt;/code&gt; instances for data nodes (memory-optimized, Graviton for cost savings) and &lt;code&gt;m6g&lt;/code&gt; for master/coordinating nodes. Use gp3 EBS volumes - they offer 3000 baseline IOPS regardless of volume size, which is more cost-effective than gp2 for most Elasticsearch workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  Index Lifecycle Management (ILM)
&lt;/h2&gt;

&lt;p&gt;ILM automates the progression of indices through hot, warm, cold, and delete phases. Without ILM, you end up with hundreds of indices eating resources forever, or someone writes a fragile cron job to delete old indices.&lt;/p&gt;

&lt;p&gt;Here is a production ILM policy for application logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;PUT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;_ilm/policy/logs-lifecycle&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"policy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"phases"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"hot"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"min_age"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"0ms"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"actions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"rollover"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"max_primary_shard_size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"50gb"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"max_age"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1d"&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"set_priority"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"priority"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"warm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"min_age"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2d"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"actions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"shrink"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"number_of_shards"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"forcemerge"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"max_num_segments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"allocate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"require"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
              &lt;/span&gt;&lt;span class="nl"&gt;"data"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"warm"&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"set_priority"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"priority"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"cold"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"min_age"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"30d"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"actions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"allocate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"require"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
              &lt;/span&gt;&lt;span class="nl"&gt;"data"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cold"&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"set_priority"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"priority"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"delete"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"min_age"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"90d"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"actions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"delete"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply the policy to an index template:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;PUT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;_index_template/logs-template&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"index_patterns"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"logs-*"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"template"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"settings"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"number_of_shards"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"number_of_replicas"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"index.lifecycle.name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"logs-lifecycle"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"index.lifecycle.rollover_alias"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"logs-write"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"index.routing.allocation.require.data"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"hot"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key ILM tuning tips:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rollover on shard size (50GB max primary shard) rather than document count - shard size is what affects performance&lt;/li&gt;
&lt;li&gt;Force merge to 1 segment in the warm phase to reclaim disk and improve query performance on read-only indices&lt;/li&gt;
&lt;li&gt;Shrink indices in warm phase if you started with multiple shards for write throughput but no longer need them&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Shard Allocation and Rebalancing
&lt;/h2&gt;

&lt;p&gt;Sharding is where most Elasticsearch performance problems originate. The guidelines are straightforward but frequently violated:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule 1: Keep shards between 10GB and 50GB.&lt;/strong&gt; Shards under 1GB waste resources on overhead. Shards over 50GB make recovery slow and rebalancing painful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rule 2: Aim for fewer than 20 shards per GB of heap.&lt;/strong&gt; A node with 16GB heap should host no more than ~300 shards. Each shard consumes memory for metadata, field data, and segment info regardless of whether it is being queried.&lt;/p&gt;

&lt;p&gt;Check your current shard distribution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Shard count per node&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; localhost:9200/_cat/allocation?v

&lt;span class="c"&gt;# Shards sorted by size&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; localhost:9200/_cat/shards?v&lt;span class="se"&gt;\&amp;amp;&lt;/span&gt;&lt;span class="nv"&gt;s&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;store:desc

&lt;span class="c"&gt;# Identify hot spots - nodes with too many shards&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; localhost:9200/_cat/nodes?v&lt;span class="se"&gt;\&amp;amp;&lt;/span&gt;&lt;span class="nv"&gt;h&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;name,heap.percent,disk.used_percent,shards
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When shards are unevenly distributed, you can adjust allocation awareness:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;PUT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;_cluster/settings&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"persistent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"cluster.routing.allocation.awareness.attributes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"zone"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"cluster.routing.allocation.awareness.force.zone.values"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"us-east-1a,us-east-1b,us-east-1c"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ensures that primary and replica shards land in different availability zones, giving you zone-failure resilience.&lt;/p&gt;

&lt;h2&gt;
  
  
  Snapshot and Restore
&lt;/h2&gt;

&lt;p&gt;Snapshots are your safety net. Without them, a bad mapping change, accidental index deletion, or cluster corruption means data loss. Set up automated snapshots to S3:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;PUT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;_snapshot/s&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="err"&gt;-backup&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"settings"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"bucket"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-es-snapshots"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"region"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"base_path"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"production"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"max_snapshot_bytes_per_sec"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"200mb"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"max_restore_bytes_per_sec"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"200mb"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a snapshot lifecycle management (SLM) policy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;PUT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;_slm/policy/nightly-backup&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"schedule"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"0 0 2 * * ?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;nightly-snap-{now/d}&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"repository"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s3-backup"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"config"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"indices"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"logs-*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"metrics-*"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"app-*"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ignore_unavailable"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"include_global_state"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"retention"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"expire_after"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"30d"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"min_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"max_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Test your restores regularly.&lt;/strong&gt; A snapshot you have never restored is a backup you do not have. Schedule a quarterly restore drill:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Restore a specific index to verify backup integrity&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"localhost:9200/_snapshot/s3-backup/nightly-snap-2026.04.01/_restore"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/json'&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{
    "indices": "logs-2026.03.31",
    "rename_pattern": "(.+)",
    "rename_replacement": "restored-$1"
  }'&lt;/span&gt;

&lt;span class="c"&gt;# Verify document count matches&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="s2"&gt;"localhost:9200/restored-logs-2026.03.31/_count"&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; &lt;span class="s2"&gt;"localhost:9200/logs-2026.03.31/_count"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Performance Tuning
&lt;/h2&gt;

&lt;p&gt;Here are the performance levers that have the biggest real-world impact, ranked by effort-to-benefit ratio:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Use bulk indexing, always.&lt;/strong&gt; Single-document indexing is orders of magnitude slower. Aim for bulk requests of 5-15MB.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Bad: indexing one document at a time&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"localhost:9200/logs/_doc"&lt;/span&gt; &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"message": "log entry"}'&lt;/span&gt;

&lt;span class="c"&gt;# Good: bulk indexing&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"localhost:9200/_bulk"&lt;/span&gt; &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s1"&gt;'Content-Type: application/x-ndjson'&lt;/span&gt; &lt;span class="nt"&gt;--data-binary&lt;/span&gt; @bulk-data.ndjson
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Tune refresh interval for write-heavy indices.&lt;/strong&gt; The default 1-second refresh creates a new Lucene segment every second, which is expensive. For logging use cases where near-real-time search is not critical:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;PUT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;logs-write/_settings&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"index"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"refresh_interval"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"30s"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Use doc_values and avoid fielddata.&lt;/strong&gt; For aggregations and sorting on text fields, use the &lt;code&gt;keyword&lt;/code&gt; type or multi-field mapping. Never enable fielddata on analyzed text fields - it loads the entire inverted index into heap memory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Profile slow queries.&lt;/strong&gt; Enable slow log to catch queries that degrade cluster performance:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;PUT&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;logs-*/_settings&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"index.search.slowlog.threshold.query.warn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"5s"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"index.search.slowlog.threshold.query.info"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2s"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"index.search.slowlog.threshold.fetch.warn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1s"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;5. Disable swapping entirely.&lt;/strong&gt; Swapped JVM heap is a death sentence for Elasticsearch performance. Configure this at the OS level:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# /etc/sysctl.conf&lt;/span&gt;
vm.swappiness &lt;span class="o"&gt;=&lt;/span&gt; 1

&lt;span class="c"&gt;# elasticsearch.yml&lt;/span&gt;
bootstrap.memory_lock: &lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Monitoring and Alerting
&lt;/h2&gt;

&lt;p&gt;At minimum, monitor these metrics (Prometheus with elasticsearch_exporter, Datadog, or Elastic's own monitoring):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cluster health&lt;/strong&gt; - yellow means missing replicas, red means missing primaries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;JVM heap pressure&lt;/strong&gt; - alert at 75%, panic at 85%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disk watermark&lt;/strong&gt; - ES stops allocating at 85% (low), marks read-only at 95% (flood)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thread pool rejections&lt;/strong&gt; - search, write, and bulk thread pool rejections indicate saturation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GC pause time&lt;/strong&gt; - young GC pauses over 100ms or old GC pauses over 1s are problems
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Quick health check script&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; localhost:9200/_cluster/health | jq &lt;span class="s1"&gt;'{
  status,
  number_of_nodes,
  active_primary_shards,
  relocating_shards,
  unassigned_shards
}'&lt;/span&gt;

&lt;span class="c"&gt;# Thread pool rejections&lt;/span&gt;
curl &lt;span class="nt"&gt;-s&lt;/span&gt; localhost:9200/_cat/thread_pool/search,write?v&lt;span class="se"&gt;\&amp;amp;&lt;/span&gt;&lt;span class="nv"&gt;h&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;node_name,name,active,rejected,completed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Need Help with Your DevOps?
&lt;/h2&gt;

&lt;p&gt;Running Elasticsearch in production requires ongoing attention to cluster health, capacity planning, and performance optimization. At &lt;a href="https://instadevops.com" rel="noopener noreferrer"&gt;InstaDevOps&lt;/a&gt;, we help startups and SMBs manage complex infrastructure like Elasticsearch clusters without the cost of a full-time hire - starting at $2,999/mo.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://calendly.com/instadevops/15min" rel="noopener noreferrer"&gt;Book a free 15-minute consultation&lt;/a&gt; to discuss your Elasticsearch challenges.&lt;/p&gt;

</description>
      <category>elasticsearch</category>
      <category>devops</category>
      <category>performance</category>
      <category>operations</category>
    </item>
    <item>
      <title>Git Branching Strategies: Trunk-Based vs GitFlow vs GitHub Flow</title>
      <dc:creator>InstaDevOps</dc:creator>
      <pubDate>Tue, 28 Apr 2026 13:47:28 +0000</pubDate>
      <link>https://dev.to/instadevops/git-branching-strategies-trunk-based-vs-gitflow-vs-github-flow-4ejb</link>
      <guid>https://dev.to/instadevops/git-branching-strategies-trunk-based-vs-gitflow-vs-github-flow-4ejb</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Your Git branching strategy directly impacts how fast you ship code, how often deployments break, and how much time your team wastes on merge conflicts. Yet most teams pick a branching model once and never revisit it, even as their team size, deployment frequency, and product complexity change.&lt;/p&gt;

&lt;p&gt;In this guide, we will compare the three most popular branching strategies - trunk-based development, GitFlow, and GitHub Flow - with honest assessments of when each works and when each falls apart. We will also cover feature flags, release management, and monorepo-specific considerations that most branching guides skip entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trunk-Based Development
&lt;/h2&gt;

&lt;p&gt;Trunk-based development (TBD) means everyone commits to a single branch (main/trunk), either directly or through very short-lived feature branches that last no more than a day or two. Google, Meta, and Netflix all practice trunk-based development at massive scale.&lt;/p&gt;

&lt;p&gt;The core rules are simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The main branch is always deployable&lt;/li&gt;
&lt;li&gt;Feature branches live less than 24-48 hours&lt;/li&gt;
&lt;li&gt;Broken code is hidden behind feature flags, not long-lived branches&lt;/li&gt;
&lt;li&gt;CI runs on every commit and blocks merges on failure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here is a typical workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start work&lt;/span&gt;
git checkout main
git pull origin main
git checkout &lt;span class="nt"&gt;-b&lt;/span&gt; feature/add-retry-logic

&lt;span class="c"&gt;# Work for a few hours, then push&lt;/span&gt;
git add &lt;span class="nt"&gt;-A&lt;/span&gt;
git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"Add exponential backoff to API client"&lt;/span&gt;
git push origin feature/add-retry-logic

&lt;span class="c"&gt;# Open PR, get review, merge same day&lt;/span&gt;
&lt;span class="c"&gt;# Delete branch immediately after merge&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;When it works:&lt;/strong&gt; Teams with strong CI/CD pipelines, good test coverage (70%+ meaningful coverage), and engineers comfortable with feature flags. Ideal for continuous deployment where you ship multiple times per day.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it breaks down:&lt;/strong&gt; Teams without automated testing, regulated industries requiring formal release sign-off, or junior-heavy teams where code review bottlenecks cause branches to pile up.&lt;/p&gt;

&lt;p&gt;The biggest misconception about trunk-based development is that it means no code review. You absolutely still review code - you just review smaller, more frequent changes rather than massive PRs that touch 50 files.&lt;/p&gt;

&lt;h2&gt;
  
  
  GitFlow
&lt;/h2&gt;

&lt;p&gt;GitFlow uses long-lived branches to manage releases: &lt;code&gt;main&lt;/code&gt; holds production code, &lt;code&gt;develop&lt;/code&gt; holds the next release, &lt;code&gt;feature/*&lt;/code&gt; branches come off develop, &lt;code&gt;release/*&lt;/code&gt; branches stabilize a release, and &lt;code&gt;hotfix/*&lt;/code&gt; branches patch production emergencies.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;main ─────────────────────────────────────────────
  │                               ▲          ▲
  │                               │          │
  └──▶ develop ──────────────────────────────────
         │         ▲       │            ▲
         │         │       │            │
         └── feature/x ──┘  └── feature/y ──┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A typical GitFlow release cycle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start a feature&lt;/span&gt;
git checkout develop
git checkout &lt;span class="nt"&gt;-b&lt;/span&gt; feature/new-dashboard

&lt;span class="c"&gt;# ... work for days/weeks ...&lt;/span&gt;

&lt;span class="c"&gt;# Merge back to develop&lt;/span&gt;
git checkout develop
git merge &lt;span class="nt"&gt;--no-ff&lt;/span&gt; feature/new-dashboard

&lt;span class="c"&gt;# Cut a release&lt;/span&gt;
git checkout &lt;span class="nt"&gt;-b&lt;/span&gt; release/2.4.0 develop

&lt;span class="c"&gt;# Stabilize (bug fixes only)&lt;/span&gt;
git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"Fix date formatting in dashboard"&lt;/span&gt;

&lt;span class="c"&gt;# Finalize release&lt;/span&gt;
git checkout main
git merge &lt;span class="nt"&gt;--no-ff&lt;/span&gt; release/2.4.0
git tag &lt;span class="nt"&gt;-a&lt;/span&gt; v2.4.0 &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"Release 2.4.0"&lt;/span&gt;
git checkout develop
git merge &lt;span class="nt"&gt;--no-ff&lt;/span&gt; release/2.4.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;When it works:&lt;/strong&gt; Teams shipping versioned software (desktop apps, mobile apps, SDKs, on-prem installations), teams with scheduled release cycles (bi-weekly, monthly), and teams that need to maintain multiple release versions simultaneously.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it breaks down:&lt;/strong&gt; Web applications deploying continuously, small teams (under 5 engineers) where the overhead is not justified, and teams that end up with feature branches living for weeks, creating merge hell.&lt;/p&gt;

&lt;p&gt;The honest truth about GitFlow in 2026: it was designed for a world where shipping software meant burning a CD. If you deploy a web application, GitFlow is almost certainly more process than you need.&lt;/p&gt;

&lt;h2&gt;
  
  
  GitHub Flow
&lt;/h2&gt;

&lt;p&gt;GitHub Flow is the middle ground. You have one long-lived branch (main), create feature branches off it, open pull requests, and merge back to main after review. Main is always deployable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create feature branch from main&lt;/span&gt;
git checkout main
git pull origin main
git checkout &lt;span class="nt"&gt;-b&lt;/span&gt; add-health-check-endpoint

&lt;span class="c"&gt;# Work, commit, push&lt;/span&gt;
git add src/health.ts
git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"Add /health endpoint with dependency checks"&lt;/span&gt;
git push origin add-health-check-endpoint

&lt;span class="c"&gt;# Open PR on GitHub&lt;/span&gt;
gh &lt;span class="nb"&gt;pr &lt;/span&gt;create &lt;span class="nt"&gt;--title&lt;/span&gt; &lt;span class="s2"&gt;"Add health check endpoint"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--body&lt;/span&gt; &lt;span class="s2"&gt;"Adds /health endpoint that checks DB and Redis connectivity"&lt;/span&gt;

&lt;span class="c"&gt;# After review and CI passes, merge via GitHub UI&lt;/span&gt;
&lt;span class="c"&gt;# Deploy automatically from main&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;When it works:&lt;/strong&gt; Most web application teams. It is simple enough for small teams, scales well to medium teams (5-30 engineers), and pairs naturally with CI/CD pipelines that deploy on every merge to main.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When it breaks down:&lt;/strong&gt; When feature branches live too long (the same problem as GitFlow but less structured), or when you need to maintain multiple production versions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Feature Flags as a Branching Strategy
&lt;/h2&gt;

&lt;p&gt;Feature flags are not a replacement for branching - they are a complement that makes trunk-based development and GitHub Flow viable for complex features that take weeks to build.&lt;/p&gt;

&lt;p&gt;Here is a practical implementation using environment variables for simple flags and a feature flag service for anything more complex:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simple approach: environment-based flags&lt;/span&gt;
&lt;span class="c1"&gt;# docker-compose.yml&lt;/span&gt;
&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;api&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;FEATURE_NEW_BILLING=false&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;FEATURE_V2_SEARCH=true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Application code&lt;/span&gt;
&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;getSearchResults&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;FEATURE_V2_SEARCH&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;true&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;searchV2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// New implementation&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;searchV1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;query&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;    &lt;span class="c1"&gt;// Existing implementation&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For production systems, use a proper feature flag service (LaunchDarkly, Unleash, Flagsmith, or even a simple Redis-backed solution):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Using Unleash (open source)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;initialize&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;unleash-client&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;unleash&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;initialize&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;https://unleash.yourcompany.com/api&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;api-service&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;customHeaders&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;Authorization&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;UNLEASH_TOKEN&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/api/dashboard&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;unleash&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isEnabled&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;new-dashboard&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="p"&gt;}))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;getNewDashboard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;getLegacyDashboard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The critical rule with feature flags: &lt;strong&gt;clean them up&lt;/strong&gt;. Every flag should have an expiration date. Dead flags are technical debt that makes code harder to reason about. Set a calendar reminder to remove each flag within 2 weeks of full rollout.&lt;/p&gt;

&lt;h2&gt;
  
  
  Release Management Patterns
&lt;/h2&gt;

&lt;p&gt;Regardless of your branching strategy, you need a release management approach. Here are three patterns ranked by deployment frequency:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pattern 1: Continuous Deployment (trunk-based or GitHub Flow)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Every merge to main deploys to production automatically. Requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full automated tests&lt;/li&gt;
&lt;li&gt;Feature flags for incomplete work&lt;/li&gt;
&lt;li&gt;Canary or blue-green deployments&lt;/li&gt;
&lt;li&gt;Fast rollback capability
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# GitHub Actions - deploy on every merge to main&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deploy&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm ci &amp;amp;&amp;amp; npm test&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm run build&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deploy to production&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;aws ecs update-service \&lt;/span&gt;
            &lt;span class="s"&gt;--cluster production \&lt;/span&gt;
            &lt;span class="s"&gt;--service api \&lt;/span&gt;
            &lt;span class="s"&gt;--force-new-deployment&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pattern 2: Release Trains (GitHub Flow with scheduled deploys)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Merge to main anytime, but deploy on a schedule (daily, twice a week). Good for teams building confidence in their pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pattern 3: Versioned Releases (GitFlow)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cut release branches, stabilize, tag, and deploy. Necessary for software with multiple live versions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monorepo Branching Considerations
&lt;/h2&gt;

&lt;p&gt;Monorepos add complexity to any branching strategy because a single branch may contain changes to multiple services. Key considerations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;monorepo/
├── services/
│   ├── api/
│   ├── worker/
│   └── frontend/
├── packages/
│   ├── shared-types/
│   └── utils/
└── infrastructure/
    ├── terraform/
    └── k8s/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use path-based CI triggers so changes to one service do not trigger builds for unrelated services:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# GitHub Actions with path filters&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;API CI&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;services/api/**'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;packages/shared-types/**'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;packages/utils/**'&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cd services/api &amp;amp;&amp;amp; npm ci &amp;amp;&amp;amp; npm test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For monorepos, trunk-based development or GitHub Flow works far better than GitFlow. Long-lived feature branches in a monorepo are a recipe for merge conflicts across service boundaries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Choosing the Right Strategy for Your Team
&lt;/h2&gt;

&lt;p&gt;Here is a decision framework:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Factor&lt;/th&gt;
&lt;th&gt;Trunk-Based&lt;/th&gt;
&lt;th&gt;GitHub Flow&lt;/th&gt;
&lt;th&gt;GitFlow&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Team size&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;td&gt;2-30&lt;/td&gt;
&lt;td&gt;5-50+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deploy frequency&lt;/td&gt;
&lt;td&gt;Multiple/day&lt;/td&gt;
&lt;td&gt;Daily-weekly&lt;/td&gt;
&lt;td&gt;Weekly-monthly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Test coverage&lt;/td&gt;
&lt;td&gt;High (required)&lt;/td&gt;
&lt;td&gt;Medium-high&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Feature flag maturity&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low-medium&lt;/td&gt;
&lt;td&gt;Not needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multiple live versions&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Regulatory requirements&lt;/td&gt;
&lt;td&gt;Works with flags&lt;/td&gt;
&lt;td&gt;Works fine&lt;/td&gt;
&lt;td&gt;Natural fit&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;For most startups and SMBs deploying web applications:&lt;/strong&gt; Start with GitHub Flow. It has the lowest overhead and the fewest ways to shoot yourself in the foot. Move to trunk-based development when your CI/CD pipeline and test coverage are mature enough to support it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avoid GitFlow unless&lt;/strong&gt; you are shipping versioned software (mobile apps, SDKs, on-prem products) or you have regulatory requirements that demand formal release branches.&lt;/p&gt;

&lt;h2&gt;
  
  
  Need Help with Your DevOps?
&lt;/h2&gt;

&lt;p&gt;Setting up the right branching strategy, CI/CD pipeline, and deployment workflow is just the beginning. At &lt;a href="https://instadevops.com" rel="noopener noreferrer"&gt;InstaDevOps&lt;/a&gt;, we help startups and SMBs build production-grade DevOps infrastructure at a fraction of the cost of a full-time hire - starting at $2,999/mo.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://calendly.com/instadevops/15min" rel="noopener noreferrer"&gt;Book a free 15-minute consultation&lt;/a&gt; to discuss your team's workflow and deployment challenges.&lt;/p&gt;

</description>
      <category>git</category>
      <category>devops</category>
      <category>workflow</category>
      <category>cicd</category>
    </item>
  </channel>
</rss>
