<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: InstaDevOps</title>
    <description>The latest articles on DEV Community by InstaDevOps (@instadevops).</description>
    <link>https://dev.to/instadevops</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2952358%2F474aa7f4-09cf-409d-891e-cfc4f071d18a.png</url>
      <title>DEV Community: InstaDevOps</title>
      <link>https://dev.to/instadevops</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/instadevops"/>
    <language>en</language>
    <item>
      <title>PostgreSQL High Availability: A Practical Guide with Patroni and pgBouncer</title>
      <dc:creator>InstaDevOps</dc:creator>
      <pubDate>Sun, 19 Apr 2026 13:46:56 +0000</pubDate>
      <link>https://dev.to/instadevops/postgresql-high-availability-a-practical-guide-with-patroni-and-pgbouncer-2p10</link>
      <guid>https://dev.to/instadevops/postgresql-high-availability-a-practical-guide-with-patroni-and-pgbouncer-2p10</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Your database is the last thing that should go down. Yet setting up PostgreSQL for high availability remains one of the most under-documented areas in the DevOps world. Most teams run a single Postgres instance until the day it dies, then scramble to set up replication while their users stare at error pages.&lt;/p&gt;

&lt;p&gt;This guide walks through a production-ready PostgreSQL HA setup using the tools that have become the industry standard: &lt;strong&gt;Patroni&lt;/strong&gt; for automated failover, &lt;strong&gt;pgBouncer&lt;/strong&gt; for connection pooling, &lt;strong&gt;streaming replication&lt;/strong&gt; for data redundancy, and &lt;strong&gt;pgBackRest&lt;/strong&gt; for backups. We'll cover the architecture, actual configuration files, failure scenarios, and the operational runbooks you'll need.&lt;/p&gt;

&lt;p&gt;If you're running Postgres in production and don't have automated failover, this article is for you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;A production HA Postgres setup has four layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;                    ┌─────────────┐
                    │   HAProxy   │  (Virtual IP / DNS)
                    │  Port 5432  │
                    └──────┬──────┘
                           │
              ┌────────────┼────────────┐
              │            │            │
        ┌─────┴─────┐ ┌───┴───┐ ┌─────┴─────┐
        │ pgBouncer  │ │pgBouncer│ │ pgBouncer  │
        │  Node 1    │ │ Node 2 │ │  Node 3    │
        └─────┬──────┘ └───┬───┘ └─────┬──────┘
              │            │            │
        ┌─────┴─────┐ ┌───┴───┐ ┌─────┴─────┐
        │ Patroni +  │ │Patroni│ │ Patroni +  │
        │ Postgres   │ │  + PG  │ │ Postgres   │
        │ (Primary)  │ │(Replica)│ │ (Replica)  │
        └────────────┘ └────────┘ └────────────┘
              │            │            │
        ┌─────────────────────────────────────┐
        │           etcd Cluster (3 nodes)     │
        └─────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Patroni&lt;/strong&gt; manages the Postgres instances and handles leader election via a distributed consensus store (etcd, ZooKeeper, or Consul). When the primary fails, Patroni automatically promotes a replica and reconfigures the others to follow the new primary.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;pgBouncer&lt;/strong&gt; sits between your application and Postgres, pooling connections to avoid the expensive fork-per-connection model. Each Postgres node runs its own pgBouncer instance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HAProxy&lt;/strong&gt; (or a DNS-based solution) routes traffic to the current primary. Patroni exposes a REST API that HAProxy uses for health checks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up Patroni
&lt;/h2&gt;

&lt;p&gt;Install Patroni on each Postgres node:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip3 &lt;span class="nb"&gt;install &lt;/span&gt;patroni[etcd]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's a production Patroni configuration (&lt;code&gt;/etc/patroni/patroni.yml&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;scope&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres-cluster&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pg-node-1&lt;/span&gt;

&lt;span class="na"&gt;restapi&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;listen&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:8008&lt;/span&gt;
  &lt;span class="na"&gt;connect_address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10.0.1.10:8008&lt;/span&gt;

&lt;span class="na"&gt;etcd3&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10.0.1.20:2379,10.0.1.21:2379,10.0.1.22:2379&lt;/span&gt;

&lt;span class="na"&gt;bootstrap&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;dcs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;ttl&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;30&lt;/span&gt;
    &lt;span class="na"&gt;loop_wait&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;retry_timeout&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;maximum_lag_on_failover&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1048576&lt;/span&gt;  &lt;span class="c1"&gt;# 1MB&lt;/span&gt;
    &lt;span class="na"&gt;postgresql&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;use_pg_rewind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;use_slots&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;wal_level&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;replica&lt;/span&gt;
        &lt;span class="na"&gt;hot_standby&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;on"&lt;/span&gt;
        &lt;span class="na"&gt;max_wal_senders&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
        &lt;span class="na"&gt;max_replication_slots&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
        &lt;span class="na"&gt;wal_log_hints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;on"&lt;/span&gt;
        &lt;span class="na"&gt;archive_mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;on"&lt;/span&gt;
        &lt;span class="na"&gt;archive_command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;&amp;gt;&lt;/span&gt;
          &lt;span class="s"&gt;pgbackrest --stanza=main archive-push %p&lt;/span&gt;

  &lt;span class="na"&gt;initdb&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;encoding&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;UTF8&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;data-checksums&lt;/span&gt;

  &lt;span class="na"&gt;pg_hba&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;host replication replicator 10.0.1.0/24 md5&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;host all all 10.0.1.0/24 md5&lt;/span&gt;

&lt;span class="na"&gt;postgresql&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;listen&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0:5432&lt;/span&gt;
  &lt;span class="na"&gt;connect_address&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10.0.1.10:5432&lt;/span&gt;
  &lt;span class="na"&gt;data_dir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/var/lib/postgresql/16/main&lt;/span&gt;
  &lt;span class="na"&gt;bin_dir&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/usr/lib/postgresql/16/bin&lt;/span&gt;

  &lt;span class="na"&gt;authentication&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;superuser&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;username&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
      &lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-secure-password"&lt;/span&gt;
    &lt;span class="na"&gt;replication&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;username&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;replicator&lt;/span&gt;
      &lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-replication-password"&lt;/span&gt;

  &lt;span class="na"&gt;parameters&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;shared_buffers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;4GB&lt;/span&gt;
    &lt;span class="na"&gt;effective_cache_size&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;12GB&lt;/span&gt;
    &lt;span class="na"&gt;work_mem&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;64MB&lt;/span&gt;
    &lt;span class="na"&gt;maintenance_work_mem&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;512MB&lt;/span&gt;
    &lt;span class="na"&gt;max_connections&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;200&lt;/span&gt;
    &lt;span class="na"&gt;checkpoint_completion_target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.9&lt;/span&gt;
    &lt;span class="na"&gt;wal_buffers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;64MB&lt;/span&gt;
    &lt;span class="na"&gt;random_page_cost&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1.1&lt;/span&gt;  &lt;span class="c1"&gt;# SSD storage&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Key settings to understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;maximum_lag_on_failover&lt;/code&gt;&lt;/strong&gt;: A replica won't be promoted if it's more than 1MB behind the primary. This prevents data loss but means failover might not happen if all replicas are lagging heavily.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;use_pg_rewind&lt;/code&gt;&lt;/strong&gt;: Allows the old primary to rejoin the cluster as a replica after failover without a full base backup. This is critical for fast recovery.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;use_slots&lt;/code&gt;&lt;/strong&gt;: Replication slots prevent the primary from removing WAL segments that replicas still need, avoiding replication breakage during network issues.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Start Patroni as a systemd service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl &lt;span class="nb"&gt;enable &lt;/span&gt;patroni
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl start patroni
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check cluster status:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;patronictl &lt;span class="nt"&gt;-c&lt;/span&gt; /etc/patroni/patroni.yml list
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+ Cluster: postgres-cluster ----+---------+---------+----+-----------+
| Member    | Host       | Role    | State   | TL | Lag in MB |
+-----------+------------+---------+---------+----+-----------+
| pg-node-1 | 10.0.1.10  | Leader  | running |  1 |           |
| pg-node-2 | 10.0.1.11  | Replica | running |  1 |       0.0 |
| pg-node-3 | 10.0.1.12  | Replica | running |  1 |       0.0 |
+-----------+------------+---------+---------+----+-----------+
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Configuring pgBouncer for Connection Pooling
&lt;/h2&gt;

&lt;p&gt;Without connection pooling, each application connection forks a new Postgres backend process (~10MB RAM each). With 500 application connections, that's 5GB just for connection overhead. pgBouncer solves this by multiplexing thousands of application connections over a small pool of actual database connections.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;; /etc/pgbouncer/pgbouncer.ini
&lt;/span&gt;
&lt;span class="nn"&gt;[databases]&lt;/span&gt;
&lt;span class="py"&gt;myapp&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;host=127.0.0.1 port=5432 dbname=myapp&lt;/span&gt;

&lt;span class="nn"&gt;[pgbouncer]&lt;/span&gt;
&lt;span class="py"&gt;listen_addr&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;0.0.0.0&lt;/span&gt;
&lt;span class="py"&gt;listen_port&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;6432&lt;/span&gt;
&lt;span class="py"&gt;auth_type&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;md5&lt;/span&gt;
&lt;span class="py"&gt;auth_file&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;/etc/pgbouncer/userlist.txt&lt;/span&gt;

&lt;span class="c"&gt;; Pool settings
&lt;/span&gt;&lt;span class="py"&gt;pool_mode&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;transaction&lt;/span&gt;
&lt;span class="py"&gt;default_pool_size&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;25&lt;/span&gt;
&lt;span class="py"&gt;min_pool_size&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;5&lt;/span&gt;
&lt;span class="py"&gt;reserve_pool_size&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;5&lt;/span&gt;
&lt;span class="py"&gt;reserve_pool_timeout&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;3&lt;/span&gt;

&lt;span class="c"&gt;; Connection limits
&lt;/span&gt;&lt;span class="py"&gt;max_client_conn&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;1000&lt;/span&gt;
&lt;span class="py"&gt;max_db_connections&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;100&lt;/span&gt;

&lt;span class="c"&gt;; Timeouts
&lt;/span&gt;&lt;span class="py"&gt;server_idle_timeout&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;300&lt;/span&gt;
&lt;span class="py"&gt;client_idle_timeout&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;0&lt;/span&gt;
&lt;span class="py"&gt;query_timeout&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;60&lt;/span&gt;
&lt;span class="py"&gt;query_wait_timeout&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;30&lt;/span&gt;

&lt;span class="c"&gt;; Logging
&lt;/span&gt;&lt;span class="py"&gt;log_connections&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;1&lt;/span&gt;
&lt;span class="py"&gt;log_disconnections&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;1&lt;/span&gt;
&lt;span class="py"&gt;log_pooler_errors&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;1&lt;/span&gt;
&lt;span class="py"&gt;stats_period&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;60&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Critical decision: pool_mode.&lt;/strong&gt; Transaction mode is the right default for most applications - it assigns a server connection for the duration of a transaction, then returns it to the pool. Session mode holds the connection for the entire client session (defeats the purpose of pooling). Statement mode is the most aggressive but breaks multi-statement transactions and prepared statements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Warning:&lt;/strong&gt; Transaction pooling breaks &lt;code&gt;SET&lt;/code&gt; commands, prepared statements that span transactions, and &lt;code&gt;LISTEN/NOTIFY&lt;/code&gt;. If your application uses these, you'll need session pooling for those specific connections.&lt;/p&gt;

&lt;h2&gt;
  
  
  Backup Strategy with pgBackRest
&lt;/h2&gt;

&lt;p&gt;pgBackRest is the gold standard for Postgres backups. It supports full, differential, and incremental backups with parallel compression and encryption.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;; /etc/pgbackrest/pgbackrest.conf
&lt;/span&gt;
&lt;span class="nn"&gt;[global]&lt;/span&gt;
&lt;span class="py"&gt;repo1-type&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;s3&lt;/span&gt;
&lt;span class="py"&gt;repo1-s3-bucket&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;your-pg-backups&lt;/span&gt;
&lt;span class="py"&gt;repo1-s3-region&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;us-east-1&lt;/span&gt;
&lt;span class="py"&gt;repo1-s3-endpoint&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;s3.amazonaws.com&lt;/span&gt;
&lt;span class="py"&gt;repo1-path&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/backups&lt;/span&gt;
&lt;span class="py"&gt;repo1-retention-full&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;4&lt;/span&gt;
&lt;span class="py"&gt;repo1-retention-diff&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;14&lt;/span&gt;
&lt;span class="py"&gt;repo1-cipher-type&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;aes-256-cbc&lt;/span&gt;
&lt;span class="py"&gt;repo1-cipher-pass&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;your-encryption-passphrase&lt;/span&gt;

&lt;span class="py"&gt;process-max&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;4&lt;/span&gt;
&lt;span class="py"&gt;compress-type&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;zst&lt;/span&gt;
&lt;span class="py"&gt;compress-level&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;3&lt;/span&gt;

&lt;span class="nn"&gt;[main]&lt;/span&gt;
&lt;span class="py"&gt;pg1-path&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/var/lib/postgresql/16/main&lt;/span&gt;
&lt;span class="py"&gt;pg1-port&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;5432&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create the stanza and run your first backup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Initialize the backup repository&lt;/span&gt;
pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main stanza-create

&lt;span class="c"&gt;# Full backup&lt;/span&gt;
pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;full backup

&lt;span class="c"&gt;# Differential backup (only changed since last full)&lt;/span&gt;
pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;diff backup

&lt;span class="c"&gt;# Incremental backup (only changed since last backup of any type)&lt;/span&gt;
pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;incr backup
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Set up a cron schedule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Full backup every Sunday at 1 AM&lt;/span&gt;
0 1 &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; 0 pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;full backup

&lt;span class="c"&gt;# Differential backup every day at 1 AM (except Sunday)&lt;/span&gt;
0 1 &lt;span class="k"&gt;*&lt;/span&gt; &lt;span class="k"&gt;*&lt;/span&gt; 1-6 pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;diff backup
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Point-in-time recovery&lt;/strong&gt; is where pgBackRest shines. Because Patroni is already archiving WAL segments to pgBackRest, you can restore to any point in time:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"2026-04-09 14:30:00"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target-action&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;promote restore
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Failover Testing and Operational Runbooks
&lt;/h2&gt;

&lt;p&gt;Setting up HA means nothing if you don't test it. Here are the failure scenarios you must validate:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Primary node crash:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Simulate primary failure&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl stop patroni  &lt;span class="c"&gt;# on the primary node&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected: Within 30 seconds (TTL), Patroni promotes a replica. HAProxy health checks detect the change and route traffic to the new primary. Application sees a brief connection error, retries, and reconnects.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Network partition:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Isolate primary from etcd&lt;/span&gt;
iptables &lt;span class="nt"&gt;-A&lt;/span&gt; OUTPUT &lt;span class="nt"&gt;-p&lt;/span&gt; tcp &lt;span class="nt"&gt;--dport&lt;/span&gt; 2379 &lt;span class="nt"&gt;-j&lt;/span&gt; DROP
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected: Primary can't reach etcd, Patroni demotes it to read-only. A replica with etcd access gets promoted. When network heals, the old primary uses pg_rewind to rejoin as a replica.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Switchover (planned maintenance):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Graceful switchover to a specific replica&lt;/span&gt;
patronictl &lt;span class="nt"&gt;-c&lt;/span&gt; /etc/patroni/patroni.yml switchover &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--master&lt;/span&gt; pg-node-1 &lt;span class="nt"&gt;--candidate&lt;/span&gt; pg-node-2 &lt;span class="nt"&gt;--force&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected: Zero or near-zero downtime. The primary finishes in-flight transactions, transfers leadership, and becomes a replica.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Full cluster restore from backup:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Stop all Patroni instances&lt;/span&gt;
&lt;span class="c"&gt;# On the new primary node:&lt;/span&gt;
pgbackrest &lt;span class="nt"&gt;--stanza&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;time&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--target&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"2026-04-09 10:00:00"&lt;/span&gt; restore

&lt;span class="c"&gt;# Start Patroni - it will bootstrap from the restored data&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl start patroni
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run these tests quarterly. Document the results. Your future self at 3 AM will thank you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitoring Your HA Cluster
&lt;/h2&gt;

&lt;p&gt;You need visibility into replication lag, connection pool utilization, and failover events. Key metrics to track:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Replication lag (run on primary)&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;client_addr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sent_lsn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;write_lsn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;flush_lsn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;replay_lsn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="n"&gt;pg_wal_lsn_diff&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sent_lsn&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;replay_lsn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;replay_lag_bytes&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_replication&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Connection counts&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_activity&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;state&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- Long-running queries&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;pid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;pg_stat_activity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query_start&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;pg_stat_activity&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;state&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s1"&gt;'idle'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;pg_stat_activity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query_start&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;interval&lt;/span&gt; &lt;span class="s1"&gt;'30 seconds'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For Prometheus monitoring, use &lt;code&gt;postgres_exporter&lt;/code&gt; and set up alerts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Prometheus alerting rules&lt;/span&gt;
&lt;span class="na"&gt;groups&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;postgres&lt;/span&gt;
    &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PostgresReplicationLag&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pg_replication_lag_seconds &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;30&lt;/span&gt;
        &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;5m&lt;/span&gt;
        &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;warning&lt;/span&gt;
        &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Replication&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;lag&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$value&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}s&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;on&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.instance&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}"&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PostgresConnectionPoolExhausted&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pgbouncer_pools_server_active / pgbouncer_pools_server_max &amp;gt; &lt;/span&gt;&lt;span class="m"&gt;0.9&lt;/span&gt;
        &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;2m&lt;/span&gt;
        &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
        &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pgBouncer&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;pool&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;gt;90%&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;utilized&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;on&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;{{&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;$labels.instance&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;}}"&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;PatroniNoLeader&lt;/span&gt;
        &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;patroni_master == &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;
        &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1m&lt;/span&gt;
        &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
        &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;No&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Patroni&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;leader&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;detected&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;cluster"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Need Help with Your DevOps?
&lt;/h2&gt;

&lt;p&gt;Building a production-grade PostgreSQL HA cluster takes time and expertise - and maintaining it takes even more. At &lt;strong&gt;InstaDevOps&lt;/strong&gt;, we design, deploy, and manage database infrastructure alongside your entire DevOps stack, starting at &lt;strong&gt;$2,999/month&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Book a free 15-minute consultation to discuss your database reliability needs: &lt;a href="https://calendly.com/instadevops/15min" rel="noopener noreferrer"&gt;https://calendly.com/instadevops/15min&lt;/a&gt;&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>database</category>
      <category>devops</category>
      <category>highavailability</category>
    </item>
    <item>
      <title>Nginx vs Traefik vs Caddy: Which Reverse Proxy Should You Pick?</title>
      <dc:creator>InstaDevOps</dc:creator>
      <pubDate>Sat, 18 Apr 2026 13:46:53 +0000</pubDate>
      <link>https://dev.to/instadevops/nginx-vs-traefik-vs-caddy-which-reverse-proxy-should-you-pick-3ekl</link>
      <guid>https://dev.to/instadevops/nginx-vs-traefik-vs-caddy-which-reverse-proxy-should-you-pick-3ekl</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Choosing a reverse proxy is one of those decisions that seems trivial until you're debugging TLS certificate renewals at 2 AM or trying to figure out why your service discovery isn't propagating. If you're running microservices, your reverse proxy is the front door to everything - and picking the wrong one can cost you weeks of configuration headaches.&lt;/p&gt;

&lt;p&gt;Here, we'll do a hands-on comparison of the three most popular reverse proxies in the modern DevOps stack: &lt;strong&gt;Nginx&lt;/strong&gt;, &lt;strong&gt;Traefik&lt;/strong&gt;, and &lt;strong&gt;Caddy&lt;/strong&gt;. We'll look at configuration complexity, automatic TLS, service discovery, performance characteristics, and when each one makes sense.&lt;/p&gt;

&lt;p&gt;No fluff. Just real configs, real trade-offs, and a clear recommendation based on your use case.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Contenders at a Glance
&lt;/h2&gt;

&lt;p&gt;Before we dive in, here's a quick snapshot:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Nginx&lt;/th&gt;
&lt;th&gt;Traefik&lt;/th&gt;
&lt;th&gt;Caddy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Language&lt;/td&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;td&gt;Go&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Config Format&lt;/td&gt;
&lt;td&gt;Custom DSL&lt;/td&gt;
&lt;td&gt;YAML/TOML/Labels&lt;/td&gt;
&lt;td&gt;Caddyfile/JSON&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auto TLS&lt;/td&gt;
&lt;td&gt;Via Certbot&lt;/td&gt;
&lt;td&gt;Built-in (Let's Encrypt)&lt;/td&gt;
&lt;td&gt;Built-in (Let's Encrypt + ZeroSSL)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Service Discovery&lt;/td&gt;
&lt;td&gt;Manual/scripted&lt;/td&gt;
&lt;td&gt;Native (Docker, K8s, Consul)&lt;/td&gt;
&lt;td&gt;Limited (on_demand_tls)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hot Reload&lt;/td&gt;
&lt;td&gt;&lt;code&gt;nginx -s reload&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dashboard&lt;/td&gt;
&lt;td&gt;Nginx Plus only&lt;/td&gt;
&lt;td&gt;Built-in&lt;/td&gt;
&lt;td&gt;Via API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory Footprint&lt;/td&gt;
&lt;td&gt;~2-5 MB&lt;/td&gt;
&lt;td&gt;~30-50 MB&lt;/td&gt;
&lt;td&gt;~20-40 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Nginx has been the default choice for over a decade. Traefik was built specifically for dynamic microservice environments. Caddy focuses on simplicity and security-by-default.&lt;/p&gt;

&lt;h2&gt;
  
  
  Configuration Complexity
&lt;/h2&gt;

&lt;p&gt;This is where the differences hit you first. Let's set up a simple reverse proxy with TLS for two services: an API on port 3000 and a frontend on port 8080.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nginx:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="k"&gt;server&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;listen&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;server_name&lt;/span&gt; &lt;span class="s"&gt;api.example.com&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;return&lt;/span&gt; &lt;span class="mi"&gt;301&lt;/span&gt; &lt;span class="s"&gt;https://&lt;/span&gt;&lt;span class="nv"&gt;$server_name$request_uri&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;server&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;listen&lt;/span&gt; &lt;span class="mi"&gt;443&lt;/span&gt; &lt;span class="s"&gt;ssl&lt;/span&gt; &lt;span class="s"&gt;http2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;server_name&lt;/span&gt; &lt;span class="s"&gt;api.example.com&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kn"&gt;ssl_certificate&lt;/span&gt; &lt;span class="n"&gt;/etc/letsencrypt/live/api.example.com/fullchain.pem&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;ssl_certificate_key&lt;/span&gt; &lt;span class="n"&gt;/etc/letsencrypt/live/api.example.com/privkey.pem&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;ssl_protocols&lt;/span&gt; &lt;span class="s"&gt;TLSv1.2&lt;/span&gt; &lt;span class="s"&gt;TLSv1.3&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;ssl_ciphers&lt;/span&gt; &lt;span class="s"&gt;HIGH:!aNULL:!MD5&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kn"&gt;location&lt;/span&gt; &lt;span class="n"&gt;/&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_pass&lt;/span&gt; &lt;span class="s"&gt;http://localhost:3000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_set_header&lt;/span&gt; &lt;span class="s"&gt;Host&lt;/span&gt; &lt;span class="nv"&gt;$host&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_set_header&lt;/span&gt; &lt;span class="s"&gt;X-Real-IP&lt;/span&gt; &lt;span class="nv"&gt;$remote_addr&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_set_header&lt;/span&gt; &lt;span class="s"&gt;X-Forwarded-For&lt;/span&gt; &lt;span class="nv"&gt;$proxy_add_x_forwarded_for&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_set_header&lt;/span&gt; &lt;span class="s"&gt;X-Forwarded-Proto&lt;/span&gt; &lt;span class="nv"&gt;$scheme&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;server&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;listen&lt;/span&gt; &lt;span class="mi"&gt;443&lt;/span&gt; &lt;span class="s"&gt;ssl&lt;/span&gt; &lt;span class="s"&gt;http2&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;server_name&lt;/span&gt; &lt;span class="s"&gt;app.example.com&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kn"&gt;ssl_certificate&lt;/span&gt; &lt;span class="n"&gt;/etc/letsencrypt/live/app.example.com/fullchain.pem&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;ssl_certificate_key&lt;/span&gt; &lt;span class="n"&gt;/etc/letsencrypt/live/app.example.com/privkey.pem&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="kn"&gt;location&lt;/span&gt; &lt;span class="n"&gt;/&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_pass&lt;/span&gt; &lt;span class="s"&gt;http://localhost:8080&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_set_header&lt;/span&gt; &lt;span class="s"&gt;Host&lt;/span&gt; &lt;span class="nv"&gt;$host&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="kn"&gt;proxy_set_header&lt;/span&gt; &lt;span class="s"&gt;X-Real-IP&lt;/span&gt; &lt;span class="nv"&gt;$remote_addr&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plus you need Certbot installed, cron jobs for renewal, and you'll reload Nginx manually after any change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Traefik (docker-compose labels):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;traefik&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;traefik:v3.1&lt;/span&gt;
    &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--providers.docker=true"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--entrypoints.web.address=:80"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--entrypoints.websecure.address=:443"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--certificatesresolvers.letsencrypt.acme.email=you@example.com"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--certificatesresolvers.letsencrypt.acme.storage=/acme/acme.json"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;--certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web"&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;80:80"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;443:443"&lt;/span&gt;
    &lt;span class="na"&gt;volumes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;/var/run/docker.sock:/var/run/docker.sock:ro&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;./acme:/acme&lt;/span&gt;

  &lt;span class="na"&gt;api&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-api:latest&lt;/span&gt;
    &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traefik.http.routers.api.rule=Host(`api.example.com`)"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traefik.http.routers.api.tls.certresolver=letsencrypt"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traefik.http.services.api.loadbalancer.server.port=3000"&lt;/span&gt;

  &lt;span class="na"&gt;frontend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-frontend:latest&lt;/span&gt;
    &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traefik.http.routers.frontend.rule=Host(`app.example.com`)"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traefik.http.routers.frontend.tls.certresolver=letsencrypt"&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traefik.http.services.frontend.loadbalancer.server.port=8080"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Caddy (Caddyfile):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight vcl"&gt;&lt;code&gt;api.example.com &lt;span class="p"&gt;{&lt;/span&gt;
    reverse_proxy localhost:&lt;span class="mi"&gt;3000&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

app.example.com &lt;span class="p"&gt;{&lt;/span&gt;
    reverse_proxy localhost:&lt;span class="mi"&gt;8080&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Caddy automatically provisions TLS certificates, redirects HTTP to HTTPS, and enables HTTP/2. Two lines per service.&lt;/p&gt;

&lt;h2&gt;
  
  
  Automatic TLS and Certificate Management
&lt;/h2&gt;

&lt;p&gt;This is Caddy's killer feature and Traefik's strong suit. With Nginx, you're on your own.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caddy&lt;/strong&gt; obtains certificates from Let's Encrypt and ZeroSSL automatically. It handles renewal, OCSP stapling, and even falls back to a secondary CA if the primary is down. You literally do nothing - just put a domain name in the Caddyfile and it works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Traefik&lt;/strong&gt; supports Let's Encrypt via ACME with HTTP-01 and DNS-01 challenges. DNS-01 is crucial if you're behind a load balancer or need wildcard certificates. Configuration looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;certificatesResolvers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;letsencrypt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;acme&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;email&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ops@example.com&lt;/span&gt;
      &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/acme/acme.json&lt;/span&gt;
      &lt;span class="na"&gt;dnsChallenge&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cloudflare&lt;/span&gt;
        &lt;span class="na"&gt;resolvers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1.1.1.1:53"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Nginx&lt;/strong&gt; requires Certbot or a similar ACME client as a separate process. You need to configure it, set up renewal hooks, and make sure the reload happens after certificate updates. It works, but it's another moving part to maintain.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Verdict:&lt;/strong&gt; If automatic TLS matters to you (it should), Caddy wins on simplicity and Traefik wins on flexibility. Nginx is manageable but requires more operational overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Service Discovery and Dynamic Configuration
&lt;/h2&gt;

&lt;p&gt;This is where Traefik pulls ahead for container-heavy environments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Traefik&lt;/strong&gt; natively watches Docker labels, Kubernetes Ingress resources, Consul catalogs, etcd, and more. When a new container spins up with the right labels, Traefik picks it up immediately - no restart, no reload, no config file change. This is incredibly powerful for dynamic environments where services scale up and down frequently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caddy&lt;/strong&gt; has an API for dynamic configuration but doesn't natively integrate with container orchestrators the way Traefik does. You can use &lt;code&gt;caddy-docker-proxy&lt;/code&gt; (community plugin) to get Docker label support, but it's not first-party.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nginx&lt;/strong&gt; is static by default. You change a file, you reload. Tools like &lt;code&gt;consul-template&lt;/code&gt; or &lt;code&gt;nginx-proxy&lt;/code&gt; (jwilder) exist to generate configs dynamically, but they're external tools bolted on.&lt;/p&gt;

&lt;p&gt;For Kubernetes specifically, all three have Ingress controller implementations, but Traefik's is the most mature for dynamic environments and ships as the default ingress controller in K3s.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance Benchmarks
&lt;/h2&gt;

&lt;p&gt;Raw performance matters less than you think for most workloads, but here are realistic numbers from common benchmarking setups:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Requests per second (simple HTTP proxy, wrk with 100 connections):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Proxy&lt;/th&gt;
&lt;th&gt;RPS (HTTP)&lt;/th&gt;
&lt;th&gt;RPS (HTTPS)&lt;/th&gt;
&lt;th&gt;p99 Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Nginx&lt;/td&gt;
&lt;td&gt;~45,000&lt;/td&gt;
&lt;td&gt;~32,000&lt;/td&gt;
&lt;td&gt;2.1 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Traefik&lt;/td&gt;
&lt;td&gt;~28,000&lt;/td&gt;
&lt;td&gt;~22,000&lt;/td&gt;
&lt;td&gt;3.8 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Caddy&lt;/td&gt;
&lt;td&gt;~30,000&lt;/td&gt;
&lt;td&gt;~24,000&lt;/td&gt;
&lt;td&gt;3.2 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Nginx is faster, period. It's written in C with an event-driven architecture that has been optimized for 20+ years. If you're proxying tens of thousands of requests per second and every millisecond counts, Nginx is the right choice.&lt;/p&gt;

&lt;p&gt;But for most applications - handling hundreds to low thousands of RPS - the difference between 3ms and 2ms latency at p99 is irrelevant. You'll hit bottlenecks in your application code, database, or network long before the proxy becomes the issue.&lt;/p&gt;

&lt;h2&gt;
  
  
  Middleware and Advanced Features
&lt;/h2&gt;

&lt;p&gt;All three support common middleware patterns, but the implementation differs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rate limiting:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight nginx"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Nginx&lt;/span&gt;
&lt;span class="k"&gt;http&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kn"&gt;limit_req_zone&lt;/span&gt; &lt;span class="nv"&gt;$binary_remote_addr&lt;/span&gt; &lt;span class="s"&gt;zone=api:10m&lt;/span&gt; &lt;span class="s"&gt;rate=10r/s&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kn"&gt;server&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="kn"&gt;location&lt;/span&gt; &lt;span class="n"&gt;/api/&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="kn"&gt;limit_req&lt;/span&gt; &lt;span class="s"&gt;zone=api&lt;/span&gt; &lt;span class="s"&gt;burst=20&lt;/span&gt; &lt;span class="s"&gt;nodelay&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
            &lt;span class="kn"&gt;proxy_pass&lt;/span&gt; &lt;span class="s"&gt;http://backend&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Traefik (middleware via labels)&lt;/span&gt;
&lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traefik.http.middlewares.ratelimit.ratelimit.average=10"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traefik.http.middlewares.ratelimit.ratelimit.burst=20"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;traefik.http.routers.api.middlewares=ratelimit"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight vcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Caddy&lt;/span&gt;
api.example.com &lt;span class="p"&gt;{&lt;/span&gt;
    rate_limit &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;remote.ip&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;r&lt;span class="o"&gt;/&lt;/span&gt;s
    reverse_proxy localhost:&lt;span class="mi"&gt;3000&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Authentication, compression, headers, CORS&lt;/strong&gt; - all three handle these well. Traefik has the concept of middleware chains that you can compose and reuse across routers, which is elegant for complex setups. Nginx has the most extensive third-party module ecosystem. Caddy has a growing plugin system but fewer options overall.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use Each
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Choose Nginx when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need maximum raw performance (high-traffic APIs, CDN edge nodes)&lt;/li&gt;
&lt;li&gt;You have a static infrastructure that doesn't change often&lt;/li&gt;
&lt;li&gt;Your team already knows Nginx configuration well&lt;/li&gt;
&lt;li&gt;You need advanced features like TCP/UDP proxying, custom Lua scripting, or GeoIP&lt;/li&gt;
&lt;li&gt;You're running on resource-constrained hardware&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Choose Traefik when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You're running Docker or Kubernetes with frequently changing services&lt;/li&gt;
&lt;li&gt;You want automatic service discovery without config file management&lt;/li&gt;
&lt;li&gt;You need the built-in dashboard for visibility into routing&lt;/li&gt;
&lt;li&gt;You're using multiple orchestration platforms (Docker + Consul, for example)&lt;/li&gt;
&lt;li&gt;Your microservices scale up and down dynamically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Choose Caddy when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You want the simplest possible setup with automatic HTTPS&lt;/li&gt;
&lt;li&gt;You're a small team that doesn't want to maintain proxy configuration&lt;/li&gt;
&lt;li&gt;You're running a handful of services that don't change often&lt;/li&gt;
&lt;li&gt;You want a single binary with no external dependencies&lt;/li&gt;
&lt;li&gt;You value security defaults (HTTPS everywhere, modern TLS settings out of the box)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Migration Tips
&lt;/h2&gt;

&lt;p&gt;If you're switching from Nginx to Traefik or Caddy, here's what to watch for:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Upstream health checks&lt;/strong&gt; - Nginx does passive health checks by default; Traefik and Caddy have active health check options that behave differently. Test failover behavior.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Websocket proxying&lt;/strong&gt; - Nginx needs explicit &lt;code&gt;proxy_set_header Upgrade&lt;/code&gt; and &lt;code&gt;Connection&lt;/code&gt; headers. Caddy and Traefik handle websockets transparently.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Custom error pages&lt;/strong&gt; - Each proxy handles error responses differently. Test your 502/503 pages after migration.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Logging format&lt;/strong&gt; - If you're parsing access logs with Loki/ELK, your log format will change. Update your parsers before switching.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Certificate storage&lt;/strong&gt; - If migrating to Traefik or Caddy, your existing Let's Encrypt certificates from Certbot won't transfer. The new proxy will obtain fresh certificates automatically, but be aware of Let's Encrypt rate limits (50 certificates per registered domain per week).&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Need Help with Your DevOps?
&lt;/h2&gt;

&lt;p&gt;Setting up and maintaining reverse proxies, load balancers, and TLS infrastructure is just one piece of the DevOps puzzle. At &lt;strong&gt;InstaDevOps&lt;/strong&gt;, we handle your entire infrastructure - from proxy configuration to CI/CD pipelines, monitoring, and Kubernetes management - starting at &lt;strong&gt;$2,999/month&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Book a free 15-minute consultation to discuss your infrastructure needs: &lt;a href="https://calendly.com/instadevops/15min" rel="noopener noreferrer"&gt;https://calendly.com/instadevops/15min&lt;/a&gt;&lt;/p&gt;

</description>
      <category>nginx</category>
      <category>traefik</category>
      <category>devops</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Cloud Cost FinOps: Cut Your AWS Bill by 40% Without Sacrificing Performance</title>
      <dc:creator>InstaDevOps</dc:creator>
      <pubDate>Fri, 17 Apr 2026 13:46:50 +0000</pubDate>
      <link>https://dev.to/instadevops/cloud-cost-finops-cut-your-aws-bill-by-40-without-sacrificing-performance-39md</link>
      <guid>https://dev.to/instadevops/cloud-cost-finops-cut-your-aws-bill-by-40-without-sacrificing-performance-39md</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Cloud spending has a way of creeping up silently. You start with a few EC2 instances and an RDS database, and eighteen months later your AWS bill is five figures and nobody can explain exactly where the money goes. Engineering says they need all the resources. Finance says the cloud was supposed to be cheaper than on-premise. Both are frustrated.&lt;/p&gt;

&lt;p&gt;FinOps - short for Cloud Financial Operations - is the practice of bringing financial accountability to cloud spending. It is not about cutting costs to the bone. It is about making sure every dollar you spend on cloud infrastructure delivers value, and that your team can make informed trade-off decisions between cost, speed, and quality.&lt;/p&gt;

&lt;p&gt;This guide covers the practical FinOps strategies that consistently deliver the biggest savings on AWS: commitment discounts, spot instances, right-sizing, cost allocation, and the organizational practices that make cost optimization sustainable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding Your Current Spend
&lt;/h2&gt;

&lt;p&gt;Before optimizing anything, you need visibility into where your money goes. Without data, you are guessing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Enable Cost Allocation Tags
&lt;/h3&gt;

&lt;p&gt;Tags are the foundation of cloud cost management. Without them, your bill is one opaque number. With them, you can attribute costs to teams, services, environments, and projects.&lt;/p&gt;

&lt;p&gt;At minimum, enforce these tags on all resources:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Terraform - enforce tagging via default_tags&lt;/span&gt;
&lt;span class="nx"&gt;provider&lt;/span&gt; &lt;span class="s2"&gt;"aws"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;region&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"eu-west-1"&lt;/span&gt;

  &lt;span class="nx"&gt;default_tags&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;tags&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;Environment&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;environment&lt;/span&gt;      &lt;span class="c1"&gt;# production, staging, development&lt;/span&gt;
      &lt;span class="nx"&gt;Team&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;team_name&lt;/span&gt;        &lt;span class="c1"&gt;# payments, platform, data&lt;/span&gt;
      &lt;span class="nx"&gt;Service&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;service_name&lt;/span&gt;     &lt;span class="c1"&gt;# payment-api, user-service&lt;/span&gt;
      &lt;span class="nx"&gt;CostCenter&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cost_center&lt;/span&gt;      &lt;span class="c1"&gt;# maps to finance department codes&lt;/span&gt;
      &lt;span class="nx"&gt;ManagedBy&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"terraform"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Activate tags in the AWS Billing console under &lt;strong&gt;Cost Allocation Tags&lt;/strong&gt;. It takes 24 hours for newly activated tags to appear in Cost Explorer.&lt;/p&gt;

&lt;h3&gt;
  
  
  AWS Cost Explorer and CUR
&lt;/h3&gt;

&lt;p&gt;Cost Explorer gives you quick visibility. For deeper analysis, enable the &lt;strong&gt;Cost and Usage Report (CUR)&lt;/strong&gt; which exports detailed billing data to S3:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_cur_report_definition"&lt;/span&gt; &lt;span class="s2"&gt;"cost_report"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;report_name&lt;/span&gt;                &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"daily-cost-report"&lt;/span&gt;
  &lt;span class="nx"&gt;time_unit&lt;/span&gt;                  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"DAILY"&lt;/span&gt;
  &lt;span class="nx"&gt;format&lt;/span&gt;                     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Parquet"&lt;/span&gt;
  &lt;span class="nx"&gt;compression&lt;/span&gt;                &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Parquet"&lt;/span&gt;
  &lt;span class="nx"&gt;additional_schema_elements&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"RESOURCES"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

  &lt;span class="nx"&gt;s3_bucket&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_s3_bucket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;cost_reports&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;s3_region&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"eu-west-1"&lt;/span&gt;
  &lt;span class="nx"&gt;s3_prefix&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"cur"&lt;/span&gt;

  &lt;span class="nx"&gt;report_versioning&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"OVERWRITE_REPORT"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Query the CUR data with Athena to answer questions like "How much did the payments team spend on RDS last quarter?"&lt;/p&gt;

&lt;h3&gt;
  
  
  Set Up Budget Alerts
&lt;/h3&gt;

&lt;p&gt;Never let a bill surprise you. Create budgets with alerts at 50%, 80%, and 100% thresholds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_budgets_budget"&lt;/span&gt; &lt;span class="s2"&gt;"monthly"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"monthly-total"&lt;/span&gt;
  &lt;span class="nx"&gt;budget_type&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"COST"&lt;/span&gt;
  &lt;span class="nx"&gt;limit_amount&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"5000"&lt;/span&gt;
  &lt;span class="nx"&gt;limit_unit&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"USD"&lt;/span&gt;
  &lt;span class="nx"&gt;time_unit&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"MONTHLY"&lt;/span&gt;

  &lt;span class="nx"&gt;notification&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;comparison_operator&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"GREATER_THAN"&lt;/span&gt;
    &lt;span class="nx"&gt;threshold&lt;/span&gt;                 &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;80&lt;/span&gt;
    &lt;span class="nx"&gt;threshold_type&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"PERCENTAGE"&lt;/span&gt;
    &lt;span class="nx"&gt;notification_type&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ACTUAL"&lt;/span&gt;
    &lt;span class="nx"&gt;subscriber_email_addresses&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"engineering@company.com"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;notification&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;comparison_operator&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"GREATER_THAN"&lt;/span&gt;
    &lt;span class="nx"&gt;threshold&lt;/span&gt;                 &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
    &lt;span class="nx"&gt;threshold_type&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"PERCENTAGE"&lt;/span&gt;
    &lt;span class="nx"&gt;notification_type&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"FORECASTED"&lt;/span&gt;
    &lt;span class="nx"&gt;subscriber_email_addresses&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"engineering@company.com"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"finance@company.com"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Reserved Instances vs Savings Plans
&lt;/h2&gt;

&lt;p&gt;Commitment discounts are the single largest cost lever for most AWS accounts, typically saving 30-60% over on-demand pricing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Savings Plans (Recommended for Most Teams)
&lt;/h3&gt;

&lt;p&gt;Savings Plans offer flexibility that Reserved Instances lack. You commit to a dollar amount per hour of compute usage, and AWS applies the discount automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compute Savings Plans&lt;/strong&gt; cover EC2, Fargate, and Lambda across all regions and instance families. They offer the most flexibility:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;On-demand: $0.0416/hour (t3.medium, eu-west-1)
1-year Compute SP (no upfront): $0.0270/hour - 35% savings
3-year Compute SP (all upfront): $0.0166/hour - 60% savings
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;EC2 Instance Savings Plans&lt;/strong&gt; are locked to a specific instance family and region but offer slightly deeper discounts.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to Calculate Your Commitment
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Open Cost Explorer and view the last 3 months of EC2/Fargate spend&lt;/li&gt;
&lt;li&gt;Identify your baseline (the minimum spend that is consistent every day)&lt;/li&gt;
&lt;li&gt;Commit to 70-80% of that baseline with Savings Plans&lt;/li&gt;
&lt;li&gt;Cover the remaining variable usage with on-demand and spot
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Example:
  Average daily compute spend: $200/day
  Minimum daily spend (baseline): $160/day
  Recommended SP commitment: $160 * 0.75 = $120/day = $5.00/hour

  Annual savings: $120/day * 365 * 0.35 = $15,330/year
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Common Mistake: Over-Committing
&lt;/h3&gt;

&lt;p&gt;Do not commit to 100% of current usage. Your architecture will change, services will be rewritten, and traffic patterns will shift. Commit conservatively and re-evaluate quarterly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Spot Instances: Up to 90% Savings
&lt;/h2&gt;

&lt;p&gt;Spot instances use AWS's spare capacity at steep discounts (60-90% off on-demand), but can be interrupted with 2 minutes notice.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where Spot Works Well
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;CI/CD build runners&lt;/li&gt;
&lt;li&gt;Batch processing and data pipelines&lt;/li&gt;
&lt;li&gt;Dev/staging environments&lt;/li&gt;
&lt;li&gt;Stateless web workers behind a load balancer (with enough capacity diversity)&lt;/li&gt;
&lt;li&gt;Machine learning training jobs with checkpointing&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Where Spot Does Not Work
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Single-instance databases&lt;/li&gt;
&lt;li&gt;Stateful services without replication&lt;/li&gt;
&lt;li&gt;Anything where a 2-minute shutdown causes data loss&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  ECS Spot with Capacity Providers
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ecs_capacity_provider"&lt;/span&gt; &lt;span class="s2"&gt;"spot"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"spot-provider"&lt;/span&gt;

  &lt;span class="nx"&gt;auto_scaling_group_provider&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;auto_scaling_group_arn&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_autoscaling_group&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;spot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;arn&lt;/span&gt;
    &lt;span class="nx"&gt;managed_termination_protection&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ENABLED"&lt;/span&gt;

    &lt;span class="nx"&gt;managed_scaling&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;maximum_scaling_step_size&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;
      &lt;span class="nx"&gt;minimum_scaling_step_size&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
      &lt;span class="nx"&gt;status&lt;/span&gt;                    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ENABLED"&lt;/span&gt;
      &lt;span class="nx"&gt;target_capacity&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_ecs_cluster_capacity_providers"&lt;/span&gt; &lt;span class="s2"&gt;"main"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;cluster_name&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_ecs_cluster&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;
  &lt;span class="nx"&gt;capacity_providers&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="nx"&gt;aws_ecs_capacity_provider&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;spot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"FARGATE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s2"&gt;"FARGATE_SPOT"&lt;/span&gt;
  &lt;span class="p"&gt;]&lt;/span&gt;

  &lt;span class="nx"&gt;default_capacity_provider_strategy&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;base&lt;/span&gt;              &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;          &lt;span class="c1"&gt;# 2 tasks always on regular Fargate&lt;/span&gt;
    &lt;span class="nx"&gt;weight&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="nx"&gt;capacity_provider&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"FARGATE"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nx"&gt;default_capacity_provider_strategy&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;weight&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;          &lt;span class="c1"&gt;# 3x weight on Fargate Spot&lt;/span&gt;
    &lt;span class="nx"&gt;capacity_provider&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"FARGATE_SPOT"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This runs a baseline of 2 tasks on regular Fargate and scales additional tasks on Fargate Spot at roughly 70% discount.&lt;/p&gt;

&lt;h2&gt;
  
  
  Right-Sizing: Stop Paying for Idle Resources
&lt;/h2&gt;

&lt;p&gt;Most organizations over-provision by 30-50%. Right-sizing means matching resource allocation to actual usage.&lt;/p&gt;

&lt;h3&gt;
  
  
  Identifying Oversized Instances
&lt;/h3&gt;

&lt;p&gt;Use AWS Compute Optimizer (free) or query CloudWatch directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check average CPU utilization for an instance over the last 2 weeks&lt;/span&gt;
aws cloudwatch get-metric-statistics &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; AWS/EC2 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--metric-name&lt;/span&gt; CPUUtilization &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--dimensions&lt;/span&gt; &lt;span class="nv"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;InstanceId,Value&lt;span class="o"&gt;=&lt;/span&gt;i-0123456789abcdef0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--start-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; &lt;span class="nt"&gt;-v-14d&lt;/span&gt; +%Y-%m-%dT%H:%M:%S&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--end-time&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;date&lt;/span&gt; &lt;span class="nt"&gt;-u&lt;/span&gt; +%Y-%m-%dT%H:%M:%S&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--period&lt;/span&gt; 3600 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--statistics&lt;/span&gt; Average Maximum &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Rules of thumb:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Average CPU under 20%: downsize by one tier&lt;/li&gt;
&lt;li&gt;Average CPU under 10%: downsize by two tiers or consider Fargate&lt;/li&gt;
&lt;li&gt;Max CPU consistently under 40%: safe to downsize one tier&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Graviton Instances: 20% Better Price-Performance
&lt;/h3&gt;

&lt;p&gt;AWS Graviton (ARM) instances offer 20% better price-performance than equivalent x86 instances. If your application runs on Linux and does not depend on x86-specific binaries, switching is straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;t3.medium  (x86): $0.0416/hour
t4g.medium (ARM): $0.0336/hour - 19% cheaper, 20% faster
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For containerized workloads, build multi-arch images:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker buildx build &lt;span class="nt"&gt;--platform&lt;/span&gt; linux/amd64,linux/arm64 &lt;span class="nt"&gt;-t&lt;/span&gt; myapp:latest &lt;span class="nt"&gt;--push&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Storage and Data Transfer Optimization
&lt;/h2&gt;

&lt;h3&gt;
  
  
  S3 Lifecycle Policies
&lt;/h3&gt;

&lt;p&gt;Most S3 data is accessed frequently for a short period and then rarely again:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_s3_bucket_lifecycle_configuration"&lt;/span&gt; &lt;span class="s2"&gt;"logs"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;bucket&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_s3_bucket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;logs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;

  &lt;span class="nx"&gt;rule&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;id&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"archive-old-logs"&lt;/span&gt;
    &lt;span class="nx"&gt;status&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Enabled"&lt;/span&gt;

    &lt;span class="nx"&gt;transition&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;days&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;
      &lt;span class="nx"&gt;storage_class&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"STANDARD_IA"&lt;/span&gt;        &lt;span class="c1"&gt;# 45% cheaper&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="nx"&gt;transition&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;days&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;90&lt;/span&gt;
      &lt;span class="nx"&gt;storage_class&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"GLACIER_IR"&lt;/span&gt;         &lt;span class="c1"&gt;# 68% cheaper, millisecond retrieval&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="nx"&gt;transition&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;days&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;365&lt;/span&gt;
      &lt;span class="nx"&gt;storage_class&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"DEEP_ARCHIVE"&lt;/span&gt;       &lt;span class="c1"&gt;# 95% cheaper&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="nx"&gt;expiration&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;days&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;730&lt;/span&gt;  &lt;span class="c1"&gt;# Delete after 2 years&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  EBS Volume Optimization
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Switch gp2 volumes to gp3: same performance, 20% cheaper, and you can independently scale IOPS&lt;/li&gt;
&lt;li&gt;Delete unattached EBS volumes (check monthly)&lt;/li&gt;
&lt;li&gt;Use EBS snapshots for infrequently accessed data instead of keeping volumes running
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Find unattached EBS volumes&lt;/span&gt;
aws ec2 describe-volumes &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--filters&lt;/span&gt; &lt;span class="nv"&gt;Name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;status,Values&lt;span class="o"&gt;=&lt;/span&gt;available &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--query&lt;/span&gt; &lt;span class="s1"&gt;'Volumes[*].{ID:VolumeId,Size:Size,Created:CreateTime}'&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output&lt;/span&gt; table
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  NAT Gateway Costs
&lt;/h3&gt;

&lt;p&gt;NAT Gateways charge $0.045/hour ($32.85/month) plus $0.045/GB of data processed. For high-traffic workloads, this adds up fast.&lt;/p&gt;

&lt;p&gt;Alternatives:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;VPC Endpoints&lt;/strong&gt; for AWS services (S3, DynamoDB, ECR) - eliminates NAT data charges for AWS API calls&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NAT instances&lt;/strong&gt; on t4g.nano ($3.07/month) for dev/staging environments
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# S3 Gateway Endpoint (free, eliminates NAT charges for S3)&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_vpc_endpoint"&lt;/span&gt; &lt;span class="s2"&gt;"s3"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;service_name&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"com.amazonaws.eu-west-1.s3"&lt;/span&gt;
  &lt;span class="nx"&gt;route_table_ids&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;aws_route_table&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;private&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# ECR Interface Endpoints&lt;/span&gt;
&lt;span class="nx"&gt;resource&lt;/span&gt; &lt;span class="s2"&gt;"aws_vpc_endpoint"&lt;/span&gt; &lt;span class="s2"&gt;"ecr_api"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;              &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;service_name&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"com.amazonaws.eu-west-1.ecr.api"&lt;/span&gt;
  &lt;span class="nx"&gt;vpc_endpoint_type&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Interface"&lt;/span&gt;
  &lt;span class="nx"&gt;subnet_ids&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_subnet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;private&lt;/span&gt;&lt;span class="p"&gt;[*].&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
  &lt;span class="nx"&gt;private_dns_enabled&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Building a FinOps Culture
&lt;/h2&gt;

&lt;p&gt;Tools alone do not reduce cloud costs. You need organizational practices that create accountability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Weekly Cost Reviews
&lt;/h3&gt;

&lt;p&gt;Run a 15-minute weekly meeting reviewing:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Total spend vs budget&lt;/li&gt;
&lt;li&gt;Week-over-week change and top 3 drivers of increase&lt;/li&gt;
&lt;li&gt;Top 5 most expensive services&lt;/li&gt;
&lt;li&gt;Any anomalies or unexpected charges&lt;/li&gt;
&lt;li&gt;Action items from last week&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Team-Level Cost Dashboards
&lt;/h3&gt;

&lt;p&gt;Give each team visibility into their own spending (this is where tags become critical). When engineers can see that their service costs $2,400/month, they start caring about optimization.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture Decision Records for Cost
&lt;/h3&gt;

&lt;p&gt;For significant infrastructure decisions, document the cost implications:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# ADR-015: Switch payment-api from t3.xlarge to t4g.large&lt;/span&gt;

&lt;span class="gu"&gt;## Context&lt;/span&gt;
Payment API currently runs on 3x t3.xlarge ($0.1664/hr each).
Average CPU: 22%. Average memory: 35%.

&lt;span class="gu"&gt;## Decision&lt;/span&gt;
Migrate to 3x t4g.large (ARM, $0.0672/hr each).
CPU and memory are sufficient based on 30-day CloudWatch data.

&lt;span class="gu"&gt;## Cost Impact&lt;/span&gt;
Before: 3 &lt;span class="ge"&gt;* $0.1664 *&lt;/span&gt; 730 = $364.42/month
After:  3 &lt;span class="ge"&gt;* $0.0672 *&lt;/span&gt; 730 = $147.17/month
Savings: $217.25/month ($2,607/year)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Quick Wins Checklist
&lt;/h2&gt;

&lt;p&gt;If you are just starting with FinOps, tackle these in order. Each one can be done in a day or less:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Delete unused resources&lt;/strong&gt; - unattached EBS volumes, old snapshots, idle load balancers, stopped instances with EBS volumes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enable S3 Intelligent-Tiering&lt;/strong&gt; on buckets with unpredictable access patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Switch gp2 to gp3&lt;/strong&gt; EBS volumes (20% savings, no performance loss)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add VPC endpoints&lt;/strong&gt; for S3 and DynamoDB (eliminates NAT data charges)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Purchase Savings Plans&lt;/strong&gt; covering 70% of your steady-state compute&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set up budget alerts&lt;/strong&gt; so you know before the bill arrives&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enable Compute Optimizer&lt;/strong&gt; recommendations (free)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Review and clean up old ECR images, CloudWatch log groups, and Lambda versions&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Need Help with Your DevOps?
&lt;/h2&gt;

&lt;p&gt;Cloud cost optimization is an ongoing practice, not a one-time project. At &lt;strong&gt;&lt;a href="https://instadevops.com" rel="noopener noreferrer"&gt;InstaDevOps&lt;/a&gt;&lt;/strong&gt;, we help startups and growing companies implement FinOps practices that reduce cloud spending by 30-50% while maintaining the performance and reliability your users expect.&lt;/p&gt;

&lt;p&gt;Plans start at &lt;strong&gt;$2,999/mo&lt;/strong&gt; for a dedicated fractional DevOps engineer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://calendly.com/instadevops/15min" rel="noopener noreferrer"&gt;Book a free 15-minute consultation&lt;/a&gt;&lt;/strong&gt; to get a cloud cost assessment.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>finops</category>
      <category>cloud</category>
      <category>costoptimization</category>
    </item>
    <item>
      <title>Building a DevSecOps Pipeline: Shift Security Left Without Slowing Down</title>
      <dc:creator>InstaDevOps</dc:creator>
      <pubDate>Thu, 16 Apr 2026 13:46:46 +0000</pubDate>
      <link>https://dev.to/instadevops/building-a-devsecops-pipeline-shift-security-left-without-slowing-down-31fl</link>
      <guid>https://dev.to/instadevops/building-a-devsecops-pipeline-shift-security-left-without-slowing-down-31fl</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Every week brings news of another data breach, supply chain attack, or misconfigured cloud resource exposing millions of records. The traditional approach of bolting security on at the end of the development cycle no longer works. By the time a penetration tester finds a vulnerability in staging, the code has been through weeks of development, and fixing it means expensive rework.&lt;/p&gt;

&lt;p&gt;DevSecOps integrates security testing directly into your CI/CD pipeline so vulnerabilities are caught in minutes, not months. The goal is not to slow down development - it is to catch security issues at the same speed you catch bugs: automatically, on every commit.&lt;/p&gt;

&lt;p&gt;This guide walks through building a practical DevSecOps pipeline with real tool configurations, covering static analysis, dependency scanning, container image scanning, infrastructure security, and policy-as-code.&lt;/p&gt;

&lt;h2&gt;
  
  
  The DevSecOps Pipeline Stages
&lt;/h2&gt;

&lt;p&gt;A mature DevSecOps pipeline runs security checks at every stage:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Code Commit → SAST → Dependency Scan → Build → Container Scan → IaC Scan → DAST → Deploy → Runtime Protection
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You do not need all of these on day one. Start with the highest-impact, lowest-friction stages and layer on additional checks as your team matures.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 1: Static Application Security Testing (SAST)
&lt;/h2&gt;

&lt;p&gt;SAST tools analyze your source code for vulnerabilities without executing it. They catch issues like SQL injection, cross-site scripting, hardcoded secrets, and insecure cryptographic patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  Semgrep for Custom and Community Rules
&lt;/h3&gt;

&lt;p&gt;Semgrep is fast, supports 30+ languages, and lets you write custom rules in a simple YAML syntax:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/sast.yml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;SAST Scan&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;semgrep&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;container&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;semgrep/semgrep&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;semgrep scan --config auto --error --sarif --output semgrep.sarif .&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;github/codeql-action/upload-sarif@v3&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;sarif_file&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;semgrep.sarif&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;always()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--config auto&lt;/code&gt; flag uses Semgrep's curated community ruleset. For stricter scanning, add specific rule packs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;semgrep scan &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--config&lt;/span&gt; p/security-audit &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--config&lt;/span&gt; p/secrets &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--config&lt;/span&gt; p/owasp-top-ten &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--error&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Writing Custom Security Rules
&lt;/h3&gt;

&lt;p&gt;Create organization-specific rules for patterns your team should avoid:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .semgrep/custom-rules.yml&lt;/span&gt;
&lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;no-raw-sql-queries&lt;/span&gt;
    &lt;span class="na"&gt;patterns&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;pattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;db.query($SQL, ...)&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;pattern-not&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;db.query($SQL, [...])&lt;/span&gt;
    &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Use&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;parameterized&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;queries&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;prevent&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;SQL&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;injection"&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ERROR&lt;/span&gt;
    &lt;span class="na"&gt;languages&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;javascript&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;typescript&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;no-console-log-in-production&lt;/span&gt;
    &lt;span class="na"&gt;pattern&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;console.log(...)&lt;/span&gt;
    &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Remove&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;console.log&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;before&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;merging&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;main"&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;WARNING&lt;/span&gt;
    &lt;span class="na"&gt;languages&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;javascript&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;typescript&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;include&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;src/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Stage 2: Dependency and Supply Chain Scanning
&lt;/h2&gt;

&lt;p&gt;Your application's dependencies are a massive attack surface. A single vulnerable transitive dependency can compromise your entire system.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scanning with Trivy
&lt;/h3&gt;

&lt;p&gt;Trivy scans not just containers but also filesystems for vulnerable dependencies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/dependency-scan.yml&lt;/span&gt;
&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;scan-dependencies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run Trivy filesystem scan&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aquasecurity/trivy-action@master&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;scan-type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fs&lt;/span&gt;
          &lt;span class="na"&gt;scan-ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HIGH,CRITICAL&lt;/span&gt;
          &lt;span class="na"&gt;exit-code&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
          &lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sarif&lt;/span&gt;
          &lt;span class="na"&gt;output&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;trivy-fs.sarif&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;github/codeql-action/upload-sarif@v3&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;sarif_file&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;trivy-fs.sarif&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;always()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Software Bill of Materials (SBOM)
&lt;/h3&gt;

&lt;p&gt;Generate an SBOM for compliance and vulnerability tracking:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Generate SBOM with Syft&lt;/span&gt;
syft packages &lt;span class="nb"&gt;dir&lt;/span&gt;:. &lt;span class="nt"&gt;-o&lt;/span&gt; spdx-json &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; sbom.spdx.json

&lt;span class="c"&gt;# Scan the SBOM for vulnerabilities with Grype&lt;/span&gt;
grype sbom:sbom.spdx.json &lt;span class="nt"&gt;--fail-on&lt;/span&gt; high
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Automated Dependency Updates
&lt;/h3&gt;

&lt;p&gt;Use Renovate or Dependabot to keep dependencies current:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;renovate.json&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"$schema"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://docs.renovatebot.com/renovate-schema.json"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"extends"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"config:recommended"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"vulnerabilityAlerts"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"enabled"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"labels"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"security"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"packageRules"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"matchUpdateTypes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"patch"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"automerge"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"automergeType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"branch"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"matchUpdateTypes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"major"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"labels"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"breaking-change"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"assignees"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"team-lead"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Stage 3: Container Image Scanning
&lt;/h2&gt;

&lt;p&gt;Container images often contain vulnerable OS packages, misconfigured permissions, and unnecessary tools that increase the attack surface.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scanning During Build
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;build-and-scan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build image&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker build -t myapp:${{ github.sha }} .&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Scan image with Trivy&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aquasecurity/trivy-action@master&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;image-ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp:${{ github.sha }}&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HIGH,CRITICAL&lt;/span&gt;
          &lt;span class="na"&gt;exit-code&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Check for root user&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;USER=$(docker inspect --format='{{.Config.User}}' myapp:${{ github.sha }})&lt;/span&gt;
          &lt;span class="s"&gt;if [ -z "$USER" ] || [ "$USER" = "root" ]; then&lt;/span&gt;
            &lt;span class="s"&gt;echo "ERROR: Container runs as root. Add a USER directive to your Dockerfile."&lt;/span&gt;
            &lt;span class="s"&gt;exit 1&lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Hardening Dockerfiles
&lt;/h3&gt;

&lt;p&gt;A secure Dockerfile follows these patterns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="c"&gt;# Use specific version, not :latest&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;node:20.11-alpine&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;builder&lt;/span&gt;

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; package*.json ./&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;npm ci &lt;span class="nt"&gt;--only&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;production

&lt;span class="c"&gt;# Multi-stage build - no build tools in final image&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; node:20.11-alpine&lt;/span&gt;

&lt;span class="c"&gt;# Create non-root user&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;addgroup &lt;span class="nt"&gt;-g&lt;/span&gt; 1001 &lt;span class="nt"&gt;-S&lt;/span&gt; appuser &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="se"&gt;\
&lt;/span&gt;    adduser &lt;span class="nt"&gt;-S&lt;/span&gt; appuser &lt;span class="nt"&gt;-u&lt;/span&gt; 1001 &lt;span class="nt"&gt;-G&lt;/span&gt; appuser

&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /app&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --from=builder /app/node_modules ./node_modules&lt;/span&gt;
&lt;span class="k"&gt;COPY&lt;/span&gt;&lt;span class="s"&gt; --chown=appuser:appuser . .&lt;/span&gt;

&lt;span class="c"&gt;# Drop all capabilities&lt;/span&gt;
&lt;span class="k"&gt;USER&lt;/span&gt;&lt;span class="s"&gt; appuser&lt;/span&gt;

&lt;span class="c"&gt;# Use dumb-init to handle PID 1 properly&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;apk add &lt;span class="nt"&gt;--no-cache&lt;/span&gt; dumb-init
&lt;span class="k"&gt;ENTRYPOINT&lt;/span&gt;&lt;span class="s"&gt; ["dumb-init", "--"]&lt;/span&gt;
&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["node", "server.js"]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Stage 4: Infrastructure as Code Security
&lt;/h2&gt;

&lt;p&gt;Misconfigured infrastructure is the leading cause of cloud breaches. Scan your Terraform, CloudFormation, and Kubernetes manifests before they reach production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Checkov for IaC Scanning
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;iac-scan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run Checkov&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bridgecrewio/checkov-action@v12&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;directory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;terraform/&lt;/span&gt;
          &lt;span class="na"&gt;framework&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;terraform&lt;/span&gt;
          &lt;span class="na"&gt;soft_fail&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
          &lt;span class="na"&gt;output_format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sarif&lt;/span&gt;
          &lt;span class="na"&gt;output_file_path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;checkov.sarif&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;github/codeql-action/upload-sarif@v3&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;sarif_file&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;checkov.sarif&lt;/span&gt;
        &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;always()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Checkov catches misconfigurations like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;S3 buckets without encryption&lt;/li&gt;
&lt;li&gt;Security groups open to 0.0.0.0/0&lt;/li&gt;
&lt;li&gt;RDS instances without encryption at rest&lt;/li&gt;
&lt;li&gt;IAM policies with wildcard permissions&lt;/li&gt;
&lt;li&gt;EBS volumes without encryption&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  tfsec for Terraform-Specific Scanning
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run tfsec on your Terraform directory&lt;/span&gt;
tfsec terraform/ &lt;span class="nt"&gt;--format&lt;/span&gt; sarif &lt;span class="nt"&gt;--out&lt;/span&gt; tfsec.sarif

&lt;span class="c"&gt;# Common findings:&lt;/span&gt;
&lt;span class="c"&gt;# - aws-ec2-no-public-ingress-sgr&lt;/span&gt;
&lt;span class="c"&gt;# - aws-s3-enable-bucket-encryption&lt;/span&gt;
&lt;span class="c"&gt;# - aws-iam-no-policy-wildcards&lt;/span&gt;
&lt;span class="c"&gt;# - aws-rds-encrypt-instance-storage-data&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Stage 5: Dynamic Application Security Testing (DAST)
&lt;/h2&gt;

&lt;p&gt;DAST tools test your running application by sending requests and analyzing responses for vulnerabilities. Run these against a staging environment after deployment.&lt;/p&gt;

&lt;h3&gt;
  
  
  OWASP ZAP Baseline Scan
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;dast&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deploy-staging&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;OWASP ZAP Baseline Scan&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;zaproxy/action-baseline@v0.12.0&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://staging.example.com&lt;/span&gt;
          &lt;span class="na"&gt;rules_file_name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;zap-rules.tsv&lt;/span&gt;
          &lt;span class="na"&gt;fail_action&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
          &lt;span class="na"&gt;cmd_options&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;-a&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;-j'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;DAST scans are slower (minutes to hours) and noisier than SAST, so run them post-deployment rather than on every commit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 6: Policy-as-Code with Open Policy Agent
&lt;/h2&gt;

&lt;p&gt;Policy-as-Code enforces organizational rules programmatically. Open Policy Agent (OPA) lets you write policies in Rego that gate deployments:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rego"&gt;&lt;code&gt;&lt;span class="c1"&gt;# policy/kubernetes.rego&lt;/span&gt;
&lt;span class="ow"&gt;package&lt;/span&gt; &lt;span class="n"&gt;kubernetes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;admission&lt;/span&gt;

&lt;span class="n"&gt;deny&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;"Deployment"&lt;/span&gt;
  &lt;span class="n"&gt;container&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;object&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;template&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;containers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;resources&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;limits&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;memory&lt;/span&gt;
  &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"Container '%v' must have memory limits set"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;deny&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;"Deployment"&lt;/span&gt;
  &lt;span class="n"&gt;container&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;object&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;template&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;containers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;
  &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"@sha256:"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;regex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"^.*:[0-9]+\\.[0-9]+\\.[0-9]+$"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"Container '%v' must use a specific version tag or digest, not :latest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;deny&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kind&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;"Deployment"&lt;/span&gt;
  &lt;span class="n"&gt;container&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;object&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;template&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;spec&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;containers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;securityContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;privileged&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;sprintf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;"Container '%v' must not run in privileged mode"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;container&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Conftest for CI Integration
&lt;/h3&gt;

&lt;p&gt;Use Conftest to run OPA policies against Kubernetes manifests, Terraform plans, and Dockerfiles in CI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Test Kubernetes manifests&lt;/span&gt;
conftest &lt;span class="nb"&gt;test &lt;/span&gt;k8s/deployment.yaml &lt;span class="nt"&gt;--policy&lt;/span&gt; policy/

&lt;span class="c"&gt;# Test Terraform plans&lt;/span&gt;
terraform plan &lt;span class="nt"&gt;-out&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;tfplan
terraform show &lt;span class="nt"&gt;-json&lt;/span&gt; tfplan &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; tfplan.json
conftest &lt;span class="nb"&gt;test &lt;/span&gt;tfplan.json &lt;span class="nt"&gt;--policy&lt;/span&gt; policy/terraform/

&lt;span class="c"&gt;# Test Dockerfiles&lt;/span&gt;
conftest &lt;span class="nb"&gt;test &lt;/span&gt;Dockerfile &lt;span class="nt"&gt;--policy&lt;/span&gt; policy/docker/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Putting It All Together
&lt;/h2&gt;

&lt;p&gt;Here is a complete pipeline that ties all stages together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;DevSecOps Pipeline&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Fast checks first (&amp;lt; 2 minutes)&lt;/span&gt;
  &lt;span class="na"&gt;secrets-scan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;trufflesecurity/trufflehog@main&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;extra_args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;--only-verified&lt;/span&gt;

  &lt;span class="na"&gt;sast&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;docker run --rm -v "${PWD}:/src" semgrep/semgrep \&lt;/span&gt;
            &lt;span class="s"&gt;semgrep scan --config auto --error /src&lt;/span&gt;

  &lt;span class="c1"&gt;# Medium checks (2-5 minutes)&lt;/span&gt;
  &lt;span class="na"&gt;dependency-scan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aquasecurity/trivy-action@master&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;scan-type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fs&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HIGH,CRITICAL&lt;/span&gt;
          &lt;span class="na"&gt;exit-code&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;

  &lt;span class="na"&gt;iac-scan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bridgecrewio/checkov-action@v12&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;directory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;terraform/&lt;/span&gt;

  &lt;span class="c1"&gt;# Build and container scan&lt;/span&gt;
  &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;secrets-scan&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;sast&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;dependency-scan&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;iac-scan&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker build -t app:${{ github.sha }} .&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aquasecurity/trivy-action@master&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;image-ref&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app:${{ github.sha }}&lt;/span&gt;
          &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;HIGH,CRITICAL&lt;/span&gt;
          &lt;span class="na"&gt;exit-code&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;

  &lt;span class="c1"&gt;# Deploy to staging, then DAST&lt;/span&gt;
  &lt;span class="na"&gt;deploy-staging&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;build&lt;/span&gt;
    &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;github.ref == 'refs/heads/main'&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;staging&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;echo "Deploy to staging"&lt;/span&gt;

  &lt;span class="na"&gt;dast&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deploy-staging&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;zaproxy/action-baseline@v0.12.0&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://staging.example.com&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Managing False Positives
&lt;/h2&gt;

&lt;p&gt;Every security scanning tool produces false positives. If you do not manage them, your team will start ignoring security alerts entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Create a baseline.&lt;/strong&gt; On initial integration, capture existing findings as a baseline and only alert on new issues.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use inline suppressions with justification.&lt;/strong&gt; Every suppressed finding should include a reason:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# nosemgrep: python.lang.security.audit.insecure-hash.insecure-hash
# Justification: MD5 used for non-security cache key, not for cryptographic purposes
&lt;/span&gt;&lt;span class="n"&gt;cache_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hashlib&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;md5&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;hexdigest&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Triage weekly.&lt;/strong&gt; Assign someone to review and triage findings weekly. Close false positives, create tickets for real issues, and tune rules that produce too much noise.&lt;/p&gt;

&lt;h2&gt;
  
  
  Security Metrics That Matter
&lt;/h2&gt;

&lt;p&gt;Tracking the right metrics tells you whether your DevSecOps pipeline is actually improving security or just generating noise.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mean Time to Remediate (MTTR)
&lt;/h3&gt;

&lt;p&gt;How long does it take from when a vulnerability is detected to when it is fixed in production? Track this by severity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Critical:&lt;/strong&gt; Target under 24 hours&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High:&lt;/strong&gt; Target under 7 days&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Medium:&lt;/strong&gt; Target under 30 days&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Low:&lt;/strong&gt; Target under 90 days&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Vulnerability Escape Rate
&lt;/h3&gt;

&lt;p&gt;What percentage of vulnerabilities make it past your CI/CD pipeline into production? This is your pipeline's effectiveness metric. Track it by comparing CI findings against production security scans and penetration test results.&lt;/p&gt;

&lt;h3&gt;
  
  
  Coverage Percentage
&lt;/h3&gt;

&lt;p&gt;What percentage of your repositories have security scanning enabled? In most organizations, the answer is disturbingly low. Aim for 100% of production services covered by at minimum dependency scanning and SAST.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix Rate vs Find Rate
&lt;/h3&gt;

&lt;p&gt;If you are finding vulnerabilities faster than you are fixing them, your backlog will grow indefinitely. This signals that your pipeline is too noisy (tune it) or your team needs more security training.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started: A Phased Approach
&lt;/h2&gt;

&lt;p&gt;Do not try to implement everything at once. Here is a realistic rollout plan:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 1-2: Secrets scanning and dependency scanning.&lt;/strong&gt; These have the highest signal-to-noise ratio and catch the most impactful issues. Enable Dependabot or Renovate for automated updates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Week 3-4: SAST with Semgrep.&lt;/strong&gt; Start with the default auto config. Do not write custom rules yet. Let the team get comfortable with the findings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 2: Container image scanning.&lt;/strong&gt; Add Trivy to your Docker build pipeline. Fix the critical and high findings, suppress the false positives with documentation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 3: IaC scanning with Checkov.&lt;/strong&gt; Scan your Terraform and Kubernetes manifests. This catches misconfigurations before they reach production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Month 4+: DAST and policy-as-code.&lt;/strong&gt; These require more setup and tuning but provide the deepest security coverage.&lt;/p&gt;

&lt;h2&gt;
  
  
  Need Help with Your DevOps?
&lt;/h2&gt;

&lt;p&gt;Building a DevSecOps pipeline that catches real vulnerabilities without drowning your team in false positives takes careful tuning. At &lt;strong&gt;&lt;a href="https://instadevops.com" rel="noopener noreferrer"&gt;InstaDevOps&lt;/a&gt;&lt;/strong&gt;, we implement security-hardened CI/CD pipelines for startups and growing teams - baking security in from day one.&lt;/p&gt;

&lt;p&gt;Plans start at &lt;strong&gt;$2,999/mo&lt;/strong&gt; for a dedicated fractional DevOps engineer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://calendly.com/instadevops/15min" rel="noopener noreferrer"&gt;Book a free 15-minute consultation&lt;/a&gt;&lt;/strong&gt; to discuss your security posture.&lt;/p&gt;

</description>
      <category>devsecops</category>
      <category>security</category>
      <category>cicd</category>
      <category>devops</category>
    </item>
    <item>
      <title>AWS ECS vs EKS: A Practical Decision Framework for Startups</title>
      <dc:creator>InstaDevOps</dc:creator>
      <pubDate>Wed, 15 Apr 2026 13:46:43 +0000</pubDate>
      <link>https://dev.to/instadevops/aws-ecs-vs-eks-a-practical-decision-framework-for-startups-3dgp</link>
      <guid>https://dev.to/instadevops/aws-ecs-vs-eks-a-practical-decision-framework-for-startups-3dgp</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;You have containerized your application and now you need to run it in production on AWS. The two main options stare back at you from the AWS console: Elastic Container Service (ECS) and Elastic Kubernetes Service (EKS). Both are fully managed, both run containers, and both integrate deeply with the AWS ecosystem.&lt;/p&gt;

&lt;p&gt;So which one should you pick?&lt;/p&gt;

&lt;p&gt;The answer is not "EKS because Kubernetes is the industry standard" and it is not "ECS because it is simpler." The right choice depends on your team size, operational maturity, workload characteristics, and growth trajectory. This article gives you a practical framework for making this decision, with real cost numbers and migration considerations.&lt;/p&gt;

&lt;h2&gt;
  
  
  ECS: The AWS-Native Path
&lt;/h2&gt;

&lt;p&gt;ECS is AWS's proprietary container orchestration service. It is deeply integrated with the AWS ecosystem and was purpose-built to make running containers on AWS as straightforward as possible.&lt;/p&gt;

&lt;h3&gt;
  
  
  ECS Architecture
&lt;/h3&gt;

&lt;p&gt;ECS has two launch types:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EC2 Launch Type:&lt;/strong&gt; You manage the underlying EC2 instances. ECS handles container placement and scheduling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fargate Launch Type:&lt;/strong&gt; Serverless containers. AWS manages the infrastructure entirely. You just define CPU, memory, and your container image.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"family"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"my-api"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"networkMode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"awsvpc"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"requiresCompatibilities"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"FARGATE"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"cpu"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"512"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"memory"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1024"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"containerDefinitions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"api"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"image"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"123456789012.dkr.ecr.eu-west-1.amazonaws.com/my-api:latest"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"portMappings"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"containerPort"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"protocol"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tcp"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"logConfiguration"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"logDriver"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"awslogs"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"options"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"awslogs-group"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"/ecs/my-api"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"awslogs-region"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"eu-west-1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"awslogs-stream-prefix"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ecs"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  ECS Strengths
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Zero Kubernetes knowledge required.&lt;/strong&gt; Your team interacts with familiar AWS concepts: task definitions, services, target groups, IAM roles. There is no &lt;code&gt;kubectl&lt;/code&gt;, no YAML manifests with 15 indentation levels, no CRDs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deep AWS integration.&lt;/strong&gt; ECS natively integrates with ALB/NLB, CloudWatch, IAM task roles, Secrets Manager, Parameter Store, App Mesh, and CloudMap for service discovery. These integrations work out of the box with minimal configuration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lower operational overhead.&lt;/strong&gt; Especially with Fargate, there are no nodes to patch, no cluster upgrades to manage, no etcd to worry about. AWS handles all of it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simpler networking.&lt;/strong&gt; The &lt;code&gt;awsvpc&lt;/code&gt; network mode gives each task its own ENI and security group. No overlay networks, no kube-proxy, no CNI plugin decisions.&lt;/p&gt;

&lt;h3&gt;
  
  
  ECS Limitations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Vendor lock-in to AWS&lt;/li&gt;
&lt;li&gt;Smaller community and ecosystem compared to Kubernetes&lt;/li&gt;
&lt;li&gt;Limited scheduling customization&lt;/li&gt;
&lt;li&gt;No equivalent to Kubernetes operators or CRDs for extending the platform&lt;/li&gt;
&lt;li&gt;Fewer third-party tools (no Helm, no ArgoCD, no Istio)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  EKS: The Kubernetes Path
&lt;/h2&gt;

&lt;p&gt;EKS is AWS's managed Kubernetes service. AWS manages the control plane (API server, etcd, scheduler), and you manage the worker nodes or use Fargate for serverless pods.&lt;/p&gt;

&lt;h3&gt;
  
  
  EKS Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Sample Kubernetes deployment&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-api&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-api&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-api&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;serviceAccountName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-api&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;123456789012.dkr.ecr.eu-west-1.amazonaws.com/my-api:latest&lt;/span&gt;
          &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
          &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;250m&lt;/span&gt;
              &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;512Mi&lt;/span&gt;
            &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;500m&lt;/span&gt;
              &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1Gi&lt;/span&gt;
          &lt;span class="na"&gt;livenessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/health&lt;/span&gt;
              &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
            &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
          &lt;span class="na"&gt;readinessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/ready&lt;/span&gt;
              &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-api&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;my-api&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
      &lt;span class="na"&gt;targetPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  EKS Strengths
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Portability.&lt;/strong&gt; Kubernetes manifests work on any cloud provider or on-premises cluster. If multi-cloud or hybrid-cloud is in your future, EKS keeps your options open.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rich ecosystem.&lt;/strong&gt; Helm charts, ArgoCD for GitOps, Istio or Linkerd for service mesh, Prometheus and Grafana for monitoring, cert-manager for TLS, external-dns for DNS automation. The Kubernetes ecosystem is enormous.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Advanced scheduling.&lt;/strong&gt; Node affinities, taints and tolerations, pod topology spread constraints, priority classes, and preemption give you fine-grained control over workload placement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extensibility.&lt;/strong&gt; Custom Resource Definitions and operators let you extend Kubernetes to manage databases, message queues, ML pipelines, and virtually anything else as native Kubernetes resources.&lt;/p&gt;

&lt;h3&gt;
  
  
  EKS Limitations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;$0.10/hour ($73/month) control plane cost before any compute&lt;/li&gt;
&lt;li&gt;Steep learning curve (expect 3-6 months for a team new to Kubernetes)&lt;/li&gt;
&lt;li&gt;Cluster upgrades require planning and testing (new version every 4 months)&lt;/li&gt;
&lt;li&gt;More moving parts: CNI plugins, ingress controllers, cluster autoscaler, DNS&lt;/li&gt;
&lt;li&gt;Significantly higher operational burden&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Cost Comparison
&lt;/h2&gt;

&lt;p&gt;Let us compare the cost of running a typical startup workload: 3 services, each running 2 replicas with 0.5 vCPU and 1 GB RAM.&lt;/p&gt;

&lt;h3&gt;
  
  
  ECS Fargate
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Per task: 0.5 vCPU + 1 GB RAM
Fargate pricing (eu-west-1):
  vCPU: $0.04048/hour
  Memory: $0.004445/GB/hour

Per task/hour: (0.5 * $0.04048) + (1 * $0.004445) = $0.02469
6 tasks total: $0.02469 * 6 = $0.14814/hour

Monthly: $0.14814 * 730 = $108.14/month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  ECS EC2 (with Reserved Instances)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;2x t3.medium (2 vCPU, 4 GB) with 1-year RI:
  On-demand: $0.0416/hour each
  1-year RI (no upfront): $0.027/hour each

Monthly: 2 * $0.027 * 730 = $39.42/month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  EKS with Managed Node Group
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;EKS control plane: $0.10/hour = $73/month
2x t3.medium nodes (same as ECS EC2):
  1-year RI: 2 * $0.027 * 730 = $39.42/month

Monthly total: $73 + $39.42 = $112.42/month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  EKS Fargate
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;EKS control plane: $73/month
Fargate pods (same as ECS Fargate): $108.14/month
Plus CoreDNS pods on Fargate: ~$15/month

Monthly total: $73 + $108.14 + $15 = $196.14/month
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At small scale, ECS on EC2 is the cheapest option by far. EKS adds a fixed $73/month control plane cost that only becomes negligible at larger scales.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Decision Framework
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Choose ECS When
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Your team is small (under 10 engineers).&lt;/strong&gt; The operational overhead of Kubernetes is not justified when a few people can manage ECS with part-time attention.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You are all-in on AWS.&lt;/strong&gt; If you have no plans for multi-cloud and are already using ALB, CloudWatch, and IAM extensively, ECS slots in naturally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You want Fargate's simplicity.&lt;/strong&gt; ECS Fargate is the easiest path from container image to running workload. No nodes, no cluster management, no upgrades.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your workloads are straightforward.&lt;/strong&gt; Web APIs, background workers, and scheduled tasks that do not need advanced scheduling or custom operators.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You need to ship fast.&lt;/strong&gt; A team new to containers can have ECS running in production within a week. Kubernetes takes months to operate confidently.&lt;/p&gt;

&lt;h3&gt;
  
  
  Choose EKS When
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Multi-cloud or hybrid is a real requirement.&lt;/strong&gt; Not a hypothetical future possibility, but an actual business constraint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You have Kubernetes experience on the team.&lt;/strong&gt; If your engineers already know Kubernetes, EKS employs that knowledge. Forcing Kubernetes-experienced engineers to learn ECS is a lateral move with no payoff.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You need the ecosystem.&lt;/strong&gt; If your architecture requires service mesh, GitOps (ArgoCD), advanced traffic management, or custom operators, the Kubernetes ecosystem is unmatched.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You are running 20+ services.&lt;/strong&gt; At scale, Kubernetes' sophisticated scheduling, resource management, and namespace isolation become genuine advantages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You are building a platform team.&lt;/strong&gt; If part of your strategy is building an internal developer platform, Kubernetes provides the extensibility to build abstractions on top.&lt;/p&gt;

&lt;h2&gt;
  
  
  Migration Paths
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ECS to EKS
&lt;/h3&gt;

&lt;p&gt;If you outgrow ECS, migration is manageable:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Set up EKS cluster alongside existing ECS&lt;/li&gt;
&lt;li&gt;Convert task definitions to Kubernetes deployments (the container config is similar)&lt;/li&gt;
&lt;li&gt;Use AWS ALB Ingress Controller to maintain the same load balancer&lt;/li&gt;
&lt;li&gt;Migrate services one at a time, shifting traffic gradually&lt;/li&gt;
&lt;li&gt;Decommission ECS services after verification&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  EKS to ECS
&lt;/h3&gt;

&lt;p&gt;Rarer, but it happens when teams realize they over-invested in Kubernetes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Convert Kubernetes deployments to ECS task definitions&lt;/li&gt;
&lt;li&gt;Replace Kubernetes services with ECS services + ALB target groups&lt;/li&gt;
&lt;li&gt;Replace Helm/ArgoCD with AWS CodePipeline or similar&lt;/li&gt;
&lt;li&gt;Migrate ConfigMaps/Secrets to AWS Parameter Store/Secrets Manager&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  A Pragmatic Recommendation
&lt;/h2&gt;

&lt;p&gt;For most startups and SMBs running fewer than 15 services on AWS, &lt;strong&gt;start with ECS&lt;/strong&gt;. Here is why:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You avoid 3-6 months of Kubernetes learning curve&lt;/li&gt;
&lt;li&gt;You save $73/month in control plane costs (trivial individually, but symbolic of broader overhead)&lt;/li&gt;
&lt;li&gt;You eliminate cluster upgrade maintenance windows&lt;/li&gt;
&lt;li&gt;Your engineers focus on product features instead of infrastructure plumbing&lt;/li&gt;
&lt;li&gt;If you outgrow ECS, the migration to EKS is well-understood&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The exception: if you already have a team member who is deeply experienced with Kubernetes and can own the cluster operations, EKS is fine from day one.&lt;/p&gt;

&lt;h2&gt;
  
  
  Operational Overhead: What Nobody Tells You
&lt;/h2&gt;

&lt;p&gt;The cost comparison above only covers compute. The real difference between ECS and EKS is operational overhead - the time your engineers spend managing the orchestration platform itself rather than building product features.&lt;/p&gt;

&lt;h3&gt;
  
  
  ECS Day-2 Operations
&lt;/h3&gt;

&lt;p&gt;With ECS (especially Fargate), your ongoing operational tasks are minimal:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Task definition updates&lt;/strong&gt; when you change container config (straightforward JSON/Terraform changes)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Service scaling policies&lt;/strong&gt; adjustments based on traffic patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ALB health check tuning&lt;/strong&gt; to match your application's startup time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security group and IAM role updates&lt;/strong&gt; as your network requirements change&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most of this is standard AWS work that any cloud engineer can handle.&lt;/p&gt;

&lt;h3&gt;
  
  
  EKS Day-2 Operations
&lt;/h3&gt;

&lt;p&gt;EKS adds a significant layer of Kubernetes-specific operational work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cluster version upgrades&lt;/strong&gt; every 4 months (Kubernetes drops support for old versions aggressively). Each upgrade requires testing all workloads, updating add-ons, and often updating manifests for deprecated APIs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add-on management&lt;/strong&gt;: CoreDNS, kube-proxy, VPC CNI, cluster autoscaler, ingress controller, cert-manager, external-dns - each has its own version matrix and upgrade path.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Node group rotation&lt;/strong&gt; when you change AMIs or instance types. Cordon, drain, replace.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RBAC policy management&lt;/strong&gt; as teams and services grow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;etcd monitoring&lt;/strong&gt; for the control plane (managed by AWS, but you still need to understand its implications for API server performance).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Debugging Kubernetes networking&lt;/strong&gt; when services cannot reach each other (is it a NetworkPolicy? A CNI issue? A security group? A missing service account?).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A reasonable estimate: EKS adds 10-20 hours per month of platform engineering work compared to ECS. At a senior engineer's loaded cost, that is $3,000-6,000/month in operational overhead that does not show up on your AWS bill.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Scenarios
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Scenario 1: 4-Person Startup, 2 Services
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Choose ECS Fargate.&lt;/strong&gt; You do not have the engineering bandwidth to learn and operate Kubernetes. ECS Fargate lets you deploy containers in an afternoon and forget about infrastructure management for the next 12 months.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scenario 2: 25-Person Scale-Up, 12 Services, AWS Only
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Choose ECS EC2.&lt;/strong&gt; You are big enough that Fargate costs add up, but your services are straightforward enough that Kubernetes' advanced features are not needed. ECS on EC2 with Reserved Instances gives you the best cost efficiency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scenario 3: 50-Person Company, 30 Services, Multi-Cloud Mandate
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Choose EKS.&lt;/strong&gt; At this scale, you likely have or can hire a dedicated platform engineer. Kubernetes' portability and ecosystem justify the operational investment, and you will benefit from tools like ArgoCD, Istio, and custom operators.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scenario 4: Regulated Industry, Compliance Requirements
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;EKS edges ahead.&lt;/strong&gt; Kubernetes' namespace isolation, network policies, pod security standards, and OPA Gatekeeper provide a richer set of compliance controls. Combined with tools like Falco for runtime security, you can build a compliance story that auditors understand and respect.&lt;/p&gt;

&lt;h2&gt;
  
  
  Need Help with Your DevOps?
&lt;/h2&gt;

&lt;p&gt;Choosing the right container orchestration platform is just the first decision. At &lt;strong&gt;&lt;a href="https://instadevops.com" rel="noopener noreferrer"&gt;InstaDevOps&lt;/a&gt;&lt;/strong&gt;, we help startups set up production-grade container infrastructure on AWS - whether that is ECS, EKS, or a migration between the two.&lt;/p&gt;

&lt;p&gt;Plans start at &lt;strong&gt;$2,999/mo&lt;/strong&gt; for a dedicated fractional DevOps engineer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://calendly.com/instadevops/15min" rel="noopener noreferrer"&gt;Book a free 15-minute consultation&lt;/a&gt;&lt;/strong&gt; to discuss your container strategy.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>ecs</category>
      <category>eks</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Advanced GitHub Actions: Matrix Builds, Reusable Workflows, and Self-Hosted Runners</title>
      <dc:creator>InstaDevOps</dc:creator>
      <pubDate>Tue, 14 Apr 2026 13:46:40 +0000</pubDate>
      <link>https://dev.to/instadevops/advanced-github-actions-matrix-builds-reusable-workflows-and-self-hosted-runners-h07</link>
      <guid>https://dev.to/instadevops/advanced-github-actions-matrix-builds-reusable-workflows-and-self-hosted-runners-h07</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;GitHub Actions has matured from a simple CI tool into a full-featured automation platform. Most teams start with a basic build-and-test workflow, but GitHub Actions offers far more powerful patterns that can dramatically reduce pipeline duplication, speed up builds, and cut CI costs.&lt;/p&gt;

&lt;p&gt;This article covers advanced patterns that separate hobby-level GitHub Actions usage from production-grade CI/CD: matrix builds for testing across multiple environments, reusable workflows for eliminating duplication, self-hosted runners for cost and performance, and secrets management strategies that keep your pipelines secure.&lt;/p&gt;

&lt;p&gt;If you are already comfortable writing basic GitHub Actions workflows, this guide will take you to the next level.&lt;/p&gt;

&lt;h2&gt;
  
  
  Matrix Builds: Test Everything in Parallel
&lt;/h2&gt;

&lt;p&gt;Matrix builds let you run the same job across multiple combinations of variables - OS versions, language versions, dependency versions - without duplicating workflow code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Basic Matrix Strategy
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;matrix&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;node-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;18&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;20&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;22&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;os&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;ubuntu-latest&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;macos-latest&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
      &lt;span class="na"&gt;fail-fast&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/setup-node@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;node-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ matrix.node-version }}&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm ci&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm test&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Setting &lt;code&gt;fail-fast: false&lt;/code&gt; ensures all matrix combinations run to completion even if one fails. This is critical for understanding the full scope of compatibility issues.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dynamic Matrix Generation
&lt;/h3&gt;

&lt;p&gt;For more complex scenarios, generate the matrix dynamically from a previous job:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;detect-services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;outputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;services&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ steps.find.outputs.services }}&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;find&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;# Find all directories with a Dockerfile&lt;/span&gt;
          &lt;span class="s"&gt;SERVICES=$(find services/ -name Dockerfile -exec dirname {} \; | jq -R -s -c 'split("\n")[:-1]')&lt;/span&gt;
          &lt;span class="s"&gt;echo "services=$SERVICES" &amp;gt;&amp;gt; $GITHUB_OUTPUT&lt;/span&gt;

  &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;detect-services&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;matrix&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ fromJson(needs.detect-services.outputs.services) }}&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker build -t ${{ matrix.service }}:latest ${{ matrix.service }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pattern is invaluable for monorepos: only build and test services that have changed, and automatically pick up new services without modifying the workflow.&lt;/p&gt;

&lt;h3&gt;
  
  
  Matrix Include and Exclude
&lt;/h3&gt;

&lt;p&gt;Fine-tune your matrix with include and exclude rules:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;matrix&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;os&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;ubuntu-latest&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;windows-latest&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;node&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;18&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;20&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;include&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="c1"&gt;# Add an experimental Node 22 build on Ubuntu only&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;os&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
        &lt;span class="na"&gt;node&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;22&lt;/span&gt;
        &lt;span class="na"&gt;experimental&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;exclude&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="c1"&gt;# Skip Node 18 on Windows (known incompatibility)&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;os&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;windows-latest&lt;/span&gt;
        &lt;span class="na"&gt;node&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;18&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Reusable Workflows: Eliminate Duplication
&lt;/h2&gt;

&lt;p&gt;If you have 10 microservices with nearly identical CI workflows, reusable workflows let you define the pipeline once and call it from each repository.&lt;/p&gt;

&lt;h3&gt;
  
  
  Creating a Reusable Workflow
&lt;/h3&gt;

&lt;p&gt;Define a workflow with &lt;code&gt;workflow_call&lt;/code&gt; trigger in a shared repository:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/build-and-deploy.yml in org/shared-workflows repo&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build and Deploy Service&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;workflow_call&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;inputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;service-name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
      &lt;span class="na"&gt;dockerfile-path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
        &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;./Dockerfile'&lt;/span&gt;
      &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;string&lt;/span&gt;
    &lt;span class="na"&gt;secrets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;AWS_ACCESS_KEY_ID&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;AWS_SECRET_ACCESS_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
      &lt;span class="na"&gt;ECR_REGISTRY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Configure AWS credentials&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-actions/configure-aws-credentials@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;aws-access-key-id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.AWS_ACCESS_KEY_ID }}&lt;/span&gt;
          &lt;span class="na"&gt;aws-secret-access-key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.AWS_SECRET_ACCESS_KEY }}&lt;/span&gt;
          &lt;span class="na"&gt;aws-region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;eu-west-1&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Login to ECR&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-actions/amazon-ecr-login@v2&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Build and push image&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;IMAGE="${{ secrets.ECR_REGISTRY }}/${{ inputs.service-name }}:${{ github.sha }}"&lt;/span&gt;
          &lt;span class="s"&gt;docker build -f ${{ inputs.dockerfile-path }} -t $IMAGE .&lt;/span&gt;
          &lt;span class="s"&gt;docker push $IMAGE&lt;/span&gt;

  &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;build&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ inputs.environment }}&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deploy to ECS&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;aws ecs update-service \&lt;/span&gt;
            &lt;span class="s"&gt;--cluster ${{ inputs.environment }}-cluster \&lt;/span&gt;
            &lt;span class="s"&gt;--service ${{ inputs.service-name }} \&lt;/span&gt;
            &lt;span class="s"&gt;--force-new-deployment&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Calling a Reusable Workflow
&lt;/h3&gt;

&lt;p&gt;Each service repository calls the shared workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/ci.yml in service repo&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CI/CD&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deploy-staging&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;org/shared-workflows/.github/workflows/build-and-deploy.yml@v2.1.0&lt;/span&gt;
    &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;service-name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payment-api&lt;/span&gt;
      &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;staging&lt;/span&gt;
    &lt;span class="na"&gt;secrets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;AWS_ACCESS_KEY_ID&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.AWS_ACCESS_KEY_ID }}&lt;/span&gt;
      &lt;span class="na"&gt;AWS_SECRET_ACCESS_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.AWS_SECRET_ACCESS_KEY }}&lt;/span&gt;
      &lt;span class="na"&gt;ECR_REGISTRY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.ECR_REGISTRY }}&lt;/span&gt;

  &lt;span class="na"&gt;deploy-production&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deploy-staging&lt;/span&gt;
    &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;org/shared-workflows/.github/workflows/build-and-deploy.yml@v2.1.0&lt;/span&gt;
    &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;service-name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payment-api&lt;/span&gt;
      &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
    &lt;span class="na"&gt;secrets&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;AWS_ACCESS_KEY_ID&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.AWS_ACCESS_KEY_ID }}&lt;/span&gt;
      &lt;span class="na"&gt;AWS_SECRET_ACCESS_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.AWS_SECRET_ACCESS_KEY }}&lt;/span&gt;
      &lt;span class="na"&gt;ECR_REGISTRY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.ECR_REGISTRY }}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pin the reusable workflow to a specific tag (&lt;code&gt;@v2.1.0&lt;/code&gt;) so that upstream changes do not break your pipelines unexpectedly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Self-Hosted Runners: Performance and Cost Control
&lt;/h2&gt;

&lt;p&gt;GitHub-hosted runners are convenient but come with limitations: fixed hardware specs, cold start times, no persistent caches, and costs that scale linearly with usage. Self-hosted runners solve these problems.&lt;/p&gt;

&lt;h3&gt;
  
  
  When Self-Hosted Runners Make Sense
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Large Docker builds&lt;/strong&gt; that benefit from persistent layer caches&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU workloads&lt;/strong&gt; for ML model training or testing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network-restricted environments&lt;/strong&gt; where jobs need access to internal resources&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;High CI volume&lt;/strong&gt; where GitHub-hosted runner costs exceed $1,000/month&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Specialized hardware&lt;/strong&gt; requirements&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Setting Up Ephemeral Self-Hosted Runners on AWS
&lt;/h3&gt;

&lt;p&gt;Use the &lt;code&gt;actions-runner-controller&lt;/code&gt; (ARC) to autoscale runners on Kubernetes, or use EC2 with a simpler setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Terraform for self-hosted runner ASG&lt;/span&gt;
&lt;span class="s"&gt;resource "aws_launch_template" "runner" {&lt;/span&gt;
  &lt;span class="s"&gt;name_prefix   = "github-runner-"&lt;/span&gt;
  &lt;span class="s"&gt;image_id      = data.aws_ami.ubuntu.id&lt;/span&gt;
  &lt;span class="s"&gt;instance_type = "c6i.2xlarge"&lt;/span&gt;

  &lt;span class="s"&gt;user_data = base64encode(templatefile("runner-init.sh", {&lt;/span&gt;
    &lt;span class="s"&gt;github_org    = var.github_org&lt;/span&gt;
    &lt;span class="s"&gt;runner_token  = var.runner_registration_token&lt;/span&gt;
    &lt;span class="s"&gt;runner_labels = "self-hosted,linux,x64,large"&lt;/span&gt;
  &lt;span class="s"&gt;}))&lt;/span&gt;

  &lt;span class="s"&gt;block_device_mappings {&lt;/span&gt;
    &lt;span class="s"&gt;device_name = "/dev/sda1"&lt;/span&gt;
    &lt;span class="s"&gt;ebs {&lt;/span&gt;
      &lt;span class="s"&gt;volume_size = &lt;/span&gt;&lt;span class="m"&gt;100&lt;/span&gt;
      &lt;span class="s"&gt;volume_type = "gp3"&lt;/span&gt;
    &lt;span class="s"&gt;}&lt;/span&gt;
  &lt;span class="s"&gt;}&lt;/span&gt;
&lt;span class="err"&gt;}&lt;/span&gt;

&lt;span class="s"&gt;resource "aws_autoscaling_group" "runners" {&lt;/span&gt;
  &lt;span class="s"&gt;desired_capacity = &lt;/span&gt;&lt;span class="m"&gt;2&lt;/span&gt;
  &lt;span class="s"&gt;max_size         = &lt;/span&gt;&lt;span class="m"&gt;10&lt;/span&gt;
  &lt;span class="s"&gt;min_size         = &lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;

  &lt;span class="s"&gt;launch_template {&lt;/span&gt;
    &lt;span class="s"&gt;id      = aws_launch_template.runner.id&lt;/span&gt;
    &lt;span class="s"&gt;version = "$Latest"&lt;/span&gt;
  &lt;span class="s"&gt;}&lt;/span&gt;

  &lt;span class="s"&gt;tag {&lt;/span&gt;
    &lt;span class="s"&gt;key                 = "Purpose"&lt;/span&gt;
    &lt;span class="s"&gt;value               = "github-actions-runner"&lt;/span&gt;
    &lt;span class="s"&gt;propagate_at_launch = &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="s"&gt;}&lt;/span&gt;
&lt;span class="err"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Using Runner Labels for Job Routing
&lt;/h3&gt;

&lt;p&gt;Target specific runners using labels:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;unit-tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Fast tests on GitHub-hosted runners&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm test&lt;/span&gt;

  &lt;span class="na"&gt;docker-build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Heavy builds on self-hosted with Docker cache&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;self-hosted&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;linux&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;large&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker build -t myapp:latest .&lt;/span&gt;

  &lt;span class="na"&gt;gpu-tests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# ML tests on GPU runners&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;self-hosted&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;gpu&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;python -m pytest tests/ml/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Secrets Management Patterns
&lt;/h2&gt;

&lt;p&gt;Secrets handling in CI/CD is a top security concern. GitHub Actions provides several mechanisms, but you need to layer them correctly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Environment-Scoped Secrets
&lt;/h3&gt;

&lt;p&gt;Use GitHub environments to scope secrets to specific deployment targets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;  &lt;span class="c1"&gt;# Only secrets defined in "production" environment are available&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deploy&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;DB_PASSWORD&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ secrets.DB_PASSWORD }}&lt;/span&gt;  &lt;span class="c1"&gt;# Production-specific secret&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./deploy.sh&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Configure required reviewers on the production environment so deployments require manual approval.&lt;/p&gt;

&lt;h3&gt;
  
  
  OIDC Authentication (No Long-Lived Keys)
&lt;/h3&gt;

&lt;p&gt;Eliminate static AWS credentials entirely using OpenID Connect:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;id-token&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;
  &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;aws-actions/configure-aws-credentials@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;role-to-assume&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;arn:aws:iam::123456789012:role/GitHubActionsRole&lt;/span&gt;
          &lt;span class="na"&gt;aws-region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;eu-west-1&lt;/span&gt;
          &lt;span class="c1"&gt;# No access key or secret needed - uses OIDC token exchange&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The IAM role trust policy restricts which repositories and branches can assume it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Principal"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Federated"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sts:AssumeRoleWithWebIdentity"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"StringEquals"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"token.actions.githubusercontent.com:aud"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sts.amazonaws.com"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"StringLike"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"token.actions.githubusercontent.com:sub"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"repo:yourorg/yourrepo:ref:refs/heads/main"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Secrets Scanning and Rotation
&lt;/h3&gt;

&lt;p&gt;Add a workflow that scans for accidentally committed secrets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;secrets-scan&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;fetch-depth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;trufflesecurity/trufflehog@main&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;extra_args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;--only-verified&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Caching Strategies for Faster Builds
&lt;/h2&gt;

&lt;p&gt;Effective caching can cut build times by 50-80%.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dependency Caching
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/cache@v4&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;~/.npm&lt;/span&gt;
      &lt;span class="s"&gt;node_modules&lt;/span&gt;
    &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deps-${{ runner.os }}-${{ hashFiles('**/package-lock.json') }}&lt;/span&gt;
    &lt;span class="na"&gt;restore-keys&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;deps-${{ runner.os }}-&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Docker Layer Caching
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;docker/build-push-action@v5&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;context&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;.&lt;/span&gt;
    &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;myapp:latest&lt;/span&gt;
    &lt;span class="na"&gt;cache-from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;type=gha&lt;/span&gt;
    &lt;span class="na"&gt;cache-to&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;type=gha,mode=max&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Artifact Passing Between Jobs
&lt;/h3&gt;

&lt;p&gt;Use artifacts to pass build outputs between jobs without rebuilding:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm run build&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/upload-artifact@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;build-output&lt;/span&gt;
          &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dist/&lt;/span&gt;
          &lt;span class="na"&gt;retention-days&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;

  &lt;span class="na"&gt;deploy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;build&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/download-artifact@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;build-output&lt;/span&gt;
          &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;dist/&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./deploy.sh dist/&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Workflow Organization at Scale
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Path-Based Triggers
&lt;/h3&gt;

&lt;p&gt;Only run workflows when relevant files change:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
    &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;services/payment-api/**'&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;.github/workflows/payment-api.yml'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Concurrency Control
&lt;/h3&gt;

&lt;p&gt;Prevent redundant workflow runs when commits are pushed rapidly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;concurrency&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;group&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deploy-${{ github.ref }}&lt;/span&gt;
  &lt;span class="na"&gt;cancel-in-progress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This cancels the previous run when a new commit is pushed to the same branch, saving runner minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Composite Actions for Shared Steps
&lt;/h3&gt;

&lt;p&gt;When full reusable workflows are overkill, composite actions bundle common steps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/actions/setup-project/action.yml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Setup Project&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install dependencies and configure environment&lt;/span&gt;
&lt;span class="na"&gt;inputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;node-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;required&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
    &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;20'&lt;/span&gt;
&lt;span class="na"&gt;runs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;using&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;composite&lt;/span&gt;
  &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/setup-node@v4&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;node-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ inputs.node-version }}&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/cache@v4&lt;/span&gt;
      &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;node_modules&lt;/span&gt;
        &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deps-${{ hashFiles('package-lock.json') }}&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;npm ci&lt;/span&gt;
      &lt;span class="na"&gt;shell&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bash&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use it in any workflow with a single line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;./.github/actions/setup-project&lt;/span&gt;
  &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;node-version&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;20'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Monitoring and Debugging Workflows
&lt;/h2&gt;

&lt;p&gt;When workflows fail in production, you need fast diagnosis. GitHub provides several tools for this.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step-Level Timing Analysis
&lt;/h3&gt;

&lt;p&gt;Every workflow run shows timing per step. If your pipeline is slow, look for steps that take disproportionate time. Common culprits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Checkout with full history&lt;/strong&gt; (&lt;code&gt;fetch-depth: 0&lt;/code&gt;) on large repos - use shallow clones unless you need full history&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docker builds without caching&lt;/strong&gt; - always use BuildKit cache with &lt;code&gt;cache-from&lt;/code&gt; and &lt;code&gt;cache-to&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NPM/pip installs without cache&lt;/strong&gt; - dependency caching alone can save 30-60 seconds per run&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Workflow Run Logs and Annotations
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;::warning&lt;/code&gt; and &lt;code&gt;::error&lt;/code&gt; workflow commands to surface issues directly in PR annotations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Check bundle size&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;SIZE=$(stat -f%z dist/bundle.js)&lt;/span&gt;
    &lt;span class="s"&gt;if [ "$SIZE" -gt 500000 ]; then&lt;/span&gt;
      &lt;span class="s"&gt;echo "::warning file=dist/bundle.js::Bundle size is ${SIZE} bytes (over 500KB limit)"&lt;/span&gt;
    &lt;span class="s"&gt;fi&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Reusable Debug Workflow
&lt;/h3&gt;

&lt;p&gt;Create a debug composite action that dumps useful context when things go wrong:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Debug info&lt;/span&gt;
  &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;failure()&lt;/span&gt;
  &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;echo "## Environment" &amp;gt;&amp;gt; $GITHUB_STEP_SUMMARY&lt;/span&gt;
    &lt;span class="s"&gt;echo "- Runner: ${{ runner.os }} ${{ runner.arch }}" &amp;gt;&amp;gt; $GITHUB_STEP_SUMMARY&lt;/span&gt;
    &lt;span class="s"&gt;echo "- Node: $(node --version)" &amp;gt;&amp;gt; $GITHUB_STEP_SUMMARY&lt;/span&gt;
    &lt;span class="s"&gt;echo "- Disk: $(df -h / | tail -1)" &amp;gt;&amp;gt; $GITHUB_STEP_SUMMARY&lt;/span&gt;
    &lt;span class="s"&gt;echo "- Memory: $(free -h 2&amp;gt;/dev/null || vm_stat)" &amp;gt;&amp;gt; $GITHUB_STEP_SUMMARY&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Common Pitfalls to Avoid
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Not pinning action versions.&lt;/strong&gt; Using &lt;code&gt;actions/checkout@main&lt;/code&gt; means your workflow can break without any change on your side. Always pin to a specific version tag or commit SHA for third-party actions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Leaking secrets in logs.&lt;/strong&gt; GitHub automatically masks secrets in logs, but derived values (like a URL containing a token) are not masked. Use &lt;code&gt;::add-mask::&lt;/code&gt; for any sensitive derived values.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ignoring workflow permissions.&lt;/strong&gt; The default &lt;code&gt;GITHUB_TOKEN&lt;/code&gt; has broad permissions. Use the &lt;code&gt;permissions&lt;/code&gt; key to restrict to only what each job needs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;contents&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read&lt;/span&gt;
  &lt;span class="na"&gt;pull-requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;write&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Running everything on push and pull_request.&lt;/strong&gt; This causes duplicate runs. Use &lt;code&gt;push&lt;/code&gt; for deploys (main branch only) and &lt;code&gt;pull_request&lt;/code&gt; for checks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not using job summaries.&lt;/strong&gt; The &lt;code&gt;$GITHUB_STEP_SUMMARY&lt;/code&gt; file lets you write rich Markdown summaries that appear in the workflow run UI - much better than scrolling through raw logs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Need Help with Your DevOps?
&lt;/h2&gt;

&lt;p&gt;Setting up production-grade CI/CD pipelines with proper security, caching, and scaling takes real-world experience. At &lt;strong&gt;&lt;a href="https://instadevops.com" rel="noopener noreferrer"&gt;InstaDevOps&lt;/a&gt;&lt;/strong&gt;, we build and maintain CI/CD infrastructure for startups and growing teams so your developers can focus on shipping features.&lt;/p&gt;

&lt;p&gt;Plans start at &lt;strong&gt;$2,999/mo&lt;/strong&gt; for a dedicated fractional DevOps engineer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://calendly.com/instadevops/15min" rel="noopener noreferrer"&gt;Book a free 15-minute consultation&lt;/a&gt;&lt;/strong&gt; to optimize your GitHub Actions workflows.&lt;/p&gt;

</description>
      <category>github</category>
      <category>cicd</category>
      <category>devops</category>
      <category>automation</category>
    </item>
    <item>
      <title>Terraform Modules Done Right: Mono-Repo, Versioning, and Registry Patterns</title>
      <dc:creator>InstaDevOps</dc:creator>
      <pubDate>Mon, 13 Apr 2026 13:46:36 +0000</pubDate>
      <link>https://dev.to/instadevops/terraform-modules-done-right-mono-repo-versioning-and-registry-patterns-34g6</link>
      <guid>https://dev.to/instadevops/terraform-modules-done-right-mono-repo-versioning-and-registry-patterns-34g6</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;As your Terraform codebase grows beyond a handful of resources, you inevitably face a structural decision: how do you organize reusable infrastructure components so that multiple teams can collaborate without stepping on each other's toes?&lt;/p&gt;

&lt;p&gt;The answer is Terraform modules. But writing a module is easy - organizing, versioning, and distributing them at scale is where most teams struggle. Poorly structured modules lead to copy-paste drift, version conflicts, and infrastructure that nobody fully understands.&lt;/p&gt;

&lt;p&gt;In this guide, we will walk through battle-tested patterns for organizing Terraform modules, choosing between mono-repo and multi-repo strategies, leveraging module registries, and implementing versioning that actually works in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Makes a Good Terraform Module
&lt;/h2&gt;

&lt;p&gt;Before discussing organization, let us establish what a well-designed module looks like. A good Terraform module follows these principles:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single responsibility.&lt;/strong&gt; Each module should manage one logical piece of infrastructure. A VPC module should not also create RDS instances.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sensible defaults with full override capability.&lt;/strong&gt; Provide defaults that work for 80% of use cases, but allow every significant parameter to be overridden:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;variable&lt;/span&gt; &lt;span class="s2"&gt;"instance_type"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"EC2 instance type for the application servers"&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;string&lt;/span&gt;
  &lt;span class="nx"&gt;default&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"t3.medium"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;variable&lt;/span&gt; &lt;span class="s2"&gt;"enable_enhanced_monitoring"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Enable enhanced monitoring with 60-second granularity"&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;bool&lt;/span&gt;
  &lt;span class="nx"&gt;default&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;variable&lt;/span&gt; &lt;span class="s2"&gt;"tags"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Additional tags to apply to all resources"&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="nx"&gt;default&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Clear input/output contracts.&lt;/strong&gt; Every variable should have a description, type constraint, and validation where appropriate. Every output that downstream consumers need should be explicitly exported:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;variable&lt;/span&gt; &lt;span class="s2"&gt;"vpc_cidr"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"CIDR block for the VPC"&lt;/span&gt;
  &lt;span class="nx"&gt;type&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;string&lt;/span&gt;

  &lt;span class="nx"&gt;validation&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;condition&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;can&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;cidrhost&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;var&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_cidr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="nx"&gt;error_message&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Must be a valid CIDR block."&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="s2"&gt;"vpc_id"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"ID of the created VPC"&lt;/span&gt;
  &lt;span class="nx"&gt;value&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_vpc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;main&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="s2"&gt;"private_subnet_ids"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;description&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"List of private subnet IDs"&lt;/span&gt;
  &lt;span class="nx"&gt;value&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_subnet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;private&lt;/span&gt;&lt;span class="p"&gt;[*].&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Minimal provider assumptions.&lt;/strong&gt; Do not hardcode provider configurations inside modules. Let the caller configure the provider:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Bad - hardcoded provider in module&lt;/span&gt;
&lt;span class="nx"&gt;provider&lt;/span&gt; &lt;span class="s2"&gt;"aws"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;region&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Good - module relies on caller's provider configuration&lt;/span&gt;
&lt;span class="nx"&gt;terraform&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;required_providers&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;aws&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;source&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"hashicorp/aws"&lt;/span&gt;
      &lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"&amp;gt;= 5.0"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Mono-Repo vs Multi-Repo: Choosing the Right Strategy
&lt;/h2&gt;

&lt;p&gt;This is the most debated structural decision in Terraform module management. Both approaches have legitimate trade-offs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mono-Repo Pattern
&lt;/h3&gt;

&lt;p&gt;All modules live in a single repository with a directory structure like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;terraform-modules/
├── modules/
│   ├── vpc/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── README.md
│   ├── ecs-service/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── README.md
│   ├── rds-postgres/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── README.md
│   └── s3-bucket/
│       ├── main.tf
│       ├── variables.tf
│       ├── outputs.tf
│       └── README.md
├── examples/
│   ├── complete-vpc/
│   └── ecs-with-rds/
├── tests/
│   ├── vpc_test.go
│   └── ecs_test.go
└── .github/
    └── workflows/
        └── test.yml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Advantages:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Atomic cross-module changes in a single PR&lt;/li&gt;
&lt;li&gt;Unified CI/CD pipeline for testing&lt;/li&gt;
&lt;li&gt;Easier to maintain consistency across modules&lt;/li&gt;
&lt;li&gt;Simpler onboarding for new team members&lt;/li&gt;
&lt;li&gt;One place to search for all infrastructure patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Disadvantages:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Git tags version the entire repo, not individual modules&lt;/li&gt;
&lt;li&gt;Large repos can slow down CI runs&lt;/li&gt;
&lt;li&gt;Permission boundaries are harder (everyone can see everything)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to use:&lt;/strong&gt; Teams under 20 engineers, organizations with fewer than 30 modules, or when most modules are tightly coupled.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-Repo Pattern
&lt;/h3&gt;

&lt;p&gt;Each module gets its own repository:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;terraform-module-vpc/          (v2.3.1)
terraform-module-ecs-service/  (v1.8.0)
terraform-module-rds-postgres/ (v3.1.0)
terraform-module-s3-bucket/    (v1.2.4)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Advantages:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Independent versioning per module using Git tags&lt;/li&gt;
&lt;li&gt;Fine-grained access control per repository&lt;/li&gt;
&lt;li&gt;Smaller, focused CI/CD pipelines&lt;/li&gt;
&lt;li&gt;Clear ownership boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Disadvantages:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cross-module changes require multiple PRs&lt;/li&gt;
&lt;li&gt;Dependency management becomes complex&lt;/li&gt;
&lt;li&gt;More repositories to maintain&lt;/li&gt;
&lt;li&gt;Harder to ensure consistency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;When to use:&lt;/strong&gt; Organizations with 50+ modules, multiple platform teams owning different modules, or strict compliance requirements needing audit trails per component.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Hybrid Approach
&lt;/h3&gt;

&lt;p&gt;Many mature organizations use a hybrid: a mono-repo for foundational modules maintained by the platform team, with separate repos for domain-specific modules owned by product teams:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;platform-terraform-modules/     # Platform team owns VPC, IAM, networking
├── modules/vpc/
├── modules/iam-roles/
└── modules/cloudfront/

team-payments-terraform/        # Payments team owns their service modules
├── modules/payment-service/
└── modules/fraud-detection/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Module Versioning Strategies
&lt;/h2&gt;

&lt;p&gt;Versioning is where most Terraform setups break down. Without proper versioning, a module change can silently break every environment that references it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Semantic Versioning
&lt;/h3&gt;

&lt;p&gt;Follow semver strictly for modules:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MAJOR&lt;/strong&gt; (v2.0.0): Breaking changes (removed variables, renamed resources that force replacement)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MINOR&lt;/strong&gt; (v1.3.0): New features, new optional variables with defaults&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PATCH&lt;/strong&gt; (v1.2.1): Bug fixes, documentation updates&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Pinning Module Versions
&lt;/h3&gt;

&lt;p&gt;Always pin module versions in your root configurations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Good - pinned to exact version&lt;/span&gt;
&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"vpc"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"git::https://github.com/yourorg/terraform-modules.git//modules/vpc?ref=v2.3.1"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Acceptable - pinned to minor version range&lt;/span&gt;
&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"vpc"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"app.terraform.io/yourorg/vpc/aws"&lt;/span&gt;
  &lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"~&amp;gt; 2.3"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Bad - no version pin, uses latest&lt;/span&gt;
&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"vpc"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"git::https://github.com/yourorg/terraform-modules.git//modules/vpc"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Automated Version Bumping
&lt;/h3&gt;

&lt;p&gt;Use a CI workflow that automatically creates releases when module directories change:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .github/workflows/release.yml&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Release Modules&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;branches&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;main&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;detect-changes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;outputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;changed_modules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ steps.changes.outputs.modules }}&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;fetch-depth&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;changes&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;CHANGED=$(git diff --name-only HEAD~1 HEAD | grep '^modules/' | cut -d'/' -f2 | sort -u | jq -R -s -c 'split("\n")[:-1]')&lt;/span&gt;
          &lt;span class="s"&gt;echo "modules=$CHANGED" &amp;gt;&amp;gt; $GITHUB_OUTPUT&lt;/span&gt;

  &lt;span class="na"&gt;release&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;detect-changes&lt;/span&gt;
    &lt;span class="na"&gt;if&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;needs.detect-changes.outputs.changed_modules != '[]'&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;matrix&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;module&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${{ fromJson(needs.detect-changes.outputs.changed_modules) }}&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Determine version bump&lt;/span&gt;
        &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;version&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;CURRENT=$(git tag -l "modules/${{ matrix.module }}/v*" | sort -V | tail -1)&lt;/span&gt;
          &lt;span class="s"&gt;# Parse commit messages for bump type&lt;/span&gt;
          &lt;span class="s"&gt;if git log --oneline HEAD~1..HEAD | grep -q "BREAKING"; then&lt;/span&gt;
            &lt;span class="s"&gt;BUMP="major"&lt;/span&gt;
          &lt;span class="s"&gt;elif git log --oneline HEAD~1..HEAD | grep -q "feat"; then&lt;/span&gt;
            &lt;span class="s"&gt;BUMP="minor"&lt;/span&gt;
          &lt;span class="s"&gt;else&lt;/span&gt;
            &lt;span class="s"&gt;BUMP="patch"&lt;/span&gt;
          &lt;span class="s"&gt;fi&lt;/span&gt;
          &lt;span class="s"&gt;echo "bump=$BUMP" &amp;gt;&amp;gt; $GITHUB_OUTPUT&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Create release tag&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;# Bump version and create tag&lt;/span&gt;
          &lt;span class="s"&gt;git tag "modules/${{ matrix.module }}/v${NEW_VERSION}"&lt;/span&gt;
          &lt;span class="s"&gt;git push --tags&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Using a Private Module Registry
&lt;/h2&gt;

&lt;p&gt;A module registry provides a discoverable, versioned catalog of your organization's modules. You have several options.&lt;/p&gt;

&lt;h3&gt;
  
  
  Terraform Cloud / HCP Terraform Registry
&lt;/h3&gt;

&lt;p&gt;The simplest option if you are already using Terraform Cloud:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"vpc"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"app.terraform.io/yourorg/vpc/aws"&lt;/span&gt;
  &lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2.3.1"&lt;/span&gt;

  &lt;span class="nx"&gt;cidr_block&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.0.0.0/16"&lt;/span&gt;
  &lt;span class="nx"&gt;environment&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Publishing is automatic when you connect your VCS repository to the registry.&lt;/p&gt;

&lt;h3&gt;
  
  
  Self-Hosted with Artifactory or S3
&lt;/h3&gt;

&lt;p&gt;For air-gapped or highly regulated environments, you can host modules on S3:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"vpc"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"s3::https://my-terraform-modules.s3.amazonaws.com/vpc/v2.3.1.zip"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pair this with a CI pipeline that packages and uploads module archives on release:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;
&lt;span class="nv"&gt;MODULE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$1&lt;/span&gt;
&lt;span class="nv"&gt;VERSION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$2&lt;/span&gt;

&lt;span class="nb"&gt;cd &lt;/span&gt;modules/&lt;span class="nv"&gt;$MODULE&lt;/span&gt;
zip &lt;span class="nt"&gt;-r&lt;/span&gt; &lt;span class="s2"&gt;"/tmp/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;MODULE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VERSION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.zip"&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
aws s3 &lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="s2"&gt;"/tmp/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;MODULE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;-&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VERSION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.zip"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"s3://my-terraform-modules/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;MODULE&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VERSION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;.zip"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  GitHub Releases as a Registry
&lt;/h3&gt;

&lt;p&gt;A lightweight approach using Git tags directly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"vpc"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"git::https://github.com/yourorg/terraform-modules.git//modules/vpc?ref=v2.3.1"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works well for smaller organizations and avoids the overhead of a dedicated registry.&lt;/p&gt;

&lt;h2&gt;
  
  
  Module Testing and Validation
&lt;/h2&gt;

&lt;p&gt;Modules without tests are modules you cannot trust. Here are the layers of testing you should implement.&lt;/p&gt;

&lt;h3&gt;
  
  
  Static Analysis
&lt;/h3&gt;

&lt;p&gt;Run these on every PR:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Format check&lt;/span&gt;
terraform &lt;span class="nb"&gt;fmt&lt;/span&gt; &lt;span class="nt"&gt;-check&lt;/span&gt; &lt;span class="nt"&gt;-recursive&lt;/span&gt; modules/

&lt;span class="c"&gt;# Validation&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="nb"&gt;dir &lt;/span&gt;&lt;span class="k"&gt;in &lt;/span&gt;modules/&lt;span class="k"&gt;*&lt;/span&gt;/&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$dir&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  terraform init &lt;span class="nt"&gt;-backend&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;false
  &lt;/span&gt;terraform validate
  &lt;span class="nb"&gt;cd&lt;/span&gt; ../..
&lt;span class="k"&gt;done&lt;/span&gt;

&lt;span class="c"&gt;# Security scanning with tfsec&lt;/span&gt;
tfsec modules/

&lt;span class="c"&gt;# Linting with tflint&lt;/span&gt;
tflint &lt;span class="nt"&gt;--recursive&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Integration Testing with Terratest
&lt;/h3&gt;

&lt;p&gt;Write Go tests that actually provision and destroy infrastructure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;package&lt;/span&gt; &lt;span class="n"&gt;test&lt;/span&gt;

&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"testing"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/gruntwork-io/terratest/modules/terraform"&lt;/span&gt;
    &lt;span class="s"&gt;"github.com/stretchr/testify/assert"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;TestVpcModule&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Parallel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;terraformOptions&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;terraform&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithDefaultRetryableErrors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;terraform&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Options&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;TerraformDir&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"../modules/vpc"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;Vars&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="k"&gt;interface&lt;/span&gt;&lt;span class="p"&gt;{}{&lt;/span&gt;
            &lt;span class="s"&gt;"cidr_block"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;"10.99.0.0/16"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;"environment"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"test"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;        &lt;span class="s"&gt;"terratest-vpc"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;terraform&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Destroy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;terraformOptions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;terraform&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;InitAndApply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;terraformOptions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;vpcId&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;terraform&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;terraformOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"vpc_id"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NotEmpty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vpcId&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;privateSubnets&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;terraform&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OutputList&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;terraformOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"private_subnet_ids"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;privateSubnets&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Example Configurations
&lt;/h3&gt;

&lt;p&gt;Every module should ship with a working example in an &lt;code&gt;examples/&lt;/code&gt; directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;modules/vpc/
├── main.tf
├── variables.tf
├── outputs.tf
├── README.md
└── examples/
    ├── simple/
    │   └── main.tf       # Minimal usage
    └── complete/
        └── main.tf       # All features enabled
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Module Composition Patterns
&lt;/h2&gt;

&lt;p&gt;Real infrastructure is built by composing modules together. Here are patterns that work well at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Root Module Pattern
&lt;/h3&gt;

&lt;p&gt;Create environment-specific root modules that compose shared modules:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# environments/production/main.tf&lt;/span&gt;

&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"network"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"app.terraform.io/yourorg/vpc/aws"&lt;/span&gt;
  &lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"2.3.1"&lt;/span&gt;

  &lt;span class="nx"&gt;cidr_block&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.0.0.0/16"&lt;/span&gt;
  &lt;span class="nx"&gt;availability_zones&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"eu-west-1a"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"eu-west-1b"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"eu-west-1c"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="nx"&gt;environment&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"database"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"app.terraform.io/yourorg/rds-postgres/aws"&lt;/span&gt;
  &lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"3.1.0"&lt;/span&gt;

  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;network&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_id&lt;/span&gt;
  &lt;span class="nx"&gt;subnet_ids&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;network&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;private_subnet_ids&lt;/span&gt;
  &lt;span class="nx"&gt;instance_class&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"db.r6g.xlarge"&lt;/span&gt;
  &lt;span class="nx"&gt;multi_az&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="nx"&gt;environment&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;module&lt;/span&gt; &lt;span class="s2"&gt;"application"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"app.terraform.io/yourorg/ecs-service/aws"&lt;/span&gt;
  &lt;span class="nx"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"1.8.0"&lt;/span&gt;

  &lt;span class="nx"&gt;vpc_id&lt;/span&gt;             &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;network&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;vpc_id&lt;/span&gt;
  &lt;span class="nx"&gt;subnet_ids&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;network&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;private_subnet_ids&lt;/span&gt;
  &lt;span class="nx"&gt;lb_target_group&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;network&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;alb_target_group_arn&lt;/span&gt;
  &lt;span class="nx"&gt;database_endpoint&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;database&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;endpoint&lt;/span&gt;
  &lt;span class="nx"&gt;desired_count&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;
  &lt;span class="nx"&gt;environment&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Terragrunt DRY Pattern
&lt;/h3&gt;

&lt;p&gt;For organizations managing many environments, Terragrunt eliminates repetition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# terragrunt.hcl (root)&lt;/span&gt;
&lt;span class="nx"&gt;remote_state&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;backend&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"s3"&lt;/span&gt;
  &lt;span class="nx"&gt;generate&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;path&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"backend.tf"&lt;/span&gt;
    &lt;span class="nx"&gt;if_exists&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"overwrite_terragrunt"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="nx"&gt;config&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;bucket&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"myorg-terraform-state"&lt;/span&gt;
    &lt;span class="nx"&gt;key&lt;/span&gt;            &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"${path_relative_to_include()}/terraform.tfstate"&lt;/span&gt;
    &lt;span class="nx"&gt;region&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"eu-west-1"&lt;/span&gt;
    &lt;span class="nx"&gt;encrypt&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="nx"&gt;dynamodb_table&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"terraform-locks"&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# environments/production/vpc/terragrunt.hcl&lt;/span&gt;
&lt;span class="nx"&gt;terraform&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;source&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"git::https://github.com/yourorg/terraform-modules.git//modules/vpc?ref=v2.3.1"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;inputs&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;cidr_block&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"10.0.0.0/16"&lt;/span&gt;
  &lt;span class="nx"&gt;environment&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"production"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Common Pitfalls and How to Avoid Them
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Over-abstracting too early.&lt;/strong&gt; Do not create a module until you have written the same Terraform code at least twice. Premature abstraction creates rigid modules that fight against real requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nested modules more than two levels deep.&lt;/strong&gt; Module A calls module B which calls module C is already hard to debug. Keep your module hierarchy shallow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not using &lt;code&gt;moved&lt;/code&gt; blocks during refactors.&lt;/strong&gt; When restructuring modules, use &lt;code&gt;moved&lt;/code&gt; blocks to prevent Terraform from destroying and recreating resources:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="nx"&gt;moved&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;from&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;aws_instance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;web&lt;/span&gt;
  &lt;span class="nx"&gt;to&lt;/span&gt;   &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;application&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;aws_instance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;web&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Ignoring module documentation.&lt;/strong&gt; Every module should have a README generated by &lt;code&gt;terraform-docs&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Generate docs automatically&lt;/span&gt;
terraform-docs markdown table modules/vpc/ &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; modules/vpc/README.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Storing secrets in tfvars files.&lt;/strong&gt; Use a secrets manager and data sources instead of committing sensitive values.&lt;/p&gt;

&lt;h2&gt;
  
  
  Need Help with Your DevOps?
&lt;/h2&gt;

&lt;p&gt;Building and maintaining a well-structured Terraform module library takes experience. At &lt;strong&gt;&lt;a href="https://instadevops.com" rel="noopener noreferrer"&gt;InstaDevOps&lt;/a&gt;&lt;/strong&gt;, we help startups and growing teams implement production-grade Infrastructure as Code from day one - so you can ship infrastructure changes with the same confidence as application code.&lt;/p&gt;

&lt;p&gt;Plans start at &lt;strong&gt;$2,999/mo&lt;/strong&gt; for a dedicated fractional DevOps engineer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://calendly.com/instadevops/15min" rel="noopener noreferrer"&gt;Book a free 15-minute consultation&lt;/a&gt;&lt;/strong&gt; to discuss your Terraform architecture.&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>iac</category>
      <category>devops</category>
      <category>modules</category>
    </item>
    <item>
      <title>Infrastructure Testing: Terratest, Checkov, and Validating Your IaC</title>
      <dc:creator>InstaDevOps</dc:creator>
      <pubDate>Sun, 12 Apr 2026 13:46:59 +0000</pubDate>
      <link>https://dev.to/instadevops/infrastructure-testing-terratest-checkov-and-validating-your-iac-2c49</link>
      <guid>https://dev.to/instadevops/infrastructure-testing-terratest-checkov-and-validating-your-iac-2c49</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Infrastructure as Code has transformed cloud management, but when a single Terraform apply can spin up hundreds of resources, testing becomes essential.&lt;/p&gt;

&lt;p&gt;This article explores Terratest for functional testing and Checkov for security scanning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Static Analysis with Checkov
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;checkov
checkov &lt;span class="nt"&gt;-d&lt;/span&gt; ./terraform
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Custom Policies
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;checkov.terraform.checks.resource.base_resource_check&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseResourceCheck&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;checkov.common.models.enums&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CheckCategories&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CheckResult&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;EC2HasEnvironmentTag&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseResourceCheck&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ensure EC2 instances have environment tag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CKV_CUSTOM_1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;supported_resources&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;aws_instance&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                        &lt;span class="n"&gt;supported_resources&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;supported_resources&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;scan_resource_conf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;tags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;tags&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[{}])[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Environment&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tags&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;CheckResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;PASSED&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;CheckResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FAILED&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Functional Testing with Terratest
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;TestS3Bucket&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;testing&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Parallel&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;terraformOptions&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;terraform&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithDefaultRetryableErrors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;terraform&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Options&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;TerraformDir&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"../modules/s3-bucket"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;Vars&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="k"&gt;interface&lt;/span&gt;&lt;span class="p"&gt;{}{&lt;/span&gt;
                &lt;span class="s"&gt;"bucket_name"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"test-bucket-"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;UniqueId&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
            &lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;terraform&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Destroy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;terraformOptions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;terraform&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;InitAndApply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;terraformOptions&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;bucketID&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;terraform&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Output&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;terraformOptions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"bucket_id"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;actualStatus&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;aws&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GetS3BucketVersioning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"us-east-1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bucketID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;assert&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Equal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Enabled"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;actualStatus&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  When to Use Each Tool
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Checkov&lt;/th&gt;
&lt;th&gt;Terratest&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Type&lt;/td&gt;
&lt;td&gt;Static analysis&lt;/td&gt;
&lt;td&gt;Functional testing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Speed&lt;/td&gt;
&lt;td&gt;Seconds&lt;/td&gt;
&lt;td&gt;Minutes to hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;td&gt;Incurs cloud costs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Coverage&lt;/td&gt;
&lt;td&gt;Security, compliance&lt;/td&gt;
&lt;td&gt;Actual functionality&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  CI/CD Integration
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;static-analysis&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bridgecrewio/checkov-action@master&lt;/span&gt;
        &lt;span class="na"&gt;with&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;directory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;terraform/&lt;/span&gt;

  &lt;span class="na"&gt;terratest&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;needs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;static-analysis&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;cd test&lt;/span&gt;
          &lt;span class="s"&gt;go test -v -timeout 30m ./...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Layer your testing: Start with Checkov for fast feedback, add Terratest for critical modules, and run full integration tests before production.&lt;/p&gt;




&lt;h2&gt;
  
  
  Need Help with Your DevOps Infrastructure?
&lt;/h2&gt;

&lt;p&gt;At &lt;strong&gt;&lt;a href="https://instadevops.com" rel="noopener noreferrer"&gt;InstaDevOps&lt;/a&gt;&lt;/strong&gt;, we specialize in helping startups build production-ready infrastructure.&lt;/p&gt;

&lt;p&gt;📅 &lt;strong&gt;&lt;a href="https://calendly.com/instadevops/15min" rel="noopener noreferrer"&gt;Book a Free 15-Min Consultation&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://instadevops.com/blog/infrastructure-testing" rel="noopener noreferrer"&gt;instadevops.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>terraform</category>
      <category>testing</category>
      <category>iac</category>
      <category>devops</category>
    </item>
    <item>
      <title>Kubernetes Networking Demystified: Services, Ingress, and Network Policies</title>
      <dc:creator>InstaDevOps</dc:creator>
      <pubDate>Sat, 11 Apr 2026 13:46:57 +0000</pubDate>
      <link>https://dev.to/instadevops/kubernetes-networking-demystified-services-ingress-and-network-policies-43ne</link>
      <guid>https://dev.to/instadevops/kubernetes-networking-demystified-services-ingress-and-network-policies-43ne</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Kubernetes networking is often cited as one of the most challenging aspects of container orchestration. This guide covers the three pillars: Services, Ingress, and Network Policies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Services: Stable Endpoints for Dynamic Pods
&lt;/h2&gt;

&lt;h3&gt;
  
  
  ClusterIP Services
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backend-api&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ClusterIP&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backend&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
    &lt;span class="na"&gt;targetPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  LoadBalancer Services
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Service&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;public-api&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LoadBalancer&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api&lt;/span&gt;
  &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;443&lt;/span&gt;
    &lt;span class="na"&gt;targetPort&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8443&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Ingress: HTTP/HTTPS Traffic Management
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app-ingress&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;ingressClassName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx&lt;/span&gt;
  &lt;span class="na"&gt;tls&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;hosts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;app.example.com&lt;/span&gt;
    &lt;span class="na"&gt;secretName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app-tls-secret&lt;/span&gt;
  &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app.example.com&lt;/span&gt;
    &lt;span class="na"&gt;http&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/api&lt;/span&gt;
        &lt;span class="na"&gt;pathType&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Prefix&lt;/span&gt;
        &lt;span class="na"&gt;backend&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backend-api&lt;/span&gt;
            &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;80&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Network Policies: Securing Pod Communication
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Default Deny All
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NetworkPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default-deny-all&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{}&lt;/span&gt;
  &lt;span class="na"&gt;policyTypes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;Egress&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Allow Specific Ingress
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NetworkPolicy&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;allow-frontend-to-backend&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backend&lt;/span&gt;
  &lt;span class="na"&gt;ingress&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;from&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;podSelector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;frontend&lt;/span&gt;
    &lt;span class="na"&gt;ports&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;protocol&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;TCP&lt;/span&gt;
      &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;8080&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check endpoints&lt;/span&gt;
kubectl get endpoints my-service

&lt;span class="c"&gt;# Test DNS&lt;/span&gt;
kubectl &lt;span class="nb"&gt;exec&lt;/span&gt; &lt;span class="nt"&gt;-it&lt;/span&gt; my-pod &lt;span class="nt"&gt;--&lt;/span&gt; nslookup kubernetes.default

&lt;span class="c"&gt;# Debug with netshoot&lt;/span&gt;
kubectl run netshoot &lt;span class="nt"&gt;--image&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;nicolaka/netshoot &lt;span class="nt"&gt;-it&lt;/span&gt; &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;--&lt;/span&gt; bash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Master Services, Ingress, and Network Policies, and you'll have a solid foundation for building secure, scalable applications on Kubernetes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Need Help with Your DevOps Infrastructure?
&lt;/h2&gt;

&lt;p&gt;At &lt;strong&gt;&lt;a href="https://instadevops.com" rel="noopener noreferrer"&gt;InstaDevOps&lt;/a&gt;&lt;/strong&gt;, we specialize in helping startups build production-ready infrastructure.&lt;/p&gt;

&lt;p&gt;📅 &lt;strong&gt;&lt;a href="https://calendly.com/instadevops/15min" rel="noopener noreferrer"&gt;Book a Free 15-Min Consultation&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://instadevops.com/blog/kubernetes-networking" rel="noopener noreferrer"&gt;instadevops.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>networking</category>
      <category>devops</category>
      <category>security</category>
    </item>
    <item>
      <title>Container Registry Best Practices: ECR, Docker Hub, and Self-Hosted Options</title>
      <dc:creator>InstaDevOps</dc:creator>
      <pubDate>Fri, 10 Apr 2026 13:46:54 +0000</pubDate>
      <link>https://dev.to/instadevops/container-registry-best-practices-ecr-docker-hub-and-self-hosted-options-22kh</link>
      <guid>https://dev.to/instadevops/container-registry-best-practices-ecr-docker-hub-and-self-hosted-options-22kh</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Container registries are the backbone of containerized application deployment. Choosing the right registry and implementing proper practices can mean the difference between smooth deployments and security nightmares.&lt;/p&gt;

&lt;h2&gt;
  
  
  Amazon ECR: AWS-Native Registry
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create a repository&lt;/span&gt;
aws ecr create-repository &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--repository-name&lt;/span&gt; my-app &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--image-scanning-configuration&lt;/span&gt; &lt;span class="nv"&gt;scanOnPush&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;

&lt;span class="c"&gt;# Push an image&lt;/span&gt;
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/my-app:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  ECR Lifecycle Policies
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"rules"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"rulePriority"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Keep last 10 production images"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"selection"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"tagStatus"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tagged"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"tagPrefixList"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"prod-"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"countType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"imageCountMoreThan"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"countNumber"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"expire"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Docker Hub
&lt;/h2&gt;

&lt;p&gt;Docker Hub remains the most widely used registry, hosting millions of public images.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rate Limits&lt;/strong&gt;: Anonymous pulls limited to 100 per 6 hours; authenticated free users get 200.&lt;/p&gt;

&lt;h2&gt;
  
  
  Self-Hosted: Harbor
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;helm install harbor harbor/harbor \&lt;/span&gt;
  &lt;span class="s"&gt;--set expose.type=ingress \&lt;/span&gt;
  &lt;span class="s"&gt;--set expose.ingress.hosts.core=registry.example.com \&lt;/span&gt;
  &lt;span class="s"&gt;--set trivy.enabled=true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Security Best Practices
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Enable image scanning&lt;/li&gt;
&lt;li&gt;Implement least-privilege access&lt;/li&gt;
&lt;li&gt;Sign your images with Cosign&lt;/li&gt;
&lt;li&gt;Use immutable tags&lt;/li&gt;
&lt;li&gt;Scan base images regularly&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Image Tagging Strategies
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;VERSION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"1.2.3"&lt;/span&gt;
&lt;span class="nv"&gt;GIT_SHA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;git rev-parse &lt;span class="nt"&gt;--short&lt;/span&gt; HEAD&lt;span class="si"&gt;)&lt;/span&gt;

docker build &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-t&lt;/span&gt; my-app:&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VERSION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-t&lt;/span&gt; my-app:&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;VERSION&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;-&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GIT_SHA&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-t&lt;/span&gt; my-app:&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GIT_SHA&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Whether you choose ECR for AWS integration, Docker Hub for ubiquity, or Harbor for control, applying security best practices will keep your container infrastructure secure.&lt;/p&gt;




&lt;h2&gt;
  
  
  Need Help with Your DevOps Infrastructure?
&lt;/h2&gt;

&lt;p&gt;At &lt;strong&gt;&lt;a href="https://instadevops.com" rel="noopener noreferrer"&gt;InstaDevOps&lt;/a&gt;&lt;/strong&gt;, we specialize in helping startups build production-ready infrastructure.&lt;/p&gt;

&lt;p&gt;📅 &lt;strong&gt;&lt;a href="https://calendly.com/instadevops/15min" rel="noopener noreferrer"&gt;Book a Free 15-Min Consultation&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://instadevops.com/blog/container-registry-guide" rel="noopener noreferrer"&gt;instadevops.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>docker</category>
      <category>ecr</category>
      <category>containers</category>
      <category>devops</category>
    </item>
    <item>
      <title>Incident Management: Building Effective On-Call Rotations and Runbooks</title>
      <dc:creator>InstaDevOps</dc:creator>
      <pubDate>Thu, 09 Apr 2026 20:46:50 +0000</pubDate>
      <link>https://dev.to/instadevops/incident-management-building-effective-on-call-rotations-and-runbooks-10ho</link>
      <guid>https://dev.to/instadevops/incident-management-building-effective-on-call-rotations-and-runbooks-10ho</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Every engineering team dreads the 3 AM page. How your team handles these moments defines your reliability as a service provider.&lt;/p&gt;

&lt;p&gt;This guide covers on-call rotation design, runbook creation, incident response processes, and blameless post-mortems.&lt;/p&gt;

&lt;h2&gt;
  
  
  On-Call Best Practices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Designing Sustainable Rotations
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rotation Length&lt;/strong&gt;: Weekly rotations work best&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Minimum Team Size&lt;/strong&gt;: Never fewer than 4 engineers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compensation&lt;/strong&gt;: Fair pay for on-call work&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Reducing Alert Fatigue
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Actionable alerts only&lt;/li&gt;
&lt;li&gt;Consolidate related alerts&lt;/li&gt;
&lt;li&gt;Regular alert reviews&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Runbook Template Example
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# Runbook: Database Connection Pool Exhausted&lt;/span&gt;

&lt;span class="gu"&gt;## Impact Assessment&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; User Impact: New requests may fail
&lt;span class="p"&gt;-&lt;/span&gt; Affected Services: All services using primary database

&lt;span class="gu"&gt;## Immediate Actions&lt;/span&gt;
&lt;span class="p"&gt;1.&lt;/span&gt; Check current connections:
   SELECT count(&lt;span class="err"&gt;*&lt;/span&gt;), state FROM pg_stat_activity GROUP BY state;
&lt;span class="p"&gt;
2.&lt;/span&gt; Identify connection hogs:
   SELECT usename, application_name, count(&lt;span class="err"&gt;*&lt;/span&gt;)
   FROM pg_stat_activity GROUP BY usename, application_name;

&lt;span class="gu"&gt;## Resolution Steps&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; If long-running query: pg_terminate_backend(&lt;span class="nt"&gt;&amp;lt;pid&amp;gt;&lt;/span&gt;)
&lt;span class="p"&gt;-&lt;/span&gt; If traffic spike: Scale up read replicas
&lt;span class="p"&gt;-&lt;/span&gt; If connection leak: Restart affected service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Incident Response Process
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Phase 1: Detection and Triage (0-5 minutes)
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Acknowledge the alert&lt;/li&gt;
&lt;li&gt;Assess severity&lt;/li&gt;
&lt;li&gt;Open relevant runbook&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Phase 2: Response Coordination
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Incident Commander&lt;/strong&gt;: Coordinates response&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Technical Lead&lt;/strong&gt;: Hands-on debugging&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Communications Lead&lt;/strong&gt;: Updates stakeholders&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Blameless Post-Mortems
&lt;/h2&gt;

&lt;p&gt;Focus on systems and processes, not individuals. The goal is learning and improvement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Effective incident management is built on preparation, not heroics. Well-designed rotations, comprehensive runbooks, and blameless post-mortems transform incident response from stress into a competitive advantage.&lt;/p&gt;




&lt;h2&gt;
  
  
  Need Help with Your DevOps Infrastructure?
&lt;/h2&gt;

&lt;p&gt;At &lt;strong&gt;&lt;a href="https://instadevops.com" rel="noopener noreferrer"&gt;InstaDevOps&lt;/a&gt;&lt;/strong&gt;, we specialize in helping startups build production-ready infrastructure.&lt;/p&gt;

&lt;p&gt;📅 &lt;strong&gt;&lt;a href="https://calendly.com/instadevops/15min" rel="noopener noreferrer"&gt;Book a Free 15-Min Consultation&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://instadevops.com/blog/incident-management" rel="noopener noreferrer"&gt;instadevops.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>incident</category>
      <category>oncall</category>
      <category>sre</category>
      <category>devops</category>
    </item>
    <item>
      <title>SRE Fundamentals: Defining SLOs, SLIs, and Error Budgets That Actually Work</title>
      <dc:creator>InstaDevOps</dc:creator>
      <pubDate>Thu, 09 Apr 2026 20:33:35 +0000</pubDate>
      <link>https://dev.to/instadevops/sre-fundamentals-defining-slos-slis-and-error-budgets-that-actually-work-42k7</link>
      <guid>https://dev.to/instadevops/sre-fundamentals-defining-slos-slis-and-error-budgets-that-actually-work-42k7</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Site Reliability Engineering (SRE) has transformed how organizations think about system reliability. Central to this framework are Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Error Budgets.&lt;/p&gt;

&lt;p&gt;This guide will walk you through defining SLOs, SLIs, and Error Budgets that actually drive meaningful improvements.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding the Reliability Hierarchy
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;SLAs (Service Level Agreements)&lt;/strong&gt;: External contracts with customers specifying consequences for failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SLOs (Service Level Objectives)&lt;/strong&gt;: Internal targets your team commits to, stricter than SLAs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SLIs (Service Level Indicators)&lt;/strong&gt;: The actual measurements determining whether you're meeting SLOs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Error Budgets&lt;/strong&gt;: How much unreliability you can tolerate while meeting your SLO.&lt;/p&gt;

&lt;h2&gt;
  
  
  Defining Meaningful SLIs
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Four Golden Signals
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt;: How long requests take&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Traffic&lt;/strong&gt;: Request volume&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Errors&lt;/strong&gt;: Rate of failed requests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Saturation&lt;/strong&gt;: How "full" your service is
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Availability SLI
sum(rate(http_requests_total{status=~"2.."}[5m]))
/
sum(rate(http_requests_total[5m]))

# Latency SLI
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Setting Realistic SLOs
&lt;/h2&gt;

&lt;p&gt;Each additional "nine" dramatically reduces your error budget:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Availability&lt;/th&gt;
&lt;th&gt;Monthly Downtime&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;99%&lt;/td&gt;
&lt;td&gt;7.2 hours&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;99.9%&lt;/td&gt;
&lt;td&gt;43.8 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;99.99%&lt;/td&gt;
&lt;td&gt;4.38 minutes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Error Budgets: The Key to Balance
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Error Budget = 1 - SLO

For a 99.9% SLO over 30 days:
Error Budget = 0.1% = 43.2 minutes of downtime
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;SLOs, SLIs, and Error Budgets aren't just metrics—they're a framework for making better decisions about reliability. The goal is appropriate reliability—enough to keep users happy while maintaining velocity.&lt;/p&gt;




&lt;h2&gt;
  
  
  Need Help with Your DevOps Infrastructure?
&lt;/h2&gt;

&lt;p&gt;At &lt;strong&gt;&lt;a href="https://instadevops.com" rel="noopener noreferrer"&gt;InstaDevOps&lt;/a&gt;&lt;/strong&gt;, we specialize in helping startups build production-ready infrastructure.&lt;/p&gt;

&lt;p&gt;📅 &lt;strong&gt;&lt;a href="https://calendly.com/instadevops/15min" rel="noopener noreferrer"&gt;Book a Free 15-Min Consultation&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://instadevops.com/blog/sre-slos-slis" rel="noopener noreferrer"&gt;instadevops.com&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>sre</category>
      <category>monitoring</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
