<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: DevOps Daily</title>
    <description>The latest articles on DEV Community by DevOps Daily (@devopsdaily).</description>
    <link>https://dev.to/devopsdaily</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F382434%2F3b4f7f10-38d4-4f4f-8351-1dcb0c1bdfc7.png</url>
      <title>DEV Community: DevOps Daily</title>
      <link>https://dev.to/devopsdaily</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/devopsdaily"/>
    <language>en</language>
    <item>
      <title>Secrets Management Best Practices with HashiCorp Vault</title>
      <dc:creator>DevOps Daily</dc:creator>
      <pubDate>Mon, 22 Jun 2026 10:55:34 +0000</pubDate>
      <link>https://dev.to/devopsdaily/secrets-management-best-practices-with-hashicorp-vault-1pd9</link>
      <guid>https://dev.to/devopsdaily/secrets-management-best-practices-with-hashicorp-vault-1pd9</guid>
      <description>&lt;p&gt;A database password leaks. Maybe it was committed to a private repo three years ago, maybe it sat in a CI log, maybe a contractor copied it into a Slack DM. You do not know, because that password has been valid the entire time and nobody rotated it. Now you are in an incident channel at 2am trying to figure out the blast radius of a credential that every service, every old laptop, and every backup job has used since 2023.&lt;/p&gt;

&lt;p&gt;This is the problem HashiCorp Vault solves, and it is not the problem most teams use it for. Most teams install Vault, run it in dev mode, dump a pile of static key-value secrets into it, and call it done. That gives you an encrypted password store with a nicer API. Useful, but it leaves the worst part untouched: secrets that live forever and that no human can fully account for.&lt;/p&gt;

&lt;p&gt;The real win with Vault is making secrets short-lived and generated on demand, so a leak has an expiry date measured in hours instead of years. This post shows how to run Vault for that: a production server that survives reboots, machine authentication that does not depend on root tokens, dynamic database credentials, and encryption as a service. Every command here is one you can run.&lt;/p&gt;

&lt;h2&gt;
  
  
  TLDR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Never run &lt;code&gt;vault server -dev&lt;/code&gt; for anything real. It is in-memory and unsealed, so a restart wipes every secret.&lt;/li&gt;
&lt;li&gt;Use auto-unseal (AWS KMS, GCP KMS, or another Vault) so a reboot does not need five humans with key shares.&lt;/li&gt;
&lt;li&gt;Authenticate machines with &lt;strong&gt;AppRole&lt;/strong&gt;, not long-lived root or service tokens.&lt;/li&gt;
&lt;li&gt;Use &lt;strong&gt;dynamic secrets&lt;/strong&gt; for databases. Vault creates a unique DB user per request with a short TTL and deletes it when the lease ends.&lt;/li&gt;
&lt;li&gt;Use the &lt;strong&gt;transit engine&lt;/strong&gt; for encryption as a service so your apps never touch the encryption keys.&lt;/li&gt;
&lt;li&gt;Write least-privilege policies, turn on the audit log, and revoke leases when something goes wrong.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;A Linux host (or VM) where you can install the Vault binary&lt;/li&gt;
&lt;li&gt;Vault 1.15 or newer (&lt;code&gt;vault version&lt;/code&gt; to check)&lt;/li&gt;
&lt;li&gt;A PostgreSQL database you can point Vault at for the dynamic secrets section&lt;/li&gt;
&lt;li&gt;An AWS account with a KMS key if you want auto-unseal (optional but recommended)&lt;/li&gt;
&lt;li&gt;Basic comfort with the command line and HCL config files&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Stop running Vault in dev mode
&lt;/h2&gt;

&lt;p&gt;Dev mode is the trap. You run one command and get a working Vault:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vault server &lt;span class="nt"&gt;-dev&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;==&amp;gt; Vault server configuration:
             Api Address: http://127.0.0.1:8200
                     Cgo: disabled
         Cluster Address: https://127.0.0.1:8201
              Listener 1: tcp (addr: "127.0.0.1:8200", tls: "disabled")
               Log Level: info
                   Mlock: supported: true, enabled: false
           Recovery Mode: false
                 Storage: inmem

WARNING! dev mode is enabled! In this mode, Vault runs entirely in-memory
and starts unsealed with a single unseal key.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read that warning. &lt;code&gt;Storage: inmem&lt;/code&gt; means every secret lives in RAM and disappears on restart. &lt;code&gt;tls: disabled&lt;/code&gt; means traffic is plaintext. It starts unsealed, so anyone who reaches port 8200 owns it. Dev mode is for trying commands on your laptop, nothing else.&lt;/p&gt;

&lt;p&gt;A production server needs three things dev mode skips: persistent storage, TLS, and a seal. Here is a real &lt;code&gt;config.hcl&lt;/code&gt; using integrated Raft storage and AWS KMS auto-unseal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# /etc/vault.d/vault.hcl&lt;/span&gt;
&lt;span class="nx"&gt;storage&lt;/span&gt; &lt;span class="s2"&gt;"raft"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;path&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"/opt/vault/data"&lt;/span&gt;
  &lt;span class="nx"&gt;node_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"vault-1"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;listener&lt;/span&gt; &lt;span class="s2"&gt;"tcp"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;address&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"0.0.0.0:8200"&lt;/span&gt;
  &lt;span class="nx"&gt;tls_cert_file&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"/opt/vault/tls/vault.crt"&lt;/span&gt;
  &lt;span class="nx"&gt;tls_key_file&lt;/span&gt;  &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"/opt/vault/tls/vault.key"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Auto-unseal: Vault asks KMS to decrypt its root key on boot.&lt;/span&gt;
&lt;span class="c1"&gt;# No more gathering humans with key shares after every restart.&lt;/span&gt;
&lt;span class="nx"&gt;seal&lt;/span&gt; &lt;span class="s2"&gt;"awskms"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;region&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"us-east-1"&lt;/span&gt;
  &lt;span class="nx"&gt;kms_key_id&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"arn:aws:kms:us-east-1:111122223333:key/abc-12345"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;api_addr&lt;/span&gt;     &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"https://vault-1.internal:8200"&lt;/span&gt;
&lt;span class="nx"&gt;cluster_addr&lt;/span&gt; &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"https://vault-1.internal:8201"&lt;/span&gt;
&lt;span class="nx"&gt;ui&lt;/span&gt;           &lt;span class="err"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Start it and initialize once:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vault server &lt;span class="nt"&gt;-config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/etc/vault.d/vault.hcl &amp;amp;

&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;VAULT_ADDR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://vault-1.internal:8200"&lt;/span&gt;
vault operator init &lt;span class="nt"&gt;-recovery-shares&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;5 &lt;span class="nt"&gt;-recovery-threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Recovery Key 1: vR2k9... (give to a different person than key 2)
Recovery Key 2: 8Lp4m...
Recovery Key 3: qW7nZ...
Recovery Key 4: 3xF8t...
Recovery Key 5: hT1bY...

Initial Root Token: hvs.CAESIJ...

Success! Vault is initialized

Recovery key initialized with 5 key shares and a key threshold of 3.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because of auto-unseal you get &lt;strong&gt;recovery keys&lt;/strong&gt; instead of unseal keys. Vault unseals itself on boot using KMS, and the recovery keys are only for emergencies like regenerating the root token. Split them across different people and store them offline. Never keep all of them in one place.&lt;/p&gt;

&lt;p&gt;Now use that root token once to set up authentication and policies, then throw it away. Root tokens are for break-glass moments, not daily use.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vault login hvs.CAESIJ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you ever see this, your Vault restarted and could not reach its seal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ vault kv get secret/payments/stripe
Error making API request.
URL: GET https://vault-1.internal:8200/v1/secret/data/payments/stripe
Code: 503. Errors:
* Vault is sealed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A sealed Vault answers nothing. That is the whole point. Auto-unseal exists so this state heals itself instead of paging you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Authenticate machines with AppRole, not tokens
&lt;/h2&gt;

&lt;p&gt;A common mistake: generate a long-lived token, paste it into an app's environment, and forget it exists. Now you have the same forever-credential problem one layer up. If that token leaks, it works until someone notices.&lt;/p&gt;

&lt;p&gt;For machines, use &lt;strong&gt;AppRole&lt;/strong&gt;. The app proves its identity with a &lt;code&gt;role_id&lt;/code&gt; (think username, not very secret) and a &lt;code&gt;secret_id&lt;/code&gt; (think password, short-lived and delivered separately), and gets back a token scoped to exactly what it needs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vault auth &lt;span class="nb"&gt;enable &lt;/span&gt;approle

&lt;span class="c"&gt;# Create a role for the payments service.&lt;/span&gt;
vault write auth/approle/role/payments-api &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;token_policies&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"payments-api"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;token_ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1h &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;token_max_ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;4h &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;secret_id_ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;24h &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;secret_id_num_uses&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1

&lt;span class="c"&gt;# role_id is stable and tied to the role.&lt;/span&gt;
vault &lt;span class="nb"&gt;read &lt;/span&gt;auth/approle/role/payments-api/role-id
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Key        Value
---        -----
role_id    7b1c4e2a-9f3d-4a8e-b6c1-2d5f8e0a1b3c
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;secret_id&lt;/code&gt; is the part that needs care. Generate it just before the app starts and hand it over once. With &lt;code&gt;secret_id_num_uses=1&lt;/code&gt; it works exactly one time, so a leaked &lt;code&gt;secret_id&lt;/code&gt; in a log is already useless.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vault write &lt;span class="nt"&gt;-f&lt;/span&gt; auth/approle/role/payments-api/secret-id
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Key                   Value
---                   -----
secret_id             d8a3...e91f
secret_id_accessor    4c2b...77a0
secret_id_ttl         24h
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The app logs in with both and gets a short-lived token:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vault write auth/approle/login &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;role_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"7b1c4e2a-9f3d-4a8e-b6c1-2d5f8e0a1b3c"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;secret_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"d8a3...e91f"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Key                  Value
---                  -----
token                hvs.CAESI...
token_duration       1h
token_renewable      true
token_policies       ["default" "payments-api"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That token dies in an hour unless the app renews it. The pattern that delivers the &lt;code&gt;secret_id&lt;/code&gt; securely (a sidecar, a cloud instance identity, or Vault Agent) is its own topic, but the rule is simple: the &lt;code&gt;role_id&lt;/code&gt; can live in config, the &lt;code&gt;secret_id&lt;/code&gt; should be freshly minted and single-use.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dynamic database credentials
&lt;/h2&gt;

&lt;p&gt;This is the feature that changes how you think about secrets. Instead of one shared database password that every service knows, Vault creates a brand new database user for each request, with a short TTL, and deletes it when the lease expires.&lt;/p&gt;

&lt;p&gt;Enable the database engine and point it at PostgreSQL:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vault secrets &lt;span class="nb"&gt;enable &lt;/span&gt;database

vault write database/config/orders-db &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;plugin_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"postgresql-database-plugin"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;allowed_roles&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"orders-readonly"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;connection_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"postgresql://{{username}}:{{password}}@db.internal:5432/orders?sslmode=require"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;username&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"vault-admin"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;password&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$ROOT_DB_PASSWORD&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;vault-admin&lt;/code&gt; account is the only static credential, and it is a privileged account Vault uses to create and drop other users. Now define a role that says what a generated user is allowed to do:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vault write database/roles/orders-readonly &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;db_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"orders-db"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;creation_statements&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"CREATE ROLE &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;{{name}}&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; WITH LOGIN PASSWORD '{{password}}' VALID UNTIL '{{expiration}}'; &lt;/span&gt;&lt;span class="se"&gt;\&lt;/span&gt;&lt;span class="s2"&gt;
      GRANT SELECT ON ALL TABLES IN SCHEMA public TO &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;{{name}}&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;default_ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"1h"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;max_ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"24h"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ask for credentials:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vault &lt;span class="nb"&gt;read &lt;/span&gt;database/creds/orders-readonly
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Key                Value
---                -----
lease_id           database/creds/orders-readonly/Qm9iY...
lease_duration     1h
lease_renewable    true
password           A1a-9Zx2Kp4Lq7Rt0Vn3
username           v-approle-orders-rea-x7Qd2bN9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;username&lt;/code&gt; did not exist a second ago. Run the command again and you get a different user with a different password. Each service instance, each request if you want, gets its own credentials. When the lease ends, Vault runs the revocation statement and the user is gone from PostgreSQL.&lt;/p&gt;

&lt;p&gt;Here is why this matters in numbers. A static password sits valid until a human rotates it, which in practice means months or years. A dynamic credential with a one-hour TTL is useless to an attacker an hour after it leaks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fre4wn1vhsjgp4wf4q6cd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fre4wn1vhsjgp4wf4q6cd.png" alt="How long a leaked credential stays valid" width="799" height="214"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The shrink in exposure window is the entire reason to run Vault. If you take one thing from this post, make it this section.&lt;/p&gt;

&lt;h2&gt;
  
  
  Encryption as a service with the transit engine
&lt;/h2&gt;

&lt;p&gt;Sometimes you do not want to store a secret, you want to encrypt application data: a customer's tax ID, a token, a column in your database. The wrong move is to ship an AES key to every app and hope nobody loses it. The transit engine keeps the key inside Vault and exposes encrypt and decrypt operations. Your app sends plaintext and gets ciphertext back. It never sees the key.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vault secrets &lt;span class="nb"&gt;enable &lt;/span&gt;transit
vault write &lt;span class="nt"&gt;-f&lt;/span&gt; transit/keys/orders-pii
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Encrypt some data (transit takes base64 input):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vault write transit/encrypt/orders-pii &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;plaintext&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &lt;span class="s2"&gt;"4111-1111-1111-1111"&lt;/span&gt; | &lt;span class="nb"&gt;base64&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Key            Value
---            -----
ciphertext     vault:v1:8SDd4HCQ9p7Hf2bxN0kZ...
key_version    1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Store &lt;code&gt;vault:v1:8SDd...&lt;/code&gt; in your database. To read it back:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vault write transit/decrypt/orders-pii &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nv"&gt;ciphertext&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"vault:v1:8SDd4HCQ9p7Hf2bxN0kZ..."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Key          Value
---          -----
plaintext    NDExMS0xMTExLTExMTEtMTExMQ==
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Base64-decode that and you are back to the card number. The &lt;code&gt;v1&lt;/code&gt; prefix is the key version, which means you can rotate the key with &lt;code&gt;vault write -f transit/keys/orders-pii/rotate&lt;/code&gt; and old ciphertext still decrypts while new writes use the fresh key. No key ever leaves Vault, so an app compromise leaks data the app could already see, not the key that protects all of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Least-privilege policies and the audit log
&lt;/h2&gt;

&lt;p&gt;Tokens are only as safe as the policy attached to them. The &lt;code&gt;payments-api&lt;/code&gt; policy referenced earlier should grant exactly what the service needs and nothing more:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight hcl"&gt;&lt;code&gt;&lt;span class="c1"&gt;# payments-api.hcl&lt;/span&gt;
&lt;span class="c1"&gt;# Read dynamic DB creds for the orders database.&lt;/span&gt;
&lt;span class="nx"&gt;path&lt;/span&gt; &lt;span class="s2"&gt;"database/creds/orders-readonly"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;capabilities&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"read"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Encrypt and decrypt PII, but not manage or export the key.&lt;/span&gt;
&lt;span class="nx"&gt;path&lt;/span&gt; &lt;span class="s2"&gt;"transit/encrypt/orders-pii"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;capabilities&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"update"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;path&lt;/span&gt; &lt;span class="s2"&gt;"transit/decrypt/orders-pii"&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;capabilities&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"update"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vault policy write payments-api payments-api.hcl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice what is missing. No &lt;code&gt;database/creds/orders-admin&lt;/code&gt;, no &lt;code&gt;transit/keys/*&lt;/code&gt; management, no wildcard paths. If the payments token leaks, the attacker can read orders and decrypt PII for an hour, and that is the ceiling. When a request asks for something outside the policy, Vault refuses:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ vault read database/creds/orders-admin
Error reading database/creds/orders-admin: Error making API request.
URL: GET https://vault-1.internal:8200/v1/database/creds/orders-admin
Code: 403. Errors:
* 1 error occurred:
    * permission denied
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Turn on the audit log before you put anything real in Vault. It records every request and response (secrets are HMAC'd, not stored in clear) so you can answer "who read this secret and when" during an incident:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vault audit &lt;span class="nb"&gt;enable &lt;/span&gt;file &lt;span class="nv"&gt;file_path&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/var/log/vault/audit.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And when you do have an incident, dynamic secrets give you a clean kill switch. Revoke every credential a database role ever issued in one command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vault lease revoke &lt;span class="nt"&gt;-prefix&lt;/span&gt; database/creds/orders-readonly
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;All revocation operations queued successfully!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every dynamic user that role created gets dropped from the database. Try doing that with a shared password that lives in forty places.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where to go next
&lt;/h2&gt;

&lt;p&gt;You now have the shape of a real Vault setup: a sealed, persistent server; AppRole for machines; dynamic database credentials; transit for encryption; tight policies; and an audit trail. The static KV store is still there when you need it, but it should be the exception, not the default.&lt;/p&gt;

&lt;p&gt;Concrete next steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Replace one static database password with a dynamic role this week.&lt;/strong&gt; Pick a low-risk read-only service and cut over. Seeing credentials expire on their own is what makes the model click.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stand up a 3-node Raft cluster&lt;/strong&gt;, not a single server. One Vault node is a single point of failure for every secret you own. Run &lt;code&gt;vault operator raft list-peers&lt;/code&gt; to confirm the cluster.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deploy Vault Agent&lt;/strong&gt; to handle AppRole login and token renewal so your apps read a rendered file or env var instead of calling the Vault API directly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Set short TTLs and test revocation.&lt;/strong&gt; Run &lt;code&gt;vault lease revoke -prefix&lt;/code&gt; against a staging role and confirm the users vanish from your database. Know the command works before you need it at 2am.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ship the audit log to your SIEM&lt;/strong&gt; so secret access shows up next to the rest of your security telemetry.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Start with step one. Turning a single forever-password into a one-hour credential is the smallest change that removes the largest class of secret leaks you have.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cloud</category>
      <category>security</category>
      <category>linux</category>
    </item>
    <item>
      <title>10 GitHub Repositories That Will Actually Teach You DevOps in 2026</title>
      <dc:creator>DevOps Daily</dc:creator>
      <pubDate>Tue, 05 May 2026 18:11:07 +0000</pubDate>
      <link>https://dev.to/devopsdaily/10-github-repositories-that-will-actually-teach-you-devops-in-2026-266e</link>
      <guid>https://dev.to/devopsdaily/10-github-repositories-that-will-actually-teach-you-devops-in-2026-266e</guid>
      <description>&lt;p&gt;There are roughly a thousand "top DevOps repos" listicles, and most of them are the same five awesome-lists in a different order. The problem with awesome-lists is that they are link directories. They tell you where to look, not what to do. If you want to actually get better at DevOps, you need a different shape of repo: ones with exercises, opinionated learning paths, hands-on demos, and source you can read and learn from.&lt;/p&gt;

&lt;p&gt;So here are ten GitHub repositories that have moved real engineers from "I have heard of Kubernetes" to "I run it in production." We will start with the one we maintain on this site, then walk through the rest in order of star count, with notes on who each one is for and how to get the most out of it.&lt;/p&gt;

&lt;h2&gt;
  
  
  TLDR
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Repo&lt;/th&gt;
&lt;th&gt;Stars&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/The-DevOps-Daily/devops-daily" rel="noopener noreferrer"&gt;The-DevOps-Daily/devops-daily&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;1k+&lt;/td&gt;
&lt;td&gt;Tutorials, exercises, and quizzes across the stack&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/nilbuild/developer-roadmap" rel="noopener noreferrer"&gt;nilbuild/developer-roadmap&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;354k&lt;/td&gt;
&lt;td&gt;Visual roadmap to plan your learning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/bregman-arie/devops-exercises" rel="noopener noreferrer"&gt;bregman-arie/devops-exercises&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;82k&lt;/td&gt;
&lt;td&gt;Interview prep and practice questions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/kelseyhightower/kubernetes-the-hard-way" rel="noopener noreferrer"&gt;kelseyhightower/kubernetes-the-hard-way&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;48k&lt;/td&gt;
&lt;td&gt;Building Kubernetes from scratch&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/MichaelCade/90DaysOfDevOps" rel="noopener noreferrer"&gt;MichaelCade/90DaysOfDevOps&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;29k&lt;/td&gt;
&lt;td&gt;A structured 90-day plan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/milanm/DevOps-Roadmap" rel="noopener noreferrer"&gt;milanm/DevOps-Roadmap&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;19k&lt;/td&gt;
&lt;td&gt;Roadmap with linked study resources&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/ramitsurana/awesome-kubernetes" rel="noopener noreferrer"&gt;ramitsurana/awesome-kubernetes&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;16k&lt;/td&gt;
&lt;td&gt;Curated Kubernetes deep-dive material&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/dastergon/awesome-sre" rel="noopener noreferrer"&gt;dastergon/awesome-sre&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;13k&lt;/td&gt;
&lt;td&gt;SRE-specific reading list&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/stefanprodan/podinfo" rel="noopener noreferrer"&gt;stefanprodan/podinfo&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;6k&lt;/td&gt;
&lt;td&gt;A real microservice to deploy with GitOps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/wmariuss/awesome-devops" rel="noopener noreferrer"&gt;wmariuss/awesome-devops&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;4k&lt;/td&gt;
&lt;td&gt;Broader DevOps tooling and practices&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Star counts are pulled fresh from the GitHub API as of May 2026.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. The-DevOps-Daily/devops-daily
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/The-DevOps-Daily/devops-daily" rel="noopener noreferrer"&gt;github.com/The-DevOps-Daily/devops-daily&lt;/a&gt;. the source for everything you read on this site, fully open source.&lt;/p&gt;

&lt;p&gt;We did not put ourselves at the top because we own the site. We put ourselves at the top because the way the repo is structured is a fast loop: every blog post, exercise, quiz, flashcard, checklist, and interview question is a markdown or JSON file you can read, fork, and PR into. If you find a typo, a broken command, or an outdated CLI flag, you can fix it. If you have a better explanation of how kubelet eviction works, you can add a card to the relevant flashcard deck.&lt;/p&gt;

&lt;p&gt;How to use it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Browse the &lt;code&gt;content/&lt;/code&gt; directory. Pick a topic you want to get better at and run through the exercise.&lt;/li&gt;
&lt;li&gt;Use the quizzes for spaced retrieval. Repeat until you stop getting things wrong.&lt;/li&gt;
&lt;li&gt;Submit a PR when you find something to improve. The maintainers (us) review fast and merge most of the time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Best for engineers who learn by doing, contributing, and seeing the underlying source of every lesson.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. nilbuild/developer-roadmap
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/nilbuild/developer-roadmap" rel="noopener noreferrer"&gt;github.com/nilbuild/developer-roadmap&lt;/a&gt;. 354k stars. Originally &lt;code&gt;kamranahmedse/developer-roadmap&lt;/code&gt;, now under the &lt;code&gt;nilbuild&lt;/code&gt; org. The DevOps roadmap is at &lt;a href="https://roadmap.sh/devops" rel="noopener noreferrer"&gt;roadmap.sh/devops&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This is a visual map of the skills, tools, and concepts that make up a DevOps career. It is the single best document on the internet for answering "what should I learn next?" without reinventing your own learning plan from scratch.&lt;/p&gt;

&lt;p&gt;How to use it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Open the DevOps roadmap. Identify the area you are weakest in.&lt;/li&gt;
&lt;li&gt;Click any node to get a short explanation, links, and a checklist.&lt;/li&gt;
&lt;li&gt;Mark items as you go. The site keeps your progress in localStorage if you do not sign up.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Best for people who feel scattered and want a single picture of the field.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. bregman-arie/devops-exercises
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/bregman-arie/devops-exercises" rel="noopener noreferrer"&gt;github.com/bregman-arie/devops-exercises&lt;/a&gt;. 82k stars. Maintained by Arie Bregman, ex-Red Hat.&lt;/p&gt;

&lt;p&gt;This repository is the reason a lot of engineers passed their DevOps interviews. It is hundreds of practical questions and exercises across Linux, Jenkins, AWS, SRE, Prometheus, Docker, Python, Ansible, Git, Kubernetes, Terraform, OpenStack, SQL, NoSQL, Azure, GCP, and more. Each topic has a mix of explanation questions ("What is X and when do you use it?") and hands-on exercises ("Write the Terraform module that does X").&lt;/p&gt;

&lt;p&gt;How to use it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pick a topic. Try to answer the questions out loud or in writing without looking at the answers.&lt;/li&gt;
&lt;li&gt;Star the ones you got wrong. Come back to them in a week.&lt;/li&gt;
&lt;li&gt;Use it as a barometer. If you can answer most of the Kubernetes section without help, you know your Kubernetes is solid.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Best for interview preparation and finding gaps in your knowledge.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. kelseyhightower/kubernetes-the-hard-way
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/kelseyhightower/kubernetes-the-hard-way" rel="noopener noreferrer"&gt;github.com/kelseyhightower/kubernetes-the-hard-way&lt;/a&gt;. 48k stars. The repo description is honest: "Bootstrap Kubernetes the hard way. No scripts."&lt;/p&gt;

&lt;p&gt;If you have only ever used &lt;code&gt;gcloud container clusters create&lt;/code&gt; or &lt;code&gt;eksctl&lt;/code&gt;, you have used Kubernetes. You have not learned it. This walkthrough has you stand up a control plane and worker nodes by hand, with TLS certificates you generated yourself, etcd you configured yourself, and a kubelet you registered yourself.&lt;/p&gt;

&lt;p&gt;It is also a primary reason Kelsey Hightower has the reputation he has, which is its own kind of education.&lt;/p&gt;

&lt;p&gt;How to use it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Block out a weekend. The full walkthrough takes 6 to 10 hours the first time.&lt;/li&gt;
&lt;li&gt;Do not copy commands. Type them. Read what they do before you run them.&lt;/li&gt;
&lt;li&gt;When something breaks (and it will), debug it. That is the entire point.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Best for engineers who want a deep mental model of Kubernetes internals.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. MichaelCade/90DaysOfDevOps
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/MichaelCade/90DaysOfDevOps" rel="noopener noreferrer"&gt;github.com/MichaelCade/90DaysOfDevOps&lt;/a&gt;. 29k stars. Three years of community-curated 90-day plans.&lt;/p&gt;

&lt;p&gt;This started as one engineer's public learning project: 90 days, one DevOps topic per day, write what you learned. It exploded, and is now a structured tour through Linux, networking, programming, containers, Kubernetes, IaC, observability, databases, and serverless across three different yearly cohorts. The format is one folder per day with notes, diagrams, and links.&lt;/p&gt;

&lt;p&gt;How to use it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Treat it as a TV series, not a textbook. Watch one "episode" a day for 90 days.&lt;/li&gt;
&lt;li&gt;Skip topics you already know. Spend extra time on the ones that feel uncomfortable.&lt;/li&gt;
&lt;li&gt;Read previous cohorts' notes when you finish a day. The 2022, 2023, and 2024 versions cover slightly different angles on the same material.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Best for engineers early in their career who want a forced curriculum.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. milanm/DevOps-Roadmap
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/milanm/DevOps-Roadmap" rel="noopener noreferrer"&gt;github.com/milanm/DevOps-Roadmap&lt;/a&gt;. 19k stars. A different style of roadmap from #2.&lt;/p&gt;

&lt;p&gt;Where the nilbuild roadmap is a visual node graph, this one is a long markdown document with curated links, books, courses, and YouTube videos for every step of the path. It is heavier on resources, lighter on the conceptual map.&lt;/p&gt;

&lt;p&gt;How to use it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read the introduction. Identify which "phase" of the roadmap you are at.&lt;/li&gt;
&lt;li&gt;Pick one resource per concept. Do not read all five linked resources for the same topic. Pick the format that matches how you learn best.&lt;/li&gt;
&lt;li&gt;Use the prompts at the end of each section as a checklist before moving on.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Best for self-taught engineers building their own curriculum.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. ramitsurana/awesome-kubernetes
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/ramitsurana/awesome-kubernetes" rel="noopener noreferrer"&gt;github.com/ramitsurana/awesome-kubernetes&lt;/a&gt;. 16k stars. The most thorough Kubernetes-specific awesome-list.&lt;/p&gt;

&lt;p&gt;If your day job is Kubernetes-heavy and you want to specialize, this is the link directory you want. It has sections for everything: storage, networking, monitoring, security, multi-cluster, GitOps, service mesh, FinOps. Each link is annotated.&lt;/p&gt;

&lt;p&gt;How to use it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bookmark the page. Use it as a research starting point when you need to evaluate tools in a category.&lt;/li&gt;
&lt;li&gt;Watch the commit log. New tools get added regularly, so it doubles as a "what is happening in Kubernetes" feed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Best for Kubernetes-track engineers and platform teams researching tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  8. dastergon/awesome-sre
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/dastergon/awesome-sre" rel="noopener noreferrer"&gt;github.com/dastergon/awesome-sre&lt;/a&gt;. 13k stars. The SRE-flavored cousin.&lt;/p&gt;

&lt;p&gt;DevOps and SRE overlap, but the SRE side weights toward reliability theory, incident response, observability, and the social engineering of running production systems. This repo is the curated reading list for that side: books (Google's SRE book, Charity Majors' work), papers, postmortems, blog posts, conference talks, training courses.&lt;/p&gt;

&lt;p&gt;How to use it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read at least one published postmortem a week. The "Postmortems" section is gold.&lt;/li&gt;
&lt;li&gt;The conference talks list is more useful than most paid SRE courses.&lt;/li&gt;
&lt;li&gt;Pair it with &lt;code&gt;kelseyhightower/kubernetes-the-hard-way&lt;/code&gt; if your SRE work is on a Kubernetes platform.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Best for engineers moving into SRE or platform-engineering roles.&lt;/p&gt;

&lt;h2&gt;
  
  
  9. stefanprodan/podinfo
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/stefanprodan/podinfo" rel="noopener noreferrer"&gt;github.com/stefanprodan/podinfo&lt;/a&gt;. 6k stars. A small Go web app that exists to be deployed.&lt;/p&gt;

&lt;p&gt;This one is different from the others. podinfo is not a learning resource in the read-and-take-notes sense. It is a real microservice (Go, REST + gRPC, metrics, tracing, health checks) that is purpose-built to be the demo target in tutorials. It is what every Flux, Argo CD, Linkerd, Istio, and Cilium tutorial uses when they need a service to deploy. If you want to actually try a GitOps tool end-to-end, you build the platform, point it at podinfo's helm chart, and ship.&lt;/p&gt;

&lt;p&gt;How to use it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stand up a kind or k3d cluster locally.&lt;/li&gt;
&lt;li&gt;Install Flux or Argo CD and point it at the podinfo chart.&lt;/li&gt;
&lt;li&gt;Roll out a canary. Add Linkerd. Add Prometheus. Each thing you add lets you exercise a different platform skill on a service that already works.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Best for engineers who learn by deploying, not reading.&lt;/p&gt;

&lt;h2&gt;
  
  
  10. wmariuss/awesome-devops
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/wmariuss/awesome-devops" rel="noopener noreferrer"&gt;github.com/wmariuss/awesome-devops&lt;/a&gt;. 4k stars. Smaller than &lt;code&gt;awesome-kubernetes&lt;/code&gt;, broader in scope.&lt;/p&gt;

&lt;p&gt;This is the everything-DevOps awesome list: chaos engineering, configuration management, container orchestration, log management, monitoring, package management, secret management, service discovery. The size of the list is approachable, which is its main strength. You can scroll the whole thing in 15 minutes and have a real mental map of the DevOps tooling landscape.&lt;/p&gt;

&lt;p&gt;How to use it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Read the section headings before clicking any links. The taxonomy itself is a learning aid.&lt;/li&gt;
&lt;li&gt;When evaluating a new category of tool (say, you have to pick a secret manager), use this as your starting set rather than Googling.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Best for engineers who want a manageable map of the whole DevOps tools world.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to Actually Use a List Like This
&lt;/h2&gt;

&lt;p&gt;Lists are starting points, not learning plans. The mistake people make is to star all ten repos and never come back. Avoid that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pick exactly one starting repo today.&lt;/strong&gt; If you have no plan, start with #2 (the roadmap) to get one. If you have a plan, start with #4 (kubernetes-the-hard-way) to deepen it. If you are interview-prepping, start with #3 (devops-exercises).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Block calendar time.&lt;/strong&gt; "I will learn DevOps in my spare time" does not work. "I will spend Thursdays from 7 to 9 PM on the kubernetes-the-hard-way walkthrough" works.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build something.&lt;/strong&gt; Pick one of the awesome-list categories you do not understand (say, "service mesh") and use podinfo (#9) plus a tool from the list to build a working setup. You will learn more in two hours of building than two weeks of reading.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Teach what you learned.&lt;/strong&gt; Write a blog post. Submit a PR to #1 with a flashcard you made. Give a brown-bag at work. Teaching is the fastest way to find the gaps in what you thought you knew.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Bookmark this page and come back when you finish one repo. The list is not going anywhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Awesome-lists are link directories&lt;/strong&gt;, not learning plans. Pair them with hands-on repos like #1, #4, and #9.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Star counts are not the same as quality&lt;/strong&gt;, but they are a decent first filter. Anything above 5k stars in this space has been read by enough people to be roughly trustworthy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The single best learning loop is read → build → teach.&lt;/strong&gt; Most engineers do step one, skip step two, and never reach step three. The repos in this list are picked to support all three.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Start one. Finish one.&lt;/strong&gt; Do not collect ten tabs and never close any of them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contribute back.&lt;/strong&gt; Every repo in this list takes PRs. Even small ones (typo fixes, broken-link fixes) count. They also get you GitHub history that future employers can see.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If we missed a repo you think belongs here, &lt;a href="https://github.com/The-DevOps-Daily/devops-daily/issues" rel="noopener noreferrer"&gt;open an issue on our repo&lt;/a&gt; and tell us which one. We update this list when something deserves to be on it.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>opensource</category>
      <category>learning</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Claude Code Hidden Features You Probably Missed</title>
      <dc:creator>DevOps Daily</dc:creator>
      <pubDate>Wed, 01 Apr 2026 17:21:58 +0000</pubDate>
      <link>https://dev.to/devopsdaily/claude-code-hidden-features-you-probably-missed-3ej0</link>
      <guid>https://dev.to/devopsdaily/claude-code-hidden-features-you-probably-missed-3ej0</guid>
      <description>&lt;p&gt;Most people use Claude Code to write code, fix bugs, and maybe generate a commit message. That's fine, but you're leaving a lot on the table.&lt;/p&gt;

&lt;p&gt;Boris Cherny, the creator of Claude Code, recently shared a &lt;a href="https://x.com/bcherny/status/2038454336355999749" rel="noopener noreferrer"&gt;thread on X&lt;/a&gt; about features that even daily users tend to overlook. Some of these genuinely changed how I work. Here's a rundown of the ones worth knowing about.&lt;/p&gt;

&lt;h2&gt;
  
  
  TLDR
&lt;/h2&gt;

&lt;p&gt;Claude Code has mobile sessions, automated scheduling, voice input, parallel agents, git worktrees, hooks, and a browser extension. Most people use about 20% of what it can do.&lt;/p&gt;

&lt;h2&gt;
  
  
  Move Your Session Anywhere with /teleport
&lt;/h2&gt;

&lt;p&gt;You can start a session on your laptop and pick it up on your phone. Or move it to the web. The &lt;code&gt;/teleport&lt;/code&gt; command transfers your full session context between devices.&lt;/p&gt;

&lt;p&gt;The reverse also works. If you're reviewing something on your phone during a commute, you can &lt;code&gt;/teleport&lt;/code&gt; it back to your terminal when you sit down.&lt;/p&gt;

&lt;p&gt;There's also &lt;code&gt;/remote-control&lt;/code&gt; which lets you connect to a running session from another device without transferring it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# On your laptop&lt;/span&gt;
/teleport

&lt;span class="c"&gt;# On your phone or web - enter the code to pick up the session&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is useful when you kick off a long-running task on your workstation and want to check progress from your phone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Automate Repetitive Tasks with /loop and /schedule
&lt;/h2&gt;

&lt;p&gt;This one is a genuine workflow changer. You can tell Claude Code to run a task on a recurring schedule for up to a week.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Review PRs every 30 minutes&lt;/span&gt;
/loop 30m review open PRs and post comments

&lt;span class="c"&gt;# Run a health check every hour&lt;/span&gt;
/schedule every 1h check &lt;span class="k"&gt;if &lt;/span&gt;the staging environment is healthy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Think about what you do repeatedly: reviewing PRs, checking CI status, monitoring deployments, updating dependencies. You can automate all of it without writing a single script.&lt;/p&gt;

&lt;p&gt;Some practical examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Review all open PRs every morning at 9 AM&lt;/li&gt;
&lt;li&gt;Monitor a Slack channel for feedback and create GitHub issues&lt;/li&gt;
&lt;li&gt;Run your test suite after every push and report failures&lt;/li&gt;
&lt;li&gt;Check for dependency updates weekly&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Hooks for Deterministic Automation
&lt;/h2&gt;

&lt;p&gt;Hooks let you run code at specific points in Claude Code's lifecycle. Unlike the AI-driven &lt;code&gt;/loop&lt;/code&gt; command, hooks are deterministic - they always run the same way.&lt;/p&gt;

&lt;p&gt;You configure them in your settings and they fire on events like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Session start&lt;/strong&gt; - set up your environment, load context&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Before bash commands&lt;/strong&gt; - validate or log commands before execution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;On permission requests&lt;/strong&gt; - auto-approve specific patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Continuous operation&lt;/strong&gt; - keep Claude running without manual intervention&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is powerful for teams. You can enforce standards (like running linters before every commit) without relying on each engineer to remember.&lt;/p&gt;

&lt;h2&gt;
  
  
  Git Worktrees for Parallel Sessions
&lt;/h2&gt;

&lt;p&gt;If you've ever wanted Claude to work on two different branches at the same time, worktrees make this possible. Each session gets its own isolated copy of the repo.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start a session in a worktree&lt;/span&gt;
claude &lt;span class="nt"&gt;--worktree&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Why this matters: you can have Claude refactoring module A while simultaneously building feature B. Neither session interferes with the other.&lt;/p&gt;

&lt;p&gt;This pairs well with &lt;code&gt;/batch&lt;/code&gt;, which fans out work across dozens of parallel agents. Need to update 50 files? &lt;code&gt;/batch&lt;/code&gt; can process them concurrently instead of one at a time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Voice Input with /voice
&lt;/h2&gt;

&lt;p&gt;You can dictate to Claude instead of typing. This sounds gimmicky until you try it for longer explanations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/voice
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's particularly useful for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Explaining complex requirements ("I need a migration that handles both the old and new schema formats, with a rollback path if...")&lt;/li&gt;
&lt;li&gt;Code reviews ("Look at the authentication flow in this PR and tell me if...")&lt;/li&gt;
&lt;li&gt;Brainstorming ("What's the best way to structure this API given these constraints...")&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Typing detailed prompts takes time. Talking is faster for anything longer than a few sentences.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Chrome Extension for Frontend Work
&lt;/h2&gt;

&lt;p&gt;Claude Code has a Chrome extension that lets the AI see what your app looks like in the browser. Instead of describing UI bugs, Claude can verify its own output visually.&lt;/p&gt;

&lt;p&gt;This closes the feedback loop for frontend work. Claude makes a change, checks the browser, adjusts if something looks off. You stop being the human screenshot tool.&lt;/p&gt;

&lt;h2&gt;
  
  
  /branch and --fork-session for Experiments
&lt;/h2&gt;

&lt;p&gt;Want to try two different approaches to the same problem? &lt;code&gt;/branch&lt;/code&gt; creates a copy of your current session so you can explore a different path without losing your progress.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Fork the current session&lt;/span&gt;
/branch

&lt;span class="c"&gt;# Or fork when starting&lt;/span&gt;
claude &lt;span class="nt"&gt;--fork-session&lt;/span&gt; &amp;lt;session-id&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is like git branches but for your AI conversation. Try approach A in one branch, approach B in another, then pick the winner.&lt;/p&gt;

&lt;h2&gt;
  
  
  /btw for Side Questions
&lt;/h2&gt;

&lt;p&gt;When Claude is working on a long task, you might have an unrelated question. Instead of interrupting the main task, &lt;code&gt;/btw&lt;/code&gt; lets you ask a side question.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;/btw what&lt;span class="s1"&gt;'s the difference between SIGTERM and SIGKILL?
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude answers your side question and goes right back to what it was doing. No context switching, no lost progress.&lt;/p&gt;

&lt;h2&gt;
  
  
  --bare for SDK Speed
&lt;/h2&gt;

&lt;p&gt;If you're using Claude Code in scripts or CI pipelines, the &lt;code&gt;--bare&lt;/code&gt; flag skips loading plugins and extra features, making startup up to 10x faster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude &lt;span class="nt"&gt;--bare&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"generate a migration for adding user roles"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This matters when you're calling Claude from automation scripts where every second counts.&lt;/p&gt;

&lt;h2&gt;
  
  
  --add-dir for Multi-Repo Work
&lt;/h2&gt;

&lt;p&gt;Working across multiple repositories? You can give Claude access to all of them in a single session.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude &lt;span class="nt"&gt;--add-dir&lt;/span&gt; ~/projects/api &lt;span class="nt"&gt;--add-dir&lt;/span&gt; ~/projects/frontend
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now Claude can see your API schema and your frontend code at the same time. No more copying types between repos or explaining your API structure manually.&lt;/p&gt;

&lt;h2&gt;
  
  
  Custom Agents with --agent
&lt;/h2&gt;

&lt;p&gt;You can create custom agent configurations with their own system prompts and tool permissions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude &lt;span class="nt"&gt;--agent&lt;/span&gt; reviewer    &lt;span class="c"&gt;# Uses your custom reviewer agent config&lt;/span&gt;
claude &lt;span class="nt"&gt;--agent&lt;/span&gt; deployer    &lt;span class="c"&gt;# Uses your custom deployer agent config&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Define these in your &lt;code&gt;.claude/agents/&lt;/code&gt; directory. Each agent can have different instructions, different tool access, and different behaviors. A code reviewer agent doesn't need write access. A deployment agent doesn't need to browse the web.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means for DevOps
&lt;/h2&gt;

&lt;p&gt;These features shift Claude Code from "AI code assistant" to "AI DevOps team member." The combination of scheduling, hooks, parallel sessions, and multi-repo access means you can automate workflows that previously required custom tooling.&lt;/p&gt;

&lt;p&gt;Here's a realistic DevOps setup:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;/schedule&lt;/code&gt; reviews all PRs every morning&lt;/li&gt;
&lt;li&gt;Hooks enforce linting and security scanning on every session&lt;/li&gt;
&lt;li&gt;Worktrees let you debug production while shipping features&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--add-dir&lt;/code&gt; gives Claude access to your infra and app repos simultaneously&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;/loop&lt;/code&gt; monitors your staging environment and alerts you on issues&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The key insight from Boris's thread: "There is no one right way to use Claude Code." The tool is intentionally flexible. Experiment with these features and build the workflow that fits your team.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Out
&lt;/h2&gt;

&lt;p&gt;If you haven't updated Claude Code recently, run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude update
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Many of these features are recent additions. The mobile app, scheduling, and hooks in particular have been added in the last few months.&lt;/p&gt;

&lt;p&gt;For more DevOps tools and guides, check out our &lt;a href="https://dev.to/exercises"&gt;exercises&lt;/a&gt; and &lt;a href="https://dev.to/quizzes"&gt;quizzes&lt;/a&gt; to sharpen your skills.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This post was inspired by &lt;a href="https://x.com/bcherny/status/2038454336355999749" rel="noopener noreferrer"&gt;Boris Cherny's thread on X&lt;/a&gt;. Boris is the creator of Claude Code at Anthropic.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>devops</category>
      <category>linux</category>
    </item>
    <item>
      <title>🎄 Advent of DevOps: 25 Days to Level Up Your DevOps Game!</title>
      <dc:creator>DevOps Daily</dc:creator>
      <pubDate>Sun, 30 Nov 2025 22:00:00 +0000</pubDate>
      <link>https://dev.to/devopsdaily/advent-of-devops-25-days-to-level-up-your-devops-game-2fb5</link>
      <guid>https://dev.to/devopsdaily/advent-of-devops-25-days-to-level-up-your-devops-game-2fb5</guid>
      <description>&lt;p&gt;Hey DevOps enthusiasts! 👋&lt;/p&gt;

&lt;p&gt;Remember how exciting advent calendars were as a kid? Each day bringing a new surprise behind those little doors? Well, we're bringing that same excitement to the DevOps world, but instead of chocolate (sorry! 🍫), you're getting something even better: &lt;strong&gt;real-world DevOps skills that will make you a better engineer&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  🎁 What is Advent of DevOps?
&lt;/h2&gt;

&lt;p&gt;Think "Advent of Code" meets real-world DevOps challenges. Starting December 1st, we're releasing &lt;strong&gt;25 daily hands-on challenges&lt;/strong&gt; that cover everything you need to know to thrive in modern DevOps environments.&lt;/p&gt;

&lt;p&gt;Each day unlocks a new practical challenge focusing on tools and techniques you'll actually use in production. No theory-heavy lectures, no boring slides—just pure, hands-on learning that you can apply immediately.&lt;/p&gt;

&lt;h2&gt;
  
  
  🚀 What's Inside?
&lt;/h2&gt;

&lt;p&gt;Here's a taste of what you'll tackle over 25 days:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🐳 &lt;strong&gt;Containerization &amp;amp; Orchestration&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;⚙️ &lt;strong&gt;CI/CD &amp;amp; Automation&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;🏗️ &lt;strong&gt;Infrastructure as Code&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;🔒 &lt;strong&gt;Security &amp;amp; Observability&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;☁️ &lt;strong&gt;Cloud &amp;amp; Scaling&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  💡 Why Join?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;🎯 Real-World Skills&lt;/strong&gt;: Every challenge is based on actual scenarios you'll face in production&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;📈 Progressive Learning&lt;/strong&gt;: Start easy, level up gradually. Whether you're a beginner or seasoned pro, there's something for you&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🎮 Fun &amp;amp; Engaging&lt;/strong&gt;: Gamified progress tracking makes learning addictive (in a good way!)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🌟 Community-Driven&lt;/strong&gt;: Share solutions, learn from others, and grow together&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;⏰ Learn at Your Pace&lt;/strong&gt;: Can't keep up daily? No problem! All challenges remain available year-round&lt;/p&gt;

&lt;h2&gt;
  
  
  🎄 How It Works
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pick Your Challenge&lt;/strong&gt;: Start with Day 1 or jump to what interests you most&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Get Hands-On&lt;/strong&gt;: Each challenge includes clear tasks, starter code, and success criteria&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build &amp;amp; Learn&lt;/strong&gt;: Complete the challenge at your own pace&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Share &amp;amp; Celebrate&lt;/strong&gt;: Post your wins and solutions with the community&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Level Up&lt;/strong&gt;: Review reference solutions and explanations to deepen your understanding&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each challenge includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Clear task description&lt;/li&gt;
&lt;li&gt;🎯 Success criteria&lt;/li&gt;
&lt;li&gt;🔧 Starter code (when applicable)&lt;/li&gt;
&lt;li&gt;💡 Solution &amp;amp; explanation&lt;/li&gt;
&lt;li&gt;🔗 Additional resources&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🌟 Join the Community
&lt;/h2&gt;

&lt;p&gt;This isn't just about solo learning—it's about growing together! &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Share your progress:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Follow us on X/Twitter: &lt;a href="https://x.com/thedevopsdaily" rel="noopener noreferrer"&gt;@thedevopsdaily&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Use hashtag: &lt;strong&gt;#AdventOfDevOps&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Share on LinkedIn, dev.to, wherever you hang out!&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Contribute:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Found a cool solution? Share it!&lt;/li&gt;
&lt;li&gt;Have ideas for challenges? We're open-source!&lt;/li&gt;
&lt;li&gt;Check out our &lt;a href="https://github.com/The-DevOps-Daily/devops-daily" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt; and contribute&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🎯 Ready to Start?
&lt;/h2&gt;

&lt;p&gt;Don't wait for December 1st to check it out—head over to the page now and get familiar with what's coming:&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;&lt;a href="https://devops-daily.com/advent-of-devops" rel="noopener noreferrer"&gt;devops-daily.com/advent-of-devops&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Mark your calendar 📅, set your reminders ⏰, and get ready to transform your DevOps skills one day at a time!&lt;/p&gt;

&lt;h2&gt;
  
  
  🤔 Who Should Join?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DevOps Engineers&lt;/strong&gt; looking to sharpen their skills&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Developers&lt;/strong&gt; wanting to understand the ops side better&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;System Administrators&lt;/strong&gt; transitioning to DevOps&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Students &amp;amp; Career Changers&lt;/strong&gt; building practical experience&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anyone&lt;/strong&gt; curious about modern infrastructure practices&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No gatekeeping here, if you're interested in DevOps, you're welcome! 🙌&lt;/p&gt;

&lt;h2&gt;
  
  
  🎊 Let's Make This December Special
&lt;/h2&gt;

&lt;p&gt;Learning doesn't have to be boring. It doesn't have to be stressful. And it definitely doesn't have to be lonely.&lt;/p&gt;

&lt;p&gt;This December, join hundreds (thousands?) of DevOps practitioners around the world in leveling up together. One challenge at a time, one skill at a time, one day at a time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;See you on December 1st! 🎄✨&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;P.S. - Can't wait? Start exploring the challenges now at &lt;a href="https://devops-daily.com/advent-of-devops" rel="noopener noreferrer"&gt;devops-daily.com/advent-of-devops&lt;/a&gt;. They're already live and ready for early birds! 🐦&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;P.P.S. - This is completely free, open-source, and community-driven. No paywalls, no upsells, just pure learning. If you find value, give us a star on &lt;a href="https://github.com/The-DevOps-Daily/devops-daily" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; and spread the word! ⭐&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Follow DevOps Daily:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🐦 X/Twitter: &lt;a href="https://x.com/thedevopsdaily" rel="noopener noreferrer"&gt;@thedevopsdaily&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;💻 GitHub: &lt;a href="https://github.com/The-DevOps-Daily/devops-daily" rel="noopener noreferrer"&gt;The-DevOps-Daily/devops-daily&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🌐 Website: &lt;a href="https://devops-daily.com" rel="noopener noreferrer"&gt;devops-daily.com&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Happy DevOps-ing! 🚀&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cloud</category>
      <category>beginners</category>
      <category>adventofcode</category>
    </item>
    <item>
      <title>Building a DDoS Attack Simulator to Understand Defense Strategies</title>
      <dc:creator>DevOps Daily</dc:creator>
      <pubDate>Fri, 21 Nov 2025 09:53:22 +0000</pubDate>
      <link>https://dev.to/devopsdaily/building-a-ddos-attack-simulator-to-understand-defense-strategies-lg4</link>
      <guid>https://dev.to/devopsdaily/building-a-ddos-attack-simulator-to-understand-defense-strategies-lg4</guid>
      <description>&lt;p&gt;I created an educational content piece for DevOps Daily and realized something: most explanations of DDoS attacks are either too abstract or too technical. We talk about "request floods" and "mitigation strategies," but it's hard to visualize what's actually happening.&lt;/p&gt;

&lt;p&gt;So I built an interactive simulator to help bridge that gap.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem with Learning About DDoS 📚
&lt;/h2&gt;

&lt;p&gt;When you're reading about DDoS protection, you see phrases like "distributes load across multiple servers" or "rate limiting prevents abuse." But what does that actually mean when thousands of requests are hitting your infrastructure?&lt;/p&gt;

&lt;p&gt;I wanted something that would help people - especially those newer to infrastructure work - actually see these concepts in action.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Simulator Does 🎮
&lt;/h2&gt;

&lt;p&gt;You can try it here: &lt;a href="https://devops-daily.com/games/ddos-simulator" rel="noopener noreferrer"&gt;devops-daily.com/games/ddos-simulator&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It lets you simulate three common attack types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HTTP Flood&lt;/strong&gt; 🌊 - overwhelming with legitimate-looking requests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SYN Flood&lt;/strong&gt; 🔄 - exploiting TCP handshake mechanics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;UDP Flood&lt;/strong&gt; 📦 - connectionless packet storms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The interesting part is watching how different defense mechanisms respond. You can toggle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Firewall&lt;/strong&gt; 🛡️ - blocks about 30% based on signatures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load Balancer&lt;/strong&gt; ⚖️ - reduces impact by 50%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto Rate Limit&lt;/strong&gt; 🚦 - blocks high-frequency traffic&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What I Learned Building It 💡
&lt;/h2&gt;

&lt;p&gt;A few things became clear while working on this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attack intensity matters less than you'd think.&lt;/strong&gt; The attack type and your defense configuration matter way more. A moderate SYN flood with no defenses is worse than an intense HTTP flood with proper rate limiting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Single defenses aren't enough.&lt;/strong&gt; This is obvious in theory, but seeing it play out makes it concrete. A firewall alone, or a load balancer alone, only gets you so far.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Visualization helps understanding.&lt;/strong&gt; Watching the server health bar drop while packets animate across the screen creates an intuition that documentation doesn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who Might Find This Useful ⚙️
&lt;/h2&gt;

&lt;p&gt;If you're:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Learning about infrastructure security&lt;/li&gt;
&lt;li&gt;Trying to explain DDoS concepts to your team&lt;/li&gt;
&lt;li&gt;Deciding what protections to implement&lt;/li&gt;
&lt;li&gt;Just curious how attacks and defenses interact&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It might be helpful to play around with it for a bit.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next 🚀
&lt;/h2&gt;

&lt;p&gt;I'm planning to add more waves with additional attack vectors and defense mechanisms. Things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Application-layer attacks&lt;/li&gt;
&lt;li&gt;CDN protection&lt;/li&gt;
&lt;li&gt;Anycast routing&lt;/li&gt;
&lt;li&gt;More realistic traffic patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you have thoughts on what would be useful to include, I'd be interested to hear them.&lt;/p&gt;




&lt;p&gt;The goal here is education, not creating chaos. Understanding how attacks work helps you build better defenses. 🛡️&lt;/p&gt;

&lt;p&gt;If you try it out, let me know what you think or if anything is unclear.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cloud</category>
      <category>systemdesign</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Right-Sizing Kubernetes Resources with VPA and Karpenter</title>
      <dc:creator>DevOps Daily</dc:creator>
      <pubDate>Fri, 22 Aug 2025 17:02:04 +0000</pubDate>
      <link>https://dev.to/devopsdaily/right-sizing-kubernetes-resources-with-vpa-and-karpenter-22ah</link>
      <guid>https://dev.to/devopsdaily/right-sizing-kubernetes-resources-with-vpa-and-karpenter-22ah</guid>
      <description>&lt;h2&gt;
  
  
  TLDR
&lt;/h2&gt;

&lt;p&gt;Setting CPU and memory requests too high in Kubernetes wastes money and reduces cluster efficiency. This guide shows you how to identify overprovisioned workloads, use Vertical Pod Autoscaler (VPA) to right-size your pods, and implement Karpenter for smarter node scaling. You'll also learn to monitor costs and validate your improvements with real metrics.&lt;/p&gt;

&lt;p&gt;When you set resource requests too conservatively in Kubernetes, your cluster reserves more capacity than workloads actually need. This leads to underutilized nodes and higher cloud bills. The problem gets worse at scale - imagine 200 pods each requesting 2 CPU cores but only using 200m. That's 400 reserved cores when actual demand is closer to 40 cores.&lt;/p&gt;

&lt;p&gt;The solution involves right-sizing both your pods and nodes. You'll use monitoring data to understand actual usage, apply VPA to adjust pod requests automatically, and leverage Karpenter to provision nodes that match your workload requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;p&gt;Before you start, make sure you have:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Kubernetes cluster (version 1.20 or higher) with metrics-server installed&lt;/li&gt;
&lt;li&gt;kubectl configured with admin access to your cluster&lt;/li&gt;
&lt;li&gt;Prometheus and Grafana deployed for monitoring (or similar observability stack)&lt;/li&gt;
&lt;li&gt;Basic understanding of Kubernetes resource requests and limits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You'll also need the ability to install cluster-wide components like VPA and Karpenter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Identifying Overprovisioned Workloads
&lt;/h2&gt;

&lt;p&gt;The first step is understanding how your current workloads use resources compared to what they request. You can start with kubectl to get a quick snapshot of resource usage across your cluster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check current resource usage for all nodes&lt;/span&gt;
kubectl top nodes

&lt;span class="c"&gt;# View pod resource usage across all namespaces&lt;/span&gt;
kubectl top pods &lt;span class="nt"&gt;--all-namespaces&lt;/span&gt; &lt;span class="nt"&gt;--sort-by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;cpu

&lt;span class="c"&gt;# Get detailed resource requests vs usage for a specific namespace&lt;/span&gt;
kubectl describe nodes | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-A&lt;/span&gt; 15 &lt;span class="s2"&gt;"Allocated resources"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These commands show you the gap between requested and actual resource usage. If you see pods consistently using 50Mi of memory while requesting 1Gi, or using 100m CPU while requesting 1000m, those are prime candidates for right-sizing.&lt;/p&gt;

&lt;p&gt;For deeper analysis, you'll want historical data from Prometheus. Here are some key queries to run in your Grafana dashboard:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# CPU utilization percentage (actual usage vs requests)
(rate(container_cpu_usage_seconds_total{container!=""}[5m]) * 100) /
(container_spec_cpu_quota{container!=""} / container_spec_cpu_period{container!=""})

# Memory utilization percentage
(container_memory_working_set_bytes{container!=""} * 100) /
container_spec_memory_limit_bytes{container!=""}

# Top 10 pods with the highest request-to-usage ratio (biggest waste)
topk(10,
  (container_spec_cpu_quota{container!=""} / container_spec_cpu_period{container!=""}) /
  rate(container_cpu_usage_seconds_total{container!=""}[5m])
)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run these queries over a 2-week period to account for traffic variations and identify consistent patterns. Workloads running at 10-20% utilization with stable traffic are good candidates for optimization.&lt;/p&gt;

&lt;h2&gt;
  
  
  Installing and Configuring VPA
&lt;/h2&gt;

&lt;p&gt;Vertical Pod Autoscaler analyzes your workloads and recommends optimal CPU and memory values. Start by installing VPA in your cluster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone the VPA repository&lt;/span&gt;
git clone https://github.com/kubernetes/autoscaler.git
&lt;span class="nb"&gt;cd &lt;/span&gt;autoscaler/vertical-pod-autoscaler

&lt;span class="c"&gt;# Deploy VPA components&lt;/span&gt;
./hack/vpa-up.sh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This script installs three main components: the VPA recommender (analyzes usage), the updater (applies changes), and the admission controller (validates recommendations).&lt;/p&gt;

&lt;p&gt;Next, create a VPA configuration for a workload you want to optimize. Start with recommendation mode to see suggested values before making changes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# vpa-web-service.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;autoscaling.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;VerticalPodAutoscaler&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;web-service-vpa&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;targetRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;apps/v1'&lt;/span&gt;
    &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;web-service&lt;/span&gt;
  &lt;span class="na"&gt;updatePolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;updateMode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Off'&lt;/span&gt; &lt;span class="c1"&gt;# Only provide recommendations, don't auto-update&lt;/span&gt;
  &lt;span class="na"&gt;resourcePolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;containerPolicies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;web-app&lt;/span&gt;
        &lt;span class="c1"&gt;# Set boundaries to prevent extreme recommendations&lt;/span&gt;
        &lt;span class="na"&gt;maxAllowed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2'&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;4Gi'&lt;/span&gt;
        &lt;span class="na"&gt;minAllowed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;100m'&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;128Mi'&lt;/span&gt;
        &lt;span class="na"&gt;controlledResources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;cpu'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;memory'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply the VPA configuration and wait for recommendations to generate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; vpa-web-service.yaml

&lt;span class="c"&gt;# Wait a few minutes, then check recommendations&lt;/span&gt;
kubectl describe vpa web-service-vpa &lt;span class="nt"&gt;-n&lt;/span&gt; production
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output shows recommended values for CPU and memory under the &lt;code&gt;Status&lt;/code&gt; section. VPA typically suggests values based on the 90th percentile of usage over the past 8 days, which provides a safety buffer while eliminating waste.&lt;/p&gt;

&lt;h2&gt;
  
  
  Applying VPA Recommendations Safely
&lt;/h2&gt;

&lt;p&gt;Once you have solid recommendations, you can apply them gradually. Start with non-critical workloads and monitor for any issues.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Update your deployment with VPA recommendations&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;apps/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Deployment&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;web-service&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;replicas&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
  &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;matchLabels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;web-service&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;app&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;web-service&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;web-app&lt;/span&gt;
          &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nginx:1.21&lt;/span&gt;
          &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;250m'&lt;/span&gt; &lt;span class="c1"&gt;# Reduced from 1000m based on VPA recommendation&lt;/span&gt;
              &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;512Mi'&lt;/span&gt; &lt;span class="c1"&gt;# Reduced from 2Gi based on VPA recommendation&lt;/span&gt;
            &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
              &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;500m'&lt;/span&gt; &lt;span class="c1"&gt;# Set limits 2x requests for burst capacity&lt;/span&gt;
              &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1Gi'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After updating requests, monitor your workloads for at least a week. Watch for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Increased pod restarts or OOMKilled events&lt;/li&gt;
&lt;li&gt;Higher response times or error rates&lt;/li&gt;
&lt;li&gt;Pods getting evicted under memory pressure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If everything runs smoothly, you can switch VPA to automatic mode:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Update VPA to automatically apply changes&lt;/span&gt;
kubectl patch vpa web-service-vpa &lt;span class="nt"&gt;-n&lt;/span&gt; production &lt;span class="nt"&gt;--type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'merge'&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'{"spec":{"updatePolicy":{"updateMode":"Auto"}}}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In Auto mode, VPA will restart pods when it detects they need different resource allocations. Make sure you have proper PodDisruptionBudgets in place to maintain availability during updates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Up Karpenter for Node Optimization
&lt;/h2&gt;

&lt;p&gt;While VPA optimizes individual pods, Karpenter optimizes your entire node infrastructure. Instead of fixed node groups, Karpenter provisions nodes dynamically based on your workload requirements.&lt;/p&gt;

&lt;p&gt;First, install Karpenter in your cluster. The exact steps depend on your cloud provider, but here's the process for AWS EKS:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install Karpenter using Helm&lt;/span&gt;
helm upgrade &lt;span class="nt"&gt;--install&lt;/span&gt; karpenter oci://public.ecr.aws/karpenter/karpenter &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--version&lt;/span&gt; &lt;span class="s2"&gt;"0.32.0"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; &lt;span class="s2"&gt;"karpenter"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--create-namespace&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; &lt;span class="s2"&gt;"settings.clusterName=&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;CLUSTER_NAME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; &lt;span class="s2"&gt;"settings.interruptionQueueName=&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;CLUSTER_NAME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--wait&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, create a NodePool that defines what types of nodes Karpenter can provision:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# karpenter-nodepool.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.sh/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NodePool&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;general-purpose&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# Template for nodes Karpenter will create&lt;/span&gt;
  &lt;span class="na"&gt;template&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;node-type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;general-purpose&lt;/span&gt;
    &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="c1"&gt;# Instance requirements - Karpenter will pick the best fit&lt;/span&gt;
      &lt;span class="na"&gt;requirements&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;kubernetes.io/arch&lt;/span&gt;
          &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
          &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;amd64'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.sh/capacity-type&lt;/span&gt;
          &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
          &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;spot'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;on-demand'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt; &lt;span class="c1"&gt;# Allow both for cost optimization&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;node.kubernetes.io/instance-type&lt;/span&gt;
          &lt;span class="na"&gt;operator&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;In&lt;/span&gt;
          &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;m6i.large'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;m6i.xlarge'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;m6i.2xlarge'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;r6i.large'&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;r6i.xlarge'&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

      &lt;span class="c1"&gt;# Node configuration&lt;/span&gt;
      &lt;span class="na"&gt;nodeClassRef&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.k8s.aws/v1beta1&lt;/span&gt;
        &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;EC2NodeClass&lt;/span&gt;
        &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;general-purpose&lt;/span&gt;

      &lt;span class="c1"&gt;# Taints to control which pods can schedule here&lt;/span&gt;
      &lt;span class="na"&gt;taints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;key&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.sh/unschedulable&lt;/span&gt;
          &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;true'&lt;/span&gt;
          &lt;span class="na"&gt;effect&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;NoSchedule&lt;/span&gt;

  &lt;span class="c1"&gt;# Scaling and disruption policies&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1000&lt;/span&gt; &lt;span class="c1"&gt;# Maximum CPU across all nodes in this pool&lt;/span&gt;
  &lt;span class="na"&gt;disruption&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;consolidationPolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;WhenUnderutilized&lt;/span&gt;
    &lt;span class="na"&gt;consolidateAfter&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;30s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create the corresponding EC2NodeClass for AWS-specific configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# karpenter-nodeclass.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;karpenter.k8s.aws/v1beta1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;EC2NodeClass&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;general-purpose&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="c1"&gt;# AMI and instance configuration&lt;/span&gt;
  &lt;span class="na"&gt;amiFamily&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;AL2&lt;/span&gt;
  &lt;span class="na"&gt;subnetSelectorTerms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;karpenter.sh/discovery&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;${CLUSTER_NAME}'&lt;/span&gt;
  &lt;span class="na"&gt;securityGroupSelectorTerms&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;karpenter.sh/discovery&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;${CLUSTER_NAME}'&lt;/span&gt;

  &lt;span class="c1"&gt;# Instance store configuration&lt;/span&gt;
  &lt;span class="na"&gt;userData&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;#!/bin/bash&lt;/span&gt;
    &lt;span class="s"&gt;/etc/eks/bootstrap.sh ${CLUSTER_NAME}&lt;/span&gt;

  &lt;span class="c1"&gt;# Tags for cost tracking&lt;/span&gt;
  &lt;span class="na"&gt;tags&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;Team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;platform&lt;/span&gt;
    &lt;span class="na"&gt;Environment&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply both configurations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; karpenter-nodepool.yaml
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; karpenter-nodeclass.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Karpenter will now monitor unschedulable pods and provision appropriately-sized nodes. When you deploy workloads with right-sized resource requests (thanks to VPA), Karpenter will select smaller, more cost-effective instances.&lt;/p&gt;

&lt;h2&gt;
  
  
  Monitoring Cost Impact
&lt;/h2&gt;

&lt;p&gt;To validate your optimizations, you need visibility into resource costs. Kubecost provides detailed insights into how much each workload costs and how much capacity you're wasting.&lt;/p&gt;

&lt;p&gt;Install Kubecost in your cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Add the Kubecost Helm repository&lt;/span&gt;
helm repo add kubecost https://kubecost.github.io/cost-analyzer/

&lt;span class="c"&gt;# Install Kubecost with Prometheus integration&lt;/span&gt;
helm &lt;span class="nb"&gt;install &lt;/span&gt;kubecost kubecost/cost-analyzer &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--namespace&lt;/span&gt; kubecost &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--create-namespace&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; &lt;span class="nv"&gt;kubecostToken&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-token-here"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--set&lt;/span&gt; prometheus.server.global.external_labels.cluster_id&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;CLUSTER_NAME&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Access the Kubecost UI by port-forwarding:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl port-forward &lt;span class="nt"&gt;-n&lt;/span&gt; kubecost deployment/kubecost-cost-analyzer 9090:9090
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the Kubecost dashboard, focus on these key metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Efficiency scores&lt;/strong&gt;: Shows the percentage of requested resources actually being used&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idle costs&lt;/strong&gt;: Money spent on provisioned but unused resources&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Right-sizing recommendations&lt;/strong&gt;: Suggestions for adjusting requests and limits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Namespace costs&lt;/strong&gt;: Helps identify which teams or applications drive costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Track these metrics before and after implementing VPA and Karpenter to quantify your savings.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Optimization Example
&lt;/h2&gt;

&lt;p&gt;Let's walk through optimizing a typical microservice deployment. You start with a Node.js API that was conservatively configured:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before optimization&lt;/span&gt;
&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1000m'&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2Gi'&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2000m'&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;4Gi'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After running this workload for two weeks, your monitoring shows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Average CPU usage: 150m (15% of requests)&lt;/li&gt;
&lt;li&gt;Average memory usage: 400Mi (20% of requests)&lt;/li&gt;
&lt;li&gt;Peak CPU usage: 300m&lt;/li&gt;
&lt;li&gt;Peak memory usage: 800Mi&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Based on this data, VPA recommends:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# VPA recommendations (with safety buffer)&lt;/span&gt;
&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;200m'&lt;/span&gt; &lt;span class="c1"&gt;# Covers 99th percentile usage&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;512Mi'&lt;/span&gt; &lt;span class="c1"&gt;# Accounts for memory spikes&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;400m'&lt;/span&gt; &lt;span class="c1"&gt;# 2x requests for burst capacity&lt;/span&gt;
    &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1Gi'&lt;/span&gt; &lt;span class="c1"&gt;# Prevents OOM while allowing growth&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The cost impact for 20 replicas of this service:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Before&lt;/strong&gt;: 20 CPU cores, 40Gi memory requested&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;After&lt;/strong&gt;: 4 CPU cores, 10Gi memory requested&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Savings&lt;/strong&gt;: 80% reduction in resource allocation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With Karpenter managing nodes, this workload now runs on smaller instances, further reducing costs by eliminating the need for oversized nodes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setting Resource Quotas and Guardrails
&lt;/h2&gt;

&lt;p&gt;As you roll out right-sizing across your organization, implement quotas to prevent teams from reverting to oversized requests:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# namespace-quota.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ResourceQuota&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backend-team-quota&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backend&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;hard&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;requests.cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;50'&lt;/span&gt; &lt;span class="c1"&gt;# Total CPU requests across all pods&lt;/span&gt;
    &lt;span class="na"&gt;requests.memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;100Gi'&lt;/span&gt; &lt;span class="c1"&gt;# Total memory requests&lt;/span&gt;
    &lt;span class="na"&gt;limits.cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;100'&lt;/span&gt; &lt;span class="c1"&gt;# Total CPU limits&lt;/span&gt;
    &lt;span class="na"&gt;limits.memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;200Gi'&lt;/span&gt; &lt;span class="c1"&gt;# Total memory limits&lt;/span&gt;
    &lt;span class="na"&gt;pods&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;100'&lt;/span&gt; &lt;span class="c1"&gt;# Maximum number of pods&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also create LimitRanges to enforce reasonable defaults:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# limit-range.yaml&lt;/span&gt;
&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LimitRange&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pod-limits&lt;/span&gt;
  &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;backend&lt;/span&gt;
&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Container&lt;/span&gt;
      &lt;span class="na"&gt;default&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# Default limits if not specified&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;500m'&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;1Gi'&lt;/span&gt;
      &lt;span class="na"&gt;defaultRequest&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# Default requests if not specified&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;100m'&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;256Mi'&lt;/span&gt;
      &lt;span class="na"&gt;max&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# Maximum allowed values&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;4'&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;8Gi'&lt;/span&gt;
      &lt;span class="na"&gt;min&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="c1"&gt;# Minimum required values&lt;/span&gt;
        &lt;span class="na"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;50m'&lt;/span&gt;
        &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;64Mi'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These guardrails help maintain optimization gains while giving teams flexibility within reasonable bounds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Troubleshooting Common Issues
&lt;/h2&gt;

&lt;p&gt;When implementing VPA and Karpenter, you might encounter some challenges. Here are solutions to the most common problems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VPA recommendations seem too aggressive&lt;/strong&gt;: VPA sometimes suggests very low values during low-traffic periods. Check that your monitoring data covers representative traffic patterns. You can also adjust the VPA algorithm:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;resourcePolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;containerPolicies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;web-app&lt;/span&gt;
        &lt;span class="na"&gt;controlledValues&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;RequestsOnly&lt;/span&gt; &lt;span class="c1"&gt;# Only adjust requests, leave limits alone&lt;/span&gt;
        &lt;span class="na"&gt;mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Auto&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Karpenter nodes aren't scaling down&lt;/strong&gt;: This usually happens when pods can't be evicted. Check for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Look for pods without PodDisruptionBudgets&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;--all-namespaces&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; wide | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; Terminating

&lt;span class="c"&gt;# Check for pods using local storage or host networking&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;--all-namespaces&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; yaml | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-A&lt;/span&gt; 5 hostNetwork

&lt;span class="c"&gt;# Verify PodDisruptionBudgets allow eviction&lt;/span&gt;
kubectl get pdb &lt;span class="nt"&gt;--all-namespaces&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pods getting OOMKilled after VPA optimization&lt;/strong&gt;: This indicates VPA recommendations were too low. Temporarily increase memory requests and check for memory leaks in your application:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check recent OOM events&lt;/span&gt;
kubectl get events &lt;span class="nt"&gt;--sort-by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;.metadata.creationTimestamp | &lt;span class="nb"&gt;grep &lt;/span&gt;OOMKilled

&lt;span class="c"&gt;# Monitor memory usage patterns&lt;/span&gt;
kubectl top pods &lt;span class="nt"&gt;--sort-by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;memory &lt;span class="nt"&gt;--all-namespaces&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can make VPA more conservative by setting higher safety margins:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;resourcePolicy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;containerPolicies&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;containerName&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;web-app&lt;/span&gt;
        &lt;span class="na"&gt;maxAllowed&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2Gi'&lt;/span&gt; &lt;span class="c1"&gt;# Set a reasonable upper bound&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Next Steps
&lt;/h2&gt;

&lt;p&gt;Now that you have VPA and Karpenter working together, consider these additional optimizations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Horizontal Pod Autoscaling&lt;/strong&gt;: Combine with VPA to handle both vertical and horizontal scaling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cluster Autoscaler tuning&lt;/strong&gt;: If using multiple node provisioners, configure them to work together&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost alerts&lt;/strong&gt;: Set up notifications when resource costs exceed thresholds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regular reviews&lt;/strong&gt;: Schedule monthly reviews of VPA recommendations and cost reports&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can also explore more advanced Karpenter features like multiple NodePools for different workload types (CPU-intensive, memory-intensive, GPU workloads) and spot instance strategies for non-critical workloads.&lt;/p&gt;

&lt;p&gt;The key is to treat right-sizing as an ongoing process. As your applications evolve and traffic patterns change, continue monitoring and adjusting to maintain optimal resource utilization.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>cloud</category>
      <category>docker</category>
    </item>
    <item>
      <title>The 5-Minute Kubernetes Cluster Health Check</title>
      <dc:creator>DevOps Daily</dc:creator>
      <pubDate>Fri, 15 Aug 2025 10:30:08 +0000</pubDate>
      <link>https://dev.to/devopsdaily/the-5-minute-kubernetes-cluster-health-check-b89</link>
      <guid>https://dev.to/devopsdaily/the-5-minute-kubernetes-cluster-health-check-b89</guid>
      <description>&lt;h2&gt;
  
  
  TLDR
&lt;/h2&gt;

&lt;p&gt;You can check your Kubernetes cluster's health in under 5 minutes using five key commands: checking node status, monitoring resource usage, reviewing pod health across namespaces, investigating problem pods, and examining cluster events. This quick routine helps catch issues before they escalate into critical problems.&lt;/p&gt;

&lt;p&gt;Kubernetes is great until it's not. One bad node, a pod stuck in CrashLoopBackOff, or a resource spike can ruin your day. The good news? You don't need to spend an hour digging through dashboards to spot trouble early. With a few quick commands, you can get a solid read on your cluster's health in under 5 minutes.&lt;/p&gt;

&lt;p&gt;Here's how to do it effectively.&lt;/p&gt;

&lt;h2&gt;
  
  
  Make Sure Your Nodes Are Happy
&lt;/h2&gt;

&lt;p&gt;Start by checking the overall status of your cluster nodes. This gives you the foundation-level health of your infrastructure.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get nodes &lt;span class="nt"&gt;-o&lt;/span&gt; wide
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command displays all nodes in your cluster along with their detailed information. You'll see each node's status, roles, age, version, internal and external IPs, OS image, kernel version, and container runtime.&lt;/p&gt;

&lt;p&gt;What you want to see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;STATUS&lt;/strong&gt; should be &lt;code&gt;Ready&lt;/code&gt; for all nodes&lt;/li&gt;
&lt;li&gt;No mystery nodes suddenly showing up in your cluster&lt;/li&gt;
&lt;li&gt;Roles, IPs, and ages that make sense for your environment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you spot &lt;code&gt;NotReady&lt;/code&gt;, that's your cue to dig deeper. A node in this state might be experiencing network issues, resource exhaustion, or kubelet problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Check Resource Usage at a Glance
&lt;/h2&gt;

&lt;p&gt;Next, get a quick overview of resource consumption across your nodes to identify potential bottlenecks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl top nodes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command shows CPU and memory usage for each node in your cluster. It provides both absolute values and percentages, making it easy to spot resource pressure.&lt;/p&gt;

&lt;p&gt;Keep an eye out for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CPU or memory regularly above 80% on any node&lt;/li&gt;
&lt;li&gt;One node doing all the heavy lifting while others are barely working&lt;/li&gt;
&lt;li&gt;Sudden spikes that don't match your expected workload patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No &lt;code&gt;metrics-server&lt;/code&gt; running? Install it with this command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The metrics-server is essential for resource monitoring and is required for horizontal pod autoscaling to work properly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Look at All Pods Across All Namespaces
&lt;/h2&gt;

&lt;p&gt;Get a bird's-eye view of all pods running in your cluster to quickly identify any that are misbehaving.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get pods &lt;span class="nt"&gt;--all-namespaces&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command lists every pod across all namespaces, showing their current status, restart count, and age. It's like taking the pulse of your entire application ecosystem.&lt;/p&gt;

&lt;p&gt;Healthy pods should be &lt;code&gt;Running&lt;/code&gt; or &lt;code&gt;Completed&lt;/code&gt;. If you see states like &lt;code&gt;CrashLoopBackOff&lt;/code&gt;, &lt;code&gt;ImagePullBackOff&lt;/code&gt;, &lt;code&gt;Pending&lt;/code&gt;, or &lt;code&gt;Error&lt;/code&gt;, note the namespace and pod name for further investigation.&lt;/p&gt;

&lt;p&gt;Also watch the &lt;strong&gt;RESTARTS&lt;/strong&gt; column closely. If a pod has restarted a dozen times in the last hour, something's definitely off. Frequent restarts often indicate:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Application crashes due to bugs or configuration issues&lt;/li&gt;
&lt;li&gt;Failing health checks (readiness or liveness probes)&lt;/li&gt;
&lt;li&gt;Resource limits being exceeded&lt;/li&gt;
&lt;li&gt;Dependencies being unavailable&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Zoom In on Problem Pods
&lt;/h2&gt;

&lt;p&gt;When you spot problematic pods, dig deeper to understand what's causing the issues.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl describe pod &amp;lt;pod-name&amp;gt; &lt;span class="nt"&gt;-n&lt;/span&gt; &amp;lt;namespace&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace &lt;code&gt;&amp;lt;pod-name&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;namespace&amp;gt;&lt;/code&gt; with the actual values from your problem pods. This command provides detailed information about the pod's configuration, current state, and recent events.&lt;/p&gt;

&lt;p&gt;Check for these common issues:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Events at the bottom&lt;/strong&gt; (often the smoking gun that reveals the root cause)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failing readiness or liveness probes&lt;/strong&gt; that prevent the pod from receiving traffic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Image pull errors&lt;/strong&gt; indicating registry access problems or incorrect image names&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource limit issues&lt;/strong&gt; where the pod exceeds its memory or CPU constraints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The events section is particularly valuable because it shows a chronological history of what happened to the pod, including scheduling decisions, volume mounts, and error conditions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Check the Cluster's Event Log
&lt;/h2&gt;

&lt;p&gt;Get insight into what's been happening across your entire cluster by examining recent events.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;kubectl get events &lt;span class="nt"&gt;--sort-by&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;.metadata.creationTimestamp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command shows cluster-wide events sorted by when they occurred, giving you a timeline of recent activity. Events provide context about system-level operations and can reveal patterns or issues that affect multiple components.&lt;/p&gt;

&lt;p&gt;Events will tell you what's been happening behind the scenes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Failed volume mounts that prevent pods from starting&lt;/li&gt;
&lt;li&gt;DNS resolution errors affecting service communication&lt;/li&gt;
&lt;li&gt;Scheduling issues when pods can't be placed on nodes&lt;/li&gt;
&lt;li&gt;Node pressure warnings indicating resource constraints&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try k9s for a Better View
&lt;/h2&gt;

&lt;p&gt;If you want something more interactive than command-line tools, give &lt;strong&gt;&lt;a href="https://k9scli.io/" rel="noopener noreferrer"&gt;k9s&lt;/a&gt;&lt;/strong&gt; a try. It's a terminal-based UI for Kubernetes that provides real-time cluster information in an intuitive interface.&lt;/p&gt;

&lt;p&gt;k9s lets you browse resources, view logs, and drill into problems without typing long commands. You can navigate between different resource types using simple keystrokes, filter resources, and even perform actions like scaling deployments or deleting pods.&lt;/p&gt;

&lt;p&gt;Once you try k9s, it's hard to go back to plain kubectl for exploratory tasks. It's particularly useful when you need to quickly jump between different namespaces or resource types during troubleshooting.&lt;/p&gt;

&lt;p&gt;Five minutes a day is all it takes to stay ahead of most cluster problems. Make this health check part of your daily routine and you'll catch issues before they blow up and before your pager goes off at 3 a.m. Regular monitoring helps you understand your cluster's normal behavior, making it easier to spot anomalies when they occur.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>cloud</category>
      <category>linux</category>
    </item>
    <item>
      <title>What’s the Most Underrated DevOps Skill You’ve Learned (and How Did You Learn It)?</title>
      <dc:creator>DevOps Daily</dc:creator>
      <pubDate>Tue, 05 Aug 2025 07:43:56 +0000</pubDate>
      <link>https://dev.to/devopsdaily/whats-the-most-underrated-devops-skill-youve-learned-and-how-did-you-learn-it-5a7i</link>
      <guid>https://dev.to/devopsdaily/whats-the-most-underrated-devops-skill-youve-learned-and-how-did-you-learn-it-5a7i</guid>
      <description>&lt;p&gt;When we think about DevOps skills, we usually picture Kubernetes, Terraform, CI/CD pipelines, or cloud automation.&lt;/p&gt;

&lt;p&gt;But some of the most valuable skills are the ones that never make it into a certification or a tech stack diagram.&lt;/p&gt;

&lt;p&gt;It could be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Staying calm during a production incident and knowing how to prioritize actions&lt;/li&gt;
&lt;li&gt;Communicating effectively with teams under pressure&lt;/li&gt;
&lt;li&gt;Spotting patterns in logs and metrics that others might miss&lt;/li&gt;
&lt;li&gt;Finding ways to optimize cloud costs without slowing down delivery&lt;/li&gt;
&lt;li&gt;Automating the boring stuff so you can focus on the real problems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For me, one of the most underrated skills I've learned is knowing when &lt;em&gt;not&lt;/em&gt; to automate something. Sometimes the "manual but reliable" approach saves you from a lot of complexity and maintenance overhead later.&lt;/p&gt;

&lt;p&gt;What about you?&lt;br&gt;
What's the most underrated DevOps skill you've picked up along the way, and how did you learn it?&lt;/p&gt;

&lt;p&gt;P.S. You might find some useful DevOps resources at &lt;a href="http://devops-daily.com/" rel="noopener noreferrer"&gt;devops-daily.com&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cloud</category>
      <category>beginners</category>
      <category>discuss</category>
    </item>
    <item>
      <title>What's the One DevOps "Best Practice" You Secretly Ignore (and Why)?</title>
      <dc:creator>DevOps Daily</dc:creator>
      <pubDate>Wed, 30 Jul 2025 14:04:41 +0000</pubDate>
      <link>https://dev.to/devopsdaily/whats-the-one-devops-best-practice-you-secretly-ignore-and-why-2460</link>
      <guid>https://dev.to/devopsdaily/whats-the-one-devops-best-practice-you-secretly-ignore-and-why-2460</guid>
      <description>&lt;p&gt;We've all read the books, followed the gurus, and tried to tick every box in the DevOps checklist.. but let’s be honest:&lt;/p&gt;

&lt;p&gt;There's always that one best practice that just doesn’t work for your team, your stack, or your sanity.&lt;/p&gt;

&lt;p&gt;Maybe you don't write as many tests as you should.&lt;br&gt;
Maybe you still SSH into production (👀).&lt;br&gt;
Maybe you use latest tags on your Docker images and pray.&lt;/p&gt;

&lt;p&gt;No judgment here, just real talk from the trenches.&lt;/p&gt;

&lt;p&gt;What's your "ignored" DevOps best practice, and why do you skip it?&lt;/p&gt;

&lt;p&gt;Bonus points if you share how it's actually worked out for you.&lt;/p&gt;




&lt;p&gt;🛠️ Posted by the team behind &lt;a href="https://devops-daily.com" rel="noopener noreferrer"&gt;DevOps Daily&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cloud</category>
      <category>linux</category>
      <category>beginners</category>
    </item>
    <item>
      <title>The Complete DevOps Roadmap for 2025 🚀</title>
      <dc:creator>DevOps Daily</dc:creator>
      <pubDate>Sat, 26 Jul 2025 13:13:38 +0000</pubDate>
      <link>https://dev.to/devopsdaily/the-complete-devops-roadmap-for-2025-4n1h</link>
      <guid>https://dev.to/devopsdaily/the-complete-devops-roadmap-for-2025-4n1h</guid>
      <description>&lt;p&gt;The DevOps landscape continues to evolve rapidly, and 2025 presents incredible opportunities for aspiring engineers. Organizations are increasingly adopting DevOps practices to deliver software faster, more reliably, and at scale. The demand for skilled DevOps professionals has never been higher.&lt;/p&gt;

&lt;p&gt;Whether you're a developer looking to expand into operations, a system administrator aiming to modernize your skills, or a complete beginner drawn to this exciting field, this comprehensive roadmap will guide your journey to DevOps mastery.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why DevOps in 2025? 🌟
&lt;/h2&gt;

&lt;p&gt;DevOps represents a fundamental shift in how software is built, deployed, and maintained. It's not just about tools, it's about culture, collaboration, and continuous improvement. Here's why it matters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🔄 Faster Delivery&lt;/strong&gt;: Teams deploy multiple times per day instead of monthly releases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🛡️ Better Reliability&lt;/strong&gt;: Automated testing and monitoring catch issues early&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;⚡ Improved Collaboration&lt;/strong&gt;: Breaks down silos between development and operations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔧 Enhanced Automation&lt;/strong&gt;: Reduces manual work and human error&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📈 Career Growth&lt;/strong&gt;: High demand for skilled professionals across all industries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But beyond the benefits, DevOps offers intellectually rewarding work where you solve complex problems and see immediate impact on product delivery.&lt;/p&gt;

&lt;h2&gt;
  
  
  DevOps in the Age of AI: Why Infrastructure Matters More Than Ever 🤖
&lt;/h2&gt;

&lt;p&gt;With AI transforming every industry, you might wonder: "Is DevOps still a smart career choice?" The answer is a resounding &lt;strong&gt;yes&lt;/strong&gt;, and here's why:&lt;/p&gt;

&lt;h3&gt;
  
  
  🏗️ AI Runs on Infrastructure
&lt;/h3&gt;

&lt;p&gt;Every AI application, from ChatGPT to autonomous vehicles, depends on robust, scalable infrastructure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🚀 Model Training&lt;/strong&gt;: Requires massive computational resources and distributed systems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;⚡ Real-time Inference&lt;/strong&gt;: Needs low-latency, highly available services
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📊 Data Pipelines&lt;/strong&gt;: AI models need continuous data flow and processing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔄 Model Deployment&lt;/strong&gt;: Rolling out AI models safely requires sophisticated CI/CD&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🤝 AI Enhances DevOps (Doesn't Replace It)
&lt;/h3&gt;

&lt;p&gt;Rather than replacing DevOps engineers, AI is becoming a powerful tool in our toolkit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🔍 Intelligent Monitoring&lt;/strong&gt;: AI helps predict system failures before they happen&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🛠️ Automated Remediation&lt;/strong&gt;: Smart systems can fix common issues automatically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📈 Resource Optimization&lt;/strong&gt;: AI optimizes cloud costs and performance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔐 Security Enhancement&lt;/strong&gt;: AI-powered threat detection and response&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🎯 The Human Element Remains Critical
&lt;/h3&gt;

&lt;p&gt;While AI can automate many tasks, DevOps engineers provide irreplaceable value:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🧠 Strategic Thinking&lt;/strong&gt;: Designing architecture and making technology choices&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔧 Complex Problem Solving&lt;/strong&gt;: Debugging unique issues and system design&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;👥 Cross-team Collaboration&lt;/strong&gt;: Bridging technical and business requirements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📋 Compliance &amp;amp; Governance&lt;/strong&gt;: Ensuring systems meet regulatory requirements&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🌐 Growing Complexity Requires Expertise
&lt;/h3&gt;

&lt;p&gt;As AI adoption accelerates, infrastructure becomes more complex:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🔀 Multi-cloud Strategies&lt;/strong&gt;: Managing resources across different providers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;⚓ Container Orchestration&lt;/strong&gt;: Running AI workloads at scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔒 Security Challenges&lt;/strong&gt;: Protecting sensitive AI models and data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📊 Observability Needs&lt;/strong&gt;: Understanding performance of AI-driven systems&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The 9-Stage DevOps Learning Journey 🗺️
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Stage 1: Master the Fundamentals 💻
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Foundation Skills Every DevOps Engineer Needs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before diving into advanced tools, you need rock-solid fundamentals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🐧 Linux/Unix Systems&lt;/strong&gt;: Command line proficiency is essential&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📜 Shell Scripting (Bash)&lt;/strong&gt;: Automate repetitive tasks efficiently
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔀 Version Control (Git)&lt;/strong&gt;: Collaborate effectively with development teams&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🐍 Basic Programming&lt;/strong&gt;: Python or Go for automation scripts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🌐 Networking Fundamentals&lt;/strong&gt;: Understand how services communicate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;💡 Pro Tip&lt;/strong&gt;: Don't rush this stage. These skills form the foundation for everything else.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What You'll Build&lt;/strong&gt;: Personal development environment, system monitoring scripts, automated backup solutions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 2: Infrastructure as Code 🏗️
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Manage Infrastructure Through Code&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Infrastructure as Code (IaC) transforms how we manage infrastructure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;⚙️ Terraform&lt;/strong&gt;: Industry standard for multi-cloud infrastructure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔧 Ansible&lt;/strong&gt;: Configuration management and application deployment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;☁️ CloudFormation&lt;/strong&gt;: AWS-native infrastructure provisioning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;✅ Infrastructure Testing&lt;/strong&gt;: Validate changes before deployment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real Impact&lt;/strong&gt;: Companies achieve consistent, reproducible deployments while reducing manual configuration errors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What You'll Build&lt;/strong&gt;: Multi-environment infrastructure, automated web application stacks, infrastructure testing pipelines.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 3: Containerization &amp;amp; Orchestration 📦
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Package and Orchestrate Applications&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Container technology has revolutionized application deployment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🐳 Docker Fundamentals&lt;/strong&gt;: Package applications consistently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔗 Container Networking&lt;/strong&gt;: Understand service communication&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;⚓ Kubernetes&lt;/strong&gt;: Orchestrate containers at scale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📋 Helm Charts&lt;/strong&gt;: Simplify Kubernetes application deployment&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔒 Container Security&lt;/strong&gt;: Protect your containerized workloads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why It Matters&lt;/strong&gt;: Containers solve the "it works on my machine" problem and enable consistent deployments across environments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What You'll Build&lt;/strong&gt;: Microservices e-commerce platform, container CI/CD pipeline, production-ready Kubernetes cluster.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 4: CI/CD Pipelines ⚡
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Automate Your Deployment Process&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Continuous Integration and Deployment revolutionize software delivery:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🚀 GitHub Actions&lt;/strong&gt;: Automate workflows directly in your repository&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔄 Jenkins&lt;/strong&gt;: Build complex, enterprise-grade pipelines&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🦊 GitLab CI&lt;/strong&gt;: Integrated DevOps platform&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🎯 ArgoCD&lt;/strong&gt;: GitOps-style deployments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🧪 Testing Automation&lt;/strong&gt;: Integrate quality gates&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Game Changer&lt;/strong&gt;: Teams can deploy changes safely and frequently, with automatic rollback capabilities.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What You'll Build&lt;/strong&gt;: Multi-stage CI/CD pipeline, GitOps deployment system, blue-green deployment strategy.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 5: Cloud Platforms ☁️
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Master Modern Cloud Infrastructure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cloud expertise is essential in today's landscape:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🌐 AWS Fundamentals&lt;/strong&gt;: Learn the most widely adopted cloud platform&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔷 Azure Services&lt;/strong&gt;: Microsoft's comprehensive cloud ecosystem&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔵 Google Cloud Platform&lt;/strong&gt;: Strong in data and AI services&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🌍 Multi-Cloud Strategy&lt;/strong&gt;: Many organizations use multiple providers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;💰 Cost Optimization&lt;/strong&gt;: Control and reduce cloud spending&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Industry Reality&lt;/strong&gt;: Most organizations have moved to cloud-first strategies, making these skills essential.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What You'll Build&lt;/strong&gt;: Multi-cloud architecture, serverless application suite, cost optimization dashboard.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 6: Monitoring &amp;amp; Observability 📊
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Ensure System Reliability and Performance&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Observability provides visibility into system behavior:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;📈 Prometheus &amp;amp; Grafana&lt;/strong&gt;: Industry-standard metrics and visualization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📋 ELK Stack&lt;/strong&gt;: Centralized logging and analysis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔍 Distributed Tracing&lt;/strong&gt;: Track requests across microservices&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;⚡ APM Tools&lt;/strong&gt;: Application performance monitoring&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🎯 SLO/SLI Design&lt;/strong&gt;: Define and measure service reliability&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Critical Importance&lt;/strong&gt;: You can't improve what you can't measure. Monitoring prevents small issues from becoming major outages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What You'll Build&lt;/strong&gt;: Complete observability stack, SLO monitoring dashboard, performance analysis tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 7: Security &amp;amp; Compliance 🛡️
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Integrate Security Throughout the Pipeline&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Security must be built-in, not bolted-on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🔐 DevSecOps Practices&lt;/strong&gt;: Shift security left in the development process&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🛡️ Container Security&lt;/strong&gt;: Secure runtime and images&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔑 Secrets Management&lt;/strong&gt;: Handle credentials safely&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📋 Compliance Automation&lt;/strong&gt;: Automate SOC2, GDPR requirements&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔍 Security Scanning&lt;/strong&gt;: Integrate vulnerability detection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Modern Approach&lt;/strong&gt;: Security teams collaborate with development from day one, rather than reviewing at the end.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What You'll Build&lt;/strong&gt;: Secure CI/CD pipeline, zero-trust network, compliance automation systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 8: Database Management 🗄️
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Handle Data Persistence and Reliability&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Data management remains critical across all applications:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🗃️ SQL &amp;amp; NoSQL&lt;/strong&gt;: Master both relational and document databases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🤖 Database Automation&lt;/strong&gt;: Automate deployments and migrations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;💾 Backup Strategies&lt;/strong&gt;: Ensure data recovery capabilities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;⚡ Performance Tuning&lt;/strong&gt;: Optimize database performance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;☁️ Cloud Databases&lt;/strong&gt;: Leverage managed database services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Universal Need&lt;/strong&gt;: Every application needs data persistence, making these skills valuable across all projects.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What You'll Build&lt;/strong&gt;: Database migration pipeline, multi-database architecture, database monitoring system.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 9: Continuous Learning 🎓
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Embrace Lifelong Growth&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Technology evolves rapidly, making continuous learning essential:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🌍 Open Source Contribution&lt;/strong&gt;: Build your reputation in the community&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;✍️ Technical Writing&lt;/strong&gt;: Share knowledge and build authority&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;👥 Mentoring&lt;/strong&gt;: Guide others and develop leadership skills&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🎤 Conference Participation&lt;/strong&gt;: Stay current with industry trends&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🛠️ Side Projects&lt;/strong&gt;: Experiment with new technologies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Long-term Success&lt;/strong&gt;: The most successful DevOps engineers are those who adapt and grow with the technology landscape.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What You'll Build&lt;/strong&gt;: Open source contributions, technical blog series, mentorship programs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Learning Approach 🎯
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Hands-On Projects Beat Theory 🔨
&lt;/h3&gt;

&lt;p&gt;Don't just read about tools, build with them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set up a complete development environment&lt;/li&gt;
&lt;li&gt;Create infrastructure across multiple cloud providers&lt;/li&gt;
&lt;li&gt;Build and deploy a real application end-to-end&lt;/li&gt;
&lt;li&gt;Implement comprehensive monitoring and alerting&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Learn in Public 📢
&lt;/h3&gt;

&lt;p&gt;Document your journey and help others:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Write blog posts about your learnings and challenges&lt;/li&gt;
&lt;li&gt;Share code and configurations on GitHub&lt;/li&gt;
&lt;li&gt;Participate in DevOps communities and forums&lt;/li&gt;
&lt;li&gt;Help others troubleshoot problems you've solved&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Focus on Problem-Solving 🧩
&lt;/h3&gt;

&lt;p&gt;DevOps is about solving business problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Understand why tools exist, not just how to use them&lt;/li&gt;
&lt;li&gt;Practice troubleshooting and debugging systematically&lt;/li&gt;
&lt;li&gt;Learn to communicate with both technical and business stakeholders&lt;/li&gt;
&lt;li&gt;Think about reliability, scalability, and maintainability&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Embrace AI as a Tool 🤖
&lt;/h3&gt;

&lt;p&gt;Learn to work alongside AI rather than compete with it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use AI-powered tools to enhance your productivity&lt;/li&gt;
&lt;li&gt;Understand how to deploy and manage AI workloads&lt;/li&gt;
&lt;li&gt;Learn about MLOps practices and AI model lifecycle management&lt;/li&gt;
&lt;li&gt;Focus on the strategic and creative aspects that AI can't replace&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Industry Trends to Watch in 2025 🔮
&lt;/h2&gt;

&lt;h3&gt;
  
  
  🏗️ Platform Engineering Rise
&lt;/h3&gt;

&lt;p&gt;Organizations are investing in internal developer platforms to improve developer experience and reduce cognitive load.&lt;/p&gt;

&lt;h3&gt;
  
  
  🔄 GitOps Adoption
&lt;/h3&gt;

&lt;p&gt;Git-based deployment workflows are becoming the standard for managing infrastructure and applications.&lt;/p&gt;

&lt;h3&gt;
  
  
  🤖 AI/ML Integration &amp;amp; Infrastructure Demands
&lt;/h3&gt;

&lt;p&gt;AI is transforming DevOps in multiple ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;🧠 AI-Powered Tools&lt;/strong&gt;: Intelligent monitoring, predictive scaling, and automated incident response&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🚀 MLOps Emergence&lt;/strong&gt;: New discipline combining ML and DevOps practices&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;⚡ GPU Infrastructure&lt;/strong&gt;: Managing specialized hardware for AI workloads&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📊 AI Model Pipelines&lt;/strong&gt;: Deploying and updating AI models safely at scale&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🌱 Sustainability Focus
&lt;/h3&gt;

&lt;p&gt;Green DevOps practices are becoming important as organizations focus on reducing their environmental impact, especially with energy-intensive AI workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  🔒 Security-First Mindset
&lt;/h3&gt;

&lt;p&gt;Security considerations are moving earlier in the development lifecycle, making DevSecOps skills increasingly valuable, particularly for protecting AI models and data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Your Next Steps 🚶‍♂️
&lt;/h2&gt;

&lt;p&gt;Starting your DevOps journey can feel overwhelming, but remember: every expert was once a beginner. Here's how to begin:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;📚 Start with the Fundamentals&lt;/strong&gt;: Master Linux, Git, and basic programming&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;⏰ Practice Consistently&lt;/strong&gt;: Dedicate time each day to hands-on learning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;👥 Join Communities&lt;/strong&gt;: Connect with other learners and experienced practitioners&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔨 Build Projects&lt;/strong&gt;: Apply your knowledge to real-world scenarios&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;🔍 Stay Curious&lt;/strong&gt;: Technology evolves rapidly, embrace continuous learning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;📝 Document Everything&lt;/strong&gt;: Keep notes and share your learning journey&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Interactive Learning Resources 🎮
&lt;/h2&gt;

&lt;p&gt;While this roadmap provides the structure, hands-on practice is essential. For an interactive experience with curated resources, practice labs, and detailed guidance for each skill, check out the &lt;a href="https://devops-daily.com/roadmap" rel="noopener noreferrer"&gt;complete DevOps roadmap&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The interactive version includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;📚 Curated learning resources for each skill&lt;/li&gt;
&lt;li&gt;💻 Hands-on project ideas with difficulty levels&lt;/li&gt;
&lt;li&gt;🎯 Skills assessment and progress tracking&lt;/li&gt;
&lt;li&gt;🔗 Direct links to tutorials, documentation, and practice platforms&lt;/li&gt;
&lt;li&gt;🏆 Achievement badges and learning milestones&lt;/li&gt;
&lt;li&gt;💡 Real-world examples and use cases&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion 🎉
&lt;/h2&gt;

&lt;p&gt;The DevOps field offers tremendous opportunities for those willing to invest in learning and skill development. Even in an AI-driven world, infrastructure expertise becomes more valuable, not less. As AI applications proliferate, they all depend on the robust, scalable systems that DevOps engineers build and maintain.&lt;/p&gt;

&lt;p&gt;With the right roadmap and consistent effort, you can build a rewarding career that combines technical challenges with meaningful business impact. The rise of AI doesn't diminish the importance of DevOps, it amplifies it.&lt;/p&gt;

&lt;p&gt;Remember, the goal isn't to master everything at once. Focus on building a strong foundation, then gradually expand your expertise. The industry rewards competence, problem-solving ability, and continuous learning, all qualities that define successful DevOps engineers.&lt;/p&gt;

&lt;p&gt;The journey may seem long, but every step builds upon the previous one. Start where you are, use what you have, and do what you can. Your future self will thank you for starting today.&lt;/p&gt;

&lt;p&gt;Start your journey now. The DevOps community is welcoming and always ready to help newcomers succeed! 🌟&lt;/p&gt;

&lt;p&gt;What's your current position on this roadmap? Share your DevOps learning journey in the comments below! 💬&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cloud</category>
      <category>ai</category>
      <category>beginners</category>
    </item>
    <item>
      <title>The 10 Most Common DevOps Mistakes (And How to Avoid Them in 2025)</title>
      <dc:creator>DevOps Daily</dc:creator>
      <pubDate>Mon, 21 Jul 2025 11:00:00 +0000</pubDate>
      <link>https://dev.to/devopsdaily/the-10-most-common-devops-mistakes-and-how-to-avoid-them-in-2025-52gi</link>
      <guid>https://dev.to/devopsdaily/the-10-most-common-devops-mistakes-and-how-to-avoid-them-in-2025-52gi</guid>
      <description>&lt;p&gt;DevOps isn't just about shipping code faster, it's about doing it smarter, safer, and saner. But let's be real: even the best teams make mistakes. Some are harmless. Others take down production on a Friday afternoon (yes, &lt;em&gt;that&lt;/em&gt; Friday deploy).&lt;/p&gt;

&lt;p&gt;Here are 10 common DevOps mistakes in 2025, how to avoid them, and a few moments that might hit a little too close to home.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Treating Infrastructure as Code Like a One-Off Script
&lt;/h2&gt;

&lt;p&gt;You wrote Terraform once, it worked, and now it lives untouched in a dusty repo folder. That's not IaC, that's tech debt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avoid it&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Version control your IaC.&lt;/li&gt;
&lt;li&gt;Apply formatting and linting.&lt;/li&gt;
&lt;li&gt;Test it with tools like &lt;code&gt;terraform plan&lt;/code&gt; or &lt;code&gt;terratest&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fav950b4q5xcks0dkuu72.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fav950b4q5xcks0dkuu72.gif" alt="Please don't do this" width="498" height="280"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Not Enforcing Version Control on CI/CD Configs
&lt;/h2&gt;

&lt;p&gt;Your pipeline files are changing, but without versioning, there's no easy way to debug regressions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avoid it&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Store all CI/CD config files (like GitHub Actions, GitLab CI, etc.) in version control.&lt;/li&gt;
&lt;li&gt;Treat pipeline logic like any other critical code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2g181s0hwsy4rdow1ihb.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2g181s0hwsy4rdow1ihb.gif" alt="Where did that config go?" width="498" height="280"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Poor Secrets Management
&lt;/h2&gt;

&lt;p&gt;Hardcoding secrets in code or using &lt;code&gt;.env&lt;/code&gt; files without encryption is a fast way to land on HN for the wrong reasons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avoid it&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use Vault, Doppler, AWS Secrets Manager, or SOPS.&lt;/li&gt;
&lt;li&gt;Rotate secrets regularly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7j89typdjgcgu0chtfhm.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7j89typdjgcgu0chtfhm.gif" alt="It's fine" width="498" height="280"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  4. No Rollback Strategy
&lt;/h2&gt;

&lt;p&gt;You deploy. Something breaks. And there's no plan B.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avoid it&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use blue-green or canary deployments.&lt;/li&gt;
&lt;li&gt;Automate rollbacks on failure.&lt;/li&gt;
&lt;li&gt;Always have a &lt;code&gt;rollback.sh&lt;/code&gt; or previous image ready.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fka8tilxaily9knheymoh.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fka8tilxaily9knheymoh.gif" width="640" height="640"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Ignoring Observability Until It's Too Late
&lt;/h2&gt;

&lt;p&gt;Monitoring isn't just about uptime. You can't fix what you can't see.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avoid it&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add metrics, logs, and traces from day one.&lt;/li&gt;
&lt;li&gt;Use tools like Prometheus, Grafana, and OpenTelemetry.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyrxey2qv78t2o1vyc6lf.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyrxey2qv78t2o1vyc6lf.gif" width="498" height="318"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Too Many Tools, Not Enough Integration
&lt;/h2&gt;

&lt;p&gt;Your stack has 25 tools. None of them talk to each other. And your alert fatigue is real.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avoid it&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Consolidate tools where possible.&lt;/li&gt;
&lt;li&gt;Favor tools that integrate well with your existing stack.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg2rk344iw8r99olhtb3x.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg2rk344iw8r99olhtb3x.gif" width="600" height="600"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Manual Approval for Every Tiny Change
&lt;/h2&gt;

&lt;p&gt;A typo fix shouldn't need a 3-person review and a Slack war.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avoid it&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set up clear policies: auto-approve safe changes, gate critical ones.&lt;/li&gt;
&lt;li&gt;Use GitHub environments, OPA, or custom bots to help.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxjwt9dt67t62dc3sdgwt.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxjwt9dt67t62dc3sdgwt.gif" alt="The sloth from Zootopia slowly stamping papers" width="498" height="498"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  8. No Documentation = Single Point of Failure
&lt;/h2&gt;

&lt;p&gt;"Ask Alex, they built it." Alex is on vacation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avoid it&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Write docs as you go.&lt;/li&gt;
&lt;li&gt;Use tools like Backstage, Docusaurus, or just plain Markdown.&lt;/li&gt;
&lt;li&gt;Encourage a culture of async knowledge sharing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F63q52iz1e7ep44s7ru4y.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F63q52iz1e7ep44s7ru4y.gif" width="426" height="212"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Skipping Tests for Infrastructure Changes
&lt;/h2&gt;

&lt;p&gt;You test app code, but deploy infra changes directly to prod? Bold.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avoid it&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use staging or preview environments.&lt;/li&gt;
&lt;li&gt;Test IaC with &lt;code&gt;checkov&lt;/code&gt;, &lt;code&gt;terratest&lt;/code&gt;, or &lt;code&gt;kitchen&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8t1epuvom2iwo2d9n1cq.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8t1epuvom2iwo2d9n1cq.gif" width="373" height="280"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  10. Forgetting Security in Your Pipelines
&lt;/h2&gt;

&lt;p&gt;If your pipeline can deploy to prod, attackers might be able to as well.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Avoid it&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use least privilege for pipeline credentials.&lt;/li&gt;
&lt;li&gt;Run security checks like &lt;code&gt;trivy&lt;/code&gt;, &lt;code&gt;semgrep&lt;/code&gt;, and &lt;code&gt;snyk&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7jcp8vru23tsw8bu70ol.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7jcp8vru23tsw8bu70ol.jpg" width="506" height="500"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  Final Thoughts
&lt;/h3&gt;

&lt;p&gt;DevOps is a journey. These mistakes are all lessons learned the hard way by teams around the world, and probably you, if you've been around long enough.&lt;/p&gt;

&lt;p&gt;Want to avoid these mistakes before they cost you time, sleep, or your weekend? We're building checklists, guides, and battle-tested content at &lt;a href="https://devops-daily.com" rel="noopener noreferrer"&gt;DevOps Daily&lt;/a&gt;. Come hang out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;PS&lt;/strong&gt;: Got a DevOps horror story or lesson to share? Drop it in the comments or tag us on Twitter.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>cloud</category>
      <category>security</category>
      <category>beginners</category>
    </item>
    <item>
      <title>What's Your Go-To Stack for Personal Projects in 2025?</title>
      <dc:creator>DevOps Daily</dc:creator>
      <pubDate>Fri, 18 Jul 2025 17:07:49 +0000</pubDate>
      <link>https://dev.to/devopsdaily/whats-your-go-to-stack-for-personal-projects-in-2025-3pg2</link>
      <guid>https://dev.to/devopsdaily/whats-your-go-to-stack-for-personal-projects-in-2025-3pg2</guid>
      <description>&lt;p&gt;When you're building a side project in 2025, what's your default stack these days?&lt;/p&gt;

&lt;p&gt;Are you still loving the reliability of Laravel or Ruby on Rails, or have you fully embraced Next.js, Bun, or something even more bleeding edge? Maybe you're mixing in tools like Supabase, Neon, or HTMX?&lt;/p&gt;

&lt;p&gt;Curious to hear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What's your go-to stack for quick MVPs or weekend builds?&lt;/li&gt;
&lt;li&gt;Do you keep it simple or try to mirror production setups?&lt;/li&gt;
&lt;li&gt;What are you hosting it on?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Been thinking about this a lot while working on something for &lt;a href="https://devops-daily.com" rel="noopener noreferrer"&gt;DevOps Daily&lt;/a&gt; and it made me wonder what others are using this year.&lt;/p&gt;

&lt;p&gt;Drop your stack below, someone might discover their next favorite combo from your setup!&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>programming</category>
      <category>discuss</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
